The Impact of Corpus Resource Level on Vector Space Variability across the Wiki-40 Dataset

Location

Library Room 1576

Date and Time

Time

1:00 PM to 1:50 PM

Abstract

Word embeddings are vector-based representations of semantic relationships between words that enable computational analysis across large text corpora. However, the stability and validity of word embeddings remain a significant concern in NLP given limited corpora in certain languages. We define stability through the percent overlap between nearest neighbors in an embedding space and cosine similarity. This project examines how different embedding algorithms influence model stability across resource levels, with a focus on fastText in comparison to Word2Vec and GloVe. FastText incorporates subword information through character-level n-gram modeling, a feature that has shown particular promise for languages with limited training data. These results have broader implications for low-resource language technologies and computational linguistics, especially in domains where reproducibility and robustness are essential.

The Impact of Corpus Resource Level on Vector Space Variability across the Wiki-40 Dataset

Location

Date and Time

Abstract

Links and Resources