Classifier on top of a sentence2vec model.
Main idea: the morphological structure of a word carries important information about the meaning of the word, which is not taken into account by traditional [word embeddings](/tag/word_embedding). This is especially significant for morphologically rich languages (German, Turkish) in which a single word can have a large number of morphological forms, each of which might occur rarely, thus making it hard to train good word embeddings.
FastText attempts to solve this by treating each word as the aggregation of its subwords (uses character n-grams as features -> avoids the OOV (out of vocabulary) problem)
(FastText represents words as the sum of their n-gram representations trained with a skip-gram model)
Embeddings learned using FastText (trained on wikipedia) are available in [many languages](https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md)
Under the hood: Multilingual embeddings | Engineering Blog | Facebook Code(About) With this technique, embeddings for every language exist in the same vector space, and maintain the property that words with similar meanings (regardless of language) are close together in vector space
> To train these multilingual word embeddings, we first trained separate embeddings for each language using fastText and a combination of data from Facebook and Wikipedia. We then used dictionaries to project each of these embedding spaces into a common space (English). The dictionaries are automatically induced from parallel data — meaning data sets that consist of a pair of sentences in two different languages that have the same meaning — which we use for training translation systems.
GitHub - Babylonpartners/fastText_multilingual: Multilingual word vectors(About) Aligning the fastText vectors of 78 languages.
> In a recent paper at ICLR 2017, we showed how the SVD can be used to learn a linear transformation (a matrix), which aligns monolingual vectors from two languages in a single vector space. In this repository we provide 78 matrices, which can be used to align the majority of the fastText languages in a single space.
[How to align two vector spaces for myself!](https://github.com/Babylonpartners/fastText_multilingual/blob/master/align_your_own.ipynb)
(fastText) Euclidean distance instead of cosine-similarity?(About) **the norm of a word vector is somewhat related to the overall frequency** of which words occur in the training corpus (so a common word like "frog" will still be similar to a less frequent word like "Anura" which is it's scientific name) (Hence the use of cosine-distance)
> That the inner product relates to the PMI between the vectors is for the most part an empirical result and there is very little theoretical background behind this finding
Bag of Tricks for Efficient Text Classification (arxiv) 2016(About) A simple and efficient baseline for text classification.
**Our word features can
be averaged** together to form good sentence representations.
Our experiments show that fastText is often on par with deep learning classifiers in terms of accuracy, and many orders of magnitude faster for training and evaluation. We can train fastText on more than one billion words in less than ten minutes using a standard multicore~CPU, and classify half a million sentences among~312K classes in less than a minute.