blog-image

Word Embeddings

Exploiting word embeddings (word2vec, glove, fastText…).

Embeddings are vectorial representations of linguistic units. They are ubiquitous in nearly all systems and applications that process natural language inputs, because they encode a much richer information than what is possible to do with discrete symbols.

Embeddings are used to represent every linguistic unit: character, syllable, word, phrase and document. The most famous ones are word embeddings, which, when trained with methods such as word2vec or glove, result in vectors that may encode the semantic meaning of the words. Such embeddings can be trained on raw texts, without any annotation apart from the words themselves. General word embeddings are thus trained on huge datasets for a large variety of languages, and further bring valuable lexical semantic information in most text-based applications through transfer learning. Exploiting such pretrained embeddings hence enable to train efficient sytems even on small datasets with a imited number of annotations for the target task.

Because of their success, many variants of these methods have been proposed, for instance skipthought and infersent for sentence-level embeddings, or BERT for context-dependent word embeddings. In this course, we will not explore all of these variants, but we will rather introduce the fundamental principles on which all kinds of embeddings are based, and show how we can concretely download, exploit and adapt these embeddings to leverage various types of downstream applications.