A word embedding is a learned representation for text where words that have the similar meaning have a similar representation. It is an approach to representing words and documents that may be considered one of the key breakthroughs for the impressive performance of deep learning methods on challenging natural language processing problems.

Word Embedding Algorithms

There are three main algorithms for learning a word embedding from text data. An embedding layer is a word embedding that is learned jointly with a neural network model on a specific natural language processing task. It requires that document text be cleaned and prepared such that each word is one-hot encoded. The embedding layer is used on the front end of a neural network. It is fit in a supervised way using the backpropagation algorithm. Word2Vec is a statistical method for efficiently learning a standalone word embedding from a text corpus. It was developed to make the neural-network-based training of the embedding more efficient and since then has become the de facto standard for developing pre-trained word embedding. The Global Vectors or GloVe algorithm is an extension to the Word2Vec method for efficiently learning word vectors. GloVe is an approach to marry both the global statistics of matrix factorization techniques with the local context-based learning in word2vec.

Gensim Python Library

Gensim is an open source Python library for natural language processing, with a focus on topic modeling. Gensim is a mature, focused, and efficient suite of NLP tools for topic modeling. It also provides tools for loading pre-trained word embeddings in a few formats and for making use and querying a loaded embedding.