The Bag of Words Model

Kinder Chen
2 min readMay 17, 2021

The bag-of-words (BoW) model is a simple and flexible approach to represent text data when modeling text with machine learning algorithms. In this blog, we will give an introduction to the bag-of-words model for feature extraction in natural language processing (NLP).

Bag of Words

The bag-of-words model is the most common approach to working with text by vectorization. A bag-of-words is a representation of text that describes the occurrence of words within a document. However, the information about the order or structure of words in the document is discarded. The model contains information about all the important words in the text individually, but not in any particular order.

In the model, we take every word in a corpus (a vocabulary of known words) and throw them into a bag. The simplest way to create a bag of words is to just count how many times each unique word is used in a given corpus. If we have a number for every word, then we have a way to treat each bag as a vector, which opens up all kinds of machine learning tools for use.

Limitations

The bag-of-words model has been used with great success on prediction problems like language modeling and documentation classification. Nevertheless, it suffers from some shortcomings. In the model, the corpus requires careful design to manage its size, which impacts the sparsity of the document representations. However, sparse representations are harder to model both for computational reasons and for information reasons. Discarding word order ignores the context, and in turn meaning of words in the document semantically. Context and meaning can offer a lot to the model, therefore it is important ito tell the difference between the same words differently arranged and synonyms in the model.

--

--

Kinder Chen

What happened couldn’t have happened any other way…