Introduction to Natural Language Processing — Feature Engineering

Kinder Chen
1 min readMay 10, 2021

Natural Language Processing (NLP) has become increasingly popular nowadays, and it is also grabbed insightful attention. Feature Engineering such as removing stop words, stemming, lemmatization, and n-grams, is essential to understand the dynamics of the text dataset. This blog is to give a brief introduction to some common feature engineering techniques.

Feature Engineering

Sometimes, due to the tense of words, they may show as different tokens. NLP methods can deal with problem and reduce each word token down to its root word.

Stemming is to remove the ends of words where the end signals the derivational changes to the word. Stemming is a crude and heuristic process, which may not have to make sense as actual English words. However, it is easy to implement. Lemmatization is similar to stemming, and it is to examine the morphology of words and attempting to reduce each word to its most basic form/lemma. Therefore, the results often end up a bit different than stemming.

There are frequent words that are pretty much useless and don’t contain much information in a text, which is called Stop Words, such as “a”, “the” and “of” etc. Stop Words are often removed after tokenization is complete in order to reduce the dimensionality of each corpus down to only the words that contain important information.

--

--

Kinder Chen

What happened couldn’t have happened any other way…