Introduction to Natural Language Processing

1 min readMay 3, 2021

Natural language refers to the way humans communicate with each other. Natural Language Processing( NLP) is broadly defined as the automatic manipulation of natural language. The study of natural language processing has been around for more than 50 years and grew out of the field of linguistics with the rise of computers. Deep learning methods have been evaluated in a broader suite of problems from NLP and achieved greatest success on challenging and interesting problems such as text classification, language modeling, speech recognition, caption generation, machine translation, document summarization and question answering.

Text Cleaning

Text data requires more cleaning and preprocessing than normal data in order to work with statistical methods or machine learning methods. Cleaning a text dataset usually means splitting it into words and normalizing issues including punctuation, upper/lower case characters, numbers and dates, spelling mistakes and regional variations, unicode characters etc. However, text cleaning can be tricky and require decisions based on the text and the goals.

Tokenization

Generally, tokenization refers to the process of turning raw text into a list of words as tokens. The goal of this step is to create word tokens. We can manually develop code to clean text, and often this is considered as a good approach to tokenize each text dataset in a unique way.

For example, the sentence “How did you study the Natural Language Processing?” would probably look more like [‘how’, ‘did’, ‘you, ‘study’, ‘the’, ‘natural’, ‘language’, ‘processing’] when cleaned and tokenized.

Introduction to Natural Language Processing

Text Cleaning

Tokenization

Written by Kinder Chen