Natural Language Processing
Chenghao Ding, 06 May 2021
Natural Language Processing
- Questions 1: Identify the spam emails from hams?
- Questions 2: Identify the disaster tweets from 10,000 of tweets?
In this project, several datasets are going to be used to show how different machine learning model can be applied to various applications in NLP, such as text classification and sentiment analysis.
1. Loading Data and Exploratory Data Analysis
As we can see, the classes are imbalanced, so we can consider using some kind of resampling.
As we can see, the ham message length tend to be lower than spam message length.
2. Data Pre-processing
Before we feed texts into our machine learning models, we need to clean the corpus, and remove stop words, as well as stemming or lematization. After that, encode texts into vectors.
2.1 Cleaning the corpus
Make text lowercase, remove text in square brackets,remove links,remove punctuation and remove words containing numbers.
Remove Stopwords
Stopwords are commonly used words in English which have no contextual meaning in an sentence. So therefore we remove them before classification. The figure below shows the cleaned texts after stopwords are removed.
2.2 Stemming/Lemmatization
Stemming usually refers to a process that chops off the ends of words in the hope of achieving goal correctly most of the time and often includes the removal of derivational affixes.
Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base and dictionary form of a word.
2.3 Target encoding
Now, the corpus have been cleaned. The figure below shows the cleaned sentences.
Now, we need to label the ham as 0, but spam as 1.
3. Tokens visualization (Word Cloud) and Vectorization
Top Words for HAM emails
Top Words for SPAM emails
We will first use SciKit Learn’s CountVectorizer. This model will convert a collection of text documents to a matrix of token counts. Tunning CountVectorizer: CountVectorizer has a few parameters such as “stop_words”, ngram_range, “min_df, max_df”, “max_features”.
Word Embeddings: GloVe
We can derive semantic relationships between words from the co-occurrence matrix. Now we will load embedding vectors of those words that appear in the Glove dictionary. Others will be initialized to 0.
4. Build Model
4.1 Naive Bayes
Accuracy = 0.96.
4.2 XGBoost
Train Accuracy = 0.98. Test Accuracy = 0.96.
4.3. LSTM Model
After epoch 3, the loss of validation didn’t imporved. There’s overfit problem.
5. Disaster Tweets
5.1 Data Pre-Processing
5.2 Word Cloud
Top Words for Real Disaster Tweets
Top Words for Fake Disaster Tweets
5.3 GloVe-LSTM Model
First, we use the embedding layer to map the words to their GloVe vectors, and then those vectors are input to the LSTM layers followed by a Dense layer with ‘sigmoid’ activation.
Accuracy: 0.81