Transforming Text to Insights: Natural Language Processing & Sentiment Analysis

Author

Riley Svensson

1 Introduction

In a constantly developing world of modeling, Keras has become a trailblazer in open-source software, providing a Python interface for producing artificial neural networks. By introducing these deep-learning techniques and complex pattern recognition algorithms, data scientists have been able to apply code to answer and solve many niche use cases that were previously unreasonable to achieve. These applications include a few examples like Image Classification, Natural Language Processing (NLP), Audio Processing, and Anonmaly Detection, which can lead to powerful models that can be utilized to make actionable decisions, reduce significant analysis time and complete tasks that were previously only possible by leveraging substantial human brainpower.

Specifically, organizations like law enforcement agencies can utilize these models for facial or object detection, helping them locate individuals of interest, while a more business oriented application could be to perform a sentiment analysis on a company’s mass of textual reviews, leveraging the power of NLP’s. In this example walk-through we will focus on the latter, and conduct a sentiment analysis of Amazon reviews, capable of determining whether a review was positive or negative based on the text it contains.

1.1 Data Preparation

First, we need to load and prepare our dataset. This involves reading the data from CSV files, ensuring that it is loaded correctly, and re-coding the labels to make them more suitable for our machine learning use case. In this first chunk of code, we read-in each training and test data CSV’s into their own dataframe, specifying the three column names in order: polarity, title, and text, which was determined from inspecting the Kaggle datacard supplied with our set.

import os
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from keras import layers, models

# Load the dataset without headers and assign column names
train_df = pd.read_csv('/Users/rileysvensson/Downloads/archive (6)/train.csv', header=None, names=['polarity', 'title', 'text'])
test_df = pd.read_csv('/Users/rileysvensson/Downloads/archive (6)/test.csv', header=None, names=['polarity', 'title', 'text'])

1.1.0.1 Sanity Check #1: View Training Data

# Get the count of rows in the train_df, should be 3600000
print(f"Number of rows in train_df: {train_df.shape[0]}")

# Display the train_df to ensure it loaded correctly with our column labels and 
train_df.head(5)

Number of rows in train_df: 3600000

	polarity	title	text	sentiment_positive
0	2	Stuning even for the non-gamer	This sound track was beautiful! It paints the ...	1
1	2	The best soundtrack ever to anything.	I'm reading a lot of reviews saying that this ...	1
2	2	Amazing!	This soundtrack is my favorite music of all ti...	1
3	2	Excellent Soundtrack	I truly like this soundtrack and I enjoy video...	1
4	2	Remember, Pull Your Jaw Off The Floor After He...	If you've played the game, you know how divine...	1

The next step we take is to transform our target variable in which we want to predict, which is ‘polarity’ or aka the sentiment that the reviewer produces with their textual feeback. The transformation applied in technically not needed, however typically in data science, we build models to predict in the form of 1’s, indicating something of interest occuring, and 0, indicating the opposite. In our case we could’ve left our target variable in the form of 2 for positive reviews, and 1 for negative reviews, but for best practice and clearly defined interpretations, we re-coded this variable to binarily represent positive sentiment. Simply put, positive reviews will be predicted as 1 and negative reviews will be 0. To ensure a correct application of this process, we created a new variable named, sentiment_positive to hold the new 1’s and 0’s, as a way to be able to visually check that the lambda function kept all the data’s patterns and only changed the labels. The next lines of code extract the text and the re-coded sentiment labels from the DataFrame into separate arrays, which are then used for training and testing the machine learning model.

# Re-code the labels to be more appropriate for a ML test: 0 for negative sentiment (polarity 1), 1 for positive (polarity 2)

# Create a new column for the recoded sentiment labels
train_df['sentiment_positive'] = train_df['polarity'].apply(lambda x: 0 if x == 1 else 1)
test_df['sentiment_positive'] = test_df['polarity'].apply(lambda x: 0 if x == 1 else 1)

# Extract the relevant columns for modeling, x (text) and y (sentiment positive)
train_texts = train_df['text'].values
train_labels = train_df['sentiment_positive'].values

test_texts = test_df['text'].values
test_labels = test_df['sentiment_positive'].values

# View df to ensure that polarity was correctly re-coded (2's are now 1's and 1's are 0's )
train_df.head(10)

	polarity	title	text	sentiment_positive
0	2	Stuning even for the non-gamer	This sound track was beautiful! It paints the ...	1
1	2	The best soundtrack ever to anything.	I'm reading a lot of reviews saying that this ...	1
2	2	Amazing!	This soundtrack is my favorite music of all ti...	1
3	2	Excellent Soundtrack	I truly like this soundtrack and I enjoy video...	1
4	2	Remember, Pull Your Jaw Off The Floor After He...	If you've played the game, you know how divine...	1
5	2	an absolute masterpiece	I am quite sure any of you actually taking the...	1
6	1	Buyer beware	This is a self-published book, and if you want...	0
7	2	Glorious story	I loved Whisper of the wicked saints. The stor...	1
8	2	A FIVE STAR BOOK	I just finished reading Whisper of the Wicked ...	1
9	2	Whispers of the Wicked Saints	This was a easy to read book that made me want...	1

1.1.0.2 Sanity Check #2: View Value Counts of Both Datasets

# Verify the distribution of the new sentiment_positive column, which should be 1,800,000 for each in train and 200,000 each in test
print(train_df['sentiment_positive'].value_counts())
print(test_df['sentiment_positive'].value_counts())

1    1800000
0    1800000
Name: sentiment_positive, dtype: int64
1    200000
0    200000
Name: sentiment_positive, dtype: int64

1.2 Data Standarization & Vectorization

Next, we needed to standardize and vectorize the text data, which we did so utilizing our own function and keras built-in vectorization layer. Standardization involves cleaning the text data by removing HTML tags like breaks ( < br >) and punctuation using regular expressions from the re package and string manipulation functions from the string package, to ensure the text is in a consistent format before modeling. Vectorization transforms the text data into numerical form so that our neural network can process it, and we use Keras’s TextVectorization layer for this purpose.

import re 
import string

# Define a custom standardization function to clean the text data
def custom_standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    stripped_html = tf.strings.regex_replace(lowercase, "<br />", " ")
    return tf.strings.regex_replace(stripped_html, f"[{re.escape(string.punctuation)}]", "")

# Instantiate a TextVectorization layer to process the text data
max_features = 20000
sequence_length = 500

vectorize_layer = layers.TextVectorization(
    standardize=custom_standardization,
    max_tokens=max_features,
    output_mode="int",
    output_sequence_length=sequence_length
)

# Adapt the vectorize_layer to the training text data
vectorize_layer.adapt(train_texts)

After vectorizing the text data, we transformed the raw text into numerical sequences that can be fed into a neural network for training. This step is crucial because neural networks require numerical input to perform calculations. The TextVectorization layer in Keras helps automate this process by converting text into integer sequences, where each integer represents a specific token in the vocabulary, allowing us to efficiently train our model on the preprocessed text data.

# Convert the text data to integer sequences
train_texts = vectorize_layer(tf.constant(train_texts))
test_texts = vectorize_layer(tf.constant(test_texts))

2 Modeling & Evaluation

Now that we have our preprocessed data, we can build and compile the neural network model. This model consists of an embedding layer for word embeddings, convolutional layers for feature extraction, and dense layers for classification. We use a 1D convolutional neural network (CNN) for this task. A CNN is a type of deep learning model that is particularly effective for tasks involving spatial data, like images or text sequences, because it can automatically detect and learn hierarchical patterns and features from the input data. Due to the mass of data accessible to us, we only needed one epoch to train the model effectively an achieve a test accuracy of 92.44%. The term “epoch” refers to one cycle through the full training dataset. In our case, a single epoch was sufficient for the model to learn and accurately predict on unfamiliar data, which is the end goal of a modeling application.

# Create and compile the model
model = models.Sequential([
    layers.Embedding(max_features, 128),
    layers.Dropout(0.5),
    layers.Conv1D(128, 7, activation='relu'),
    layers.MaxPooling1D(pool_size=2),
    layers.Conv1D(128, 7, activation='relu'),
    layers.GlobalMaxPooling1D(),
    layers.Dense(128, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(1, activation='sigmoid')
])

#Compile model

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model using the training data
history = model.fit(train_texts, train_labels, epochs=1, validation_split=0.2)

# Evaluate the model using the test data
model.evaluate(test_texts, test_labels)

90000/90000 ━━━━━━━━━━━━━━━━━━━━ 4969s 55ms/step - accuracy: 0.8931 - loss: 0.2621 - val_accuracy: 0.9262 - val_loss: 0.1926
12500/12500 ━━━━━━━━━━━━━━━━━━━━ 181s 14ms/step - accuracy: 0.9229 - loss: 0.2002

[0.19694775342941284, 0.9244725108146667]

2.1 Discussion of Metrics:

In this model, we use binary_crossentropy to measure our loss (or error) because we are dealing with a binary classification problem (positive vs. negative sentiment), while the optimizing function used is adam, which is an efficient gradient descent algorithm. The key metric we focus on in this example is accuracy, which tells us the proportion of correctly classified reviews. In this case, our model achieved an accuracy of approximately 92%, meaning that the model correctly identifies whether a review is positive or negative 92% of the time, when applied to the test_set of unseen reviews.

2.2 Streamline the Process: End-to-End Model

To further enhance usability, we created a final end-to-end model capable of processing raw text inputs, which allows reviews to be directly input into our model, and receive a predicted sentiment as our output. This model includes all preprocessing steps within the neural network itself, allowing it to handle raw text data directly, making it easier to deploy in real-world applications.

# Create an end-to-end model capable of processing raw strings
inputs = tf.keras.Input(shape=(1,), dtype="string")
indices = vectorize_layer(inputs)
outputs = model(indices)

end_to_end_model = tf.keras.Model(inputs, outputs)
end_to_end_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Ensure test_texts are in the correct shape (raw strings)
test_texts = test_df['text'].values

# Test it with the test data
end_to_end_model.evaluate(tf.data.Dataset.from_tensor_slices((test_texts, test_labels)).batch(32))

12500/12500 ━━━━━━━━━━━━━━━━━━━━ 186s 15ms/step - accuracy: 0.9229 - loss: 0.0000e+00

[0.0, 0.0, 0.9244725108146667, 0.9244725108146667]

3 Conclusion

Achieving 92% accuracy indicates that our model is highly effective at distinguishing between positive and negative reviews. For practical applications, this level of accuracy can significantly aid businesses in understanding their customers’ sentiment, which in turn can lead to many actionable decisions. For example, businesses can:

Monitor Customer Feedback: By quickly identifying and responding to negative reviews, improving their customer service and satisfaction. In doing this, businesses can enhance their reputation and retain customers by promptly addressing their concerns.
Perform Market Analysis: In analyzing trends in customer sentiment over time, business can better inform their marketing strategies. Understanding how customers feel about different products or services allows them to tailor their marketing efforts more effectively, targeting areas that resonate positively with their audience.
Improvement Products: Gaining insights into specific features or aspects of products that customers feel strongly about, whether positively or negatively could be invaluable for product development teams to focus on enhancing features that customers love and addressing issues that lead to dissatisfaction.

In summary, the model’s ability to accurately classify sentiment provides businesses with powerful insights into customer opinions, enabling them to make data-driven decisions that can enhance customer experience, optimize marketing strategies, and drive product improvements. As a final reminder, this was only possible by leveraging Keras’s NLP best practices and structure, which can be applied to countless other data science use cases.

4 Sources

Keras Example - NLP Sentiment Analysis

Amazon Reviews - Kaggle Data