Positive or Negative? Spam or Not-spam? A simple Text classification problem using Python

Shanika Perera
4 min readJun 28, 2019

Do you know how your favorite Email service providers classify the emails you receive as spam or not spam? It uses text classification to determine whether incoming emails are sent to the inbox or spam folder. Interesting. Isn’t it? First, we’ll learn what text classification really means.

What is text classification?

Text Classification(TC) is the process of assigning tags or categories to text according to its content. It is one of the fundamental tasks in Natural Language Processing (NLP). Text classifiers can be used to organize, structure, and categorize pretty much anything.

There are many approaches to text classification such as:

  • Rule-based systems
  • Machine learning-based systems
  • Hybrid systems

Text classification is mostly used for sentiment analysis, topic labeling, spam detection, and intent detection. Here are some applications that text classification is used for information retrieval.

  1. Detecting a document’s encoding (ASCII, Unicode UTF-8, etc) [1]
  2. Word segmentation
  3. Truecasing [2]
  4. Identifying the language of a document
  5. The automatic detection of spam pages
  6. The automatic detection of sexually explicit content
  7. Sentiment analysis
  8. Personal email sorting
  9. Topic-specific or vertical search

Text classification algorithms are at the heart of a variety of software systems that process text data at scale. There are many reasons why everyone is obsessed with using text classification problems.

  • Scalability — It takes a lot of time for a human to manually analyze and organize text. Machine learning helps you to do it fast and in an accurate way.
  • Real-time analysis — Text classification is used in some companies for critical problems such as sentiment analysis. TC can make accurate precisions and help you to make decisions right away.
  • Consistent criteria — This helps by allowing humans to reduce errors with centralized TC problems.

Today, I’ll be focusing on a sentiment analysis problem. Since you have a slight idea about what text classification is now, let’s get right to it 😉

What is sentiment analysis?

Sentiment analysis the process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer’s attitude towards a particular topic, product, etc. is positive, negative, or neutral. In simpler words let’s say that it is when you have a text of review as input and as the output you have to predict the class of sentiment either positive? or negative?

For example, A positive review contains something like this:

The hotel is really beautiful. Very nice and helpful service at the front desk.

And for a negative review:

We had problems with the Wi-fi. The food was also not so great.

For us, it is easy to read this and understand whether this is a positive or a negative review. But for computers, it is somewhat harder than that. So, let’s see what we can do about this.

I’m using the movie_reviews corpus in the nltk library for this process. A corpus is simply a large collection of texts. It is a body of written or spoken material upon which a linguistic analysis is based. I'm using the Naive Bayes classifier as the text classification algorithm.

Step 01: Create a python file and import the following packages.

Step 02: Define a function to extract features.

Step 03: To get the training data, use the following movie reviews from NLTK.

Step 04: Now we will separate the positive and negative reviews.

Step 05: Since we need 2 datasets for this process, divide the data into training and testing datasets.

Step 06: Extract the features.

Output

Step 07: Use the Navie Bayes classifier. Define the object and train it.

Output

Step 08: To find out the most informative words inside the classifier which decides a review is positive or negative, print the following.

Output

Step 09: Create some random movie reviews of your own.

Step 10: Now, run the classifier on those sentences and obtain the predictions.

Output

Step 11: Tada! It’s done. Now you can print the output.

Output

As you can see, the model has predicted the sentiments almost with a 90% accuracy. You can get the entire script from my Github account.

I hope this small tutorial helped you understand what sentiment analysis really is. Keep in touch for more cool stuff! ❤️

--

--

Shanika Perera

Infrastructure Security Engineer | WSO2 | CKA | AWS SysOps Administrator | HashiCorp Certified Terraform Associate