Twitter Sentiment Analysis

Analyzing tweets might sound fun 🐦 but it’s actually a tough challenge for beginners. People often struggle with cleaning messy text, choosing the right model, and understanding how to measure performance 📉. Sentiment analysis can feel overwhelming without a clear path.

In this blog post, we will cover how to clean and prepare Twitter data 🧹, train a machine learning model to detect sentiment 🧠, and evaluate how well it performs using real-world examples 📊.

🚀 Let’s explore how we can turn noisy social media posts into meaningful insights using machine learning!

Video Tutorial

Overview

Item	Details
Category	Classification
Goal	Predict sentiment from tweets
Data Source	Kaggle (Twitter Sentiment Dataset)
Task Type	Multi-class Classification
Data Type	Text
Algorithms	Logistic Regression
Evaluation Metrics	Accuracy, Precision, Recall, F1 Score
IDE	Jupyter
Tools	Pandas, Scikit-learn, TfidfVectorizer

Loading

import pandas as pd

df = pd.read_csv('twitter_training.csv')
df.columns = ['id', 'topic', 'sentiment', 'text']
df.head()

A table displaying tweet data including columns for ID, topic, sentiment, and text related to 'Borderlands', with all entries marked as positive sentiment.

✅ Key Insights:

Read the twitter_training.csv file into a DataFrame called df.
Assign new names to the columns for clarity:
- id: Identifier for each tweet.
- topic: The main subject or theme of the tweet.
- sentiment: The label indicating the sentiment (e.g., Positive, Negative, Neutral, Irrelevant).
- text: The actual content of the tweet.

Exploring

Shape

df.shape

✅ Key Insights:

Output: (74681, 4)
74,681 rows (tweets) and 4 columns (id, topic, sentiment, text)

Info()

df.shape

A screenshot of a pandas DataFrame displaying the structure of a dataset with 74,681 entries and 4 columns: id, topic, sentiment, and text.

✅ Key Insights:

text has 73995 non-null values — meaning 686 missing values.
id is of type int64, while the others (topic, sentiment, text) are object (text data).

Sentiment Distribution

df['sentiment'].value_counts()

Bar chart displaying the distribution of sentiments from a Twitter dataset with categories: Negative, Positive, Neutral, and Irrelevant.

✅ Key Insights:

This shows how the sentiment labels are distributed.
“Irrelevant” is later merged into “Neutral” during cleaning.

Unique Values

df.nunique()

A screenshot showing a DataFrame with four columns: 'id', 'topic', 'sentiment', and 'text', along with their respective counts.

✅ Key Insights:

id: 12,447 unique IDs (means many tweets are part of the same thread or topic).
topic: 32 unique topics.
sentiment: 4 unique labels.
text: 69,490 unique tweets — shows some repeated entries.

Missing Values

df.isnull().sum()

✅ Key Insights:

text: 686 missing entries.
All other columns have 0 missing values.

Duplicate Rows

df.duplicated().sum()

✅ Key Insights:

Output: 2700 duplicate rows.
These duplicates are cleaned later.

Cleaning

Removing Missing Texts

df = df.dropna(subset=['text'])
df.isnull().sum()

✅ Key Insights:

This line removes rows where the text column is missing.
After this, there are no missing values in the dataset.

Removing Duplicates

df = df.drop_duplicates()
df.duplicated().sum()

✅ Key Insights:

All 2700 duplicate rows are dropped.
The result is a cleaner dataset with unique tweets.

Merging ‘Irrelevant’ into ‘Neutral’

df['sentiment'] = df['sentiment'].replace('Irrelevant', 'Neutral')
df['sentiment'].value_counts()

✅ Key Insights:

‘Irrelevant’ tweets are merged into the ‘Neutral’ category to simplify the classification task.
New sentiment distribution:
- Neutral: 30,245
- Negative: 21,698
- Positive: 19,712

Cleaning Text

def clean_text(text):
    if isinstance(text, str):
        text = text.lower()
        text = re.sub(r'http\S+|www\S+', '', text)
        text = re.sub(r'@\w+', '', text)
        text = re.sub(r'#\w+', '', text)
        text = re.sub(r'[^\w\s]', '', text)
        text = re.sub(r'\d+', '', text)
        text = re.sub(r'\s+', ' ', text).strip()
        return text
    else:
        return ""

✅ Key Insights:

This function does multiple things:
- Converts all text to lowercase.
- Removes URLs, mentions, hashtags, punctuation, numbers, and extra spaces.

Applying the Cleaning Function

df['clean_text'] = df['text'].apply(clean_text)
df[['text', 'clean_text']].head(20)

✅ Key Insights:

A new column clean_text is created containing cleaned tweet text.
Example:
- Original: "I'm getting on Borderlands and I will kill you!"
- Cleaned: "im getting on borderlands and i will kill you"

Preprocessing

Splitting the Data

from sklearn.model_selection import train_test_split

X = df['clean_text']
y = df['sentiment']

X_text_train, X_text_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

✅ Key Insights:

X: Features (cleaned tweet texts)
y: Labels (sentiment)
train_test_split: Splits data into training (80%) and testing (20%) sets.
random_state=42 ensures the same split each time for reproducibility.

Vectorization

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=5000)

X_train = vectorizer.fit_transform(X_text_train)
X_test = vectorizer.transform(X_text_test)

✅ Key Insights:

Importing TF-IDF Vectorizer
- TfidfVectorizer stands for Term Frequency–Inverse Document Frequency.
- It helps capture the importance of words relative to each tweet and the entire dataset.
Creating the Vectorizer
- Limits the vocabulary to the top 5000 most important words.
- This keeps the feature space manageable while retaining important information.
Fitting and Transforming the Training Data
- Learns the vocabulary from the training data.
- Converts each tweet into a sparse matrix of TF-IDF scores.
Transforming the Test Data
- Applies the same vocabulary learned from the training set to the test set.

Encoding

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

y_train = le.fit_transform(y_train)
y_test = le.transform(y_test)

✅ Key Insights:

Importing LabelEncoder
- LabelEncoder is a simple utility that converts text labels into integers.
Creating and Fitting the Encoder
- fit_transform: learns the mapping from text → numbers and applies it to y_train.
- transform: applies the same mapping to y_test.
  - 'Negative' → 0
  - 'Neutral' → 1
  - 'Positive' → 2

(Note: exact mapping depends on the internal sorting of label names.)

Training

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

✅ Key Insights:

A Logistic Regression model is trained using the TF-IDF vectorized tweets and numerically encoded labels.
max_iter=1000: Increases the number of iterations to ensure the model converges (more stable training for text data).
The model is now ready to make predictions on new/unseen tweets.

Prediction

y_pred = model.predict(X_test)

✅ Key Insights:

The model predicts sentiment labels on the test data.
These predictions are stored and will be evaluated and visualized next.

Cleaning

Accuracy Score

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

accuracy_score(y_test, y_pred)

✅ Key Insights:

Output: 0.7247
This means the model correctly predicted sentiment for about 72.5% of the tweets.

Classification Report

classification_report(y_test, y_pred, target_names=le.classes_)

✅ Key Insights:

The model performs best on Negative and Neutral tweets.
Positive tweets are more often misclassified (lower recall).

Confusion Matrix

confusion_matrix(y_test, y_pred)

from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=le.classes_, yticklabels=le.classes_)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

✅ Key Insights:

Positive tweets had the most confusion, often mistaken for Neutral.

Visualization

results = pd.DataFrame({
    'Tweet': X_text_test.values,
    'True Sentiment': le.inverse_transform(y_test),
    'Predicted Sentiment': le.inverse_transform(y_pred)
})

results.sample(10)

✅ Key Insights:

The model performs reasonably well on tweets.
It sometimes misclassifies short, vague, or nuanced tweets, especially between Neutral and other sentiments.