Analyzing tweets might sound fun 🐦 but it’s actually a tough challenge for beginners. People often struggle with cleaning messy text, choosing the right model, and understanding how to measure performance 📉. Sentiment analysis can feel overwhelming without a clear path.
In this blog post, we will cover how to clean and prepare Twitter data 🧹, train a machine learning model to detect sentiment 🧠, and evaluate how well it performs using real-world examples 📊.
🚀 Let’s explore how we can turn noisy social media posts into meaningful insights using machine learning!
Video Tutorial
Overview
| Item | Details |
|---|---|
| Category | Classification |
| Goal | Predict sentiment from tweets |
| Data Source | Kaggle (Twitter Sentiment Dataset) |
| Task Type | Multi-class Classification |
| Data Type | Text |
| Algorithms | Logistic Regression |
| Evaluation Metrics | Accuracy, Precision, Recall, F1 Score |
| IDE | Jupyter |
| Tools | Pandas, Scikit-learn, TfidfVectorizer |
Loading
import pandas as pd
df = pd.read_csv('twitter_training.csv')
df.columns = ['id', 'topic', 'sentiment', 'text']
df.head()

✅ Key Insights:
- Read the
twitter_training.csvfile into a DataFrame calleddf. - Assign new names to the columns for clarity:
id: Identifier for each tweet.topic: The main subject or theme of the tweet.sentiment: The label indicating the sentiment (e.g., Positive, Negative, Neutral, Irrelevant).text: The actual content of the tweet.
Exploring
Shape
df.shape
✅ Key Insights:
- Output:
(74681, 4) - 74,681 rows (tweets) and 4 columns (
id,topic,sentiment,text)
Info()
df.shape

✅ Key Insights:
texthas 73995 non-null values — meaning 686 missing values.idis of typeint64, while the others (topic,sentiment,text) areobject(text data).
Sentiment Distribution
df['sentiment'].value_counts()

✅ Key Insights:
- This shows how the sentiment labels are distributed.
- “Irrelevant” is later merged into “Neutral” during cleaning.
Unique Values
df.nunique()

✅ Key Insights:
id: 12,447 unique IDs (means many tweets are part of the same thread or topic).topic: 32 unique topics.sentiment: 4 unique labels.text: 69,490 unique tweets — shows some repeated entries.
Missing Values
df.isnull().sum()

✅ Key Insights:
text: 686 missing entries.- All other columns have 0 missing values.
Duplicate Rows
df.duplicated().sum()
✅ Key Insights:
- Output:
2700duplicate rows. - These duplicates are cleaned later.
Cleaning
Removing Missing Texts
df = df.dropna(subset=['text'])
df.isnull().sum()
✅ Key Insights:
- This line removes rows where the
textcolumn is missing. - After this, there are no missing values in the dataset.
Removing Duplicates
df = df.drop_duplicates()
df.duplicated().sum()
✅ Key Insights:
- All 2700 duplicate rows are dropped.
- The result is a cleaner dataset with unique tweets.
Merging ‘Irrelevant’ into ‘Neutral’
df['sentiment'] = df['sentiment'].replace('Irrelevant', 'Neutral')
df['sentiment'].value_counts()
✅ Key Insights:
- ‘Irrelevant’ tweets are merged into the ‘Neutral’ category to simplify the classification task.
- New sentiment distribution:
- Neutral: 30,245
- Negative: 21,698
- Positive: 19,712
Cleaning Text
def clean_text(text):
if isinstance(text, str):
text = text.lower()
text = re.sub(r'http\S+|www\S+', '', text)
text = re.sub(r'@\w+', '', text)
text = re.sub(r'#\w+', '', text)
text = re.sub(r'[^\w\s]', '', text)
text = re.sub(r'\d+', '', text)
text = re.sub(r'\s+', ' ', text).strip()
return text
else:
return ""
✅ Key Insights:
- This function does multiple things:
- Converts all text to lowercase.
- Removes URLs, mentions, hashtags, punctuation, numbers, and extra spaces.
Applying the Cleaning Function
df['clean_text'] = df['text'].apply(clean_text)
df[['text', 'clean_text']].head(20)

✅ Key Insights:
- A new column
clean_textis created containing cleaned tweet text. - Example:
- Original:
"I'm getting on Borderlands and I will kill you!" - Cleaned:
"im getting on borderlands and i will kill you"
- Original:
Preprocessing
Splitting the Data
from sklearn.model_selection import train_test_split
X = df['clean_text']
y = df['sentiment']
X_text_train, X_text_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
✅ Key Insights:
X: Features (cleaned tweet texts)y: Labels (sentiment)train_test_split: Splits data into training (80%) and testing (20%) sets.random_state=42ensures the same split each time for reproducibility.
Vectorization
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=5000)
X_train = vectorizer.fit_transform(X_text_train)
X_test = vectorizer.transform(X_text_test)
✅ Key Insights:
- Importing TF-IDF Vectorizer
TfidfVectorizerstands for Term Frequency–Inverse Document Frequency.- It helps capture the importance of words relative to each tweet and the entire dataset.
- Creating the Vectorizer
- Limits the vocabulary to the top 5000 most important words.
- This keeps the feature space manageable while retaining important information.
- Fitting and Transforming the Training Data
- Learns the vocabulary from the training data.
- Converts each tweet into a sparse matrix of TF-IDF scores.
- Transforming the Test Data
- Applies the same vocabulary learned from the training set to the test set.
Encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y_train = le.fit_transform(y_train)
y_test = le.transform(y_test)
✅ Key Insights:
- Importing LabelEncoder
LabelEncoderis a simple utility that converts text labels into integers.
- Creating and Fitting the Encoder
fit_transform: learns the mapping from text → numbers and applies it toy_train.transform: applies the same mapping toy_test.'Negative'→ 0'Neutral'→ 1'Positive'→ 2
(Note: exact mapping depends on the internal sorting of label names.)
Training
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
✅ Key Insights:
- A Logistic Regression model is trained using the TF-IDF vectorized tweets and numerically encoded labels.
max_iter=1000: Increases the number of iterations to ensure the model converges (more stable training for text data).- The model is now ready to make predictions on new/unseen tweets.
Prediction
y_pred = model.predict(X_test)
✅ Key Insights:
- The model predicts sentiment labels on the test data.
- These predictions are stored and will be evaluated and visualized next.
Cleaning
Accuracy Score
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
accuracy_score(y_test, y_pred)
✅ Key Insights:
- Output:
0.7247 - This means the model correctly predicted sentiment for about 72.5% of the tweets.
Classification Report
classification_report(y_test, y_pred, target_names=le.classes_)

✅ Key Insights:
- The model performs best on Negative and Neutral tweets.
- Positive tweets are more often misclassified (lower recall).
Confusion Matrix
confusion_matrix(y_test, y_pred)

from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=le.classes_, yticklabels=le.classes_)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

✅ Key Insights:
- Positive tweets had the most confusion, often mistaken for Neutral.
Visualization
results = pd.DataFrame({
'Tweet': X_text_test.values,
'True Sentiment': le.inverse_transform(y_test),
'Predicted Sentiment': le.inverse_transform(y_pred)
})
results.sample(10)

✅ Key Insights:
- The model performs reasonably well on tweets.
- It sometimes misclassifies short, vague, or nuanced tweets, especially between Neutral and other sentiments.
Code
👉 Download












