Predicting who survived the Titanic disaster isn’t just a classic machine learning task—it’s also a real challenge filled with tricky patterns and messy data ⚠️. Many people struggle with cleaning the dataset, converting categories into numbers, and figuring out which features actually matter 🤔.
In this blog post, we will cover how to explore the Titanic dataset using visualizations, train a logistic regression model, and evaluate its performance using key metrics like precision and recall.
🚀 Let’s dive into the world of Titanic classification and uncover how we can turn raw passenger data into meaningful survival predictions!
Video Tutorial
Overview
| Item | Details |
|---|---|
| Category | Classification |
| Goal | Predict survival of passengers on the Titanic |
| Data Source | Kaggle (Titanic Dataset) |
| Task Type | Binary Classification |
| Data Type | Tabular |
| Algorithms | Logistic Regression |
| Evaluation Metrics | Accuracy, Precision, Recall, F1 Score |
| IDE | Jupyter |
| Tools | Pandas, Scikit-learn, Matplotlib |
Loading
import pandas as pd
df = pd.read_csv('train.csv')
df.head()

✅ Key Insights:
pd.read_csv("train.csv"): Reads the CSV file into a Pandas DataFrame.df.head(): Displays the first 5 rows to confirm the data is loaded correctly.
Exploring
🔍 Basic Exploration
df.shape
- Shows that the dataset has 891 rows and 12 columns.
df.describe()

✅ Key Insights
- Survival: About 38% of passengers survived. Most did not. (Survival: 0 = No, 1 = Yes)
- Passenger Class: Many people were in 3rd class, which was the lowest.
- Age: The average age was about 30 years old, but some data is missing.
- Family: Most people traveled alone (no siblings, spouses, or parents with them).
- Fare: Ticket prices were very different. Some paid nothing, while others paid over 512.
📊 Data Visualization
Survival Count
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(6, 4))
sns.countplot(data=df, x='Survived')
plt.title('Survival Count')
plt.xlabel('Survived (0 = No, 1 = Yes)')
plt.ylabel('Count')
plt.show()

✅ Key Insights:
- About 340 of passengers survived. Most did not.
Survival by Sex
plt.figure(figsize=(6, 4))
sns.barplot(x='Sex', y='Survived', data=df)
plt.title('Survival by Sex')
plt.xlabel('Sex')
plt.ylabel('Survival Rate')
plt.show()

✅ Key Insights:
- Females have a significantly higher survival rate around 75% compared to males around 19%.
- This reflects the “women and children first” policy used during the Titanic evacuation.
Survival by Passenger Class
plt.figure(figsize=(6, 4))
sns.barplot(x='Pclass', y='Survived', data=df)
plt.title('Survival by Passenger Class')
plt.xlabel('Passenger Class')
plt.ylabel('Survival Rate')
plt.show()

✅ Key Insights:
- 1st class passengers had the highest survival rate.
- Survival rates decreased as class number increased (i.e., 3rd class had the lowest).
Survival by Port of Embarkation
plt.figure(figsize=(6, 4))
sns.barplot(x='Embarked', y='Survived', data=df)
plt.title('Survival by Port of Embarkation')
plt.xlabel('Embarked')
plt.ylabel('Survival Rate')
plt.show()

✅ Key Insights:
- Passengers who boarded at Cherbourg (C) had the highest survival rate.
Age Distribution by Survival
plt.figure(figsize=(6, 4))
sns.histplot(data=df, x='Age', hue='Survived', bins=30, kde=True)
plt.title('Age Distribution by Survival')
plt.xlabel('Age')
plt.ylabel('Count')
plt.show()

✅ Key Insights:
- Many younger people survived.
- There’s a significant number of survivors across different age ranges, but children seem to have had slightly better chances.
Fare Distribution
plt.figure(figsize=(6, 4))
sns.histplot(data=df, x='Fare', hue='Survived', bins=40, kde=True)
plt.title('Fare Distribution')
plt.xlabel('Fare')
plt.ylabel('Frequency')
plt.show()

✅ Key Insights:
- Survivors generally paid higher fares.
- People in cheaper fare ranges were less likely to survive.
Fare Distribution by Survival (Box Plot)
plt.figure(figsize=(6, 4))
sns.boxplot(data=df, x='Survived', y='Fare')
plt.title('Fare Distribution by Survival')
plt.xlabel('Survived (0 = No, 1 = Yes)')
plt.ylabel('Fare')
plt.show()

✅ Key Insights:
- Survivors tend to have a higher median fare.
- There’s also a wide range of fare values among survivors, with some extreme high values (outliers).
Cleaning
Missing Values
df.isnull().sum()

✅ Key Insights:
- Cabin has too many missing values (over 75%), so it may not be useful.
- Age is important, so it will be filled.
- Embarked only has 2 missing, so it can be filled easily.
Drop Unnecessary Columns
df.drop(columns=['PassengerId', 'Cabin', 'Ticket', 'Name'], inplace=True)
✅ Key Insights:
PassengerId: Just an index, not useful for prediction.Cabin: Too many missing values.TicketandName: Hard to extract meaningful info without advanced processing.
Fill Missing values
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
✅ Key Insights:
- The median is robust to outliers and represents a typical age.
- The mode is the most frequent value (in this case likely
'S').
Encoding
Label Encoding: 'Sex'
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})
✅ Key Insights:
- This is a binary category, so simple label encoding is enough.
One-Hot Encoding: 'Embarked'
df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)

✅ Key Insights:
- Creates two new columns:
Embarked_Q,Embarked_S - Drops one category (
Embarked_C) to avoid the dummy variable trap (multicollinearity).
Feature Selection
Correlation with Survival
df.corr()['Survived'].sort_values(ascending=False)
| Feature | Correlation with Survived |
|---|---|
| Survived | 1.00 (perfect — itself) |
| Sex | 0.54 (positive) |
| Fare | 0.26 (positive) |
| Parch | 0.08 (weak positive) |
| Embarked_Q | 0.004 (negligible) |
| SibSp | -0.035 (very weak) |
| Age | -0.065 (slightly negative) |
| Embarked_S | -0.15 (negative) |
| Pclass | -0.34 (moderately negative) |
✅ Key Insights:
- Sex and Fare are useful predictors (positive impact).
- Pclass has a negative correlation—the higher the class number (i.e., 3rd class), the lower the chance of survival.
Define Features and Target
X = df.drop('Survived', axis=1)
y = df['Survived']
✅ Key Insights:
X: All features (input variables) except'Survived'y: The target variable we want to predict
Data Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
✅ Key Insights:
test_size=0.2:
20% of the data goes to the test set (179 samples), 80% to training (712 samples)random_state=42:
Ensures reproducible results (same split every time)
Training
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
✅ Key Insights:
- Logistic Regression is a simple and effective model for binary classification tasks.
max_iter=1000: Ensures the model has enough iterations to converge (reach a good solution). Sometimes the default (100) is too low.- It finds the best weights (coefficients) to separate survivors and non-survivors.
Prediction
y_pred = model.predict(X_test)
print(y_pred)

✅ Key Insights:
- Predict survival (
0or1) for the passengers inX_test.
Evaluation
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))

✅ Key Insights:
- Accuracy
- About 81% of predictions were correct. A good start for a baseline model.
- Classification Report
- Class 0 (Did Not Survive):
- High precision (0.83) means most of those predicted to not survive were correct.
- High recall (0.86) means the model catches most of the people who didn’t survive.
- Overall, the model is strong and reliable at identifying passengers who died.
- Class 1 (Survived):
- Lower precision (0.79) means some predictions of survival are false positives.
- Lower recall (0.74) means the model misses some actual survivors.
- The model is less confident and slightly weaker at identifying survivors.
- Class 0 (Did Not Survive):
- Confusion Matrix
- 90 passengers correctly predicted as not survived
- 55 correctly predicted as survived
- 15 false positives (predicted survived, actually died)
- 19 false negatives (predicted died, actually survived)
Code
👉 Download












