Titanic Survival Classification

Predicting who survived the Titanic disaster isn’t just a classic machine learning task—it’s also a real challenge filled with tricky patterns and messy data ⚠️. Many people struggle with cleaning the dataset, converting categories into numbers, and figuring out which features actually matter 🤔.

In this blog post, we will cover how to explore the Titanic dataset using visualizations, train a logistic regression model, and evaluate its performance using key metrics like precision and recall.

🚀 Let’s dive into the world of Titanic classification and uncover how we can turn raw passenger data into meaningful survival predictions!

Video Tutorial

Overview

Item	Details
Category	Classification
Goal	Predict survival of passengers on the Titanic
Data Source	Kaggle (Titanic Dataset)
Task Type	Binary Classification
Data Type	Tabular
Algorithms	Logistic Regression
Evaluation Metrics	Accuracy, Precision, Recall, F1 Score
IDE	Jupyter
Tools	Pandas, Scikit-learn, Matplotlib

Loading

import pandas as pd

df = pd.read_csv('train.csv')
df.head()

✅ Key Insights:

pd.read_csv("train.csv"): Reads the CSV file into a Pandas DataFrame.
df.head(): Displays the first 5 rows to confirm the data is loaded correctly.

Exploring

🔍 Basic Exploration

df.shape

Shows that the dataset has 891 rows and 12 columns.

df.describe()

A table displaying statistical summaries of the Titanic dataset, including columns for PassengerId, Survived, Pclass, Age, SibSp, Parch, and Fare.

✅ Key Insights

Survival: About 38% of passengers survived. Most did not. (Survival: 0 = No, 1 = Yes)
Passenger Class: Many people were in 3rd class, which was the lowest.
Age: The average age was about 30 years old, but some data is missing.
Family: Most people traveled alone (no siblings, spouses, or parents with them).
Fare: Ticket prices were very different. Some paid nothing, while others paid over 512.

📊 Data Visualization

Survival Count

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(6, 4))
sns.countplot(data=df, x='Survived')
plt.title('Survival Count')
plt.xlabel('Survived (0 = No, 1 = Yes)')
plt.ylabel('Count')
plt.show()

✅ Key Insights:

About 340 of passengers survived. Most did not.

Survival by Sex

plt.figure(figsize=(6, 4))
sns.barplot(x='Sex', y='Survived', data=df)
plt.title('Survival by Sex')
plt.xlabel('Sex')
plt.ylabel('Survival Rate')
plt.show()

Bar chart showing survival rates for male and female passengers, highlighting that females had a significantly higher survival rate.

✅ Key Insights:

Females have a significantly higher survival rate around 75% compared to males around 19%.
This reflects the “women and children first” policy used during the Titanic evacuation.

Survival by Passenger Class

plt.figure(figsize=(6, 4))
sns.barplot(x='Pclass', y='Survived', data=df)
plt.title('Survival by Passenger Class')
plt.xlabel('Passenger Class')
plt.ylabel('Survival Rate')
plt.show()

Bar plot showing survival rates of Titanic passengers by passenger class, with Class 1 showing the highest survival rate, followed by Class 2, and Class 3 with the lowest.

✅ Key Insights:

1st class passengers had the highest survival rate.
Survival rates decreased as class number increased (i.e., 3rd class had the lowest).

Survival by Port of Embarkation

plt.figure(figsize=(6, 4))
sns.barplot(x='Embarked', y='Survived', data=df)
plt.title('Survival by Port of Embarkation')
plt.xlabel('Embarked')
plt.ylabel('Survival Rate')
plt.show()

Bar chart showing the survival rate of Titanic passengers by port of embarkation: Cherbourg (C) has the highest rate, followed by Queenstown (Q) and Southampton (S) with error bars for variability.

✅ Key Insights:

Passengers who boarded at Cherbourg (C) had the highest survival rate.

Age Distribution by Survival

plt.figure(figsize=(6, 4))
sns.histplot(data=df, x='Age', hue='Survived', bins=30, kde=True)
plt.title('Age Distribution by Survival')
plt.xlabel('Age')
plt.ylabel('Count')
plt.show()

Histogram showing the age distribution of Titanic passengers, colored by survival status, with counts on the y-axis and age on the x-axis.

✅ Key Insights:

Many younger people survived.
There’s a significant number of survivors across different age ranges, but children seem to have had slightly better chances.

Fare Distribution

plt.figure(figsize=(6, 4))
sns.histplot(data=df, x='Fare', hue='Survived', bins=40, kde=True)
plt.title('Fare Distribution')
plt.xlabel('Fare')
plt.ylabel('Frequency')
plt.show()

A histogram showing the distribution of fares paid by passengers on the Titanic, with two curves representing survival status: blue for non-survivors and orange for survivors.

✅ Key Insights:

Survivors generally paid higher fares.
People in cheaper fare ranges were less likely to survive.

Fare Distribution by Survival (Box Plot)

plt.figure(figsize=(6, 4))
sns.boxplot(data=df, x='Survived', y='Fare')
plt.title('Fare Distribution by Survival')
plt.xlabel('Survived (0 = No, 1 = Yes)')
plt.ylabel('Fare')
plt.show()

Box plot illustrating the fare distribution based on survival status in the Titanic dataset, showing higher fares for surviving passengers.

✅ Key Insights:

Survivors tend to have a higher median fare.
There’s also a wide range of fare values among survivors, with some extreme high values (outliers).

Cleaning

Missing Values

df.isnull().sum()

✅ Key Insights:

Cabin has too many missing values (over 75%), so it may not be useful.
Age is important, so it will be filled.
Embarked only has 2 missing, so it can be filled easily.

Drop Unnecessary Columns

df.drop(columns=['PassengerId', 'Cabin', 'Ticket', 'Name'], inplace=True)

✅ Key Insights:

PassengerId: Just an index, not useful for prediction.
Cabin: Too many missing values.
Ticket and Name: Hard to extract meaningful info without advanced processing.

Fill Missing values

df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

✅ Key Insights:

The median is robust to outliers and represents a typical age.
The mode is the most frequent value (in this case likely 'S' ).

Encoding

Label Encoding: 'Sex'

df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})

✅ Key Insights:

This is a binary category, so simple label encoding is enough.

One-Hot Encoding: 'Embarked'

df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)

A tabular dataset displaying information about Titanic passengers, including their status of survival, class, sex, age, number of siblings or spouses aboard, fare paid, and embarkation point.

✅ Key Insights:

Creates two new columns: Embarked_Q, Embarked_S
Drops one category (Embarked_C) to avoid the dummy variable trap (multicollinearity).

Feature Selection

Correlation with Survival

df.corr()['Survived'].sort_values(ascending=False)

Feature	Correlation with Survived
Survived	1.00 (perfect — itself)
Sex	0.54 (positive)
Fare	0.26 (positive)
Parch	0.08 (weak positive)
Embarked_Q	0.004 (negligible)
SibSp	-0.035 (very weak)
Age	-0.065 (slightly negative)
Embarked_S	-0.15 (negative)
Pclass	-0.34 (moderately negative)

✅ Key Insights:

Sex and Fare are useful predictors (positive impact).
Pclass has a negative correlation—the higher the class number (i.e., 3rd class), the lower the chance of survival.

Define Features and Target

X = df.drop('Survived', axis=1)
y = df['Survived']

✅ Key Insights:

X: All features (input variables) except 'Survived'
y: The target variable we want to predict

Data Split

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

✅ Key Insights:

test_size=0.2:
20% of the data goes to the test set (179 samples), 80% to training (712 samples)
random_state=42:
Ensures reproducible results (same split every time)

Training

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

✅ Key Insights:

Logistic Regression is a simple and effective model for binary classification tasks.
max_iter=1000: Ensures the model has enough iterations to converge (reach a good solution). Sometimes the default (100) is too low.
It finds the best weights (coefficients) to separate survivors and non-survivors.

Prediction

y_pred = model.predict(X_test)
print(y_pred)

✅ Key Insights:

Predict survival (0 or 1) for the passengers in X_test.

Evaluation

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))

✅ Key Insights:

Accuracy
- About 81% of predictions were correct. A good start for a baseline model.
Classification Report
- Class 0 (Did Not Survive):
  - High precision (0.83) means most of those predicted to not survive were correct.
  - High recall (0.86) means the model catches most of the people who didn’t survive.
  - Overall, the model is strong and reliable at identifying passengers who died.
- Class 1 (Survived):
  - Lower precision (0.79) means some predictions of survival are false positives.
  - Lower recall (0.74) means the model misses some actual survivors.
  - The model is less confident and slightly weaker at identifying survivors.
Confusion Matrix
- 90 passengers correctly predicted as not survived
- 55 correctly predicted as survived
- 15 false positives (predicted survived, actually died)
- 19 false negatives (predicted died, actually survived)