Titanic Survival Classification

Predicting who survived the Titanic disaster isn’t just a classic machine learning task—it’s also a real challenge filled with tricky patterns and messy data ⚠️. Many people struggle with cleaning the dataset, converting categories into numbers, and figuring out which features actually matter 🤔.

In this blog post, we will cover how to explore the Titanic dataset using visualizations, train a logistic regression model, and evaluate its performance using key metrics like precision and recall.

🚀 Let’s dive into the world of Titanic classification and uncover how we can turn raw passenger data into meaningful survival predictions!

Video Tutorial

Exploring

🔍 Basic Exploration

  • Shows that the dataset has 891 rows and 12 columns.
A table displaying statistical summaries of the Titanic dataset, including columns for PassengerId, Survived, Pclass, Age, SibSp, Parch, and Fare.

✅ Key Insights

  • Survival: About 38% of passengers survived. Most did not. (Survival: 0 = No, 1 = Yes)
  • Passenger Class: Many people were in 3rd class, which was the lowest.
  • Age: The average age was about 30 years old, but some data is missing.
  • Family: Most people traveled alone (no siblings, spouses, or parents with them).
  • Fare: Ticket prices were very different. Some paid nothing, while others paid over 512.

📊 Data Visualization

Survival Count

Bar chart showing survival count of Titanic passengers, with counts for those who did not survive (0) and those who did (1).

✅ Key Insights:

  • About 340 of passengers survived. Most did not.

Survival by Sex

Bar chart showing survival rates for male and female passengers, highlighting that females had a significantly higher survival rate.

✅ Key Insights:

  • Females have a significantly higher survival rate around 75% compared to males around 19%.
  • This reflects the “women and children first” policy used during the Titanic evacuation.

Survival by Passenger Class

Bar plot showing survival rates of Titanic passengers by passenger class, with Class 1 showing the highest survival rate, followed by Class 2, and Class 3 with the lowest.

✅ Key Insights:

  • 1st class passengers had the highest survival rate.
  • Survival rates decreased as class number increased (i.e., 3rd class had the lowest).

Survival by Port of Embarkation

Bar chart showing the survival rate of Titanic passengers by port of embarkation: Cherbourg (C) has the highest rate, followed by Queenstown (Q) and Southampton (S) with error bars for variability.

✅ Key Insights:

  • Passengers who boarded at Cherbourg (C) had the highest survival rate.

Age Distribution by Survival

Histogram showing the age distribution of Titanic passengers, colored by survival status, with counts on the y-axis and age on the x-axis.

✅ Key Insights:

  • Many younger people survived.
  • There’s a significant number of survivors across different age ranges, but children seem to have had slightly better chances.

Fare Distribution

A histogram showing the distribution of fares paid by passengers on the Titanic, with two curves representing survival status: blue for non-survivors and orange for survivors.

✅ Key Insights:

  • Survivors generally paid higher fares.
  • People in cheaper fare ranges were less likely to survive.

Fare Distribution by Survival (Box Plot)

Box plot illustrating the fare distribution based on survival status in the Titanic dataset, showing higher fares for surviving passengers.

✅ Key Insights:

  • Survivors tend to have a higher median fare.
  • There’s also a wide range of fare values among survivors, with some extreme high values (outliers).

Cleaning

Missing Values

A screenshot showing a DataFrame output in a programming environment, highlighting the count of missing values for various columns related to Titanic passengers.

✅ Key Insights:

  • Cabin has too many missing values (over 75%), so it may not be useful.
  • Age is important, so it will be filled.
  • Embarked only has 2 missing, so it can be filled easily.

Drop Unnecessary Columns

✅ Key Insights:

  • PassengerId: Just an index, not useful for prediction.
  • Cabin: Too many missing values.
  • Ticket and Name: Hard to extract meaningful info without advanced processing.

Fill Missing values

✅ Key Insights:

  • The median is robust to outliers and represents a typical age.
  • The mode is the most frequent value (in this case likely 'S' ).

Encoding

Label Encoding: 'Sex'

✅ Key Insights:

  • This is a binary category, so simple label encoding is enough.

One-Hot Encoding: 'Embarked'

A tabular dataset displaying information about Titanic passengers, including their status of survival, class, sex, age, number of siblings or spouses aboard, fare paid, and embarkation point.

✅ Key Insights:

  • Creates two new columns: Embarked_Q, Embarked_S
  • Drops one category (Embarked_C) to avoid the dummy variable trap (multicollinearity).

Feature Selection

Correlation with Survival

FeatureCorrelation with Survived
Survived1.00 (perfect — itself)
Sex0.54 (positive)
Fare0.26 (positive)
Parch0.08 (weak positive)
Embarked_Q0.004 (negligible)
SibSp-0.035 (very weak)
Age-0.065 (slightly negative)
Embarked_S-0.15 (negative)
Pclass-0.34 (moderately negative)

✅ Key Insights:

  • Sex and Fare are useful predictors (positive impact).
  • Pclass has a negative correlation—the higher the class number (i.e., 3rd class), the lower the chance of survival.

Define Features and Target

✅ Key Insights:

  • X: All features (input variables) except 'Survived'
  • y: The target variable we want to predict

Data Split

✅ Key Insights:

  • test_size=0.2:
    20% of the data goes to the test set (179 samples), 80% to training (712 samples)
  • random_state=42:
    Ensures reproducible results (same split every time)

Training

✅ Key Insights:

  • Logistic Regression is a simple and effective model for binary classification tasks.
  • max_iter=1000: Ensures the model has enough iterations to converge (reach a good solution). Sometimes the default (100) is too low.
  • It finds the best weights (coefficients) to separate survivors and non-survivors.

Prediction

A binary matrix representation displaying a series of zeros and ones, indicating binary data or an array used in programming or data processing.

✅ Key Insights:

  • Predict survival (0 or 1) for the passengers in X_test.

Evaluation

Screenshot displaying classification report with accuracy, precision, recall, and F1-score for a binary classification model.

✅ Key Insights:

  • Accuracy
    • About 81% of predictions were correct. A good start for a baseline model.
  • Classification Report
    • Class 0 (Did Not Survive):
      • High precision (0.83) means most of those predicted to not survive were correct.
      • High recall (0.86) means the model catches most of the people who didn’t survive.
      • Overall, the model is strong and reliable at identifying passengers who died.
    • Class 1 (Survived):
      • Lower precision (0.79) means some predictions of survival are false positives.
      • Lower recall (0.74) means the model misses some actual survivors.
      • The model is less confident and slightly weaker at identifying survivors.
  • Confusion Matrix
    • 90 passengers correctly predicted as not survived
    • 55 correctly predicted as survived
    • 15 false positives (predicted survived, actually died)
    • 19 false negatives (predicted died, actually survived)

Code

 👉 Download

Categories

,

Discover more from Coding Fab

Subscribe now to keep reading and get access to the full archive.

Continue reading