Medical Cost Prediction

Understanding how much medical care might cost can be really confusing 😕. Many people struggle with unexpected bills 💸, varying insurance coverage 🏥, and not knowing what factors influence the final price. That’s where data and machine learning come in to help make smarter predictions 📊.

In this blog post, we will cover how to use machine learning to predict medical costs based on factors like age, BMI, smoking status, and more. We’ll walk through the entire process step by step—starting from loading the dataset to evaluating the model’s performance.

🚀 Let’s explore how you can use Python to create a predictive model that makes sense of healthcare expenses and helps make more informed decisions!

Video Tutorial

Overview

Item	Details
Category	Regression
Goal	Predict medical charges based on patient attributes
Data Source	Kaggle (Medical Cost Personal Dataset)
Task Type	Regression
Data Type	Tabular
Algorithms	Linear Regression
Evaluation Metrics	MSE, RMSE, R² Score
IDE	Jupyter
Tools	Pandas, Scikit-learn, Matplotlib, Seaborn

Loading

import pandas as pd

# Load the dataset
df = pd.read_csv("insurance.csv")

# Show the first 5 rows
df.head()

✅ Key Insights:

pd.read_csv("insurance.csv"): Reads the CSV file into a Pandas DataFrame.
df.head(): Displays the first 5 rows to confirm the data is loaded correctly.

Exploring

🔍 Basic Exploration

# Check the the DataFrame’s structure
df.info()

✅ Key Insights

Total entries: 1338 rows (no missing values)
Column types:
- Numerical: age, bmi, children, charges
- Categorical: sex, smoker, region
All columns are complete (no null values).

# Summary statistics 
df.describe()

✅ Key Insights

Column	Meaningful Insights
age	Age ranges from 18 to 64. Median is 39.
bmi	BMI ranges from ~16 to 53. Median ~30.4.
children	Most people have 0–2 children. Max is 5.
charges	Huge range: $1,121 to $63,770. Very skewed.
sex	676 males, 662 females.
smoker	1064 non-smokers (very imbalanced).
region	Most people are from the Southeast (364).

📊 Data Visualization

Distribution of Charges

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8, 5))
sns.histplot(df['charges'], bins=40, kde=True)
plt.title("Distribution of Medical Charges")
plt.xlabel("Charges")
plt.ylabel("Count")
plt.show()

Histogram showing the distribution of medical charges, with the x-axis representing charge amounts and the y-axis indicating the count of occurrences.

✅ Key Insights:

A histogram shows how medical charges are distributed.
The KDE (Kernel Density Estimate) curve gives a smoothed line representing the probability density.
Insight: The distribution is right-skewed — most people have lower charges, but a few have very high charges.

Charges vs Age (Colored by Smoker)

plt.figure(figsize=(8, 5))
sns.scatterplot(data=df, x='age', y='charges', hue='smoker')
plt.title("Charges vs Age (colored by Smoker)")
plt.xlabel("Age")
plt.ylabel("Charges")
plt.show()

Scatter plot showing medical charges against age, with data points colored by smoking status (blue for smokers, orange for non-smokers).

✅ Key Insights:

Charges generally increase with age.
Smokers tend to have significantly higher charges at all ages.

Charges vs BMI (Colored by Smoker)

plt.figure(figsize=(8, 5))
sns.scatterplot(data=df, x='bmi', y='charges', hue='smoker')
plt.title("Charges vs BMI (colored by Smoker)")
plt.xlabel("BMI")
plt.ylabel("Charges")
plt.show()

Scatter plot displaying medical charges in relation to BMI, colored by smoking status. Smokers are represented in blue, and non-smokers in orange, illustrating the trend of increasing charges with higher BMI.

✅ Key Insights:

Charges tend to rise with BMI, but the effect is more extreme for smokers.
Smokers with high BMI often have very high charges.

Box Plot: Charges by Smoker

plt.figure(figsize=(6, 5))
sns.boxplot(data=df, x='smoker', y='charges')
plt.title("Medical Charges by Smoker Status")
plt.xlabel("Smoker")
plt.ylabel("Charges")
plt.show()

Box plot displaying medical charges based on smoker status, with higher median charges for smokers.

✅ Key Insights:

Smokers have a much higher median and a wider range of charges.

Box Plot: Charges by Sex

plt.figure(figsize=(6, 5))
sns.boxplot(data=df, x='sex', y='charges')
plt.title("Medical Charges by Sex")
plt.xlabel("Sex")
plt.ylabel("Charges")
plt.show()

Box plot illustrating medical charges categorized by sex, showing median, quartiles, and outliers.

✅ Key Insights:

There’s no significant difference in medical charges based on sex.

Box Plot: Charges by Region

plt.figure(figsize=(8, 5))
sns.boxplot(data=df, x='region', y='charges')
plt.title("Medical Charges by Region")
plt.xlabel("Region")
plt.ylabel("Charges")
plt.show()

Box plot showing medical charges by region, displaying different charge distributions for the southwest, southeast, northwest, and northeast regions, including median and outlier values.

✅ Key Insights:

Slight variation between regions, but not a major factor in charge differences.

📊 Correlation Analysis

Correlation Between Numerical Features

df.corr(numeric_only=True)

Correlation matrix showing relationships between age, BMI, children, and medical charges.

plt.figure(figsize=(8, 6))
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()

A correlation heatmap displaying the relationship between age, BMI, children, and medical charges, with color coding to indicate the strength of correlation.

✅ Key Insights:

Feature Pair	Correlation	Insight
`charges` & `age`	0.30	Older people tend to have higher charges.
`charges` & `bmi`	0.20	Higher BMI can slightly increase charges.
`charges` & `children`	0.07	Very weak correlation. Having more children doesn’t strongly affect charges.

Include Categorical Features for Correlation

df_temp = df.copy()

# Convert categorical variables to numeric variables
df_temp['smoker'] = df_temp['smoker'].map({'no': 0, 'yes': 1})
df_temp['sex'] = df_temp['sex'].map({'female': 0, 'male': 1})

df_temp = pd.get_dummies(df_temp, columns=['region'], drop_first=True)

df_temp.corr()['charges'].sort_values(ascending=False)

A correlation coefficients table displaying the relationships between 'charges' and various features such as 'smoker', 'age', 'bmi', 'region', 'children', and 'sex'.

plt.figure(figsize=(8, 6))
sns.heatmap(df_temp.corr(numeric_only=True), annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()

A heatmap displaying the correlations between various features related to medical costs, including age, sex, BMI, children, smoker status, and charges.

✅ Key Insights:

Feature	Correlation with Charges	Insight
`smoker`	0.79	Very strong positive correlation—smoking greatly increases charges.
`age`	0.30	Older age is associated with higher charges.
`bmi`	0.20	Higher BMI has a moderate impact.
`region_southeast`	0.07	Slight increase in charges for this region.
`sex`	0.06	Very weak relationship.
`region_northwest`	-0.04	Slightly lower charges.
`region_southwest`	-0.04	Similar effect to northwest.

Smoking status is the most significant predictor of medical charges.
Age and BMI also matter but less strongly.
Sex and region have little impact on charges.
Children have minimal influence.

Cleaning

Missing Values

df.isnull().sum()

✅ Key Insights:

All columns have 0 missing values. No action needed.

Duplicates

df.duplicated().sum()

df.drop_duplicates(inplace=True)

✅ Key Insights:

Found 1 duplicate row.
Drops the duplicate entry from the dataset.

Inconsistent Data in Categorical Columns

df['sex'].unique()
df['smoker'].unique()
df['region'].unique()

✅ Key Insights:

sex: ['female', 'male'] → Clean and consistent.
smoker: ['yes', 'no'] → No typos or inconsistencies.
region: ['southwest', 'southeast', 'northwest', 'northeast'] → All expected values.

Detecting Outliers

sns.boxplot(data=df, x='charges')

Box plot illustrating the distribution of medical charges, highlighting the median, interquartile range, and outliers.

✅ Key Insights:

There are extremely high charges (outliers) far from the main distribution.
These outliers are expected in medical data (e.g., due to surgeries or chronic illness).
Keep for now.

Feature Selection

X = df.drop('charges', axis=1) 
y = df['charges'] 

X.head()

✅ Key Insights:

Keep all 6 features (age, sex, bmi, children, smoker, region) for now.
Linear Regression can handle these well once we preprocess the categorical variables.

Preprocessing

🛠️ One-Hot Encoding

X = pd.get_dummies(X, drop_first=True)

✅ Key Insights:

drop_first=True: To avoid the dummy variable trap (multicollinearity)

🛠️ Train-Test Split

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

✅ Key Insights:

test_size=0.2: 80% training, 20% testing
random_state=42: ensures reproducibility

Training

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

✅ Key Insights:

The model learns the weights (coefficients) for each feature to minimize error between predicted and actual charges.

coefficients = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': model.coef_
})
coefficients

✅ Key Insights:

This shows the influence of each feature on the predicted charges.
Smoker status is by far the most impactful feature. It’s strongly positively associated with higher charges.

model.intercept_

✅ Key Insights:

This is the value of charges when all feature values are 0
It represents the baseline medical cost before adding any influence from the input features.

Prediction

y_test_pred = model.predict(X_test)

✅ Key Insights:

This line uses the trained linear regression model to predict medical charges for each observation in the test dataset.

comparison = pd.DataFrame({
    'Actual': y_test.values,
    'Predicted': y_test_pred
})

comparison.head(10)

✅ Key Insights:

This creates a side-by-side table of the true charges (Actual) and the model’s predictions (Predicted) for each patient in the test set.

plt.figure(figsize=(8, 6))
sns.scatterplot(x='Actual', y='Predicted', data=comparison, alpha=0.6)

max_val = max(comparison.max())
min_val = min(comparison.min())
plt.plot([min_val, max_val], [min_val, max_val], color='red', linestyle='--')

plt.title("Actual vs Predicted Medical Charges")
plt.xlabel("Actual Charges")
plt.ylabel("Predicted Charges")
plt.grid(True)
plt.show()

✅ Key Insights:

Each point in the plot represents one patient.
If the prediction is perfect, the point will lie exactly on the diagonal line.
Most predictions follow the general trend.
For very high actual charges (e.g., $40,000+), the model tends to underpredict.

Evaluation

from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Training performance
train_mse = mean_squared_error(y_train, y_train_pred)
train_rmse = np.sqrt(train_mse)
train_r2 = r2_score(y_train, y_train_pred)

# Testing performance
test_mse = mean_squared_error(y_test, y_test_pred)
test_rmse = np.sqrt(test_mse)
test_r2 = r2_score(y_test, y_test_pred)


print("Train MSE:", train_mse)
print("Train RMSE:", train_rmse)
print("Train R²:", train_r2)

print("\nTest MSE:", test_mse)
print("Test RMSE:", test_rmse)
print("Test R²:", test_r2)

✅ Key Insights:

Good Predictive Performance (R² ≒ 0.78 on the test set)
No Overfitting (The test RMSE ($5,796) is slightly lower than the train RMSE ($6,105).)
Rreasonable Error Magnitude (Given the wide range of charges (from around $1,000 to $60,000+), RMSE $5,796 is a reasonable error)