Understanding how much medical care might cost can be really confusing 😕. Many people struggle with unexpected bills 💸, varying insurance coverage 🏥, and not knowing what factors influence the final price. That’s where data and machine learning come in to help make smarter predictions 📊.
In this blog post, we will cover how to use machine learning to predict medical costs based on factors like age, BMI, smoking status, and more. We’ll walk through the entire process step by step—starting from loading the dataset to evaluating the model’s performance.
🚀 Let’s explore how you can use Python to create a predictive model that makes sense of healthcare expenses and helps make more informed decisions!
Video Tutorial
Overview
| Item | Details |
|---|---|
| Category | Regression |
| Goal | Predict medical charges based on patient attributes |
| Data Source | Kaggle (Medical Cost Personal Dataset) |
| Task Type | Regression |
| Data Type | Tabular |
| Algorithms | Linear Regression |
| Evaluation Metrics | MSE, RMSE, R² Score |
| IDE | Jupyter |
| Tools | Pandas, Scikit-learn, Matplotlib, Seaborn |
Loading
import pandas as pd
# Load the dataset
df = pd.read_csv("insurance.csv")
# Show the first 5 rows
df.head()

✅ Key Insights:
pd.read_csv("insurance.csv"): Reads the CSV file into a Pandas DataFrame.df.head(): Displays the first 5 rows to confirm the data is loaded correctly.
Exploring
🔍 Basic Exploration
# Check the the DataFrame’s structure
df.info()

✅ Key Insights
- Total entries: 1338 rows (no missing values)
- Column types:
- Numerical:
age,bmi,children,charges - Categorical:
sex,smoker,region
- Numerical:
- All columns are complete (no null values).
# Summary statistics
df.describe()

✅ Key Insights
| Column | Meaningful Insights |
|---|---|
| age | Age ranges from 18 to 64. Median is 39. |
| bmi | BMI ranges from ~16 to 53. Median ~30.4. |
| children | Most people have 0–2 children. Max is 5. |
| charges | Huge range: $1,121 to $63,770. Very skewed. |
| sex | 676 males, 662 females. |
| smoker | 1064 non-smokers (very imbalanced). |
| region | Most people are from the Southeast (364). |
📊 Data Visualization
Distribution of Charges
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(8, 5))
sns.histplot(df['charges'], bins=40, kde=True)
plt.title("Distribution of Medical Charges")
plt.xlabel("Charges")
plt.ylabel("Count")
plt.show()

✅ Key Insights:
- A histogram shows how medical charges are distributed.
- The KDE (Kernel Density Estimate) curve gives a smoothed line representing the probability density.
- Insight: The distribution is right-skewed — most people have lower charges, but a few have very high charges.
Charges vs Age (Colored by Smoker)
plt.figure(figsize=(8, 5))
sns.scatterplot(data=df, x='age', y='charges', hue='smoker')
plt.title("Charges vs Age (colored by Smoker)")
plt.xlabel("Age")
plt.ylabel("Charges")
plt.show()

✅ Key Insights:
- Charges generally increase with age.
- Smokers tend to have significantly higher charges at all ages.
Charges vs BMI (Colored by Smoker)
plt.figure(figsize=(8, 5))
sns.scatterplot(data=df, x='bmi', y='charges', hue='smoker')
plt.title("Charges vs BMI (colored by Smoker)")
plt.xlabel("BMI")
plt.ylabel("Charges")
plt.show()

✅ Key Insights:
- Charges tend to rise with BMI, but the effect is more extreme for smokers.
- Smokers with high BMI often have very high charges.
Box Plot: Charges by Smoker
plt.figure(figsize=(6, 5))
sns.boxplot(data=df, x='smoker', y='charges')
plt.title("Medical Charges by Smoker Status")
plt.xlabel("Smoker")
plt.ylabel("Charges")
plt.show()

✅ Key Insights:
- Smokers have a much higher median and a wider range of charges.
Box Plot: Charges by Sex
plt.figure(figsize=(6, 5))
sns.boxplot(data=df, x='sex', y='charges')
plt.title("Medical Charges by Sex")
plt.xlabel("Sex")
plt.ylabel("Charges")
plt.show()

✅ Key Insights:
- There’s no significant difference in medical charges based on sex.
Box Plot: Charges by Region
plt.figure(figsize=(8, 5))
sns.boxplot(data=df, x='region', y='charges')
plt.title("Medical Charges by Region")
plt.xlabel("Region")
plt.ylabel("Charges")
plt.show()

✅ Key Insights:
- Slight variation between regions, but not a major factor in charge differences.
📊 Correlation Analysis
Correlation Between Numerical Features
df.corr(numeric_only=True)

plt.figure(figsize=(8, 6))
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()

✅ Key Insights:
| Feature Pair | Correlation | Insight |
|---|---|---|
charges & age | 0.30 | Older people tend to have higher charges. |
charges & bmi | 0.20 | Higher BMI can slightly increase charges. |
charges & children | 0.07 | Very weak correlation. Having more children doesn’t strongly affect charges. |
Include Categorical Features for Correlation
df_temp = df.copy()
# Convert categorical variables to numeric variables
df_temp['smoker'] = df_temp['smoker'].map({'no': 0, 'yes': 1})
df_temp['sex'] = df_temp['sex'].map({'female': 0, 'male': 1})
df_temp = pd.get_dummies(df_temp, columns=['region'], drop_first=True)
df_temp.corr()['charges'].sort_values(ascending=False)

plt.figure(figsize=(8, 6))
sns.heatmap(df_temp.corr(numeric_only=True), annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()

✅ Key Insights:
| Feature | Correlation with Charges | Insight |
|---|---|---|
smoker | 0.79 | Very strong positive correlation—smoking greatly increases charges. |
age | 0.30 | Older age is associated with higher charges. |
bmi | 0.20 | Higher BMI has a moderate impact. |
region_southeast | 0.07 | Slight increase in charges for this region. |
sex | 0.06 | Very weak relationship. |
region_northwest | -0.04 | Slightly lower charges. |
region_southwest | -0.04 | Similar effect to northwest. |
- Smoking status is the most significant predictor of medical charges.
- Age and BMI also matter but less strongly.
- Sex and region have little impact on charges.
- Children have minimal influence.
Cleaning
Missing Values
df.isnull().sum()
✅ Key Insights:
- All columns have 0 missing values. No action needed.
Duplicates
df.duplicated().sum()
df.drop_duplicates(inplace=True)
✅ Key Insights:
- Found 1 duplicate row.
- Drops the duplicate entry from the dataset.
Inconsistent Data in Categorical Columns
df['sex'].unique()
df['smoker'].unique()
df['region'].unique()
✅ Key Insights:
sex:['female', 'male']→ Clean and consistent.smoker:['yes', 'no']→ No typos or inconsistencies.region:['southwest', 'southeast', 'northwest', 'northeast']→ All expected values.
Detecting Outliers
sns.boxplot(data=df, x='charges')

✅ Key Insights:
- There are extremely high charges (outliers) far from the main distribution.
- These outliers are expected in medical data (e.g., due to surgeries or chronic illness).
- Keep for now.
Feature Selection
X = df.drop('charges', axis=1)
y = df['charges']
X.head()

✅ Key Insights:
- Keep all 6 features (
age,sex,bmi,children,smoker,region) for now. - Linear Regression can handle these well once we preprocess the categorical variables.
Preprocessing
🛠️ One-Hot Encoding
X = pd.get_dummies(X, drop_first=True)
✅ Key Insights:
- drop_first=True: To avoid the dummy variable trap (multicollinearity)
🛠️ Train-Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
✅ Key Insights:
test_size=0.2: 80% training, 20% testingrandom_state=42: ensures reproducibility
Training
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
✅ Key Insights:
- The model learns the weights (coefficients) for each feature to minimize error between predicted and actual charges.
coefficients = pd.DataFrame({
'Feature': X.columns,
'Coefficient': model.coef_
})
coefficients

✅ Key Insights:
- This shows the influence of each feature on the predicted charges.
- Smoker status is by far the most impactful feature. It’s strongly positively associated with higher charges.
model.intercept_

✅ Key Insights:
- This is the value of
chargeswhen all feature values are 0 - It represents the baseline medical cost before adding any influence from the input features.
Prediction
y_test_pred = model.predict(X_test)
✅ Key Insights:
- This line uses the trained linear regression model to predict medical charges for each observation in the test dataset.
comparison = pd.DataFrame({
'Actual': y_test.values,
'Predicted': y_test_pred
})
comparison.head(10)

✅ Key Insights:
- This creates a side-by-side table of the true charges (
Actual) and the model’s predictions (Predicted) for each patient in the test set.
plt.figure(figsize=(8, 6))
sns.scatterplot(x='Actual', y='Predicted', data=comparison, alpha=0.6)
max_val = max(comparison.max())
min_val = min(comparison.min())
plt.plot([min_val, max_val], [min_val, max_val], color='red', linestyle='--')
plt.title("Actual vs Predicted Medical Charges")
plt.xlabel("Actual Charges")
plt.ylabel("Predicted Charges")
plt.grid(True)
plt.show()

✅ Key Insights:
- Each point in the plot represents one patient.
- If the prediction is perfect, the point will lie exactly on the diagonal line.
- Most predictions follow the general trend.
- For very high actual charges (e.g., $40,000+), the model tends to underpredict.
Evaluation
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
# Training performance
train_mse = mean_squared_error(y_train, y_train_pred)
train_rmse = np.sqrt(train_mse)
train_r2 = r2_score(y_train, y_train_pred)
# Testing performance
test_mse = mean_squared_error(y_test, y_test_pred)
test_rmse = np.sqrt(test_mse)
test_r2 = r2_score(y_test, y_test_pred)
print("Train MSE:", train_mse)
print("Train RMSE:", train_rmse)
print("Train R²:", train_r2)
print("\nTest MSE:", test_mse)
print("Test RMSE:", test_rmse)
print("Test R²:", test_r2)

✅ Key Insights:
- Good Predictive Performance (R² ≒ 0.78 on the test set)
- No Overfitting (The test RMSE ($5,796) is slightly lower than the train RMSE ($6,105).)
- Rreasonable Error Magnitude (Given the wide range of
charges(from around $1,000 to $60,000+), RMSE $5,796 is a reasonable error)
Code
👉 Download












