Medical Cost Prediction

Medical Cost Prediction

Understanding how much medical care might cost can be really confusing 😕. Many people struggle with unexpected bills 💸, varying insurance coverage 🏥, and not knowing what factors influence the final price. That’s where data and machine learning come in to help make smarter predictions 📊.

In this blog post, we will cover how to use machine learning to predict medical costs based on factors like age, BMI, smoking status, and more. We’ll walk through the entire process step by step—starting from loading the dataset to evaluating the model’s performance.

🚀 Let’s explore how you can use Python to create a predictive model that makes sense of healthcare expenses and helps make more informed decisions!

Video Tutorial

Exploring

🔍 Basic Exploration

DataFrame structure summary with 1338 entries and 7 columns including age, sex, bmi, children, smoker, region, and charges.

✅ Key Insights

  • Total entries: 1338 rows (no missing values)
  • Column types:
    • Numerical: age, bmi, children, charges
    • Categorical: sex, smoker, region
  • All columns are complete (no null values).
A table displaying summary statistics of a medical dataset, including columns for age, sex, BMI, children, smoker status, region, and charges.

✅ Key Insights

ColumnMeaningful Insights
ageAge ranges from 18 to 64. Median is 39.
bmiBMI ranges from ~16 to 53. Median ~30.4.
childrenMost people have 0–2 children. Max is 5.
chargesHuge range: $1,121 to $63,770. Very skewed.
sex676 males, 662 females.
smoker1064 non-smokers (very imbalanced).
regionMost people are from the Southeast (364).

📊 Data Visualization

Distribution of Charges

Histogram showing the distribution of medical charges, with the x-axis representing charge amounts and the y-axis indicating the count of occurrences.

✅ Key Insights:

  • A histogram shows how medical charges are distributed.
  • The KDE (Kernel Density Estimate) curve gives a smoothed line representing the probability density.
  • Insight: The distribution is right-skewed — most people have lower charges, but a few have very high charges.

Charges vs Age (Colored by Smoker)

Scatter plot showing medical charges against age, with data points colored by smoking status (blue for smokers, orange for non-smokers).

✅ Key Insights:

  • Charges generally increase with age.
  • Smokers tend to have significantly higher charges at all ages.

Charges vs BMI (Colored by Smoker)

Scatter plot displaying medical charges in relation to BMI, colored by smoking status. Smokers are represented in blue, and non-smokers in orange, illustrating the trend of increasing charges with higher BMI.

✅ Key Insights:

  • Charges tend to rise with BMI, but the effect is more extreme for smokers.
  • Smokers with high BMI often have very high charges.

Box Plot: Charges by Smoker

Box plot displaying medical charges based on smoker status, with higher median charges for smokers.

✅ Key Insights:

  • Smokers have a much higher median and a wider range of charges.

Box Plot: Charges by Sex

Box plot illustrating medical charges categorized by sex, showing median, quartiles, and outliers.

✅ Key Insights:

  • There’s no significant difference in medical charges based on sex.

Box Plot: Charges by Region

Box plot showing medical charges by region, displaying different charge distributions for the southwest, southeast, northwest, and northeast regions, including median and outlier values.

✅ Key Insights:

  • Slight variation between regions, but not a major factor in charge differences.

📊 Correlation Analysis

Correlation Between Numerical Features

Correlation matrix showing relationships between age, BMI, children, and medical charges.
A correlation heatmap displaying the relationship between age, BMI, children, and medical charges, with color coding to indicate the strength of correlation.

✅ Key Insights:

Feature PairCorrelationInsight
charges & age0.30Older people tend to have higher charges.
charges & bmi0.20Higher BMI can slightly increase charges.
charges & children0.07Very weak correlation. Having more children doesn’t strongly affect charges.

Include Categorical Features for Correlation

A correlation coefficients table displaying the relationships between 'charges' and various features such as 'smoker', 'age', 'bmi', 'region', 'children', and 'sex'.
A heatmap displaying the correlations between various features related to medical costs, including age, sex, BMI, children, smoker status, and charges.

✅ Key Insights:

FeatureCorrelation with ChargesInsight
smoker0.79Very strong positive correlation—smoking greatly increases charges.
age0.30Older age is associated with higher charges.
bmi0.20Higher BMI has a moderate impact.
region_southeast0.07Slight increase in charges for this region.
sex0.06Very weak relationship.
region_northwest-0.04Slightly lower charges.
region_southwest-0.04Similar effect to northwest.
  • Smoking status is the most significant predictor of medical charges.
  • Age and BMI also matter but less strongly.
  • Sex and region have little impact on charges.
  • Children have minimal influence.

Cleaning

Missing Values

✅ Key Insights:

  • All columns have 0 missing values. No action needed.

Duplicates

✅ Key Insights:

  • Found 1 duplicate row.
  • Drops the duplicate entry from the dataset.

Inconsistent Data in Categorical Columns

✅ Key Insights:

  • sex: ['female', 'male'] → Clean and consistent.
  • smoker: ['yes', 'no'] → No typos or inconsistencies.
  • region: ['southwest', 'southeast', 'northwest', 'northeast'] → All expected values.

Detecting Outliers

Box plot illustrating the distribution of medical charges, highlighting the median, interquartile range, and outliers.

✅ Key Insights:

  • There are extremely high charges (outliers) far from the main distribution.
  • These outliers are expected in medical data (e.g., due to surgeries or chronic illness).
  • Keep for now.

Feature Selection

A table displaying data from a medical dataset includes columns for age, sex, bmi, number of children, smoker status, and region.

✅ Key Insights:

  • Keep all 6 features (age, sex, bmi, children, smoker, region) for now.
  • Linear Regression can handle these well once we preprocess the categorical variables.

Preprocessing

🛠️ One-Hot Encoding

✅ Key Insights:

  • drop_first=True: To avoid the dummy variable trap (multicollinearity)

🛠️ Train-Test Split

✅ Key Insights:

  • test_size=0.2: 80% training, 20% testing
  • random_state=42: ensures reproducibility

Training

✅ Key Insights:

  • The model learns the weights (coefficients) for each feature to minimize error between predicted and actual charges.
Table displaying features and coefficients of a linear regression model predicting medical charges.

✅ Key Insights:

  • This shows the influence of each feature on the predicted charges.
  • Smoker status is by far the most impactful feature. It’s strongly positively associated with higher charges.
A screenshot showing the output of a predictive model, displaying a negative value of -11092.652295945965.

✅ Key Insights:

  • This is the value of charges when all feature values are 0
  • It represents the baseline medical cost before adding any influence from the input features.

Prediction

✅ Key Insights:

  • This line uses the trained linear regression model to predict medical charges for each observation in the test dataset.
A table displaying actual and predicted medical charges for patients, with columns for 'Actual' and 'Predicted' values.

✅ Key Insights:

  • This creates a side-by-side table of the true charges (Actual) and the model’s predictions (Predicted) for each patient in the test set.
Scatter plot showing actual versus predicted medical charges, with actual charges on the x-axis and predicted charges on the y-axis, including a red dashed line representing perfect predictions.

✅ Key Insights:

  • Each point in the plot represents one patient.
  • If the prediction is perfect, the point will lie exactly on the diagonal line.
  • Most predictions follow the general trend.
  • For very high actual charges (e.g., $40,000+), the model tends to underpredict.

Evaluation

A screen displaying the results of a linear regression model's evaluation metrics, including Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R² scores for both training and testing datasets.

✅ Key Insights:

  • Good Predictive Performance (R² ≒ 0.78 on the test set)
  • No Overfitting (The test RMSE ($5,796) is slightly lower than the train RMSE ($6,105).)
  • Rreasonable Error Magnitude (Given the wide range of charges (from around $1,000 to $60,000+), RMSE $5,796 is a reasonable error)

Code

 👉 Download

Categories

,

Discover more from Coding Fab

Subscribe now to keep reading and get access to the full archive.

Continue reading