Predicting Heart Disease Using Machine Learning: A Step-by-Step Guide



Introduction

Heart disease is one of the leading causes of death worldwide, but imagine if we could use data to predict whether someone is at risk before symptoms even appear? That's the power of Machine Learning. By training a model on health data, we can make predictions that can potentially save lives. In this post, we’ll walk through a Heart Disease Prediction Model using machine learning. This project will allow us to predict if a person is at risk of heart disease based on features like age, cholesterol levels, blood pressure, and other health indicators.

By the end of this guide, you’ll have a clear understanding of how machine learning can be used in the healthcare sector to assist doctors in making better decisions.


Step 1: Getting Started with the Data

For this project, we will use the Heart Disease Dataset from the UCI Machine Learning Repository. The dataset contains information about patients, including various features such as:

  • Age: The age of the patient.
  • Sex: Male or Female.
  • Chest Pain Type: Type of chest pain experienced.
  • Resting Blood Pressure: Blood pressure at rest.
  • Serum Cholesterol: Cholesterol levels.
  • Maximum Heart Rate: The maximum heart rate achieved during exercise.
  • ECG: Electrocardiographic results.
  • Thalassemia: A genetic disorder affecting blood cells.

The target variable in this dataset is Heart Disease (0 for no disease, 1 for heart disease).

We will load the dataset using Python and start exploring it.

import pandas as pd

# Load dataset
df = pd.read_csv('heart_disease.csv')

# Show the first 5 rows of the dataset
print(df.head())

Step 2: Data Preprocessing

Before we can build a machine learning model, we need to clean the data and ensure it’s ready for analysis. Here are the main steps in the preprocessing stage:

  1. Handling Missing Data: In real-world datasets, missing values are quite common. We can handle missing data by either filling the missing values with the median or removing rows with missing data.
# Check for missing values
print(df.isnull().sum())

# Fill missing values with the median
df.fillna(df.median(), inplace=True)
  1. Feature Encoding: Some features like gender and chest pain type are categorical (non-numeric). We need to convert these into numerical values for our model.
# Convert categorical variables to numerical using one-hot encoding
df = pd.get_dummies(df, columns=['Sex', 'ChestPainType'])
  1. Feature Scaling: Machine learning models perform better when the features are scaled. We’ll use StandardScaler from Scikit-learn to scale the features so they all have the same scale.
from sklearn.preprocessing import StandardScaler

# Scale the features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df.drop('HeartDisease', axis=1))
  1. Splitting the Data: Before training the model, we need to split our data into a training set and a test set. This allows us to train the model on one set of data and evaluate its performance on another set of data.
from sklearn.model_selection import train_test_split

X = scaled_features
y = df['HeartDisease']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 3: Choosing the Right Machine Learning Algorithm

For this project, we’ll use Logistic Regression, which is a great choice for binary classification problems like predicting heart disease. Logistic Regression estimates the probability that an input belongs to a certain class.

from sklearn.linear_model import LogisticRegression

# Initialize and train the model
model = LogisticRegression()
model.fit(X_train, y_train)

Step 4: Model Evaluation

After training the model, we need to evaluate its performance. We will use several metrics to assess how well the model is performing:

  1. Accuracy: The percentage of correctly predicted labels.
# Evaluate the model on the test data
accuracy = model.score(X_test, y_test)
print(f"Accuracy: {accuracy:.2f}")
  1. Confusion Matrix: The confusion matrix shows the number of true positives, true negatives, false positives, and false negatives. It gives us a better understanding of how the model is performing across different classes.
from sklearn.metrics import confusion_matrix

y_pred = model.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
  1. ROC Curve: The Receiver Operating Characteristic (ROC) curve helps us visualize the trade-off between true positive rate (sensitivity) and false positive rate. The Area Under the Curve (AUC) is a useful metric to assess the model’s ability to distinguish between the classes.
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

fpr, tpr, thresholds = roc_curve(y_test, model.predict_proba(X_test)[:,1])
roc_auc = auc(fpr, tpr)

plt.plot(fpr, tpr, color='blue', label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic Curve')
plt.legend(loc='lower right')
plt.show()

Step 5: Making Predictions

Now that we have a trained model, we can use it to predict whether a new patient is at risk of heart disease based on their health features. Simply input the features into the model, and it will return the probability of the patient having heart disease.

# Make predictions for a new patient
new_patient = [[45, 1, 140, 230, 145, 0, 0, 1]]  # Example values for a new patient
scaled_new_patient = scaler.transform(new_patient)
prediction = model.predict(scaled_new_patient)
print(f"Prediction: {'Heart Disease' if prediction == 1 else 'No Heart Disease'}")

Conclusion

Building a heart disease prediction model using machine learning is a powerful example of how data science can impact healthcare. By using historical data on patients’ health indicators, we can build a model that predicts the likelihood of heart disease, helping doctors to make more informed decisions and ultimately saving lives.

The key takeaways from this project include:

  • Data preprocessing is critical for ensuring your data is clean and ready for modeling.
  • Logistic Regression is a simple yet effective algorithm for binary classification.
  • Evaluating your model’s performance using metrics like accuracy, confusion matrix, and ROC curves is crucial for understanding how well your model is performing.

Now, you can take this knowledge and start building more advanced models, try different algorithms, or use more complex datasets. The possibilities are endless!


Post a Comment

Previous Post Next Post