Live Project: Heart Disease Prediction Using Data Science

Project Info:

This project leverages the Framingham Heart Study dataset to predict the 10-year risk of coronary heart disease (CHD) in patients using logistic regression. The dataset includes demographic, behavioral & medical risk factors collected from over 4,000 individuals. Through data cleaning, preprocessing, EDA, model building, & evaluation, this project demonstrates the complete lifecycle of a machine learning pipeline tailored for binary health classification.


Project Implementation:

  • Imported necessary libraries like Pandas, NumPy, Seaborn, Matplotlib, Statsmodels & sklearn

  • Loaded the dataset, dropped irrelevant columns like education, handled null values & renamed relevant columns

  • Performed EDA to visualize the class imbalance using countplots & trend lines of TenYearCHD occurrences

  • Scaled features using StandardScaler & split the dataset into training (70%) & testing (30%) sets

  • Built a logistic regression model using LogisticRegression() from sklearn

  • Predicted outcomes on the test data & evaluated performance using metrics like accuracy, confusion matrix & classification report

  • Visualized the confusion matrix using seaborn’s heatmap for better interpretation of prediction results


Insights & Key Outcomes:

  • The dataset is highly imbalanced, with significantly more negative (no CHD) cases than positive

  • The logistic regression model showed decent accuracy but struggled with predicting positive (CHD) cases

  • Techniques like class rebalancing, threshold tuning or trying different algorithms (e.g. Random Forest, XGBoost) could improve recall & F1-score for class 1

1: Importing Necessary Libraries

We will import Numpy, Pandas, Matplotlib, Seaborn, Statsmodels and sklearn library in python.

  • Statsmodels: for statistical modeling for fitting logistic regression.
  • Sklearn: Provides tools for machine learning modeling.

import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt
import seaborn as sns

2: Data Preparation

The dataset is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. The classification goal is to predict whether the patient has 10-year risk of future coronary heart disease (CHD). The dataset provides the patients information. It includes over 4,000 records and 15 attributes.

2.1 Loading and Handling Missing Values from the Dataset

We will load the dataset and drop the irrelevant features from the the dataset like “education” and rename columns also.

  • disease_df.pd.read_csv(): This is used to read the contents of CSV file.
  • disease_df, dropna(axis=0, inplace=True): This removes any rows with missing values (NaN) from the DataFrame.
  • disease_df.TenYearCHD.value_counts(): This prints the count of unique values in the TenYearCHD column which likely indicates whether a patient has heart disease.

disease_df = pd.read_csv("framingham.csv")
disease_df.drop(columns=['education'], inplace = True, axis = 1)
disease_df.rename(columns ={'male':'Sex_male'}, inplace = True)

disease_df.dropna(axis = 0, inplace = True)
disease_df

print(disease_df.TenYearCHD.value_counts())

3: Splitting the Dataset into Test and Train Sets

We will split the dataset into training and testing portions. But before that we will transform our data by scaling all the features using StandardScaler.

  • X=preprocessing.StandardScaler().fit(X).transform(X): This scales the features in X to have a mean of 0 and standard deviation of 1 using StandardScaler. Scaling is important for many machine learning models, especially when the features have different units or magnitudes.
  • Training set (70% of data, X_train and y_train)
  • Test set (30% of data, X_test and y_test)
  • random_state=4 ensures the split is reproducible.

X = np.asarray(disease_df[['age', 'Sex_male', 'cigsPerDay', 
                           'totChol', 'sysBP', 'glucose']])
y = np.asarray(disease_df['TenYearCHD'])

X = preprocessing.StandardScaler().fit(X).transform(X)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( 
        X, y, test_size = 0.3, random_state = 4)

print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

4: Exploratory Data Analysis of Heart Disease Dataset

In Exploratory Data Analysis (EDA) we perform EDA on the heart disease dataset to understand and gain insights into the dataset before building a predictive model for heart disease.

4.1: Ten Year’s CHD Record of all the patients available in the dataset:

  • sns.countplot(x=’TenYearCHD’, data=disease_df, palettte=”BuGn_r”): creates a count plot using Seaborn which visualizes the distribution of the values in the TenYearCHD column showing how many individuals have heart disease (1) vs. how many don’t (0).

plt.figure(figsize=(7, 5))
sns.countplot(x='TenYearCHD', data=disease_df,
             palette="BuGn_r")
plt.show()

The count plot shows a high imbalance in the dataset where the majority of individuals (over 3000) do not have heart disease (label 0) while only a small number (around 500) have heart disease (label 1).

4.2: Counting number of patients affected by CHD where (0= Not Affected; 1= Affected)


laste = disease_df['TenYearCHD'].plot()
plt.show(laste)

5: Fitting Logistic Regression Model for Heart Disease Prediction

We will create a simple logistic regression model for prediction.

  • logreg=LogisticRegression(): This creates an instance of the Logistic Regression model.
  • logreg.fit(X_train, y_train): This trains the logistic regression model using the training data (X_train for features and y_train for the target).
  • y_pred=logreg.predict(X_test): This uses the trained logistic regression model to make predictions on the test set (X_test). The predicted values are stored in y_pred.
 
 

from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

6: Evaluating Logistic Regression Model


from sklearn.metrics import accuracy_score
print('Accuracy of the model is =', 
      accuracy_score(y_test, y_pred))

from sklearn.metrics import confusion_matrix, classification_report

print('The details for confusion matrix is =')
print (classification_report(y_test, y_pred))

cm = confusion_matrix(y_test, y_pred)
conf_matrix = pd.DataFrame(data = cm, 
                           columns = ['Predicted:0', 'Predicted:1'], 
                           index =['Actual:0', 'Actual:1'])

plt.figure(figsize = (8, 5))
sns.heatmap(conf_matrix, annot = True, fmt = 'd', cmap = "Greens")

plt.show()

The model performs well at predicting no heart disease (class 0) but poorly predicts heart disease (class 1) result in an imbalanced classification performance. To enhance model performance techniques such as class balancing, adjust thresholds or experiment with different algorithms help to achieve better results to correctly identify individuals with heart disease.

©2025 All Rights Reserved PrimePoint Institute