Live Project: Heart Disease Prediction Using Data Science

Project Info:

This project leverages the Framingham Heart Study dataset to predict the 10-year risk of coronary heart disease (CHD) in patients using logistic regression. The dataset includes demographic, behavioral & medical risk factors collected from over 4,000 individuals. Through data cleaning, preprocessing, EDA, model building, & evaluation, this project demonstrates the complete lifecycle of a machine learning pipeline tailored for binary health classification.

Project Implementation:

Imported necessary libraries like Pandas, NumPy, Seaborn, Matplotlib, Statsmodels & sklearn
Loaded the dataset, dropped irrelevant columns like education, handled null values & renamed relevant columns
Performed EDA to visualize the class imbalance using countplots & trend lines of TenYearCHD occurrences
Scaled features using StandardScaler & split the dataset into training (70%) & testing (30%) sets
Built a logistic regression model using LogisticRegression() from sklearn
Predicted outcomes on the test data & evaluated performance using metrics like accuracy, confusion matrix & classification report
Visualized the confusion matrix using seaborn’s heatmap for better interpretation of prediction results

Insights & Key Outcomes:

The dataset is highly imbalanced, with significantly more negative (no CHD) cases than positive
The logistic regression model showed decent accuracy but struggled with predicting positive (CHD) cases
Techniques like class rebalancing, threshold tuning or trying different algorithms (e.g. Random Forest, XGBoost) could improve recall & F1-score for class 1

1: Importing Necessary Libraries

We will import Numpy, Pandas, Matplotlib, Seaborn, Statsmodels and sklearn library in python.

Statsmodels: for statistical modeling for fitting logistic regression.
Sklearn: Provides tools for machine learning modeling.


import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt
import seaborn as sns

2: Data Preparation

The dataset is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. The classification goal is to predict whether the patient has 10-year risk of future coronary heart disease (CHD). The dataset provides the patients information. It includes over 4,000 records and 15 attributes.

2.1 Loading and Handling Missing Values from the Dataset

We will load the dataset and drop the irrelevant features from the the dataset like “education” and rename columns also.

disease_df.pd.read_csv(): This is used to read the contents of CSV file.
disease_df, dropna(axis=0, inplace=True): This removes any rows with missing values (NaN) from the DataFrame.
disease_df.TenYearCHD.value_counts(): This prints the count of unique values in the TenYearCHD column which likely indicates whether a patient has heart disease.


disease_df = pd.read_csv("framingham.csv")
disease_df.drop(columns=['education'], inplace = True, axis = 1)
disease_df.rename(columns ={'male':'Sex_male'}, inplace = True)

disease_df.dropna(axis = 0, inplace = True)
disease_df


print(disease_df.TenYearCHD.value_counts())

3: Splitting the Dataset into Test and Train Sets

We will split the dataset into training and testing portions. But before that we will transform our data by scaling all the features using StandardScaler.

X=preprocessing.StandardScaler().fit(X).transform(X): This scales the features in X to have a mean of 0 and standard deviation of 1 using StandardScaler. Scaling is important for many machine learning models, especially when the features have different units or magnitudes.
Training set (70% of data, X_train and y_train)
Test set (30% of data, X_test and y_test)
random_state=4 ensures the split is reproducible.


X = np.asarray(disease_df[['age', 'Sex_male', 'cigsPerDay', 
                           'totChol', 'sysBP', 'glucose']])
y = np.asarray(disease_df['TenYearCHD'])

X = preprocessing.StandardScaler().fit(X).transform(X)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( 
        X, y, test_size = 0.3, random_state = 4)

print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

4: Exploratory Data Analysis of Heart Disease Dataset

In Exploratory Data Analysis (EDA) we perform EDA on the heart disease dataset to understand and gain insights into the dataset before building a predictive model for heart disease.

4.1: Ten Year’s CHD Record of all the patients available in the dataset:

sns.countplot(x=’TenYearCHD’, data=disease_df, palettte=”BuGn_r”): creates a count plot using Seaborn which visualizes the distribution of the values in the TenYearCHD column showing how many individuals have heart disease (1) vs. how many don’t (0).


plt.figure(figsize=(7, 5))
sns.countplot(x='TenYearCHD', data=disease_df,
             palette="BuGn_r")
plt.show()

The count plot shows a high imbalance in the dataset where the majority of individuals (over 3000) do not have heart disease (label 0) while only a small number (around 500) have heart disease (label 1).

4.2: Counting number of patients affected by CHD where (0= Not Affected; 1= Affected)


laste = disease_df['TenYearCHD'].plot()
plt.show(laste)

5: Fitting Logistic Regression Model for Heart Disease Prediction

We will create a simple logistic regression model for prediction.

logreg=LogisticRegression(): This creates an instance of the Logistic Regression model.
logreg.fit(X_train, y_train): This trains the logistic regression model using the training data (X_train for features and y_train for the target).
y_pred=logreg.predict(X_test): This uses the trained logistic regression model to make predictions on the test set (X_test). The predicted values are stored in y_pred.


from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

6: Evaluating Logistic Regression Model


from sklearn.metrics import accuracy_score
print('Accuracy of the model is =', 
      accuracy_score(y_test, y_pred))


from sklearn.metrics import confusion_matrix, classification_report

print('The details for confusion matrix is =')
print (classification_report(y_test, y_pred))

cm = confusion_matrix(y_test, y_pred)
conf_matrix = pd.DataFrame(data = cm, 
                           columns = ['Predicted:0', 'Predicted:1'], 
                           index =['Actual:0', 'Actual:1'])

plt.figure(figsize = (8, 5))
sns.heatmap(conf_matrix, annot = True, fmt = 'd', cmap = "Greens")

plt.show()

The model performs well at predicting no heart disease (class 0) but poorly predicts heart disease (class 1) result in an imbalanced classification performance. To enhance model performance techniques such as class balancing, adjust thresholds or experiment with different algorithms help to achieve better results to correctly identify individuals with heart disease.

Live Project: Heart Disease Prediction Using Data Science

Project Info:

Project Implementation:

Insights & Key Outcomes:

1: Importing Necessary Libraries

2: Data Preparation

2.1 Loading and Handling Missing Values from the Dataset

3: Splitting the Dataset into Test and Train Sets

4: Exploratory Data Analysis of Heart Disease Dataset

4.1: Ten Year’s CHD Record of all the patients available in the dataset:

4.2: Counting number of patients affected by CHD where (0= Not Affected; 1= Affected)

5: Fitting Logistic Regression Model for Heart Disease Prediction

6: Evaluating Logistic Regression Model

Quick Links

Home

Contact

Blogs

FAQs

News

Placements

Interview Questions

Data Science Projects

Courses

Data Science

Data Analytics

Power BI

Machine Learning

Advance AI

Full Stack Python

Full Stack Java

MERN Stack

Address

Prime Point AI, Data Science Course, Data Analytics Training

Office No. 7, First Floor, Quantum Works Awfis Building, Near Nal Stop, Metro Station, Erandwane, Pune, Maharashtra - 411004

Contact Details

+91 8446273688

info@primepointinstitute.com

Live Project: Heart Disease Prediction Using Data Science

Project Info:

Project Implementation:

Insights & Key Outcomes:

1: Importing Necessary Libraries

2: Data Preparation

2.1 Loading and Handling Missing Values from the Dataset

3: Splitting the Dataset into Test and Train Sets

4: Exploratory Data Analysis of Heart Disease Dataset

4.1: Ten Year’s CHD Record of all the patients available in the dataset:

4.2: Counting number of patients affected by CHD where (0= Not Affected; 1= Affected)

5: Fitting Logistic Regression Model for Heart Disease Prediction

6: Evaluating Logistic Regression Model

Quick Links

Courses

Address

Contact Details

Request Callback