QnA Chatbot

Abstract

This study explores the application of machine learning techniques for heart disease prediction using the UCI Heart Disease dataset. The dataset, comprising 920 entries with 16 attributes, underwent extensive preprocessing including handling missing values, outlier treatment, and feature engineering. Multiple classification algorithms such as K-Nearest Neighbors (KNN), Support Vector Machines (SVM), Logistic Regression, Decision Trees, Random Forest, Naive Bayes, Gradient Boosting, and XGBoost were implemented to classify patients into heart disease risk categories. The preprocessing pipeline utilized transformations like one-hot encoding, ordinal encoding, and imputation to ensure optimal data preparation. Models were evaluated using metrics such as accuracy, precision, recall, F1 score, and ROC-AUC to identify the most effective classifier. Confusion matrices and visualizations provided insight into the performance of each approach on both training and testing datasets. Results demonstrated varying performance among the algorithms, with ensemble models showing higher accuracy and robustness. The trained models were saved as pipelines to enable deployment in a Streamlit-based application for real-time predictions. This research highlights the efficacy of machine learning in medical diagnostics, particularly for heart disease, and provides a scalable framework for implementation in clinical decision support systems.

Dataset Description

EDA: Exploratory Data Analysis

Open the EDA Report

The dataset consists of 920 patient records with the following key attributes:

Age: Patient age in years.
Sex: Biological sex (Male/Female).
Chest Pain Type (CP): Type of chest pain experienced.
Resting BP: Systolic blood pressure (mmHg).
Cholesterol: Serum cholesterol (mg/dL).
Fasting Blood Sugar: >120 mg/dL indicates diabetes risk.
Resting ECG: Electrocardiogram results.
Max Heart Rate: Maximum achieved heart rate during exercise.
ST Depression: Exercise ST depression value.
Slope: ST segment slope during peak exercise.
CA: Number of major vessels detected.
Thalassemia: Blood disorder classification.

Methodology

Overview: Evaluated eight classification algorithms:

K-Nearest Neighbors

Baseline classifier using distance-based voting.

Support Vector Machine

Linear kernel to maximize margin between classes.

Logistic Regression

Probabilistic model for binary classification.

Decision Tree

Interpretable model based on recursive splits.

Random Forest

Ensemble of trees to improve stability and accuracy.

Gaussian Naive Bayes

Assumes feature independence with Gaussian likelihoods.

Gradient Boosting

Sequential boosting to minimize prediction errors.

XGBoost

Optimized gradient boosting with regularization.

Result

Classification Score for Models

Algorithm	Correctly Classified (%)	Incorrectly Classified (%)	Kappa Value	MAE Value	Precision (YES)	Precision (NO)
KNN	56.52	43.48	0.346	0.636	0.345	0.345
SVM	53.80	46.20	0.302	0.658	0.309	0.309
Logistic Regression	54.89	45.11	0.329	0.609	0.317	0.317
Gradient Boosting	60.33	39.67	0.415	0.538	0.447	0.447
Decision Tree	55.43	44.57	0.368	0.625	0.406	0.406
Random Forest	59.24	40.76	0.396	0.576	0.385	0.385
Naive Bayes	33.15	66.85	0.191	1.565	0.265	0.265
XGBoost	61.41	38.59	0.436	0.543	0.521	0.521

Classification and Comparison of Supervised Machine Learning Algorithms Based on UCI Heart Disease Dataset