This study explores the application of machine learning techniques for heart disease prediction using the UCI Heart Disease dataset. The dataset, comprising 920 entries with 16 attributes, underwent extensive preprocessing including handling missing values, outlier treatment, and feature engineering. Multiple classification algorithms such as K-Nearest Neighbors (KNN), Support Vector Machines (SVM), Logistic Regression, Decision Trees, Random Forest, Naive Bayes, Gradient Boosting, and XGBoost were implemented to classify patients into heart disease risk categories. The preprocessing pipeline utilized transformations like one-hot encoding, ordinal encoding, and imputation to ensure optimal data preparation. Models were evaluated using metrics such as accuracy, precision, recall, F1 score, and ROC-AUC to identify the most effective classifier. Confusion matrices and visualizations provided insight into the performance of each approach on both training and testing datasets. Results demonstrated varying performance among the algorithms, with ensemble models showing higher accuracy and robustness. The trained models were saved as pipelines to enable deployment in a Streamlit-based application for real-time predictions. This research highlights the efficacy of machine learning in medical diagnostics, particularly for heart disease, and provides a scalable framework for implementation in clinical decision support systems.
The dataset consists of 920 patient records with the following key attributes:
Overview: Evaluated eight classification algorithms:
Baseline classifier using distance-based voting.
Linear kernel to maximize margin between classes.
Probabilistic model for binary classification.
Interpretable model based on recursive splits.
Ensemble of trees to improve stability and accuracy.
Assumes feature independence with Gaussian likelihoods.
Sequential boosting to minimize prediction errors.
Optimized gradient boosting with regularization.
Algorithm | Correctly Classified (%) | Incorrectly Classified (%) | Kappa Value | MAE Value | Precision (YES) | Precision (NO) |
---|---|---|---|---|---|---|
KNN | 56.52 | 43.48 | 0.346 | 0.636 | 0.345 | 0.345 |
SVM | 53.80 | 46.20 | 0.302 | 0.658 | 0.309 | 0.309 |
Logistic Regression | 54.89 | 45.11 | 0.329 | 0.609 | 0.317 | 0.317 |
Gradient Boosting | 60.33 | 39.67 | 0.415 | 0.538 | 0.447 | 0.447 |
Decision Tree | 55.43 | 44.57 | 0.368 | 0.625 | 0.406 | 0.406 |
Random Forest | 59.24 | 40.76 | 0.396 | 0.576 | 0.385 | 0.385 |
Naive Bayes | 33.15 | 66.85 | 0.191 | 1.565 | 0.265 | 0.265 |
XGBoost | 61.41 | 38.59 | 0.436 | 0.543 | 0.521 | 0.521 |