Machine Learning and Healthcare — Detection of Parkinson’s Disease

Pratham Soneja
6 min readJan 7, 2022

--

Machine Learning has many applications in many modern-day industries, one of them being the healthcare industry. With sufficient information passed to a machine learning model, it can predict whether someone is diagnosed with a disease or not.

In this article, I’ll be explaining how to create an XGBoost Classification model for predicting whether a person is diagnosed with Parkinson’s Disease. This project is coded in python using Jupyter notebook.

What are Ensemble Methods?

The goal of Ensemble methods in machine learning is to combine predictions from various base models built with a given learning algorithm to improve results. The two main families of ensemble methods are averaging methods(Bagging methods, Random Forest Trees, etc) and boosting methods(AdaBoost, Gradient Tree Boosting, etc).

I’ll be discussing boosting methods and a mini project to implement one of the most famous boosting algorithms i.e. Extreme Gradient Boosting (XGBoost).

What is boosting ensemble method?

Boosting is an iterative method that goes through cycles to iteratively add models into an ensemble.

The purpose of boosting ensembles is to correct the prediction errors made by previous models in the ensemble. It begins by initializing a single model in an ensemble whose predictions can be pretty naive. These predictions are used to calculate a loss function (mean squared error, for instance). The second model will try to correct the predictions errors and decrease the loss by the first model and subsequently, the third model will try to correct the predictions errors and reduce the loss by the second model and so on. This process aims to build a so-called strong learner from many purpose-built weak learners.

It wasn’t until AdaBoost was developed that demonstrated boosting as an effective ensemble method. Since AdaBoost many boosting algorithms have been developed including:

  1. Gradient Boosting
  2. Stochastic Gradient Boosting

What is Gradient Boosting?

Gradient boosting is a machine learning technique used in regression and classification tasks, among others. It gives a prediction model in the form of an ensemble of weak prediction models typically decision trees are used. When a decision tree is a weak learner, the resulting algorithm is called a gradient-boosted tree. It usually outperforms random forest.

A gradient-boosted tree model is built in a stage-wise fashion as in other boosting methods, but it generalizes the other methods by allowing optimization of a differentiable loss function by using gradient descent.

What is XGBoost?

XGBoost stands for Extreme Gradient Boosting, which is an open-source software library designed to be efficient, flexible, and portable. It provides an implementation of machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solves many data science problems in a fast and accurate way. The primary focus of XGBoost is to improve performance and speed. (Scikit-learn has another version of gradient boosting, but XGBoost has some technical advantages). XGBoost aims at :

  1. Efficient use of functional space.
  2. Improved model performance.
  3. Execution speed.

XGBoost Parameters

  1. n_estimators

Specifies how many times to go through the modeling cycle described above. It is equal to the number of models that we include in the ensemble.

2. early_stopping_rounds

Offers a way to automatically find the ideal value for n_estimators. Early stopping causes the model to stop iterating when the validation score stops improving, even if we aren’t at the hard stop for n_estimators.

3. learning_rate

Instead of getting predictions by simply adding up the predictions from each component model, we can multiply the predictions from each model by a small number (known as the learning rate) before adding them in.

4. n_jobs

On larger datasets where the runtime is a consideration, you can use parallelism to build your models faster. It’s common to set the parameter n_jobs equal to the number of cores on your machine. On smaller datasets, this won't help.

What is Parkinson’s Disease?

Parkinson’s disease is a progressive disorder of the central nervous system affecting movement and inducing tremors and stiffness. It has 5 stages and affects more than 1 million individuals every year in India. This is chronic and has no cure yet. It is a neurodegenerative disorder affecting dopamine-producing neurons in the brain.

Objective — Detection of Parkinson’s Disease

To build a model which accurately predicts whether someone is infected by the disease.

Detection of Parkinson’s Disease — About the machine learning project

In this Python machine learning project, using the Python libraries Scikit-learn, NumPy, Pandas, and XGBoost, we will build a model using an XGBClassifier. We’ll load the data, get the features and labels, scale the features, then split the data set, build an XGBClassifier, and then calculate the accuracy of our model.

The Data set used here is the UCI ML Parkinson’s data set. Download it here. The data set has 24 columns and 195 records.

Steps involved in building the model

  1. Importing necessary libraries

2. Importing data set

3. Getting features and labels

4. Scaling the features

5. Split the data set

6. Train the model

7. Calculate the accuracy

Why use XGBoost?

A design goal was to make the best use of available resources to train the model.

The algorithm differentiates itself in the following ways:

  1. A wide range of applications: Can be used to solve regression, classification, ranking, and user-defined prediction problems.
  2. Portability: Runs smoothly on Windows, Linux, and OS X.
  3. Languages: Supports all major programming languages including C++, Python, R, Java, and Scala.

Some key algorithm implementation features include:

  1. Sparse Aware implementation with automatic handling of missing data values.
  2. Block Structure to support the parallelization of tree construction.
  3. Continued Training so that you can further boost an already fitted model on new data.

Conclusion

Here the boosting model can predict the scores with an accuracy of 94.87 which is a pretty good score. With careful parameter tuning, highly accurate models can be trained. XGBoost is one of the leading software libraries for working with standard tabular data, many Kaggle users use it in competitions. Recently, Microsoft came up with their own gradient boosting framework called LightGBM, written by Guolin Ke. Another alternative of XGBoost is CatBoost developed by Yandex, which attempts to solve for Categorical features using a permutation-driven method. There is always more to explore in every topic, but I hope I was able to share my thoughts on Gradient Boosting. Hope you found my article both interesting and insightful. Feel free to comment down below your thoughts.

Please feel free to check out my profile, Pratham Soneja, and other articles, and reach out to me on LinkedIn if you have any questions or comments.

References

  1. XGBoost documentation
  2. CatBoost documentation
  3. LightGBM documentation
  4. Ensemble Learning
  5. Ensemble methods — Scikit-learn documentation
  6. Gradient Boosting
  7. A Gentle Introduction to XGBoost

--

--