In this blog post, we delve into the classification techniques using the sci-kit-learn library in Python. We’ll explore the modular development of different classification models and compare their performance on the Diagnostic Wisconsin Breast Cancer Database.

This is very basic example to get you started wiht ML and scikit-learn.

I have chosen the classification techniques below.

  • Logistic Regression: A linear model suitable for binary classification.
  • Random Forest: An ensemble of decision trees offering versatility and robustness.
  • XGBoost: An optimized gradient boosting algorithm known for its efficiency.
  • Neural Network (MLP): A multi-layer perceptron for non-linear classification.

Let’s start loading data for our models first. I have created “data_processing.py” file which handles data loading, preprocessing, and splitting. It consists of functions for loading the dataset, processing features, and splitting data into training and testing sets.

# data_processing.py
import pandas as pd
# Load breast cancer data set
from sklearn.datasets import load_breast_cancer 
# Do Standred Scaler : For each feature, calculate its mean (μ) and standard deviation (σ) using the training data.
from sklearn.preprocessing import StandardScaler
# Split data into treaning dataset and test data set
from sklearn.model_selection import train_test_split

# Function to load data

def load_data():
    cancer = load_breast_cancer()
    data = pd.DataFrame(data=cancer.data, columns = cancer.feature_names)
    data['target'] = cancer.target
    return data 

# Function to process data
def process_data(data):
    X = data.drop('target',axis = 1)
    y = data['target']

    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    return X_scaled,y

# Function to split data in to training and test data set 
def split_data(X,y,test_size=0.2,random_state = 42):
    return train_test_split(X,y,test_size=test_size,random_state=random_state) 

Here we are using the load_breast_cancer function to load a dataset from scikit-learn. Next, we will create a file for each classification technique. In each model, we will have a function in which we take the same training and test the dataset

Below code for logistic_regression_model.py

# Let's use LogisticRegression modle
from sklearn.linear_model import LogisticRegression
# Let's import few function to see how well we did
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report


# Let's build , train and evaluate our logistic regression model
def train_and_evaluate_logistic_regression(X_train, X_test, y_train, y_test):
   
    # build our model
    model = LogisticRegression(random_state=42)
    model.fit(X_train,y_train)

    y_pred = model.predict(X_test)

    accuracy = accuracy_score(y_test,y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)
    class_report = classification_report(y_test, y_pred)

    return model, accuracy, conf_matrix, class_report

Below code for random_forest_model.py

# Let's use random forest classifier
from sklearn.ensemble import RandomForestClassifier
# Let's import few function to see how well we did
from sklearn.metrics import  accuracy_score,confusion_matrix,classification_report

# Let's build , train and evaluate our ramdon forest
def train_and_evaluate_random_forest(X_train,X_test,y_train,y_test):
    # build our model
    model = RandomForestClassifier(random_state=42)
    model.fit(X_train,y_train)

    y_pred = model.predict(X_test)

    accuracy = accuracy_score(y_test,y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)
    class_report = classification_report(y_test, y_pred)

    return model, accuracy, conf_matrix, class_report

Below code for xgboost_model.py

# Let's import XG boost 
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


def train_and_evaluate_xgboost(X_train, X_test, y_train, y_test):
    # Let's build our model
    model = XGBClassifier(random_state=42)
    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)

    accuracy = accuracy_score(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)
    class_report = classification_report(y_test, y_pred)

    return model, accuracy, conf_matrix, class_report

Below code for mlpclassifier_model.py

# neural_network_model.py
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


def train_and_evaluate_MLPClassifier(X_train, X_test, y_train, y_test ):
    # build our model
    model = MLPClassifier(random_state=42, max_iter=1000)
    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)

    accuracy = accuracy_score(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)
    class_report = classification_report(y_test, y_pred)

    return model, accuracy, conf_matrix, class_report

Let’s wire up everything in main.py

import matplotlib.pyplot as plt
import numpy as np
# Let's load out data from data procressing file 
from data_processing import load_data, process_data, split_data
# Let's train logistic regression
from logistic_regression_model import train_and_evaluate_logistic_regression
# Let's train random forest
from random_forest_model import train_and_evaluate_random_forest 
# Let's train xgboots
from xgboost_model import train_and_evaluate_xgboost
# Let's train nural network
from mlpclassifier_model import train_and_evaluate_MLPClassifier

def plot_performance(models, accuracies):
    indices = np.arange(len(models))
    plt.figure(figsize=(10, 6))

    plt.bar(models, accuracies, color=['skyblue', 'lightgreen', 'lightcoral', 'lightsalmon'])
    plt.ylim(0, 1)  # Set y-axis limits between 0% and 100%
    plt.title('Model Performance Comparison')
    plt.ylabel('Accuracy')
    plt.show()

if __name__ == "__main__":
    data = load_data()
    X, y = process_data(data)
    X_train, X_test, y_train, y_test = split_data(X, y)

    models = ['Logistic Regression', 'Random Forest', 'XGBoost', 'Neural Network']

    lr_model, lr_accuracy, _, _ = train_and_evaluate_logistic_regression(X_train, X_test, y_train, y_test)
    rf_model, rf_accuracy, _, _ = train_and_evaluate_random_forest(X_train, X_test, y_train, y_test)
    xg_model, xg_accuracy, _, _ = train_and_evaluate_xgboost(X_train, X_test, y_train, y_test)
    nn_model, nn_accuracy, _, _ = train_and_evaluate_MLPClassifier(X_train, X_test, y_train, y_test)

    all_accuracies = [lr_accuracy, rf_accuracy, xg_accuracy, nn_accuracy]

    print(all_accuracies)
    plot_performance(models, all_accuracies)

The main script orchestrates the entire process:

  • Loads and processes the data.
  • Trains and evaluates each classification model.
  • Compares the accuracy of different models.
  • Plots a bar chart for visualizing performance.

Code Execution

  1. Import necessary libraries in each module.
  2. Implement functions for loading, processing, and training models.
  3. Use main.py to execute the entire process.

you may have to install dependent libraries. To do so

pip install pandas scikit-learn matplotlib xgboost

Ensure you run these commands in your Python environment before executing the provided code. If you haven’t installed xgboost it before, it may require additional dependencies, such as a C++ compiler. Refer to the XGBoost installation guide for platform-specific instructions.

After exiting main.py I got the below result, which shows almost all algorithms performed equally.

Conclusion

Scikit-learn provides a powerful and unified interface for implementing various classification techniques. By breaking down our code into modular components, we were able to easily compare the performance of different classifiers on a real-world dataset. Whether you’re dealing with linear or non-linear relationships, sci-kit-learn has you covered with its extensive range of algorithms.

Feel free to experiment with other datasets and classification techniques using sci-kit-learn to further enhance your understanding of machine learning classification.

Leave a Reply

Your email address will not be published. Required fields are marked *