Introduction to Scikit-learn: classifying poisonous mushrooms and glass types

Introduction

This tutorial is an introduction to using Scikit-learn for machine learning in Python, focused on building a classifier to separate poisonous from edible mushrooms and to separate different types of glass. Scikit-learn is an excellent library for this purpose. Besides the fact that it does a lot of useful things for you right out of the box, saving you coding time, it’s also easy to implement your code, and it also comes with a plethora of examples that can be found around the web.

Among other things, the scikit-learn library can be used for clustering, classification and regression, and extracting features. At the same time, the interface and the way to set up models is consistent no matter what field you apply it to, which makes it feasible to tackle large projects with different components with the same package.

This tutorial focuses on building and training a classifier, after which it is validated and evaluated to cross-check that it functions as intended. It will also tackle hyper-parameter optimization, allowing us to get a better understanding of optimization and convoluted datasets.

Data

We will use two popular datasets in the machine learning community in this guide, namely the mushroom dataset and the glass dataset. These are also popular in excercises for students on deep learning.

The glass dataset contains six types of glass, which are identified by the type of minerals found in their composition (such as Fe, K, Na). The data is numerical, which means that it only contains numbers, making it easier to work with particularly with packages such as numpy that process numbers only.

The mushroom dataset contains data on edible and poisonous mushrooms. It contains some non-numerical values as well. Hence we will encode the information first before we work with the data.

You can download the data from the links below. Save the datasets by right clicking on them and selecting 'Save as':

These data originate from the Machine Learning Repository at University of California Irvine (here and here).

Classifying the glass data

Before we get to the mushroom data, let’s first investigate the more straightforward glass dataset. We will import it in Python using the pandas package, which is most commonly used for data science in Python. Before classifying the data we will use pandas to get it in the right shape.

Make sure you have the necessary Python packages, install them as follows in your terminal:

pip install pandas
pip install numpy
pip install matplotlib
pip install seaborn
pip install sklearn

Then import these and the scikit-learn libraries (following conventions imported as 'sklearn') using the following code in a Python file:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import time

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, LabelEncoder

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn import tree
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB

A couple of things to note about the imports:

  • The StandardScaler of scikit-learn - sklearn in the code above - is a library designed for normalizing and standardizing the dataset
  • The LaberEncoder library will be utilized to One Hot Encode all the categorical features in the mushroom dataset (i.e. assign unique numbers to categories)
  • For visualization, to draw graphs, we use the Seaborn library, which is based on matplotlib
  • The other libraries are used to classify the dataset

Whilst working on the datasets, it is important to consider questions regarding the data quality and content, such as:

  • Does our dataset contain numerical, categorial, geographic, or any other kind of data?
  • Is our dataset complete, or is it missing some data?
  • Is there any redundant data in our dataset?
  • Are there significant differences between the values of the features? If so we may have to normalize the dataset before proceeding.

Let's load the dataset.

filename_glass = './data/glass.csv'
df_glass = pd.read_csv(filename_glass)

print(df_glass.shape)
display(df_glass.head())
display(df_glass.describe())

As you can see, the data consists of 10 columns and 214 rows. All of them contain numerical data only. There is also no missing information to worry about, and the feature values are around the same order of magnitude.

Hence for now we do not need to perform spend time on preprocessing the dataset.

The describe() method of the pandas library gives us a quick overview of the dataset particularly regarding the following information: - The number of rows included in the dataset - Descriptive statistics such the mean, standard deviation, maximum and minimum values

Classifying and validating the data

Let's start building and training the actual classifier, which will enable us to classify and find out the type of glass of every entry in the dataset.

The first step is to split the dataset into two separate sets: the training sets and test sets. The training set will be used to train the classifier, while the test is used to check how accurate the classifier will be. Conventionally a 70% - 30% ratio is used when splitting the dataset.

It’s important to check the distribution of different classes in both the the training and the test datasets. The two sets should feature the same class distribution, otherwise the results will be skewed when switching from one to the other. To make sure this is the case, the original dataset is split randomly:

def get_train_test(df, y_col, x_cols, ratio):
    mask = np.random.rand(len(df)) < ratio
    df_train = df[mask]
    df_test = df[~mask]

    Y_train = df_train[y_col].values
    X_train = df_train[x_cols].values
    Y_test = df_test[y_col].values
    X_test = df_test[x_cols].values
    return df_train, df_test, X_train, Y_train, X_test, Y_test

y_col_glass = 'typeglass'
x_cols_glass = list(df_glass.columns.values)
x_cols_glass.remove(y_col_glass)

train_test_ratio = 0.7
df_train, df_test, X_train, Y_train, X_test, Y_test = get_train_test(df_glass, y_col_glass, x_cols_glass, train_test_ratio)

Now that the dataset is split into training and test sets, we can create the classification model. We have no prior logic to reason which type of classifier will most accurately predict the type of glass when applied to this particular dataset. For instance, we don't know whether a 'functional approach' classifier like Logistic Regression is best, or if we'd better apply Gradient Boosting or a Random Forest approach.

One of the advantages of sklearn is that you can include and test all of these classifiers at the same time. We will include and optimize each to see which classifier performs the best. To do so we will create a dictionary that includes all desired classifiers, using their names as keys and instances of the said classifiers as values.

dict_classifiers = {
    "Logistic Regression": LogisticRegression(),
    "Nearest Neighbors": KNeighborsClassifier(),
    "Linear SVM": SVC(),
    "Gradient Boosting Classifier": GradientBoostingClassifier(n_estimators=1000),
    "Decision Tree": tree.DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(n_estimators=1000),
    "Neural Net": MLPClassifier(alpha = 1),
    "Naive Bayes": GaussianNB(),
    #"AdaBoost": AdaBoostClassifier(),
    #"QDA": QuadraticDiscriminantAnalysis(),
    #"Gaussian Process": GaussianProcessClassifier()
}

We now iterate over each item in the dictionary, whilst doing the following:

  • Like the code earlier, use .fit(X_train, Y_train) to train the classifier
  • See how a particular classifier performs on the training set by using .score(X_train, Y_train)
  • Do the same for the test set with .score(X_test, Y_test)
  • Save our trained models in the end, along with respective training and test scores, as well as the training times, into a new dictionary. You can use the pickle module to save the dictionary as a special Python object directly to disk.

Let's define the function for this which takes the train and tests sets, i.e. the matrices, as X and Y. It then fits them on all of the Classifiers specified in the dict_classifier above. The trained models and corresponding accuracies are saved in a dictionary.

  • Note: the SVM, Random Forest and Gradient Boosting classifiers usually take some more time to train. Hence with large datasets it is best to train them on a smaller subsample first and then decide based on the accuracy they deliver whether you want to comment them out or not.
def batch_classify(X_train, Y_train, X_test, Y_test, n_classifiers = 5, verbose = True):

    dict_models = {}
    for classifier_name, classifier in list(dict_classifiers.items())[:n_classifiers]:
        classifier.fit(X_train, Y_train)

        train_score = classifier.score(X_train, Y_train)
        test_score = classifier.score(X_test, Y_test)

        dict_models[classifier_name] = {'model': classifier, 'train_score': train_score, 'test_score': test_score}
        print("trained {c} in {f:.2f} s".format(c=classifier_name, f=t_diff))
    return dict_models



def display_dict_models(dict_models, sort_by='test_score'):
    clskeys = [key for key in dict_models.keys()]
    test_score = [dict_models[key]['test_score'] for key in clskeys]
    training_score = [dict_models[key]['train_score'] for key in clskeys]

    df2 = pd.DataFrame(data=np.zeros(shape=(len(clskeys),4)), columns = ['classifier', 'train_score', 'test_score', 'train_time'])
    for ii in range(0,len(clskeys)):
        df2.loc[ii, 'classifier'] = clskeys[ii]
        df2.loc[ii, 'train_score'] = training_score[ii]
        df2.loc[ii, 'test_score'] = test_score[ii]

   df2.sort_values(by=sort_by, ascending=False)

The score() method in the code above returns the accuray, of the accuracy_score() function that is part of the metrics module. This module contains a lot of useful features to evaluate classification and regression models.

To calculate precision - the f1-score and the recall of the classes in the dataset - the classification_report feature is used. This information is very important if you want to improve the accuracy of a classifier, or if you want to cross-check you obtain whether you obtain the expected results in terms of accuracy.

The accuracy on the training set and test set is stored as well: the display_dict_models() method can be used to inspect the results and order them by score/accuracy.

dict_models = batch_classify(X_train, Y_train, X_test, Y_test, n_classifiers = 8)
display_dict_models(dict_models)

It may not look pretty from a Pythonic perspective, but this is a convenient way to deal with more than one classifier at the same time and see which one delivers the best results regarding the dataset in question. After the initial testing phase, you can further refine this, for instance by looping over the top three classifiers and tweaking their parameters to improve their accuracy.

The most accurate model in this case is the Gradient Boosting classifier. This classifier, and similar ones like Random Forest, in conjunction with xgboost or some other form is boosting, usually perform pretty well (e.g. on Kaggle problems). Scikit-learn comes with extensive documentation on the theory behind these classifiers and why some may perform better, in some cases, than others.

Hyper-parameter optimization on the chosen classifier

We tried different approaches to select a classifier, and determined which one perfoms best. Now, we can tweak it by optimizing its hyper-parameters as follows:

GDB_params = {
    'n_estimators': [250, 500, 2000],
    'learning_rate': [0.5, 0.1, 0.01, 0.001],
    'criterion': ['friedman_mse', 'mse', 'mae']
}

df_train, df_test, X_train, Y_train, X_test, Y_test = get_train_test(df_glass, y_col_glass, x_cols_glass, 0.6)

for n_est in GDB_params['n_estimators']:
    for lr in GDB_params['learning_rate']:
        for crit in GDB_params['criterion']:
            clf = GradientBoostingClassifier(n_estimators=n_est,
                                            learning_rate = lr,
                                            criterion = crit)
            clf.fit(X_train, Y_train)
            train_score = clf.score(X_train, Y_train)
            test_score = clf.score(X_test, Y_test)
            print("For ({}, {}, {}) - train, test score: \t {:.5f} \t-\t {:.5f}".format(n_est, lr, crit[:4], train_score, test_score))

Classifying poisonous versus edible mushrooms

We now investigate the mushroom dataset, containing data on which mushrooms are edible and which are poisonous. There are 8124 mushrooms defined in the dataset of which 4208 are edible and 3915 are poisonous. Each of these is characterized by 22 features.

The difference with the glass dataset is that we do not have numerical values to work with. The mushroom dataset instead contains categorical values. Hence we take the extra step in the classification process in of encoding the values, as follows:

filename_mushrooms = './data/mushrooms.csv'
df_mushrooms = pd.read_csv(filename_mushrooms)
display(df_mushrooms.head())

To find out the categories, we print the unique values in each column. We will then also check whether there are missing values or redundant columns that can be removed.

for col in df_mushrooms.columns.values:
    print(col, df_mushrooms[col].unique())

There are 22 categorical features in the dataset. On of the features, called “veil-type”, contains only one value “p”, which is not very helpful and does not add any value to the classifier. Hence we could remove this column and similar ones:

for col in df_mushrooms.columns.values:
    if len(df_mushrooms[col].unique()) <= 1:
        print("Removing column {}, which only contains the value: {}".format(col, df_mushrooms[col].unique()[0]))

Missing values

Datasets might contain missing values, presented as '', NaN or Null depending on the type of variable (numeric vs string) and the Python library (e.g. numpy vs pandas) you use to load the data into. We need to deal with this data.

Here are some of the things to consider: - Depending on the number of rows with missing data, you can drop them from the dataset altogether. You may lose information, however, or have a misrepresented dataset, if there are too many. - A missing value might alsoo be information in itself and may be classified as a separate entity. - You may also want to impute the missing values (e.g. taking the mean or median)

Dropping entire rows if they contain values:

for col in df_mushrooms.columns.values:
    if len(df_mushrooms[col].unique()) <= 1:
        print("Removing column {}, which only contains the value: {}".format(col, df_mushrooms[col].unique()[0]))

Dropping the entire column if a certain percentage of it is missing:

drop_percentage = 0.8

df_mushrooms_dropped_cols = df_mushrooms.copy(deep=True)
df_mushrooms_dropped_cols.loc[df_mushrooms_dropped_cols['stalk-root'] == '?', 'stalk-root'] = np.nan

for col in df_mushrooms_dropped_cols.columns.values:
    no_rows = df_mushrooms_dropped_cols[col].isnull().sum()
    percentage = no_rows / df_mushrooms_dropped_cols.shape[0]
    if percentage > drop_percentage:
        del df_mushrooms_dropped_cols[col]
        print("Column {} contains {} missing values. This is {} percent. Dropping this column.".format(col, no_rows, percentage))

Filling missing values with zeros

df_mushrooms_zerofill = df_mushrooms.copy(deep = True)
df_mushrooms_zerofill.loc[df_mushrooms_zerofill['stalk-root'] == '?', 'stalk-root'] = np.nan
df_mushrooms_zerofill.fillna(0, inplace=True)

Replacing missing values with a backward fill

df_mushrooms_bfill = df_mushrooms.copy(deep = True)
df_mushrooms_bfill.loc[df_mushrooms_bfill['stalk-root'] == '?', 'stalk-root'] = np.nan
df_mushrooms_bfill.fillna(method='bfill', inplace=True)

Replacing missing values with a forward fill

df_mushrooms_ffill = df_mushrooms.copy(deep = True)
df_mushrooms_ffill.loc[df_mushrooms_ffill['stalk-root'] == '?', 'stalk-root'] = np.nan
df_mushrooms_ffill.fillna(method='ffill', inplace=True)

Encoding categorial variables

As mentioned before, some classifiers do not play well with with non-numerical data, which is why they will be prone to errors if you try to use them on a dataset with mainly categorical values. In such cases, there are two options: - 1) Convert the categorical values to numerical values, using One-Hot encoding in scikit-learn - 2) Expand the column in question into multiple columns filled with binary values (dummy values with zeroes and ones)

This is how it would work for a column called “FRUIT” containing the unique values [‘ORANGE’, ‘APPLE’, ‘PEAR’]:

  • Using the first method, this column would be converted to unique values in the form of [0,1,2]
  • Using the second method, the column would be converted into three columns named [‘FRUIT_IS_ORANGE’, ‘FRUIT_IS_APPLE’, ‘FRUIT_IS_PEAR’]. After this step the original ‘FRUIT’ column would be dropped. The three new colums contain binary values representing True or False.

Certain classifiers that use the numerical values of the one-hot encoded column may do so in ways that do not necessarily reflect the nature of the data. For instance, the Nearest Neighbour algorithm assumes that a value of 1 is closer to 0 than a value of 2. On the other hand, such numerical values do not make sense in the case of one-hot encoded columns, since the string APPLE is not in any way closer to the string ORANGE than it is to the string PEAR. Hence, you have to be careful whilst including such data.

Below we hot-encode the columns:

def label_encode(df, columns):
    for col in columns:
        le = LabelEncoder()
        col_values_unique = list(df[col].unique())
        le_fitted = le.fit(col_values_unique)

        col_values = list(df[col].values)
        le.classes_
        col_values_transformed = le.transform(col_values)
        df[col] = col_values_transformed


df_mushrooms_ohe = df_mushrooms.copy(deep=True)
to_be_encoded_cols = df_mushrooms_ohe.columns.values
label_encode(df_mushrooms_ohe, to_be_encoded_cols)
display(df_mushrooms_ohe.head())

## Now we do the same thing for the other dataframes
df_mushrooms_dropped_rows_ohe = df_mushrooms_dropped_rows.copy(deep = True)
df_mushrooms_zerofill_ohe = df_mushrooms_zerofill.copy(deep = True)
df_mushrooms_bfill_ohe = df_mushrooms_bfill.copy(deep = True)
df_mushrooms_ffill_ohe = df_mushrooms_ffill.copy(deep = True)

label_encode(df_mushrooms_dropped_rows_ohe, to_be_encoded_cols)
label_encode(df_mushrooms_zerofill_ohe, to_be_encoded_cols)
label_encode(df_mushrooms_bfill_ohe, to_be_encoded_cols)
label_encode(df_mushrooms_ffill_ohe, to_be_encoded_cols)

And we expand the columns containing categorical data:

def expand_cols(df, list_columns):
    for column in list_columns:
        colvalues = df[column].unique()
        for colvalue in colvalues:
            newcol_name = "{}_is_{}".format(column, colvalue)
            df.loc[df[column] == colvalue, newcol_name] = 1
            df.loc[df[column] != colvalue, newcol_name] = 0
    df.drop(list_columns, inplace=True, axis=1)

y_col = 'class'
cols_to_expand = list(df_mushrooms.columns.values)
cols_to_expand.remove(y_col)

df_mushrooms_expanded = df_mushrooms.copy(deep=True)
label_encode(df_mushrooms_expanded, [y_col])
expand_cols(df_mushrooms_expanded, cols_to_expand)

## Now we do the same thing for all other dataframes
df_mushrooms_dropped_rows_expanded = df_mushrooms_dropped_rows.copy(deep = True)
df_mushrooms_zerofill_expanded = df_mushrooms_zerofill.copy(deep = True)
df_mushrooms_bfill_expanded = df_mushrooms_bfill.copy(deep = True)
df_mushrooms_ffill_expanded = df_mushrooms_ffill.copy(deep = True)

label_encode(df_mushrooms_dropped_rows_expanded, [y_col])
label_encode(df_mushrooms_zerofill_expanded, [y_col])
label_encode(df_mushrooms_bfill_expanded, [y_col])
label_encode(df_mushrooms_ffill_expanded, [y_col])

expand_cols(df_mushrooms_dropped_rows_expanded, cols_to_expand)
expand_cols(df_mushrooms_zerofill_expanded, cols_to_expand)
expand_cols(df_mushrooms_bfill_expanded, cols_to_expand)
expand_cols(df_mushrooms_ffill_expanded, cols_to_expand)

Classifying and validating the data

Like before, we can 'brute-force' our way into finding out which classifier works best on the mushroom dataset by trying out all of them at once.

Again, the goal is to find the classifier with the highest accuracy. We will use the same 70% / 30% ratio for splitting the data into training and test sets.

dict_dataframes = {
    "df_mushrooms_ohe": df_mushrooms_ohe,
    "df_mushrooms_dropped_rows_ohe": df_mushrooms_dropped_rows_ohe,
    "df_mushrooms_zerofill_ohe": df_mushrooms_zerofill_ohe,
    "df_mushrooms_bfill_ohe": df_mushrooms_bfill_ohe,
    "df_mushrooms_ffill_ohe": df_mushrooms_ffill_ohe,
    "df_mushrooms_expanded": df_mushrooms_expanded,
    "df_mushrooms_dropped_rows_expanded": df_mushrooms_dropped_rows_expanded,
    "df_mushrooms_zerofill_expanded": df_mushrooms_zerofill_expanded,
    "df_mushrooms_bfill_expanded": df_mushrooms_bfill_expanded,
    "df_mushrooms_ffill_expanded": df_mushrooms_ffill_expanded
}

y_col = 'class'
train_test_ratio = 0.7

for df_key, df in dict_dataframes.items():
    x_cols = list(df.columns.values)
    x_cols.remove(y_col)
    df_train, df_test, X_train, Y_train, X_test, Y_test = get_train_test(df, y_col, x_cols, train_test_ratio)
    dict_models = batch_classify(X_train, Y_train, X_test, Y_test, n_classifiers = 8, verbose=False)

    print()
    print(df_key)
    display_dict_models(dict_models)

The accuracy of the classifiers in this are pretty high.

Working with complex datasets and understanding them

Not all datasets are easy to work with. Sometimes it’s difficult to figure out which features are helping with Classification or Regression, and which are actually just adding unnecessary noise to the results.

To get a better understanding of such datasets, you can look at a few methods used to understand how a certain dataset is characterized by its features.

The correlation matrix

For a better understanding of how strongly each individual feature is related to the Type of glass in our first dataset, it is helpful to calculate and plot the a correlation matrix using the following code:

correlation_matrix = df_glass.corr()
plt.figure(figsize=(10,8))
ax = sns.heatmap(correlation_matrix, vmax=1, square=True, annot=True,fmt='.2f', cmap ='GnBu', cbar_kws={"shrink": .5}, robust=True)
plt.title('Correlation matrix between the features', fontsize=20)
plt.show()

From the results we can see that oxides such as Aluminum and Magnesium are much stronger correlated to the type of glass. On the other hand, the content of Calcium does not play a very important role. If your dataset includes features that have no correlation at all with the variable of interest, it may help the model in removing them entirely, since they will just add noise to your results.

Correlate a single feature with multiple other features

In a dataset with a large number of features, the correlation matrix can become very large if the features are correlated with each other in complex ways. Thankfully, there is a way that can be used to look at the correlations of a single feature, and you can visualize the results in the form of a graph:

def display_corr_with_col(df, col):
    correlation_matrix = df.corr()
    correlation_type = correlation_matrix[col].copy()
    abs_correlation_type = correlation_type.apply(lambda x: abs(x))
    desc_corr_values = abs_correlation_type.sort_values(ascending=False)
    y_values = list(desc_corr_values.values)[1:]
    x_values = range(0,len(y_values))
    xlabels = list(desc_corr_values.keys())[1:]
    fig, ax = plt.subplots(figsize=(8,8))
    ax.bar(x_values, y_values)
    ax.set_title('Correlation of all features with {}'.format(col), fontsize=20)
    ax.set_ylabel('Pearson correlation coefficient', fontsize=16)
    plt.xticks(x_values, xlabels, rotation='vertical')
    plt.show()

display_corr_with_col(df_glass, 'typeglass')

Cumulative Explained Variance

This method allows you to see how much of the variance is captured by the first N of features. For example, the following plot shows you how the first four features with the largest correlation capture 90% of the variance in the dataset.

X = df_glass[x_cols_glass].values
X_std = StandardScaler().fit_transform(X)

pca = PCA().fit(X_std)
var_ratio = pca.explained_variance_ratio_
components = pca.components_
#print(pca.explained_variance_)
plt.plot(np.cumsum(var_ratio))
plt.xlim(0,9,1)
plt.xlabel('Number of Features', fontsize=16)
plt.ylabel('Cumulative explained variance', fontsize=16)
plt.show()

You may want to remove the features with the lowest correlation if your Classification/Regression model has low accuracy. You can also add the features with the highest correlation incrementally and check whether the results improve.

Pairwise relationships between various features

Besides the correlation matrix, it’s possible look at the pairwise relationships between certain features and plot them in order to observe how they are correlated.

ax = sns.pairplot(df_glass, hue='typeglass')
plt.title('Pairwise relationships between the features')
plt.show()

Conclusion

The Scikit-learn library is very useful and beginner-friendly libnrary for machine learning. It packs a hefty punch in terms of features, and can work with a variety of datasets, regardless of their type.

Comments

Leave a comment

Back to Top