Note: This post is an extension of Loan Approval Application that I made using Flask web framework.
import pandas as pd
import numpy as np

# pipeline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler

# model
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# diplays HTML representation in a jupyter context
from sklearn import set_config
set_config(display='diagram') # default: display='text'

The dataset is downloaded from GitHub Repository which contains information about loan applicants. There are 12 independent columns and 1 dependent column as follow:

  • Loan_ID: A unique ID assigned to every loan applicant
  • Gender: Gender of the applicant (Male, Female)
  • Married: The marital status of the applicant (Yes, No)
  • Dependents: No. of people dependent on the applicant (0,1,2,3+)
  • Education: Education level of the applicant (Graduated, Not Graduated)
  • Self_Employed: If the applicant is self-employed or not (Yes, No)
  • ApplicantIncome: The amount of income the applicant earns
  • CoapplicantIncome: The amount of income the co-applicant earns
  • LoanAmount: The amount of loan the applicant has requested for
  • Loan_Amount_Term: The no. of days over which the loan will be paid
  • Credit_History: A record of a borrower's responsible repayment of debts (1- has all debts paid, 0- not paid)
  • Property_Area : The type of location where the applicant’s property lies (Rural, Semiurban, Urban)
  • Loan_Status: Loan granted or not (1 for granted, 0 for not granted)

Data Preprocessing before Pipeline

There are several steps to be executed before we use sklearn pipeline:

  1. Load the data: Read loan.csv file and set Loan_ID as index
  2. Manual feature selection: Drop irrelevant or select relevant columns only
  3. Data type conversion: This is done to make it easier to select columns in the Pipeline
  4. Feature-target split: Separate the independent and dependent variable into two objects
  5. Train-test split: Training set will be used to train model, while testing set will be treated as unseen data for model evaluation
loan = pd.read_csv("data_input/loan.csv", index_col='Loan_ID')

# 2. Drop irrelevant column
loan.drop(columns=['Unnamed: 0'], inplace=True)

# 3. Data type conversion
loan['Credit_History'] = loan['Credit_History'].astype('bool').astype('object')

# 4. Feature-target split
X = loan.drop('Loan_Status', axis=1)
y = loan.Loan_Status

# 5. Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=123)
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")
X_train shape: (368, 11)
X_test shape: (123, 11)
y_train shape: (368,)
y_test shape: (123,)

Check which columns contain missing values:

na_values = X_train.isna().mean() * 100
na_values.sort_values(ascending=False).round(2).to_frame('Percentage of NA').astype(str) + ' %'
Percentage of NA
Self_Employed 5.98 %
LoanAmount 2.99 %
Loan_Amount_Term 2.72 %
Gender 1.9 %
Dependents 1.36 %
Property_Area 0.0 %
Credit_History 0.0 %
CoapplicantIncome 0.0 %
ApplicantIncome 0.0 %
Education 0.0 %
Married 0.0 %

Note: The missing values will be handle during pipeline implementation. The numerical columns will be imputed using median (LoanAmount, Loan_Amount_Term), where as the categorical columns will be imputed by a constant "Undefined" value (Credit_History, Self_Employed, Gender, Dependents).

Pipeline

Why is pipeline preferred? A pipeline is used to chain several estimators into one sequential process. The advantages are:

  1. Shorten the workflow: we only need to call .fit() and .predict() at the end of the pipeline implementation.
  2. Parameter selection: all parameters (including preprocessing parameters) in the pipeline can be combined to perform a hyperparameter search (e.g., If you want to compare the performance of the model between mean or median imputation strategies)
  3. Ensure that data leakage does not occur, i.e., statistical leakage from test data to the model during cross-validation. The most common one is when we do imputation, supposed we use mean strategy. We should only calculate the mean only based on the training set, not the whole data.

Note: all estimators in the pipeline must have .transform() method, except for the last estimator.

Build Pipeline

First, we define the transformation pipeline for numerical and categorical columns separately:

  • The numerical columns will be imputed using median strategy and then scaled using z-score
  • The categorical columns will be imputed using constant "Undefined" value and then one-hot encoding will be applied.
num_features = ['ApplicantIncome', 'CoapplicantIncome',
                'LoanAmount', 'Loan_Amount_Term']
cat_features = ['Gender', 'Married', 'Dependents', 'Education',
                'Self_Employed', 'Credit_History', 'Property_Area']

num_transformer = Pipeline(
    [
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())  # z-score
    ]
)

cat_transformer = Pipeline(
    [
        ('imputer', SimpleImputer(strategy='constant', fill_value='Undefined')),
        ('encoder', OneHotEncoder(handle_unknown='ignore'))
    ]
)

data_transformer = ColumnTransformer(
    [
        ('num', num_transformer, num_features),
        ('cat', cat_transformer, cat_features)
    ]
)

data_transformer
<div id="sk-45a63a3a-5b09-47ec-8269-45be6d8fd550" class"sk-top-container">
ColumnTransformer(transformers=[('num',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='median')),
                                                 ('scaler', StandardScaler())]),
                                 ['ApplicantIncome', 'CoapplicantIncome',
                                  'LoanAmount', 'Loan_Amount_Term']),
                                ('cat',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(fill_value='Undefined',
                                                                strategy='constant')),
                                                 ('encoder',
                                                  OneHotEncoder(handle_unknown='ignore'))]),
                                 ['Gender', 'Married', 'Dependents',
                                  'Education', 'Self_Employed',
                                  'Credit_History', 'Property_Area'])])
['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term']
SimpleImputer(strategy='median')
StandardScaler()
['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Credit_History', 'Property_Area']
SimpleImputer(fill_value='Undefined', strategy='constant')
OneHotEncoder(handle_unknown='ignore')

Next, the transformation pipeline is appended to a classifier. In this case, we use a simple LogisticRegression model.

pipe = Pipeline(
    [
        ('preprocess', data_transformer),
        ('classifier', LogisticRegression(max_iter=np.Inf, random_state=123))
    ]
)

pipe
<div id="sk-bff6454a-c93a-4023-8da5-619f47ea2b6c" class"sk-top-container">
Pipeline(steps=[('preprocess',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['ApplicantIncome',
                                                   'CoapplicantIncome',
                                                   'LoanAmount',
                                                   'Loan_Amount_Term']),
                                                 ('cat',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value='Undefined',
                                                                                 strategy='constant')),
                                                                  ('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['Gender', 'Married',
                                                   'Dependents', 'Education',
                                                   'Self_Employed',
                                                   'Credit_History',
                                                   'Property_Area'])])),
                ('classifier',
                 LogisticRegression(max_iter=inf, random_state=123))])
ColumnTransformer(transformers=[('num',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='median')),
                                                 ('scaler', StandardScaler())]),
                                 ['ApplicantIncome', 'CoapplicantIncome',
                                  'LoanAmount', 'Loan_Amount_Term']),
                                ('cat',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(fill_value='Undefined',
                                                                strategy='constant')),
                                                 ('encoder',
                                                  OneHotEncoder(handle_unknown='ignore'))]),
                                 ['Gender', 'Married', 'Dependents',
                                  'Education', 'Self_Employed',
                                  'Credit_History', 'Property_Area'])])
['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term']
SimpleImputer(strategy='median')
StandardScaler()
['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Credit_History', 'Property_Area']
SimpleImputer(fill_value='Undefined', strategy='constant')
OneHotEncoder(handle_unknown='ignore')
LogisticRegression(max_iter=inf, random_state=123)

</div>

Use the Pipeline

By using .fit() method, the statistics of training data will be learned both by the transformation (for scaling and one-hot encoding) and classifier.

pipe.fit(X_train, y_train)
<div id="sk-bc57611d-81a8-432d-8913-cb5e61b77a65" class"sk-top-container">
Pipeline(steps=[('preprocess',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['ApplicantIncome',
                                                   'CoapplicantIncome',
                                                   'LoanAmount',
                                                   'Loan_Amount_Term']),
                                                 ('cat',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value='Undefined',
                                                                                 strategy='constant')),
                                                                  ('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['Gender', 'Married',
                                                   'Dependents', 'Education',
                                                   'Self_Employed',
                                                   'Credit_History',
                                                   'Property_Area'])])),
                ('classifier',
                 LogisticRegression(max_iter=inf, random_state=123))])
ColumnTransformer(transformers=[('num',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='median')),
                                                 ('scaler', StandardScaler())]),
                                 ['ApplicantIncome', 'CoapplicantIncome',
                                  'LoanAmount', 'Loan_Amount_Term']),
                                ('cat',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(fill_value='Undefined',
                                                                strategy='constant')),
                                                 ('encoder',
                                                  OneHotEncoder(handle_unknown='ignore'))]),
                                 ['Gender', 'Married', 'Dependents',
                                  'Education', 'Self_Employed',
                                  'Credit_History', 'Property_Area'])])
['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term']
SimpleImputer(strategy='median')
StandardScaler()
['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Credit_History', 'Property_Area']
SimpleImputer(fill_value='Undefined', strategy='constant')
OneHotEncoder(handle_unknown='ignore')
LogisticRegression(max_iter=inf, random_state=123)

</div>

Then we use the trained pipeline to predict the target variable, the the trained model is evaluated using classification_report.

print(classification_report(y_train, pipe.predict(X_train)))
print(classification_report(y_test, pipe.predict(X_test)))
              precision    recall  f1-score   support

           0       0.88      0.43      0.58       113
           1       0.79      0.97      0.87       255

    accuracy                           0.81       368
   macro avg       0.83      0.70      0.73       368
weighted avg       0.82      0.81      0.78       368

              precision    recall  f1-score   support

           0       0.90      0.51      0.65        35
           1       0.83      0.98      0.90        88

    accuracy                           0.85       123
   macro avg       0.87      0.75      0.78       123
weighted avg       0.85      0.85      0.83       123

Up to this point, if we are satisfied with the model's performance, we can serialized the pipeline using joblib.dump():

import joblib
joblib.dump(pipe, 'cache/loan_pipeline.joblib')
['cache/loan_pipeline.joblib']

Reversely, we can de-serialized (read) the pickled pipeline using joblib.load():

pipe = joblib.load('cache/loan_pipeline.joblib')
pipe
<div id="sk-aab38f93-2222-4105-ab20-c7dd4f258bc9" class"sk-top-container">
Pipeline(steps=[('preprocess',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['ApplicantIncome',
                                                   'CoapplicantIncome',
                                                   'LoanAmount',
                                                   'Loan_Amount_Term']),
                                                 ('cat',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value='Undefined',
                                                                                 strategy='constant')),
                                                                  ('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['Gender', 'Married',
                                                   'Dependents', 'Education',
                                                   'Self_Employed',
                                                   'Credit_History',
                                                   'Property_Area'])])),
                ('classifier',
                 LogisticRegression(max_iter=inf, random_state=123))])
ColumnTransformer(transformers=[('num',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='median')),
                                                 ('scaler', StandardScaler())]),
                                 ['ApplicantIncome', 'CoapplicantIncome',
                                  'LoanAmount', 'Loan_Amount_Term']),
                                ('cat',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(fill_value='Undefined',
                                                                strategy='constant')),
                                                 ('encoder',
                                                  OneHotEncoder(handle_unknown='ignore'))]),
                                 ['Gender', 'Married', 'Dependents',
                                  'Education', 'Self_Employed',
                                  'Credit_History', 'Property_Area'])])
['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term']
SimpleImputer(strategy='median')
StandardScaler()
['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Credit_History', 'Property_Area']
SimpleImputer(fill_value='Undefined', strategy='constant')
OneHotEncoder(handle_unknown='ignore')
LogisticRegression(max_iter=inf, random_state=123)

</div>

Warning: joblib.load() relies on the pickle module and can execute arbitrary Python code. It should therefore never be used to load files from untrusted sources.

How to Access Fitted Pipeline?

Sometimes, we want a more in-depth inspection of what each component of the pipeline has learned from the training set.

Case 1: Print out the learned statistics of SimpleImputer

pd.DataFrame({
    'Numerical Features': num_features,
    'Learned Statistics (Median)': pipe['preprocess'].named_transformers_['num'].named_steps['imputer'].statistics_
})
Numerical Features Learned Statistics (Median)
0 ApplicantIncome 3859.0
1 CoapplicantIncome 1293.5
2 LoanAmount 128.0
3 Loan_Amount_Term 360.0

Note: This is the value that will be imputed to the unseen data.

Case 2: Transform new data using the fitted data_transformer

new_data = pd.DataFrame({
    'Gender': ['Male'],
    'Married': ['No'],
    'Dependents': [0],
    'Education': ['Graduate'],
    'Self_Employed': ['No'],
    'ApplicantIncome': [10000],
    'CoapplicantIncome': [np.nan],
    'LoanAmount': [1000],
    'Loan_Amount_Term': [36],
    'Credit_History': [True],
    'Property_Area': [np.nan]
})

new_data
Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Property_Area
0 Male No 0 Graduate No 10000 NaN 1000 36 True NaN

First, we would like to extract the column name after OneHotEncoder is applied:

fitted_encoder = pipe['preprocess'].named_transformers_['cat'].named_steps['encoder']
fitted_encoder.get_feature_names(cat_features)
array(['Gender_Female', 'Gender_Male', 'Gender_Undefined', 'Married_No',
       'Married_Yes', 'Dependents_0', 'Dependents_1', 'Dependents_2',
       'Dependents_3+', 'Dependents_Undefined', 'Education_Graduate',
       'Education_Not Graduate', 'Self_Employed_No',
       'Self_Employed_Undefined', 'Self_Employed_Yes',
       'Credit_History_False', 'Credit_History_True',
       'Property_Area_Rural', 'Property_Area_Semiurban',
       'Property_Area_Urban'], dtype=object)

Second, we can immediately use .transform() method on the new data using trained transformation pipeline.

data_transformer.transform(new_data)
array([[ 0.67810314, -0.10995112, 10.06889387, -4.86515444,  0.        ,
         1.        ,  0.        ,  1.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  1.        ,
         0.        ,  1.        ,  0.        ,  0.        ,  0.        ,
         1.        ,  0.        ,  0.        ,  0.        ]])

The two steps above can be combined into one cell code in the following dataframe:

pd.DataFrame(
    data_transformer.transform(new_data),
    columns=num_features + list(fitted_encoder.get_feature_names(cat_features))
)
ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term Gender_Female Gender_Male Gender_Undefined Married_No Married_Yes Dependents_0 ... Education_Graduate Education_Not Graduate Self_Employed_No Self_Employed_Undefined Self_Employed_Yes Credit_History_False Credit_History_True Property_Area_Rural Property_Area_Semiurban Property_Area_Urban
0 0.678103 -0.109951 10.068894 -4.865154 0.0 1.0 0.0 1.0 0.0 0.0 ... 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0

1 rows × 24 columns

GridSearch over the Pipeline

One advantage of using pipeline is that we can perform grid search over all parameters inside the pipeline. Here is the list of parameters to be tuned based on previous pipeline. Parameter name that starts with classifier__ means we are tuning the model, where as preprocess__ means we are tuning the preprocessing (transformation) parameters.

sorted(list(pipe.get_params().keys()))
['classifier',
 'classifier__C',
 'classifier__class_weight',
 'classifier__dual',
 'classifier__fit_intercept',
 'classifier__intercept_scaling',
 'classifier__l1_ratio',
 'classifier__max_iter',
 'classifier__multi_class',
 'classifier__n_jobs',
 'classifier__penalty',
 'classifier__random_state',
 'classifier__solver',
 'classifier__tol',
 'classifier__verbose',
 'classifier__warm_start',
 'memory',
 'preprocess',
 'preprocess__cat',
 'preprocess__cat__encoder',
 'preprocess__cat__encoder__categories',
 'preprocess__cat__encoder__drop',
 'preprocess__cat__encoder__dtype',
 'preprocess__cat__encoder__handle_unknown',
 'preprocess__cat__encoder__sparse',
 'preprocess__cat__imputer',
 'preprocess__cat__imputer__add_indicator',
 'preprocess__cat__imputer__copy',
 'preprocess__cat__imputer__fill_value',
 'preprocess__cat__imputer__missing_values',
 'preprocess__cat__imputer__strategy',
 'preprocess__cat__imputer__verbose',
 'preprocess__cat__memory',
 'preprocess__cat__steps',
 'preprocess__cat__verbose',
 'preprocess__n_jobs',
 'preprocess__num',
 'preprocess__num__imputer',
 'preprocess__num__imputer__add_indicator',
 'preprocess__num__imputer__copy',
 'preprocess__num__imputer__fill_value',
 'preprocess__num__imputer__missing_values',
 'preprocess__num__imputer__strategy',
 'preprocess__num__imputer__verbose',
 'preprocess__num__memory',
 'preprocess__num__scaler',
 'preprocess__num__scaler__copy',
 'preprocess__num__scaler__with_mean',
 'preprocess__num__scaler__with_std',
 'preprocess__num__steps',
 'preprocess__num__verbose',
 'preprocess__remainder',
 'preprocess__sparse_threshold',
 'preprocess__transformer_weights',
 'preprocess__transformers',
 'preprocess__verbose',
 'steps',
 'verbose']

In the cell below we prepare the parameter combinations as follows:

  • On preprocessing step:
    • Try mean and median for the imputation stategy
    • Try not doing scaling (passthrough), standardization, and min-max scaling
  • On classifier step:
    • Try to penalize model using C = 0.1, 1, and 10
param_grid = {
    # combination for numeric imputer
    'preprocess__num__imputer__strategy': ['mean', 'median'],

    # combination for numeric scaling
    'preprocess__num__scaler': ['passthrough', StandardScaler(), MinMaxScaler()],

    # combination for logistic regression hyperparameter
    'classifier__C': [0.1, 1, 10]
}

pipe_grid = GridSearchCV(pipe, param_grid, cv=3)

pipe_grid.fit(X_train, y_train)
<div id="sk-07f55928-8301-4537-bedc-c6f7f548effd" class"sk-top-container">
GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('preprocess',
                                        ColumnTransformer(transformers=[('num',
                                                                         Pipeline(steps=[('imputer',
                                                                                          SimpleImputer(strategy='median')),
                                                                                         ('scaler',
                                                                                          StandardScaler())]),
                                                                         ['ApplicantIncome',
                                                                          'CoapplicantIncome',
                                                                          'LoanAmount',
                                                                          'Loan_Amount_Term']),
                                                                        ('cat',
                                                                         Pipeline(steps=[('imputer',
                                                                                          SimpleImputer(fill_value='Undefined',
                                                                                                        strategy='constant...
                                                                                          OneHotEncoder(handle_unknown='ignore'))]),
                                                                         ['Gender',
                                                                          'Married',
                                                                          'Dependents',
                                                                          'Education',
                                                                          'Self_Employed',
                                                                          'Credit_History',
                                                                          'Property_Area'])])),
                                       ('classifier',
                                        LogisticRegression(max_iter=inf,
                                                           random_state=123))]),
             param_grid={'classifier__C': [0.1, 1, 10],
                         'preprocess__num__imputer__strategy': ['mean',
                                                                'median'],
                         'preprocess__num__scaler': ['passthrough',
                                                     StandardScaler(),
                                                     MinMaxScaler()]})
ColumnTransformer(transformers=[('num',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='median')),
                                                 ('scaler', StandardScaler())]),
                                 ['ApplicantIncome', 'CoapplicantIncome',
                                  'LoanAmount', 'Loan_Amount_Term']),
                                ('cat',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(fill_value='Undefined',
                                                                strategy='constant')),
                                                 ('encoder',
                                                  OneHotEncoder(handle_unknown='ignore'))]),
                                 ['Gender', 'Married', 'Dependents',
                                  'Education', 'Self_Employed',
                                  'Credit_History', 'Property_Area'])])
['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term']
SimpleImputer(strategy='median')
StandardScaler()
['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Credit_History', 'Property_Area']
SimpleImputer(fill_value='Undefined', strategy='constant')
OneHotEncoder(handle_unknown='ignore')
LogisticRegression(max_iter=inf, random_state=123)

</div>

Print out the grid search cross-validation results:

pd.DataFrame(pipe_grid.cv_results_).sort_values('rank_test_score')[['rank_test_score', 'params', 'mean_test_score', 'std_test_score']]
rank_test_score params mean_test_score std_test_score
2 1 {'classifier__C': 0.1, 'preprocess__num__imput... 0.806966 0.030669
5 1 {'classifier__C': 0.1, 'preprocess__num__imput... 0.806966 0.030669
8 3 {'classifier__C': 1, 'preprocess__num__imputer... 0.801546 0.024111
11 3 {'classifier__C': 1, 'preprocess__num__imputer... 0.801546 0.024111
1 5 {'classifier__C': 0.1, 'preprocess__num__imput... 0.798814 0.030696
4 5 {'classifier__C': 0.1, 'preprocess__num__imput... 0.798814 0.030696
0 7 {'classifier__C': 0.1, 'preprocess__num__imput... 0.793327 0.044085
9 8 {'classifier__C': 1, 'preprocess__num__imputer... 0.790684 0.021143
10 8 {'classifier__C': 1, 'preprocess__num__imputer... 0.790684 0.020074
14 10 {'classifier__C': 10, 'preprocess__num__impute... 0.787974 0.020709
15 10 {'classifier__C': 10, 'preprocess__num__impute... 0.787974 0.018458
17 10 {'classifier__C': 10, 'preprocess__num__impute... 0.787974 0.020709
7 13 {'classifier__C': 1, 'preprocess__num__imputer... 0.787951 0.023938
6 13 {'classifier__C': 1, 'preprocess__num__imputer... 0.787951 0.024841
12 15 {'classifier__C': 10, 'preprocess__num__impute... 0.785241 0.022270
3 16 {'classifier__C': 0.1, 'preprocess__num__imput... 0.782487 0.034336
16 17 {'classifier__C': 10, 'preprocess__num__impute... 0.777089 0.022301
13 17 {'classifier__C': 10, 'preprocess__num__impute... 0.777089 0.022301

Note: We can conclude that scaling method doesn’t affect the performance of Logistic Regression.

We can print out the best combination of parameters based on the grid search:

pipe_grid.best_params_
{'classifier__C': 0.1,
 'preprocess__num__imputer__strategy': 'mean',
 'preprocess__num__scaler': MinMaxScaler()}

Lastly, we can directly apply .predict() method to the pipe_grid, automatically it will use the best_params_ combination:

print(classification_report(y_train, pipe_grid.predict(X_train)))
print(classification_report(y_test, pipe_grid.predict(X_test)))
              precision    recall  f1-score   support

           0       0.92      0.41      0.56       113
           1       0.79      0.98      0.88       255

    accuracy                           0.81       368
   macro avg       0.85      0.70      0.72       368
weighted avg       0.83      0.81      0.78       368

              precision    recall  f1-score   support

           0       0.89      0.46      0.60        35
           1       0.82      0.98      0.89        88

    accuracy                           0.83       123
   macro avg       0.85      0.72      0.75       123
weighted avg       0.84      0.83      0.81       123

References for sklearn pipeline

Note: The point in bold is the part that has been discussed in this post.

Explainable AI (XAI): Local Interpretable Model-Agnostic Explanations (LIME)

X_train_preprocessed is X_train data that has been preprocessed on the best Pipeline (previous GridSearchCV result)

X_train_preprocessed = pd.DataFrame(
    pipe_grid.best_estimator_['preprocess'].transform(X_train),
    columns=num_features +
    list(fitted_encoder.get_feature_names(cat_features)),
    index=X_train.index
)
X_train_preprocessed.head()
ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term Gender_Female Gender_Male Gender_Undefined Married_No Married_Yes Dependents_0 ... Education_Graduate Education_Not Graduate Self_Employed_No Self_Employed_Undefined Self_Employed_Yes Credit_History_False Credit_History_True Property_Area_Rural Property_Area_Semiurban Property_Area_Urban
Loan_ID
LP001367 0.035894 0.030440 0.131122 0.74359 0.0 1.0 0.0 0.0 1.0 0.0 ... 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0
LP002137 0.076475 0.135443 0.382306 0.74359 0.0 1.0 0.0 0.0 1.0 1.0 ... 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0
LP002113 0.020779 0.000000 0.204194 0.74359 1.0 0.0 0.0 1.0 0.0 0.0 ... 0.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0
LP001664 0.049981 0.000000 0.162717 0.74359 0.0 1.0 0.0 1.0 0.0 1.0 ... 1.0 0.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0
LP001421 0.067013 0.063303 0.249605 0.74359 0.0 1.0 0.0 0.0 1.0 1.0 ... 1.0 0.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0

5 rows × 24 columns

categorical_names_mapping is a dictionary used by OneHotEncoder when each column is being encoded.

fitted_encoder_best = pipe_grid.best_estimator_['preprocess'].named_transformers_['cat'].named_steps['encoder']
categorical_names_mapping = dict(zip(cat_features, fitted_encoder.categories_))
categorical_names_mapping
{'Gender': array(['Female', 'Male', 'Undefined'], dtype=object),
 'Married': array(['No', 'Yes'], dtype=object),
 'Dependents': array(['0', '1', '2', '3+', 'Undefined'], dtype=object),
 'Education': array(['Graduate', 'Not Graduate'], dtype=object),
 'Self_Employed': array(['No', 'Undefined', 'Yes'], dtype=object),
 'Credit_History': array([False, True], dtype=object),
 'Property_Area': array(['Rural', 'Semiurban', 'Urban'], dtype=object)}

Setting up LIME Explainer for tabular data, the data used is the training data only.

Note: Install with command pip install lime in your Python environment.
from lime.lime_tabular import LimeTabularExplainer

explainer = LimeTabularExplainer(
    training_data=X_train_preprocessed.values,  # must be a numeric array
    mode='classification',
    feature_names=X_train_preprocessed.columns,
    class_names=['Not Granted', 'Granted'],
    categorical_names=categorical_names_mapping)

Next, provide the data that we want to interpret, namely data testing that has been preprocessed based on the best pipeline (the previous GridSearchCV result).

X_test_preprocessed = pd.DataFrame(
    pipe_grid.best_estimator_['preprocess'].transform(X_test),
    columns=num_features + list(fitted_encoder.get_feature_names(cat_features)),
    index=X_test.index
)
X_test_preprocessed.head()
ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term Gender_Female Gender_Male Gender_Undefined Married_No Married_Yes Dependents_0 ... Education_Graduate Education_Not Graduate Self_Employed_No Self_Employed_Undefined Self_Employed_Yes Credit_History_False Credit_History_True Property_Area_Rural Property_Area_Semiurban Property_Area_Urban
Loan_ID
LP001279 0.027409 0.074800 0.187994 0.74359 0.0 1.0 0.0 1.0 0.0 1.0 ... 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0
LP001322 0.049264 0.000000 0.165877 0.74359 0.0 1.0 0.0 1.0 0.0 1.0 ... 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0
LP001086 0.015980 0.000000 0.028436 0.74359 0.0 1.0 0.0 1.0 0.0 1.0 ... 0.0 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0
LP002974 0.038120 0.057629 0.143760 0.74359 0.0 1.0 0.0 0.0 1.0 1.0 ... 1.0 0.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0
LP001404 0.037316 0.067471 0.216430 0.74359 1.0 0.0 0.0 0.0 1.0 1.0 ... 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0

5 rows × 24 columns

LIME only provides local interpretation (per each observation only). In the cell below, we want to see the interpretation results for the first observation (using .iloc[[0]]) on X_test_encoded.

predict_fn = lambda x: pipe_grid.best_estimator_['classifier'].predict_proba(x).astype(float)

exp = explainer.explain_instance(
    X_test_preprocessed.iloc[[1]].values[0],  # must be a one-dimensional numpy array
    predict_fn,  # prediction function that outputs probability
    num_features=5  # top n features to be explained
)
print(f"LIME Explanation for ID {X_test_preprocessed.iloc[[1]].index.values[0]}")
exp.show_in_notebook(show_table=True, show_all=False)
LIME Explanation for ID LP001322

Insight for the second observation (Loan ID LP001322):

  • Loan application is predicted to be granted with 80% probability
  • The reasons that support the loan should be granted are:
    • The credit history is good (Credit_History_False = 0 means Credit_History = True (value 1))
    • The property is in semiurban area, not in rural area (Property_Area_Semiurban = 1 dan Property_Area_Rural = 0)
    • Number of dependents is not 1 (Dependents_1 = 0)
  • The reason that contradict the loan should be granted is:
    • The applicant is not married (Married_No = 1)

There is a slightly truncated part of the bar plot visualization above, so we can see it more clearly in the form of a DataFrame:

pd.DataFrame(exp.as_list(), columns=['Condition', 'Influence Score'])
Condition Influence Score
0 Credit_History_False <= 0.00 0.239573
1 0.00 < Property_Area_Semiurban <= 1.00 0.064764
2 Property_Area_Rural <= 0.00 0.058795
3 Dependents_1 <= 0.00 0.050181
4 Gender_Female <= 0.00 0.042407

Note: The influence score is not the coefficient of the logistic regression. It is used to be compared one condition to another. The larger the value, the more influential it is. Positive value ​​indicates the condition that support the positive class, while negative value indicates the condition that contradicts the positive class.

LIME creates a simple, interpretable model to study the relationship of features with predict_proba results. Therefore, there is a explanation fit value that describes how well LIME explains the black-box model (such as the R-squared value in linear model). The range of explanation fit is from 0 to 1 (the bigger the value, the better the fit).

print(f"Explanation fit: {exp.__dict__['score']}")
Explanation fit: 0.7204616471931928

Note: The LimeTabularExplainer model can also be tuned like a classification model. Parameter reference

</div>