Note: This post is an extension of Loan Approval Application that I made using Flask web framework.

import pandas as pd
import numpy as np

# pipeline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler

# model
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# diplays HTML representation in a jupyter context
from sklearn import set_config
set_config(display='diagram') # default: display='text'

The dataset is downloaded from GitHub Repository which contains information about loan applicants. There are 12 independent columns and 1 dependent column as follow:

Loan_ID: A unique ID assigned to every loan applicant
Gender: Gender of the applicant (Male, Female)
Married: The marital status of the applicant (Yes, No)
Dependents: No. of people dependent on the applicant (0,1,2,3+)
Education: Education level of the applicant (Graduated, Not Graduated)
Self_Employed: If the applicant is self-employed or not (Yes, No)
ApplicantIncome: The amount of income the applicant earns
CoapplicantIncome: The amount of income the co-applicant earns
LoanAmount: The amount of loan the applicant has requested for
Loan_Amount_Term: The no. of days over which the loan will be paid
Credit_History: A record of a borrower's responsible repayment of debts (1- has all debts paid, 0- not paid)
Property_Area : The type of location where the applicant’s property lies (Rural, Semiurban, Urban)
Loan_Status: Loan granted or not (1 for granted, 0 for not granted)

Data Preprocessing before Pipeline

There are several steps to be executed before we use sklearn pipeline:

Load the data: Read loan.csv file and set Loan_ID as index
Manual feature selection: Drop irrelevant or select relevant columns only
Data type conversion: This is done to make it easier to select columns in the Pipeline
Feature-target split: Separate the independent and dependent variable into two objects
Train-test split: Training set will be used to train model, while testing set will be treated as unseen data for model evaluation

loan = pd.read_csv("data_input/loan.csv", index_col='Loan_ID')

# 2. Drop irrelevant column
loan.drop(columns=['Unnamed: 0'], inplace=True)

# 3. Data type conversion
loan['Credit_History'] = loan['Credit_History'].astype('bool').astype('object')

# 4. Feature-target split
X = loan.drop('Loan_Status', axis=1)
y = loan.Loan_Status

# 5. Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=123)
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

X_train shape: (368, 11)
X_test shape: (123, 11)
y_train shape: (368,)
y_test shape: (123,)

Check which columns contain missing values:

na_values = X_train.isna().mean() * 100
na_values.sort_values(ascending=False).round(2).to_frame('Percentage of NA').astype(str) + ' %'

Note: The missing values will be handle during pipeline implementation. The numerical columns will be imputed using median (LoanAmount, Loan_Amount_Term), where as the categorical columns will be imputed by a constant "Undefined" value (Credit_History, Self_Employed, Gender, Dependents).

Pipeline

Why is pipeline preferred? A pipeline is used to chain several estimators into one sequential process. The advantages are:

Shorten the workflow: we only need to call .fit() and .predict() at the end of the pipeline implementation.
Parameter selection: all parameters (including preprocessing parameters) in the pipeline can be combined to perform a hyperparameter search (e.g., If you want to compare the performance of the model between mean or median imputation strategies)
Ensure that data leakage does not occur, i.e., statistical leakage from test data to the model during cross-validation. The most common one is when we do imputation, supposed we use mean strategy. We should only calculate the mean only based on the training set, not the whole data.

Note: all estimators in the pipeline must have .transform() method, except for the last estimator.

Build Pipeline

First, we define the transformation pipeline for numerical and categorical columns separately:

The numerical columns will be imputed using median strategy and then scaled using z-score
The categorical columns will be imputed using constant "Undefined" value and then one-hot encoding will be applied.

num_features = ['ApplicantIncome', 'CoapplicantIncome',
                'LoanAmount', 'Loan_Amount_Term']
cat_features = ['Gender', 'Married', 'Dependents', 'Education',
                'Self_Employed', 'Credit_History', 'Property_Area']

num_transformer = Pipeline(
    [
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())  # z-score
    ]
)

cat_transformer = Pipeline(
    [
        ('imputer', SimpleImputer(strategy='constant', fill_value='Undefined')),
        ('encoder', OneHotEncoder(handle_unknown='ignore'))
    ]
)

data_transformer = ColumnTransformer(
    [
        ('num', num_transformer, num_features),
        ('cat', cat_transformer, cat_features)
    ]
)

data_transformer

ColumnTransformer(transformers=[('num',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='median')),
                                                 ('scaler', StandardScaler())]),
                                 ['ApplicantIncome', 'CoapplicantIncome',
                                  'LoanAmount', 'Loan_Amount_Term']),
                                ('cat',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(fill_value='Undefined',
                                                                strategy='constant')),
                                                 ('encoder',
                                                  OneHotEncoder(handle_unknown='ignore'))]),
                                 ['Gender', 'Married', 'Dependents',
                                  'Education', 'Self_Employed',
                                  'Credit_History', 'Property_Area'])])

['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term']

SimpleImputer(strategy='median')

StandardScaler()

['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Credit_History', 'Property_Area']

SimpleImputer(fill_value='Undefined', strategy='constant')

OneHotEncoder(handle_unknown='ignore')

	Numerical Features	Learned Statistics (Median)
0	ApplicantIncome	3859.0
1	CoapplicantIncome	1293.5
2	LoanAmount	128.0
3	Loan_Amount_Term	360.0

	rank_test_score	params	mean_test_score	std_test_score
2	1	{'classifier__C': 0.1, 'preprocess__num__imput...	0.806966	0.030669
5	1	{'classifier__C': 0.1, 'preprocess__num__imput...	0.806966	0.030669
8	3	{'classifier__C': 1, 'preprocess__num__imputer...	0.801546	0.024111
11	3	{'classifier__C': 1, 'preprocess__num__imputer...	0.801546	0.024111
1	5	{'classifier__C': 0.1, 'preprocess__num__imput...	0.798814	0.030696
4	5	{'classifier__C': 0.1, 'preprocess__num__imput...	0.798814	0.030696
0	7	{'classifier__C': 0.1, 'preprocess__num__imput...	0.793327	0.044085
9	8	{'classifier__C': 1, 'preprocess__num__imputer...	0.790684	0.021143
10	8	{'classifier__C': 1, 'preprocess__num__imputer...	0.790684	0.020074
14	10	{'classifier__C': 10, 'preprocess__num__impute...	0.787974	0.020709
15	10	{'classifier__C': 10, 'preprocess__num__impute...	0.787974	0.018458
17	10	{'classifier__C': 10, 'preprocess__num__impute...	0.787974	0.020709
7	13	{'classifier__C': 1, 'preprocess__num__imputer...	0.787951	0.023938
6	13	{'classifier__C': 1, 'preprocess__num__imputer...	0.787951	0.024841
12	15	{'classifier__C': 10, 'preprocess__num__impute...	0.785241	0.022270
3	16	{'classifier__C': 0.1, 'preprocess__num__imput...	0.782487	0.034336
16	17	{'classifier__C': 10, 'preprocess__num__impute...	0.777089	0.022301
13	17	{'classifier__C': 10, 'preprocess__num__impute...	0.777089	0.022301

	ApplicantIncome	CoapplicantIncome	LoanAmount	Loan_Amount_Term	Gender_Female	Gender_Male	Gender_Undefined	Married_No	Married_Yes	Dependents_0	...	Education_Graduate	Education_Not Graduate	Self_Employed_No	Self_Employed_Undefined	Self_Employed_Yes	Credit_History_False	Credit_History_True	Property_Area_Rural	Property_Area_Semiurban	Property_Area_Urban
Loan_ID
LP001367	0.035894	0.030440	0.131122	0.74359	0.0	1.0	0.0	0.0	1.0	0.0	...	1.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0
LP002137	0.076475	0.135443	0.382306	0.74359	0.0	1.0	0.0	0.0	1.0	1.0	...	1.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	1.0	0.0
LP002113	0.020779	0.000000	0.204194	0.74359	1.0	0.0	0.0	1.0	0.0	0.0	...	0.0	1.0	1.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0
LP001664	0.049981	0.000000	0.162717	0.74359	0.0	1.0	0.0	1.0	0.0	1.0	...	1.0	0.0	1.0	0.0	0.0	0.0	1.0	1.0	0.0	0.0
LP001421	0.067013	0.063303	0.249605	0.74359	0.0	1.0	0.0	0.0	1.0	1.0	...	1.0	0.0	1.0	0.0	0.0	0.0	1.0	1.0	0.0	0.0

	ApplicantIncome	CoapplicantIncome	LoanAmount	Loan_Amount_Term	Gender_Female	Gender_Male	Gender_Undefined	Married_No	Married_Yes	Dependents_0	...	Education_Graduate	Education_Not Graduate	Self_Employed_No	Self_Employed_Undefined	Self_Employed_Yes	Credit_History_False	Credit_History_True	Property_Area_Rural	Property_Area_Semiurban	Property_Area_Urban
Loan_ID
LP001279	0.027409	0.074800	0.187994	0.74359	0.0	1.0	0.0	1.0	0.0	1.0	...	1.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	1.0	0.0
LP001322	0.049264	0.000000	0.165877	0.74359	0.0	1.0	0.0	1.0	0.0	1.0	...	1.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	1.0	0.0
LP001086	0.015980	0.000000	0.028436	0.74359	0.0	1.0	0.0	1.0	0.0	1.0	...	0.0	1.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0
LP002974	0.038120	0.057629	0.143760	0.74359	0.0	1.0	0.0	0.0	1.0	1.0	...	1.0	0.0	1.0	0.0	0.0	0.0	1.0	1.0	0.0	0.0
LP001404	0.037316	0.067471	0.216430	0.74359	1.0	0.0	0.0	0.0	1.0	1.0	...	1.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	1.0	0.0

	Condition	Influence Score
0	Credit_History_False <= 0.00	0.239573
1	0.00 < Property_Area_Semiurban <= 1.00	0.064764
2	Property_Area_Rural <= 0.00	0.058795
3	Dependents_1 <= 0.00	0.050181
4	Gender_Female <= 0.00	0.042407

Interpretable Loan Approval Pipeline Model

Data Preprocessing before Pipeline

Pipeline

Build Pipeline

Use the Pipeline

How to Access Fitted Pipeline?

Case 1: Print out the learned statistics of `SimpleImputer`

Case 2: Transform new data using the fitted `data_transformer`

GridSearch over the Pipeline

References for `sklearn` pipeline

Explainable AI (XAI): Local Interpretable Model-Agnostic Explanations (LIME)

References for LIME

	Percentage of NA
Self_Employed	5.98 %
LoanAmount	2.99 %
Loan_Amount_Term	2.72 %
Gender	1.9 %
Dependents	1.36 %
Property_Area	0.0 %
Credit_History	0.0 %
CoapplicantIncome	0.0 %
ApplicantIncome	0.0 %
Education	0.0 %
Married	0.0 %

Data Preprocessing before Pipeline

Pipeline

Build Pipeline

Use the Pipeline

How to Access Fitted Pipeline?

Case 1: Print out the learned statistics of SimpleImputer

Case 2: Transform new data using the fitted data_transformer

GridSearch over the Pipeline

References for sklearn pipeline

Explainable AI (XAI): Local Interpretable Model-Agnostic Explanations (LIME)

References for LIME

Case 1: Print out the learned statistics of `SimpleImputer`

Case 2: Transform new data using the fitted `data_transformer`

References for `sklearn` pipeline