Interpretable Loan Approval Pipeline Model
A simple workflow to demonstrate on how to utilize scikit-learn pipeline to classify whether a loan application should be granted or not. Grid search over the pipeline can be applied to evaluate combination of preprocessing and classifier hyperparameters. The best pipeline is interpreted using the Local Interpretable Model-Agnostic Explanations (LIME) method.
- Data Preprocessing before Pipeline
- Pipeline
- Explainable AI (XAI): Local Interpretable Model-Agnostic Explanations (LIME)
Flask
web framework.
import pandas as pd
import numpy as np
# pipeline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler
# model
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
# diplays HTML representation in a jupyter context
from sklearn import set_config
set_config(display='diagram') # default: display='text'
The dataset is downloaded from GitHub Repository which contains information about loan applicants. There are 12 independent columns and 1 dependent column as follow:
-
Loan_ID
: A unique ID assigned to every loan applicant -
Gender
: Gender of the applicant (Male, Female) -
Married
: The marital status of the applicant (Yes, No) -
Dependents
: No. of people dependent on the applicant (0,1,2,3+) -
Education
: Education level of the applicant (Graduated, Not Graduated) -
Self_Employed
: If the applicant is self-employed or not (Yes, No) -
ApplicantIncome
: The amount of income the applicant earns -
CoapplicantIncome
: The amount of income the co-applicant earns -
LoanAmount
: The amount of loan the applicant has requested for -
Loan_Amount_Term
: The no. of days over which the loan will be paid -
Credit_History
: A record of a borrower's responsible repayment of debts (1- has all debts paid, 0- not paid) -
Property_Area
: The type of location where the applicant’s property lies (Rural, Semiurban, Urban) -
Loan_Status
: Loan granted or not (1 for granted, 0 for not granted)
There are several steps to be executed before we use sklearn
pipeline:
- Load the data: Read
loan.csv
file and setLoan_ID
as index - Manual feature selection: Drop irrelevant or select relevant columns only
- Data type conversion: This is done to make it easier to select columns in the Pipeline
- Feature-target split: Separate the independent and dependent variable into two objects
- Train-test split: Training set will be used to train model, while testing set will be treated as unseen data for model evaluation
loan = pd.read_csv("data_input/loan.csv", index_col='Loan_ID')
# 2. Drop irrelevant column
loan.drop(columns=['Unnamed: 0'], inplace=True)
# 3. Data type conversion
loan['Credit_History'] = loan['Credit_History'].astype('bool').astype('object')
# 4. Feature-target split
X = loan.drop('Loan_Status', axis=1)
y = loan.Loan_Status
# 5. Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=123)
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")
Check which columns contain missing values:
na_values = X_train.isna().mean() * 100
na_values.sort_values(ascending=False).round(2).to_frame('Percentage of NA').astype(str) + ' %'
LoanAmount
, Loan_Amount_Term
), where as the categorical columns will be imputed by a constant "Undefined" value (Credit_History
, Self_Employed
, Gender
, Dependents
).
Why is pipeline preferred? A pipeline is used to chain several estimators into one sequential process. The advantages are:
- Shorten the workflow: we only need to call
.fit()
and.predict()
at the end of the pipeline implementation. - Parameter selection: all parameters (including preprocessing parameters) in the pipeline can be combined to perform a hyperparameter search (e.g., If you want to compare the performance of the model between mean or median imputation strategies)
- Ensure that data leakage does not occur, i.e., statistical leakage from test data to the model during cross-validation. The most common one is when we do imputation, supposed we use mean strategy. We should only calculate the mean only based on the training set, not the whole data.
.transform()
method, except for the last estimator.
First, we define the transformation pipeline for numerical and categorical columns separately:
- The numerical columns will be imputed using median strategy and then scaled using z-score
- The categorical columns will be imputed using constant "Undefined" value and then one-hot encoding will be applied.
num_features = ['ApplicantIncome', 'CoapplicantIncome',
'LoanAmount', 'Loan_Amount_Term']
cat_features = ['Gender', 'Married', 'Dependents', 'Education',
'Self_Employed', 'Credit_History', 'Property_Area']
num_transformer = Pipeline(
[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()) # z-score
]
)
cat_transformer = Pipeline(
[
('imputer', SimpleImputer(strategy='constant', fill_value='Undefined')),
('encoder', OneHotEncoder(handle_unknown='ignore'))
]
)
data_transformer = ColumnTransformer(
[
('num', num_transformer, num_features),
('cat', cat_transformer, cat_features)
]
)
data_transformer
Next, the transformation pipeline is appended to a classifier. In this case, we use a simple LogisticRegression
model.
pipe = Pipeline(
[
('preprocess', data_transformer),
('classifier', LogisticRegression(max_iter=np.Inf, random_state=123))
]
)
pipe
</div>
By using .fit()
method, the statistics of training data will be learned both by the transformation (for scaling and one-hot encoding) and classifier.
pipe.fit(X_train, y_train)
</div>
Then we use the trained pipeline to predict the target variable, the the trained model is evaluated using classification_report
.
print(classification_report(y_train, pipe.predict(X_train)))
print(classification_report(y_test, pipe.predict(X_test)))
Up to this point, if we are satisfied with the model's performance, we can serialized the pipeline using joblib.dump()
:
import joblib
joblib.dump(pipe, 'cache/loan_pipeline.joblib')
Reversely, we can de-serialized (read) the pickled pipeline using joblib.load()
:
pipe = joblib.load('cache/loan_pipeline.joblib')
pipe
</div>
joblib.load()
relies on the pickle
module and can execute arbitrary Python code. It should therefore never be used to load files from untrusted sources.
Sometimes, we want a more in-depth inspection of what each component of the pipeline has learned from the training set.
pd.DataFrame({
'Numerical Features': num_features,
'Learned Statistics (Median)': pipe['preprocess'].named_transformers_['num'].named_steps['imputer'].statistics_
})
new_data = pd.DataFrame({
'Gender': ['Male'],
'Married': ['No'],
'Dependents': [0],
'Education': ['Graduate'],
'Self_Employed': ['No'],
'ApplicantIncome': [10000],
'CoapplicantIncome': [np.nan],
'LoanAmount': [1000],
'Loan_Amount_Term': [36],
'Credit_History': [True],
'Property_Area': [np.nan]
})
new_data
First, we would like to extract the column name after OneHotEncoder
is applied:
fitted_encoder = pipe['preprocess'].named_transformers_['cat'].named_steps['encoder']
fitted_encoder.get_feature_names(cat_features)
Second, we can immediately use .transform()
method on the new data using trained transformation pipeline.
data_transformer.transform(new_data)
The two steps above can be combined into one cell code in the following dataframe:
pd.DataFrame(
data_transformer.transform(new_data),
columns=num_features + list(fitted_encoder.get_feature_names(cat_features))
)
One advantage of using pipeline is that we can perform grid search over all parameters inside the pipeline. Here is the list of parameters to be tuned based on previous pipeline. Parameter name that starts with classifier__
means we are tuning the model, where as preprocess__
means we are tuning the preprocessing (transformation) parameters.
sorted(list(pipe.get_params().keys()))
In the cell below we prepare the parameter combinations as follows:
- On preprocessing step:
- Try mean and median for the imputation stategy
- Try not doing scaling (
passthrough
), standardization, and min-max scaling
- On classifier step:
- Try to penalize model using
C
= 0.1, 1, and 10
- Try to penalize model using
param_grid = {
# combination for numeric imputer
'preprocess__num__imputer__strategy': ['mean', 'median'],
# combination for numeric scaling
'preprocess__num__scaler': ['passthrough', StandardScaler(), MinMaxScaler()],
# combination for logistic regression hyperparameter
'classifier__C': [0.1, 1, 10]
}
pipe_grid = GridSearchCV(pipe, param_grid, cv=3)
pipe_grid.fit(X_train, y_train)
</div>
Print out the grid search cross-validation results:
pd.DataFrame(pipe_grid.cv_results_).sort_values('rank_test_score')[['rank_test_score', 'params', 'mean_test_score', 'std_test_score']]
We can print out the best combination of parameters based on the grid search:
pipe_grid.best_params_
Lastly, we can directly apply .predict()
method to the pipe_grid
, automatically it will use the best_params_
combination:
print(classification_report(y_train, pipe_grid.predict(X_train)))
print(classification_report(y_test, pipe_grid.predict(X_test)))
sklearn
pipeline
References for - Pipeline component
-
Preprocessing data
- Scaling: Standarization, Normalization
- Non-linear transformation
- Encoding categorical
- Binning (numerical to categorical)
- Missing values imputation
- Polynomial features
X_train_preprocessed
is X_train
data that has been preprocessed on the best Pipeline (previous GridSearchCV
result)
X_train_preprocessed = pd.DataFrame(
pipe_grid.best_estimator_['preprocess'].transform(X_train),
columns=num_features +
list(fitted_encoder.get_feature_names(cat_features)),
index=X_train.index
)
X_train_preprocessed.head()
categorical_names_mapping
is a dictionary used by OneHotEncoder
when each column is being encoded.
fitted_encoder_best = pipe_grid.best_estimator_['preprocess'].named_transformers_['cat'].named_steps['encoder']
categorical_names_mapping = dict(zip(cat_features, fitted_encoder.categories_))
categorical_names_mapping
Setting up LIME Explainer for tabular data, the data used is the training data only.
pip install lime
in your Python environment.
from lime.lime_tabular import LimeTabularExplainer
explainer = LimeTabularExplainer(
training_data=X_train_preprocessed.values, # must be a numeric array
mode='classification',
feature_names=X_train_preprocessed.columns,
class_names=['Not Granted', 'Granted'],
categorical_names=categorical_names_mapping)
Next, provide the data that we want to interpret, namely data testing that has been preprocessed based on the best pipeline (the previous GridSearchCV
result).
X_test_preprocessed = pd.DataFrame(
pipe_grid.best_estimator_['preprocess'].transform(X_test),
columns=num_features + list(fitted_encoder.get_feature_names(cat_features)),
index=X_test.index
)
X_test_preprocessed.head()
LIME only provides local interpretation (per each observation only). In the cell below, we want to see the interpretation results for the first observation (using .iloc[[0]]
) on X_test_encoded
.
predict_fn = lambda x: pipe_grid.best_estimator_['classifier'].predict_proba(x).astype(float)
exp = explainer.explain_instance(
X_test_preprocessed.iloc[[1]].values[0], # must be a one-dimensional numpy array
predict_fn, # prediction function that outputs probability
num_features=5 # top n features to be explained
)
print(f"LIME Explanation for ID {X_test_preprocessed.iloc[[1]].index.values[0]}")
exp.show_in_notebook(show_table=True, show_all=False)
Insight for the second observation (Loan ID LP001322):
- Loan application is predicted to be granted with 80% probability
- The reasons that support the loan should be granted are:
- The credit history is good (
Credit_History_False = 0
meansCredit_History = True
(value 1)) - The property is in semiurban area, not in rural area (
Property_Area_Semiurban = 1
danProperty_Area_Rural = 0
) - Number of dependents is not 1 (
Dependents_1 = 0
)
- The credit history is good (
- The reason that contradict the loan should be granted is:
- The applicant is not married (
Married_No = 1
)
- The applicant is not married (
There is a slightly truncated part of the bar plot visualization above, so we can see it more clearly in the form of a DataFrame:
pd.DataFrame(exp.as_list(), columns=['Condition', 'Influence Score'])
LIME creates a simple, interpretable model to study the relationship of features with predict_proba results
. Therefore, there is a explanation fit value that describes how well LIME explains the black-box model (such as the R-squared value in linear model). The range of explanation fit is from 0 to 1 (the bigger the value, the better the fit).
print(f"Explanation fit: {exp.__dict__['score']}")
LimeTabularExplainer
model can also be tuned like a classification model. Parameter reference
</div>