Introduction

Patients with liver disease have been continuously increasing because of excessive consumption of alcohol, inhale of harmful gases, intake of contaminated food, pickles, and drugs. This use case is provided on Kaggle using the dataset from UCI ML Repository). The dataset contains 416 liver patient records and 167 non-liver patient records collected from North East of Andhra Pradesh, India. The collected data is used for evaluating prediction algorithms to reduce the burden on doctors.

Import Libraries

First of all, let us import all necessary libraries, mainly for data analysis, visualization, and modeling.

import pandas as pd
import numpy as np

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.cbook import boxplot_stats

# preprocessing (pre-modeling)
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# modeling
import statsmodels.api as sm
from sklearn.neighbors import KNeighborsClassifier

# evaluation (post-modeling)
from sklearn.metrics import *
from statsmodels.stats.outliers_influence import variance_inflation_factor
from scipy import stats
from scipy.special import logit

Set Notebook Options

  • Suppress package warning
  • Set color of the plot to be more contrast
  • Change float format to five decimal places
  • Display all content in a pandas column

import warnings
warnings.filterwarnings("ignore")
sns.set(style="ticks", color_codes=True)
pd.options.display.float_format = '{:.5f}'.format
pd.options.display.max_colwidth = None

Data Wrangling

Before we jump into any visualization or modeling step, we have to make sure our data is ready.

Import Data

Let's import indian_liver_patient.csv and analyze the data structure.

liver = pd.read_csv("data_input/indian_liver_patient.csv")
liver.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 583 entries, 0 to 582
Data columns (total 11 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Age                         583 non-null    int64  
 1   Gender                      583 non-null    object 
 2   Total_Bilirubin             583 non-null    float64
 3   Direct_Bilirubin            583 non-null    float64
 4   Alkaline_Phosphotase        583 non-null    int64  
 5   Alamine_Aminotransferase    583 non-null    int64  
 6   Aspartate_Aminotransferase  583 non-null    int64  
 7   Total_Protiens              583 non-null    float64
 8   Albumin                     583 non-null    float64
 9   Albumin_and_Globulin_Ratio  579 non-null    float64
 10  Dataset                     583 non-null    int64  
dtypes: float64(5), int64(5), object(1)
memory usage: 50.2+ KB

The dataset contains 583 observations of patient records with 11 columns as follows (Reference):

  • Age: Age of the patient. Any patient whose age exceeded 89 is listed as being age 90.
  • Gender: Gender of the patient, either Male or Female.
  • Total_Bilirubin: Bilirubin is an orange-yellow pigment, a waste product primarily produced by the normal breakdown of heme. This test measures the amount of bilirubin in the blood to evaluate a person's liver function or to help diagnose anemias caused by red blood cell destruction (hemolytic anemia). Measured in mg/dL.
  • Direct_Bilirubin: Water-soluble forms of bilirubin. Measured in mg/dL.
  • Alkaline_Phosphotase (ALP): Enzyme found in several tissues throughout the body. The highest concentrations of ALP are present in the cells that comprise bone and the liver. Elevated levels of ALP in the blood are most commonly caused by liver disease or bone disorders. Measured in U/L.
  • Alamine_Aminotransferase (ALT): Enzyme found mostly in the cells of the liver and kidney. Normally, ALT levels in the blood are low, but when the liver is damaged, ALT is released into the blood and the level increases. This test measures the level of ALT in the blood and is useful for early detection of liver disease. Measured in U/L.
  • Aspartate_Aminotransferase: Enzyme found in cells throughout the body but mostly in the heart and liver and, to a lesser extent, in the kidneys and muscles. In healthy individuals, levels of AST in the blood are low. When liver or muscle cells are injured, they release AST into the blood. This makes AST a useful test for detecting or monitoring liver damage. Measured in U/L.
  • Total_Protiens: Measures the amount of protein in g/dL.
  • Albumin: Made by the liver and makes up about 60% of the total protein. Albumin keeps fluid from leaking out of blood vessels, nourishes tissues, and transports hormones, vitamins, drugs, and substances like calcium throughout the body. Measured in g/dL.
  • Albumin_and_Globulin_Ratio: Compares the amount of Albumin with Globulin. Globulin made up the remaining 40% of proteins in the blood which is a varied group of proteins, some produced by the liver and some by the immune system. They help fight infection and transport nutrients. Measured in g/dL.
  • Dataset: A class label used to divide the patient into two groups. Value 1 indicates patient with liver disease and 2 otherwise.

Rename Columns

Rename the column to its abbreviation in order to make it shorter.

column_names = {
    "Age": "Age",
    "Gender": "Gender",
    "TB": "Total Bilirubin",
    "DB": "Direct Bilirubin",
    "ALP": "Alkaline Phosphotase",
    "ALT": "Alamine Aminotransferase",
    "AST": "Aspartate Aminotransferase",
    "TP": "Total Proteins",
    "Albumin": "Albumin",
    "A/G Ratio": "Albumin and Globulin Ratio",
    "Disease": "Dataset"
}
liver.columns = column_names.keys()
liver.head()
Age Gender TB DB ALP ALT AST TP Albumin A/G Ratio Disease
0 65 Female 0.70000 0.10000 187 16 18 6.80000 3.30000 0.90000 1
1 62 Male 10.90000 5.50000 699 64 100 7.50000 3.20000 0.74000 1
2 62 Male 7.30000 4.10000 490 60 68 7.00000 3.30000 0.89000 1
3 58 Male 1.00000 0.40000 182 14 20 6.80000 3.40000 1.00000 1
4 72 Male 3.90000 2.00000 195 27 59 7.30000 2.40000 0.40000 1

Rename Target Variable

Rename the values in the target variable Disease:

  • 1 indicates patient with liver disease, we subtitute this with "Yes"
  • 2 indicates a non-liver disease patient, we subtitute this with "No"

liver.loc[liver["Disease"] == 1, "Disease"] = "Yes"
liver.loc[liver["Disease"] == 2, "Disease"] = "No"
liver["Disease"].unique()
array(['Yes', 'No'], dtype=object)

Data Type Conversion

Convert these two categorical columns, from object to category type:

  • Gender with two levels: "Female" and "Male"
  • Disease with two levels: "Yes" and "No"

liver[["Gender", "Disease"]] = liver[["Gender", "Disease"]].astype("category")
liver.dtypes
Age             int64
Gender       category
TB            float64
DB            float64
ALP             int64
ALT             int64
AST             int64
TP            float64
Albumin       float64
A/G Ratio     float64
Disease      category
dtype: object

Identify Missing Values

We have to make sure our data is complete, means that there are no missing or NA values.

liver.isna().sum()
Age          0
Gender       0
TB           0
DB           0
ALP          0
ALT          0
AST          0
TP           0
Albumin      0
A/G Ratio    4
Disease      0
dtype: int64

Four observations of A/G Ratio are missing, this will be handled later on after the feature engineering step.

Feature Engineering

Feature engineering is the process of using domain knowledge to extract features from provided raw data. These features can be used to improve the performance of machine learning models. In this section, we'll be extracting features to reduce the correlation between the predictors and reduce the risk of multicollinearity on the model. So let's plot the Pearson correlation heatmap to find out which predictors have a strong correlation.

def plotCorrelationHeatmap(data, figsize=(12, 6)):
    plt.figure(figsize=figsize)
    corr_val = data.corr(method="pearson")
    mask = np.zeros_like(corr_val, dtype=np.bool)
    mask[np.triu_indices_from(mask, k=1)] = True
    corr_heatmap = sns.heatmap(corr_val, mask=mask,
                               annot=True, fmt='.3f', linewidths=3, cmap="Reds")
    corr_heatmap.set_title("PEARSON CORRELATION HEATMAP",
                           fontsize=15, fontweight="bold")
    corr_heatmap


plotCorrelationHeatmap(liver)

Consider correlation above 0.6 as strong, we want to minimize the risk of multicollinearity on our model by considering the correlation of these features:

  • TB and DB (0.875): Merge into DB/TB Percentage (Reference)
  • AST and ALT (0.792): We can merge them into AST/ALT Ratio (Reference), but then we ended up with an insignificant predictor later on modelling. Thus, we'll keep them as separate predictors.
  • Albumin and A/G Ratio (0.69): Calculate Globulin level, given the A/G ratio
  • Albumin and TP (0.784): Albumin is a group of protein, it certainly have a strong correlation with TP (Total Protein). These two variables will be discussed later on.

liver["DB/TB Percentage"] = liver["DB"] / liver["TB"] * 100
# liver["AST/ALT Ratio"] = liver["AST"]/liver["ALT"]
liver["Globulin"] = liver["Albumin"] / liver["A/G Ratio"]
liver = liver.drop(["DB", "TB", "A/G Ratio"], axis=1)
liver.head()
Age Gender ALP ALT AST TP Albumin Disease DB/TB Percentage Globulin
0 65 Female 187 16 18 6.80000 3.30000 Yes 14.28571 3.66667
1 62 Male 699 64 100 7.50000 3.20000 Yes 50.45872 4.32432
2 62 Male 490 60 68 7.00000 3.30000 Yes 56.16438 3.70787
3 58 Male 182 14 20 6.80000 3.40000 Yes 40.00000 3.40000
4 72 Male 195 27 59 7.30000 2.40000 Yes 51.28205 6.00000

Here is the new correlation heatmap after feature engineering:

plotCorrelationHeatmap(liver, (14, 6))

Interestingly, TP is now strongly correlated with both Albumin and Globulin. Let's manually add the value of Albumin and Globulin as Albumin+Globulin then further investigate the correlation with TP:

AG_df = liver[["Albumin", "Globulin", "TP"]]
AG_df["Albumin+Globulin"] = AG_df["Albumin"] + AG_df["Globulin"]
plotCorrelationHeatmap(AG_df, (6, 5))

Such a strong (almost perfect) correlation between TP and Albumin+Globulin, indicating that TP can be explained by both Albumin and Globulin. We will use this information to handle missing value.

Handle Missing Value

Remember about the missing value in A/G Ratio? Now they are moved to Globulin since we calculate it using A/G Ratio.

missing = liver["Globulin"].isna()
liver[missing]
Age Gender ALP ALT AST TP Albumin Disease DB/TB Percentage Globulin
209 45 Female 189 23 33 6.60000 3.90000 Yes 33.33333 nan
241 51 Male 230 24 46 6.50000 3.10000 Yes 25.00000 nan
253 35 Female 180 12 15 5.20000 2.70000 No 33.33333 nan
312 27 Male 106 25 54 8.50000 4.80000 No 46.15385 nan

Let's impute missing Globulin value by fitting a linear regression line $TP = Albumin + Globulin + constant$ using the non-missing data.

Important: Make sure to fit the regression line to the training data only, since we don’t want data leakage to occur. Statistics between training and testing data should not be shared. Use the same random_state to ensure the split in this section is the same as the split in the modeling section.

liver_train, liver_test = train_test_split(liver, test_size=0.25, random_state=888)
X_impute = liver_train[-missing][["Albumin", "Globulin"]]
y_impute = liver_train[-missing]["TP"].values
lin_reg = sm.OLS(y_impute, sm.add_constant(X_impute)).fit()
beta = lin_reg.params.values

print("Adjusted R-squared: {:.3f}%".format(100*lin_reg.rsquared_adj))
print("Estimate:", beta)
Adjusted R-squared: 90.619%
Estimate: [0.4775957  1.06058543 0.77073515]

The Adjusted R-squared indicating that the model is a good fit. Now, we have to change the subject of the formula from TP to Globulin as follows:

$TP = \beta_0 + \beta_1*Albumin + \beta_2*Globulin$

$Globulin = \dfrac{TP - \beta_0 - \beta_1*Albumin}{\beta_2}$

where:

  • $\beta_0$ is the intercept of regression line. $TP$ is equal to $\beta_0$ when $Albumin = 0$ and $Globulin = 0$
  • $\beta_1$ is the coefficient of $Albumin$. One unit increase in $Albumin$ will increase $TP$ as much as $\beta_1$
  • $\beta_2$ is the coefficient of $Globulin$. One unit increase in $Globulin$ will increase $TP$ as much as $\beta_2$

We will use this formula to impute the missing value of Globulin based on TP and Albumin.

liver["Globulin"] = liver.apply(
    lambda row: (row["TP"] - beta[0] - beta[1] * row["Albumin"]) / beta[2]
    if np.isnan(row["Globulin"]) 
    else row.Globulin,
    axis=1
)
liver[missing]
Age Gender ALP ALT AST TP Albumin Disease DB/TB Percentage Globulin
209 45 Female 189 23 33 6.60000 3.90000 Yes 33.33333 2.57692
241 51 Male 230 24 46 6.50000 3.10000 Yes 25.00000 3.54803
253 35 Female 180 12 15 5.20000 2.70000 No 33.33333 2.41175
312 27 Male 106 25 54 8.50000 4.80000 No 46.15385 3.80363

We drop TP as it strongly correlated to Albumin and Globulin.

liver = liver.drop(["TP"], axis=1)
liver.head()
Age Gender ALP ALT AST Albumin Disease DB/TB Percentage Globulin
0 65 Female 187 16 18 3.30000 Yes 14.28571 3.66667
1 62 Male 699 64 100 3.20000 Yes 50.45872 4.32432
2 62 Male 490 60 68 3.30000 Yes 56.16438 3.70787
3 58 Male 182 14 20 3.40000 Yes 40.00000 3.40000
4 72 Male 195 27 59 2.40000 Yes 51.28205 6.00000

Exploratory Data Analysis

Our data is ready, now we visualize the distribution and proportion of the variables.

Pair Plot

pair_plot = sns.pairplot(liver, hue="Disease",
                         diag_kind="kde", corner=True, markers='+')
pair_plot.fig.suptitle("PAIR PLOT OF NUMERICAL VARIABLES",
                       size=25, fontweight="bold")
pair_plot;

Note: From the pair plot of numerical variables, we expect little to no correlation between the independent variables, since we have done the feature engineering step except for ALT and AST. Variables ALP, ALT, AST, and DB/TB Percentage have a very positive skewed distribution. We’ll analyze them further using box plot.

Box Plot

fig, axes = plt.subplots(2, 4, figsize=(15, 8))

for ax, col in zip(axes.flat, liver.select_dtypes('number').columns):
    sns.boxplot(x="Disease", y=col, data=liver, ax=ax)
    # Outlier Count
    outlier_count = 0
    for disease in liver["Disease"].cat.categories:
        liver_disease = liver.loc[liver["Disease"] == disease, col]
        outlier_list = boxplot_stats(liver_disease).pop(0)['fliers']
        outlier_count += len(outlier_list)
    ax.set_title("Outlier Count: {} ({:.2f}%)".format(
        outlier_count, 100*outlier_count/liver.shape[0]))

axes[-1, -1].axis("off")
plt.tight_layout()
fig.suptitle("BOX PLOT OF NUMERICAL VARIABLES",
             size=28, y=1.05, fontweight="bold")
plt.show()

Note: Using median, patient who has liver disease have higher Age, lower Albumin, and slightly higher Globulin compared to those who doesn’t. This will be further explained in the modeling step. More than 10% of the observations in ALP, ALT, and AST are considered as outliers. We’ll apply some transformation during the modeling to handle the outlier.

Modeling: Logistic Regression

Logistic regression is a classification algorithm used to fit a regression curve. $\log{\frac{p}{1-p}} = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_n X_n$ where:

  • $p$ is the probability of an observation to be in the positive class
  • $\frac{p}{1-p}$ is called as odds
  • $\log{\frac{p}{1-p}}$ is called as log-odds or logit
  • $X_1, X_2, ..., X_n$ are the predictors
  • $\beta_0$ is a constant value
  • $\beta_1, \beta_2, ..., \beta_n$ are the coefficient of $X_1, X_2, ..., X_n$ respectively
  • $n$ is the number of predictors

The formula above can be re-written as a sigmoid curve:

$p = \dfrac{1}{1 + e^{-z}}$ where $z = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_n X_n$

z_plot = np.linspace(-10, 10)
plt.plot(z_plot, 1/(1 + np.exp(-z_plot)))
plt.axvline(0, color="k", ls="--", alpha=0.25)
plt.axhline(0.5, color="k", ls="--", alpha=0.25)
plt.xlabel("z")
plt.ylabel("p")
plt.title("ILLUSTRATION: SIGMOID CURVE", fontweight="bold")
plt.show()

Data Preparation

We have to do some data preparation specifically for modeling:

  • Create dummy variables for the categorical variables Gender and Disease
  • Separate the target variable from the predictors
  • Train test split in order to evaluate our model, with 75% train and 25% test

liver_dummy = pd.get_dummies(
    liver,
    columns=liver.select_dtypes('category').columns,
    drop_first=True)
X = liver_dummy.drop(["Disease_Yes"], axis=1)
y = liver_dummy["Disease_Yes"].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=888)

print("X Train:", X_train.shape)
print("X Test:", X_test.shape)
print("y Train:", y_train.shape)
print("y Test:", y_test.shape)
X Train: (437, 8)
X Test: (146, 8)
y Train: (437,)
y Test: (146,)

Imbalance Data

Next we check the frequency of each levels in the target variable Disease by Gender

def plotProportion(data):
    data['Disease'] = data['Disease'].replace({0: 'No', 1: 'Yes'})
    ax = data.groupby(['Disease', 'Set']).size().unstack().plot(kind='bar', stacked=True)
    for rect in ax.patches:
        height = rect.get_height()
        width = rect.get_width()
        padding = 0.25

        ax.text(rect.get_x() + width - padding,
                rect.get_y() + height / 2,
                int(height),
                ha='center', va='center',
                color="white")

data = pd.concat([
    pd.DataFrame({'Disease': y_train, 'Set': 'Training'}),
    pd.DataFrame({'Disease': y_test, 'Set': 'Testing'})
])
plotProportion(data)
plt.title("PROPORTION OF TARGET VARIABLE", fontsize=14, fontweight="bold")
plt.show()

data_prop = pd.crosstab(
    index=data.Set,
    columns=data.Disease,
    margins=True,
    normalize="index") * 100
data_prop.round(2).astype(str) + ' %'
Disease No Yes
Set
Testing 28.77 % 71.23 %
Training 28.6 % 71.4 %
All 28.64 % 71.36 %

We consider our data as imbalanced with 71:29 ratio. Therefore, we can do either upsampling or downsampling to balance the positive and negative class of Disease.

  • Upsampling is a method to randomly subsample the observation from minority class to make the dataset balanced.
  • Downsampling is a method to randomly sample (with replacement) the observation from majority class to make the dataset balanced.

If we choose to downsample the data, we only end up with a small number of observations and will lose some information from the data. Therefore, in this case we prefer to do upsampling using Synthetic Minority Over-sampling Technique (known as SMOTE).

from imblearn.over_sampling import SMOTE
sampler = SMOTE(random_state=888)
X_train_smote, y_train_smote = sampler.fit_resample(X_train, y_train)

data_smote = pd.concat([
    pd.DataFrame({'Disease': y_train_smote, 'Set': 'Training'}),
    pd.DataFrame({'Disease': y_test, 'Set': 'Testing'})
])
plotProportion(data_smote)
plt.title("PROPORTION OF TARGET VARIABLE (UPSAMPLED)",
          fontsize=14, fontweight="bold")
plt.show()

data_smote_prop = pd.crosstab(
    index=data_smote.Set,
    columns=data_smote.Disease,
    margins=True,
    normalize="index") * 100
data_smote_prop.round(2).astype(str) + ' %'
Disease No Yes
Set
Testing 28.77 % 71.23 %
Training 50.0 % 50.0 %
All 45.97 % 54.03 %

Important: Notice how the proportion of testing set is left untouched. The common practice is to do balancing data only on training set, since we want to put the testing set aside as an unseen data.

Base Model

As a starting point, let's us fit a logistic regression model with all existing predictors and see how it goes.

def fitLogisticRegression(X, y):
    model = sm.Logit(y, sm.add_constant(X))
    result = model.fit()
    return result

model_all = fitLogisticRegression(X_train_smote, y_train_smote)
model_all.summary()
Optimization terminated successfully.
         Current function value: 88.807037
         Iterations 8
Logit Regression Results
Dep. Variable: y No. Observations: 624
Model: Logit Df Residuals: 615
Method: MLE Df Model: 8
Date: Tue, 26 Oct 2021 Pseudo R-squ.: inf
Time: 21:03:57 Log-Likelihood: -55416.
converged: True LL-Null: 0.0000
Covariance Type: nonrobust LLR p-value: 1.000
coef std err z P>|z| [0.025 0.975]
const -3.7837 0.790 -4.790 0.000 -5.332 -2.236
Age 0.0244 0.006 4.066 0.000 0.013 0.036
ALP 0.0013 0.001 1.936 0.053 -1.6e-05 0.003
ALT 0.0112 0.005 2.458 0.014 0.002 0.020
AST 0.0069 0.003 2.249 0.025 0.001 0.013
Albumin -0.2874 0.135 -2.133 0.033 -0.551 -0.023
DB/TB Percentage 0.0201 0.008 2.505 0.012 0.004 0.036
Globulin 0.3830 0.151 2.538 0.011 0.087 0.679
Gender_Male 0.7252 0.208 3.480 0.001 0.317 1.134

Can we further improve our logistic regression model? One way is to handle the outlier present in our predictors.

Handle Outliers

In this section, we'll be transforming the outliers present in these predictors: ALP, ALT, and AST. We will compare the following:

  1. Log transformation
  2. Square root transformation

fig, axes = plt.subplots(1, 2, figsize=(10, 5))

x_axis = np.linspace(1, 5000, 1000)
y_axis = [np.log(x_axis), np.sqrt(x_axis)]
title_list = ["LOG: $y = \log(x)$", "SQUARE ROOT: $y = \sqrt{x}$"]

for y, title, ax in zip(y_axis, title_list, axes):
    ax.plot(x_axis, y)
    ax.set_xlabel("Original Value")
    ax.set_ylabel("Transformed Value")
    ax.set_title(title)

plt.tight_layout()
fig.suptitle("TRANSFORMATION FUNCTION", size=28, y=1.05, fontweight="bold")
plt.show()
transform_col = ["ALP", "ALT", "AST"]
X_train_untransformed = X_train_smote.drop(transform_col, axis=1)
X_test_untransformed = X_test.drop(transform_col, axis=1)

Note: Logarithm and square root function can be used to handle outlier since it condensed the range of the original value (horizontal axis) to its transformed value (vertical axis).

Log Transformation

Transform the predictors ALP, ALT, and AST using the natural logarithm function: $y = \log(x)$

def transform(data, transform_col, func):
    data = data[transform_col].transform(func)
    data.columns = [f"{func.__name__}_{col}" for col in transform_col]
    return data


X_train_smote_log = transform(X_train_smote, transform_col, np.log)
X_test_log = transform(X_test, transform_col, np.log)
X_train_smote_log.head()
log_ALP log_ALT log_AST
0 5.20949 4.51086 3.97029
1 5.14166 3.09104 2.77259
2 6.19441 4.74493 4.51086
3 5.55296 4.38203 4.72739
4 5.37064 5.24702 6.85646

Note: In the case of using transformation function, we have to apply it to both training and testing set since the model expect the predictor value in the same range.

def boxPlotTransformedData(data, figsize=(10, 4)):
    fig, axes = plt.subplots(1, data.shape[1]-1, figsize=figsize)
    for ax, col in zip(axes, data.columns[:-1]):
        sns.boxplot(x="Disease", y=col, data=data, ax=ax)
        # Outlier Count
        outlier_count = 0
        for flag in data["Disease"].cat.categories:
            flag_disease = data.loc[data["Disease"] == flag, col]
            outlier_list = boxplot_stats(flag_disease).pop(0)['fliers']
            outlier_count += len(outlier_list)
        ax.set_title("Outlier Count: {} ({:.2f}%)".format(
            outlier_count, 100*outlier_count/liver.shape[0]))
    plt.tight_layout()


boxPlotTransformedData(
    pd.concat([
        X_train_smote_log,
        pd.Series(y_train_smote, name="Disease").replace({0: 'No', 1: 'Yes'}).astype('category')
    ],
        axis=1)
)
plt.suptitle("BOX PLOT OF LOG-TRANSFORMED VARIABLES",
             size=25, y=1.05, fontweight="bold")
plt.show()

The outlier count has been reduced, now let's fit the log-transformed data into the logistic regression model.

X_train_smote_log = pd.concat([X_train_untransformed, X_train_smote_log], axis=1)
X_test_log = pd.concat([X_test_untransformed, X_test_log], axis=1)

model_log = fitLogisticRegression(X_train_smote_log, y_train_smote)
model_log.summary()
Optimization terminated successfully.
         Current function value: 90.611754
         Iterations 6
Logit Regression Results
Dep. Variable: y No. Observations: 624
Model: Logit Df Residuals: 615
Method: MLE Df Model: 8
Date: Tue, 26 Oct 2021 Pseudo R-squ.: inf
Time: 21:03:58 Log-Likelihood: -56542.
converged: True LL-Null: 0.0000
Covariance Type: nonrobust LLR p-value: 1.000
coef std err z P>|z| [0.025 0.975]
const -8.8601 1.409 -6.290 0.000 -11.621 -6.099
Age 0.0253 0.006 4.196 0.000 0.013 0.037
Albumin -0.2775 0.135 -2.055 0.040 -0.542 -0.013
DB/TB Percentage 0.0217 0.008 2.712 0.007 0.006 0.037
Globulin 0.3583 0.151 2.371 0.018 0.062 0.655
Gender_Male 0.7102 0.209 3.396 0.001 0.300 1.120
log_ALP 0.5216 0.236 2.206 0.027 0.058 0.985
log_ALT 0.6592 0.230 2.863 0.004 0.208 1.111
log_AST 0.2873 0.200 1.439 0.150 -0.104 0.679

Square Root Transformation

Transform the predictors ALP, ALT, and AST using the square root function: $y = \sqrt{x}$

X_train_smote_sqrt = transform(X_train_smote, transform_col, np.sqrt)
X_test_sqrt = transform(X_test, transform_col, np.sqrt)
X_train_smote_sqrt.head()
sqrt_ALP sqrt_ALT sqrt_AST
0 13.52775 9.53939 7.28011
1 13.07670 4.69042 4.00000
2 22.13594 10.72381 9.53939
3 16.06238 8.94427 10.63015
4 14.66288 13.78405 30.82207

boxPlotTransformedData(
    pd.concat([
        X_train_smote_sqrt,
        pd.Series(y_train_smote, name="Disease").replace({0: 'No', 1: 'Yes'}).astype('category')
    ],
        axis=1)
)
plt.suptitle("BOX PLOT OF SQRT-TRANSFORMED VARIABLES",
             size=25, y=1.05, fontweight="bold")
plt.show()

The outlier count has been reduced. Now, let's fit the square root transformed data into the logistic regression model.

X_train_smote_sqrt = pd.concat([X_train_untransformed, X_train_smote_sqrt], axis=1)
X_test_sqrt = pd.concat([X_test_untransformed, X_test_sqrt], axis=1)

model_sqrt = fitLogisticRegression(X_train_smote_sqrt, y_train_smote)
model_sqrt.summary()
Optimization terminated successfully.
         Current function value: 90.255749
         Iterations 7
Logit Regression Results
Dep. Variable: y No. Observations: 624
Model: Logit Df Residuals: 615
Method: MLE Df Model: 8
Date: Tue, 26 Oct 2021 Pseudo R-squ.: inf
Time: 21:03:59 Log-Likelihood: -56320.
converged: True LL-Null: 0.0000
Covariance Type: nonrobust LLR p-value: 1.000
coef std err z P>|z| [0.025 0.975]
const -5.2668 0.868 -6.067 0.000 -6.968 -3.565
Age 0.0251 0.006 4.155 0.000 0.013 0.037
Albumin -0.2765 0.135 -2.048 0.041 -0.541 -0.012
DB/TB Percentage 0.0201 0.008 2.512 0.012 0.004 0.036
Globulin 0.3677 0.151 2.431 0.015 0.071 0.664
Gender_Male 0.7173 0.209 3.430 0.001 0.307 1.127
sqrt_ALP 0.0568 0.026 2.149 0.032 0.005 0.109
sqrt_ALT 0.1758 0.068 2.577 0.010 0.042 0.310
sqrt_AST 0.1010 0.052 1.953 0.051 -0.000 0.202

Evaluation

Next, we evaluate our binary logistic regression classifier by using a confusion matrix as follows:

pd.DataFrame(
    [["True Positive (TP)", "False Negative (FN)"], ["False Positive (FP)", "True Negative (TN)"]],
    index = [["Actual", "Actual"], ["Positive", "Negative"]],
    columns = [["Predicted", "Predicted"], ["Positive", "Negative"]],
)
Predicted
Positive Negative
Actual Positive True Positive (TP) False Negative (FN)
Negative False Positive (FP) True Negative (TN)

There are several metrics considered in this case:

  • Recall is the proportion of actual positives that are classified correctly.
$Recall = \dfrac{TP}{TP+FN}$
  • Precision is the proportion of TP out of all observations predicted as positive.
$Precision = \dfrac{TP}{TP+FP}$
  • F1-score is the harmonic mean of precision and recall.
$F1 = \dfrac{2 \times Precision \times Recall}{Precision+Recall}$

But Precision, Recall, F1-score do not account for the TN. We'll introduce another metrics that measure the overall performance of the model which includes all values in the confusion matrix:

  • Accuracy is the proportion of observations that are classified correctly. This metric is not robust when the class is imbalance.
$Accuracy = \dfrac{TP+TN}{TP+TN+FP+FN}$
  • MCC, stands for Matthew's Correlation Coefficient, ranges between -1 and 1.

    • MCC = 1 indicates a perfect positive correlation, the classifier is perfect $(FP = FN = 0)$
    • MCC = -1 indicates a perfect negative correlation, the classifier always misclassifies $(TP = TN = 0)$

    • MCC = 0 indicates no correlation, the classifier randomly classify observations

$MCC = \dfrac{TP \times TN - FP \times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}$

def evaluateLogReg(result, X_true, y_true):
    eval_list = []
    y_pred_prob = result.predict(sm.add_constant(X_true))
    for threshold in np.linspace(0, 0.99, 100):
        y_pred_cl = (y_pred_prob > threshold).astype(int)
        eval_res = {
            "Threshold": threshold,
            "Recall": recall_score(y_true, y_pred_cl),
            "Precision": precision_score(y_true, y_pred_cl),
            "F1": f1_score(y_true, y_pred_cl),
            "Accuracy": accuracy_score(y_true, y_pred_cl),
            "MCC": matthews_corrcoef(y_true, y_pred_cl)
        }
        eval_list.append(eval_res)
    eval_df = pd.DataFrame(eval_list)
    return eval_df

eval_logreg_all = evaluateLogReg(model_all, X_test, y_test)
eval_logreg_log = evaluateLogReg(model_log, X_test_log, y_test)
eval_logreg_sqrt = evaluateLogReg(model_sqrt, X_test_sqrt, y_test)

eval_logreg_list = [eval_logreg_all, eval_logreg_log, eval_logreg_sqrt]

title_list = ["ALL PREDICTORS", "LOG-TRANSFORMED", "SQRT-TRANSFORMED"]
fig, axes = plt.subplots(3, 1, figsize=(15, 10))

thresh_list = []
for ax, eval_df, title in zip(axes.flat, eval_logreg_list, title_list):
    # LINE PLOT
    eval_df = eval_df.drop(["Accuracy"], axis=1)
    lineplot = eval_df.plot(x="Threshold", color="rgbk", legend=False, ax=ax)

    # IDENTIFY CENTER
    diff = abs(eval_df["Recall"] - eval_df["Precision"])
    thresh_eq = eval_df[diff == min(diff)]["Threshold"].values[0]
    ax.axvline(x=thresh_eq, ls='--', color="y")
    ax.text(x=thresh_eq + 0.01, y=0.05,
            s="CENTER",
            fontsize=12, color="y")

    # F1 MEASURE
    row_max_F1 = eval_df[eval_df["F1"] == max(eval_df["F1"])]
    thresh_max_F1 = row_max_F1["Threshold"].values[0]
    ax.axvline(x=thresh_max_F1, ls='--', color="b")
    ax.text(x=thresh_max_F1 - 0.01, y=0.7,
            s="MAX F1",
            horizontalalignment='right',
            fontsize=12, color="b")

    # LOCATE MCC
    mcc = row_max_F1["MCC"].values[0]
    ax.plot(thresh_max_F1, mcc, marker='x', markersize=10, color="k")
    ax.text(x=thresh_max_F1 - 0.025, y=mcc,
            s="MCC = {:.3f}".format(mcc),
            horizontalalignment='right',
            fontsize=12, fontweight="bold", color="k")

    ax.set_xticks([0, 1] + [thresh_eq, thresh_max_F1])
    ax.set_title(title, fontweight="bold")
    handles, labels = ax.get_legend_handles_labels()
    thresh_list.append(thresh_max_F1)

plt.tight_layout()
plt.legend(handles=handles, loc="center",
           bbox_to_anchor=(0.5, -0.5),
           shadow=True, ncol=4)
fig.suptitle("LOGISTIC REGRESSION MODEL EVALUATION",
             size=28, y=1.05, fontweight="bold")
plt.show()

The output of our logistic regression is $p$, the probability of a particular patient to be classified as having a liver disease. For each of the model, we iterate the threshold from 0 to 1. If $p > threshold$, then the patient is classified as having a liver disease. Our goal is to find the optimum threshold by comparing metrics. Here's the thought process:

  • First, identify the center of threshold, where Recall is equal to Precision.
  • In this case, we want to minimize the case of False Negative, where patient with liver disease is predicted to be healthy. This can be a threat to one's life because they will not get the treatment. Therefore, we prioritize Recall over Precision, where the threshold is smaller than the center threshold.
  • In reality, we also do care about minimizing the case of False Positive too, where healthy patient is predicted to have liver disease. This can triggers panic, but not as severe as False Negative case. Therefore we have to choose the threshold with maximum value of F1 score which take account the Precision value.
  • Lastly, out of the four models, we pick the best overall model by the highest MCC value.

eval_logreg_df = pd.concat([eval_df[eval_df["Threshold"] == thresh_list[idx]]
                           for idx, eval_df in enumerate(eval_logreg_list)])
eval_logreg_df.index = title_list
eval_logreg_df
Threshold Recall Precision F1 Accuracy MCC
ALL PREDICTORS 0.19000 0.98077 0.73913 0.84298 0.73973 0.24591
LOG-TRANSFORMED 0.20000 0.98077 0.75556 0.85356 0.76027 0.33453
SQRT-TRANSFORMED 0.19000 0.98077 0.74453 0.84647 0.74658 0.27750

In conclusion, we choose logistic regression model with log-transformed data at threshold = 0.2. Next, we interpret and check the assumptions of this model.

Model Interpretation

One of the advantages of logistic regression model is its interpretability. We can interpret the coefficient and its significancy to the target variable.

final_model = model_log
final_model.summary()
Logit Regression Results
Dep. Variable: y No. Observations: 624
Model: Logit Df Residuals: 615
Method: MLE Df Model: 8
Date: Tue, 26 Oct 2021 Pseudo R-squ.: inf
Time: 21:04:01 Log-Likelihood: -56542.
converged: True LL-Null: 0.0000
Covariance Type: nonrobust LLR p-value: 1.000
coef std err z P>|z| [0.025 0.975]
const -8.8601 1.409 -6.290 0.000 -11.621 -6.099
Age 0.0253 0.006 4.196 0.000 0.013 0.037
Albumin -0.2775 0.135 -2.055 0.040 -0.542 -0.013
DB/TB Percentage 0.0217 0.008 2.712 0.007 0.006 0.037
Globulin 0.3583 0.151 2.371 0.018 0.062 0.655
Gender_Male 0.7102 0.209 3.396 0.001 0.300 1.120
log_ALP 0.5216 0.236 2.206 0.027 0.058 0.985
log_ALT 0.6592 0.230 2.863 0.004 0.208 1.111
log_AST 0.2873 0.200 1.439 0.150 -0.104 0.679

Recall that the formula of logistic regression is $\log{\frac{p}{1-p}} = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_n X_n$ where the coefficients are exactly the $\beta$ s. It means that coefficients represent the change in logit value for one unit increase in the predictor. For example, one unit increase of Age will increase the logit value by 0.0196 whereas one unit increase of Albumin will decrease the logit value by 0.2576.

We can apply exponent to the coefficient to get the ratio of odds. This represent how many percent increase in the odds for one unit increase in the predictor. Let's take a look on the table below.

model_interpret = pd.DataFrame(final_model.params, columns=["Logit Difference"])
model_interpret["Ratio of Odds"] = model_interpret["Logit Difference"].transform(np.exp)
model_interpret
Logit Difference Ratio of Odds
const -8.86010 0.00014
Age 0.02533 1.02565
Albumin -0.27747 0.75770
DB/TB Percentage 0.02174 1.02198
Globulin 0.35831 1.43090
Gender_Male 0.71016 2.03432
log_ALP 0.52158 1.68469
log_ALT 0.65923 1.93330
log_AST 0.28731 1.33284

The ratio of odds of Age is 1.02565 means that for one unit increase in Age, we expect to see 2.565% increase in the odds of a patient having liver disease. On the other hand, the ratio of odds of Albumin is 0.75770 means that for one unit increase in Albumin, we expect to see 24.23% decrease in the odds of a patient having liver disease.

Note: The table above shows the increase/decrease in the odds not probability. The ratio of probability changes depending on the predictor value as follows:

for age in range(20, 23):
    print(f"\nComparison of Age {age} with Age {age+1}")
    print("=================================")
    interpret_df = pd.DataFrame([[1, age, 5, 50, 5, 1, 5, 5, 5],
                                 [1, age+1, 5, 50, 5, 1, 5, 5, 5]],
                                 columns=["const"] + list(X_train_smote_log.columns))
    prob_interpret = final_model.predict(interpret_df)
    logit_interpret = prob_interpret.transform(logit)
    odds_20 = np.exp(logit_interpret[0])
    odds_21 = np.exp(logit_interpret[1])

    print("Logit Difference: {:.5f}".format(
        logit_interpret[1] - logit_interpret[0]))
    print("Ratio of Odds: {:.5f}".format(odds_21/odds_20))
    print("Ratio of Probability: {:.5f}".format(
        prob_interpret[1]/prob_interpret[0]))
Comparison of Age 20 with Age 21
=================================
Logit Difference: 0.02533
Ratio of Odds: 1.02565
Ratio of Probability: 1.00587

Comparison of Age 21 with Age 22
=================================
Logit Difference: 0.02533
Ratio of Odds: 1.02565
Ratio of Probability: 1.00576

Comparison of Age 22 with Age 23
=================================
Logit Difference: 0.02533
Ratio of Odds: 1.02565
Ratio of Probability: 1.00565

How to interpret the coefficient of log_ALP, log_ALT, and log_AST? The original value must be log-transformed first, then interpreted just like above. Here's the illustration of how one unit increase of log-transformed value looks like:

x_range = (1, 4.25)
x_axis = np.linspace(*x_range, 100)
x_ticks = np.arange(*x_range)
y_ticks = np.exp(x_ticks)

plt.plot(x_axis, np.exp(x_axis))
plt.scatter(x_ticks, y_ticks, marker="x")
for x, y in zip(x_ticks, y_ticks):
    plt.axvline(x, color="k", ls="--", alpha=0.25)
    plt.axhline(y, color="k", ls="--", alpha=0.25)

plt.xticks(x_ticks)
plt.yticks(y_ticks)
plt.xlabel("Log-Transformed Value")
plt.ylabel("Original Value")
plt.title("ILLUSTRATION: ONE UNIT INCREASE OF LOG", fontweight="bold")
plt.show()

Model Assumptions

There are three assumptions of logistic regression model:

Linearity

We assume that the independent variables (predictors) are linearly related to the log odds of the target variable Disease, meaning that the data is assumed to be linearly separable.

No Multicollinearity

We expect the model to have little to no multicollinearity. It is a condition where at least two predictors have a strong linear relationship. Multicollinearity exists if the Variance Inflation Factor (VIF) value is greater than 10.

vif_list = []
for idx, col in enumerate(final_model.model.exog_names[1:]):
    vif_dict = {"Variable": col,
                "VIF":  variance_inflation_factor(final_model.model.exog, idx+1)}
    vif_list.append(vif_dict)

pd.DataFrame(vif_list)
Variable VIF
0 Age 1.12372
1 Albumin 1.21778
2 DB/TB Percentage 1.07372
3 Globulin 1.12135
4 Gender_Male 1.11552
5 log_ALP 1.29119
6 log_ALT 3.48703
7 log_AST 3.47282

Note: We can conclude that there is only little multicollinearity exist in our model, since the VIFs < 10

Independence of Observation

We assume that the observations are independent and are not a repeated measurement of the same patient.

Modeling: K-Nearest Neighbor

K-Nearest Neighbor (KNN) is a non-parametric, lazy learning classification algorithm. No model are being learned, it uses the training examples to classify new data points. Here's the algorithm:

  1. Specify a positive integer $k$
  2. Compute the similarity/distance from the unlabeled data point to every points on the training examples. The most commonly used is Euclidean distance.
  3. Assign the class by majority voting based on $k$-nearest training examples. If tie, then assign randomly. To avoid this, choose $k$ to be an odd positive number.
  4. Repeat step 2-3 for each unlabeled data points in the test set.

Note: Classical KNN is not suitable for categorical data, since the distance metric used is Euclidean distance. So, for this case we drop the dummy variable.
X_train_knn = X_train_smote.drop(columns=['Gender_Male'])
X_test_knn = X_test.drop(columns=['Gender_Male'])

Feature Scaling

This is a necessary step before fitting the data into KNN. The range of each predictor is not the same, they should be treated equally so that each feature’s contribution to the distance formula is equally weighted. We will perform two types of scaling to shrink/expand the range:

  1. Min-Max Normalization, scale a variable to have a range between 0 and 1
  2. Standardization, transforms data to have a mean of 0 and a standard deviation of 1
X_train_knn.describe().loc[['min', 'max']]
Age ALP ALT AST Albumin DB/TB Percentage Globulin
min 4.00000 63.00000 10.00000 10.00000 0.90000 4.80000 1.00000
max 90.00000 2110.00000 2000.00000 4929.00000 5.50000 500.00000 6.25000

def scaling(train, test, scaler):
    s = scaler().fit(train)
    train_scaled = pd.DataFrame(s.transform(train), columns=train.columns)
    test_scaled = pd.DataFrame(s.transform(test), columns=test.columns)
    return train_scaled, test_scaled

Important: We only call .fit() method on training data, since we treat testing data as unseen data, which the statistics is unknown to the model.

# min-max scaler
X_train_knn_minmax, X_test_knn_minmax = scaling(X_train_knn, X_test_knn, MinMaxScaler)
X_train_knn_minmax.describe().loc[['min', 'max']]
Age ALP ALT AST Albumin DB/TB Percentage Globulin
min 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
max 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000

Using min-max scaler, the min and max of all variables are 0 and 1 respectively.

# standard scaler
X_train_knn_std, X_test_knn_std = scaling(X_train_knn, X_test_knn, StandardScaler)
X_train_knn_std.describe().loc[['mean', 'std']]
Age ALP ALT AST Albumin DB/TB Percentage Globulin
mean 0.00000 -0.00000 0.00000 0.00000 0.00000 0.00000 -0.00000
std 1.00080 1.00080 1.00080 1.00080 1.00080 1.00080 1.00080

Using standard scaler, the mean and standard deviation of all variables are 0 and 1 respectively.

Evaluation

Since there are no model to be learned, we directly evaluate our binary KNN classifier by using a confusion matrix as follows:

def evaluateKNN(X_train, y_train, X_test, y_test, range_k):
    eval_list = []
    for k in range(*range_k, 2):
        knn = KNeighborsClassifier(n_neighbors=k)
        knn.fit(X_train, y_train)
        y_pred_cl = knn.predict(X_test)
        eval_res = {
            "k": k,
            "Recall": recall_score(y_test, y_pred_cl),
            "Precision": precision_score(y_test, y_pred_cl),
            "F1": f1_score(y_test, y_pred_cl),
            "Accuracy": accuracy_score(y_test, y_pred_cl),
            "MCC": matthews_corrcoef(y_test, y_pred_cl)
        }
        eval_list.append(eval_res)
    eval_df = pd.DataFrame(eval_list)
    return eval_df

range_k = (1, 50)

# min-max scaler
eval_knn_minmax = evaluateKNN(
    X_train_knn_minmax, y_train_smote,
    X_test_knn_minmax, y_test,
    range_k)

# standard scaler
eval_knn_std = evaluateKNN(
    X_train_knn_std, y_train_smote,
    X_test_knn_std, y_test,
    range_k)

Next, we compare the performance of the two scaling types based on Recall, F1-score, and MCC.

metric_list = ["Recall", "F1", "MCC"]
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

eval_minmax = pd.melt(eval_knn_minmax, id_vars="k",
                      var_name="Metric", value_name="Value")
eval_minmax["Scaling Type"] = "Min-Max"
eval_standard = pd.melt(eval_knn_std, id_vars="k",
                        var_name="Metric", value_name="Value")
eval_standard["Scaling Type"] = "Standard"
eval_df = pd.concat([eval_minmax, eval_standard])

for ax, metric in zip(axes.flat, metric_list):
    df = eval_df[eval_df["Metric"] == metric]
    line = sns.lineplot(data=df, x="k", y="Value",
                        hue="Scaling Type", palette="gray", ax=ax)
    line.legend_.remove()
    ax.set_title(metric.upper(), size=15, fontweight="bold")

plt.tight_layout()
plt.legend(handles=axes[0].get_legend_handles_labels()[0], loc="center",
           bbox_to_anchor=(-0.7, -0.2),
           shadow=True, ncol=3)
fig.suptitle("K-NEAREST NEIGHBOR MODEL EVALUATION",
             size=28, y=1.05, fontweight="bold")
plt.show()

As explained on the model evaluation of logistic regression, we prioritize Recall over Precision and taking maximum value of F1 score. So, we choose the model with Standard Scaling data because the Recall and F1 is relatively higher than the min-max one. Next, we choose the optimal number of neighbors $k$.

fig, ax = plt.subplots(figsize=(8, 5))
eval_df = eval_knn_std.drop(["Accuracy"], axis=1)

# LINE PLOT
eval_df.plot(x="k", color="rgbk", ax=ax)

# F1 SCORE
# row_max_F1 = eval_df[eval_df["F1"] == max(eval_df["F1"])]
# k_max_F1 = row_max_F1["k"].values[0]
# ax.axvline(x=k_max_F1, ls='--', color="b")
# ax.text(x=k_max_F1 + 5, y=0.2,
#         s="MAX F1",
#         fontsize=12, color="b")
# ax.set_xticks(list(ax.get_xticks())[1:-1] + [k_max_F1])

# SQRT NROWS
sqrt_n = int(X_train_smote.shape[0] ** 0.5)
ax.axvline(x=sqrt_n, ls='--', color='b')
ax.text(x=sqrt_n + 1, y=0.2,
        s="$\sqrt{nrows}$",
        fontsize=12, color="b")

# CHOOSE K MANUALLY
choose_k = 5
ax.axvline(x=choose_k, ls='--', color='y')
ax.set_xticks(list(ax.get_xticks())[1:-1] + [sqrt_n, choose_k])

plt.legend(loc="center", bbox_to_anchor=(0.5, -0.2),
           shadow=True, ncol=4)
fig.suptitle("K-NEAREST NEIGHBOR METRICS (STANDARDIZATION)",
             size=16, fontweight="bold")
plt.show()

Most common value of $k$ to be used is $\sqrt{nrows}$, which is 24 in this case but it doesn't result in good metrics. From the plot, we can choose $k$ manually as k=5 to sacrifice Precision for a better Recall score.

eval_knn_list = [eval_knn_minmax, eval_knn_std]
eval_knn_df = pd.concat([eval_df[eval_df["k"] == choose_k]
                        for eval_df in eval_knn_list])
eval_knn_df.index = ["MIN-MAX NORMALIZATION", "STANDARDIZATION"]
eval_knn_df
k Recall Precision F1 Accuracy MCC
MIN-MAX NORMALIZATION 5 0.50962 0.76812 0.61272 0.54110 0.11666
STANDARDIZATION 5 0.55769 0.84058 0.67052 0.60959 0.26820

In this case, using the same number of neighbors $k$, standardization is overall outperforming the min-max normalization.

Conclusion

Finally, let's compare the perfomance of logistic regression model and k-nearest neighbor side by side.

pd.DataFrame(
    [eval_logreg_df.loc["LOG-TRANSFORMED"][1:],
     eval_knn_df.loc["STANDARDIZATION"][1:]]
)
Recall Precision F1 Accuracy MCC
LOG-TRANSFORMED 0.98077 0.75556 0.85356 0.76027 0.33453
STANDARDIZATION 0.55769 0.84058 0.67052 0.60959 0.26820

The logistic regression with log-transformed data is better in recall but worse in precision.

To further evaluate our model, let's introduce another performance measurement which is the Receiver Operating Characteristic (ROC) curve. It is a probability curve which plots True Positive Rate (Recall) against False Positive Rate. Then the area under the ROC curve, called as Area Under Curve (AUC), measures the degree of classification separability. Higher the AUC, better the model is at distinguishing between patients with liver disease and no disease.

  • AUC near to 1 indicates the model has good measure of separability.
  • AUC near to 0 means it has worst measure of separability.
  • When AUC is 0.5, it means model has no class separation capacity whatsoever.

# BASELINE
base_probs = np.zeros(len(y_test))
base_fpr, base_tpr, _ = roc_curve(y_test, base_probs)
base_auc = roc_auc_score(y_test, base_probs)

# LOGISTIC REGRESSION
logreg_probs = model_log.predict(sm.add_constant(X_test_log))
logreg_fpr, logreg_tpr, _ = roc_curve(y_test, logreg_probs)
logreg_auc = roc_auc_score(y_test, logreg_probs)

# KNN
knn_opt_model = KNeighborsClassifier(n_neighbors=choose_k).fit(X_train_knn_std, y_train_smote)
knn_probs = knn_opt_model.predict_proba(X_test_knn_minmax)[:, 0]
knn_fpr, knn_tpr, _ = roc_curve(y_test, knn_probs)
knn_auc = roc_auc_score(y_test, knn_probs)

# PLOT ROC
plt.plot(base_fpr, base_tpr, linestyle='--',
         label="Baseline (AUC = {:.3f})".format(base_auc))
plt.plot(logreg_fpr, logreg_tpr, linestyle='-',
         label="Logistic Regression (AUC = {:.3f})".format(logreg_auc), color="r")
plt.plot(knn_fpr, knn_tpr, linestyle='-',
         label="KNN (AUC = {:.3f})".format(knn_auc), color="g")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate (Recall)")
plt.title("RECEIVER OPERATING CHARACTERISTIC (ROC) CURVE", fontweight="bold")
plt.legend()
plt.show()

In conclusion, we choose the logistic regression model because the AUC score is greater than KNN, indicating that the model have a good capability in separating patients with liver disease and no disease no matter the threshold is. Secondly, the model's interpretability gives us insight on which predictors to be used in classifying whether patient have liver disease or not, whereas we couldn't interpret KNN. From the model summary, we can conclude:

  • Older patient have significantly higher probability of having a liver disease.
  • Higher Albumin present in the blood significantly decrease the probability of patient having a liver disease.
  • Being a male (Gender_Male = 1) significantly increase the probability of patient having a liver disease than female (Gender_Male = 0).
  • Other test results such as higher DB/TB Percentage, Globulin, ALP, ALT, AST will significantly increase the probability of patient having a liver disease.
final_model.summary()
Logit Regression Results
Dep. Variable: y No. Observations: 624
Model: Logit Df Residuals: 615
Method: MLE Df Model: 8
Date: Tue, 26 Oct 2021 Pseudo R-squ.: inf
Time: 21:04:03 Log-Likelihood: -56542.
converged: True LL-Null: 0.0000
Covariance Type: nonrobust LLR p-value: 1.000
coef std err z P>|z| [0.025 0.975]
const -8.8601 1.409 -6.290 0.000 -11.621 -6.099
Age 0.0253 0.006 4.196 0.000 0.013 0.037
Albumin -0.2775 0.135 -2.055 0.040 -0.542 -0.013
DB/TB Percentage 0.0217 0.008 2.712 0.007 0.006 0.037
Globulin 0.3583 0.151 2.371 0.018 0.062 0.655
Gender_Male 0.7102 0.209 3.396 0.001 0.300 1.120
log_ALP 0.5216 0.236 2.206 0.027 0.058 0.985
log_ALT 0.6592 0.230 2.863 0.004 0.208 1.111
log_AST 0.2873 0.200 1.439 0.150 -0.104 0.679

By using log-transformed data and threshold = 0.2, here's our final logistic regression model performance:

pd.DataFrame(eval_logreg_df.loc["LOG-TRANSFORMED"][1:])
LOG-TRANSFORMED
Recall 0.98077
Precision 0.75556
F1 0.85356
Accuracy 0.76027
MCC 0.33453