Liver Disease Classification using Logistic Regression and k-Nearest Neighbors
Machine learning workflow on a binary classification problem of Indian liver patient records using Logistic Regression and K-Nearest Neighbor (KNN) algorithm. Linear regression is used to handle missing values. Imbalanced data is addressed using SMOTE. Logarithm and square root transformations are applied to predictors with outliers. Both models are evaluated using the confusion matrix, Matthew's Correlation Coefficient (MCC), and ROC Curve. Logistic regression is further improved using threshold tuning, while KNN is enhanced by comparing the scaling method (min-max or standardization) and manual tuning of the number of neighbors (k)
- Introduction
- Data Wrangling
- Exploratory Data Analysis
- Modeling: Logistic Regression
- Modeling: K-Nearest Neighbor
- Conclusion
Introduction
Patients with liver disease have been continuously increasing because of excessive consumption of alcohol, inhale of harmful gases, intake of contaminated food, pickles, and drugs. This use case is provided on Kaggle using the dataset from UCI ML Repository). The dataset contains 416 liver patient records and 167 non-liver patient records collected from North East of Andhra Pradesh, India. The collected data is used for evaluating prediction algorithms to reduce the burden on doctors.
import pandas as pd
import numpy as np
# visualization
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.cbook import boxplot_stats
# preprocessing (pre-modeling)
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler
# modeling
import statsmodels.api as sm
from sklearn.neighbors import KNeighborsClassifier
# evaluation (post-modeling)
from sklearn.metrics import *
from statsmodels.stats.outliers_influence import variance_inflation_factor
from scipy import stats
from scipy.special import logit
import warnings
warnings.filterwarnings("ignore")
sns.set(style="ticks", color_codes=True)
pd.options.display.float_format = '{:.5f}'.format
pd.options.display.max_colwidth = None
liver = pd.read_csv("data_input/indian_liver_patient.csv")
liver.info()
The dataset contains 583 observations of patient records with 11 columns as follows (Reference):
-
Age
: Age of the patient. Any patient whose age exceeded 89 is listed as being age 90. -
Gender
: Gender of the patient, either Male or Female. -
Total_Bilirubin
: Bilirubin is an orange-yellow pigment, a waste product primarily produced by the normal breakdown of heme. This test measures the amount of bilirubin in the blood to evaluate a person's liver function or to help diagnose anemias caused by red blood cell destruction (hemolytic anemia). Measured in mg/dL. -
Direct_Bilirubin
: Water-soluble forms of bilirubin. Measured in mg/dL. -
Alkaline_Phosphotase
(ALP): Enzyme found in several tissues throughout the body. The highest concentrations of ALP are present in the cells that comprise bone and the liver. Elevated levels of ALP in the blood are most commonly caused by liver disease or bone disorders. Measured in U/L. -
Alamine_Aminotransferase
(ALT): Enzyme found mostly in the cells of the liver and kidney. Normally, ALT levels in the blood are low, but when the liver is damaged, ALT is released into the blood and the level increases. This test measures the level of ALT in the blood and is useful for early detection of liver disease. Measured in U/L. -
Aspartate_Aminotransferase
: Enzyme found in cells throughout the body but mostly in the heart and liver and, to a lesser extent, in the kidneys and muscles. In healthy individuals, levels of AST in the blood are low. When liver or muscle cells are injured, they release AST into the blood. This makes AST a useful test for detecting or monitoring liver damage. Measured in U/L. -
Total_Protiens
: Measures the amount of protein in g/dL. -
Albumin
: Made by the liver and makes up about 60% of the total protein. Albumin keeps fluid from leaking out of blood vessels, nourishes tissues, and transports hormones, vitamins, drugs, and substances like calcium throughout the body. Measured in g/dL. -
Albumin_and_Globulin_Ratio
: Compares the amount ofAlbumin
withGlobulin
.Globulin
made up the remaining 40% of proteins in the blood which is a varied group of proteins, some produced by the liver and some by the immune system. They help fight infection and transport nutrients. Measured in g/dL. -
Dataset
: A class label used to divide the patient into two groups. Value 1 indicates patient with liver disease and 2 otherwise.
column_names = {
"Age": "Age",
"Gender": "Gender",
"TB": "Total Bilirubin",
"DB": "Direct Bilirubin",
"ALP": "Alkaline Phosphotase",
"ALT": "Alamine Aminotransferase",
"AST": "Aspartate Aminotransferase",
"TP": "Total Proteins",
"Albumin": "Albumin",
"A/G Ratio": "Albumin and Globulin Ratio",
"Disease": "Dataset"
}
liver.columns = column_names.keys()
liver.head()
liver.loc[liver["Disease"] == 1, "Disease"] = "Yes"
liver.loc[liver["Disease"] == 2, "Disease"] = "No"
liver["Disease"].unique()
liver[["Gender", "Disease"]] = liver[["Gender", "Disease"]].astype("category")
liver.dtypes
liver.isna().sum()
Four observations of A/G Ratio
are missing, this will be handled later on after the feature engineering step.
Feature Engineering
Feature engineering is the process of using domain knowledge to extract features from provided raw data. These features can be used to improve the performance of machine learning models. In this section, we'll be extracting features to reduce the correlation between the predictors and reduce the risk of multicollinearity on the model. So let's plot the Pearson correlation heatmap to find out which predictors have a strong correlation.
def plotCorrelationHeatmap(data, figsize=(12, 6)):
plt.figure(figsize=figsize)
corr_val = data.corr(method="pearson")
mask = np.zeros_like(corr_val, dtype=np.bool)
mask[np.triu_indices_from(mask, k=1)] = True
corr_heatmap = sns.heatmap(corr_val, mask=mask,
annot=True, fmt='.3f', linewidths=3, cmap="Reds")
corr_heatmap.set_title("PEARSON CORRELATION HEATMAP",
fontsize=15, fontweight="bold")
corr_heatmap
plotCorrelationHeatmap(liver)
Consider correlation above 0.6 as strong, we want to minimize the risk of multicollinearity on our model by considering the correlation of these features:
-
TB
andDB
(0.875): Merge intoDB/TB Percentage
(Reference) -
AST
andALT
(0.792): We can merge them intoAST/ALT Ratio
(Reference), but then we ended up with an insignificant predictor later on modelling. Thus, we'll keep them as separate predictors. -
Albumin
andA/G Ratio
(0.69): CalculateGlobulin
level, given theA/G ratio
-
Albumin
andTP
(0.784): Albumin is a group of protein, it certainly have a strong correlation withTP
(Total Protein). These two variables will be discussed later on.
liver["DB/TB Percentage"] = liver["DB"] / liver["TB"] * 100
# liver["AST/ALT Ratio"] = liver["AST"]/liver["ALT"]
liver["Globulin"] = liver["Albumin"] / liver["A/G Ratio"]
liver = liver.drop(["DB", "TB", "A/G Ratio"], axis=1)
liver.head()
Here is the new correlation heatmap after feature engineering:
plotCorrelationHeatmap(liver, (14, 6))
Interestingly, TP
is now strongly correlated with both Albumin
and Globulin
. Let's manually add the value of Albumin
and Globulin
as Albumin+Globulin
then further investigate the correlation with TP
:
AG_df = liver[["Albumin", "Globulin", "TP"]]
AG_df["Albumin+Globulin"] = AG_df["Albumin"] + AG_df["Globulin"]
plotCorrelationHeatmap(AG_df, (6, 5))
Such a strong (almost perfect) correlation between TP
and Albumin+Globulin
, indicating that TP
can be explained by both Albumin
and Globulin
. We will use this information to handle missing value.
Remember about the missing value in A/G Ratio
? Now they are moved to Globulin
since we calculate it using A/G Ratio
.
missing = liver["Globulin"].isna()
liver[missing]
Let's impute missing Globulin
value by fitting a linear regression line $TP = Albumin + Globulin + constant$ using the non-missing data.
random_state
to ensure the split in this section is the same as the split in the modeling section.
liver_train, liver_test = train_test_split(liver, test_size=0.25, random_state=888)
X_impute = liver_train[-missing][["Albumin", "Globulin"]]
y_impute = liver_train[-missing]["TP"].values
lin_reg = sm.OLS(y_impute, sm.add_constant(X_impute)).fit()
beta = lin_reg.params.values
print("Adjusted R-squared: {:.3f}%".format(100*lin_reg.rsquared_adj))
print("Estimate:", beta)
The Adjusted R-squared indicating that the model is a good fit. Now, we have to change the subject of the formula from TP
to Globulin
as follows:
$TP = \beta_0 + \beta_1*Albumin + \beta_2*Globulin$
$Globulin = \dfrac{TP - \beta_0 - \beta_1*Albumin}{\beta_2}$
where:
- $\beta_0$ is the intercept of regression line. $TP$ is equal to $\beta_0$ when $Albumin = 0$ and $Globulin = 0$
- $\beta_1$ is the coefficient of $Albumin$. One unit increase in $Albumin$ will increase $TP$ as much as $\beta_1$
- $\beta_2$ is the coefficient of $Globulin$. One unit increase in $Globulin$ will increase $TP$ as much as $\beta_2$
We will use this formula to impute the missing value of Globulin
based on TP
and Albumin
.
liver["Globulin"] = liver.apply(
lambda row: (row["TP"] - beta[0] - beta[1] * row["Albumin"]) / beta[2]
if np.isnan(row["Globulin"])
else row.Globulin,
axis=1
)
liver[missing]
We drop TP
as it strongly correlated to Albumin
and Globulin
.
liver = liver.drop(["TP"], axis=1)
liver.head()
pair_plot = sns.pairplot(liver, hue="Disease",
diag_kind="kde", corner=True, markers='+')
pair_plot.fig.suptitle("PAIR PLOT OF NUMERICAL VARIABLES",
size=25, fontweight="bold")
pair_plot;
ALT
and AST
. Variables ALP
, ALT
, AST
, and DB/TB Percentage
have a very positive skewed distribution. We’ll analyze them further using box plot.
fig, axes = plt.subplots(2, 4, figsize=(15, 8))
for ax, col in zip(axes.flat, liver.select_dtypes('number').columns):
sns.boxplot(x="Disease", y=col, data=liver, ax=ax)
# Outlier Count
outlier_count = 0
for disease in liver["Disease"].cat.categories:
liver_disease = liver.loc[liver["Disease"] == disease, col]
outlier_list = boxplot_stats(liver_disease).pop(0)['fliers']
outlier_count += len(outlier_list)
ax.set_title("Outlier Count: {} ({:.2f}%)".format(
outlier_count, 100*outlier_count/liver.shape[0]))
axes[-1, -1].axis("off")
plt.tight_layout()
fig.suptitle("BOX PLOT OF NUMERICAL VARIABLES",
size=28, y=1.05, fontweight="bold")
plt.show()
Age
, lower Albumin
, and slightly higher Globulin
compared to those who doesn’t. This will be further explained in the modeling step. More than 10% of the observations in ALP
, ALT
, and AST
are considered as outliers. We’ll apply some transformation during the modeling to handle the outlier.
Modeling: Logistic Regression
Logistic regression is a classification algorithm used to fit a regression curve. $\log{\frac{p}{1-p}} = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_n X_n$ where:
- $p$ is the probability of an observation to be in the positive class
- $\frac{p}{1-p}$ is called as odds
- $\log{\frac{p}{1-p}}$ is called as log-odds or logit
- $X_1, X_2, ..., X_n$ are the predictors
- $\beta_0$ is a constant value
- $\beta_1, \beta_2, ..., \beta_n$ are the coefficient of $X_1, X_2, ..., X_n$ respectively
- $n$ is the number of predictors
The formula above can be re-written as a sigmoid curve:
$p = \dfrac{1}{1 + e^{-z}}$ where $z = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_n X_n$
z_plot = np.linspace(-10, 10)
plt.plot(z_plot, 1/(1 + np.exp(-z_plot)))
plt.axvline(0, color="k", ls="--", alpha=0.25)
plt.axhline(0.5, color="k", ls="--", alpha=0.25)
plt.xlabel("z")
plt.ylabel("p")
plt.title("ILLUSTRATION: SIGMOID CURVE", fontweight="bold")
plt.show()
liver_dummy = pd.get_dummies(
liver,
columns=liver.select_dtypes('category').columns,
drop_first=True)
X = liver_dummy.drop(["Disease_Yes"], axis=1)
y = liver_dummy["Disease_Yes"].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=888)
print("X Train:", X_train.shape)
print("X Test:", X_test.shape)
print("y Train:", y_train.shape)
print("y Test:", y_test.shape)
def plotProportion(data):
data['Disease'] = data['Disease'].replace({0: 'No', 1: 'Yes'})
ax = data.groupby(['Disease', 'Set']).size().unstack().plot(kind='bar', stacked=True)
for rect in ax.patches:
height = rect.get_height()
width = rect.get_width()
padding = 0.25
ax.text(rect.get_x() + width - padding,
rect.get_y() + height / 2,
int(height),
ha='center', va='center',
color="white")
data = pd.concat([
pd.DataFrame({'Disease': y_train, 'Set': 'Training'}),
pd.DataFrame({'Disease': y_test, 'Set': 'Testing'})
])
plotProportion(data)
plt.title("PROPORTION OF TARGET VARIABLE", fontsize=14, fontweight="bold")
plt.show()
data_prop = pd.crosstab(
index=data.Set,
columns=data.Disease,
margins=True,
normalize="index") * 100
data_prop.round(2).astype(str) + ' %'
We consider our data as imbalanced with 71:29 ratio. Therefore, we can do either upsampling or downsampling to balance the positive and negative class of Disease
.
- Upsampling is a method to randomly subsample the observation from minority class to make the dataset balanced.
- Downsampling is a method to randomly sample (with replacement) the observation from majority class to make the dataset balanced.
If we choose to downsample the data, we only end up with a small number of observations and will lose some information from the data. Therefore, in this case we prefer to do upsampling using Synthetic Minority Over-sampling Technique (known as SMOTE).
from imblearn.over_sampling import SMOTE
sampler = SMOTE(random_state=888)
X_train_smote, y_train_smote = sampler.fit_resample(X_train, y_train)
data_smote = pd.concat([
pd.DataFrame({'Disease': y_train_smote, 'Set': 'Training'}),
pd.DataFrame({'Disease': y_test, 'Set': 'Testing'})
])
plotProportion(data_smote)
plt.title("PROPORTION OF TARGET VARIABLE (UPSAMPLED)",
fontsize=14, fontweight="bold")
plt.show()
data_smote_prop = pd.crosstab(
index=data_smote.Set,
columns=data_smote.Disease,
margins=True,
normalize="index") * 100
data_smote_prop.round(2).astype(str) + ' %'
def fitLogisticRegression(X, y):
model = sm.Logit(y, sm.add_constant(X))
result = model.fit()
return result
model_all = fitLogisticRegression(X_train_smote, y_train_smote)
model_all.summary()
Can we further improve our logistic regression model? One way is to handle the outlier present in our predictors.
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
x_axis = np.linspace(1, 5000, 1000)
y_axis = [np.log(x_axis), np.sqrt(x_axis)]
title_list = ["LOG: $y = \log(x)$", "SQUARE ROOT: $y = \sqrt{x}$"]
for y, title, ax in zip(y_axis, title_list, axes):
ax.plot(x_axis, y)
ax.set_xlabel("Original Value")
ax.set_ylabel("Transformed Value")
ax.set_title(title)
plt.tight_layout()
fig.suptitle("TRANSFORMATION FUNCTION", size=28, y=1.05, fontweight="bold")
plt.show()
transform_col = ["ALP", "ALT", "AST"]
X_train_untransformed = X_train_smote.drop(transform_col, axis=1)
X_test_untransformed = X_test.drop(transform_col, axis=1)
def transform(data, transform_col, func):
data = data[transform_col].transform(func)
data.columns = [f"{func.__name__}_{col}" for col in transform_col]
return data
X_train_smote_log = transform(X_train_smote, transform_col, np.log)
X_test_log = transform(X_test, transform_col, np.log)
X_train_smote_log.head()
def boxPlotTransformedData(data, figsize=(10, 4)):
fig, axes = plt.subplots(1, data.shape[1]-1, figsize=figsize)
for ax, col in zip(axes, data.columns[:-1]):
sns.boxplot(x="Disease", y=col, data=data, ax=ax)
# Outlier Count
outlier_count = 0
for flag in data["Disease"].cat.categories:
flag_disease = data.loc[data["Disease"] == flag, col]
outlier_list = boxplot_stats(flag_disease).pop(0)['fliers']
outlier_count += len(outlier_list)
ax.set_title("Outlier Count: {} ({:.2f}%)".format(
outlier_count, 100*outlier_count/liver.shape[0]))
plt.tight_layout()
boxPlotTransformedData(
pd.concat([
X_train_smote_log,
pd.Series(y_train_smote, name="Disease").replace({0: 'No', 1: 'Yes'}).astype('category')
],
axis=1)
)
plt.suptitle("BOX PLOT OF LOG-TRANSFORMED VARIABLES",
size=25, y=1.05, fontweight="bold")
plt.show()
The outlier count has been reduced, now let's fit the log-transformed data into the logistic regression model.
X_train_smote_log = pd.concat([X_train_untransformed, X_train_smote_log], axis=1)
X_test_log = pd.concat([X_test_untransformed, X_test_log], axis=1)
model_log = fitLogisticRegression(X_train_smote_log, y_train_smote)
model_log.summary()
X_train_smote_sqrt = transform(X_train_smote, transform_col, np.sqrt)
X_test_sqrt = transform(X_test, transform_col, np.sqrt)
X_train_smote_sqrt.head()
boxPlotTransformedData(
pd.concat([
X_train_smote_sqrt,
pd.Series(y_train_smote, name="Disease").replace({0: 'No', 1: 'Yes'}).astype('category')
],
axis=1)
)
plt.suptitle("BOX PLOT OF SQRT-TRANSFORMED VARIABLES",
size=25, y=1.05, fontweight="bold")
plt.show()
The outlier count has been reduced. Now, let's fit the square root transformed data into the logistic regression model.
X_train_smote_sqrt = pd.concat([X_train_untransformed, X_train_smote_sqrt], axis=1)
X_test_sqrt = pd.concat([X_test_untransformed, X_test_sqrt], axis=1)
model_sqrt = fitLogisticRegression(X_train_smote_sqrt, y_train_smote)
model_sqrt.summary()
pd.DataFrame(
[["True Positive (TP)", "False Negative (FN)"], ["False Positive (FP)", "True Negative (TN)"]],
index = [["Actual", "Actual"], ["Positive", "Negative"]],
columns = [["Predicted", "Predicted"], ["Positive", "Negative"]],
)
There are several metrics considered in this case:
- Recall is the proportion of actual positives that are classified correctly.
- Precision is the proportion of TP out of all observations predicted as positive.
- F1-score is the harmonic mean of precision and recall.
But Precision, Recall, F1-score do not account for the TN. We'll introduce another metrics that measure the overall performance of the model which includes all values in the confusion matrix:
- Accuracy is the proportion of observations that are classified correctly. This metric is not robust when the class is imbalance.
-
MCC, stands for Matthew's Correlation Coefficient, ranges between -1 and 1.
- MCC = 1 indicates a perfect positive correlation, the classifier is perfect $(FP = FN = 0)$
-
MCC = -1 indicates a perfect negative correlation, the classifier always misclassifies $(TP = TN = 0)$
-
MCC = 0 indicates no correlation, the classifier randomly classify observations
def evaluateLogReg(result, X_true, y_true):
eval_list = []
y_pred_prob = result.predict(sm.add_constant(X_true))
for threshold in np.linspace(0, 0.99, 100):
y_pred_cl = (y_pred_prob > threshold).astype(int)
eval_res = {
"Threshold": threshold,
"Recall": recall_score(y_true, y_pred_cl),
"Precision": precision_score(y_true, y_pred_cl),
"F1": f1_score(y_true, y_pred_cl),
"Accuracy": accuracy_score(y_true, y_pred_cl),
"MCC": matthews_corrcoef(y_true, y_pred_cl)
}
eval_list.append(eval_res)
eval_df = pd.DataFrame(eval_list)
return eval_df
eval_logreg_all = evaluateLogReg(model_all, X_test, y_test)
eval_logreg_log = evaluateLogReg(model_log, X_test_log, y_test)
eval_logreg_sqrt = evaluateLogReg(model_sqrt, X_test_sqrt, y_test)
eval_logreg_list = [eval_logreg_all, eval_logreg_log, eval_logreg_sqrt]
title_list = ["ALL PREDICTORS", "LOG-TRANSFORMED", "SQRT-TRANSFORMED"]
fig, axes = plt.subplots(3, 1, figsize=(15, 10))
thresh_list = []
for ax, eval_df, title in zip(axes.flat, eval_logreg_list, title_list):
# LINE PLOT
eval_df = eval_df.drop(["Accuracy"], axis=1)
lineplot = eval_df.plot(x="Threshold", color="rgbk", legend=False, ax=ax)
# IDENTIFY CENTER
diff = abs(eval_df["Recall"] - eval_df["Precision"])
thresh_eq = eval_df[diff == min(diff)]["Threshold"].values[0]
ax.axvline(x=thresh_eq, ls='--', color="y")
ax.text(x=thresh_eq + 0.01, y=0.05,
s="CENTER",
fontsize=12, color="y")
# F1 MEASURE
row_max_F1 = eval_df[eval_df["F1"] == max(eval_df["F1"])]
thresh_max_F1 = row_max_F1["Threshold"].values[0]
ax.axvline(x=thresh_max_F1, ls='--', color="b")
ax.text(x=thresh_max_F1 - 0.01, y=0.7,
s="MAX F1",
horizontalalignment='right',
fontsize=12, color="b")
# LOCATE MCC
mcc = row_max_F1["MCC"].values[0]
ax.plot(thresh_max_F1, mcc, marker='x', markersize=10, color="k")
ax.text(x=thresh_max_F1 - 0.025, y=mcc,
s="MCC = {:.3f}".format(mcc),
horizontalalignment='right',
fontsize=12, fontweight="bold", color="k")
ax.set_xticks([0, 1] + [thresh_eq, thresh_max_F1])
ax.set_title(title, fontweight="bold")
handles, labels = ax.get_legend_handles_labels()
thresh_list.append(thresh_max_F1)
plt.tight_layout()
plt.legend(handles=handles, loc="center",
bbox_to_anchor=(0.5, -0.5),
shadow=True, ncol=4)
fig.suptitle("LOGISTIC REGRESSION MODEL EVALUATION",
size=28, y=1.05, fontweight="bold")
plt.show()
The output of our logistic regression is $p$, the probability of a particular patient to be classified as having a liver disease. For each of the model, we iterate the threshold from 0 to 1. If $p > threshold$, then the patient is classified as having a liver disease. Our goal is to find the optimum threshold by comparing metrics. Here's the thought process:
- First, identify the center of threshold, where Recall is equal to Precision.
- In this case, we want to minimize the case of False Negative, where patient with liver disease is predicted to be healthy. This can be a threat to one's life because they will not get the treatment. Therefore, we prioritize Recall over Precision, where the threshold is smaller than the center threshold.
- In reality, we also do care about minimizing the case of False Positive too, where healthy patient is predicted to have liver disease. This can triggers panic, but not as severe as False Negative case. Therefore we have to choose the threshold with maximum value of F1 score which take account the Precision value.
- Lastly, out of the four models, we pick the best overall model by the highest MCC value.
eval_logreg_df = pd.concat([eval_df[eval_df["Threshold"] == thresh_list[idx]]
for idx, eval_df in enumerate(eval_logreg_list)])
eval_logreg_df.index = title_list
eval_logreg_df
In conclusion, we choose logistic regression model with log-transformed data at threshold = 0.2
. Next, we interpret and check the assumptions of this model.
final_model = model_log
final_model.summary()
Recall that the formula of logistic regression is $\log{\frac{p}{1-p}} = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_n X_n$ where the coefficients are exactly the $\beta$ s. It means that coefficients represent the change in logit value for one unit increase in the predictor. For example, one unit increase of Age
will increase the logit value by 0.0196 whereas one unit increase of Albumin
will decrease the logit value by 0.2576.
We can apply exponent to the coefficient to get the ratio of odds. This represent how many percent increase in the odds for one unit increase in the predictor. Let's take a look on the table below.
model_interpret = pd.DataFrame(final_model.params, columns=["Logit Difference"])
model_interpret["Ratio of Odds"] = model_interpret["Logit Difference"].transform(np.exp)
model_interpret
The ratio of odds of Age
is 1.02565 means that for one unit increase in Age
, we expect to see 2.565% increase in the odds of a patient having liver disease. On the other hand, the ratio of odds of Albumin
is 0.75770 means that for one unit increase in Albumin
, we expect to see 24.23% decrease in the odds of a patient having liver disease.
for age in range(20, 23):
print(f"\nComparison of Age {age} with Age {age+1}")
print("=================================")
interpret_df = pd.DataFrame([[1, age, 5, 50, 5, 1, 5, 5, 5],
[1, age+1, 5, 50, 5, 1, 5, 5, 5]],
columns=["const"] + list(X_train_smote_log.columns))
prob_interpret = final_model.predict(interpret_df)
logit_interpret = prob_interpret.transform(logit)
odds_20 = np.exp(logit_interpret[0])
odds_21 = np.exp(logit_interpret[1])
print("Logit Difference: {:.5f}".format(
logit_interpret[1] - logit_interpret[0]))
print("Ratio of Odds: {:.5f}".format(odds_21/odds_20))
print("Ratio of Probability: {:.5f}".format(
prob_interpret[1]/prob_interpret[0]))
How to interpret the coefficient of log_ALP
, log_ALT
, and log_AST
? The original value must be log-transformed first, then interpreted just like above. Here's the illustration of how one unit increase of log-transformed value looks like:
x_range = (1, 4.25)
x_axis = np.linspace(*x_range, 100)
x_ticks = np.arange(*x_range)
y_ticks = np.exp(x_ticks)
plt.plot(x_axis, np.exp(x_axis))
plt.scatter(x_ticks, y_ticks, marker="x")
for x, y in zip(x_ticks, y_ticks):
plt.axvline(x, color="k", ls="--", alpha=0.25)
plt.axhline(y, color="k", ls="--", alpha=0.25)
plt.xticks(x_ticks)
plt.yticks(y_ticks)
plt.xlabel("Log-Transformed Value")
plt.ylabel("Original Value")
plt.title("ILLUSTRATION: ONE UNIT INCREASE OF LOG", fontweight="bold")
plt.show()
vif_list = []
for idx, col in enumerate(final_model.model.exog_names[1:]):
vif_dict = {"Variable": col,
"VIF": variance_inflation_factor(final_model.model.exog, idx+1)}
vif_list.append(vif_dict)
pd.DataFrame(vif_list)
Modeling: K-Nearest Neighbor
K-Nearest Neighbor (KNN) is a non-parametric, lazy learning classification algorithm. No model are being learned, it uses the training examples to classify new data points. Here's the algorithm:
- Specify a positive integer $k$
- Compute the similarity/distance from the unlabeled data point to every points on the training examples. The most commonly used is Euclidean distance.
- Assign the class by majority voting based on $k$-nearest training examples. If tie, then assign randomly. To avoid this, choose $k$ to be an odd positive number.
- Repeat step 2-3 for each unlabeled data points in the test set.
X_train_knn = X_train_smote.drop(columns=['Gender_Male'])
X_test_knn = X_test.drop(columns=['Gender_Male'])
Feature Scaling
This is a necessary step before fitting the data into KNN. The range of each predictor is not the same, they should be treated equally so that each feature’s contribution to the distance formula is equally weighted. We will perform two types of scaling to shrink/expand the range:
- Min-Max Normalization, scale a variable to have a range between 0 and 1
- Standardization, transforms data to have a mean of 0 and a standard deviation of 1
X_train_knn.describe().loc[['min', 'max']]
def scaling(train, test, scaler):
s = scaler().fit(train)
train_scaled = pd.DataFrame(s.transform(train), columns=train.columns)
test_scaled = pd.DataFrame(s.transform(test), columns=test.columns)
return train_scaled, test_scaled
.fit()
method on training data, since we treat testing data as unseen data, which the statistics is unknown to the model.
# min-max scaler
X_train_knn_minmax, X_test_knn_minmax = scaling(X_train_knn, X_test_knn, MinMaxScaler)
X_train_knn_minmax.describe().loc[['min', 'max']]
Using min-max scaler, the min and max of all variables are 0 and 1 respectively.
# standard scaler
X_train_knn_std, X_test_knn_std = scaling(X_train_knn, X_test_knn, StandardScaler)
X_train_knn_std.describe().loc[['mean', 'std']]
Using standard scaler, the mean and standard deviation of all variables are 0 and 1 respectively.
def evaluateKNN(X_train, y_train, X_test, y_test, range_k):
eval_list = []
for k in range(*range_k, 2):
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
y_pred_cl = knn.predict(X_test)
eval_res = {
"k": k,
"Recall": recall_score(y_test, y_pred_cl),
"Precision": precision_score(y_test, y_pred_cl),
"F1": f1_score(y_test, y_pred_cl),
"Accuracy": accuracy_score(y_test, y_pred_cl),
"MCC": matthews_corrcoef(y_test, y_pred_cl)
}
eval_list.append(eval_res)
eval_df = pd.DataFrame(eval_list)
return eval_df
range_k = (1, 50)
# min-max scaler
eval_knn_minmax = evaluateKNN(
X_train_knn_minmax, y_train_smote,
X_test_knn_minmax, y_test,
range_k)
# standard scaler
eval_knn_std = evaluateKNN(
X_train_knn_std, y_train_smote,
X_test_knn_std, y_test,
range_k)
Next, we compare the performance of the two scaling types based on Recall, F1-score, and MCC.
metric_list = ["Recall", "F1", "MCC"]
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
eval_minmax = pd.melt(eval_knn_minmax, id_vars="k",
var_name="Metric", value_name="Value")
eval_minmax["Scaling Type"] = "Min-Max"
eval_standard = pd.melt(eval_knn_std, id_vars="k",
var_name="Metric", value_name="Value")
eval_standard["Scaling Type"] = "Standard"
eval_df = pd.concat([eval_minmax, eval_standard])
for ax, metric in zip(axes.flat, metric_list):
df = eval_df[eval_df["Metric"] == metric]
line = sns.lineplot(data=df, x="k", y="Value",
hue="Scaling Type", palette="gray", ax=ax)
line.legend_.remove()
ax.set_title(metric.upper(), size=15, fontweight="bold")
plt.tight_layout()
plt.legend(handles=axes[0].get_legend_handles_labels()[0], loc="center",
bbox_to_anchor=(-0.7, -0.2),
shadow=True, ncol=3)
fig.suptitle("K-NEAREST NEIGHBOR MODEL EVALUATION",
size=28, y=1.05, fontweight="bold")
plt.show()
As explained on the model evaluation of logistic regression, we prioritize Recall over Precision and taking maximum value of F1 score. So, we choose the model with Standard Scaling data because the Recall and F1 is relatively higher than the min-max one. Next, we choose the optimal number of neighbors $k$.
fig, ax = plt.subplots(figsize=(8, 5))
eval_df = eval_knn_std.drop(["Accuracy"], axis=1)
# LINE PLOT
eval_df.plot(x="k", color="rgbk", ax=ax)
# F1 SCORE
# row_max_F1 = eval_df[eval_df["F1"] == max(eval_df["F1"])]
# k_max_F1 = row_max_F1["k"].values[0]
# ax.axvline(x=k_max_F1, ls='--', color="b")
# ax.text(x=k_max_F1 + 5, y=0.2,
# s="MAX F1",
# fontsize=12, color="b")
# ax.set_xticks(list(ax.get_xticks())[1:-1] + [k_max_F1])
# SQRT NROWS
sqrt_n = int(X_train_smote.shape[0] ** 0.5)
ax.axvline(x=sqrt_n, ls='--', color='b')
ax.text(x=sqrt_n + 1, y=0.2,
s="$\sqrt{nrows}$",
fontsize=12, color="b")
# CHOOSE K MANUALLY
choose_k = 5
ax.axvline(x=choose_k, ls='--', color='y')
ax.set_xticks(list(ax.get_xticks())[1:-1] + [sqrt_n, choose_k])
plt.legend(loc="center", bbox_to_anchor=(0.5, -0.2),
shadow=True, ncol=4)
fig.suptitle("K-NEAREST NEIGHBOR METRICS (STANDARDIZATION)",
size=16, fontweight="bold")
plt.show()
Most common value of $k$ to be used is $\sqrt{nrows}$, which is 24 in this case but it doesn't result in good metrics. From the plot, we can choose $k$ manually as k=5
to sacrifice Precision for a better Recall score.
eval_knn_list = [eval_knn_minmax, eval_knn_std]
eval_knn_df = pd.concat([eval_df[eval_df["k"] == choose_k]
for eval_df in eval_knn_list])
eval_knn_df.index = ["MIN-MAX NORMALIZATION", "STANDARDIZATION"]
eval_knn_df
In this case, using the same number of neighbors $k$, standardization is overall outperforming the min-max normalization.
pd.DataFrame(
[eval_logreg_df.loc["LOG-TRANSFORMED"][1:],
eval_knn_df.loc["STANDARDIZATION"][1:]]
)
The logistic regression with log-transformed data is better in recall but worse in precision.
To further evaluate our model, let's introduce another performance measurement which is the Receiver Operating Characteristic (ROC) curve. It is a probability curve which plots True Positive Rate (Recall) against False Positive Rate. Then the area under the ROC curve, called as Area Under Curve (AUC), measures the degree of classification separability. Higher the AUC, better the model is at distinguishing between patients with liver disease and no disease.
- AUC near to 1 indicates the model has good measure of separability.
- AUC near to 0 means it has worst measure of separability.
- When AUC is 0.5, it means model has no class separation capacity whatsoever.
# BASELINE
base_probs = np.zeros(len(y_test))
base_fpr, base_tpr, _ = roc_curve(y_test, base_probs)
base_auc = roc_auc_score(y_test, base_probs)
# LOGISTIC REGRESSION
logreg_probs = model_log.predict(sm.add_constant(X_test_log))
logreg_fpr, logreg_tpr, _ = roc_curve(y_test, logreg_probs)
logreg_auc = roc_auc_score(y_test, logreg_probs)
# KNN
knn_opt_model = KNeighborsClassifier(n_neighbors=choose_k).fit(X_train_knn_std, y_train_smote)
knn_probs = knn_opt_model.predict_proba(X_test_knn_minmax)[:, 0]
knn_fpr, knn_tpr, _ = roc_curve(y_test, knn_probs)
knn_auc = roc_auc_score(y_test, knn_probs)
# PLOT ROC
plt.plot(base_fpr, base_tpr, linestyle='--',
label="Baseline (AUC = {:.3f})".format(base_auc))
plt.plot(logreg_fpr, logreg_tpr, linestyle='-',
label="Logistic Regression (AUC = {:.3f})".format(logreg_auc), color="r")
plt.plot(knn_fpr, knn_tpr, linestyle='-',
label="KNN (AUC = {:.3f})".format(knn_auc), color="g")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate (Recall)")
plt.title("RECEIVER OPERATING CHARACTERISTIC (ROC) CURVE", fontweight="bold")
plt.legend()
plt.show()
In conclusion, we choose the logistic regression model because the AUC score is greater than KNN, indicating that the model have a good capability in separating patients with liver disease and no disease no matter the threshold is. Secondly, the model's interpretability gives us insight on which predictors to be used in classifying whether patient have liver disease or not, whereas we couldn't interpret KNN. From the model summary, we can conclude:
- Older patient have significantly higher probability of having a liver disease.
- Higher
Albumin
present in the blood significantly decrease the probability of patient having a liver disease. - Being a male (
Gender_Male = 1
) significantly increase the probability of patient having a liver disease than female (Gender_Male = 0
). - Other test results such as higher
DB/TB Percentage
,Globulin
,ALP
,ALT
,AST
will significantly increase the probability of patient having a liver disease.
final_model.summary()
By using log-transformed data and threshold = 0.2
, here's our final logistic regression model performance:
pd.DataFrame(eval_logreg_df.loc["LOG-TRANSFORMED"][1:])