Load Libraries

import pandas as pd
import numpy as np

# data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import graphviz

# modeling
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import recall_score, precision_score, f1_score, classification_report
from sklearn.model_selection import GridSearchCV

# other
from IPython.display import display, Math, HTML

# setting
plt.style.use('seaborn')
pd.set_option('display.float_format', lambda x: '%.5f' % x)
pd.set_option('display.max_colwidth', None)
pd.options.plotting.backend = "plotly"

Data Loading

The cancer dataset used is provided from Kaggle.

cancer = pd.read_csv("data_input/data.csv")
cancer.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             569 non-null    float64
 15  area_se                  569 non-null    float64
 16  smoothness_se            569 non-null    float64
 17  compactness_se           569 non-null    float64
 18  concavity_se             569 non-null    float64
 19  concave points_se        569 non-null    float64
 20  symmetry_se              569 non-null    float64
 21  fractal_dimension_se     569 non-null    float64
 22  radius_worst             569 non-null    float64
 23  texture_worst            569 non-null    float64
 24  perimeter_worst          569 non-null    float64
 25  area_worst               569 non-null    float64
 26  smoothness_worst         569 non-null    float64
 27  compactness_worst        569 non-null    float64
 28  concavity_worst          569 non-null    float64
 29  concave points_worst     569 non-null    float64
 30  symmetry_worst           569 non-null    float64
 31  fractal_dimension_worst  569 non-null    float64
 32  Unnamed: 32              0 non-null      float64
dtypes: float64(31), int64(1), object(1)
memory usage: 146.8+ KB

There are 569 observations and 33 columns with the following description:

  • id: ID number
  • diagnosis: M = Malignant, B = Benign (target variable)

Ten real-valued features are computed for each cell nucleus:

  • radius: mean of distances from center to points on the perimeter
  • texture: standard deviation of gray-scale values
  • perimeter
  • area
  • smoothness: local variation in radius lengths
  • compactness: perimeter^2 / area - 1.0
  • concavity: severity of concave portions of the contour
  • concave points: number of concave portions of the contour
  • symmetry
  • fractal dimension: "coastline approximation" - 1

The feature above is calculated based on their mean, se (standard error), and worst (largest).

Drop Columns

There is no missing value in the data, except for the empty column Unnamed: 32.

missing = cancer.isnull().sum()
missing[missing>0]
Unnamed: 32    569
dtype: int64

We can remove the id column because it is not used as a feature, and the Unnamed: 32 column because it is an empty column.

cancer.drop(['id', 'Unnamed: 32'], axis=1, inplace=True)

Data Visualization

Target Proportion

Target variable diagnosis: 212 Malignant and 357 Benign observations. In this case, we assume the proportion is quite balanced.

sns.countplot(x='diagnosis', data=cancer);

Correlation

From the correlation heatmap, we gain insights that there are some measurements that are actually calculations from other columns. Suppose radius, parameter, and area are related to each other.

plt.figure(figsize=(15, 15))
sns.heatmap(cancer.corr(), annot=True, linewidths=0.5, fmt='.2f');

Distribution

cancer_longer = cancer.melt(id_vars='diagnosis', var_name='features')
g = sns.FacetGrid(data=cancer_longer, col='features',
                  col_wrap=10, sharey=False)
g.map_dataframe(sns.boxplot, x='diagnosis', y='value', hue='diagnosis')
g.add_legend()
plt.show()

median = cancer_longer.groupby(['diagnosis', 'features']).quantile(0.5)['value'].unstack(level='diagnosis')
median['Median Difference'] = median['M'] - median['B']
median = median.sort_values('Median Difference')
pd.concat([median.head(3), median.tail(3)])
diagnosis B M Median Difference
features
texture_se 1.10800 1.10250 -0.00550
symmetry_se 0.01909 0.01770 -0.00139
smoothness_se 0.00653 0.00621 -0.00032
perimeter_worst 86.92000 138.00000 51.08000
area_mean 458.40000 932.00000 473.60000
area_worst 547.40000 1303.00000 755.60000

The table above shows the median value of each features for Benign and Malignant cancer. The median difference is calculated by subtracting median of Benign from Malignant. In general, the median value for majority of the features is greater for Malignant than Benign except for texture_se, symmetry_se, and smoothness_se. The most obvious feature is area, because the median value of Malignant is more than twice the median value of Benign.

Data Preprocessing

We have to encode the target variable diagnosis using LabelEncoder. Encoding is not needed for the features since all value are numeric.

X = cancer.drop(columns=['diagnosis'])

le = LabelEncoder()
y = le.fit_transform(cancer['diagnosis'])
le_name_mapping = dict(zip(le.classes_, le.transform(le.classes_)))
le_name_mapping
{'B': 0, 'M': 1}

Note: Using LabelEncoder, the target value Benign is mapped into 0, while Malignant is mapped into 1

Train-test splitting with 75:25 proportion.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")
X_train shape: (426, 30)
X_test shape: (143, 30)
y_train shape: (426,)
y_test shape: (143,)

Data Modeling

We are going to compare four binary classification models: logistic regression, support vector, decision tree, and random forest classfier. The positive target class is Malignant (1), while the negative is Benign (0).

We are going to consider the following evaluation metrics:

  • Recall: what percentage of patients with Malignant can the model detect?
  • Precision: of the patients predicted to be Malignant, what percentage were truly Malignant?
  • F1 score: weighted average of Recall and Precision.

$F1 = 2 \times \frac{Recall \times Precision}{Recall+Precision}$

Types of errors:

  • False Positive: Patients who actually have benign cancer (Benign), predicted by the model as Malignant. This can cause panic in the patient, but the patient will receive further consultation or treatment as a preventive measure, so that patient safety is guaranteed.
  • False Negative: Patients who actually have malignant cancer (Malignant), predicted by the model to be Benign. This can jeopardize the patients' safety because they are not taken seriously.

Note: If we want to minimize False Negative cases, high recall are expected. Precision is also expected to remain high, but could be a second priority. If we want both metrics to be prioritized, use the F1-score.

Logistic Regression

logreg = LogisticRegression(max_iter=np.inf, random_state=123)
logreg.fit(X_train, y_train)
LogisticRegression(max_iter=inf, random_state=123)

We define a function threshold_tuning to further tune the threshold to get the largest F1 score.

def threshold_tuning(model, X, y):
    y_pred_prob = model.predict_proba(X)[:, 1]  # probability

    eval_list = []
    for threshold in np.linspace(0, 0.99, 100):
        y_pred = (y_pred_prob > threshold).astype(int)  # threshold cut-off

        # evaluation metric
        eval_dict = {
            'Threshold': threshold,
            'Recall': recall_score(y, y_pred),
            'Precision': precision_score(y, y_pred),
            'F1': f1_score(y, y_pred)
        }

        eval_list.append(eval_dict)

    # tuning result
    eval_df = pd.DataFrame(eval_list).set_index('Threshold')
    eval_df.columns = eval_df.columns.set_names('Metrics')
    max_f1 = eval_df.sort_values('F1', ascending=False).head(1)
    optimal_threshold = max_f1.index.values[0]

    # plotting
    fig = eval_df.plot(title="Threshold Tuning")
    fig.add_shape(dict(type="line",
                       x0=optimal_threshold, y0=0,
                       x1=optimal_threshold, y1=1))

    # print classification report using max F1
    y_pred_optimal = (y_pred_prob > optimal_threshold).astype(int)
    print(classification_report(y, y_pred_optimal,
          target_names=['Benign', 'Malignant']))

    return eval_df.sort_values('F1', ascending=False), fig
logreg_tuning, fig = threshold_tuning(logreg, X_test, y_test)
HTML(fig.to_html(include_plotlyjs='cdn'))
              precision    recall  f1-score   support

      Benign       0.97      0.99      0.98        90
   Malignant       0.98      0.94      0.96        53

    accuracy                           0.97       143
   macro avg       0.97      0.97      0.97       143
weighted avg       0.97      0.97      0.97       143

For logistic regression, we use threshold = 0.84 to get the largest F1 score on the testing set. Next, we would like to interpret the model via estimate (intercept and coefficient). The following predictors have been sorted by coefficient from largest to smallest:

logreg_interpret = pd.DataFrame(np.append(logreg.intercept_, logreg.coef_), columns=['Estimate (Logit)'])
logreg_interpret.index = ['intercept'] + list(X.columns)
logreg_interpret['Odds Ratio'] = logreg_interpret['Estimate (Logit)'].transform(np.exp)
logreg_interpret.sort_values('Estimate (Logit)', ascending=False)
Estimate (Logit) Odds Ratio
concavity_worst 1.07945 2.94308
symmetry_worst 0.62870 1.87517
concave points_worst 0.46480 1.59169
compactness_worst 0.44890 1.56659
texture_worst 0.39236 1.48047
concavity_mean 0.32357 1.38205
symmetry_mean 0.31655 1.37239
smoothness_worst 0.27173 1.31223
perimeter_worst 0.23309 1.26249
concave points_mean 0.21682 1.24212
smoothness_mean 0.15523 1.16793
perimeter_mean 0.15401 1.16651
compactness_mean 0.13147 1.14050
area_se 0.11032 1.11663
fractal_dimension_worst 0.06806 1.07043
concave points_se 0.02301 1.02327
symmetry_se 0.02242 1.02267
fractal_dimension_mean 0.02136 1.02159
radius_se 0.01969 1.01989
smoothness_se 0.01406 1.01416
area_worst 0.01009 1.01014
perimeter_se -0.00461 0.99540
concavity_se -0.01838 0.98179
fractal_dimension_se -0.02138 0.97885
area_mean -0.02385 0.97644
compactness_se -0.09884 0.90588
texture_mean -0.16468 0.84816
radius_worst -0.28772 0.74997
radius_mean -0.50493 0.60355
texture_se -0.98308 0.37416
intercept -30.78763 0.00000

Note: Coefficient interpretation: A positive estimate value will increase the possibility of malignant cancer by the Odds Ratio value, if the predictor value increases by 1 unit. On the other hand, a negative estimate value will reduce the possibility of malignant cancer by the Odds Ratio value, if the predictor value increases by 1 unit.

Examples:

  • An increase of 1 unit of concativity_worst will increase the possibility of a cancer to be malignant by 194.308% (its Odds Ratio - 1)
  • An increase of 1 unit of texture_se will reduce the possibility of a cancer to be malignant by 62.584% (1 - its Odds Ratio)

Support Vector Classifier

Use grid search cross-validation for the support vector model to get the largest F1 value.

svc_parameters = {
    'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
    'C': [0.1, 1, 10]
}
svc = SVC(random_state=123)
svc_grid = GridSearchCV(svc, svc_parameters, cv=3, scoring='f1')
svc_grid.fit(X_train, y_train)
GridSearchCV(cv=3, estimator=SVC(random_state=123),
             param_grid={'C': [0.1, 1, 10],
                         'kernel': ['linear', 'poly', 'rbf', 'sigmoid']},
             scoring='f1')
pd.DataFrame(svc_grid.cv_results_).sort_values('rank_test_score').head(3)[['rank_test_score', 'params', 'mean_test_score', 'std_test_score']]
rank_test_score params mean_test_score std_test_score
8 1 {'C': 10, 'kernel': 'linear'} 0.93955 0.02547
4 2 {'C': 1, 'kernel': 'linear'} 0.93936 0.01144
0 3 {'C': 0.1, 'kernel': 'linear'} 0.93888 0.01207

Re-training the support vector model with a combination of parameters that produces the largest F1.

Note: Actually we can use svc_grid directly to predict a class, but in this case we would like to predict a probability to tune the threshold. It will take a long time if we use probability=True during the grid search because for all combinations of parameters the probability will be calculated. What we really want is the probability only for the best combination of parameters.
svc_best = SVC(**svc_grid.best_params_, probability=True, random_state=123)
svc_best.fit(X_train, y_train)
SVC(C=10, kernel='linear', probability=True, random_state=123)

We further tune the support vector model using defined threshold_tuning function.

svc_tuning, fig = threshold_tuning(svc_best, X_test, y_test)
HTML(fig.to_html(include_plotlyjs='cdn'))
              precision    recall  f1-score   support

      Benign       0.99      0.97      0.98        90
   Malignant       0.95      0.98      0.96        53

    accuracy                           0.97       143
   macro avg       0.97      0.97      0.97       143
weighted avg       0.97      0.97      0.97       143

For support vector, we use threshold = 0.42 to get the largest F1 score on the testing set.

Decision Tree Classifier

Use grid search cross-validation for the decision tree model to get the largest F1 value.

dtc_parameters = {
    'criterion': ['gini', 'entropy'],
    'splitter': ['best', 'random'],
    'max_features': [2, 'sqrt', 'log2'],
    'min_samples_leaf': range(1, 10, 2)
}
dtc = DecisionTreeClassifier(random_state=123)
dtc_grid = GridSearchCV(dtc, dtc_parameters, cv=3, scoring='f1')
dtc_grid.fit(X_train, y_train)
GridSearchCV(cv=3, estimator=DecisionTreeClassifier(random_state=123),
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_features': [2, 'sqrt', 'log2'],
                         'min_samples_leaf': range(1, 10, 2),
                         'splitter': ['best', 'random']},
             scoring='f1')
pd.DataFrame(dtc_grid.cv_results_).sort_values('rank_test_score').head(3)[['rank_test_score', 'params', 'mean_test_score', 'std_test_score']]
rank_test_score params mean_test_score std_test_score
56 1 {'criterion': 'entropy', 'max_features': 'log2', 'min_samples_leaf': 7, 'splitter': 'best'} 0.91301 0.01684
50 2 {'criterion': 'entropy', 'max_features': 'log2', 'min_samples_leaf': 1, 'splitter': 'best'} 0.91081 0.00144
0 3 {'criterion': 'gini', 'max_features': 2, 'min_samples_leaf': 1, 'splitter': 'best'} 0.90929 0.02982

We further tune the decision tree model using defined threshold_tuning function. We don't need to re-train the best model as in the previous support vector section, because dtc_grid can be directly used to predict probability.

dtc_tuning, fig = threshold_tuning(dtc_grid, X_test, y_test)
HTML(fig.to_html(include_plotlyjs='cdn'))
              precision    recall  f1-score   support

      Benign       0.95      0.93      0.94        90
   Malignant       0.89      0.92      0.91        53

    accuracy                           0.93       143
   macro avg       0.92      0.93      0.93       143
weighted avg       0.93      0.93      0.93       143

For decision tree, we use threshold = 0.5 to get the largest F1 score on the testing set. Next, we would like to interpret the model visually.

dot_data = export_graphviz(
    dtc_grid.best_estimator_,
    feature_names=X.columns,
    class_names=['Benign', 'Malignant'],
    impurity=False,
    leaves_parallel=True,
    filled=True,
    rounded=True)
graph = graphviz.Source(dot_data)
# display(graph)
graph.format = "png"
graph.render("assets/2020-10-26-breast-cancer-wisconsin-classification/decision_tree_breast_cancer");

The decision tree can be traversed from the root (top) to leaf (bottom). The predicted class is labeled for each leaf. Each node will be separated by a condition, if True then traverse to the left, otherwise to the right.

Random Forest Classifier

Use grid search cross-validation for the random forest model to get the largest F1 value.

rfc_parameters = {
    'n_estimators': [100, 200, 300],
    'criterion': ['gini', 'entropy'],
    'max_features': [2, 'sqrt', 'log2']
}
rfc = RandomForestClassifier(random_state=123)
rfc_grid = GridSearchCV(rfc, rfc_parameters, cv=3, scoring='f1')
rfc_grid.fit(X_train, y_train)
GridSearchCV(cv=3, estimator=RandomForestClassifier(random_state=123),
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_features': [2, 'sqrt', 'log2'],
                         'n_estimators': [100, 200, 300]},
             scoring='f1')
pd.DataFrame(rfc_grid.cv_results_).sort_values('rank_test_score').head(3)[['rank_test_score', 'params', 'mean_test_score', 'std_test_score']]
rank_test_score params mean_test_score std_test_score
1 1 {'criterion': 'gini', 'max_features': 2, 'n_estimators': 200} 0.93867 0.01034
10 2 {'criterion': 'entropy', 'max_features': 2, 'n_estimators': 200} 0.93862 0.01725
13 3 {'criterion': 'entropy', 'max_features': 'sqrt', 'n_estimators': 200} 0.93626 0.00933

We further tune the random forest model using defined threshold_tuning function. We don't need to re-train the best model as in the previous support vector section, because rfc_grid can be directly used to predict probability.

rfc_tuning, fig = threshold_tuning(rfc_grid, X_test, y_test)
HTML(fig.to_html(include_plotlyjs='cdn'))
              precision    recall  f1-score   support

      Benign       0.99      0.97      0.98        90
   Malignant       0.95      0.98      0.96        53

    accuracy                           0.97       143
   macro avg       0.97      0.97      0.97       143
weighted avg       0.97      0.97      0.97       143

For random forest, we use threshold = 0.46 to get the largest F1 score on the testing set. Next, we would like to interpret the model based on feature importances. Here are the top 10 features that are most important in classifying whether a cancer is benign or malignant:

pd.options.plotting.backend = "matplotlib"

var_imp = pd.DataFrame({
    'Feature': X.columns,
    'Importance': rfc_grid.best_estimator_.feature_importances_
}).set_index('Feature')

var_imp.sort_values('Importance', ascending=False).head(10).sort_values(
    'Importance').plot.barh(title='Top 10 Feature Importances');

Conclusion

The following are the performance results of each binary classification model:

final_result = pd.concat([
    logreg_tuning.head(1),
    svc_tuning.head(1),
    dtc_tuning.head(1),
    rfc_tuning.head(1)]).reset_index()
final_result.index = ['Logistic Regression', 'Support Vector Classifier',
                      'Decision Tree Classifier', 'Random Forest Classifier']
final_result
Metrics Threshold Recall Precision F1
Logistic Regression 0.84000 0.94340 0.98039 0.96154
Support Vector Classifier 0.42000 0.98113 0.94545 0.96296
Decision Tree Classifier 0.50000 0.92453 0.89091 0.90741
Random Forest Classifier 0.46000 0.98113 0.94545 0.96296

Each model has its advantages and disadvantages as shown in the image below:

If we have to choose between the four models, we prefer the model that results in high performance in classifying cancer since the safety and life of a patient are crucial in the medical field. Thus, we choose Random Forest Classifier with a threshold of 0.46 because it has the highest F1 score, and the importance of each feature can be measured using Feature Importances (although the influence of positive or negative directions is unknown as in Logistic Regression).