Hotel Booking Cancellation Prediction using Random Forest
Machine learning workflow to predict the possibility of hotel booking cancellation using Random Forest model. Features are selected using the Predictive Power Score (PPS) method. Several techniques of model improvement are done using downsampling, random search cross-validation, and threshold tuning. Models are evaluated using the confusion matrix and ROC curve.
- Introduction
- Import Libraries
- Data Wrangling
- Exploratory Data Analysis (EDA)
- Modeling and Evaluation
- Conclusion
Introduction
In this post, we'll be using Hotel Booking Demand data from Kaggle. This data set contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things. We are going to visualize the data by using plotly
and also predict the possibility of hotel bookings using random forest model using sklearn
.
Import Libraries
As usual, before we begin any analysis and modeling, let's import several necessary libraries to work with the data. There are two additional packages used in this notebook:
-
pycountry
: contains mapping of ISO country, subdivision, language, currency and script definitions and their translations -
ppscore
: implementation of the Predictive Power Score (PPS)
import pandas as pd
import numpy as np
# visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.offline as py
import plotly.io as pio
import plotly.express as px
sns.set()
pio.templates.default = "plotly_white"
pio.renderers.default = "notebook"
# modeling
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score, RandomizedSearchCV
from sklearn.metrics import classification_report, roc_curve, roc_auc_score
from sklearn.utils import resample
import multiprocessing
import pickle
# additional packages
import pycountry
import pycountry_convert as pc
import ppscore as pps
from IPython.display import Image, HTML
from tqdm.notebook import tqdm_notebook
import warnings
warnings.filterwarnings('ignore')
hotel = pd.read_csv("data_input/hotel_bookings.csv")
hotel.info()
Our dataframe hotel
contains 119390 rows of bookings and 32 columns with data description as follows:
-
hotel
: Hotel (H1 = Resort Hotel or H2 = City Hotel) -
is_canceled
: Value indicating if the booking was canceled (1) or not (0) -
lead_time
: Number of days that elapsed between the entering date of the booking into the PMS and the arrival date -
arrival_date_year
: Year of arrival date -
arrival_date_month
: Month of arrival date -
arrival_date_week_number
: Week number of year for arrival date -
arrival_date_day_of_month
: Day of arrival date -
stays_in_weekend_nights
: Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel -
stays_in_week_nights
: Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel -
adults
: Number of adults -
children
: Number of children -
babies
: Number of babies -
meal
: Type of meal booked. Categories are presented in standard hospitality meal packages:- Undefined/SC – no meal package;
- BB – Bed & Breakfast;
- HB – Half board (breakfast and one other meal – usually dinner);
- FB – Full board (breakfast, lunch and dinner)
-
country
: Country of origin. Categories are represented in the ISO 3155–3:2013 format -
market_segment
: Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators” -
distribution_channel
: Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators” -
is_repeated_guest
: Value indicating if the booking name was from a repeated guest (1) or not (0) -
previous_cancellations
: Number of previous bookings that were cancelled by the customer prior to the current booking -
previous_bookings_not_canceled
: Number of previous bookings not cancelled by the customer prior to the current booking -
reserved_room_type
: Code of room type reserved. Code is presented instead of designation for anonymity reasons. -
assigned_room_type
: Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons. -
booking_changes
: Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation -
deposit_type
: Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories:- No Deposit – no deposit was made;
- Non Refund – a deposit was made in the value of the total stay cost;
- Refundable – a deposit was made with a value under the total cost of stay.
-
agent
: ID of the travel agency that made the booking -
company
: ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons -
days_in_waiting_list
: Number of days the booking was in the waiting list before it was confirmed to the customer -
customer_type
: Type of booking, assuming one of four categories:- Contract - when the booking has an allotment or other type of contract associated to it;
- Group – when the booking is associated to a group;
- Transient – when the booking is not part of a group or contract, and is not associated to other transient booking;
- Transient-party – when the booking is transient, but is associated to at least other transient booking
-
adr
: Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights -
required_car_parking_spaces
: Number of car parking spaces required by the customer -
total_of_special_requests
: Number of special requests made by the customer (e.g. twin bed or high floor) -
reservation_status
: Reservation last status, assuming one of three categories:- Canceled – booking was canceled by the customer;
- Check-Out – customer has checked in but already departed;
- No-Show – customer did not check-in and did inform the hotel of the reason why
-
reservation_status_date
: Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel.
hotel.isna().sum().sort_values(ascending=False).head(10)
There are four columns with missing values, here's how we handle them:
- Drop columns
agent
andcompany
, since the missing values are too many and we won't use them for prediction - Create "UNKNOWN" category for
country
- Fill
children
with the value 0
hotel.drop(columns=['agent', 'company'], inplace=True)
hotel['country'].fillna("UNKNOWN", inplace=True)
hotel['children'].fillna(0, inplace=True)
Check whether there is another missing values.
hotel.isna().values.any()
# list of columns to be casted
category_cols = ['hotel', 'meal', 'country', 'market_segment', 'distribution_channel',
'reserved_room_type', 'assigned_room_type', 'deposit_type', 'customer_type', 'reservation_status']
boolean_cols = ['is_canceled', 'is_repeated_guest']
# map values
boolean_map = {0: 'No', 1: 'Yes'}
hotel['is_canceled'] = hotel['is_canceled'].map(boolean_map)
hotel['is_repeated_guest'] = hotel['is_repeated_guest'].map(boolean_map)
# re-order categories for readability
hotel[category_cols + boolean_cols] = hotel[category_cols +
boolean_cols].astype('category')
hotel['is_canceled'].cat.reorder_categories(
list(boolean_map.values()), inplace=True)
hotel['is_repeated_guest'].cat.reorder_categories(
list(boolean_map.values()), inplace=True)
hotel['children'].apply(float.is_integer).all()
hotel['children'] = hotel['children'].astype('int')
hotel['reservation_status_date'] = hotel['reservation_status_date'].astype('datetime64')
hotel['reserved_room_type'].cat.set_categories(hotel['assigned_room_type'].cat.categories, inplace=True)
hotel['is_assigned_as_reserved'] = (hotel['assigned_room_type'] == hotel['reserved_room_type']).astype('category')
hotel[['assigned_room_type', 'reserved_room_type', 'is_assigned_as_reserved']].head()
arrival_date_cols = ['arrival_date_year', 'arrival_date_month', 'arrival_date_day_of_month']
hotel[arrival_date_cols] = hotel[arrival_date_cols].astype(str)
hotel['arrival_date'] = pd.to_datetime(hotel[arrival_date_cols].apply('-'.join, axis=1), format="%Y-%B-%d")
hotel.drop(columns = arrival_date_cols + ['arrival_date_week_number'], inplace=True)
hotel[['arrival_date']].head()
hotel['booking_date'] = hotel['arrival_date'] - pd.to_timedelta(hotel['lead_time'], unit='days')
hotel[['booking_date', 'arrival_date', 'lead_time']].head()
def convertCountryCode2Name(code):
additional_code2name = {'TMP': 'East Timor'}
country_name = None
try:
if len(code) == 2:
country_name = pycountry.countries.get(alpha_2=code).name
elif len(code) == 3:
country_name = pycountry.countries.get(alpha_3=code).name
except:
if code in additional_code2name.keys():
country_name = additional_code2name[code]
return country_name if country_name is not None else code
def convertCountryName2Continent(country_name):
additional_name2continent = {
'East Timor': 'Asia',
'United States Minor Outlying Islands': 'North America',
'French Southern Territories': 'Antarctica',
'Antarctica': 'Antarctica'}
continent_name = None
try:
alpha2 = pc.country_name_to_country_alpha2(country_name)
continent_code = pc.country_alpha2_to_continent_code(alpha2)
continent_name = pc.convert_continent_code_to_continent_name(continent_code)
except:
if country_name in additional_name2continent.keys():
continent_name = additional_name2continent[country_name]
else:
continent_name = "UNKNOWN"
return continent_name if continent_name is not None else country_name
hotel['country_name'] = hotel['country'].apply(
convertCountryCode2Name).astype('category')
hotel['continent_name'] = hotel['country_name'].apply(
convertCountryName2Continent).astype('category')
hotel[['country', 'country_name', 'continent_name']].head()
Suspicious Bookings
There are some hidden anomalies present in the hotel bookings. Let's create new variables and plot their frequencies:
-
total_guest
: Total number ofadults
,children
, andbabies
-
total_nights
: Number of nights the guest stayed at the hotel, sum ofstays_in_weekend_nights
andstays_in_week_nights
hotel['total_guest'] = hotel[['adults', 'children', 'babies']].sum(axis=1)
hotel['total_nights'] = hotel[['stays_in_weekend_nights', 'stays_in_week_nights']].sum(axis=1)
data2plot = [hotel['total_guest'].value_counts().sort_index(ascending=False),
hotel['total_nights'].value_counts().sort_index(ascending=False)[-21:]]
ylabs = ["Total Guest per Booking (Person)", "Total Nights per Booking"]
titles = ["FREQUENCY OF TOTAL GUEST PER BOOKING\n",
"FREQUENCY OF TOTAL NIGHTS PER BOOKING\n(UP TO 20 NIGHTS ONLY)"]
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
for ax, data, ylab, title in zip(axes, data2plot, ylabs, titles):
bp = data.plot(kind='barh', rot=0, ax=ax)
for rect in bp.patches:
height = rect.get_height()
width = rect.get_width()
bp.text(rect.get_x() + width,
rect.get_y() + height/2,
int(width),
ha='left',
va='center',
fontsize=8)
bp.set_xlabel("Frequency")
bp.set_ylabel(ylab)
ax.set_title(title, fontweight="bold")
There are 180 bookings without guest (total_guest = 0
) and 715 bookings with zero nights of staying at the hotel (total_nights = 0
). Ideally, such cases should not occur on our bookings data. Therefore, from this point onwards we will ignore the observations with either cases since it can affect our modeling outcome.
hotel = hotel[(hotel['total_guest'] != 0) & (hotel['total_nights'] != 0)]
hotel.shape
We end up with 118565 rows of bookings, originally it was 119390 rows.
df_cancel_status = pd.crosstab(index=hotel.is_canceled,
columns=hotel.reservation_status,
margins=True)
ax = df_cancel_status.iloc[:-1, :-1].plot(kind='bar', stacked=True, rot=0)
for rect in ax.patches:
height = rect.get_height()
width = rect.get_width()
if height != 0:
ax.text(rect.get_x() + width,
rect.get_y() + height/2,
int(height),
ha='left',
va='center',
color="black",
fontsize=10)
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles=handles, labels=labels, title="Reservation Status")
percent_no = (100*df_cancel_status /
df_cancel_status.iloc[-1, -1]).loc["No", "All"]
ax.set_xticklabels(["No\n({:.2f} %)".format(
percent_no), "Yes\n({:.2f} %)".format(100-percent_no)])
ax.set_xlabel("Canceled?")
ax.set_ylabel("Number of Bookings")
plt.title("BOOKING CANCELLATION PROPORTION", fontweight="bold")
plt.show()
The proportion of the target variable is_canceled
is somewhat balanced. There is 37.26% of the bookings which are canceled, divided into two cases:
- Canceled: Booking was canceled by the customer, or
- No-show: Customer did not check-in and did inform the hotel of the reason why.
reservation_status
is a variable that is known after the target variable is_canceled
is known.
df_choropleth = hotel.copy()
df_choropleth['booking_date_year'] = df_choropleth['booking_date'].dt.year
df_country_year_count = df_choropleth.groupby(['country', 'booking_date_year']).count()['hotel'].fillna(0).reset_index() \
.rename(columns={'country': 'country_code', 'booking_date_year': 'year', 'hotel': 'count'})
df_country_year_count['country_name'] = df_country_year_count['country_code'].apply(
convertCountryCode2Name)
df_country_year_count['count'] = df_country_year_count['count'].astype('int')
fig = px.choropleth(df_country_year_count[df_country_year_count["year"] != 2013],
locations="country_code", color="count", animation_frame="year",
hover_name="country_name",
range_color=(0, 5000),
color_continuous_scale=px.colors.sequential.Reds,
projection="natural earth")
fig.update_layout(title='ANNUAL HOTEL BOOKING COUNTS',
template="seaborn")
HTML(fig.to_html(include_plotlyjs='cdn'))
ax = pd.crosstab(index=hotel['continent_name'],
columns=hotel['is_canceled'],
margins=True).sort_values('All').iloc[:-1, :-1].plot(kind='barh')
ax.legend(bbox_to_anchor=(1, 1), title="Canceled?")
ax.set_xlabel("Number of Bookings")
ax.set_ylabel("Continent Name")
ax.set_title("BOOKINGS BY EACH CONTINENT", fontweight="bold")
plt.show()
ax = (pd.crosstab(index=hotel['continent_name'],
columns=hotel['is_canceled'],
normalize='index').sort_values('Yes') * 100).plot(kind='barh', stacked=True)
ax.legend(bbox_to_anchor=(1, 1), title="Canceled?")
ax.set_xlabel("Percentage of Bookings")
ax.set_ylabel("Continent Name")
ax.set_title("PERCENTAGE OF BOOKINGS CANCELLATION BY EACH CONTINENT", fontweight="bold")
plt.show()
continent_name
maybe use as a predictor since the cancellation rate differs amongst continent, with Africe with the greatest cancellation rate.
df_cancellation = hotel.copy()
df_cancellation['date_period'] = df_cancellation['reservation_status_date'].dt.to_period('M')
df_cancellation_percent = df_cancellation.groupby(['date_period', 'is_canceled', 'hotel'])['hotel'].count() \
.groupby(['date_period', 'hotel']).apply(lambda x: 100*x/x.sum()) \
.unstack(level='is_canceled') \
.rename(columns=str).reset_index().rename_axis(None, axis=1).rename(columns={'hotel': 'Hotel Type'})
df_cancellation_percent['date_period'] = df_cancellation_percent['date_period'].values.astype('datetime64[M]')
fig = px.line(df_cancellation_percent, x='date_period',
y='Yes', color='Hotel Type')
fig.update_traces(mode="markers+lines",
hovertemplate="Rate: %{y:.2f}%")
fig.update_layout(title='CANCELLATION RATE OVER TIME BY HOTEL TYPE',
xaxis_title='Cancellation Period',
yaxis_title='Cancellation Rate (%)',
hovermode='x',
template="seaborn",
xaxis=dict(tickformat="%b %Y"))
HTML(fig.to_html(include_plotlyjs='cdn'))
Predictive Power Score (PPS)
According to Florian Wetschoreck, PPS is an asymmetric and data-type-agnostic score that can detect linear or non-linear relationships between two columns. One column acts as an univariate predictor, whereas the other acts as the target variable. The score ranges from 0 (no predictive power) to 1 (perfect predictive power). It can be used as an alternative to the correlation matrix.
The PPS is calculated as follows:
$PPS = \dfrac{F1_{model} - F1_{naive}}{1 - F1_{naive}}$
where:
- $F1_{naive}$ is the weighted F1 score of a naive model that always predicts the most common class of the target column.
- $F1_{model}$ is the weighted F1 score of a classifier using
sklearn.DecisionTreeClassifier
.
Before we investigate the PPS, we do the following data preparation:
- Consider dayofyear instead of datetime for
booking_date
,reservation_status_date
, andarrival_date
. - Ignore
assigned_room_type
andreserved_room_type
because the levels are quite many, insteadis_assigned_as_reserved
will be considered. - Ignore
country
andcountry_name
because the levels are too many, insteadcontinent_name
will be considered. - Convert the categorical columns into dummy variables.
datetime_cols = ['booking_date', 'reservation_status_date', 'arrival_date']
for col in datetime_cols:
hotel[f"{col}_dayofyear"] = hotel[col].dt.dayofyear
ignore_cols = ['assigned_room_type', 'reserved_room_type', 'country', 'country_name']
hotel_pps_data = hotel.drop(datetime_cols + ignore_cols, axis=1)
We calculate PPS for each categorical and numerical predictors and then present the ranking in a bar chart.
pps_score = []
target = 'is_canceled'
for col in hotel_pps_data.columns:
if col == target:
continue
d = {}
d['feature'] = col
d['dtypes'] = hotel_pps_data[col].dtypes
d['pps'] = pps.score(hotel_pps_data, x=col, y=target)['ppscore']
pps_score.append(d)
hotel_pps = pd.DataFrame(pps_score).set_index('feature')
ax = hotel_pps[hotel_pps['dtypes'] == 'category'].sort_values('pps')[:-1]\
.plot(kind='barh', legend=False)
for rect in ax.patches:
height = rect.get_height()
width = rect.get_width()
ax.text(rect.get_x() + 1.01*width,
rect.get_y() + 0.75*height,
round(width, 2),
ha='left',
va='top',
fontsize=8)
ax.set_xlabel("PPS")
ax.set_ylabel("Predictor Variable")
plt.title("CATEGORICAL PREDICTORS PREDICTIVE POWER SCORE\n TARGET: is_canceled",
fontweight="bold")
plt.show()
reservation_status
for the modeling. The score is high because from the business perspective, this value is actually is_canceled
but breakdown into three specific categories. So we cannot use this as a predictor.
ax = hotel_pps[hotel_pps['dtypes'] != 'category'].sort_values('pps')\
.plot(kind='barh', legend=False)
for rect in ax.patches:
height = rect.get_height()
width = rect.get_width()
ax.text(rect.get_x() + width,
rect.get_y() + height/2,
round(width, 2),
ha='left',
va='center',
fontsize=8)
ax.set_xlabel("PPS")
ax.set_ylabel("Predictor Variable")
plt.title("NUMERICAL PREDICTORS PREDICTIVE POWER SCORE\n TARGET: is_canceled",
fontweight="bold")
plt.show()
reservation_status_date_dayofyear
the reason is the same as before when we ignore reservation_status
. We can ignore stay_in_week_nights
and stay_in_weekend_nights
because already explained by total_nights
. Also, adults
, children
, and babies
are explained by total_guest
.
ignore_cols_for_model = ['reservation_status', 'reservation_status_date_dayofyear',
'total_nights', 'total_guest']
hotel_model = hotel_pps_data.drop(columns=ignore_cols_for_model)
print(f"Dimension of data: {hotel_model.shape[0]} rows and {hotel_model.shape[1]} columns")
hotel_model_dummy = pd.get_dummies(hotel_model, drop_first=True)
X = hotel_model_dummy.drop(['is_canceled_Yes'], axis=1)
y = hotel_model_dummy['is_canceled_Yes']
X_train, X_test, y_train, y_test = train_test_split(
X, y, train_size=0.8, shuffle=True, random_state=333)
print(f"Predictors: {X_train.shape[1]}")
print(f"Training dataset: {y_train.shape[0]} observations")
print(f"Testing dataset: {y_test.shape[0]} observations")
model_default = RandomForestClassifier(
n_jobs=multiprocessing.cpu_count()-1, oob_score=True, random_state=333)
model_default.fit(X_train, y_train)
Then we evaluate the model_default
by printing out classification metrics such as:
- Accuracy: the proportion of observations that are classified correctly.
- Precision: the proportion of True Positives out of all observations predicted as positive.
- Recall: the proportion of actual positives that are classified correctly.
- F1-score: the harmonic mean of precision and recall.
def clfReport(y_true, y_pred):
report = classification_report(
y_true, y_pred,
output_dict=True, target_names=["Not Canceled", "Canceled"])
return pd.DataFrame(report).transpose()
def clfReport_default(model, X, y):
clfreport = clfReport(y, model.predict(X))
print(f'Accuracy: {clfreport.loc["accuracy", "support"]}')
return clfreport.iloc[:2, :-1]
clfReport_default(model_default, X_test, y_test)
From the report above, we conclude that model_default
bias towards "Not Canceled" (negative class) because the recall is much higher than "Canceled" (positive class). The cause of this may be because the target variable proportion is not equally balanced, the original ratio is 63:37. In the next section, we'll perform sampling to overcome this problem.
Balancing Data
We can do either upsampling or downsampling to balance the positive and negative class of is_canceled
into a 50:50 ratio.
- Upsampling is a method to randomly subsample the observation from minority class to make the dataset balanced.
- Downsampling is a method to randomly sampled (with replacement) the observation from the majority class to make the dataset balanced.
In this case, we prefer to do downsampling since we still have thousands of observations, which are sufficient for modeling. But we only do downsampling on the training dataset, whereas still retain the ratio of testing data.
def sampleData(data, target_col, method="down"):
freq = data[target_col].value_counts()
class_majority, class_minority = freq.idxmax(), freq.idxmin()
data_majority, data_minority = data[data[target_col] ==
class_majority], data[data[target_col] == class_minority]
if method == "down":
data_hold = data_minority
data_sample = data_majority
replacement = False
elif method == "up":
data_hold = data_majority
data_sample = data_minority
replacement = True
data_sampled = resample(data_sample,
n_samples=data_hold.shape[0],
replace=replacement,
random_state=333)
return pd.concat([data_sampled, data_hold])
hotel_model_train_down = sampleData(
pd.concat([X_train, y_train], axis=1), "is_canceled_Yes", "down")
X_train_down = hotel_model_train_down.drop(['is_canceled_Yes'], axis=1)
y_train_down = hotel_model_train_down['is_canceled_Yes']
Here is the proportion of target variable before and after downsampling.
pd.concat([
y_train.value_counts().rename("Before Sampling"),
y_train_down.value_counts().rename("After Sampling")],
axis=1).set_index(pd.Series(["Not Canceled", "Canceled"]))
model_default_down = RandomForestClassifier(
n_jobs=multiprocessing.cpu_count()-1, oob_score=True, random_state=333)
model_default_down.fit(X_train_down, y_train_down)
clfReport_default(model_default_down, X_test, y_test)
Good, now the recall gap is not too huge. Next, let's do train-test evaluation to check whether the model is a good fit or not.
def trainTestEval(model, pos_class, X_train, X_test, y_train, y_test):
report_train = clfReport(y_train, model.predict(X_train))
report_test = clfReport(y_test, model.predict(X_test))
df = pd.concat([report_train.loc[pos_class, :].rename("Train"),
report_test.loc[pos_class, :].rename("Test")], axis=1).transpose()
df.insert(loc=0, column="accuracy",
value=[
report_train.loc["accuracy", "support"],
report_test.loc["accuracy", "support"]])
df["observation"] = np.array([X_train.shape[0], X_test.shape[0]])
df["pos_proportion"] = df["support"]/df["observation"]
return df.drop(columns="support")
trainTestEval(model_default_down, "Canceled",
X_train_down, X_test, y_train_down, y_test)
model_default_down
is overfitted the data, since the performance on the training dataset is nearly perfect but not so good on the test dataset. To overcome this, we are going to tune the hyperparameter using Random Search Cross-Validation.
Random Search Cross-Validation
This technique narrows down our search by evaluating a wide range of values for each parameter. Using the sklearn
RandomizedSearchCV
method, we define a grid of hyperparameter ranges random_grid
and as the name suggests, it randomly samples from the grid, performing k-fold cross-validation with each combination of values.
random_grid = {
'n_estimators': [100, 200, 300],
'criterion': ['gini', 'entropy'],
'max_depth': [x for x in range(5, 100, 2)],
'max_features': ['sqrt', 'log2', None]
}
# number of possible combinations
mul = 1
for key in random_grid.keys():
mul *= len(random_grid[key])
print(f"Number of possible combinations: {mul}")
From the above number of possible combinations, we decided to build models by using only 200 combinations. Using 3-fold cross-validation means that each combination of parameters will be evaluated 3 times with a 67-33 train-test proportion. In this case, we want to maximize recall score on the testing set. Anyway, the code below takes a while to be executed, 2-3 hours to be specific, since we train them on CPU - maybe should consider using GPU.
model_base = RandomForestClassifier(warm_start=True,
n_jobs=multiprocessing.cpu_count()-1,
random_state=333)
rf_random = RandomizedSearchCV(estimator=model_base,
param_distributions=random_grid,
scoring='recall',
n_iter=250,
cv=3,
verbose=1,
n_jobs=multiprocessing.cpu_count()-1,
random_state=333)
rf_random.fit(X_train_down, y_train_down)
def saveModel(model, pickle_name):
with open(pickle_name, 'wb') as f:
pickle.dump(model, f)
saveModel(rf_random, "cache/rf_randomcv_500_down.pkl")
Let's take a look of the result from best to worst in terms of Recall and print one combination of the parameter which yield the best recall.
with open("cache/rf_randomcv_500_down.pkl", 'rb') as f:
rf_random = pickle.load(f)
pd.DataFrame(rf_random.cv_results_).sort_values('rank_test_score').head()
rf_random.best_params_
Just like we did on model_default
, let's evaluate whether the tuned model is still overfit or not:
model_tuned_metrics = trainTestEval(
rf_random, "Canceled", X_train_down, X_test, y_train_down, y_test)
model_tuned_metrics
Threshold Tuning
We can further tune model_tuned
by changing the threshold when classifying the probabilities to a class. So far, the model use default threshold which is 0.5, means that if a booking is predicted to have more than 50% chance of cancellation, then it is classified as Canceled, otherwise Not Canceled. We create tuningThresholdPlot
function to iteratively move the threshold from 0 to 1 and plot the classification metrics.
def tuningThresholdPlot(model, X, y):
tuning_list = []
y_pred_prob = model.predict_proba(X)[:, 1]
for threshold in np.linspace(0, 1, 101)[:-1]:
y_pred_class = y_pred_prob > threshold
report = clfReport(y, y_pred_class)
tuning_res = {"threshold": threshold,
"accuracy": report.loc["accuracy"][-1]}
tuning_res = {**tuning_res, **
report.loc["Canceled"].to_dict()} # append dicts
tuning_res.pop("support", None)
tuning_list.append(tuning_res)
tuning_df = pd.DataFrame(tuning_list)
# plotting
ax = tuning_df.plot(x="threshold", color="rgyb", lw=3, figsize=(10, 6))
# vertical line for center
diff = abs(tuning_df["recall"] - tuning_df["precision"])
thresh_center = tuning_df[diff == min(diff)]["threshold"].values[0]
ax.axvline(x=thresh_center, ls='--', color="k")
ax.text(x=thresh_center + 0.01, y=1,
s=f"CENTER", fontsize=12, color="k")
ax.set_xticks(list(ax.get_xticks()[1:-1]) + [thresh_center])
ax.legend(loc="upper center", bbox_to_anchor=(
0.5, -0.15), shadow=True, ncol=4)
ax.set_ylabel("Metric")
plt.title("TUNING THRESHOLD", size=15, fontweight="bold")
plt.show()
return tuning_df
tuning_thresh = tuningThresholdPlot(rf_random, X_test, y_test)
The vertical line labeled "CENTER" is when precision equal to recall. As mentioned earlier, we do care about the recall, but on second priority, we have to take into account about the precision too so that our model can correctly classify bookings that are not canceled but predicted as canceled (False Positives). We want to minimize this case in order to reduce the risk of overbooking. In conclusion, we will move the threshold to be less than the CENTER to achieve higher recall, but still maintaining a good precision. Let's compare the center to the default threshold, which is 0.5.
DEFAULT_THRESHOLD = 0.5
CENTER_THRESHOLD = 0.55
compare_thresh = tuning_thresh[tuning_thresh["threshold"].isin(
[DEFAULT_THRESHOLD, CENTER_THRESHOLD])]
compare_thresh
Let's calculate gain/loss of the metrics if we use CENTER_THRESHOLD
instead of DEFAULT_THRESHOLD
compare_thresh.iloc[1, :] - compare_thresh.iloc[0, :]
DEFAULT_THRESHOLD
instead of CENTER_THRESHOLD
in order to maintain recall value.
Receiver Operating Characteristic (ROC) Curve
It is a probability curve which plots True Positive Rate (Recall) against False Positive Rate. The area under the ROC curve, called as Area Under Curve (AUC), measures the degree of classification separability. Higher the AUC, better the model is at distinguishing between canceled and not canceled bookings.
- AUC near to 1 indicates the model has good measure of separability.
- AUC near to 0 means it has worst measure of separability.
- When AUC is 0.5, it means model has no class separation capacity whatsoever.
def rocauc(y_pred_list, y_test):
y_pred_list = [(np.zeros(len(y_test)), "BASELINE")] + y_pred_list
for y_pred, label in y_pred_list:
fpr, tpr, _ = roc_curve(y_test, y_pred)
auc = roc_auc_score(y_test, y_pred)
plt.plot(fpr, tpr, label=f"{label}\n(AUC: {auc:.3f})",
linestyle='--' if label == "BASELINE" else '-')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate (Recall)")
plt.title("RECEIVER OPERATING CHARACTERISTIC (ROC) CURVE", fontweight="bold")
plt.legend(bbox_to_anchor=(1, 1))
plt.show()
rocauc([(model_default.predict(X_test), "DEFAULT"),
(rf_random.predict(X_test), "THRESHOLD = 0.5"),
(rf_random.predict_proba(X_test)[:, 1] > CENTER_THRESHOLD, f"THRESHOLD = {CENTER_THRESHOLD}")], y_test)
DEFAULT_THRESHOLD
or CENTER_THRESHOLD
is nearly the same, indicating the same measure of separability while we retain the recall performance if we use DEFAULT_THRESHOLD
.
Feature Importance
This refers to a technique that assign a score to features based on how useful they are at predicting a target variable. In this case, we want to know which feature is the most important in predicting hotel booking cancellation using sklearn
built-in attributes feature_importances_
.
def featureImportancesPlot(features, importances, n_features=20):
feature_imp_df = pd.DataFrame({'Feature': features,
'Importance': importances}).set_index('Feature')\
.sort_values('Importance').tail(n_features)
ax = feature_imp_df.plot(kind='barh', legend=False,
title=f"Top {n_features} Feature Importances")
ax.set_xlabel("Relative Importance")
plt.show()
return feature_imp_df.sort_values('Importance', ascending=False)
feature_imp = featureImportancesPlot(
X.columns, rf_random.best_estimator_.feature_importances_)
The top three features based on their importance are:
-
deposit_type_Non Refund
: the customer made a deposit to guarantee the booking in the value of the total stay cost -
lead_time
: number of days that elapsed between the entering date of the booking into the PMS and the arrival date -
adr
: average daily rate
Conclusion
We successfully visualized the hotel booking data using plotly
to gain insights about the data and modeling using Random Forest to predict the cancellation of the booking based on various predictors. Before modeling, we inspect the PPS score to do feature selection manually. Next, we tune the base model using RandomizedSearchCV
to prevent overfitting, followed by threshold tuning checking. In the end, we achieve a quite good metrics:
- accuracy = 85.6155%
- precision = 78.5129%
- recall = 84.1427%
- f1-score = 81.2304%
There are several things need to be improved:
- Tuning other parameters such as
ccp_alpha
to cut the tree with cost complexity pruning. - SearchCV using GPU instead of CPU, explore more about
cuML
packages. - Use another ensemble method such as XGBoost.