Introduction

In this post, we'll be using Hotel Booking Demand data from Kaggle. This data set contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things. We are going to visualize the data by using plotly and also predict the possibility of hotel bookings using random forest model using sklearn.

Import Libraries

As usual, before we begin any analysis and modeling, let's import several necessary libraries to work with the data. There are two additional packages used in this notebook:

pycountry: contains mapping of ISO country, subdivision, language, currency and script definitions and their translations
ppscore: implementation of the Predictive Power Score (PPS)

import pandas as pd
import numpy as np

# visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.offline as py
import plotly.io as pio
import plotly.express as px
sns.set()
pio.templates.default = "plotly_white"
pio.renderers.default = "notebook"

# modeling
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score, RandomizedSearchCV
from sklearn.metrics import classification_report, roc_curve, roc_auc_score
from sklearn.utils import resample
import multiprocessing
import pickle

# additional packages
import pycountry
import pycountry_convert as pc
import ppscore as pps

from IPython.display import Image, HTML
from tqdm.notebook import tqdm_notebook
import warnings
warnings.filterwarnings('ignore')

Data Wrangling

Before we jump into any visualization or modeling step, we have to make sure our data is ready.

Import Data

Load the downloaded csv and inspect the structure of the data.

hotel = pd.read_csv("data_input/hotel_bookings.csv")
hotel.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 32 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   hotel                           119390 non-null  object 
 1   is_canceled                     119390 non-null  int64  
 2   lead_time                       119390 non-null  int64  
 3   arrival_date_year               119390 non-null  int64  
 4   arrival_date_month              119390 non-null  object 
 5   arrival_date_week_number        119390 non-null  int64  
 6   arrival_date_day_of_month       119390 non-null  int64  
 7   stays_in_weekend_nights         119390 non-null  int64  
 8   stays_in_week_nights            119390 non-null  int64  
 9   adults                          119390 non-null  int64  
 10  children                        119386 non-null  float64
 11  babies                          119390 non-null  int64  
 12  meal                            119390 non-null  object 
 13  country                         118902 non-null  object 
 14  market_segment                  119390 non-null  object 
 15  distribution_channel            119390 non-null  object 
 16  is_repeated_guest               119390 non-null  int64  
 17  previous_cancellations          119390 non-null  int64  
 18  previous_bookings_not_canceled  119390 non-null  int64  
 19  reserved_room_type              119390 non-null  object 
 20  assigned_room_type              119390 non-null  object 
 21  booking_changes                 119390 non-null  int64  
 22  deposit_type                    119390 non-null  object 
 23  agent                           103050 non-null  float64
 24  company                         6797 non-null    float64
 25  days_in_waiting_list            119390 non-null  int64  
 26  customer_type                   119390 non-null  object 
 27  adr                             119390 non-null  float64
 28  required_car_parking_spaces     119390 non-null  int64  
 29  total_of_special_requests       119390 non-null  int64  
 30  reservation_status              119390 non-null  object 
 31  reservation_status_date         119390 non-null  object 
dtypes: float64(4), int64(16), object(12)
memory usage: 29.1+ MB

Our dataframe hotel contains 119390 rows of bookings and 32 columns with data description as follows:

hotel: Hotel (H1 = Resort Hotel or H2 = City Hotel)
is_canceled: Value indicating if the booking was canceled (1) or not (0)
lead_time: Number of days that elapsed between the entering date of the booking into the PMS and the arrival date
arrival_date_year: Year of arrival date
arrival_date_month: Month of arrival date
arrival_date_week_number: Week number of year for arrival date
arrival_date_day_of_month: Day of arrival date
stays_in_weekend_nights: Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
stays_in_week_nights: Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
adults: Number of adults
children: Number of children
babies: Number of babies
meal: Type of meal booked. Categories are presented in standard hospitality meal packages:
- Undefined/SC – no meal package;
- BB – Bed & Breakfast;
- HB – Half board (breakfast and one other meal – usually dinner);
- FB – Full board (breakfast, lunch and dinner)
country: Country of origin. Categories are represented in the ISO 3155–3:2013 format
market_segment: Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”
distribution_channel: Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”
is_repeated_guest: Value indicating if the booking name was from a repeated guest (1) or not (0)
previous_cancellations: Number of previous bookings that were cancelled by the customer prior to the current booking
previous_bookings_not_canceled: Number of previous bookings not cancelled by the customer prior to the current booking
reserved_room_type: Code of room type reserved. Code is presented instead of designation for anonymity reasons.
assigned_room_type: Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons.
booking_changes: Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation
deposit_type: Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories:
- No Deposit – no deposit was made;
- Non Refund – a deposit was made in the value of the total stay cost;
- Refundable – a deposit was made with a value under the total cost of stay.
agent: ID of the travel agency that made the booking
company: ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons
days_in_waiting_list: Number of days the booking was in the waiting list before it was confirmed to the customer
customer_type: Type of booking, assuming one of four categories:
- Contract - when the booking has an allotment or other type of contract associated to it;
- Group – when the booking is associated to a group;
- Transient – when the booking is not part of a group or contract, and is not associated to other transient booking;
- Transient-party – when the booking is transient, but is associated to at least other transient booking
adr: Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights
required_car_parking_spaces: Number of car parking spaces required by the customer
total_of_special_requests: Number of special requests made by the customer (e.g. twin bed or high floor)
reservation_status: Reservation last status, assuming one of three categories:
- Canceled – booking was canceled by the customer;
- Check-Out – customer has checked in but already departed;
- No-Show – customer did not check-in and did inform the hotel of the reason why
reservation_status_date: Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel.

Missing Values

Check if there are any missing values in hotel

hotel.isna().sum().sort_values(ascending=False).head(10)

company                     112593
agent                        16340
country                        488
children                         4
lead_time                        0
arrival_date_year                0
arrival_date_month               0
arrival_date_week_number         0
is_canceled                      0
market_segment                   0
dtype: int64

There are four columns with missing values, here's how we handle them:

Drop columns agent and company, since the missing values are too many and we won't use them for prediction
Create "UNKNOWN" category for country
Fill children with the value 0

hotel.drop(columns=['agent', 'company'], inplace=True)
hotel['country'].fillna("UNKNOWN", inplace=True)
hotel['children'].fillna(0, inplace=True)

Check whether there is another missing values.

hotel.isna().values.any()

False

Data Type Conversion

Categorical

We convert object to category data types to save memory. Also map the boolean columns is_canceled and is_repeated_guest into category for readability.

# list of columns to be casted
category_cols = ['hotel', 'meal', 'country', 'market_segment', 'distribution_channel',
                 'reserved_room_type', 'assigned_room_type', 'deposit_type', 'customer_type', 'reservation_status']
boolean_cols = ['is_canceled', 'is_repeated_guest']

# map values
boolean_map = {0: 'No', 1: 'Yes'}
hotel['is_canceled'] = hotel['is_canceled'].map(boolean_map)
hotel['is_repeated_guest'] = hotel['is_repeated_guest'].map(boolean_map)

# re-order categories for readability
hotel[category_cols + boolean_cols] = hotel[category_cols +
                                            boolean_cols].astype('category')
hotel['is_canceled'].cat.reorder_categories(
    list(boolean_map.values()), inplace=True)
hotel['is_repeated_guest'].cat.reorder_categories(
    list(boolean_map.values()), inplace=True)

Numerical

Convert children from float to integer.

hotel['children'].apply(float.is_integer).all()
hotel['children'] = hotel['children'].astype('int')

Datetime

Convert reservation_status_date as datetime.

hotel['reservation_status_date'] = hotel['reservation_status_date'].astype('datetime64')

Feature Engineering

Feature engineering is the process of using domain knowledge to extract features from provided raw data. These features can be used to improve the performance of machine learning models.

Room Type Assignment

Instead of considering each assigned and reserved room type, we create a new column is_assigned_as_reserved to make a flag whether the customer get their expected room type or not.

hotel['reserved_room_type'].cat.set_categories(hotel['assigned_room_type'].cat.categories, inplace=True)
hotel['is_assigned_as_reserved'] = (hotel['assigned_room_type'] == hotel['reserved_room_type']).astype('category')
hotel[['assigned_room_type', 'reserved_room_type', 'is_assigned_as_reserved']].head()

Arrival Date

Combine arrival_date_year, arrival_date_month, arrival_date_day_of_month into one column arrival_date so that we can extract more information from the date.

arrival_date_cols = ['arrival_date_year', 'arrival_date_month', 'arrival_date_day_of_month']
hotel[arrival_date_cols] = hotel[arrival_date_cols].astype(str)
hotel['arrival_date'] = pd.to_datetime(hotel[arrival_date_cols].apply('-'.join, axis=1), format="%Y-%B-%d")
hotel.drop(columns = arrival_date_cols + ['arrival_date_week_number'], inplace=True)
hotel[['arrival_date']].head()

Booking Date

Create booking_date by subtracting lead_time days from arrival_date.

hotel['booking_date'] = hotel['arrival_date'] - pd.to_timedelta(hotel['lead_time'], unit='days')
hotel[['booking_date', 'arrival_date', 'lead_time']].head()

Country and Continent Name

The column country represents code of a country in the ISO 3155–3:2013 format. By utilizing the code-to-name mapping provided in pycountry package, we can extract it into country_name and continent_name.

def convertCountryCode2Name(code):
    additional_code2name = {'TMP': 'East Timor'}
    country_name = None
    try:
        if len(code) == 2:
            country_name = pycountry.countries.get(alpha_2=code).name
        elif len(code) == 3:
            country_name = pycountry.countries.get(alpha_3=code).name
    except:
        if code in additional_code2name.keys():
            country_name = additional_code2name[code]
    return country_name if country_name is not None else code


def convertCountryName2Continent(country_name):
    additional_name2continent = {
        'East Timor': 'Asia',
        'United States Minor Outlying Islands': 'North America',
        'French Southern Territories': 'Antarctica',
        'Antarctica': 'Antarctica'}
    continent_name = None
    try:
        alpha2 = pc.country_name_to_country_alpha2(country_name)
        continent_code = pc.country_alpha2_to_continent_code(alpha2)
        continent_name = pc.convert_continent_code_to_continent_name(continent_code)
    except:
        if country_name in additional_name2continent.keys():
            continent_name = additional_name2continent[country_name]
        else:
            continent_name = "UNKNOWN"
    return continent_name if continent_name is not None else country_name


hotel['country_name'] = hotel['country'].apply(
    convertCountryCode2Name).astype('category')
hotel['continent_name'] = hotel['country_name'].apply(
    convertCountryName2Continent).astype('category')
hotel[['country', 'country_name', 'continent_name']].head()

Suspicious Bookings

There are some hidden anomalies present in the hotel bookings. Let's create new variables and plot their frequencies:

total_guest: Total number of adults, children, and babies
total_nights: Number of nights the guest stayed at the hotel, sum of stays_in_weekend_nights and stays_in_week_nights

hotel['total_guest'] = hotel[['adults', 'children', 'babies']].sum(axis=1)
hotel['total_nights'] = hotel[['stays_in_weekend_nights', 'stays_in_week_nights']].sum(axis=1)

data2plot = [hotel['total_guest'].value_counts().sort_index(ascending=False),
             hotel['total_nights'].value_counts().sort_index(ascending=False)[-21:]]

ylabs = ["Total Guest per Booking (Person)", "Total Nights per Booking"]
titles = ["FREQUENCY OF TOTAL GUEST PER BOOKING\n",
          "FREQUENCY OF TOTAL NIGHTS PER BOOKING\n(UP TO 20 NIGHTS ONLY)"]

fig, axes = plt.subplots(1, 2, figsize=(15, 5))
for ax, data, ylab, title in zip(axes, data2plot, ylabs, titles):
    bp = data.plot(kind='barh', rot=0, ax=ax)
    for rect in bp.patches:
        height = rect.get_height()
        width = rect.get_width()
        bp.text(rect.get_x() + width,
                rect.get_y() + height/2,
                int(width),
                ha='left',
                va='center',
                fontsize=8)
    bp.set_xlabel("Frequency")
    bp.set_ylabel(ylab)
    ax.set_title(title, fontweight="bold")

There are 180 bookings without guest (total_guest = 0) and 715 bookings with zero nights of staying at the hotel (total_nights = 0). Ideally, such cases should not occur on our bookings data. Therefore, from this point onwards we will ignore the observations with either cases since it can affect our modeling outcome.

hotel = hotel[(hotel['total_guest'] != 0) & (hotel['total_nights'] != 0)]
hotel.shape

(118565, 33)

We end up with 118565 rows of bookings, originally it was 119390 rows.

Exploratory Data Analysis (EDA)

Before we jump into the modeling step, it is recommended to visualize the data to better understand our data.

How is the proportion of booking cancellation based on reservation status?

df_cancel_status = pd.crosstab(index=hotel.is_canceled,
                               columns=hotel.reservation_status,
                               margins=True)

ax = df_cancel_status.iloc[:-1, :-1].plot(kind='bar', stacked=True, rot=0)
for rect in ax.patches:
    height = rect.get_height()
    width = rect.get_width()
    if height != 0:
        ax.text(rect.get_x() + width,
                rect.get_y() + height/2,
                int(height),
                ha='left',
                va='center',
                color="black",
                fontsize=10)

handles, labels = ax.get_legend_handles_labels()
ax.legend(handles=handles, labels=labels, title="Reservation Status")

percent_no = (100*df_cancel_status /
              df_cancel_status.iloc[-1, -1]).loc["No", "All"]
ax.set_xticklabels(["No\n({:.2f} %)".format(
    percent_no), "Yes\n({:.2f} %)".format(100-percent_no)])
ax.set_xlabel("Canceled?")
ax.set_ylabel("Number of Bookings")
plt.title("BOOKING CANCELLATION PROPORTION", fontweight="bold")
plt.show()

The proportion of the target variable is_canceled is somewhat balanced. There is 37.26% of the bookings which are canceled, divided into two cases:

Canceled: Booking was canceled by the customer, or
No-show: Customer did not check-in and did inform the hotel of the reason why.

Important: In building model, we have to remove leaky variables - a predictor which would not be expected to be available at prediction time. In this case, reservation_status is a variable that is known after the target variable is_canceled is known.

Where do most of the bookings happens?

df_choropleth = hotel.copy()
df_choropleth['booking_date_year'] = df_choropleth['booking_date'].dt.year
df_country_year_count = df_choropleth.groupby(['country', 'booking_date_year']).count()['hotel'].fillna(0).reset_index() \
    .rename(columns={'country': 'country_code', 'booking_date_year': 'year', 'hotel': 'count'})
df_country_year_count['country_name'] = df_country_year_count['country_code'].apply(
    convertCountryCode2Name)
df_country_year_count['count'] = df_country_year_count['count'].astype('int')

fig = px.choropleth(df_country_year_count[df_country_year_count["year"] != 2013],
                    locations="country_code", color="count", animation_frame="year",
                    hover_name="country_name",
                    range_color=(0, 5000),
                    color_continuous_scale=px.colors.sequential.Reds,
                    projection="natural earth")
fig.update_layout(title='ANNUAL HOTEL BOOKING COUNTS',
                  template="seaborn")

HTML(fig.to_html(include_plotlyjs='cdn'))

Note: From the choropleth map, we can see that Europe is the continent with the most hotel booking counts. The specific country with the most bookings is Portugal (PRT).

Which continent has the greatest cancellation rate?

From the previous section, we know Europe is the continent with most bookings, but how about the cancellation rate?

ax = pd.crosstab(index=hotel['continent_name'],
                 columns=hotel['is_canceled'],
                 margins=True).sort_values('All').iloc[:-1, :-1].plot(kind='barh')
ax.legend(bbox_to_anchor=(1, 1), title="Canceled?")
ax.set_xlabel("Number of Bookings")
ax.set_ylabel("Continent Name")
ax.set_title("BOOKINGS BY EACH CONTINENT", fontweight="bold")
plt.show()

ax = (pd.crosstab(index=hotel['continent_name'],
                  columns=hotel['is_canceled'],
                  normalize='index').sort_values('Yes') * 100).plot(kind='barh', stacked=True)
ax.legend(bbox_to_anchor=(1, 1), title="Canceled?")
ax.set_xlabel("Percentage of Bookings")
ax.set_ylabel("Continent Name")
ax.set_title("PERCENTAGE OF BOOKINGS CANCELLATION BY EACH CONTINENT", fontweight="bold")
plt.show()

Note: Turns out that continent_name maybe use as a predictor since the cancellation rate differs amongst continent, with Africe with the greatest cancellation rate.

How is the cancellation rate over time?

df_cancellation = hotel.copy()
df_cancellation['date_period'] = df_cancellation['reservation_status_date'].dt.to_period('M')
df_cancellation_percent = df_cancellation.groupby(['date_period', 'is_canceled', 'hotel'])['hotel'].count() \
    .groupby(['date_period', 'hotel']).apply(lambda x: 100*x/x.sum()) \
    .unstack(level='is_canceled') \
    .rename(columns=str).reset_index().rename_axis(None, axis=1).rename(columns={'hotel': 'Hotel Type'})
df_cancellation_percent['date_period'] = df_cancellation_percent['date_period'].values.astype('datetime64[M]')

fig = px.line(df_cancellation_percent, x='date_period',
              y='Yes', color='Hotel Type')
fig.update_traces(mode="markers+lines",
                  hovertemplate="Rate: %{y:.2f}%")
fig.update_layout(title='CANCELLATION RATE OVER TIME BY HOTEL TYPE',
                  xaxis_title='Cancellation Period',
                  yaxis_title='Cancellation Rate (%)',
                  hovermode='x',
                  template="seaborn",
                  xaxis=dict(tickformat="%b %Y"))
                  
HTML(fig.to_html(include_plotlyjs='cdn'))

Note: Most of the time, City hotel has greater cancellation rate than Resort hotel. The good news is that: both rate are decreasing towards zero during mid 2017, meaning none of the booking was cancelled. But we have to anticipate during the beginning of year 2018, since for the past three years there are peaks on the cancellation rate during January. Therefore in the next section, we can use a Machine Learning model to predict whether certain bookings will be canceled by the customer or not from predictor variables present in our data.

Predictive Power Score (PPS)

According to Florian Wetschoreck, PPS is an asymmetric and data-type-agnostic score that can detect linear or non-linear relationships between two columns. One column acts as an univariate predictor, whereas the other acts as the target variable. The score ranges from 0 (no predictive power) to 1 (perfect predictive power). It can be used as an alternative to the correlation matrix.

The PPS is calculated as follows:

$PPS = \dfrac{F1_{model} - F1_{naive}}{1 - F1_{naive}}$

where:

$F1_{naive}$ is the weighted F1 score of a naive model that always predicts the most common class of the target column.
$F1_{model}$ is the weighted F1 score of a classifier using sklearn.DecisionTreeClassifier.

Note: Detailed explanation of Predictive Power Score is available on GitHub.

Before we investigate the PPS, we do the following data preparation:

Consider dayofyear instead of datetime for booking_date, reservation_status_date, and arrival_date.
Ignore assigned_room_type and reserved_room_type because the levels are quite many, instead is_assigned_as_reserved will be considered.
Ignore country and country_name because the levels are too many, instead continent_name will be considered.
Convert the categorical columns into dummy variables.

datetime_cols = ['booking_date', 'reservation_status_date', 'arrival_date']
for col in datetime_cols:
    hotel[f"{col}_dayofyear"] = hotel[col].dt.dayofyear

ignore_cols = ['assigned_room_type', 'reserved_room_type', 'country', 'country_name']
hotel_pps_data = hotel.drop(datetime_cols + ignore_cols, axis=1)

We calculate PPS for each categorical and numerical predictors and then present the ranking in a bar chart.

pps_score = []
target = 'is_canceled'
for col in hotel_pps_data.columns:
    if col == target:
        continue
    d = {}
    d['feature'] = col
    d['dtypes'] = hotel_pps_data[col].dtypes
    d['pps'] = pps.score(hotel_pps_data, x=col, y=target)['ppscore']
    pps_score.append(d)

hotel_pps = pd.DataFrame(pps_score).set_index('feature')

ax = hotel_pps[hotel_pps['dtypes'] == 'category'].sort_values('pps')[:-1]\
    .plot(kind='barh', legend=False)
for rect in ax.patches:
    height = rect.get_height()
    width = rect.get_width()
    ax.text(rect.get_x() + 1.01*width,
            rect.get_y() + 0.75*height,
            round(width, 2),
            ha='left',
            va='top',
            fontsize=8)
ax.set_xlabel("PPS")
ax.set_ylabel("Predictor Variable")
plt.title("CATEGORICAL PREDICTORS PREDICTIVE POWER SCORE\n TARGET: is_canceled",
          fontweight="bold")
plt.show()

Note: From PPS of the categorical variables, we have to ignore reservation_status for the modeling. The score is high because from the business perspective, this value is actually is_canceled but breakdown into three specific categories. So we cannot use this as a predictor.

ax = hotel_pps[hotel_pps['dtypes'] != 'category'].sort_values('pps')\
    .plot(kind='barh', legend=False)
for rect in ax.patches:
    height = rect.get_height()
    width = rect.get_width()
    ax.text(rect.get_x() + width,
            rect.get_y() + height/2,
            round(width, 2),
            ha='left',
            va='center',
            fontsize=8)
ax.set_xlabel("PPS")
ax.set_ylabel("Predictor Variable")
plt.title("NUMERICAL PREDICTORS PREDICTIVE POWER SCORE\n TARGET: is_canceled",
          fontweight="bold")
plt.show()

Note: From PPS of the numerical variables, we have ignore reservation_status_date_dayofyear the reason is the same as before when we ignore reservation_status. We can ignore stay_in_week_nights and stay_in_weekend_nights because already explained by total_nights. Also, adults, children, and babies are explained by total_guest.

Modeling and Evaluation

Now, let's proceed to the modeling step using Random Forest Classifier by sklearn.

Feature Selection

PPS score from the previous section help us to identify which column to be ignored. We are going to drop four columns before we fit the data into Random Forest model.

ignore_cols_for_model = ['reservation_status', 'reservation_status_date_dayofyear',
                         'total_nights', 'total_guest']
hotel_model = hotel_pps_data.drop(columns=ignore_cols_for_model)
print(f"Dimension of data: {hotel_model.shape[0]} rows and {hotel_model.shape[1]} columns")

Dimension of data: 118565 rows and 25 columns

Train-test split

We split our dataset into 80% train and 20% test dataset for model evaluation.

hotel_model_dummy = pd.get_dummies(hotel_model, drop_first=True)
X = hotel_model_dummy.drop(['is_canceled_Yes'], axis=1)
y = hotel_model_dummy['is_canceled_Yes']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.8, shuffle=True, random_state=333)
print(f"Predictors: {X_train.shape[1]}")
print(f"Training dataset: {y_train.shape[0]} observations")
print(f"Testing dataset: {y_test.shape[0]} observations")

Predictors: 45
Training dataset: 94852 observations
Testing dataset: 23713 observations

Default Model

First, we fit the train dataset into RandomForestClassifier without tuning any parameter to know the starting performance of our model as a benchmark.

model_default = RandomForestClassifier(
    n_jobs=multiprocessing.cpu_count()-1, oob_score=True, random_state=333)
model_default.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=7,
                       oob_score=True, random_state=333, verbose=0,
                       warm_start=False)

Then we evaluate the model_default by printing out classification metrics such as:

Accuracy: the proportion of observations that are classified correctly.
Precision: the proportion of True Positives out of all observations predicted as positive.
Recall: the proportion of actual positives that are classified correctly.
F1-score: the harmonic mean of precision and recall.

def clfReport(y_true, y_pred):
    report = classification_report(
        y_true, y_pred,
        output_dict=True, target_names=["Not Canceled", "Canceled"])
    return pd.DataFrame(report).transpose()


def clfReport_default(model, X, y):
    clfreport = clfReport(y, model.predict(X))
    print(f'Accuracy: {clfreport.loc["accuracy", "support"]}')
    return clfreport.iloc[:2, :-1]

clfReport_default(model_default, X_test, y_test)

Accuracy: 0.8594441867330156

From the report above, we conclude that model_default bias towards "Not Canceled" (negative class) because the recall is much higher than "Canceled" (positive class). The cause of this may be because the target variable proportion is not equally balanced, the original ratio is 63:37. In the next section, we'll perform sampling to overcome this problem.

Note: Let’s focus on recall because in this case, we want to minimize False Negative, where a canceled booking is predicted as not canceled. Suppose the marketing team of this hotel only offers follow-up promotions to customers who are predicted as canceled. If the recall is too low, then the potential customer who will cancel their booking won’t be contacted.

Balancing Data

We can do either upsampling or downsampling to balance the positive and negative class of is_canceled into a 50:50 ratio.

Upsampling is a method to randomly subsample the observation from minority class to make the dataset balanced.
Downsampling is a method to randomly sampled (with replacement) the observation from the majority class to make the dataset balanced.

In this case, we prefer to do downsampling since we still have thousands of observations, which are sufficient for modeling. But we only do downsampling on the training dataset, whereas still retain the ratio of testing data.

def sampleData(data, target_col, method="down"):
    freq = data[target_col].value_counts()
    class_majority, class_minority = freq.idxmax(), freq.idxmin()
    data_majority, data_minority = data[data[target_col] ==
                                        class_majority], data[data[target_col] == class_minority]
    if method == "down":
        data_hold = data_minority
        data_sample = data_majority
        replacement = False
    elif method == "up":
        data_hold = data_majority
        data_sample = data_minority
        replacement = True

    data_sampled = resample(data_sample,
                            n_samples=data_hold.shape[0],
                            replace=replacement,
                            random_state=333)
    return pd.concat([data_sampled, data_hold])

hotel_model_train_down = sampleData(
    pd.concat([X_train, y_train], axis=1), "is_canceled_Yes", "down")
X_train_down = hotel_model_train_down.drop(['is_canceled_Yes'], axis=1)
y_train_down = hotel_model_train_down['is_canceled_Yes']

Here is the proportion of target variable before and after downsampling.

pd.concat([
    y_train.value_counts().rename("Before Sampling"),
    y_train_down.value_counts().rename("After Sampling")],
    axis=1).set_index(pd.Series(["Not Canceled", "Canceled"]))

Default Model (with Downsampled Data)

We fit the downsampled train dataset again into RandomForestClassifier and compare it with model_default.

model_default_down = RandomForestClassifier(
    n_jobs=multiprocessing.cpu_count()-1, oob_score=True, random_state=333)
model_default_down.fit(X_train_down, y_train_down)
clfReport_default(model_default_down, X_test, y_test)

Accuracy: 0.8525703200775946

Good, now the recall gap is not too huge. Next, let's do train-test evaluation to check whether the model is a good fit or not.

def trainTestEval(model, pos_class, X_train, X_test, y_train, y_test):
    report_train = clfReport(y_train, model.predict(X_train))
    report_test = clfReport(y_test, model.predict(X_test))
    df = pd.concat([report_train.loc[pos_class, :].rename("Train"),
                    report_test.loc[pos_class, :].rename("Test")], axis=1).transpose()
    df.insert(loc=0, column="accuracy",
              value=[
                  report_train.loc["accuracy", "support"],
                  report_test.loc["accuracy", "support"]])
    df["observation"] = np.array([X_train.shape[0], X_test.shape[0]])
    df["pos_proportion"] = df["support"]/df["observation"]
    return df.drop(columns="support")

trainTestEval(model_default_down, "Canceled",
              X_train_down, X_test, y_train_down, y_test)

Note: From the evaluation report above, we can conclude that model_default_down is overfitted the data, since the performance on the training dataset is nearly perfect but not so good on the test dataset. To overcome this, we are going to tune the hyperparameter using Random Search Cross-Validation.

Random Search Cross-Validation

This technique narrows down our search by evaluating a wide range of values for each parameter. Using the sklearn RandomizedSearchCV method, we define a grid of hyperparameter ranges random_grid and as the name suggests, it randomly samples from the grid, performing k-fold cross-validation with each combination of values.

random_grid = {
    'n_estimators': [100, 200, 300],
    'criterion': ['gini', 'entropy'],
    'max_depth': [x for x in range(5, 100, 2)],
    'max_features': ['sqrt', 'log2', None]
}

# number of possible combinations
mul = 1
for key in random_grid.keys():
    mul *= len(random_grid[key])
print(f"Number of possible combinations: {mul}")

Number of possible combinations: 864

From the above number of possible combinations, we decided to build models by using only 200 combinations. Using 3-fold cross-validation means that each combination of parameters will be evaluated 3 times with a 67-33 train-test proportion. In this case, we want to maximize recall score on the testing set. Anyway, the code below takes a while to be executed, 2-3 hours to be specific, since we train them on CPU - maybe should consider using GPU.

model_base = RandomForestClassifier(warm_start=True,
                                    n_jobs=multiprocessing.cpu_count()-1,
                                    random_state=333)

rf_random = RandomizedSearchCV(estimator=model_base,
                               param_distributions=random_grid,
                               scoring='recall',
                               n_iter=250,
                               cv=3,
                               verbose=1,
                               n_jobs=multiprocessing.cpu_count()-1,
                               random_state=333)

rf_random.fit(X_train_down, y_train_down)

Fitting 3 folds for each of 250 candidates, totalling 750 fits

[Parallel(n_jobs=7)]: Using backend LokyBackend with 7 concurrent workers.
[Parallel(n_jobs=7)]: Done  36 tasks      | elapsed:  7.1min
[Parallel(n_jobs=7)]: Done 186 tasks      | elapsed: 30.5min
[Parallel(n_jobs=7)]: Done 436 tasks      | elapsed: 67.4min
[Parallel(n_jobs=7)]: Done 750 out of 750 | elapsed: 132.6min finished

RandomizedSearchCV(cv=3, error_score='raise-deprecating',
                   estimator=RandomForestClassifier(bootstrap=True,
                                                    class_weight=None,
                                                    criterion='gini',
                                                    max_depth=None,
                                                    max_features='auto',
                                                    max_leaf_nodes=None,
                                                    min_impurity_decrease=0.0,
                                                    min_impurity_split=None,
                                                    min_samples_leaf=1,
                                                    min_samples_split=2,
                                                    min_weight_fraction_leaf=0.0,
                                                    n_estimators='warn',
                                                    n_jobs=7, oob_score...
                   iid='warn', n_iter=250, n_jobs=7,
                   param_distributions={'criterion': ['gini', 'entropy'],
                                        'max_depth': [5, 7, 9, 11, 13, 15, 17,
                                                      19, 21, 23, 25, 27, 29,
                                                      31, 33, 35, 37, 39, 41,
                                                      43, 45, 47, 49, 51, 53,
                                                      55, 57, 59, 61, 63, ...],
                                        'max_features': ['sqrt', 'log2', None],
                                        'n_estimators': [100, 200, 300]},
                   pre_dispatch='2*n_jobs', random_state=333, refit=True,
                   return_train_score=False, scoring='recall', verbose=1)

def saveModel(model, pickle_name):
    with open(pickle_name, 'wb') as f:
        pickle.dump(model, f)


saveModel(rf_random, "cache/rf_randomcv_500_down.pkl")

Let's take a look of the result from best to worst in terms of Recall and print one combination of the parameter which yield the best recall.

with open("cache/rf_randomcv_500_down.pkl", 'rb') as f:
    rf_random = pickle.load(f)
pd.DataFrame(rf_random.cv_results_).sort_values('rank_test_score').head()

rf_random.best_params_

{'n_estimators': 300,
 'max_features': None,
 'max_depth': 25,
 'criterion': 'entropy'}

Just like we did on model_default, let's evaluate whether the tuned model is still overfit or not:

model_tuned_metrics = trainTestEval(
    rf_random, "Canceled", X_train_down, X_test, y_train_down, y_test)
model_tuned_metrics

Threshold Tuning

We can further tune model_tuned by changing the threshold when classifying the probabilities to a class. So far, the model use default threshold which is 0.5, means that if a booking is predicted to have more than 50% chance of cancellation, then it is classified as Canceled, otherwise Not Canceled. We create tuningThresholdPlot function to iteratively move the threshold from 0 to 1 and plot the classification metrics.

def tuningThresholdPlot(model, X, y):
    tuning_list = []
    y_pred_prob = model.predict_proba(X)[:, 1]
    for threshold in np.linspace(0, 1, 101)[:-1]:
        y_pred_class = y_pred_prob > threshold
        report = clfReport(y, y_pred_class)
        tuning_res = {"threshold": threshold,
                      "accuracy": report.loc["accuracy"][-1]}
        tuning_res = {**tuning_res, **
                      report.loc["Canceled"].to_dict()}  # append dicts
        tuning_res.pop("support", None)
        tuning_list.append(tuning_res)
    tuning_df = pd.DataFrame(tuning_list)

    # plotting
    ax = tuning_df.plot(x="threshold", color="rgyb", lw=3, figsize=(10, 6))

    # vertical line for center
    diff = abs(tuning_df["recall"] - tuning_df["precision"])
    thresh_center = tuning_df[diff == min(diff)]["threshold"].values[0]
    ax.axvline(x=thresh_center, ls='--', color="k")
    ax.text(x=thresh_center + 0.01, y=1,
            s=f"CENTER", fontsize=12, color="k")

    ax.set_xticks(list(ax.get_xticks()[1:-1]) + [thresh_center])
    ax.legend(loc="upper center", bbox_to_anchor=(
        0.5, -0.15), shadow=True, ncol=4)
    ax.set_ylabel("Metric")
    plt.title("TUNING THRESHOLD", size=15, fontweight="bold")
    plt.show()

    return tuning_df


tuning_thresh = tuningThresholdPlot(rf_random, X_test, y_test)

The vertical line labeled "CENTER" is when precision equal to recall. As mentioned earlier, we do care about the recall, but on second priority, we have to take into account about the precision too so that our model can correctly classify bookings that are not canceled but predicted as canceled (False Positives). We want to minimize this case in order to reduce the risk of overbooking. In conclusion, we will move the threshold to be less than the CENTER to achieve higher recall, but still maintaining a good precision. Let's compare the center to the default threshold, which is 0.5.

DEFAULT_THRESHOLD = 0.5
CENTER_THRESHOLD = 0.55
compare_thresh = tuning_thresh[tuning_thresh["threshold"].isin(
    [DEFAULT_THRESHOLD, CENTER_THRESHOLD])]
compare_thresh

Let's calculate gain/loss of the metrics if we use CENTER_THRESHOLD instead of DEFAULT_THRESHOLD

compare_thresh.iloc[1, :] - compare_thresh.iloc[0, :]

threshold    0.050000
accuracy     0.006494
precision    0.030266
recall      -0.028728
f1-score     0.001741
dtype: float64

Note: In conclusion, we prefer to use DEFAULT_THRESHOLD instead of CENTER_THRESHOLD in order to maintain recall value.

Receiver Operating Characteristic (ROC) Curve

It is a probability curve which plots True Positive Rate (Recall) against False Positive Rate. The area under the ROC curve, called as Area Under Curve (AUC), measures the degree of classification separability. Higher the AUC, better the model is at distinguishing between canceled and not canceled bookings.

AUC near to 1 indicates the model has good measure of separability.
AUC near to 0 means it has worst measure of separability.
When AUC is 0.5, it means model has no class separation capacity whatsoever.

def rocauc(y_pred_list, y_test):
    y_pred_list = [(np.zeros(len(y_test)), "BASELINE")] + y_pred_list
    for y_pred, label in y_pred_list:
        fpr, tpr, _ = roc_curve(y_test, y_pred)
        auc = roc_auc_score(y_test, y_pred)
        plt.plot(fpr, tpr, label=f"{label}\n(AUC: {auc:.3f})",
                 linestyle='--' if label == "BASELINE" else '-')
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate (Recall)")
    plt.title("RECEIVER OPERATING CHARACTERISTIC (ROC) CURVE", fontweight="bold")
    plt.legend(bbox_to_anchor=(1, 1))
    plt.show()


rocauc([(model_default.predict(X_test), "DEFAULT"),
        (rf_random.predict(X_test), "THRESHOLD = 0.5"),
        (rf_random.predict_proba(X_test)[:, 1] > CENTER_THRESHOLD, f"THRESHOLD = {CENTER_THRESHOLD}")], y_test)

Note: The AUC of using DEFAULT_THRESHOLD or CENTER_THRESHOLD is nearly the same, indicating the same measure of separability while we retain the recall performance if we use DEFAULT_THRESHOLD.

Feature Importance

This refers to a technique that assign a score to features based on how useful they are at predicting a target variable. In this case, we want to know which feature is the most important in predicting hotel booking cancellation using sklearn built-in attributes feature_importances_.

def featureImportancesPlot(features, importances, n_features=20):
    feature_imp_df = pd.DataFrame({'Feature': features,
                                   'Importance': importances}).set_index('Feature')\
        .sort_values('Importance').tail(n_features)
    ax = feature_imp_df.plot(kind='barh', legend=False,
                             title=f"Top {n_features} Feature Importances")
    ax.set_xlabel("Relative Importance")
    plt.show()
    return feature_imp_df.sort_values('Importance', ascending=False)


feature_imp = featureImportancesPlot(
    X.columns, rf_random.best_estimator_.feature_importances_)

The top three features based on their importance are:

deposit_type_Non Refund: the customer made a deposit to guarantee the booking in the value of the total stay cost
lead_time: number of days that elapsed between the entering date of the booking into the PMS and the arrival date
adr: average daily rate

Conclusion

We successfully visualized the hotel booking data using plotly to gain insights about the data and modeling using Random Forest to predict the cancellation of the booking based on various predictors. Before modeling, we inspect the PPS score to do feature selection manually. Next, we tune the base model using RandomizedSearchCV to prevent overfitting, followed by threshold tuning checking. In the end, we achieve a quite good metrics:

accuracy = 85.6155%
precision = 78.5129%
recall = 84.1427%
f1-score = 81.2304%

There are several things need to be improved:

Tuning other parameters such as ccp_alpha to cut the tree with cost complexity pruning.
SearchCV using GPU instead of CPU, explore more about cuML packages.
Use another ensemble method such as XGBoost.

	arrival_date
0	2015-07-01
1	2015-07-01
2	2015-07-01
3	2015-07-01
4	2015-07-01

	booking_date	arrival_date	lead_time
0	2014-07-24	2015-07-01	342
1	2013-06-24	2015-07-01	737
2	2015-06-24	2015-07-01	7
3	2015-06-18	2015-07-01	13
4	2015-06-17	2015-07-01	14

	precision	recall	f1-score
Not Canceled	0.858404	0.930393	0.892950
Canceled	0.861684	0.738600	0.795409

	precision	recall	f1-score
Not Canceled	0.886786	0.878121	0.882432
Canceled	0.795806	0.809052	0.802374

	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_n_estimators	param_max_features	param_max_depth	param_criterion	params	split0_test_score	split1_test_score	split2_test_score	mean_test_score	std_test_score	rank_test_score
100	164.374928	0.922007	4.703054	0.291531	300	None	25	entropy	{'n_estimators': 300, 'max_features': None, 'm...	0.840451	0.835183	0.833743	0.836459	0.002883	1
130	73.878549	6.213109	5.713560	0.240528	100	None	25	entropy	{'n_estimators': 100, 'max_features': None, 'm...	0.839519	0.834167	0.833828	0.835838	0.002607	2
21	126.615590	2.122243	5.434928	0.828088	200	None	33	entropy	{'n_estimators': 200, 'max_features': None, 'm...	0.838756	0.831201	0.832896	0.834284	0.003237	3
143	76.089374	0.373065	4.953523	0.353314	100	None	31	entropy	{'n_estimators': 100, 'max_features': None, 'm...	0.839095	0.831201	0.831624	0.833974	0.003626	4
215	51.189257	0.704618	3.211668	0.305625	100	None	23	gini	{'n_estimators': 100, 'max_features': None, 'm...	0.837231	0.833912	0.830523	0.833889	0.002739	5

	country	country_name	continent_name
0	PRT	Portugal	Europe
1	PRT	Portugal	Europe
2	GBR	United Kingdom	Europe
3	GBR	United Kingdom	Europe
4	GBR	United Kingdom	Europe

	accuracy	precision	recall	f1-score	observation	pos_proportion
Train	0.983801	0.989008	0.978477	0.983715	70808	0.500000
Test	0.852570	0.795806	0.809052	0.802374	23713	0.369924

	accuracy	precision	recall	f1-score	observation	pos_proportion
Train	0.974494	0.967548	0.981923	0.974682	70808	0.500000
Test	0.856155	0.785129	0.841427	0.812304	23713	0.369924

	threshold	accuracy	precision	recall	f1-score
50	0.50	0.856155	0.785129	0.841427	0.812304
55	0.55	0.862649	0.815395	0.812699	0.814045

	Before Sampling	After Sampling
Not Canceled	59448	35404
Canceled	35404	35404