Introduction

Time series data is one of the most common form of data to be found in every industry. It is considered to be a significant area of interest for most industries: retail, telecommunication, logistic, engineering, finance, and socio-economic. Time series analysis aims to extract the underlying components of a time series to better understand the nature of the data. In this post, we will tackle a common industry case of business sales.

Library and Setup

We load some packages that will be used to analyze and forecast our time series data.

import numpy as np
import pandas as pd

# data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.io as pio

# visualization settings
plt.style.use('seaborn')
plt.rcParams['figure.facecolor'] = 'white'
pio.renderers.default = 'colab'

# prophet
from fbprophet.diagnostics import cross_validation, performance_metrics
from fbprophet.plot import plot_plotly, plot_components_plotly, plot_cross_validation_metric
from fbprophet import Prophet, hdays

# evaluation metric
from sklearn.metrics import mean_squared_log_error

# others
import itertools
from tqdm import tqdm

Workflow

Data Loading

We will be using provided data from one of the largest Russian software firms - 1C Company and is made available through Kaggle. This is a good example case of data since it contains seasonality and a particular noise in several data when special occurences happened.

sales = pd.read_csv('data_input/sales_train.csv')
sales.head()

The following are the glossary provided in the Kaggle platform:

date is the date format provided in dd.mm.yyyy format
date_block_num is a consecutive month number used for convenience (January 2013 is 0, February 2013 is 1, and so on)
shop_id is the unique identifier of the shop
item_id is the unique identifier of the product
item_price is the price of the item on the specified date
item_cnt_day is the number of products sold on the specified date

Note: The variable of interest that we are trying to predict is the item_cnt_day and item_price, which we’ll see how to analyze the business demand from the sales record.

sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2935849 entries, 0 to 2935848
Data columns (total 6 columns):
 #   Column          Dtype  
---  ------          -----  
 0   date            object 
 1   date_block_num  int64  
 2   shop_id         int64  
 3   item_id         int64  
 4   item_price      float64
 5   item_cnt_day    float64
dtypes: float64(2), int64(3), object(1)
memory usage: 134.4+ MB

Some insights we can get from the output are:

The data consist of 2,935,849 observations (or rows)
It has 6 variables (or columns)

Data Preprocessing

Time series data is defined as data observations that are collected at regular time intervals. In this case, we are talking about software sales daily data.

We have to make sure our data is ready to be fitted into models, such as:

Convert date column data type from object to datetime64
Sort the data ascending by date column
Feature engineering of total_revenue, which will be forecasted

sales['date'] = pd.to_datetime(sales['date'], dayfirst=True)
sales.sort_values('date', inplace=True)
sales['total_revenue'] = sales['item_price'] * sales['item_cnt_day']
sales.dtypes

date              datetime64[ns]
date_block_num             int64
shop_id                    int64
item_id                    int64
item_price               float64
item_cnt_day             float64
total_revenue            float64
dtype: object

Important: The legal and cultural expectations for datetime format may vary between countries. In Indonesia for example, most people are used to storing dates in DMY order. pandas will infer date as a month first order by default. Since the sales date is stored in dd.mm.yyyy format, we have to specify parameter dayfirst=True inside pd.to_datetime() method.

Take a look on the third observation below; pandas converts it to 2nd January while the actual data represents February 1st.

date = pd.Series(['30-01-2021', '31-01-2021', '01-02-2021', '02-02-2021'])
pd.to_datetime(date)  # format of datetime64 is yyyy-mm-dd

0   2021-01-30
1   2021-01-31
2   2021-01-02
3   2021-02-02
dtype: datetime64[ns]

Next, let's check the date range of sales data. Turns out it ranges from January 1st, 2013 to October 31st, 2015.

sales['date'].apply(['min', 'max'])

min   2013-01-01
max   2015-10-31
Name: date, dtype: datetime64[ns]

We are interested in analyzing the most popular shop (shop_id) of our sales data. The popularity of a shop is defined by the number of transaction that occur. Let's create a frequency table using .value_counts() as follows:

top_3_shop = sales['shop_id'].value_counts().head(3)
top_3_shop

31    235636
25    186104
54    143480
Name: shop_id, dtype: int64

We have gain the information that shop 31, 25, and 54 are the top three shops with the most record sales. Say we would like to analyze their time series attribute. To do that, we can apply conditional subsetting (filter) to sales data.

sales_top_3_shop = sales[sales['shop_id'].isin(top_3_shop.index)]
sales_top_3_shop.head()

Now let’s highlight again the most important definition of a time series:

Important: Time series is an observation that is recorded at a regular time interval. Notice that the records has a multiple samples of the same day. This must mean that our data frame violates the rules of a time series where the records is sampled multiple time a day. Based on the structure of our data, it is recording the sales of different items within the same day. An important aspect in preparing a time series is called a data aggregation, where we need to aggregate the sales from one day into one records.

Now let’s take a look at the codes:

daily_sales = sales_top_3_shop.groupby(['date', 'shop_id'])[['item_cnt_day', 'total_revenue']] \
    .sum().reset_index() \
    .rename(columns={'item_cnt_day': 'total_qty'})
daily_sales.head()

Note: In performing data aggregation, we can only transform a more frequent data sampling to a more sparse frequency, for example hourly to daily, daily to weekly, daily to monthly, monthly to quarterly, and so on.

Visualization

One of an important aspect in time series analysis is performing a visual exploratory analysis. Python is known for its graphical capabilities and has a very popular visualization package called matplotlib and seaborn. Let’s take our daily_sales data frame we have created earlier and observe through the visualization.

Important: There is a misconception between multiple and multivariate time series. In multiple time series, there is one variable from multiple objects being observed from time to time. In multivariate time series, there are multiple variables from only one object being observed from time to time. Typically for such series, the variables are closely interrelated.

Multiple Time Series

In our case, multiple time series is when we observed the fluctuation of total_qty over time, from the top three shops.

sns.relplot(data=daily_sales,
            kind='line',
            x='date',
            y='total_qty',
            col='shop_id',
            aspect=6/5)
plt.show()

From the visualization we can conclude that the fluctuation of total_qty is very distinct for each shop. There are some extreme spikes on shop_id 25 and 31 at the end of each year, while shop_id 54 doesn't have any spike.

Multivariate Time Series

In our case, multivariate time series is when we observed the fluctuation of total_qty and total_revenue over time, from only shop_id 31.

daily_sales_31 = daily_sales[daily_sales['shop_id'] == 31].reset_index(drop=True)
daily_sales_31.set_index('date')[['total_qty', 'total_revenue']].plot(subplots=True,
                                                                      figsize=(10, 5))
plt.suptitle('DAILY SOFTWARE SALES: STORE 31')
plt.show()

From the visualization we can conclude that the fluctuation of total_qty and total_revenue is quite similar for shop_id 31.

Note: From the business perspective, variable quantity and revenue are closely related to each other. When the total_qty sold increases, logically, the total_revenue will also increases. We can use total_qty as the regressor when forecasting total_revenue.

Modeling using `fbprophet`

A very fundamental part in understanding time series is to be able to decompose its underlying components. A classic way in describing a time series is using General Additive Model (GAM). This definition describes time series as a summation of its components. As a starter, we will define time series with 3 different components:

Trend ($T$): Long term movement in its mean
Seasonality ($S$): Repeated seasonal effects
Residuals ($E$): Irregular components or random fluctuations not described by trend and seasonality

The idea of GAM is that each of them is added to describe our time series:

$Y(t) = T(t) + S(t) + E(t)$

Important: When we are discussing time series forecasting there is one main assumption that needs to be remembered: We assume correlation among successive observations. Means that the idea of performing a forecasting for a time series is based on its past behavior. So in order to forecast the future values, we will take a look at any existing trend and seasonality of the time series and use it to generate future values.

Prophet enhanced the classical trend and seasonality components by adding a holiday effect. It will try to model the effects of holidays which occur on some dates and has been proven to be really useful in many cases. Take, for example: Lebaran Season. In Indonesia, it is really common to have an effect on Lebaran season. The effect, however, is a bit different from a classic seasonality effect because it shows the characteristics of an irregular schedule.

Minimal Workflow

Let's us start using fbprophet by creating a baseline model.

Prepare the data

To use the fbprophet package, we first need to prepare our time series data into a specific format data frame required by the package. The data frame requires 2 columns:

ds: the time stamp column, stored in datetime64 data type
y: the value to be forecasted

In this example, we will be using the total_qty as the value to be forecasted.

daily_total_qty = daily_sales_31[['date', 'total_qty']].rename(
    columns={'date': 'ds',
             'total_qty': 'y'})

daily_total_qty.head()

Fitting Model

Let’s initiate a fbprophet object using Prophet() and fit the daily_total_qty data. The idea of fitting a time series model is to extract the pattern information of a time series in order to perform a forecasting over the specified period of time.

model_31 = Prophet()
model_31.fit(daily_total_qty)

INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.

<fbprophet.forecaster.Prophet at 0x7fc138847450>

Forecasting

Based on the existing data, let's say we would like to perform a forecasting for 1 years into the future. To do that, we will need to first prepare a data frame that consist of the future time stamp range. Conveniently, fbprophet has provided .make_future_dataframe() method that help us to prepare the data:

future_31 = model_31.make_future_dataframe(periods=365, freq='D')
future_31.tail()

Now we have acquired a new future_31 data frame that consist of a date span of the beginning of a time series to 365 days into the future. We will then use this data frame is to perform the forecasting by using .predict() method of our model_31:

forecast_31 = model_31.predict(future_31)
forecast_31[['ds', 'trend', 'weekly', 'yearly', 'yhat']]

Recall that in General Additive Model, we use time series components and perform a summation of all components. In this case, we can see that the model is extracting 3 types of components: trend, weekly seasonality, and yearly seasonality. Means, in forecasting future values it will use the following formula: $yhat(t) = T(t) + S_{weekly}(t) + S_{yearly}(t)$

We can manually confirm from forecast_31 that the column yhat is equal to trend + weekly + yearly.

forecast_result = forecast_31['yhat']
forecast_add_components = forecast_31['trend'] + forecast_31['weekly'] + forecast_31['yearly']

(forecast_result.round(10) == forecast_add_components.round(10)).all()

True

Note: .round() is performed to prevent floating point overflow.

Visualize

Now, observe how .plot() method take our model_31, and newly created forecast_31 object to create a matplotlib object that shows the forecasting result. The black points in the plot shows the actual time series, and the blue line shows the fitted time series along with its forecasted values 365 days into the future.

fig = model_31.plot(forecast_31, xlabel='Date', ylabel='Quantity Sold')

We can also visualize each of the trend and seasonality components using .plot_components method.

fig = model_31.plot_components(forecast_31)

From the visualization above, we can get insights such as:

The trend shows that the total_qty sold is decreasing from time to time.
The weekly seasonality shows that sales on weekends are higher than weekdays.
The yearly seasonality shows that sales peaked at the end of the year.

Interactive Visualization

An interactive figure of the forecast and components can be created with plotly.

Note: You will need to install plotly 4.0 or above separately, as it will not by default be installed with fbprophet. You will also need to install the notebook and ipywidgets packages.

plot_plotly(model_31, forecast_31)

plot_components_plotly(model_31, forecast_31)

Trend Component

The trend components of our model, as plotted using .plot_components() method is producing a decreasing trend over the year. Trend is defined as a long term movement of average over the year. The methods that is implemented by Prophet is by default a linear model as shown below:

# for illustration purposes only
from datetime import date

# prepare data
daily_sales_31_copy = daily_sales_31.copy()
daily_sales_31_copy['date_ordinal'] = daily_sales_31_copy['date'].apply(lambda date: date.toordinal())

# visualize regression line
plt.figure(figsize=(10, 5))
ax = sns.regplot(x='date_ordinal',
                 y='total_qty',
                 data=daily_sales_31_copy,
                 scatter_kws={'color': 'black', 's': 2},
                 line_kws={'color': 'red'})
new_labels = [date.fromordinal(int(item)) for item in ax.get_xticks()]
ax.set_xticklabels(new_labels)
plt.xlabel('date')
plt.show()

Automatic Changepoint Detection

Prophet however implements a changepoint detection which tries to automatically detect a point where the slope has a significant change rate. It will tries to split the series using several points where the trend slope is calculated for each range.

Note: By default, prophet specifies 25 potential changepoints (n_changepoints=25) which are placed uniformly on the first 80% of the time series (changepoint_range=0.8).

# for illustration purposes only
from fbprophet.plot import add_changepoints_to_plot
fig = model_31.plot(forecast_31, xlabel='Date', ylabel='Quantity Sold')
a = add_changepoints_to_plot(fig.gca(), model_31, forecast_31, threshold=0)

From the 25 potential changepoints, it will then calculate the magnitude of the slope change rate and decided the significant change rate. The model detected 3 significant changepoints and separate the series into 4 different trend slopes.

fig = model_31.plot(forecast_31, xlabel='Date', ylabel='Quantity Sold')
a = add_changepoints_to_plot(fig.gca(), model_31, forecast_31)

Adjusting Trend Flexibility

Prophet provided us a tuning parameter to adjust the detection flexibility:

n_changepoints (default = 25): The number of potential changepoints, not recommended to be tuned, this is better tuned by adjusting the regularization (changepoint_prior_scale)
changepoint_range (default = 0.8): Proportion of the history in which the trend is allowed to change. Recommended range: [0.8, 0.95]
changepoint_prior_scale (default = 0.05): The flexibility of the trend, and in particular how much the trend changes at the trend changepoints. Recommended range: [0.001, 0.5]

Tip: Increasing the default value of the parameter above will give extra flexibility to the trend line (overfitting the training data). On the other hand, decreasing the value will cause the trend to be less flexible (underfitting).

model_tuning_trend = Prophet(
    n_changepoints=25,  # default = 25
    changepoint_range=0.8,  # default = 0.8
    changepoint_prior_scale=0.5  # default = 0.05
)
model_tuning_trend.fit(daily_total_qty)

# forecasting
future = model_tuning_trend.make_future_dataframe(periods=365, freq='D')
forecast = model_tuning_trend.predict(future)

# visualize
fig = model_tuning_trend.plot(forecast, xlabel='Date', ylabel='Quantity Sold')
a = add_changepoints_to_plot(fig.gca(), model_tuning_trend, forecast)

INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.

Seasonality Component

Let’s talk about other time series component, seasonality. We will review the following plot components.

fig = model_31.plot_components(forecast_31)

By default, Prophet will try to determine existing seasonality based on existing data provided. In our case, the data provided is a daily data from early 2013 to end 2015.

Any daily sampled data by default will be detected to have a weekly seasonality.
While yearly seasonality, by default will be set as True if the provided data has more than 2 years of daily sample.
The other regular seasonality is a daily seasonality which tries to model an hourly pattern of a time series. Since our data does not accommodate hourly data, by default the daily seasonality will be set as False.

Fourier Order

Prophet uses a Fourier series to approximate the seasonality effect. It is a way of approximating a periodic function as a (possibly infinite) sum of sine and cosine functions.

Tip: The number of terms in the partial sum (the order) is a parameter that determines how quickly the seasonality can change. Increasing the fourier order will give extra flexibility to the seasonality (overfitting the training data), and vice versa.

An Interactive Introduction to Fourier Transforms

model_tuning_seasonality = Prophet(
    weekly_seasonality=3,  # default = 3
    yearly_seasonality=200  # default = 10
)
model_tuning_seasonality.fit(daily_total_qty)

# forecasting
future = model_tuning_seasonality.make_future_dataframe(periods=365, freq='D')
forecast = model_tuning_seasonality.predict(future)

# visualize
fig = model_tuning_seasonality.plot_components(forecast)

INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.

Custom Seasonalities

The default provided seasonality modelled by Prophet for a daily sampled data is: weekly and yearly.

Consider this case: a sales in your business is heavily affected by payday. Most customers tends to buy your product based on the day of the month. Since it did not follow the default seasonality of yearly and weekly, we will need to define a non-regular seasonality. There are two steps we have to do:

Remove default seasonality (eg: remove yearly seasonality) by setting False
Add seasonality (eg: add monthly seasonality) by using .add_seasonality() method before fitting the model

We ended up with formula: $yhat(t) = T(t) + S_{weekly}(t) + \bf{S_{monthly}(t)}$

model_custom_seasonality = Prophet(
    yearly_seasonality=False  # remove seasonality
)
# add seasonality
model_custom_seasonality.add_seasonality(
    name='monthly', period=30.5, fourier_order=5)
model_custom_seasonality.fit(daily_total_qty)

# forecasting
future = model_custom_seasonality.make_future_dataframe(periods=365, freq='D')
forecast = model_custom_seasonality.predict(future)

# visualize
fig = model_custom_seasonality.plot_components(forecast)

INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.

For monthly seasonality, we provided period = 30.5 indicating that there will be non-regular 30.5 frequency in one season of the data. The 30.5 is a common frequency quantifier for monthly seasonality, since there are some months with a total of 30 and 31 (some are 28 or 29).

Tip: Recommended Fourier order according to the seasonality, 3 for weekly seasonality, 5 for monthly seasonality, and 10 for yearly seasonality.

Holiday Effects

One of the advantage in using Prophet is the ability to model a holiday effect. This holiday effect is defined as a non-regular effect that needs to be manually specified by the user.

Modeling Holidays and Special Events

Now let’s take a better look for our data. We could see that every end of a year, there is a significant increase of sales which exceeds 800 sales a day.

fig = model_31.plot(forecast_31, xlabel='Date', ylabel='Quantity Sold')
plt.axhline(y=800, color='red', ls='--')
plt.show()

Table below shows that the relatively large sales mostly happened at the very end of a year between 27th to 31st December. Now let’s assume that this phenomenon is the result of the new year eve where most people spent the remaining budget of their Christmas or End year bonus to buy our goods.

daily_total_qty[daily_total_qty['y'] > 800]

We'll need to prepare a holiday data frame with the following column:

holiday: the holiday unique name identifier
ds: timestamp
lower_window: how many time unit behind the holiday that is assumed to to be affected (smaller or equal than zero)
upper_window: how many time unit after the holiday that is assumed to be affected (larger or equal to zero)

Important: It must include all occurrences of the holiday, both in the past (back as far as the historical data go) and in the future (out as far as the forecast is being made).

holiday = pd.DataFrame({
    'holiday': 'new_year_eve',
    'ds': pd.to_datetime(['2013-12-31', '2014-12-31',  # past date, historical data
                          '2015-12-31']),  # future date, to be forecasted
    'lower_window': -4,  # include 27th - 31st December
    'upper_window': 0})
holiday

Once we have prepared our holiday data frame, we can pass that into the Prophet() class:

model_holiday = Prophet(holidays=holiday)
model_holiday.fit(daily_total_qty)

# forecasting
future = model_holiday.make_future_dataframe(periods=365, freq='D')
forecast = model_holiday.predict(future)

# visualize
fig = model_holiday.plot(forecast, xlabel='Date', ylabel='Quantity Sold')

INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.

Now, observe how it has more confidence in capturing the holiday effect on the end of the year instead of relying on the yearly seasonality effect. If we plot the components, we could also get the holiday components listed as one of the time series components:

fig = model_holiday.plot_components(forecast)

Built-in Country Holidays

We can use a built-in collection of country-specific holidays using the .add_country_holidays() method before fitting model. For Indonesia, we can specify parameter country_name='ID'.

model_holiday_indo = Prophet()
model_holiday_indo.add_country_holidays(country_name='ID')
model_holiday_indo.fit(daily_total_qty)

model_holiday_indo.train_holiday_names

INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
/usr/local/lib/python3.7/dist-packages/fbprophet/hdays.py:105: Warning:

We only support Nyepi holiday from 2009 to 2019

0               New Year's Day
1             Chinese New Year
2        Day of Silence/ Nyepi
3     Ascension of the Prophet
4                    Labor Day
5           Ascension of Jesus
6            Buddha's Birthday
7                  Eid al-Fitr
8             Independence Day
9       Feast of the Sacrifice
10            Islamic New Year
11                   Christmas
12        Birth of the Prophet
dtype: object

Tip: We can also manually populate Indonesia holidays by using hdays module. This is useful if we want to take a look on the holiday dates and then manually include only certain holidays.

# Reference code: https://github.com/facebook/prophet/blob/master/python/fbprophet/hdays.py
holidays_indo = hdays.Indonesia()
holidays_indo._populate(2021)
pd.DataFrame([holidays_indo], index=['holiday']).T.rename_axis('ds').reset_index()

/usr/local/lib/python3.7/dist-packages/fbprophet/hdays.py:105: Warning:

We only support Nyepi holiday from 2009 to 2019

Adding Regressors

Additional regressors can be added to the linear part of the model using the .add_regressor() method, before fitting model. In this case, we want to forecast total_revenue based on its previous revenue components (trend, seasonality, holiday) and also total_qty sold as the regressor:

$revenue(t) = T_{revenue}(t) + S_{revenue}(t) + H_{revenue}(t) + \bf{qty(t)}$

Important: The extra regressor must be known for both the history and for future dates. It thus must either be something that has known future values or something that has separately been forecasted with a time series model, such as Prophet. A note of caution around this approach: error in the forecast of regressor will produce error in the forecast of target value.

Forecast the Regressor (`total_qty`)

In this section, we separately create a Prophet model to forecast total_qty, before we forecast total_revenue.

model_total_qty = Prophet(
    n_changepoints=20,  # trend flexibility
    changepoint_range=0.9,  # trend flexibility
    changepoint_prior_scale=0.25,  # trend flexibility
    weekly_seasonality=5,  # seasonality fourier order
    yearly_seasonality=False,  # remove seasonality
    holidays=holiday  # new year eve effects
)
# add monthly seasonality
model_total_qty.add_seasonality(name='monthly', period=30.5, fourier_order=5)
model_total_qty.fit(daily_total_qty)

# forecasting
future = model_total_qty.make_future_dataframe(periods=365, freq='D')
forecast_total_qty = model_total_qty.predict(future)

# visualize
fig = model_total_qty.plot(
    forecast_total_qty, xlabel='Date', ylabel='Quantity Sold')

INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.

The table below shows the forecasted total quantity for the last 365 days (from November 1st, 2015 until October 30th, 2016).

forecasted_total_qty = forecast_total_qty[['ds', 'yhat']].tail(365) \
    .rename(columns={'yhat': 'total_qty'})
forecasted_total_qty

On the other hand, the table below shows the actual total quantity which we used for training model. We have to rename the column exactly like the previous table.

actual_total_qty = daily_total_qty.rename(columns={'y': 'total_qty'})
actual_total_qty

Now, we have to prepare concatenated data of total_qty as the regressor values of total_revenue:

First 1031 observations: actual values of total_qty
Last 365 observations: forecasted values of total_qty

future_with_regressor = pd.concat([actual_total_qty, forecasted_total_qty])
future_with_regressor

Forecast the Target Variable (`total_revenue`)

Next, we create a Prophet model to forecast total_revenue, using total_qty as the regressor. Make sure to rename the date as ds and the value to be forecasted as y.

daily_total_revenue = daily_sales_31[['date', 'total_revenue', 'total_qty']].rename(
    columns={'date': 'ds',
             'total_revenue': 'y'})

daily_total_revenue.head()

During fitting a model with regressor, make sure:

Apply .add_regressor() method before fitting
Forecast the value using future_with_regressor data frame that we have prepared before, containing ds and the regressor values

model_total_revenue = Prophet(
    holidays=holiday  # new year eve effects
)
# add regressor
model_total_revenue.add_regressor('total_qty')
model_total_revenue.fit(daily_total_revenue)

# forecasting
# use dataframe with regressor, instead of just `ds` column
forecast_total_revenue = model_total_revenue.predict(future_with_regressor)

# visualize
fig = model_total_revenue.plot(
    forecast_total_revenue, xlabel='Date', ylabel='Total Revenue')

INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.

If we plot the components, we could also get the extra regressors components listed as one of the time series components:

fig = model_total_revenue.plot_components(forecast_total_revenue)

Important: By adding regressors, we lose the ability to interpret the other components (trend, seasonality, holiday) due to the fluctuation of the extra regressor value.

Forecasting Evaluation

Recall how we performed a visual analysis on how the performance of our forecasting model earlier. The technique was in fact, a widely used technique for model cross-validation. It involves splitting our data into two parts:

Train data is used to train our time series model in order to acquire the underlying patterns such as trend and seasonality.
Test data is purposely being kept for us to perform a cross-validation and see how our model perform on an unseen data.

Train-Test Split

Recall that our data has the range of early 2013 to end 2015. Say, we are going to save the records of 2015 as a test data and use the rest for model training. The points in red will now be treated as unseen data and will not be passed in to our Prophet model.

cutoff = pd.to_datetime('2015-01-01')
daily_total_qty['type'] = daily_total_qty['ds'].apply(
    lambda date: 'train' if date < cutoff else 'test')

plt.figure(figsize=(10, 5))
sns.scatterplot(x='ds', y='y', hue='type', s=5,
                palette=['black', 'red'],
                data=daily_total_qty)
plt.axvline(x=cutoff, color='gray', label='cutoff')
plt.legend()
plt.show()

We can split at a cutoff using conditional subsetting as below:

train = daily_total_qty[daily_total_qty['ds'] < cutoff]
test = daily_total_qty[daily_total_qty['ds'] >= cutoff]

print(f'Train length: {train.shape[0]} days')
print(f'Test length: {test.shape[0]} days')

Train length: 728 days
Test length: 303 days

Now let's train the model using data from 2013-2014 only, and forecast 303 days into the future (until October 31st, 2015).

model_final = Prophet(
    holidays=holiday,  # holiday effect
    yearly_seasonality=True)
# add monthly seasonality
model_final.add_seasonality(name='monthly', period=30.5, fourier_order=5)
model_final.fit(train)  # only training set

# forecasting
future_final = model_final.make_future_dataframe(
    periods=303, freq='D')  # 303 days (test size)
forecast_final = model_final.predict(future_final)

# visualize
fig = model_final.plot(forecast_final, xlabel='Date', ylabel='Quantity Sold')
plt.scatter(x=test['ds'], y=test['y'], s=10, color='red')
plt.show()

INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.

Evaluation Metrics

Based on the plot above, we can see that the model is capable in forecasting the actual future value. But for most part, we will need to quantify the error to be able to have a conclusive result. To quantify an error, we need to calculate the difference between actual demand and the forecasted demand. However, there are several metrics we can use to express the value. Some of them are:

Root Mean Squared Error

$RMSE = \displaystyle{\sqrt{\frac{1}{n} \sum_{i=1}^{n}{(p_i - a_i)^2}}}$

Root Mean Squared Logarithmic Error

$RMSLE = \displaystyle{\sqrt{\frac{1}{n} \sum_{i=1}^{n}{(log(p_i + 1) - log(a_i + 1))^2}}}$

Notation:

$n$: length of time series
$p_i$: predicted value at time $i$
$a_i$: actual value at time $i$

# for illustration purposes only

# calculation
err = np.arange(-500, 500)
p = np.arange(0, 1000)
a = p - err
rmse_plot = np.power(err, 2) ** 0.5
rmsle_plot = np.power(np.log1p(p) - np.log1p(a), 2) ** 0.5

# visualization
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

for idx, (ax, y) in enumerate(zip(axes, [rmse_plot, rmsle_plot])):
    ax.plot(err[err < 0], y[err < 0], color='red', label="Underestimate")
    ax.plot(err[err > 0], y[err > 0], color='blue', label="Overestimate")
    ax.set_xlabel("Error (Prediction - Actual)")
    ax.set_ylabel(f"RMS{'L' if idx else ''}E")
    ax.set_title(f"Root Mean Squared {'Logarithmic ' if idx else ''}Error")
    ax.legend()

plt.show()

Important: The main reason RMSLE is prefered over RMSE: It incurs a larger penalty for the underestimation of the actual value than the overestimation. This is useful for business cases where the underestimation of the target variable is not acceptable but overestimation can be tolerated.

forecast_train = forecast_final[forecast_final['ds'] < cutoff]
train_rmsle = mean_squared_log_error(y_true=train['y'],
                                     y_pred=forecast_train['yhat']) ** 0.5
train_rmsle

0.18125248249390455

forecast_test = forecast_final[forecast_final['ds'] >= cutoff]
test_rmsle = mean_squared_log_error(y_true=test['y'],
                                    y_pred=forecast_test['yhat']) ** 0.5
test_rmsle

0.36602550027367353

Note: Any of the metrics can be used to benchmark a forecasting model, as long as we are consistent in using it. Other regression metrics can be seen on scikit-learn documentation.

Expanding Window Cross Validation

Instead of only doing one time train-test split, we can do cross validation as shown below:

This cross validation procedure is called as expanding window and can be done automatically by using the cross_validation() method. There are three parameters to be specified:

initial: the length of the initial training period
horizon: forecast length
period: spacing between cutoff dates

df_cv = cross_validation(model_total_revenue, initial='730 days', horizon='30 days', period='90 days')
df_cv

INFO:fbprophet:Making 4 forecasts with cutoffs between 2015-01-04 00:00:00 and 2015-10-01 00:00:00

The cross validation process above will be carried out for 4 folds, where at each fold a forecast will be made for the next 30 days (horizon) from the cutoff dates. Below is the illustration for each fold:

# for illustration purposes only
df_copy = daily_total_qty[['ds', 'y']].copy()
df_cutoff_horizon = df_cv.groupby('cutoff')[['ds']].max()

for i, (cutoff, horizon) in enumerate(df_cutoff_horizon.iterrows()):
    horizon_cutoff = horizon['ds']

    df_copy['type'] = df_copy['ds'].apply(
        lambda date: 'train' if date < cutoff else
                     'test' if date < horizon_cutoff else 'unseen')

    plt.figure(figsize=(10, 5))
    sns.scatterplot(x='ds', y='y', hue='type', s=5,
                    palette=['black', 'red', 'gray'],
                    data=df_copy)
    plt.axvline(x=cutoff, color='gray', label='cutoff')
    plt.axvline(x=horizon_cutoff, color='gray', ls='--', label='horizon')
    plt.legend(bbox_to_anchor=(1, 1), loc='upper left')
    plt.title(
        f"FOLD {i+1}\nTRAIN SIZE: {df_copy['type'].value_counts()['train']} DAYS")
    plt.show()

Cross validation error metrics can be evaluated for each folds, here shown for RMSLE.

cv_rmsle = df_cv.groupby('cutoff').apply(
    lambda x: mean_squared_log_error(y_true=x['y'],
                                     y_pred=x['yhat']) ** 0.5)
cv_rmsle

cutoff
2015-01-04    0.235920
2015-04-04    0.304658
2015-07-03    0.236829
2015-10-01    0.308821
dtype: float64

We can aggregate the metrics by using its mean. In other words, we are calculating the mean of RMSLE to represent the overall model performance.

cv_rmsle.mean()

0.27155703183152

Error Diagnostics

Prophet has provide us several frequently used evaluation metrics by using performance_metrics():

Mean Squared Error (MSE)
Root Mean Squared Error (RMSE)
Mean Absolute Error (MAE)
Mean Absolute Percentage Error (MAPE)
Median Absolute Percentage Error (MDAPE)
Coverage: Percentage of actual data that falls on the forecasted uncertainty (confidence) interval

df_p = performance_metrics(df_cv, rolling_window=0)
df_p.tail()

Cross validation performance metrics can be visualized with plot_cross_validation_metric, here shown for RMSE.

Dots show the root squared error (RSE) for each prediction in df_cv.
The blue line shows the RMSE for each horizon.

fig = plot_cross_validation_metric(df_cv, metric='rmse', rolling_window=0)

Note: Unfortunately, Prophet has not implement RMSLE metric in their library. Therefore, we have to calculate it manually according to its mathematical formula, or simply use sklearn.metrics module.

Dots show the root squared logarithmic error (RSLE) for each prediction in df_cv.
The blue line shows the RMSLE for each horizon.

# calculation
df_cv['horizon'] = (df_cv['ds'] - df_cv['cutoff']).dt.days
df_cv['sle'] = np.power((np.log1p(df_cv['yhat']) - np.log1p(df_cv['y'])), 2) # squared logarithmic error
df_cv['rsle'] = df_cv['sle'] ** 0.5
horizon_rmsle = df_cv.groupby('horizon')['sle'].apply(lambda x: x.mean() ** 0.5) # root mean

# visualization
plt.figure(figsize=(10, 5))
sns.scatterplot(x='horizon', y='rsle', 
                s=15, color='gray',
                data=df_cv)
plt.plot(horizon_rmsle.index, horizon_rmsle, color='blue')
plt.grid(b=True, which='major', color='gray', linestyle='-')
plt.xlabel('Horizon (days)')
plt.ylabel('RMSLE')
plt.show()

Hyperparameter Tuning

In this section, we implement a Grid search algorithm for model tuning by using for-loop. It builds model for every combination from specified hyperparameters and then evaluate it. The goal is to choose a set of optimal hyperparameters which minimize the forecast error (in this case, smallest RMSLE).

Tip: Visit the documentation here for a list of recommended hyperparameters to be tuned.

param_grid = {
    'changepoint_prior_scale': [0.001, 0.01, 0.1],
    'changepoint_range': [0.8, 0.95]
}

# Generate all combinations of parameters
all_params = [dict(zip(param_grid.keys(), v))
              for v in itertools.product(*param_grid.values())]
rmsles = []  # Store the RMSLEs for each params here

# Use cross validation to evaluate all parameters
for params in tqdm(all_params):
    # fitting model
    # (TO DO: change the data and add other components: seasonality, holiday, regressor)
    model = Prophet(**params,
                    holidays=holiday)
    model.fit(daily_total_qty)

    # Expanding window cross validation (TO DO: use different values)
    cv = cross_validation(model, initial='730 days', period='90 days', horizon='60 days',
                          parallel='processes')

    # Evaluation metrics: RMSLE
    rmsle = cv.groupby('cutoff').apply(
        lambda x: mean_squared_log_error(y_true=x['y'],
                                         y_pred=x['yhat']) ** 0.5)

    mean_rmsle = rmsle.mean()
    rmsles.append(mean_rmsle)

  0%|          | 0/6 [00:00<?, ?it/s]INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
INFO:fbprophet:Making 3 forecasts with cutoffs between 2015-03-05 00:00:00 and 2015-09-01 00:00:00
INFO:fbprophet:Applying in parallel with <concurrent.futures.process.ProcessPoolExecutor object at 0x7fc13488c790>
 17%|█▋        | 1/6 [00:06<00:30,  6.14s/it]INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
INFO:fbprophet:Making 3 forecasts with cutoffs between 2015-03-05 00:00:00 and 2015-09-01 00:00:00
INFO:fbprophet:Applying in parallel with <concurrent.futures.process.ProcessPoolExecutor object at 0x7fc13496a9d0>
 33%|███▎      | 2/6 [00:12<00:24,  6.02s/it]INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
INFO:fbprophet:Making 3 forecasts with cutoffs between 2015-03-05 00:00:00 and 2015-09-01 00:00:00
INFO:fbprophet:Applying in parallel with <concurrent.futures.process.ProcessPoolExecutor object at 0x7fc136479510>
 50%|█████     | 3/6 [00:18<00:18,  6.26s/it]INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
INFO:fbprophet:Making 3 forecasts with cutoffs between 2015-03-05 00:00:00 and 2015-09-01 00:00:00
INFO:fbprophet:Applying in parallel with <concurrent.futures.process.ProcessPoolExecutor object at 0x7fc136fba310>
 67%|██████▋   | 4/6 [00:24<00:12,  6.29s/it]INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
INFO:fbprophet:Making 3 forecasts with cutoffs between 2015-03-05 00:00:00 and 2015-09-01 00:00:00
INFO:fbprophet:Applying in parallel with <concurrent.futures.process.ProcessPoolExecutor object at 0x7fc1347e73d0>
 83%|████████▎ | 5/6 [00:31<00:06,  6.28s/it]INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
INFO:fbprophet:Making 3 forecasts with cutoffs between 2015-03-05 00:00:00 and 2015-09-01 00:00:00
INFO:fbprophet:Applying in parallel with <concurrent.futures.process.ProcessPoolExecutor object at 0x7fc13488c790>
100%|██████████| 6/6 [00:37<00:00,  6.28s/it]

Note: In this section, we only forecast daily_total_qty for the sake of explanation simplicity.

We can observe the error metrics for each hyperparameter combination, and sort by ascending:

tuning_results = pd.DataFrame(all_params)
tuning_results['rmsle'] = rmsles
tuning_results.sort_values(by='rmsle')

Best hyperparameter combination can be extracted as follows:

best_params = all_params[np.argmin(rmsles)]
best_params

{'changepoint_prior_scale': 0.1, 'changepoint_range': 0.95}

Lastly, re-fit the model and use it for forecasting.

model_best = Prophet(**best_params, holidays=holiday)
model_best.fit(daily_total_qty)
model_best

INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.

<fbprophet.forecaster.Prophet at 0x7fc13496a610>

Note: ** is an operator for dictionary unpacking. It delivers key-value pairs in a dictionary into a function’s arguments.

Final Forecasting Result

In this last section, we use model_best to forecast daily total quantity sold for 1 year to the future, i.e. from November 1st, 2015 to October 30th, 2016.

future_best = model_best.make_future_dataframe(periods=365, freq='D')
forecast_best = model_best.predict(future_best)

# visualize
fig = model_best.plot(forecast_best, xlabel='Date', ylabel='Quantity Sold')

The final forecasting result then can be used as a sales target for one year ahead.

	date	shop_id	item_id	item_price	item_cnt_day
0	02.01.2013	59	22154	999.00	1.0
1	03.01.2013	25	2552	899.00	1.0
2	05.01.2013	25	2552	899.00	-1.0
3	06.01.2013	25	2554	1709.05	1.0
4	15.01.2013	25	2555	1099.00	1.0

	date	shop_id	item_id	item_price	item_cnt_day	total_revenue
84820	2013-01-01	54	12370	149.0	1.0	149.0
85870	2013-01-01	54	13431	11489.7	1.0	11489.7
87843	2013-01-01	54	8709	349.0	1.0	349.0
76711	2013-01-01	54	20144	249.0	1.0	249.0
87341	2013-01-01	54	10458	298.0	1.0	298.0

	date	shop_id	total_qty	total_revenue
0	2013-01-01	54	415.0	316557.00
1	2013-01-02	25	568.0	345174.13
2	2013-01-02	31	568.0	396376.10
3	2013-01-02	54	709.0	519336.00
4	2013-01-03	25	375.0	249421.00

	ds	y
0	2013-01-02	568.0
1	2013-01-03	423.0
2	2013-01-04	431.0
3	2013-01-05	415.0
4	2013-01-06	435.0

	ds
1391	2016-10-26
1392	2016-10-27
1393	2016-10-28
1394	2016-10-29
1395	2016-10-30

Sales Forecasting using Prophet

Introduction

Library and Setup

Workflow

Data Loading

Data Preprocessing

Visualization

Multiple Time Series

Multivariate Time Series

Modeling using `fbprophet`

Minimal Workflow

Prepare the data

Fitting Model

Forecasting

Visualize

Interactive Visualization

Trend Component

Automatic Changepoint Detection

Adjusting Trend Flexibility

Seasonality Component

Fourier Order

Custom Seasonalities

Holiday Effects

Modeling Holidays and Special Events

Built-in Country Holidays

Adding Regressors

Forecast the Regressor (`total_qty`)

Forecast the Target Variable (`total_revenue`)

Forecasting Evaluation

Train-Test Split

Evaluation Metrics

Expanding Window Cross Validation

Error Diagnostics

Hyperparameter Tuning

Final Forecasting Result

References

	ds	trend	weekly	yearly	yhat
0	2013-01-02	376.014905	-32.833816	234.600919	577.782008
1	2013-01-03	375.942465	-26.061642	215.077487	564.958311
2	2013-01-04	375.870026	55.637544	193.971274	625.478844
3	2013-01-05	375.797587	82.002893	171.625948	629.426428
4	2013-01-06	375.725148	-2.450713	148.403261	521.677695
...	...	...	...	...	...
1391	2016-10-26	153.726469	-32.833816	-32.499417	88.393237
1392	2016-10-27	153.545641	-26.061642	-30.534326	96.949673
1393	2016-10-28	153.364812	55.637544	-27.645116	181.357241
1394	2016-10-29	153.183984	82.002893	-23.842381	211.344496
1395	2016-10-30	153.003156	-2.450713	-19.162378	131.390064

	ds	y
64	2013-03-07	803.0
359	2013-12-27	861.0
360	2013-12-28	1028.0
361	2013-12-29	962.0
362	2013-12-30	1035.0
363	2013-12-31	891.0
723	2014-12-27	942.0
725	2014-12-29	839.0
726	2014-12-30	1080.0
727	2014-12-31	912.0

	holiday	ds	lower_window
0	new_year_eve	2013-12-31	-4
1	new_year_eve	2014-12-31	-4
2	new_year_eve	2015-12-31	-4

	ds	holiday
0	2021-01-01	New Year's Day
1	2021-02-12	Chinese New Year
2	2021-03-11	Ascension of the Prophet
3	2021-05-01	Labor Day
4	2021-05-13	Ascension of Jesus
5	2021-05-26	Buddha's Birthday
6	2021-06-01	Pancasila Day
7	2021-05-14	Eid al-Fitr
8	2021-08-17	Independence Day
9	2021-07-20	Feast of the Sacrifice
10	2021-08-10	Islamic New Year
11	2021-10-19	Birth of the Prophet
12	2021-12-25	Christmas

	ds	total_qty
1031	2015-11-01	193.255929
1032	2015-11-02	128.734150
1033	2015-11-03	164.798473
1034	2015-11-04	158.194736
1035	2015-11-05	178.743638
...	...	...
1391	2016-10-26	165.831597
1392	2016-10-27	171.156393
1393	2016-10-28	247.616489
1394	2016-10-29	271.470601
1395	2016-10-30	189.416426

	ds	total_qty
0	2013-01-02	568.000000
1	2013-01-03	423.000000
2	2013-01-04	431.000000
3	2013-01-05	415.000000
4	2013-01-06	435.000000
...	...	...
1391	2016-10-26	165.831597
1392	2016-10-27	171.156393
1393	2016-10-28	247.616489
1394	2016-10-29	271.470601
1395	2016-10-30	189.416426

	ds	yhat	yhat_lower	yhat_upper	y	cutoff
0	2015-01-05	453243.686249	343139.314528	555740.657548	357996.0	2015-01-04
1	2015-01-06	528839.864534	416678.540820	641381.035848	562368.0	2015-01-04
2	2015-01-07	400103.526338	289172.328596	514448.356461	290563.0	2015-01-04
3	2015-01-08	282273.692327	176986.342806	394101.768913	285423.0	2015-01-04
4	2015-01-09	255507.073608	143118.443054	367993.065885	232971.0	2015-01-04
...	...	...	...	...	...	...
115	2015-10-27	111576.408050	7996.709968	213987.808389	111851.0	2015-10-01
116	2015-10-28	91895.680477	-5108.655465	196370.841312	180557.0	2015-10-01
117	2015-10-29	126869.389853	23990.467195	226052.458306	103456.0	2015-10-01
118	2015-10-30	237405.154426	142924.554223	335557.351306	204317.0	2015-10-01
119	2015-10-31	187605.724288	83049.868234	293944.613439	237587.0	2015-10-01

	horizon	mse	rmse	mae	mape	mdape	coverage
25	26 days	3.791062e+09	61571.598733	50099.136212	0.269882	0.274819	1.0
26	27 days	2.667743e+09	51650.199465	40674.234599	0.198813	0.127268	1.0
27	28 days	6.598579e+08	25687.698689	25620.623111	0.207142	0.215036	1.0
28	29 days	1.654698e+09	40677.980863	37825.473664	0.276104	0.250000	1.0
29	30 days	2.441526e+09	49411.803495	46637.639525	0.405376	0.345830	1.0

	changepoint_prior_scale	changepoint_range	rmsle
5	0.100	0.95	0.264595
4	0.100	0.80	0.264672
3	0.010	0.95	0.265335
2	0.010	0.80	0.266934
0	0.001	0.80	0.268179
1	0.001	0.95	0.275176

Introduction

Library and Setup

Workflow

Data Loading

Data Preprocessing

Visualization

Multiple Time Series

Multivariate Time Series

Modeling using fbprophet

Minimal Workflow

Prepare the data

Fitting Model

Forecasting

Visualize

Interactive Visualization

Trend Component

Automatic Changepoint Detection

Adjusting Trend Flexibility

Seasonality Component

Fourier Order

Custom Seasonalities

Holiday Effects

Modeling Holidays and Special Events

Built-in Country Holidays

Adding Regressors

Forecast the Regressor (total_qty)

Forecast the Target Variable (total_revenue)

Forecasting Evaluation

Train-Test Split

Evaluation Metrics

Expanding Window Cross Validation

Error Diagnostics

Hyperparameter Tuning

Final Forecasting Result

References

Modeling using `fbprophet`

Forecast the Regressor (`total_qty`)

Forecast the Target Variable (`total_revenue`)