import pandas as pd
import numpy as np

# data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# statistical test
from scipy import stats

# supress warning
import warnings
warnings.filterwarnings("ignore")

# reproducible model
import os
import random
import numpy as np
import torch

# reference: https://github.com/pytorch/pytorch/issues/7068#issuecomment-487907668
def seed_torch(seed):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed) # if you are using multi-GPU
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True

SEED = 123
seed_torch(SEED)

Introduction

Survival Analysis arises because of the many concerns about something that might happen and be detrimental to life. We will undoubtedly prepare meticulous planning through this survival analysis only if we know when the worst possibility occurs. In addition, we also want to know what factors are most correlated with these hazards. The first question that will always arise is how long an object will last?

Survival Analysis is a statistical method used to analyze longitudinal data regarding events or events, such as

  • What is the probability that a patient will survive after doctor's statement of diagnosis?
  • How long will a customer stay with the products we produce until the customer churns?
  • How long can a production machine last after three years of use?
  • How is the retention rate of different marketing channels?
  • etc

The example above is very interesting to be discussed further. For a deeper understanding, this post will explain the basic concepts and workings of Survival Analysis and how it is applied in industry.

Concept

Event and Time

Survival Analysis is also known as time-to-event analysis. To better understand the concept of events and time, let's take a look the following application:

  • Marketing Analysis: The goals is to evaluate retention rates on each marketing channel. As we know, a company must have several services offered to customers according to the marketing targets that have been made. Here, we define the event as the act at which the customer unsubscribes from the marketing channel. Time is defined when the customer initiates a marketing channel service/subscription. The time scale can be months or weeks.

  • Predictive Maintenance: This analysis is used to determine the validity period for mechanical parts/machines. An event is defined as a situation where the machine breaks down. Time is defined as when the machine can operate continuously in the production process. The time scale can be weeks, months, or years.

Censoring

Survival Analysis sounds very similar to regression. So, why we use a different modeling technique instead? Why not just use a simple linear regression?

Survival analysis is explicitly designed to handle data about terminal events where some observations may experience those events and others may not. Such observations are called "censored". For example, the target variable represents the time to a terminal event, and the duration of study is constrained. Therefore, some observations will not experience the event. If we relate to the predictive maintenance case, some equipment will probably fail during performance monitoring, but some will not.

There are two types of censoring, namely right censored data and left censored data:

  • Right censored data is the most common type of censoring that occurs when the observation ends or the individual is removed from the observation before the event occurs. For example, some individuals may be alive at the end of an observational trial or maybe dropped out before the study is terminated.
  • Data will be called as left censored if the initial time at risk is unknown. This will happen if we do not know when the individual first experienced the observed event. For example, when the individual is infected with a disease.

Censoring concept slightly complicates the estimation of the survival function. Therefore, we need a different technique other than simple regression model. Thus, in addition to the target variable in the survival analysis, we need a categorical variable that indicates for each observation whether or not the event occurred at the time of censor.

Mathematical Concept

Probability Density Function (PDF) dan Cumulative Distribution Function (CDF)

Let $T$ be a positive continuous random variable which represents time until the event happens.

  • Probability density function (PDF) $f_T(t)$ represents the probability value of an event occurring at any time $t$ for the random variable $T$.
  • Cumulative distribution function (CDF) $F_T(t)$ represents a function that adds up the probability value of an event occurring up to time $t$ for the random variable $T$.

Mathematically, $F_T(t) = P(T < t) = \int_0^t f_T(s)\ ds$

# generate dummy data
T = np.round(np.random.normal(loc=100, scale=10, size=10000))
t = np.linspace(T.min(), T.max(), dtype='int')
kde = stats.gaussian_kde(T)
df = pd.Series(kde.pdf(t), index=t, name='PDF').to_frame()
df['CDF'] = df['PDF'].cumsum()/df['PDF'].sum()
df = df.round(10)

# plotting
fig, ax = plt.subplots(1, 2, figsize=(10, 5), sharex=True)

ax[0].plot(df['PDF'])
ax[0].set_xlabel("Time $t$")
ax[0].set_ylabel("$f_T(t)$")
ax[0].set_title("Probability Density Function")

ax[1].plot(df['CDF'])
ax[1].set_xlabel("Time $t$")
ax[1].set_ylabel("$F_T(t)$")
ax[1].set_title("Cumulative Distribution Function")
plt.show()

Survival Function

Survival function $S(t)$ defines the probability that the event has not occurred or the object can survive up to time $t$. Mathematically it can be written as $S(t) = P(T \geq t)$, with the following properties:

  1. $0 \leq S(t) \leq 1$
  2. $S(t) = 1 - F_T(t)$ where $F_T(t)$ is CDF for the random variable $T$
  3. $F_T(t)$ is a monotonically increasing function, so inversely $S(t)$ is a monotonically decreasing function

fig, ax = plt.subplots(1, 2, figsize=(10, 5), sharex=True)

ax[0].plot(df['CDF'])
ax[0].set_xlabel("Time $t$")
ax[0].set_ylabel("$F_T(t)$")
ax[0].set_title("Cumulative Distribution Function")

df['Survival'] = 1-df['CDF']
ax[1].plot(df['Survival'])
ax[1].set_xlabel("Time $t$")
ax[1].set_ylabel("$S(t)$")
ax[1].set_title("Survival Function")
plt.show()

Hazard Function

The hazard function $h(t)$ defines the conditional probability that the event will occur at the interval $[t, t+dt)$ given that the event is not occurred. Mathematically it can be written as:

$h(t) = \displaystyle{\lim_{dt \to 0} \frac{P(t \leq T < t+dt\ |\ T \geq t)}{dt}}$

The above formula can be derived into (attached below this post):

$h(t) = \dfrac{f_T(t)}{S(t)} = - \dfrac{d}{dt} \ln S(t)$

So, survival function can be written in terms of hazard function as: $S(t) = e^{-\int_0^t h(s)\ ds}$

$H(t) = \int_0^t h(s)\ ds$ is further defined as cumulative hazard function.

fig, ax = plt.subplots(1, 2, figsize=(10, 5), sharex=True)

df['Hazard'] = df['PDF']/df['Survival']
ax[0].plot(df['Hazard'])
ax[0].set_xlabel("Time $t$")
ax[0].set_ylabel("$h(t)$")
ax[0].set_title("Hazard Function")

df['CumHazard'] = df['Hazard'].cumsum()
ax[1].plot(df['CumHazard'])
ax[1].set_xlabel("Time $t$")
ax[1].set_ylabel("$H(t)$")
ax[1].set_title("Cumulative Hazard Function")
plt.show()

Risk Score

However, the hazard function is rarely used in its original form. Usually, the $t$ axis is discretized into $J$ parts and the risk score $r(x)$ is calculated for each sample $x$:

$r(x) = \displaystyle{\sum_{j=1}^J H(t_j, x)}$

Note: The faster an object experiences an event (smaller $t$), the higher the risk score it has.

J = 20
j = np.linspace(T.min(), T.max(), J, dtype='int')
j[-1] -= 1
df2 = df.reindex(index=range(int(T.min()), int(T.max())), method='ffill')

plt.plot(df['CumHazard'], label='Cumulative Hazard')
risk = 0
for j in j:
    y = df2.loc[j, 'CumHazard']
    plt.vlines(x=j, ymin=0, ymax=y, linestyles='--', color='red')
    plt.plot(j, y, marker='x', color='red')
    risk += y

plt.legend()
plt.xlabel("Time $t$")
plt.ylabel("$H(t)$")
plt.title(f"Risk Score: {risk:.2f} with {J} partitions")
plt.show()

Predictive Maintenance

Predictive maintenance is a maintenance strategy to predict the chances of when a machine will be damaged, it can be used to assist technicians in making repairs as a preventive measure for damage to a machine. According to a survey conducted by Deloitte, predictive maintenance analysis can reduce factory equipment maintenance costs by up to 40%.

Predictive maintenance is built based on an analysis of a machine equipment condition that is monitored in real-time. The monitoring process is assisted by sensors on the engine that can capture pressure, humidity, and temperature information on the engine, where some of these factors can be used to measure the performance of an engine.

Real-time data captured by several sensors is collected which is then stored in a cloud. The data can then be analyzed and visualized into a dashboard to be displayed to technicians. Up to this point, technicians only know the state of a machine. Then what about the predictive analysis process?

To perform predictive analysis, the existing data is further analyzed and a machine learning algorithm is applied, namely survival analysis which can produce a model. This model can be used as a decision parameter to determine the probability of damage to a machine over time. After the model is successfully created and produces the right results according to the business question, the maintenance specialist can determine the action plan to take preventive action on machine damage.

Data Loading

With the help of Internet of Things (IoT), maintenance dataset can be collected from sensors. There are several measurements such as pressure, moisture, and temperature.

from pysurvival.datasets import Dataset
maintenance = Dataset('maintenance').load()
maintenance.head(10)
lifetime broken pressureInd moistureInd temperatureInd team provider
0 56 0 92.178854 104.230204 96.517159 TeamA Provider4
1 81 1 72.075938 103.065701 87.271062 TeamC Provider4
2 60 0 96.272254 77.801376 112.196170 TeamA Provider1
3 86 1 94.406461 108.493608 72.025374 TeamC Provider2
4 34 0 97.752899 99.413492 103.756271 TeamB Provider1
5 30 0 87.678801 115.712262 89.792105 TeamA Provider1
6 68 0 94.614174 85.702236 142.827001 TeamB Provider2
7 65 1 96.483303 93.046797 98.316190 TeamB Provider3
8 23 0 105.486158 118.291997 96.028822 TeamB Provider2
9 81 1 99.178235 99.138717 95.492965 TeamC Provider4

Here are the data description of maintenance:

  1. Time component

    • lifetime: machine uptime, in weeks
  2. Event component

    • broken: indicated whether a machine has broken or not
  3. Features

    • pressureInd: pressure index, measurement of the flow of liquid through a pipe. A sudden drop in pressure may indicate a leakage.
    • moistureInd: moisture index. Excessive humidity can cause mold and damage the equipment.
    • temperatureInd: temperature index from thermocouple. Improper temperature can cause electrical circuit damage, fire, or even explosion.
    • team: the team operating the machine
    • provider: machine manufacturer
maintenance[['broken', 'team', 'provider']] = maintenance[['broken', 'team', 'provider']].astype('category')
maintenance.dtypes
lifetime             int64
broken            category
pressureInd        float64
moistureInd        float64
temperatureInd     float64
team              category
provider          category
dtype: object

Exploratory Data Analysis

Before doing the modeling, it would be better if we explore the maintenance data in the form of visualization.

Machine Record

First, let's see how much data has been recorded for each team and provider respectively.

maintenance.groupby(['team', 'provider']).count()['broken'].unstack().plot.bar()

plt.xticks(rotation=0)
plt.xlabel('')
plt.ylabel('')
plt.title('Number of Records by Team and Provider')
plt.legend(bbox_to_anchor=(1, 1), loc='upper left', title='Provider')
plt.show()

The frequency of recorded machine is not much difference for each team and provider, in other words the distribution is quite uniform. Next, let's plot the distribution of pressureInd, moistureInd, and temperatureInd for both provider and broken.

maintenance_melt = maintenance.melt(id_vars=['provider', 'broken'],
                                    value_vars=['pressureInd', 'moistureInd', 'temperatureInd'])

g = sns.FacetGrid(maintenance_melt,
                  row='variable', col='provider', hue='broken',
                  margin_titles=True)
g.map(sns.distplot, 'value', hist_kws=dict(alpha=0.2))
g.add_legend()
plt.show()

At a glance, it can be concluded that the bell curve plot shows that the three variables are normally distributed, and the distribution is more or less similar for broken and not broken. We can perform hypothesis testing to test whether the two distributions are statistically equal or not.

Mann-Whitney U: Tests whether the distribution of two independent samples is equal or not.

Assumptions:

  • Observations in each sample are independent and identically distributed.
  • Observations in each sample can be ranked.

Hypothesis:

  • $H_0$: distribution of both samples is the same.
  • $H_1$: distribution of the two samples is not the same.

from scipy.stats import mannwhitneyu

H0 = "same"
H1 = "different"
alpha = 0.05

ks_result = []
for var in ['pressureInd', 'moistureInd', 'temperatureInd']:
    for provider in maintenance['provider'].cat.categories:
        group = maintenance['provider'] == provider
        broken = maintenance[(maintenance['broken'] == 0) & (group)]
        not_broken = maintenance[(maintenance['broken'] == 1) & (group)]

        stat, p = mannwhitneyu(broken[var], not_broken[var])

        ks_result.append(
            {
                'provider': provider,
                'variable': var,
                'statistic': stat,
                'pvalue': p,
                f'result (alpha={alpha})': H1 if p < alpha else H0
            }
        )

pd.DataFrame(ks_result)
provider variable statistic pvalue result (alpha=0.05)
0 Provider1 pressureInd 7770.0 0.344450 same
1 Provider2 pressureInd 7492.0 0.214888 same
2 Provider3 pressureInd 6774.0 0.168687 same
3 Provider4 pressureInd 5090.0 0.015711 different
4 Provider1 moistureInd 7835.0 0.386328 same
5 Provider2 moistureInd 7357.0 0.154727 same
6 Provider3 moistureInd 7186.0 0.420178 same
7 Provider4 moistureInd 6005.0 0.380593 same
8 Provider1 temperatureInd 7484.0 0.186542 same
9 Provider2 temperatureInd 7608.0 0.276022 same
10 Provider3 temperatureInd 6919.0 0.244274 same
11 Provider4 temperatureInd 5596.0 0.129266 same

Using alpha = 5%, the majority of distributions are the same except for pressureInd from Provider4. This indicates that it means that we cannot rely solely on numerical variables and providers in the modeling stage. We must also involve the lifetime time component when we want to predict the broken state of a machine.

Correlation Heatmap

Here, we are interested in looking at the correlations between numeric variables in the maintenance data. From the correlation coefficient, it turns out that there is no strong correlation which indicates that there is no linear relationship between all numerical variables.

sns.heatmap(maintenance.corr(), annot=True, cmap='Reds')
plt.title("Correlation Heatmap")
plt.show()

Machine Status (Time to Event)

Let's explore the lifetime (time component) to observe when the machine breaks down.

sns.histplot(x='lifetime', hue='broken', 
             bins=25, alpha=0.2,
             data=maintenance)
plt.title("Number of Record by Machine Status")
plt.show()

It turns out that the failure begins when the machine has been active for at least 60 weeks. Then, what if we plot the above in detail for each team and provider?

def multihist(x, hue, n_bins=10, color=None, **kws):
    bins = np.linspace(x.min(), x.max(), n_bins)
    for label, x_i in x.groupby(hue):
        plt.hist(x_i, bins, label=label, **kws)


g = sns.FacetGrid(maintenance, col='provider', margin_titles=True)
g.map(multihist, 'lifetime', 'broken', alpha=0.5)
g.add_legend(title='broken')
plt.show()

We can clearly see the initial time when the machine failure occurred for each provider. We can sort the provider from the smallest lifetime: Provider3, Provider1, Provider4, followed by Provider2.

g = sns.FacetGrid(maintenance, col='team', margin_titles=True)
g.map(multihist, 'lifetime', 'broken', alpha=0.5)
g.add_legend(title='broken')
plt.show()

On the other hand, for each team, it can be seen that the initial machine failure time for team C is faster than for the other two. To be more confident with the two visualizations above, let's test the hypothesis as follows:

Chi-Squared: Tests whether two categorical variables are related or independent.

Assumptions:

  • Observations used in the calculation of the contingency table are independent.
  • 25 or more records in each value in the contingency table.

Hypothesis:

  • $H_0$: two samples are independent of each other.
  • $H_1$: two samples are mutually dependent.

from scipy.stats import chi2_contingency

H0 = "independent with broken"
H1 = "dependent with broken"
alpha = 0.05

chi2_result = []
for var in ['provider', 'team']:
    table = pd.pivot_table(data=maintenance,
                           index='broken',
                           columns=var,
                           values='lifetime',
                           aggfunc='count')
    stat, p, dof, expected = chi2_contingency(table)
    
    chi2_result.append(
        {
            'variable': var,
            'statistic': stat,
            'pvalue': p,
            f'result (alpha={alpha})': H1 if p < alpha else H0
        }
    )

pd.DataFrame(chi2_result)
variable statistic pvalue result (alpha=0.05)
0 provider 18.673817 0.000319 dependent with broken
1 team 2.264541 0.322301 independent with broken

Using alpha = 5%, we can conclude that the provider is dependent on the broken status of the machine, while the team is independent.

Kaplan-Meier

Kaplan-Meier is the simplest method for estimating survival function for each category in a population. Calculation of estimates in the Kaplan-Meier method involves the probability of an event occurring up to a certain time, then successively multiplied by the previous probability to produce the final estimate. Mathematically can be written as follows:

$S(t) = \displaystyle{\prod_{j=1}^k \frac{n_j - d_j}{d_j}}$

where $n_j$ represents the number of individuals at time $t$ and $d_j$ is the number of individuals experiencing the event at time $t$.

from lifelines import KaplanMeierFitter

kmf = KaplanMeierFitter()
for provider in maintenance['provider'].cat.categories:
    group = maintenance['provider'] == provider
    kmf.fit(maintenance['lifetime'][group],
            maintenance['broken'][group],
            label=provider)
    kmf.plot(ci_show=False)
plt.xlabel('Lifetime')
plt.ylabel('Survival Probability')
plt.title('Survival Function by Provider')
plt.show()

The plot above is in line with our previous findings where the fastest broken machine was from Provider3 and the slowest was Provider2. We can test the hypothesis to be more sure whether each provider has a different survival function.

Log-Rank: Tests whether two processes are of different intensity. That is, given two sequences of events, test whether the processes producing data differ statistically.

Hypothesis:

  • $H_0$: two survival functions are the same.
  • $H_1$: two different survival functions.

from lifelines.statistics import pairwise_logrank_test

def logrank_test(cat, alpha):
    H0 = "same"
    H1 = "different"

    lr_test = pairwise_logrank_test(event_durations=maintenance['lifetime'],
                                    groups=maintenance[cat],
                                    event_observed=maintenance['broken']).summary
    lr_test.drop('-log2(p)', axis=1, inplace=True)
    lr_test[f'result (alpha={alpha})'] = lr_test['p'].apply(
        lambda x: H1 if x < alpha else H0)
    lr_test.index.set_names(['First', 'Second'], inplace=True)

    return lr_test.reset_index()


logrank_test('provider', alpha=0.05)
First Second test_statistic p result (alpha=0.05)
0 Provider1 Provider2 204.519279 2.156144e-46 different
1 Provider1 Provider3 259.180962 2.588498e-58 different
2 Provider1 Provider4 191.174684 1.761680e-43 different
3 Provider2 Provider3 251.730255 1.089516e-56 different
4 Provider2 Provider4 66.552210 3.407479e-16 different
5 Provider3 Provider4 230.643017 4.316305e-52 different

It truns out that all survival functions were statistically significant. We can tell the maintenance team to pay more attention to the machines from Provider2, as they require early maintenance to prevent unexpected failures.

kmf = KaplanMeierFitter()
for team in maintenance['team'].cat.categories:
    group = maintenance['team'] == team
    kmf.fit(maintenance['lifetime'][group],
            maintenance['broken'][group],
            label=team)
    kmf.plot(ci_show=False)
plt.xlabel('Lifetime')
plt.ylabel('Survival Probability')
plt.title('Survival Function by Team')
plt.legend(loc='lower left')
plt.show()

The plot above is also in line with the previous finding where the survival function of the machine operated by team C is different from that of A and B. Let's do a log-rank test for the above visualization:

logrank_test('team', alpha=0.05)
First Second test_statistic p result (alpha=0.05)
0 TeamA TeamB 0.054725 8.150366e-01 same
1 TeamA TeamC 34.698639 3.849005e-09 different
2 TeamB TeamC 34.640870 3.964939e-09 different

It turns out that the survival function of the machine operated by TeamC is very different from that of TeamA and TeamB. The maintenance team can monitor whether the TeamC has operated the machine correctly or not, then can conduct training on machine operation to prevent premature machine failure.

Data Preprocessing

At this section, we will prepare the data before modeling as follows:

  • One-hot encoding for categorical features
  • Cross-validation and train-test splitting
  • Split data into three components: feature, time, event

Perform one-hot encoding for provider and team with pd.get_dummies().

categories = ['provider', 'team']
maintenance_dummy = pd.get_dummies(maintenance, columns=categories, drop_first=True)
maintenance_dummy.head()
lifetime broken pressureInd moistureInd temperatureInd provider_Provider2 provider_Provider3 provider_Provider4 team_TeamB team_TeamC
0 56 0 92.178854 104.230204 96.517159 0 0 1 0 0
1 81 1 72.075938 103.065701 87.271062 0 0 1 0 1
2 60 0 96.272254 77.801376 112.196170 0 0 0 0 0
3 86 1 94.406461 108.493608 72.025374 1 0 0 0 1
4 34 0 97.752899 99.413492 103.756271 0 0 0 1 0

Train-test splitting with 80:20 proportion

from sklearn.model_selection import train_test_split
index_train, index_test = train_test_split(range(maintenance_dummy.shape[0]), test_size=0.2, random_state=SEED)
data_train = maintenance_dummy.loc[index_train].reset_index(drop=True)
data_test = maintenance_dummy.loc[index_test].reset_index(drop=True)

Different from the typical supervised machine learning model, which only divides the data into X features and y targets. In survival analysis we divide the data into three components:

  1. X feature contains numeric and categorical predictor columns
  2. Time T contains the column when event E occurred
  3. Event E contains the class of events, in this case whether the machine is broken or not
features = np.setdiff1d(maintenance_dummy.columns, ['lifetime', 'broken']).tolist()
X_train, X_test = data_train[features], data_test[features]
T_train, T_test = data_train['lifetime'].values, data_test['lifetime'].values
E_train, E_test = np.array(data_train['broken'].values), np.array(data_test['broken'].values)

Survival Analysis Model

Cox Proportional Hazard (CoxPH)

The CoxPH model is widely used in statistics for multivariate survival functions because of its relatively easy implementation and high interpretability. CoxPH describes the relationship between the distribution of survival functions and covariates. The predictor variables are expressed by the hazard function as follows:

$$\lambda(t|x) = \lambda_0(t)\ e^{\beta_1x_1 + ... + \beta_nx_n}$$

The CoxPH method is a semi-parametric model because it consists of 2 components:

  • The non-parametric component $\lambda_0(t)$ which is referred to as the baseline hazard, that is, the hazard when all covariates are zero.
  • The parametric component $e^{\beta_1x_1 + ... + \beta_nx_n}$ is called a time-independent partial hazard.
  • In general, the Cox model estimates the log-risk function $\lambda(t|x)$ as a linear combination of static covariates and baseline hazard.

The following is an implementation of the CoxPH model in survival analysis:

from pysurvival.models.semi_parametric import CoxPHModel
coxph = CoxPHModel()
coxph.fit(X_train, T_train, E_train, lr=0.5, l2_reg=1e-2, init_method='zeros')
Performing Newton-Raphson optimization
 * Iteration #1 - Loss = 1588.772 - ||grad||^2 = 319.12143
 * Iteration #2 - Loss = 1260.426 - ||grad||^2 = 164.46688
 * Iteration #3 - Loss = 1135.155 - ||grad||^2 = 99.67192
 * Iteration #4 - Loss = 1053.192 - ||grad||^2 = 61.15221
 * Iteration #5 - Loss = 996.584 - ||grad||^2 = 37.31715
 * Iteration #6 - Loss = 959.229 - ||grad||^2 = 22.80550
 * Iteration #7 - Loss = 935.345 - ||grad||^2 = 14.12548
 * Iteration #8 - Loss = 920.224 - ||grad||^2 = 8.92736
 * Iteration #9 - Loss = 910.730 - ||grad||^2 = 5.73887
 * Iteration #10 - Loss = 904.914 - ||grad||^2 = 3.71561
 * Iteration #11 - Loss = 901.465 - ||grad||^2 = 2.40797
 * Iteration #12 - Loss = 899.464 - ||grad||^2 = 1.55605
 * Iteration #13 - Loss = 898.312 - ||grad||^2 = 0.99669
 * Iteration #14 - Loss = 897.650 - ||grad||^2 = 0.62547
 * Iteration #15 - Loss = 897.272 - ||grad||^2 = 0.37932
 * Iteration #16 - Loss = 897.059 - ||grad||^2 = 0.22022
 * Iteration #17 - Loss = 896.941 - ||grad||^2 = 0.12224
 * Iteration #18 - Loss = 896.878 - ||grad||^2 = 0.06530
 * Iteration #19 - Loss = 896.845 - ||grad||^2 = 0.03393
 * Iteration #20 - Loss = 896.828 - ||grad||^2 = 0.01732
 * Iteration #21 - Loss = 896.819 - ||grad||^2 = 0.00876
 * Iteration #22 - Loss = 896.815 - ||grad||^2 = 0.00440
 * Iteration #23 - Loss = 896.812 - ||grad||^2 = 0.00221
 * Iteration #24 - Loss = 896.811 - ||grad||^2 = 0.00111
 * Iteration #25 - Loss = 896.811 - ||grad||^2 = 0.00055
Converged after 25 iterations.
CoxPHModel

Model Interpretation

The coef value in the summary model can be interpreted just like a linear regression model. A positive coefficient increases the baseline hazard value $\lambda_0(t)$ and indicates that the predictor increases the risk of the event occurring, in this case broken. Conversely, a negative coefficient will reduce the risk of an event occurring if the value is increased.

H0 = "not significant"
H1 = "significant"
alpha = 0.05

coxph_summary = coxph.get_summary()
coxph_summary[f'result (alpha={alpha})'] = coxph_summary['p_values'].astype('float64').apply(lambda x: H1 if x < alpha else H0)
coxph_summary[['variables', 'coef', 'p_values', f'result (alpha={alpha})']]
variables coef p_values result (alpha=0.05)
0 moistureInd -0.158 0.0 significant
1 pressureInd 0.024 0.0 significant
2 provider_Provider2 -13.342 0.015 significant
3 provider_Provider3 10.077 0.072 not significant
4 provider_Provider4 -9.616 0.024 significant
5 team_TeamB 0.044 0.748 not significant
6 team_TeamC 6.957 0.023 significant
7 temperatureInd 0.502 0.0 significant

The assumption in the CoxPH model is proportionality assumption, the hazard function for two objects will always be proportional at the same time and the ratio does not change from the beginning to the end of time. For example: if machine A has a broken risk of 2x compared to machine B, then for the next time the risk ratio remains 2x.

Properties of CoxPH model:

  • Independence of observation: The time of occurrence of an event on one object will be independent of other objects.
  • No hazard curves that intersect each other.
  • There is a linear multiplication effect of the estimated covariate value on the hazard function.

Performance Metric

We can evaluate survival analysis model using two metrics, namely Concordance Idex (C-index) and Integrated Brier Score (IBS).

  1. C-index indicates the model's ability to correctly rank survival time based on individual risk scores. C-index is a generalization of the AUC value. The closer the C-index is to the value of one, the better the model predicts, whereas when the value is 0.5, it represents a random prediction.

  2. IBS measures the average difference between event labels and predicted survival probabilities. As a benchmark, a good model will have a Brier score below 0.25. IBS values ​​always have a range between 0-1, with 0 being the best value.

from pysurvival.utils.metrics import concordance_index
from pysurvival.utils.display import integrated_brier_score, compare_to_actual


def evaluate(model, X, T, E, model_name=""):
    errors = compare_to_actual(model, X, T, E, is_at_risk=False)
    c_index = concordance_index(model, X, T, E)
    ibs = integrated_brier_score(model, X, T, E)
    metrics = {'C-index': c_index, 'IBS': ibs}
    eval_df = pd.DataFrame(data={**metrics, **errors}, index=[model_name])
    return eval_df.rename(columns={'root_mean_squared_error': 'RMSE',
                                   'median_absolute_error': 'MADE',
                                   'mean_absolute_error': 'MAE'})
eval_coxph = evaluate(coxph, X_test, T_test, E_test, model_name="Cox PH")
eval_coxph
C-index IBS RMSE MADE MAE
Cox PH 0.962313 0.020073 22.858351 19.684227 18.125846

Multi Task Logistic Regression Models

The Multi Task Logistic Regression (MTLR) model was first introduced by Chun-Nam Yu in 2011 to predict survival time in cancer patients (survival analysis). This model is an alternative to the Cox Proportional Hazard model which has several shortcomings, one of which is that the Cox model works using a hazard function rather than a survival function, so the model is less able to provide accurate predictions to analyze life expectancy.

MTLR has two approaches for its implementation of survival analysis modeling. The first model is Linear-Multi Task Logistic Regression and the second model is Neural-Multi Task Logistic Regression. The following is an implementation of the two models:

Linear-Multi Task Logistic Regression

Linear-Multi Task Logistic Regression is a set of logistic regression models built at different time intervals to estimate the probability of events occurring in that time span.

Several stages for the Linear MTLR process can be defined as follows:

  1. Divide the time variable into certain intervals
  2. Build a logistic regression model at each time interval
  3. Calculating the loss function to determine the optimum model

At this modeling stage, we create a MLTR model with the number of bins logistic regression as many as 100 models. We used adamax optimizer and 50 epochs to train the logistic regression.

from pysurvival.models.multi_task import LinearMultiTaskModel
l_mtlr = LinearMultiTaskModel(bins=100)
l_mtlr.fit(X_train, T_train, E_train, lr=1e-3, l2_reg=1e-6, l2_smooth=1e-6,
           init_method='orthogonal', optimizer='adamax', num_epochs=50)

LinearMultiTaskModel
eval_l_mtlr = evaluate(l_mtlr, X_test, T_test, E_test, model_name = "Linear MTLR")
eval_l_mtlr
C-index IBS RMSE MADE MAE
Linear MTLR 0.938971 0.044073 5.185306 0.711647 2.276439

The result above shows that the Linear MTLR model has a value of C-index = 0.938971 and a value of IBS = 0.044073. When the C-index is getting closer to 1 and the IBS is getting closer to 0, then the model can be said to have very good prediction results.

Neural-Multi Task Logistic Regression

Neural-Multi Task Logistic Regression (N-MTLR) is a model that was developed based on the previous MTLR model. The previous two models (CoxPH and regular MTLR) failed to capture non-linear patterns from the data and consequently can't satisfy certain cases. This model is supported by a deep learning architecture and can overcome the shortcomings of the previous model.

In the N-MTLR model, there are two improvements from the previous MTLR model:

  1. N-MTLR uses a deep learning framework through multi-layer perceptron (MLP). By replacing the linear core, this model is more flexible because it will not rely on assumptions like the CoxPH model.

  2. The model is implemented in Python using the open-source TensorFlow and Keras allowing users to combine many techniques in deep learning such as:

  • Initialization: Xavier uniform, Xavier gaussian, etc
  • Optimization: Adam, RMSprop, etc
  • Activation function: SeLU, ReLU, Softplus, tanh, etc
  • Miscellaneous Operation: Batch Normalization, Dropout, etc
from pysurvival.models.multi_task import NeuralMultiTaskModel

# simple structure
structure = [{'activation': 'ReLU', 'num_units': 100}, ]

# fitting the model
n_mtlr = NeuralMultiTaskModel(structure=structure, bins=100)
n_mtlr.fit(X_train, T_train, E_train, lr=1e-3, num_epochs=500,
           l2_reg=1e-6, l2_smooth=1e-6,
           init_method='orthogonal', optimizer='rprop')

NeuralMultiTaskModel( Layer(1): activation = ReLU, units = 100 )
eval_mtlr_1 = evaluate(n_mtlr, X_test, T_test, E_test, model_name = "Neural MTLR 1-hidden layer")
eval_mtlr_1
C-index IBS RMSE MADE MAE
Neural MTLR 1-hidden layer 0.853441 0.003763 2.275712 3.622909e-20 0.871271

We can also inspect the loss values for each epoch using display_loss_values function:

from pysurvival.utils.display import display_loss_values
display_loss_values(n_mtlr, figure_size=(7, 4))

Model Comparison

Up to this stage, we have built 3 different models, namely the CoxPH, L-MTLR, and N-MTLR. The performance of each of these models can be described in the table below.

eval_all = pd.concat([eval_coxph, eval_l_mtlr, eval_mtlr_1])
eval_all
C-index IBS RMSE MADE MAE
Cox PH 0.962313 0.020073 22.858351 1.968423e+01 18.125846
Linear MTLR 0.938971 0.044073 5.185306 7.116467e-01 2.276439
Neural MTLR 1-hidden layer 0.853441 0.003763 2.275712 3.622909e-20 0.871271

Referring to the C-index value, we can see that the CoxPH model is the best performance compared to other models, which is 0.96. However, from the error, it turns out that this model actually has the highest RMSE value compared to other models. So we can say that the CoxPH model is not very good at predicting because this high bias could indicate an overfitting in the model.

Just like the CoxPH model, the L-MTLR model also has a fairly good C-index performance but the error generated is still larger than the N-MTLR model. So in this case, we use the N-MTLR model as the best model because in terms of the C-Index value which is quite good and have the smallest error compared to other models.

best_model = n_mtlr

Results

Risk Score

At the model comparison, we conclude that N-MTLR is the best model. In this section, we'll use this model for individual prediction and grouping of individuals based on their risk score.

The first step, we will calculate the risk score of each machine. The value of this risk score will later be used for grouping, both for the score distribution or other grouping methods.

risk_profile = X_test.copy()
risk_profile['risk_score'] = best_model.predict_risk(risk_profile, use_log=True)
risk_profile.head()
moistureInd pressureInd provider_Provider2 provider_Provider3 provider_Provider4 team_TeamB team_TeamC temperatureInd risk_score
0 102.846778 102.709720 0 0 0 0 0 108.324299 5.858471
1 102.892983 96.436456 0 1 0 0 1 114.853392 7.807072
2 106.214236 68.850651 0 0 0 1 0 115.711970 6.776986
3 104.742757 112.893971 0 1 0 0 0 107.360631 7.345346
4 96.371860 75.997301 1 0 0 1 0 98.934087 6.183013

We group the risk_score using 1 dimensional K-Means. From the grouping results, we obtained 3 clusters, namely low, medium, and high. Since the C-index is quite high and the model error is low, the model can be said to be quite good in determining the survival time ranking of random individuals in each group, so it is found that $t_{high}<t_{medium}<t_{low}$

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3, random_state=SEED).fit(risk_profile[['risk_score']])
risk_profile['risk_group'] = kmeans.labels_

risk_group_bound = risk_profile.groupby('risk_group')['risk_score'].min().sort_values().to_frame()
risk_group_bound.index = ['low', 'medium', 'high']
risk_group_bound.columns = ['lower_bound']
risk_group_bound['upper_bound'] = risk_group_bound['lower_bound'].shift(periods=-1).fillna(risk_profile['risk_score'].max())
risk_group_bound['color'] = ['red', 'green', 'blue']
risk_group_bound
lower_bound upper_bound color
low 5.373734 6.090510 red
medium 6.090510 7.324216 green
high 7.324216 8.157679 blue

We can illustrate the risk group using the following histogram:

from pysurvival.utils.display import create_risk_groups

risk_groups = create_risk_groups(
    model=best_model, X=X_test, num_bins=50,
    **risk_group_bound.T.to_dict())

Predict Survival Function

Here is the survival function for all records with broken status on X_test for each risk group:

fig, axes = plt.subplots(3, 1, figsize=(15, 10))

for i, (ax, (label, (color, idxs))) in enumerate(zip(axes.flat, risk_groups.items())):
    X = X_test.values[idxs, :]
    T = T_test[idxs]
    E = E_test[idxs]

    broken = np.argwhere((E == 1)).flatten()
    for j in broken:
        survival = best_model.predict_survival(X[j, :]).flatten()
        ax.plot(best_model.times, survival, color=color)

    ax.set_title(f"{label.title()} Risk")

plt.show()

Instead of looking for all survival function, let's take one random observation from each group that has experienced failure (event = 1).

plt.figure(figsize=(15, 5))

for i, (label, (color, idxs)) in enumerate(risk_groups.items()):
    record_idx = X_test.iloc[idxs, :].index
    X = X_test.values[idxs, :]
    T = T_test[idxs]
    E = E_test[idxs]

    # choose a machine at random that has experienced failure
    choices = np.argwhere((E == 1)).flatten()
    k = np.random.choice(choices, 1)[0]

    # predict survival function
    survival = best_model.predict_survival(X[k, :]).flatten()
    plt.plot(best_model.times, survival,
             color=color, label=f'Record {record_idx[k]} ({label} risk)')

    # actual failure time
    actual_t = T[k]
    plt.axvline(x=actual_t, color=color, ls='--')
    plt.annotate(f'T={actual_t:.1f}',
                 xy=(actual_t, 0.5*(1+0.2*i)),
                 xytext=(actual_t, 0.5*(1+0.2*i)),
                 fontsize=12)

plt.title("Survival Functions Comparison between High, Medium, Low Risk Machine")
plt.legend()
plt.show()

The plot above shows the survival function for the three risk groups by taking 1 random machine record. As can be seen, the model successfully predicts the broken event. There is a sudden decrease in the survival value, according to the vertical dotted line, which is the real broken time.

Optional: Mathematical proof for hazard function

It is defined that $h(t) = \displaystyle{\lim_{dt \to 0} \frac{P(t \leq T < t+dt\ |\ T \geq t)}{dt}}$

According to the conditional probability rule $P(A|B) = \dfrac{P(A \cap B)}{P(B)}$ then:

$h(t) = \displaystyle{\lim_{dt \to 0} \frac{P(\{t \leq T < t+dt\} \cap \{T \geq t\})}{P(T \geq t)\ dt}}$

  • Since $T \geq t$ is a subset of the interval of $t \leq T < t+dt$ then $P(\{t \leq T < t+dt\} \cap \{T \geq t\}) = P(t \leq T < t+dt)$
  • Definition of CDF: $P(t \leq T < t+dt) = F_T(t+dt) - F_T(t)$
  • Definition of survival function: $P(T \geq t) = S(t)$

From the three definitions above: $h(t) = \dfrac{1}{S(t)}\ \displaystyle{\lim_{dt \to 0} \frac{F_T(t+dt) - F_T(t )}{dt}}$

According to the definition of the derived function: $h(t) = \dfrac{1}{S(t)}\ \dfrac{d}{dt}F_T(t)$

Using the relationship between PDF and CDF we get the first equation: $h(t) = \dfrac{f_T(t)}{S(t)}$

Using the CDF relationship and the survival function: $h(t) = \dfrac{1}{S(t)}\ \dfrac{d}{dt}[1 - S(t)] = -\dfrac{S'( t)}{S(t)}$

According to the derivative of the logarithmic function $\dfrac{d}{dx} \ln f(x) = \dfrac{f'(x)}{f(x)}$, then we get the second equation: $h( t) = - \dfrac{d}{dt} \ln S(t)$

References