探索酒店预订取消的原因#

Screenshot%20from%202020-09-29%2019-08-50.png

我们考虑哪些因素导致酒店预订被取消。本分析基于 Antonio, Almeida 和 Nunes (2019) 的酒店预订数据集。该数据集在 GitHub 上位于 rfordatascience/tidytuesday

预订取消可能有多种原因。客户可能要求了无法提供的服务(例如,停车场),客户后来可能发现酒店不符合他们的要求,或者客户可能仅仅取消了整个行程。其中一些因素,如停车场,是酒店可以采取行动的,而另一些因素,如行程取消,则超出酒店的控制范围。无论如何,我们都希望更好地了解这些因素中哪些会导致预订取消。

找出真相的黄金标准是使用实验,例如**随机对照试验**,其中每个客户被随机分配到两个类别之一,即每个客户要么被分配一个停车位,要么不分配。然而,这样的实验可能成本过高,在某些情况下也可能不道德(例如,如果人们知道酒店是随机分配给不同服务水平的,酒店的声誉就会开始受损)。

我们能否仅使用观测数据或过去收集的数据来回答我们的问题呢?

[1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import dowhy

数据描述#

读者可在此快速浏览特征及其描述:rfordatascience/tidytuesday

[2]:
dataset = pd.read_csv('https://raw.githubusercontent.com/Sid-darthvader/DoWhy-The-Causal-Story-Behind-Hotel-Booking-Cancellations/master/hotel_bookings.csv')
dataset.head()
[2]:
hotel is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights adults ... deposit_type agent company days_in_waiting_list customer_type adr required_car_parking_spaces total_of_special_requests reservation_status reservation_status_date
0 Resort Hotel 0 342 2015 July 27 1 0 0 2 ... No Deposit NaN NaN 0 Transient 0.0 0 0 Check-Out 2015-07-01
1 Resort Hotel 0 737 2015 July 27 1 0 0 2 ... No Deposit NaN NaN 0 Transient 0.0 0 0 Check-Out 2015-07-01
2 Resort Hotel 0 7 2015 July 27 1 0 1 1 ... No Deposit NaN NaN 0 Transient 75.0 0 0 Check-Out 2015-07-02
3 Resort Hotel 0 13 2015 July 27 1 0 1 1 ... No Deposit 304.0 NaN 0 Transient 75.0 0 0 Check-Out 2015-07-02
4 Resort Hotel 0 14 2015 July 27 1 0 2 2 ... No Deposit 240.0 NaN 0 Transient 98.0 0 1 Check-Out 2015-07-03

5 行 × 32 列

[3]:
dataset.columns
[3]:
Index(['hotel', 'is_canceled', 'lead_time', 'arrival_date_year',
       'arrival_date_month', 'arrival_date_week_number',
       'arrival_date_day_of_month', 'stays_in_weekend_nights',
       'stays_in_week_nights', 'adults', 'children', 'babies', 'meal',
       'country', 'market_segment', 'distribution_channel',
       'is_repeated_guest', 'previous_cancellations',
       'previous_bookings_not_canceled', 'reserved_room_type',
       'assigned_room_type', 'booking_changes', 'deposit_type', 'agent',
       'company', 'days_in_waiting_list', 'customer_type', 'adr',
       'required_car_parking_spaces', 'total_of_special_requests',
       'reservation_status', 'reservation_status_date'],
      dtype='object')

特征工程#

让我们创建一些新的有意义的特征,以降低数据集的维度。- **总入住天数 (Total Stay)** = stays_in_weekend_nights + stays_in_week_nights - **入住人数 (Guests)** = adults + children + babies - **分配不同房间 (Different_room_assigned)** = 如果 reserved_room_type 和 assigned_room_type 不同则为 1,否则为 0。

[4]:
# Total stay in nights
dataset['total_stay'] = dataset['stays_in_week_nights']+dataset['stays_in_weekend_nights']
# Total number of guests
dataset['guests'] = dataset['adults']+dataset['children'] +dataset['babies']
# Creating the different_room_assigned feature
dataset['different_room_assigned']=0
slice_indices =dataset['reserved_room_type']!=dataset['assigned_room_type']
dataset.loc[slice_indices,'different_room_assigned']=1
# Deleting older features
dataset = dataset.drop(['stays_in_week_nights','stays_in_weekend_nights','adults','children','babies'
                        ,'reserved_room_type','assigned_room_type'],axis=1)
dataset.columns
[4]:
Index(['hotel', 'is_canceled', 'lead_time', 'arrival_date_year',
       'arrival_date_month', 'arrival_date_week_number',
       'arrival_date_day_of_month', 'meal', 'country', 'market_segment',
       'distribution_channel', 'is_repeated_guest', 'previous_cancellations',
       'previous_bookings_not_canceled', 'booking_changes', 'deposit_type',
       'agent', 'company', 'days_in_waiting_list', 'customer_type', 'adr',
       'required_car_parking_spaces', 'total_of_special_requests',
       'reservation_status', 'reservation_status_date', 'total_stay', 'guests',
       'different_room_assigned'],
      dtype='object')

我们还移除了其他包含 NULL 值或具有太多唯一值(例如,agent ID)的列。我们还将 country 列的缺失值用出现次数最多的国家填充。我们移除了 distribution_channel,因为它与 market_segment 的重叠度很高。

[5]:
dataset.isnull().sum() # Country,Agent,Company contain 488,16340,112593 missing entries
dataset = dataset.drop(['agent','company'],axis=1)
# Replacing missing countries with most freqently occuring countries
dataset['country']= dataset['country'].fillna(dataset['country'].mode()[0])
[6]:
dataset = dataset.drop(['reservation_status','reservation_status_date','arrival_date_day_of_month'],axis=1)
dataset = dataset.drop(['arrival_date_year'],axis=1)
dataset = dataset.drop(['distribution_channel'], axis=1)
[7]:
# Replacing 1 by True and 0 by False for the experiment and outcome variables
dataset['different_room_assigned']= dataset['different_room_assigned'].replace(1,True)
dataset['different_room_assigned']= dataset['different_room_assigned'].replace(0,False)
dataset['is_canceled']= dataset['is_canceled'].replace(1,True)
dataset['is_canceled']= dataset['is_canceled'].replace(0,False)
dataset.dropna(inplace=True)
print(dataset.columns)
dataset.iloc[:, 5:20].head(100)
Index(['hotel', 'is_canceled', 'lead_time', 'arrival_date_month',
       'arrival_date_week_number', 'meal', 'country', 'market_segment',
       'is_repeated_guest', 'previous_cancellations',
       'previous_bookings_not_canceled', 'booking_changes', 'deposit_type',
       'days_in_waiting_list', 'customer_type', 'adr',
       'required_car_parking_spaces', 'total_of_special_requests',
       'total_stay', 'guests', 'different_room_assigned'],
      dtype='object')
[7]:
meal country market_segment is_repeated_guest previous_cancellations previous_bookings_not_canceled booking_changes deposit_type days_in_waiting_list customer_type adr required_car_parking_spaces total_of_special_requests total_stay guests
0 BB PRT Direct 0 0 0 3 No Deposit 0 Transient 0.00 0 0 0 2.0
1 BB PRT Direct 0 0 0 4 No Deposit 0 Transient 0.00 0 0 0 2.0
2 BB GBR Direct 0 0 0 0 No Deposit 0 Transient 75.00 0 0 1 1.0
3 BB GBR Corporate 0 0 0 0 No Deposit 0 Transient 75.00 0 0 1 1.0
4 BB GBR Online TA 0 0 0 0 No Deposit 0 Transient 98.00 0 1 2 2.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
95 BB PRT Online TA 0 0 0 0 No Deposit 0 Transient 73.80 0 1 2 2.0
96 BB PRT Online TA 0 0 0 0 No Deposit 0 Transient 117.00 0 1 7 2.0
97 HB ESP Offline TA/TO 0 0 0 0 No Deposit 0 Transient 196.54 0 1 7 3.0
98 BB PRT Online TA 0 0 0 0 No Deposit 0 Transient 99.30 1 2 7 3.0
99 BB DEU Direct 0 0 0 0 No Deposit 0 Transient 90.95 0 0 7 2.0

100 行 × 15 列

[8]:
dataset = dataset[dataset.deposit_type=="No Deposit"]
dataset.groupby(['deposit_type','is_canceled']).count()
[8]:
hotel lead_time arrival_date_month arrival_date_week_number meal country market_segment is_repeated_guest previous_cancellations previous_bookings_not_canceled booking_changes days_in_waiting_list customer_type adr required_car_parking_spaces total_of_special_requests total_stay guests different_room_assigned
deposit_type is_canceled
No Deposit False 74947 74947 74947 74947 74947 74947 74947 74947 74947 74947 74947 74947 74947 74947 74947 74947 74947 74947 74947
True 29690 29690 29690 29690 29690 29690 29690 29690 29690 29690 29690 29690 29690 29690 29690 29690 29690 29690 29690
[9]:
dataset_copy = dataset.copy(deep=True)

计算预期计数#

由于取消次数和分配不同房间的次数严重不平衡,我们首先随机选择 1000 个观测值,看看变量“is_cancelled”和“different_room_assigned”取相同值的情况有多少。然后将整个过程重复 10000 次,预期计数接近 50%(即这两个变量随机取相同值的概率)。所以从统计学上讲,现阶段我们没有明确的结论。因此,分配与客户之前预订时预留的房间类型不同的房间,可能导致也可能不导致他/她取消该预订。

[10]:
counts_sum=0
for i in range(1,10000):
        counts_i = 0
        rdf = dataset.sample(1000)
        counts_i = rdf[rdf["is_canceled"]== rdf["different_room_assigned"]].shape[0]
        counts_sum+= counts_i
counts_sum/10000
[10]:
$\displaystyle 588.7789$

现在我们考虑没有预订更改的情况,并重新计算预期计数。

[11]:
# Expected Count when there are no booking changes
counts_sum=0
for i in range(1,10000):
        counts_i = 0
        rdf = dataset[dataset["booking_changes"]==0].sample(1000)
        counts_i = rdf[rdf["is_canceled"]== rdf["different_room_assigned"]].shape[0]
        counts_sum+= counts_i
counts_sum/10000
[11]:
$\displaystyle 572.7711$

在第二种情况下,我们考虑有预订更改(>0)的情况,并重新计算预期计数。

[12]:
# Expected Count when there are booking changes = 66.4%
counts_sum=0
for i in range(1,10000):
        counts_i = 0
        rdf = dataset[dataset["booking_changes"]>0].sample(1000)
        counts_i = rdf[rdf["is_canceled"]== rdf["different_room_assigned"]].shape[0]
        counts_sum+= counts_i
counts_sum/10000
[12]:
$\displaystyle 665.5664$

当预订更改数量非零时,肯定发生了一些变化。这给我们一个提示,**预订更改**可能正在影响房间取消。

但是,**预订更改**是唯一的混杂变量吗?如果存在一些我们数据中没有捕捉到的未观测混杂因素(特征),我们是否仍然能够做出与之前相同的声明?

使用 DoWhy 估计因果效应#

步骤 1. 创建因果图#

使用假设将您对预测建模问题的先验知识表示为 CI 图。不用担心,您现阶段不需要指定完整的图。即使是部分图也足够了,其余的可以由 *DoWhy* 推断出来 ;-)

以下是将假设转化为因果图的列表:-

  • **市场细分**有 2 个级别,“TA”指“旅行社”,“TO”指“旅游运营商”,因此它应该影响提前预订时间(即预订和入住之间的天数)。

  • **国家**也会影响一个人是否提前预订(因此有更长的**提前预订时间**)以及一个人会偏好哪种**餐食**。

  • **提前预订时间**肯定会影响**等待天数**(预订越晚,找到预订的机会越少)。此外,更长的**提前预订时间**也可能导致**取消**。

  • **等待天数**、总入住天数(**总入住天数**)和**入住人数**可能会影响预订是否被取消或保留。

  • **以前未取消的预订**会影响客户是否为重复客户。此外,这两个变量都会影响预订是否被**取消**(例如,一个过去保留了 5 次预订的客户更有可能保留这次预订。同样,一个已经取消过预订的人更有可能重复同样的行为)。

  • **预订更改**会影响客户是否被分配**不同房间**,这可能也会导致**取消**。

  • 最后,**预订更改**作为唯一影响**处理**和**结果**的变量,这种情况极不可能发生,并且可能存在一些**未观测混杂因素**,我们数据中没有捕捉到关于它们的信息。

[13]:
import pygraphviz
causal_graph = """digraph {
different_room_assigned[label="Different Room Assigned"];
is_canceled[label="Booking Cancelled"];
booking_changes[label="Booking Changes"];
previous_bookings_not_canceled[label="Previous Booking Retentions"];
days_in_waiting_list[label="Days in Waitlist"];
lead_time[label="Lead Time"];
market_segment[label="Market Segment"];
country[label="Country"];
U[label="Unobserved Confounders",observed="no"];
is_repeated_guest;
total_stay;
guests;
meal;
hotel;
U->{different_room_assigned,required_car_parking_spaces,guests,total_stay,total_of_special_requests};
market_segment -> lead_time;
lead_time->is_canceled; country -> lead_time;
different_room_assigned -> is_canceled;
country->meal;
lead_time -> days_in_waiting_list;
days_in_waiting_list ->{is_canceled,different_room_assigned};
previous_bookings_not_canceled -> is_canceled;
previous_bookings_not_canceled -> is_repeated_guest;
is_repeated_guest -> {different_room_assigned,is_canceled};
total_stay -> is_canceled;
guests -> is_canceled;
booking_changes -> different_room_assigned; booking_changes -> is_canceled;
hotel -> {different_room_assigned,is_canceled};
required_car_parking_spaces -> is_canceled;
total_of_special_requests -> {booking_changes,is_canceled};
country->{hotel, required_car_parking_spaces,total_of_special_requests};
market_segment->{hotel, required_car_parking_spaces,total_of_special_requests};
}"""

这里,**处理**是指分配与客户预订时预留的房间类型相同的房间。**结果**是指预订是否被取消。**共同原因**表示我们认为对**结果**和**处理**具有因果效应的变量。根据我们的因果假设,满足此标准的两个变量是**预订更改**和**未观测混杂因素**。因此,如果我们没有明确指定图(不推荐!),也可以将这些作为参数提供给下面的函数。

为了帮助识别因果效应,我们从图中移除了未观测的混杂因素节点。(要检查,您可以使用原始图并运行以下代码。identify_effect 方法将发现无法识别效应。)

[14]:
model= dowhy.CausalModel(
        data = dataset,
        graph=causal_graph.replace("\n", " "),
        treatment="different_room_assigned",
        outcome='is_canceled')
model.view_model()
from IPython.display import Image, display
display(Image(filename="causal_model.png"))
/__w/dowhy/dowhy/dowhy/causal_model.py:583: UserWarning: 1 variables are assumed unobserved because they are not in the dataset. Configure the logging level to `logging.WARNING` or higher for additional details.
  warnings.warn(
../_images/example_notebooks_DoWhy-The_Causal_Story_Behind_Hotel_Booking_Cancellations_25_1.png
../_images/example_notebooks_DoWhy-The_Causal_Story_Behind_Hotel_Booking_Cancellations_25_2.png

步骤 2. 识别因果效应#

我们说如果改变**处理**会导致**结果**发生变化,同时保持其他所有条件不变,则处理导致结果。因此,在此步骤中,通过使用因果图的属性,我们识别了要估计的因果效应。

[15]:
#Identify the causal effect
identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
print(identified_estimand)
Estimand type: EstimandType.NONPARAMETRIC_ATE

### Estimand : 1
Estimand name: backdoor
Estimand expression:
            d                                                                  ↪
──────────────────────────(E[is_canceled|days_in_waiting_list,booking_changes, ↪
d[different_room_assigned]                                                     ↪

↪                                                                              ↪
↪ lead_time,total_stay,guests,total_of_special_requests,required_car_parking_s ↪
↪                                                                              ↪

↪
↪ paces,hotel,is_repeated_guest])
↪
Estimand assumption 1, Unconfoundedness: If U→{different_room_assigned} and U→is_canceled then P(is_canceled|different_room_assigned,days_in_waiting_list,booking_changes,lead_time,total_stay,guests,total_of_special_requests,required_car_parking_spaces,hotel,is_repeated_guest,U) = P(is_canceled|different_room_assigned,days_in_waiting_list,booking_changes,lead_time,total_stay,guests,total_of_special_requests,required_car_parking_spaces,hotel,is_repeated_guest)

### Estimand : 2
Estimand name: iv
No such variable(s) found!

### Estimand : 3
Estimand name: frontdoor
No such variable(s) found!

步骤 3. 估计已识别的估计量#

[16]:
estimate = model.estimate_effect(identified_estimand,
                                 method_name="backdoor.propensity_score_weighting",target_units="ate")
# ATE = Average Treatment Effect
# ATT = Average Treatment Effect on Treated (i.e. those who were assigned a different room)
# ATC = Average Treatment Effect on Control (i.e. those who were not assigned a different room)
print(estimate)
*** Causal Estimate ***

## Identified estimand
Estimand type: EstimandType.NONPARAMETRIC_ATE

### Estimand : 1
Estimand name: backdoor
Estimand expression:
            d                                                                  ↪
──────────────────────────(E[is_canceled|days_in_waiting_list,booking_changes, ↪
d[different_room_assigned]                                                     ↪

↪                                                                              ↪
↪ lead_time,total_stay,guests,total_of_special_requests,required_car_parking_s ↪
↪                                                                              ↪

↪
↪ paces,hotel,is_repeated_guest])
↪
Estimand assumption 1, Unconfoundedness: If U→{different_room_assigned} and U→is_canceled then P(is_canceled|different_room_assigned,days_in_waiting_list,booking_changes,lead_time,total_stay,guests,total_of_special_requests,required_car_parking_spaces,hotel,is_repeated_guest,U) = P(is_canceled|different_room_assigned,days_in_waiting_list,booking_changes,lead_time,total_stay,guests,total_of_special_requests,required_car_parking_spaces,hotel,is_repeated_guest)

## Realized estimand
b: is_canceled~different_room_assigned+days_in_waiting_list+booking_changes+lead_time+total_stay+guests+total_of_special_requests+required_car_parking_spaces+hotel+is_repeated_guest
Target units: ate

## Estimate
Mean value: -0.2622285649999514

/github/home/.cache/pypoetry/virtualenvs/dowhy-oN2hW5jr-py3.8/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.cn/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.cn/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

结果令人惊讶。这意味着分配不同的房间会**降低**取消的几率。这里还有更多值得深入探讨:这是正确的因果效应吗?是否可能只有在预订的房间不可用时才会分配不同的房间,因此分配不同的房间对客户产生积极影响(而不是不分配房间)?

可能还有其他机制在起作用。也许分配不同的房间只发生在入住时,而客户到达酒店后取消的可能性很低?在这种情况下,图缺少关于这些事件**何时**发生的关键变量。 different_room_assigned 是否主要发生在预订当天?了解该变量可以帮助改进图和我们的分析。

尽管早期的关联分析表明 is_canceleddifferent_room_assigned 之间存在正相关,但使用 DoWhy 估计因果效应呈现了不同的情况。这意味着减少酒店中 different_room_assigned 数量的决策/政策可能是适得其反的。

步骤 4. 反驳结果#

请注意,因果部分并非来自数据。它来自您的**假设**,这些假设导致了**识别**。数据仅用于统计**估计**。因此,验证我们的假设在第一步中是否正确变得至关重要!

当存在另一个共同原因时会发生什么?当处理本身是安慰剂时会发生什么?

方法 1#

**随机共同原因:-** *向数据添加随机抽取的协变量,并重新运行分析,以查看因果估计是否发生变化。如果我们的假设最初是正确的,那么因果估计不应该有太大变化。*

[17]:
refute1_results=model.refute_estimate(identified_estimand, estimate,
        method_name="random_common_cause")
print(refute1_results)
Refute: Add a random common cause
Estimated effect:-0.2622285649999514
New effect:-0.26222856499995134
p value:1.0

方法 2#

**安慰剂处理反驳器:-** *随机将任何协变量指定为处理并重新运行分析。如果我们的假设是正确的,那么新发现的估计应该趋于 0。*

[18]:
refute2_results=model.refute_estimate(identified_estimand, estimate,
        method_name="placebo_treatment_refuter")
print(refute2_results)
Refute: Use a Placebo Treatment
Estimated effect:-0.2622285649999514
New effect:0.05420218503316096
p value:0.0

方法 3#

**数据子集反驳器:-** *创建数据子集(类似于交叉验证),并检查因果估计是否在子集之间变化。如果我们的假设是正确的,则不应该有太大变化。*

[19]:
refute3_results=model.refute_estimate(identified_estimand, estimate,
        method_name="data_subset_refuter")
print(refute3_results)
Refute: Use a subset of data
Estimated effect:-0.2622285649999514
New effect:-0.26239836091544644
p value:0.88

我们可以看到,我们的估计通过了所有三个反驳测试。这并不能证明其正确性,但增强了对估计值的信心。