https://github.com/davidhintelmann/portugal-hotels

Dataset with two hotels, one in Lisbon other is in in the Algarve region.
https://github.com/davidhintelmann/portugal-hotels
Last synced: about 1 year ago
JSON representation
Dataset with two hotels, one in Lisbon other is in in the Algarve region.
Host: GitHub
URL: https://github.com/davidhintelmann/portugal-hotels
Owner: davidhintelmann
Created: 2020-09-17T21:42:29.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2020-10-11T03:49:25.000Z (over 5 years ago)
Last Synced: 2025-02-03T20:03:24.090Z (over 1 year ago)
Language: Jupyter Notebook
Size: 2.8 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          # Portugal Hotels Booking Demand

# Summary

This jupyter notebook has one dataset that has two different hotels from Portugal. This data will be analyzed to find any trends or patterns with guests booking into either hotel to try and find a way to minimize the amount of canceled bookings. A machine learning model will also be developed to attempt at predicting if a guest will cancel there booking before checking in.

There is a resort hotel in this dataset, found in the Algarve region of Portugal (southern Portugal), and a city hotel found in the captial Lisbon. Data was acquired directly from hotel's Property Managment System (PMS) SQL according to the paper which the data is originally from. The article is called, "Hotel Booking Demand Datasets", written by Nuno Antonio, Ana Almeida, and Luis Nunes for Data in Brief, Volume 22, February 2019. Found at https://www.sciencedirect.com/science/article/pii/S2352340918315191#bib5.

Dataset can also be found on Kaggle at https://www.kaggle.com/jessemostipak/hotel-booking-demand

## Import Python Librarys and Modules

```python

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import LogisticRegression

from matplotlib import rcParams

rcParams['figure.figsize'] = 10,8

sns.set_theme()

```

## Reading and cleaning data

```python

df = pd.read_csv('hotel_bookings.csv')

```

```python

df.head()

```



  

    

      

      hotel

      is_canceled

      lead_time

      arrival_date_year

      arrival_date_month

      arrival_date_week_number

      arrival_date_day_of_month

      stays_in_weekend_nights

      stays_in_week_nights

      adults

      ...

      deposit_type

      agent

      company

      days_in_waiting_list

      customer_type

      adr

      required_car_parking_spaces

      total_of_special_requests

      reservation_status

      reservation_status_date

    

  

  

    

      0

      Resort Hotel

      0

      342

      2015

      July

      27

      1

      0

      0

      2

      ...

      No Deposit

      NaN

      NaN

      0

      Transient

      0.0

      0

      0

      Check-Out

      2015-07-01

    

    

      1

      Resort Hotel

      0

      737

      2015

      July

      27

      1

      0

      0

      2

      ...

      No Deposit

      NaN

      NaN

      0

      Transient

      0.0

      0

      0

      Check-Out

      2015-07-01

    

    

      2

      Resort Hotel

      0

      7

      2015

      July

      27

      1

      0

      1

      1

      ...

      No Deposit

      NaN

      NaN

      0

      Transient

      75.0

      0

      0

      Check-Out

      2015-07-02

    

    

      3

      Resort Hotel

      0

      13

      2015

      July

      27

      1

      0

      1

      1

      ...

      No Deposit

      304.0

      NaN

      0

      Transient

      75.0

      0

      0

      Check-Out

      2015-07-02

    

    

      4

      Resort Hotel

      0

      14

      2015

      July

      27

      1

      0

      2

      2

      ...

      No Deposit

      240.0

      NaN

      0

      Transient

      98.0

      0

      1

      Check-Out

      2015-07-03

    

  

5 rows × 32 columns



```python

df.info()

```

    

    RangeIndex: 119390 entries, 0 to 119389

    Data columns (total 32 columns):

     #   Column                          Non-Null Count   Dtype  

    ---  ------                          --------------   -----  

     0   hotel                           119390 non-null  object 

     1   is_canceled                     119390 non-null  int64  

     2   lead_time                       119390 non-null  int64  

     3   arrival_date_year               119390 non-null  int64  

     4   arrival_date_month              119390 non-null  object 

     5   arrival_date_week_number        119390 non-null  int64  

     6   arrival_date_day_of_month       119390 non-null  int64  

     7   stays_in_weekend_nights         119390 non-null  int64  

     8   stays_in_week_nights            119390 non-null  int64  

     9   adults                          119390 non-null  int64  

     10  children                        119386 non-null  float64

     11  babies                          119390 non-null  int64  

     12  meal                            119390 non-null  object 

     13  country                         118902 non-null  object 

     14  market_segment                  119390 non-null  object 

     15  distribution_channel            119390 non-null  object 

     16  is_repeated_guest               119390 non-null  int64  

     17  previous_cancellations          119390 non-null  int64  

     18  previous_bookings_not_canceled  119390 non-null  int64  

     19  reserved_room_type              119390 non-null  object 

     20  assigned_room_type              119390 non-null  object 

     21  booking_changes                 119390 non-null  int64  

     22  deposit_type                    119390 non-null  object 

     23  agent                           103050 non-null  float64

     24  company                         6797 non-null    float64

     25  days_in_waiting_list            119390 non-null  int64  

     26  customer_type                   119390 non-null  object 

     27  adr                             119390 non-null  float64

     28  required_car_parking_spaces     119390 non-null  int64  

     29  total_of_special_requests       119390 non-null  int64  

     30  reservation_status              119390 non-null  object 

     31  reservation_status_date         119390 non-null  object 

    dtypes: float64(4), int64(16), object(12)

    memory usage: 29.1+ MB

We can see above there are many columns with missing values, and this will be addressed below

---

```python

df['country'].isna().value_counts()

```

    False    118902

    True        488

    Name: country, dtype: int64

Dropping the agent column entirely since it only has a integer value for the listing agent and no information about the company the agent works for or the country of origin for the agency.

There were 4 four rows with NaN value in the column 'children' and it has been assumed that these rooms did not have any children and a value of zero has been put there it its place. Its dtype was then converted to 'int64'.

Any rows that did not have a country of origin has been dropped as this seems to be questionable data (though these guests could likely be from Portugal and simpy did not enter their country of origin). There were only 488 rows dropped.

One row has a value for 'adr' greater than 4000. This means the average daily rate as defined by dividing the sum of all lodging transactions by the total number of staying nights was greater than €4,000. This is only on one row  and has been dropped since it is an extreme outlier.

Some rooms did not have any adults or children registered for that booking, and is likely some data had been incomplete when being filled in, same with rows that did not have a country of origin.

Lastly, all null values for 'company' column has been filled in with integer value zero.

```python

df.drop('agent', axis=1, inplace=True)

df.loc[(df[df['children'].isna()].index.values),'children'] = 0

df.children = df.children.astype('int64')

df.drop(df[df['country'].isna()].index.values,axis=0, inplace=True)

df.drop(df[df['adr'] > 4000].index.values,axis=0, inplace=True) # add to note above

df.drop(df[df['adults']==0].index.values,axis=0, inplace=True)

df['company'].fillna(value=0,inplace=True)

```

# EDA & Visualizing hotel data

We will begin by talking about each column in more depth:

* **hotel**- Either resort hotel, Algarve, or City hotel, Lisbon.  

* **is_canceled**- If a guest has cancelled a booking or not before checking into a hotel, value of 1 or 0 respectively.  

* **lead_time**- The day a guest made their booking, ie number of days before guest is expected to arrive.  

* **arrive_date_year**- The year the guest is expected to arrive at a hotel, from 2016-2017.  

* **arrival_date_month**- The month of the year the guest is expected to arrive at a hotel.  

* **arrival_date_week_number**- The week number (52 weeks in a year) the guest is expected to arrive at a hotel. 

* **arrival_date_day_of_month**- The day of the month the guest is expected to arrive at a hotel.  

* **stays_in_weekend_nights**- The number of nights the guest is going to stay during the weekend.  

* **stays_in_week_nights**- The number of nigths the guest is going to stay during the week.  

* **adults**- Number of adults booked to stay in the room for the duration of their time in the hotel.  

* **children**- Number of children booked to stay in the room for the duration of their time in the hotel.  

* **babies**- Number of babies booked to stay in a room for the duration of their time in the hotel.  

* **meal**- Type of meal booked. Categories are presented in standard hospitality meal packages: 

    * Undefined/SC – no meal package  

    * BB – Bed & Breakfast  

    * HB – Half board (breakfast and one other meal – usually dinner)  

    * FB – Full board (breakfast, lunch and dinner)  

* **country**- Country of origin. Categories are represented in the ISO 3155–3:2013 format.  

* **market_segment**- Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”.  

* **distribution_channel**-Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”.  

* **is_repeated_guest**- Value indicating if the booking name was from a repeated guest (1) or not (0).  

* **previous_cancellations**- Number of previous bookings that were cancelled by the customer prior to the current booking.  

* **previous_bookings_not_canceled**- Number of previous bookings not cancelled by the customer prior to the current booking.  

* **reserved_room_type**- Code of room type reserved. Code is presented instead of designation for anonymity reasons.  

* **assigned_room_type**- Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons.  

* **booking_changes**- Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation.  

* **deposit_type**- Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories:  

    * No Deposit – no deposit was made  

    * Non Refund – a deposit was made in the value of the total stay cost  

    * Refundable – a deposit was made with a value under the total cost of stay. 

* **agent**- ID of the travel agency that made the booking.  

* **company**- ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons.  

* **days_in_waiting_list**- Number of days the booking was in the waiting list before it was confirmed to the customer.  

* **customer_type**- Type of booking, assuming one of four categories:

    * Contract - when the booking has an allotment or other type of contract associated to it

    * Group – when the booking is associated to a group

    * Transient – when the booking is not part of a group or contract, and is not associated to other transient booking  

    * Transient-party – when the booking is transient, but is associated to at least other transient booking  

* **adr**- Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights.  

* **required_car_parking_spaces**- Number of car parking spaces required by the customer.

* **total_of_special_requests**- Number of special requests made by the customer (e.g. twin bed or high floor).  

* **reservation_status**- Reservation last status, assuming one of three categories:  

    * Canceled – booking was canceled by the customer

    * Check-Out – customer has checked in but already departed

    * No-Show – customer did not check-in and did inform the hotel of the reason why  

* **reservation_status_date**- Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel

## Resort & City Hotel

Below we see in figure 1 that there are many more city hotel bookings then there are resort hotel bookings in this dataset.

```python

sns.countplot(x='hotel', data=df, palette='Set2');

plt.title('Number of bookings for Resort and City hotel')

txt_1='Fig.1 - Resort hotel is in Algarve region of Portugal and the city hotel is in Lisbon, the capital of Portugal'

plt.figtext(0.5, -0.1, txt_1, wrap=True, horizontalalignment='center', fontsize=12);

```

![png](img/output_18_0.png)

```python

round(df['is_canceled'].value_counts()[1]/df['is_canceled'].value_counts()[0],4)*100

```

    58.84

```python

sns.countplot(x='is_canceled', data=df);

plt.title('Number of bookings canceled for both hotels')

txt_2='Fig.2 - About 58.85% of bookings were canceled, from both hotels, before the guests checked in.'

plt.figtext(0.5, -0.1, txt_2, wrap=True, horizontalalignment='center', fontsize=12);

```

![png](img/output_20_0.png)

```python

round(df[df['hotel'] == 'City Hotel']['is_canceled'].value_counts()[1]/df[df['hotel'] == 'City Hotel']['is_canceled'].value_counts()[0],4)*100

```

    71.61

```python

sns.countplot(x='is_canceled', data=df[df['hotel'] == 'City Hotel'], palette='rocket');

plt.title('Number of bookings canceled for Lisbon Hotel')

txt_3='Fig.3 - About 71.61% of bookings were canceled for Lisbon hotel before the guests checked in.'

plt.figtext(0.5, -0.1, txt_3, wrap=True, horizontalalignment='center', fontsize=12);

```

![png](img/output_22_0.png)

```python

round(df[df['hotel'] == 'Resort Hotel']['is_canceled'].value_counts()[1]/df[df['hotel'] == 'Resort Hotel']['is_canceled'].value_counts()[0],4)*100

```

    38.43

```python

sns.countplot(x='is_canceled', data=df[df['hotel'] == 'Resort Hotel'], palette='mako');

plt.title('Number of bookings canceled for Algarve Hotel')

txt_4='Fig.4 - About 38.43% of bookings were canceled for Algarve hotel before the guests checked in.'

plt.figtext(0.5, -0.1, txt_4, wrap=True, horizontalalignment='center', fontsize=12);

```

![png](img/output_24_0.png)

Most of the bookings for this dataset are from the city hotel (Lisbon) which also has a higher chance of a guest canceling before they check in at 71.61%, where as the resort hotel (Algarve) only has around 38.43% chance of a guest canceling there booking. This suggests most people are looking around at multiple hotels to stay at in a city, but slightly more committed to pulling trigger for a resort hotel. That being said we would need more hotels from cities and then more resort hotels to confirm this theory.

### Countries Analysis

Now we will investigate what country has the most guests booking rooms, and if some countries guests are more likely to cancel.

```python

df['country'].value_counts().head(10).plot.bar();

plt.title('Number of Bookings from Top 10 Countries for Both Hotels')

txt_5='Fig.5 - Most of the bookings are clearly from Portugal'

plt.figtext(0.5, -0.01, txt_5, wrap=True, horizontalalignment='center', fontsize=12);

```

![png](img/output_28_0.png)

We can see in Figure 5 that most guests are coming from this host country, Portugal, at 56.76%. Only one country is on the list that is not in Europe and that is Brazil.

```python

tmp = df.groupby('country')['is_canceled'].sum()/df.groupby('country')['is_canceled'].count()

tmp.sort_values(ascending=False).loc[df['country'].value_counts().head(10).index.values]

```

    country

    PRT    0.567580

    GBR    0.202313

    FRA    0.185813

    ESP    0.254271

    DEU    0.167102

    ITA    0.353945

    IRL    0.246291

    BEL    0.202494

    BRA    0.372514

    NLD    0.182426

    Name: is_canceled, dtype: float64

```python

df[df['hotel'] == 'City Hotel']['country'].value_counts().head(10).plot.bar();

plt.title('Number of Bookings from Top 10 Countries - Lisbon')

txt_6='Fig.6 - City hotel bookings from top 10 countries.'

plt.figtext(0.5, -0.01, txt_6, wrap=True, horizontalalignment='center', fontsize=12);

```

![png](img/output_31_0.png)

We can see in Figure 6 that most guests are also from Portugal when staying in Lisbon.

```python

tmp = df[df['hotel'] == 'City Hotel'].copy()

tmp_ = tmp.groupby('country')['is_canceled'].sum()/tmp.groupby('country')['is_canceled'].count()

tmp_.sort_values(ascending=False).loc[tmp['country'].value_counts().head(10).index.values]

```

    country

    PRT    0.650777

    FRA    0.195870

    DEU    0.176170

    GBR    0.294407

    ESP    0.288017

    ITA    0.378986

    BEL    0.219382

    BRA    0.405724

    USA    0.264633

    NLD    0.206329

    Name: is_canceled, dtype: float64

```python

df[df['hotel'] == 'Resort Hotel']['country'].value_counts().head(10).plot.bar();

plt.title('Number of Bookings from Top 10 Countries - Algarve')

txt_7='Fig.7 - Resort hotel bookings from top 10 countries staying.'

plt.figtext(0.5, -0.01, txt_7, wrap=True, horizontalalignment='center', fontsize=12);

```

![png](img/output_34_0.png)

We can see in Figure 6 that most guests are also from Portugal when staying in Lisbon.

```python

tmp = df[df['hotel'] == 'Resort Hotel'].copy()

tmp_ = tmp.groupby('country')['is_canceled'].sum()/tmp.groupby('country')['is_canceled'].count()

tmp_.sort_values(ascending=False).loc[tmp['country'].value_counts().head(10).index.values]

```

    country

    PRT    0.422086

    GBR    0.130779

    ESP    0.215116

    IRL    0.199446

    FRA    0.131056

    DEU    0.121363

    CN     0.135211

    NLD    0.108949

    USA    0.150313

    ITA    0.174292

    Name: is_canceled, dtype: float64

```python

sns.set(rc={'figure.figsize':(20,16)})

df_tmp = df[df['arrival_date_year'] == 2015]

sns.barplot(x='arrival_date_month',y='lead_time',hue='is_canceled',data=df_tmp);

plt.title('Lead time for All Bookings for Each Month for the Year 2015')

txt_8='Fig.8 - Number of days guests booked in advance for each month in the year 2015. Each month has data for number of guests who cancelled and how many checked in.'

plt.figtext(0.5, 0.05, txt_8, wrap=True, horizontalalignment='center', fontsize=12);

```

![png](img/output_37_0.png)

```python

df_tmp.groupby('arrival_date_month')[['is_canceled','lead_time']].mean().sort_values(by='is_canceled', ascending=False)

```



    .dataframe tbody tr th:only-of-type {

        vertical-align: middle;

    }

    .dataframe tbody tr th {

        vertical-align: top;

    }

    .dataframe thead th {

        text-align: right;

    }

  

    

      

      is_canceled

      lead_time

    

    

      arrival_date_month

      

      

    

  

  

    

      July

      0.455664

      126.365545

    

    

      August

      0.412174

      99.364457

    

    

      September

      0.408733

      123.068253

    

    

      October

      0.348851

      102.595650

    

    

      December

      0.335517

      52.683793

    

    

      November

      0.209483

      48.476724

    

  



```python

df_tmp = df[df['arrival_date_year'] == 2016]

sns.barplot(x='arrival_date_month',y='lead_time',hue='is_canceled',data=df_tmp);

plt.title('Lead time for All Bookings for Each Month for the Year 2016')

txt_9='Fig.9 - Number of days guests booked in advance for each month in the year 2016. Each month has data for number of guests who cancelled and how many checked in.'

plt.figtext(0.5, 0.05, txt_9, wrap=True, horizontalalignment='center', fontsize=12);

```

![png](img/output_39_0.png)

```python

df_tmp.groupby('arrival_date_month')[['is_canceled','lead_time']].mean().sort_values(by='is_canceled', ascending=False)

```



    .dataframe tbody tr th:only-of-type {

        vertical-align: middle;

    }

    .dataframe tbody tr th {

        vertical-align: top;

    }

    .dataframe thead th {

        text-align: right;

    }

  

    

      

      is_canceled

      lead_time

    

    

      arrival_date_month

      

      

    

  

  

    

      October

      0.406736

      140.018620

    

    

      June

      0.396780

      120.053409

    

    

      April

      0.381014

      86.188379

    

    

      September

      0.375627

      149.689578

    

    

      November

      0.368682

      91.964350

    

    

      December

      0.363114

      90.105016

    

    

      August

      0.360902

      121.638306

    

    

      May

      0.350348

      114.914197

    

    

      February

      0.346383

      39.144672

    

    

      July

      0.327988

      123.523506

    

    

      March

      0.308271

      57.713659

    

    

      January

      0.251467

      32.959819

    

  



```python

df_tmp = df[df['arrival_date_year'] == 2017]

sns.barplot(x='arrival_date_month',y='lead_time',hue='is_canceled',data=df_tmp);

plt.title('Lead time for All Bookings for Each Month for the Year 2017')

txt_10='Fig.10 - Number of days guests booked in advance for each month in the year 2017. Each month has data for number of guests who cancelled and how many checked in.'

plt.figtext(0.5, 0.05, txt_10, wrap=True, horizontalalignment='center', fontsize=12);

```

![png](img/output_41_0.png)

```python

df_tmp.groupby('arrival_date_month')[['is_canceled','lead_time']].mean().sort_values(by='is_canceled', ascending=False)

```



    .dataframe tbody tr th:only-of-type {

        vertical-align: middle;

    }

    .dataframe tbody tr th {

        vertical-align: top;

    }

    .dataframe thead th {

        text-align: right;

    }

  

    

      

      is_canceled

      lead_time

    

    

      arrival_date_month

      

      

    

  

  

    

      May

      0.437510

      120.224933

    

    

      April

      0.434852

      103.667789

    

    

      June

      0.431911

      136.141491

    

    

      July

      0.373424

      152.974026

    

    

      August

      0.368731

      137.798579

    

    

      January

      0.341350

      53.410768

    

    

      March

      0.337710

      82.825288

    

    

      February

      0.327076

      56.514801

    

  



```python

first_year = len(df[df['arrival_date_year'] == 2015])

second_year = len(df[df['arrival_date_year'] == 2016])

third_year = len(df[df['arrival_date_year'] == 2017])

print('Number of guests who stayed in either hotel is {}, in the year 2015'.format(first_year))

print('Number of guests who stayed in either hotel is {}, in the year 2016'.format(second_year))

print('Number of guests who stayed in either hotel is {}, in the year 2017'.format(third_year))

```

    Number of guests who stayed in either hotel is 21863, in the year 2015

    Number of guests who stayed in either hotel is 56435, in the year 2016

    Number of guests who stayed in either hotel is 40604, in the year 2017

The data is not equally distributed throughout the years, and we note in the plots above that year 2016 is the only year with data from all 12 months. 

```python

sns.displot(x='lead_time', hue='is_canceled', multiple="stack", kde=True, data=df[df['arrival_date_year'] == 2015], height=10, aspect=16/10);

```

![png](img/output_45_0.png)

```python

sns.displot(x='lead_time', hue='is_canceled', multiple="stack", kde=True, data=df[df['arrival_date_year'] == 2016], height=10, aspect=16/10);

```

![png](img/output_46_0.png)

```python

sns.displot(x='lead_time', hue='is_canceled', multiple="stack", kde=True, data=df[df['arrival_date_year'] == 2017], height=10, aspect=16/10);

```

![png](img/output_47_0.png)

```python

sns.boxplot(x='assigned_room_type',y='adr',data=df);

```

![png](img/output_48_0.png)

```python

df_adr = df.copy()

```

```python

df_adr['adr_adj'] = df_adr['adr']/(df_adr['adults']+df_adr['children'])

df_adr['adr_adj_wb'] = df_adr['adr']/(df_adr['adults']+df_adr['children']+df_adr['babies'])

df_adr.drop(df_adr[df_adr['adr_adj']>400].index.values,axis=0,inplace=True)

```

```python

sns.boxplot(x='assigned_room_type',y='adr_adj',data=df_adr);

```

![png](img/output_51_0.png)

```python

sns.countplot(data=df[~df['company'].isna()], x="is_canceled")

```

    

![png](img/output_52_1.png)

```python

sns.countplot(data=df[df['company'].isna()], x="is_canceled")

```

    

![png](img/output_53_1.png)

```python

df[~df['company'].isna()].is_canceled.value_counts() #show guests staying on behalf of a company or organization

```

    0    5435

    1    1167

    Name: is_canceled, dtype: int64

```python

x = df[~df['company'].isna()].is_canceled.value_counts()[1]/df[~df['company'].isna()].is_canceled.value_counts()[0]

round(x*100,2)

```

    21.47

We see above that only 21.47% of guests cancel when they register with a company or organization.

```python

df[df['company'].isna()].is_canceled.value_counts() #guests on vacation

```

    0    69016

    1    42890

    Name: is_canceled, dtype: int64

```python

x = df[df['company'].isna()].is_canceled.value_counts()[1]/df[df['company'].isna()].is_canceled.value_counts()[0]

round(x*100,2)

```

    62.15

When a guest is booking to stay for a personal vaction they have a 62.15% of canceling a reservation.

## Dummy Variables

```python

df_columns = ['hotel','arrival_date_month','meal','country','market_segment','distribution_channel',

              'reserved_room_type','assigned_room_type','deposit_type','customer_type']

df_ = df.copy()

df_.drop(['reservation_status','reservation_status_date'],axis=1,inplace=True)

data = pd.get_dummies(df_, prefix=df_columns, columns=df_columns)

```

# Models

```python

X_train, X_test, Y_train, Y_test= train_test_split(data.drop('is_canceled',axis=1), data['is_canceled'], random_state=42, test_size=0.2)

```

## Random Forest Classifier

```python

rf = RandomForestClassifier()

parameter_rf = {

    'n_estimators':[10,50,100,150,200],

    'criterion':('gini','entropy'),

    'max_depth':[None,1,2,3,4,5],

    'min_samples_split':[2,3,4],

    'min_samples_leaf':[1,2,3]

}

clf_rf = GridSearchCV(rf, parameters_rf, cv=5, verbose=10, n_jobs=-1)

clf_rf.fit(X_train, Y_train)

```

```python

rf_tmp = RandomForestClassifier()

#cv_results = cross_validate(rf_tmp, X_train, Y_train, cv=5, verbose=10, n_jobs=-1)

rf_tmp.fit(X_train,Y_train)

```

    RandomForestClassifier()

```python

rf_tmp.score(X_test,Y_test)

```

    0.8906843304362501

```python

feats = {}

for feature, importance in zip(data.drop('is_canceled',axis=1).columns, rf_tmp.feature_importances_):

    feats[feature] = importance

importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'importances'}).sort_values(by='importances', ascending=False)

importances.head(10)

```



  

    

      

      importances

    

  

  

    

      lead_time

      0.102637

    

    

      deposit_type_Non Refund

      0.080077

    

    

      adr

      0.071284

    

    

      deposit_type_No Deposit

      0.057994

    

    

      country_PRT

      0.057188

    

    

      arrival_date_day_of_month

      0.053326

    

    

      total_of_special_requests

      0.052536

    

    

      arrival_date_week_number

      0.046571

    

    

      stays_in_week_nights

      0.037800

    

    

      previous_cancellations

      0.028475

    

  



## Logisitic Regression

```python

lr = LogisticRegression()

"""parameter_lr = {

    'penalty':('l2', 'none'),

    'tol':[1e-5,1e-4,1e-3],

    'C':[0.1,1.0,2.0],

    'solver':('lbfgs','sag','saga'),

    'max_iter':[1000]

}"""

parameter_lr = {

    'penalty':['none'], 

    'tol':[1e-4],

    'C':[1.0],

    'solver':['sag'],

    'max_iter':[1000]

}

clf_lr = GridSearchCV(lr, parameter_lr, cv=5, verbose=10, n_jobs=-1)

clf_lr.fit(X_train, Y_train)

```

    Fitting 5 folds for each of 1 candidates, totalling 5 fits

    [Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.

    [Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:  5.5min remaining:  8.2min

    [Parallel(n_jobs=-1)]: Done   3 out of   5 | elapsed:  5.5min remaining:  3.6min

    [Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  5.5min remaining:    0.0s

    [Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  5.5min finished

    /Users/DavidH/anaconda2/envs/py382/lib/python3.8/site-packages/sklearn/linear_model/_sag.py:329: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge

      warnings.warn("The max_iter was reached which means "

    GridSearchCV(cv=5, estimator=LogisticRegression(), n_jobs=-1,

                 param_grid={'C': [1.0], 'max_iter': [1000], 'penalty': ['none'],

                             'solver': ['sag'], 'tol': [0.0001]},

                 verbose=10)

```python

clf_lr.best_params_

```

    {'C': 1.0, 'max_iter': 1000, 'penalty': 'none', 'solver': 'sag', 'tol': 0.0001}

```python

clf_lr.best_score_

```

    0.804579898626818

```python

clf_lr.best_estimator_.score(X_test,Y_test)

```

    0.8094675554805502
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/davidhintelmann/portugal-hotels

Awesome Lists containing this project

README