New-York-Taxi-Demand-Prediction

Source: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page

The data used in the attached datasets were collected and provided to the NYC Taxi and Limousine Commission (TLC) Note we have consider 2016 dataset. We have experimented on Jan 2016, Feb 2016 and March 2016 datasets.

Medium Blog:

https://medium.com/@passionatedevs/taxi-demand-prediction-on-time-series-data-with-holt-winter-forecasting-loss-0-02-2bcdeec48499

Information on taxis:

Yellow Taxi: Yellow Medallion Taxicabs

These are the famous NYC yellow taxis that provide transportation exclusively through street-hails. The number of taxicabs is limited by a finite number of medallions issued by the TLC. You access this mode of transportation by standing in the street and hailing an available taxi with your hand. The pickups are not pre-arranged.

For Hire Vehicles (FHVs)

FHV transportation is accessed by a pre-arrangement with a dispatcher or limo company. These FHVs are not permitted to pick up passengers via street hails, as those rides are not considered pre-arranged.

Green Taxi: Street Hail Livery (SHL)

The SHL program will allow livery vehicle owners to license and outfit their vehicles with green borough taxi branding, meters, credit card machines, and ultimately the right to accept street hails in addition to pre-arranged rides.

Credits: Quora

Footnote:

In the given notebook we are considering only the yellow taxis for the time period between Jan - Mar 2015 & Jan - Mar 2016

Data Collection

We Have collected all yellow taxi trips data from jan-2015 to dec-2016(Will be using only 2015 data)

file name	file name size	number of records	number of features
yellow_tripdata_2016-01	1. 59G	10906858	19
yellow_tripdata_2016-02	1. 66G	11382049	19
yellow_tripdata_2016-03	1. 78G	12210952	19
yellow_tripdata_2016-04	1. 74G	11934338	19
yellow_tripdata_2016-05	1. 73G	11836853	19
yellow_tripdata_2016-06	1. 62G	11135470	19
yellow_tripdata_2016-07	884Mb	10294080	17
yellow_tripdata_2016-08	854Mb	9942263	17
yellow_tripdata_2016-09	870Mb	10116018	17
yellow_tripdata_2016-10	933Mb	10854626	17
yellow_tripdata_2016-11	868Mb	10102128	17
yellow_tripdata_2016-12	897Mb	10449408	17
yellow_tripdata_2015-01	1.84Gb	12748986	19
yellow_tripdata_2015-02	1.81Gb	12450521	19
yellow_tripdata_2015-03	1.94Gb	13351609	19
yellow_tripdata_2015-04	1.90Gb	13071789	19
yellow_tripdata_2015-05	1.91Gb	13158262	19
yellow_tripdata_2015-06	1.79Gb	12324935	19
yellow_tripdata_2015-07	1.68Gb	11562783	19
yellow_tripdata_2015-08	1.62Gb	11130304	19
yellow_tripdata_2015-09	1.63Gb	11225063	19
yellow_tripdata_2015-10	1.79Gb	12315488	19
yellow_tripdata_2015-11	1.65Gb	11312676	19
yellow_tripdata_2015-12	1.67Gb	11460573	19

Problem statement :

Predicting Taxi Demand in New York city for future 10 minutes

ML Problem Formulation

Time-series forecasting and Regression

- To find number of pickups, given location cordinates(latitude and longitude) and time, in the query reigion and surrounding regions.

To solve the above we would be using data collected in Jan - Mar 2015 to predict the pickups in Jan - Mar 2016.

Performance metrics

Mean Absolute percentage error.
Mean Squared error.

Data Cleaning

All below points has outlier's points

Pickup Latitude and Pickup Longitude
Dropoff Latitude & Dropoff Longitude
Trip Durations
Speed
Trip Distance
Total Fare

When we plot BoxCox plot we observe outliers in above column so we simply remove those outliers

Data Preprocess

Segmentation

We divide whole city into 40 parts and segmentation was done depends on no of frequecy for taxi demand. If certain segment has more numbers of trips then area of that segment is relatively small, if numbers of trip are less then area of that segment is large.

We has used KMeans algorithm for segmentation of whole new york city.

Time Bin

As we divided whole city into segments, we also have divided time bin into parts.

Baseline Model

As we know that we are dealing with time series data. we have to consider statistic terminological based model.

Simple Moving Average
Weighted Moving Average
Exponential Weighted Moving Averages

Train and Test Datasets

As we know that it is Time Series problem we cannot randomize the data, 70% data will use for training and 30% data we will use as test data.

Machine Learning Models

Exponential Model MAPE : 0.1349

Liner Regression MAPE : 0.13468

Random Forest MAPE: 0.13438

XGBoost MAPE : 0.1392

Feature Engineering

Holt Winter forecasting

Then we used "Triple Exponential Smoothing" and then again train on Random Forest and XGBOOST.

Random Forest MAPE: 0.04657
XGBoost MAPE : 0.02932

Conclusion

Holt-Winter forecasting, triple exponential forecasting is tremendous feature for this task. Even hyperparam tunig in holt winter is also crucial. It reduce MAPE to 0.02 which is brillint.
Without Holt-Winter feature, MAPE value was not reducing more than 13. Even with LSTM model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

New-York-Taxi-Demand-Prediction

Medium Blog:

Information on taxis:

Yellow Taxi: Yellow Medallion Taxicabs

For Hire Vehicles (FHVs)

Green Taxi: Street Hail Livery (SHL)

Footnote:

Data Collection

Problem statement :

ML Problem Formulation

Performance metrics

Data Cleaning

Data Preprocess

Segmentation

Time Bin

Baseline Model

Train and Test Datasets

Machine Learning Models

Feature Engineering

Conclusion

Files

README.md

Latest commit

History

README.md

File metadata and controls

New-York-Taxi-Demand-Prediction

Medium Blog:

Information on taxis:

Yellow Taxi: Yellow Medallion Taxicabs

For Hire Vehicles (FHVs)

Green Taxi: Street Hail Livery (SHL)

Footnote:

Data Collection

Problem statement :

ML Problem Formulation

Performance metrics

Data Cleaning

Data Preprocess

Segmentation

Time Bin

Baseline Model

Train and Test Datasets

Machine Learning Models

Feature Engineering

Conclusion