Source: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
The data used in the attached datasets were collected and provided to the NYC Taxi and Limousine Commission (TLC) Note we have consider 2016 dataset. We have experimented on Jan 2016, Feb 2016 and March 2016 datasets.
These are the famous NYC yellow taxis that provide transportation exclusively through street-hails. The number of taxicabs is limited by a finite number of medallions issued by the TLC. You access this mode of transportation by standing in the street and hailing an available taxi with your hand. The pickups are not pre-arranged.
FHV transportation is accessed by a pre-arrangement with a dispatcher or limo company. These FHVs are not permitted to pick up passengers via street hails, as those rides are not considered pre-arranged.
The SHL program will allow livery vehicle owners to license and outfit their vehicles with green borough taxi branding, meters, credit card machines, and ultimately the right to accept street hails in addition to pre-arranged rides.
Credits: Quora
In the given notebook we are considering only the yellow taxis for the time period between Jan - Mar 2015 & Jan - Mar 2016We Have collected all yellow taxi trips data from jan-2015 to dec-2016(Will be using only 2015 data)
file name | file name size | number of records | number of features |
---|---|---|---|
yellow_tripdata_2016-01 | 1. 59G | 10906858 | 19 |
yellow_tripdata_2016-02 | 1. 66G | 11382049 | 19 |
yellow_tripdata_2016-03 | 1. 78G | 12210952 | 19 |
yellow_tripdata_2016-04 | 1. 74G | 11934338 | 19 |
yellow_tripdata_2016-05 | 1. 73G | 11836853 | 19 |
yellow_tripdata_2016-06 | 1. 62G | 11135470 | 19 |
yellow_tripdata_2016-07 | 884Mb | 10294080 | 17 |
yellow_tripdata_2016-08 | 854Mb | 9942263 | 17 |
yellow_tripdata_2016-09 | 870Mb | 10116018 | 17 |
yellow_tripdata_2016-10 | 933Mb | 10854626 | 17 |
yellow_tripdata_2016-11 | 868Mb | 10102128 | 17 |
yellow_tripdata_2016-12 | 897Mb | 10449408 | 17 |
yellow_tripdata_2015-01 | 1.84Gb | 12748986 | 19 |
yellow_tripdata_2015-02 | 1.81Gb | 12450521 | 19 |
yellow_tripdata_2015-03 | 1.94Gb | 13351609 | 19 |
yellow_tripdata_2015-04 | 1.90Gb | 13071789 | 19 |
yellow_tripdata_2015-05 | 1.91Gb | 13158262 | 19 |
yellow_tripdata_2015-06 | 1.79Gb | 12324935 | 19 |
yellow_tripdata_2015-07 | 1.68Gb | 11562783 | 19 |
yellow_tripdata_2015-08 | 1.62Gb | 11130304 | 19 |
yellow_tripdata_2015-09 | 1.63Gb | 11225063 | 19 |
yellow_tripdata_2015-10 | 1.79Gb | 12315488 | 19 |
yellow_tripdata_2015-11 | 1.65Gb | 11312676 | 19 |
yellow_tripdata_2015-12 | 1.67Gb | 11460573 | 19 |
Predicting Taxi Demand in New York city for future 10 minutes
Time-series forecasting and Regression
- To find number of pickups, given location cordinates(latitude and longitude) and time, in the query reigion and surrounding regions.
To solve the above we would be using data collected in Jan - Mar 2015 to predict the pickups in Jan - Mar 2016.
- Mean Absolute percentage error.
- Mean Squared error.
All below points has outlier's points
- Pickup Latitude and Pickup Longitude
- Dropoff Latitude & Dropoff Longitude
- Trip Durations
- Speed
- Trip Distance
- Total Fare
When we plot BoxCox plot we observe outliers in above column so we simply remove those outliers
We divide whole city into 40 parts and segmentation was done depends on no of frequecy for taxi demand. If certain segment has more numbers of trips then area of that segment is relatively small, if numbers of trip are less then area of that segment is large.
We has used KMeans algorithm for segmentation of whole new york city.
As we divided whole city into segments, we also have divided time bin into parts.
As we know that we are dealing with time series data. we have to consider statistic terminological based model.
- Simple Moving Average
- Weighted Moving Average
- Exponential Weighted Moving Averages
As we know that it is Time Series problem we cannot randomize the data, 70% data will use for training and 30% data we will use as test data.
Exponential Model MAPE : 0.1349
Liner Regression MAPE : 0.13468
Random Forest MAPE: 0.13438
XGBoost MAPE : 0.1392
Holt Winter forecasting
Then we used "Triple Exponential Smoothing" and then again train on Random Forest and XGBOOST.
Random Forest MAPE: 0.04657
XGBoost MAPE : 0.02932
- Holt-Winter forecasting, triple exponential forecasting is tremendous feature for this task. Even hyperparam tunig in holt winter is also crucial. It reduce MAPE to 0.02 which is brillint.
- Without Holt-Winter feature, MAPE value was not reducing more than 13. Even with LSTM model.