To drive a yellow New York taxi, you have to hold a "medallion" from the city's Taxi and Limousine Commission. Recently, one of those changed hands for over one million dollars, which shows how lucrative the job can be.
But this is the age of business intelligence and analytics! Even taxi drivers can stand to benefit from some careful investigation of the data, guiding them to maximize their profits. In this project, we will analyze a random sample of 49999 New York journeys made in 2013. We will also use regression trees and random forests to build a model that can predict the locations and times when the biggest fares can be earned.
Let's start by taking a look at the data!
# Loading the tidyverse
library('tidyverse')
# Reading in the taxi data
taxi <- read_csv('datasets/taxi.csv')
# Taking a look at the first couple of rows in taxi
head(taxi)
As you can see above, the taxi
dataset contains the times and price of a large number of taxi trips. Importantly we also get to know the location, the longitude and latitude, where the trip was started.
Cleaning data is a large part of any data scientist's daily work. It may not seem glamorous, but it makes the difference between a successful model and a failure. The taxi
dataset needs a bit of polishing before we're ready to use it.
# Renaming the location variables,
# dropping any journeys with zero fares and zero tips,
# and creating the total variable as the log sum of fare and tip
taxi <- taxi %>% rename("lat"="pickup_latitude","long"="pickup_longitude") %>% filter(fare_amount>0 | tip_amount>0) %>% mutate(total=log(fare_amount+tip_amount))
While the dataset contains taxi trips from all over New York City, the bulk of the trips are to and from Manhattan, so let's focus only on trips initiated there.
# Reducing the data to taxi trips starting in Manhattan
# Manhattan is bounded by the rectangle with
# latitude from 40.70 to 40.83 and
# longitude from -74.025 to -73.93
taxi <- taxi %>%
filter(between(lat, 40.70, 40.83) & between(long, -74.025, -73.93))
It's time to draw a map! We're going to use the excellent ggmap
package together with ggplot2
to visualize where in Manhattan people tend to start their taxi journeys.
# Loading in ggmap and viridis for nice colors
library("ggmap")
library("viridis")
# Retrieving a stored map object which originally was created by
# manhattan <- get_map("manhattan", zoom = 12, color = "bw")
manhattan <- readRDS("datasets/manhattan.rds")
# Drawing a density map with the number of journey start locations
ggmap(manhattan, darken = 0.5) +
scale_fill_viridis(option = 'plasma') +
geom_bin2d(data=taxi, aes(x=long, y=lat), bins=60, alpha=0.6) +
labs(x="Longitude", y="Latitude", fill="Journeys")
The map from the previous task showed that the journeys are highly concentrated in the business and tourist areas. We also see that some taxi trips originating in Brooklyn slipped through, but that's fine.
We're now going to use a regression tree to predict the total
fare with lat
and long
being the predictors. The tree
algorithm will try to find cutpoints in those predictors that results in the decision tree with the best predictive capability.
# Loading in the tree package
library('tree')
# Fitting a tree to lat and long
fitted_tree <- tree(total ~ lat + long, data=taxi)
# Draw a diagram of the tree structure
plot(fitted_tree)
text(fitted_tree)
The tree above looks a bit frugal, it only includes one split: It predicts that trips where lat < 40.7237
are more expensive, which makes sense as it is downtown Manhattan. But that's it. It didn't even include long
as tree
deemed that it didn't improve the predictions. Taxi drivers will need more information than this and any driver paying for your data-driven insights would be disappointed with that. As we know from Robert de Niro, it's best not to upset New York taxi drivers.
Let's start by adding some more predictors related to the time the taxi trip was made.
# Loading in the lubridate package
library('lubridate')
# Generate the three new time variables
taxi <- taxi %>%
mutate(hour=hour(pickup_datetime),
wday=wday(pickup_datetime, label=TRUE),
month=month(pickup_datetime, label=TRUE))
Let's try fitting a new regression tree where we include the new time variables.
# Fitting a tree with total as the outcome and
# lat, long, hour, wday, and month as predictors
fitted_tree <- tree(total ~ lat + long +
hour + wday + month, data=taxi)
# draw a diagram of the tree structure
plot(fitted_tree)
text(fitted_tree)
# Summarizing the performance of the tree
summary(fitted_tree)
The regression tree has not changed after including the three time variables. This is likely because latitude is still the most promising first variable to split the data on, and after that split, the other variables are not informative enough to be included. A random forest model, where many different trees are fitted to subsets of the data, may well include the other variables in some of the trees that make it up.
# Loading in the randomForest package
library('randomForest')
# Fitting a random forest
fitted_forest <- randomForest(total ~ lat +
long + hour + wday + month,
data=taxi,
ntree=80,
sampsize=10000)
# Printing the fitted_forest object
fitted_forest
In the output of fitted_forest
you should see the Mean of squared residuals
, that is, the average of the squared errors the model makes. If you scroll up and check the summary
of fitted_tree
you'll find Residual mean deviance
which is the same number. If you compare these numbers, you'll see that fitted_forest
has a slightly lower error. Neither predictive model is that good, in statistical terms, they explain only about 3% of the variance.
Now, let's take a look at the predictions of fitted_forest
projected back onto Manhattan.
# Extracting the prediction from fitted_forest
taxi$pred_total <- fitted_forest$predicted
# Plotting the predicted mean trip prices from according to the random forest
ggmap(manhattan, darken = 0.5) +
scale_fill_viridis(option = 'plasma') +
stat_summary_2d(data=taxi, aes(x=long, y=lat, z=pred_total),
fun=mean, bins=60, alpha=0.6) +
labs(x="Longitude", y="Latitude", fill="Predicted Fare")
Looking at the map with the predicted fares we see that fares in downtown Manhattan are predicted to be high, while midtown is lower. This map only shows the prediction as a function of lat
and long
, but we could also plot the predictions over time, or a combination of time and space, but we'll leave that for another time.
For now, let's compare the map with the predicted fares with a new map showing the mean fares according to the data.
# Function that returns the mean *if* there are 15 or more datapoints
mean_if_enough_data <- function(x) {
ifelse( length(x) >= 15, mean(x), NA)
}
# Plotting the mean trip prices from the data
ggmap(manhattan, darken = 0.5) +
scale_fill_viridis(option = 'plasma') +
stat_summary_2d(data=taxi, aes(x=long, y=lat, z=total),
fun=mean_if_enough_data, bins=60, alpha=0.6) +
labs(x="Longitude", y="Latitude", fill="Mean Fare")
So it looks like the random forest model captured some of the patterns in our data. At this point in the analysis, there are many more things we could do that we haven't done. We could add more predictors if we have the data. We could try to fine-tune the parameters of randomForest
. And we should definitely test the model on a hold-out test dataset. But for now, let's be happy with what we have achieved!
So, if you are a taxi driver in NYC, where in Manhattan would you expect people to spend the most on a taxi ride?
# Where are people spending the most on their taxi trips?
spends_most_on_trips <- "...." # "uptown" or "downtown"