
Dr. Iain Stitt
Data Scientist and Neuroscientist
NYC street network analysis
For this project I defined a graph theoretical measure that quantifies the redundancy (or overlap) of taxi trips along certain routes in New York City. This measure can be used to assess the degree of redundancy for certain routes in the city. Routes that have a high degree of redundancy have a greater capacity for improved efficiency, and therefore represent a more viable market for ride sharing services.
​
The code for this project can be found on my Github page here: https://github.com/iainstitt/NYC-street-network-analysis
​
How can we assess the efficiency of taxi trips?
I have thought about this question quite a bit, and for me efficiency can be considered the following: What proportion of taxi rides are redundant? That is to say, what proportion of my taxi trip could have been spent riding in someone else's taxi that was traveling in a similar direction?
​
If many taxis are taking passengers along similar roads in similar directions, then we can say that there is a high degree of redundancy in the taxi transportation system. And because of this redundancy, many trips could potentially be shared, indicating that there is room for more efficiency in the system. On the other hand, if most taxi trips travel on different roads and in different directions, then there is little redundancy, and thus little room for improving efficiency.
​
Above I have outlined my overarching ideas about how to tackle the problem of efficiency in street networks. I will outline the details of how I translate this idea into a workable algorithm below.
​
Project pickup and drop-off locations onto a street network graph
I am using the built-in geolocation information from the street network graph of NYC (using the osmnx python toolbox) to define the nodes (or intersections) that correspond to certain taxi pickup and drop-off locations. In this graph, each node represents an intersection, and each edge represents a street, with the length of the street encoded by the weight of the edge.
​
When we have the pickup and drop-off nodes, we can then simply use the networkx toolbox to compute the shortest path between the two nodes using Dijkstra's algorithm. I am making a baseline assumption here that the taxis take the shortest path available, but I understand that taxi behavior may indeed change according to factors such as congestion, city dynamics, etc.


Definition of trip redundancy metric
As you can see above, these two trips seem to have a decent amount of overlap. We can quantify how similar these two trips are by comparing which network edges were traversed during both trips, and compute the fraction of overlapping edges between trips to the number of total edges traveled.
​
I did play around with a few different variations of this metric, but in the end I settled on dividing the number of shared edges by the length of the shortest route. Thus, this will always be a number between 0 and 1, with a vaue of 1 indicating that the shorter route is essentially a subset of the larger route, and that at least one ride is totally redundant (and thus ride sharing would be a great option).
​
Randomly select pairs of rides to compare
Now that we have a metric to help us estimate the viability for ride sharing between pairs of taxi trips, we want to apply this metric at scale on the NYC taxi dataset. To restrict the dataset I only queried taxi trips that had pickup and drop-off zones that fall within specific geographical areas. To establish a simple baseline, I started with taxi trips that had pickup and drop-off locations within the neighborhood of Astoria, Queens.
​
I then randomly selected pairs of taxi trips, and computed the pairwise redundancy metric outlined above, repeating this process thousands of times to build a probability distribution of redundancy scores for the neighborhood of Astoria. The fraction of ride pairs within Astoria that have no overlapping edges/streets is 70.05%. For those that have overlapping segments, here is the distribution of pairwise redundancy scores:

Compare to Manhattan-bound taxi trips
To compare between neighborhood to the within neighborhood trip redundancy, I built another dataset of taxi trips that originate in Astoria, and have a drop-off location in Manhattan.
The fraction of ride pairs from Astoria to Manhattan that have no overlapping edges/streets is 42.35% For those that have overlapping segments, here is the breakdown of how much overlap there is:


As you can see in the plot above, Manhattan-bound trips originating in Astoria have a greater capacity for higher efficiency, as denoted by the larger proportion of rides displaying a higher pairwise redundancy score. This indicates that there is indeed more redundancy in these trips, and that ride sharing between Astoria and Manhattan makes sense.
​
Can LGA trips be shared with riders going to and from Astoria?
The La Guardia Airport (LGA) is located to the northeast of Astoria in Queens, and is the closest airport to Manhattan. Given that a large volume of people travel between LGA and Manhattan every day, and that the neighborhood of Astoria is located in between these destinations, such trips may be great candidates for ride sharing. To test this hypothesis, I computed the pairwise redundancy score between randomly selected LGA to Astoria and LGA to Manhattan taxi trips.
​
The fraction of ride pairs from LGA -> Astoria and LGA -> Manhattan that have no overlapping edges/streets is 2.56% For those that have overlapping segments, here is the breakdown of how much overlap there is:

As you can see in the figure above, there appears to be a large degree of redundancy when it comes to trips from LGA to Manhattan and LGA to Astoria. This would mean that if ride sharing services opted to include Astoria in the service area, they may be able to drop off passengers in Astoria on the way to Manhattan with minimal fuss.
​
Now let's perform the same analysis, but for trips that travel in the opposite direction: Manhattan to LGA and Astoria to LGA

This is interesting. There seems to be an asymmetry in the redundancy score with trips bound for LGA, and trips bound for either Manhattan or Astoria from LGA. To dig into this a bit deeper, I am going to have a look at the spatial distribution of taxi trips that either originate or end in LGA.
​
Spatial distribution of taxi pickup and drop-off locations
​
First, let's have a look at the drop-off location for taxi rides that originate in LGA. In the plot below, each little white dot is the dropoff location of a taxi ride that originated at LGA:

And when we zoom in on Queens:

There appear to be three different hot spots in Queens where passengers that were picked up at La Guardia Airport tend to be dropped off. By cross-referencing this plot with Google maps, I was able to determine that the main hotspot in Astoria corresponds to the Astoria Blvd train station. Similarly, the other two hotspots correspond to the Jackson Heights and Woodside subway stations. So it seems that people that take taxis from La Guardia Airport tend to want to travel to the nearest train station, so that they can probably catch the train for a lower cost (especially if they have a monthly metro card). I guess this may also have something to do with the fact that public transport options to La Guardia are quite limited, despite its vicinity to the city.
​
Now let's contrast the plots above with similar plots of the spatial distribution of pickup locations for LGA-bound trips:

The distribution of pickup locations for trips that are destined to La Guardia Airport looks quite different form the previous plots of dropoff locations of trips that originate at La Guardia.
​
I feel that these differences in pickup/drop-off locations, coupled with different user behavior (ie, taking taxis to the nearest metro station) may explain some of the asymmetry in the redundancy scores of trips going to and from La Guardia (and starting/ending in Astoria or Manhattan).
​
​
Conclusions
Some of the details of the points below were not fully fleshed out in this post, so for more information please look at my github repo for this project
-
Taxi trips that originate and end within Astoria displayed a similar distribution of redundancy, when compared to rides from the upper east side. Given that ride sharing services are successful in the upper east side, it may be expected that a similar service will be viable in the Astoria as well.
-
Trips from Astoria to Manhattan, and from Manhattan to Astoria displayed a large degree of overlap/redundancy. Indicating that these journeys represent viable targets for the ride sharing model.
-
There was especially high redundancy for trips that originate in La Guardia, and drop off passengers in either Astoria and Manhattan. Further exploration of the data revealed that this may be in part due to the tendency of many taxi trips from La Guardia to drop off passengers at the Astoria Blvd train station.
-
Trips from Manhattan to La Guardia Airport (and back) display a large degree of overlap/redundancy with trips to La Guardia from Astoria. This suggests that Astoria services to both Manhattan and La Guardia could be integrated within the current framework of servicing La Guardia Airport from Manhattan.
-
The demand for service in Astoria more generally follows the demand for taxis between Manhattan and La Guardia Airport. This is great, because it means that the demand for ride share trips within Astoria will co fluctuate with the amount of vehicles traveling through the neighborhood either to or from La Guardia Airport.
-
While there is indeed variation in the demand for service across the days of the week, I believe that the magnitude of these fluctuations do not justify restricting service to certain days. Since restricting service may lead to increased churn.
​
Further considerations
I fully acknowledge that the time component was not considered when computing the efficiency of rides within and between neighborhoods. However, I believe that the analytical framework that I have outlined above could, with some more work, be generalized to include the time of certain trips.
​
To include time information I would do the following:
-
Rather than searching randomly for pairs of rides to compute the efficiency metric, I would restrict the analysis to rides that occur within a certain time window (eg, 20 minutes) on the same day.
-
To achieve more power for this approach, I would collapse the results of the pairwise comparison across days within the same time window. That is to say, the pairs of rides that will be compared always occurred on the same day at similar times, but the overall distribution of the efficiency metric will be built by aggregating pairwise comparisons across many days.
-
To achieve temporal resolution for this metric, I would compute the efficiency score using a sliding window approach. For example, the window length could be 20 minutes (eg, compare rides in a window from 8:00-8:20am), with 5 minute increments (next step would be 8:05-8:25am, and so on). This would result in a smooth estimate (because some ride pairs will contribute to neighboring bins) of the efficiency metric with a temporal resolution of 5 minutes, resolved for the whole day.