What can data tell us about the design of the Citibike system?

This project intended to ask a simple question between the different kinds user types that exist in NYC bike sharing system Citibike.

The idea:

SUBSCRIBERS ARE LESS LIKELY TO PAY EXTRA TIME FEES PER RIDE THAN CUSTOMERS

Users are defined as follows:
Subscriber = person enrolled in a 1-year contract to use Citibikes
Customer = person who buys a Day-pass or a 3-Day-pass to use Citibike, normally associated with visitors and tourists.
Note: ride time limit without extra charges associated with each user type is different: 
Max time for Customers = 30min = 1800 sec
Max time for Subscribers = 45min = 2700 sec

The experiment was done taking data from February, April and June of 2016; these months represent different seasonal conditions that can affect ride usage.

The graph on the left shows the number of rides grouped by user type identifying the total trips within the trip duration limit and total extra-charged trips. . The graph on the right shows the same graphs displayed in ratios, revealing that:

Ratio of Customers Charged Extra:25.63%
Ratio of Subscribers Charged Extra:0.91%

You can review the statistical significance of the experiment in GitHub using the Chi-square method.

The data reveals that trips from subscribers account for 89.5% of the total trips, and customers representing just 10.5% of the rides. In other words, trips from subscribers are 8.5 times higher than the number of customer trips.

At this point, I started to get curious about the trips exceeding the time limit and triggering an extra charge, and the patterns in the data.

Below, you can see the distribution of trips throughout the day. This graph helps us to understand the characteristics of each user type and the way they use the Cibike services.

As we can observe, there are two peak hours for the Subscribers (associated with commuters, residents of the city) at 8:15am and 6:15pm, these times are commonly commuting hours. We can infer that Subscribers are regular bike commuters. If we forget about the peaks, both distributions are similar in shape but not in magnitude. It is noteworthy that the Customers usage starts ramping up later in the morning and the peak hour is at 3:00pm.

Fig 3.- Distribution of Extra Charged trips during day hours, identified by user type.

Now, let’s focus only on Extra Charged Trips, i.e. the trips that exceeded their time limit (Customers = 30min and Subscribers = 45min)

Below, we can visualize the distribution of Extra Charged Trips during the day – the trips are allocated based on their start time. Here it is obvious that the number of Extra Charged Trips to Customers is much higher than that of Subscribers. Also, we can identify that the peak for Customers (associated with visitors and tourists) is at 3:00pm and the peaks for Subscribers (associated with NYC residents using Citibike for their commute) are at 9:00am and between 5-6:00pm, when people leave work.

Fig 4.- Distribution of number of Extra Charged trips during day hours, identified by user type.

Fig 4.- Distribution of extra charged trips duration during day hours.

The next graph on the right was done with the intention of exploring the duration of the trips in order to find a behavioral pattern in the biking habits of the different user types.

The range of the trips duration goes from 0.5 hours to 1014 hours.

Unless people use Citibike to train for the Tour de France, a trip duration with more than 3 hours is hard to believe. This might be an error in the data collection system or something similar. Still, there is a clear pattern in the data showing an empty space between 10 and 20 hour-long trips starting from 8:00am and continuously going down hour by hour until midnight.

I really have no idea what this means, so if you have any clue let me know in the comments section.

With more questions than answers, I then wanted to find a way to visualize the behavior of the trip durations. I created a Cumulative distribution graph of the Extra Charged Trips for each user type and the result shows some singular characteristics.

The extra charges for each user type have a different starting point, but right at the beginning of their respective time limit, both show a similar slope in the cumulative distribution and after that both show an inflection point; Customers shows it earlier, around 1.5 hours describing 85% of the extra charged trips. For the Subscribers, it comes later at 1.75 hours representing 75% of the trips (of that user type). After these points, we can compare the different speeds at which both user types reach their 100% showing that the duration time for the Subscribers tends to go longer. For example: for Subscribers it takes 5.5 hours to reach to the 90% while Customers reach this point at 2.3 hours.

On the other graph on the right, we have the total number of trips per trip duration; we can see that after 5 hours, there are more Subscribers than Customers.

Of course, there is a level of error in here, but identifying the source is complicated. Still, the behavior in the Cumulative distribution graph tell us up to which point real usage behavior is captured by the system (inflection points). The other part of the data (after the inflection points) could be a combination of broken bikes, broken docks, broken data, or stolen bikes. What is also curious, is the difference in trip duration between Customers and Subscribers after the inflection point: 10% of subscribers’ extra charged trips last more than 5 hours.

Fig 7.- Citibike station, failed docking

My interpretation

Putting aside the problems related to broken data, it seems that proportionally it is more common for Subscribers to disregard that something went wrong during the docking process – this could be explained by the design of the system itself. Citibike provides an audiovisual feedback to confirm that the docking was done correctly but the light is actually not easy to see because of its location and the chirp is not loud enough for a noisy urban environment or riders wearing headphones. Another aspect of the design is that even while failing to dock the bike correctly, the bike is able to remain in place standing still. In the manufacturing world, to avoid this kind of situation, systems called poka-yoke are implemented i.e. the design makes sure that there is one and only one way, to perform the operation. Other bike share systems use poka-yoke.

Finally, doing things constantly and repeatedly, while creating a learning process, it also makes people more careless because of developed trust in the system and process. If you are performing something for the first time, you will always be more careful in order to understand the feedback and processes.

While intensive usage of some highly-demanded stations can cause the docking stations to wear out, the feedback of an unsuccessful docking process is not reaching the user.

By the way, other bike share systems like the one in Barcelona, look less expensive. Let me know your thoughts!

Fig 8.- Barcelona's bike sharing system Bicing dock station.