How We Filtered Our Dataset

Step 1: Choosing the Time Period with Most Reliable Data

In this research, we used data from GPS transmitters attached to Los Angeles Metro buses which record the time and location of each vehicle along its route. The information is compiled into Automatic Vehicle Location (AVL) datasets. We chose to examine the period from Oct. 1, 2015 to Sept. 30, 2016 because it included the most complete data reflecting possible seasonal variations in the public transportation system.

Step 2: Excluding Schedule Deviation Outliers

Schedule deviation reported in AVL datasets included outliers due to variety of reasons, such as idle buses or inaccurate GPS readings. We filtered out schedule deviations that fell far beyond the norms. After excluding these outliers our final dataset contained schedule deviations of buses that arrived between 5 minutes early or 10 minutes late.

Step 3: Excluding Data from Departure Stations

The AVL data reported from the first stop of a route were unreliable because buses continued to report schedule deviation even while waiting to embark on their next trip. Therefore, we disregarded the data collected at the departure stations.

Step 4: Trajectory Normalization

The AVL dataset included multiple instances in which a bus sent duplicate data from the same location, which skewed results. Therefore, we normalized the data so that we did not count the same data more than once.