Baselight

NJ Transit + Amtrak (NEC) Rail Performance

Granular performance data from 150k+ NJ Transit and Amtrak train trips

@kaggle.pranavbadami_nj_transit_amtrak_nec_performance

About this Dataset

NJ Transit + Amtrak (NEC) Rail Performance

Context

NJ Transit is the second largest commuter rail network in the United States by ridership; it spans New Jersey and connects the state to New York City. On the Northeast Corridor, the busiest passenger rail line in the United States, Amtrak also operates passenger rail service; together, NJ Transit and Amtrak operate nearly 750 trains across the NJ Transit rail network.

Despite serving over 300,000 riders on the average weekday, no granular, trip-level performance data is publicly available for the NJ Transit rail network or Amtrak. This datasets aims to publicly provide such data.

Content

This dataset contains monthly CSVs covering the performance of nearly every train trip on the NJ Transit rail network.

As of May 19, 2019:

  • Stop-level, minute resolution data on 287,000+ train trips (248,000+ NJ Transit trips, 38,000+ Amtrak trips)
  • Coverage from March 1, 2018 to April 30, 2019 (updated monthly)
  • Transparent reporting on train trips for which data was missing/invalid, or that were scraped or parsed incorrectly (97.5% of train trips were correctly captured)

Since February of 2018, I have been running a scraper that gathers stop-level, minute resolution data for NJ Transit and Amtrak train trips operating on the NJ Transit rail network. This scraper gathers data every minute from the NJ Transit DepartureVision Real Time Train Status service. The raw, timestamped train status pages are stored in a data lake and then parsed into tabular form; the parser is implemented as a state machine.

For more details on these processes and ancillary meta data (such as schedules and station locations) from the NJ Transit Developer Portal, check out the project GitHub repo.

Inspiration

Lots of interesting, high-impact projects could be driven by this data:

  • Robust prediction: This data could be used to derive a system-level prediction system for the NJ Transit network. Such a system could provide intelligent, targeted advance warnings of delays or cancellations for millions of riders.
  • Combining datasets: Weather data and service alert data could be incorporated to look at the effect of weather events and analyze the impacts of specific kinds of service interruptions.
  • Data visualization: Visualizing this data could provide robust insight into the system-level mechanics of the NJ Transit rail network, as well as more engaging reporting on NJ Transit.

For some more inspiration, you can check out Medium articles written by Michael Zhang and me with this data:

  1. The 5 Stages of a System Breakdown on NJ Transit
  2. What are the chances that NJ Transit will cause you to miss the Dinky?
  3. How data can help fix NJ Transit

Acknowledgements

A special thanks to Michael Zhang for his valuable work on using and preparing this data, as well as general support throughout the project.

Share link

Anyone who has the link will be able to view this.