BIG Data or BIG Mistake: Practical Advice on How to Get the Best out of New Data Sources



BIG Data or BIG Mistake: Practical Advice on How to Get the Best out of New Data Sources

Authors

Tim Day-Pollard, Mott MacDonald

Description

A recent model-update has provided a chance to re-examine methods for processing GPS data.
Focussing on commute, we compare estimates produced by RSI, GPS and synthetic data sources against the travel to work data from the 2011 Census.

Abstract

In recent year there has been growing interest in new BIG data sources for matrix building. These new sources (of which GPS is the most common) often purport to solve many problems faced by traditional intercept survey methods.
Thanks to their larger sample sizes, and larger geographic catchment they often result in OD matrices with a much lower percentage of either observed or unobserved 0s than for example matrices produced for example by Road Side Interview (RSI) data. This is very attractive to transport modellers forecasting demand, Daly et al (2011) showed that a well-filled base year observed matrix is essential in pivot-point models.
However despite this, BIG data sources come with their own challenges. For example; true origin-zones can be hard to pinpoint either because of data-security or inaccuracies in the underlying technology, purpose information about the trip is not captured so must be synthesised, and trying to understand sample rates in a way consistent with traditional data sources can be fraught with difficulty.
In a recent model-update has provided the author a chance to re-examine their methods in the light of newer research (for example a paper on Data Fusion by Allos et al, 2014) and to re-combine the data this time also making use of 2011 Census journey to work data.
We will look at what data-cleaning and bias-correction techniques were required for the BIG Data and how the merging of the data was achieved. In particular we will focus on the commute-purpose, comparing the estimates produced by RSI, GPS (using our previous and new methodology for purposing) and synthetic data sources against the travel to work data from the 2011 Census.
Through use of select-link analysis, GIS techniques, trip-length distributions and the MSSIM comparator (first introduced to transport by Djukic, et al 2013) we shall compare the reliability of each data source at different levels of aggregation with the aim of producing a final trip matrix that makes best use of each data set where it is most reliable.

Publisher

Association for European Transport