Data Errors and Their Impacts, the Case of DATELINE



Data Errors and Their Impacts, the Case of DATELINE

Authors

Cornelis Dirk Van Goeverden, Delft University of Technology, Bart Van Arem, Delft University of Technology, Rob Van Nes, Delft University of Technology

Description

DATELINE was the most important European project on long distance travelling. The paper discusses the magnitude of several kinds of data errors and the impact on the outcomes of analyses on travel volume and the association between variables.

Abstract

In the scientific publication process much attention is given to correct performance of statistical analyses. Generally, less attention is paid to the quality of the data that are used for the analyses. However, just like methods that are statistically not fully correct, errors in data will make the results less accurate. The paper examines data errors in the DATELINE-project. DATELINE is a large survey on long distance travelling by European residents and was conducted in 2001-2002. The examination includes the characteristics of errors, the magnitude, and the impacts on the outcomes of two kinds of analyses. The latter include the description of travel volume and a statistical analyses on modal choice. The well-known problem of underreporting of long distance journeys is not discussed in the paper; this is not a data error but an observation error.

One general problem in travel surveys regards the identification of topographical locations. This problem is valid for DATELINE as well. The DATELINE data include many kinds of locations, like the home cities of respondents and destinations of journeys. A significant number of locations is wrongly identified. Mistakes in the identification of locations are observed when the name of a reported city is not unique, when the name is incorrect spelled, and when a city is missing in the geodatabase that is used for the coding of locations in the DATELINE databases. Some mistakes in the manipulation of data caused some additional errors in the selection of locations. We estimate that for 11% of the DATELINE journeys the assumed origin or destination is not correct. For excursions, day-trips made during a journey, the estimated proportion is even 30%.
A second data problem that is typical for DATELINE, is inaccurate calculation of distances. The DATELINE respondents were not asked for the distances that they travelled in journeys, trips or excursions, and afterwards crow fly distances were added to the databases; these were calculated from the geoids of the origin and destination locations.

However, the calculated distances are not accurate.They generally underestimate the actual distances. The average underestimation is 6-7%.
Journeys which distance was assumed to be shorter than 100 km (the defined lower limit for long distance) were removed from the database. Because both wrong identification of locations and incorrect calculation of distances produced calculated distances that differ from the actual ones, for a number of removed journeys the actual distance was longer than 100 km. We estimate that 2-3% of the journeys were erroneously deleted.

We improved the data by correcting the majority of wrongly identified locations and recalculating the crow-fly distances, and by adding expansion factors for erroneously removed journeys.

The aggregate impact of the data problems on travel volume are different for the different kinds of movements. They are rather small for regular journeys (underestimation of 4% of the numbers, 7% of the mileage) but large for excursions (overestimation of 11% and 61%) and commuting journeys (varying results for different countries, ranging from significant overestimation to extreme underestimation).

The impact on the results of the modal choice analysis are small and partly marginal. The performance of the model improved marginally and results regarding significant variables and their ranking are similar. However, looking at the influence of the most adapted variable, distance, the analysis based on corrected data gives significant different, presumably better results than the analysis based on the original data.

Publisher

Association for European Transport