Data Pooling for Choice Modelling
A Sivakumar, J Polak, Imperial College London, UK
This paper explores techniques for combining disparate sources of revealed preference data for estimating different kinds of behavioural choice models, with a reliable quantification of the measurement uncertainty.
Combining data sources for the purpose of model development is neither a recent idea, nor is it uncommon. Variously known as data pooling, data fusion or data integration, the process of combining different sources of data to develop a more complete and more reliable evidence base is widely used in many fields such as business and financial analysis, sensor data fusion for biometrics verification, etc. Data pooling is common in the field of transport planning as well, such as in O-D matrix estimation and the combination of short and long-distance travel data to develop joint travel demand models. Even the practice of using level-of-service data extracted from network models to complement travel survey data is a case of data pooling, although not explicitly accounted for in the models. A more rigorous example of data pooling is the combination of revealed preference (RP) and stated preference (SP) data sources, which explicitly account for the different scales of error in these data.
Despite the vast body of literature related to data pooling, there are few studies that attempt to combine two very disparate sources of survey data collected from entirely disjointed samples. Studies that combine RP-SP and SP-SP data are an exception (see, for example, Hensher and Bradley, 1993, Bradley and Daly, 1997), though RP-RP studies are not as common. Such an approach has potentially far reaching implications for behavioural choice modelling. First, it enables the testing of hypotheses prior to suggesting travel survey extensions. In general, being able to test a hypothesis with some degree of reliability before proposing to collect large, integrated, surveys could save a lot of resources. Second, if the data pooling is successful it may be possible to entirely avoid the burden associated with large, integrated surveys. Third, being able to pool data and undertake reliable statistical hypotheses with the pooled data opens avenues to more efficient use of existing data resources. The preliminary analysis presented in Sivakumar and Polak (2009) supports this proposition, however, a number of issues need to explored before data fusion can be reliably used to develop choice models.
Sivakumar and Polak (2009) combine the UK National Travel Survey (NTS) data and the UK Time Use Survey (TUS) data for the development of in-home and out-of-home leisure activity participation models as a function of household technology holdings. They propose the conditional probability model (CPM), developed from first principles, as the preferred means of combining disparate data sources compared to more approximate solutions such as ad-hoc cluster sampling and multiple imputation. However, the CPM rapidly becomes infeasible if the data fusion requires the ?imputation? of a large number of categorical and continuous variables.
In this paper we extend the work of Sivakumar and Polak (2009) to examine the value of several approaches including ad-hoc cluster sampling, multiple imputation and a modified Bayesian Belief Network approach. Specifically, we quantify the degree of uncertainty and measurement error associated with these approximations to the conditional probability model (CPM) thus identifying the technique that mathematically approaches the reliability of the CPM. We will also examine the effects of increasing number of categorical and continuous ?imputed? variables on the measurement errors associated with these approaches.
Further, as we would expect the effects of data fusion to vary with the underlying model structure, each of the data fusion techniques will be tested for a number of different underlying model structures, including discrete choice multinomial logit, error components discrete choice and discrete-continuous models. The model estimations will be undertaken initially on simulated datasets in order to enable precise quantification of the accuracy of the data fusion process. This will then be followed by models estimated by combining the UK NTS and TUS data. The primary contribution of the paper will be to provide the practitioner with a means of combining disparate sources of revealed preference data for estimating different kinds of behavioural choice models, with a reliable quantification of the measurement uncertainty.
Bradley, M.A. and A.J. Daly (1997) ?Estimation of logit choice models using mixed stated preference and revealed preference information?, In: P.R. Stopher and M. Lee-Gosselin, Editors, Understanding travel behaviour in an era of change, Pergamon, Oxford, pp. 209?232.
Hensher, D.A. and M. Bradley (1993) ?Using stated response data to enrich revealed preference discrete choice models?, Marketing Letters, 4 (2) (1993), pp. 139?152.
Sivakumar, A. and J.W. Polak (2008) ?Modelling the endogeneity in activity participation and technology holdings: An exploration of data pooling techniques?, presented at the International Choice Modelling Conference, Leeds.
Association for European Transport