## Using Baysian Belief Networks and Process Metadata to Address Large Scale Data Integration Problems

### Authors

John Polak, Rajesh Krishnan, Charles Lindveld, Miles Logie, Andrew Westlake, Imperial College London, UK; Kay Axhausen, ETH Zurich, CH; Eric Cornelis, FUNDP, BE; Mike Collop; Transport for London, UK; Thomas Haupt, PTV AG, DE

### Description

A new method for combining information from multiple incomplete data sources is presented using Bayesian belief networks to consistently encode modelling knowledge and propagate uncertainly and process metadata methods to manage information flows.

### Abstract

An increasingly important problem affecting many areas of transport planning, operations and management is the need to combine information from a variety of different data sources in order to provide the best possible estimate of certain parameters of interest. Problems of this type arise for a variety of reasons.

? No single data source contains sufficient information by itself.

? Multiple data sources naturally arise (e.g. through observations at different levels of spatial or temporal aggregation or by means of different survey methods), resulting in a need to reconcile potentially conflicting estimates.

? The need to update or transfer an existing set of data and parameter estimates when additional information becomes available.

Although methods have been developed for several specific instances of problems arising in different areas of transport studies (e.g., for O-D matrix estimation, synthetic population generation, network performance estimation) there does not yet exist a coherent set of general purpose methods for dealing with data combination problems. Moreover, due to the lack of appropriate general purpose techniques, data integration is often in practice undertaken in an ad hoc fashion, potentially resulting both in a loss of efficiency and exposing the analysis to the risk of biases of various sorts.

In this paper, we argue that in order to address problems of this sort, innovation at two levels is required in current practice. The first is in the methods used for the consistent description and management of transport data sources and modelling processes and the second is in the methods used for the characterisation and propagation of data and modelling uncertainty during analysis.

We describe the development of a general Bayesian framework for data integration problems of this sort and associated process metadata tools to support model application. This framework is designed to enable the use of existing structural knowledge (in the form of existing transport models) and existing measurement knowledge (in the form of characterisations of sampling and non-sampling errors) to inform the data integration task. The transport system of interest is characterised by a state vector X, which might e.g., represent an O-D matrix or a set of link flows or a set of link travel times or combinations of some or all of these quantities. Full information about the system is provided by the probability distribution P(X). The data integration problem arises because, in general, complete observations of realisations of the state vector X will not be available. Instead, what can be observed are realisations of another stochastic vector Y, which is related to X. In the simplest (a direct measurement) case, Y is related to X through a measurement process characterised by sampling and non-sampling variation. The vector Y may contain several direct measurements of the same underlying state X, arising for example from the application of different measurement methods. Direct measurements are complemented by indirect measurements; comprising observations of realisations of quantities that are distinct from but structurally related to the state vector X. Thus the vector Y will in general be a combination of direct and indirect measurements on the state vector X and will embody both measurement and structural information. Given this setup, the data integration problem is to determine the best estimate of P(X) given the observations Y and (possibly) a prior estimate of X. We show that this formulation of the data integration problem subsumes a number of existing problems in the literature.

Our approach to addressing this problem is to encode existing domain structural and measurement knowledge (which we term our general a priori model or GAPM) in the form of a Bayesian belief network (BBN) and to use the BBN representation of the GAPM to compute the posterior distribution of X conditional on Y. Except in very specific special cases, this posterior distribution will not be available in closed form, so the properties of the posterior must be determined empirically. There are two key advantages in this context to adopting a Bayesian approach. The first is that, in principle, it allows us to treat both observational information from sample surveys (and other data sources) and information encoded in structural and measurement modelling assumptions in the GAPM in a consistent fashion. The second is that recent developments in computational Bayesian techniques, in particular the emergence of Markov Chain Monte Carlo methods, provide a rich set of tools to enable the sampling from complex posterior distributions (e.g., the Hastings-Metropolis and Gibbs samplers). Notwithstanding these recent developments however, the practical implementation of this approach still poses considerable challenges, especially in dealing with the high dimensionality of typical transport network problems. A number of strategies for dealing with this problem are discussed, including: re-parameterisation of the state space, hierarchical geographical decomposition of the BBN, functional partitioning of the BBN, parallelisation of the standard samplers, and the development of special purpose samplers that exploit the characteristics of particular problems.

Alongside the modelling work, the project has developed a metadata framework for characterising data inputs and model processing and storing a complete audit trail, covering the specification and fitting of statistical models. This addresses key concerns regarding the provenance and reliability of model-based estimates.

The structure of the paper is as follows. Following a brief introduction, the second section sets out a general description of the data integration problem and describes the Bayesian approach to addressing the problem. In the third section we discuss some practical considerations, particularly the problem of dimensionality and present a number of special purpose samplers that have been developed to deal with common transport applications. The fourth section discusses the metadata issues, focusing in particular on motivation for this approach and describing the process metadata tools developed to support the modelling. The fifth section illustrates the application of the methods in a number of case studies including (a) the use of data from household surveys, census records and network flow counts to produce augmented O-D matrices and (b) synthetic population generation. The paper concludes with a general discussion of the potential directions for future research work in this area.

#### Publisher

Association for European Transport