The Importance of Synthetic Datasets in Empirical Testing: Comparison of NL Models and MNL with Error Components Models

The Importance of Synthetic Datasets in Empirical Testing: Comparison of NL Models and MNL with Error Components Models


Laurie Garrow and Tudor Bodea, Georgia Institute of Technology, US


We use synthetic datasets to compare two approaches for incorporating correlation in discrete choice models. Initial results indicate using synthetic datasets can lead to systematic bias, which must be accounted for when evaluating the performance of the


Synthetic data is widely used to model real life phenomena and empirically validate new discrete choice models. However, little is known about how synthetic data generation influences the properties of parameter estimates and the validity of recommendations that are based on these estimates. Garrow and Bodea acknowledged the importance of synthetic data generation in their 2005 ETC paper and provide guidance into how to create correlation among alternatives for generalized extreme value (GEV) models using an approach that is similar in spirit to methods found in the simulation literature. However, in the 2005 paper, Garrow and Bodea encountered problems when generating datasets based on 10,000 observations to support an empirical comparison of NL models and MNL with error components models. Specifically, the synthetic datasets exhibited a high, and unacceptable, degree of variation in the estimates for the logsum coefficients. Consequently, it became critical to identify potential sources of this variation in order to design a set of experiments that could be used to empirically compare NL models and MNL with error components models (the primary objective of the 2005 ETC paper which uncovered more subtle theoretical issues than we anticipated!).

There are two key purposes of this paper. First, we present the results of an empirical analysis that show that the ability to recover logsum parameter estimates that exhibit a low degree of variation across multiple datasets depends critically on the number of observations, the frequency of chosen alternatives, and the amount of correlation in the nests. Moreover, even after controlling for variation across datasets, we see that logsum parameter estimates are biased in specific ways. Specifically, we find that: (1) the wide range of logsum parameter estimates of 0.125 obtained from 20 datasets of 10,000 observations decreases to a range of 0.006 when 20 datasets of 1,000,000 observations are used, (2) as the frequency of chosen alternatives in a nest decreases, so too does the ability to recover logsum parameter estimates for that nest, and (3) as the logsum coefficients decrease (or amount of correlation in the nest increases) the ability to recover parameter estimates slightly decreases. In regards to coefficient bias we find that: (1) all logsum coefficients are biased upwards in synthetic datasets generated using the procedure described in Garrow and Bodea (2005), (2) the bias dramatically increases for those nests that have a low choice frequency, and (3) the bias tends to be more pronounced for those nests with high correlations among alternatives.

The second key purpose of this paper is to complete the empirical comparison of NL models and MNL with error components models. Specifically, there are two common approaches for incorporating correlation among alternatives. The first approach uses a mixed MNL model and allows the parameters of the utility function to vary across alternatives in such a way that analogs to GEV models, such as the NL model, are created. The attributes used to create correlation and/or heterogeneity are called error components. The second approach uses a more complicated GEV model, such as the NL model, to represent correlation among alternatives. The benefit of this approach is that it has fewer dimensions of integration so should require less computational time. The disadvantage is that the researcher needs to program a more complicated log-likelihood function.

Based on the findings described above, we are using 100,000 observations to empirically compare these two approaches for representing correlation among alternatives. Specifically, this study explores the sensitivity of empirical identification in mixed MNL models to different factors, including the choice frequency of alternatives, amount of correlation in the nests, and the number of Halton-draws used as support points. Results indicate a clear lack of empirical identification for nests that have a low choice frequency (defined as each alternative in the low-frequency nest being chosen approximately 2,500 times). Moreover, while models with equal choice frequencies (defined as each alternative being chosen approximately 16,667 times) converge, the coefficients associated with error components are biased. Both of these findings provide a motivation for estimation of mixed GEV models.


Association for European Transport