Clustering and Profiling Traffic Roads by Means of Accident Data
K Geurts, G Wets, T Brijs, K Vanhoof, Limburg University, BE
In Belgium, every year approximately 50.000 injury accidents occur in traffic, with almost 70.000 victims, of which 1.500 deaths (Belgian Institute for Traffic Safety, 2000). Not only does the steady increase in traffic intensity pose a heavy burden on the society in terms of the number of casualties, the insecurity on the roads will also have an important effect on the economic costs associated with traffic accidents. Accordingly, traffic safety is currently one of the highest priorities of the Belgian government.
Cameron (1997) indicates that clustering methods are an important tool when analyzing traffic accidents as these methods are able to identify groups of road users, vehicles and road segments which would be suitable targets for countermeasures. More specifically, cluster analysis is a statistical technique that groups items together on the basis of similarities or dissimilarities (Anderberg, 1973). In Ng, Hung and Wong (2002) a combination of cluster analysis, regression analysis and Geographical Information System (GIS) techniques is used to group homogeneous accident data together, estimate the number of traffic accidents and assess the risk of traffic accidents in a study area.
The results will help authorities effectively allocate resources to improve safety levels in those areas with high accident risk. In addition, the results will provide information for urban planners to develop a safer city.
Furthermore, according to Kononov (2002), it is not possible to develop effective counter-measures to improve traffic safety without being able to properly and systematically relate accident frequency and severity to a large number of variables such as traffic, geometric and environmental factors. Lee, Saccomanno and Hellinga (2002) indicate that in the past, statistical models have been widely used to analyze road crashes. However, Chen and Jovanis (2002) demonstrate that certain problems may arise when using classic statistical analysis on datasets with such large dimensions such as an exponential increase in the number of parameters as the number of variables increases and the invalidity of statistical tests as a consequence of sparse data in large contingency tables. Data mining is the nontrivial extraction of implicit, previously unknown, and potentially useful information from large amounts of data (Frawley et al, 1991). Therefore, data mining methods are particularly useful in the context of large data sets on road accidents to identify the relevant variables that make a strong contribution towards a better understanding of accident circumstances.
The identification of geographical locations with high accident risk by means of clustering techniques and profiling them in terms of accident related data and location characteristics by means of data mining techniques must therefore provide valuable input for government actions towards traffic safety.
In the first part of this research, an innovative method based on latent class clustering (also called model- based clustering or finite mixture modelling) (McLachlan and Peel, 2000) is used to cluster traffic roads into distinct groups based on their similar accident frequencies. The data that will be used are obtained from the Belgian ?Analysis Form for Traffic Accidents? that should be filled out by a police officer for each traffic accident that occurs with injured or deadly wounded casualties on a public road in Belgium. These traffic accident data contain a rich source of information on the different circumstances in which the accidents have occurred: course of the accident (type of collision, road users, injuries, ?), traffic conditions (maximum speed, priority regulation, ?), environmental conditions (weather, light conditions, time of the accident, ?), road conditions (road surface, obstacles, ?), human conditions (fatigue, alcohol, ?) and geographical conditions (location, physical characteristics, ?). On average, 45 attributes are available for each accident in the data set. More specifically, this analysis will focus on 19 central roads of the city of Hasselt for 3 consecutive time periods of each 3 years: 1992-1994, 1995-1997, 1998-2000.
The observed accident frequencies are assumed to originate from a mixture of density distributions for which the parameters of the distribution, the size and the number of segments are unknown. It is the objective of latent class clustering to ?unmix? the distributions and to find the optimal parameters of the distributions and the number and size of the segments, given the underlying data.
A 3-variate Poisson distribution (Y1, Y2, Y3) with one common covariance term is defined (Li et al, 1999) with Yi = the number of accidents in period i and all X ?s independent univariate Poisson distributions with respective parameters (ë1, ë2, ë3, ë123). Since the occurrence of accidents over several time periods may be related (e.g. due to bad infrastructure), correlations between the observations in each latent class cluster are allowed. Therefore, the parameter ë123 is identified, which can be considered as a covariance factor that measures the risk of the area common to all time periods (Karlis, 2000). To estimate the parameters, we maximize the loglikelihood using a non-linear iterative fitting algorithm (nlp). To prevent the algorithm from finding a local but not a global optimum, we use multiple sets of starting values. Next, we observe the evolution of the loglikelihood for different restarts of the algorithm. Finally, to determine the number of segments (k) in the mixture model different information criteria (Akaike (AIC), Consistent Akaike (CAIC) and Bayes (BIC)) are used to evaluate the quality of a cluster solution (Schwarz, 1978).
Results show that, although the loglikelihood of the model increases when the number of segments increases, the information criteria will not choose the maximum possible segments to cluster the data. Considering the model complexity, the AIC selects 3 clusters whereas the CAIC and the BIC select only 2 clusters. This difference can be explained by the fact that the AIC does not consider the size of the dataset, whereas the CAIC and the BIC do penalize for this factor. Furthermore, in the 2-components common covariance model the average number of accidents increases each period for the first cluster and decreases each period for the second cluster. Additionally, the observed average accident rate per period for cluster 1 is mainly dependent on the average accident frequency of the concerning period and less on the covariance factor.
For cluster 2, the covariance term does play an important role in the observed average accident rate per period. This can be explained as for this cluster there is a strong common factor in all periods that has to do with the accident ?risk? on these roads.
Analogously, the results for the 3-components common covariance model can be analysed. Here, one should remark that the value for ë1 in cluster 1 and the value for ë2 in cluster 2 is very small, meaning that the observed average accident frequency for cluster 1 in the first period and cluster 2 in the second period will mainly be influenced by the overall accident risk on the roads.
In the second part of this paper, the data mining technique of association rules is used to profile each cluster of traffic roads in terms of the available traffic accident data. The strength of this approach lies within the identification of relevant variables that make a strong contribution towards a better understanding of the accident circumstances for each group of traffic roads (Geurts et al, 2002). Since the clusters show different results for the overall accident ?risk? on the roads, one could expect that not every accident variable will be of equal importance when describing the different groups of traffic roads. Therefore, a comparative analysis between the accident characteristics of the different clusters is conducted, which provides new insights into the complexity and causes of road accidents.
Association for European Transport