By Nam Huynh
Micro-simulations, such as epidemiology models or activity-based models for urban transport demanding forecasting purposes, usually involve a large number of agents representing the real population living in the area being studied. It is however extremely expensive, if not impossible (due to stringent privacy laws in certain countries), to carry out a survey that obtains a fully disaggregated data set to describe the demographics and characteristics of the agents of interest. An alternative to achieving such disaggregated dataset population characteristics is to construct a synthetic population that satisfactorily matches with the demographics of the real population. Synthetic population generation has consequently attracted increasing attentions from various research groups around the world. A number of works, for example those by Huang and Williamson (2001), Bowman (2004), Ryan et al. (2009), Muller and Axhausen (2010), and Tanton et al. (2014), provide a good review and comparisons between population synthesising methods existing in the literature.
The basic principle behind the majority of population synthesisers found in the literature is integrating an aggregate dataset with a disaggregate dataset. The aggregate dataset is a set of joint distributions (or cross-tabulations) that describes the demographics of a relative small geographical area of which the synthetic population needs to be generated (the target area). Such dataset is normally available from the census data, such as the Summary Files in the US, the Small Area Statistics file in the UK, and the Community Profiles in Australia. The disaggregate data is normally a survey data of sample households with demographic attributes of the household and those of its residents. Examples of such survey data is the Public-Use Microdata Samples in the US, the Sample of Anonymised Records in the UK, and the Confidentialised Unit Record File in Australia. The information in the survey data normally covers a much larger geographical area (the seed area) than the area for which the synthetic population is required.
One critical assumption in the aforementioned population synthesisers is the availability of a disaggregate dataset from which household records are drawn to form the resulting population in the target area. Such assumption does not always hold either because a sample survey does not exist or, more often, is inaccessible. Even when such survey data is available, it must satisfy that the sample size is large enough to be fully representative of the demographic distributions of each target area. This condition is critical to the convergence of the iterative processes (e.g. IPF, IPU, HIPF) used in the majority of the above approaches.
Researchers at the SMART Infrastructure Facility have developed an algorithm that follows the sample-free approach to synthesise a population for agent based modelling purposes. This algorithm is among the very few in the literature that do not rely on sample survey data to construct a synthetic population, and thus enjoys potentially wider applications where such survey data is not available or inaccessible. Different to existing sample-free algorithms, the population synthesis presented in this paper applies the heuristics to part of the allocation of synthetic individuals into synthetic households. As a result the iterative process allocating individuals into households, which normally is the most computationally demanding and time consuming process, is required only for a subset of synthetic individuals. The population synthesiser in this work is therefore computational efficient enough for practical application to build a large synthetic population (many millions) for many thousands of target areas at the smallest possible geographical level. This capability ensures that the geographical heterogeneity of the resulting synthetic population is best preserved. The paper also presents the application of the new method to synthesise the population for New South Wales in Australia in 2006. The resulting total synthetic population has approximately 6 million people living in over 2.3 million households residing in private dwellings across over 11000 Census Collection Districts. Analyses show evidence that the synthetic population matches very well with the census data across seven demographic attributes that characterise the population at both household level and individual level.
A manuscript on this is population synthesiser is under revisions for publication in the Journal of Artificial Societies and Social Simulation. The manuscript and the source codes of the population synthesis algorithm are available for download upon request.
References
Huang, Z., and Williamson, P. (2001) A comparison of synthetic reconstruction and combinatorial optimisation approaches to the creation of small-area microdata, http://pcwww.liv.ac.uk/~william/microdata/workingpapers/hw_wp_2001_2.pdf, accessed on 30/03/2015.
Bowman, J.L. (2004) A comparison of population synthesizers used in microsimulation models of activity and travel demand, http://jbowman.net/papers/2004.Bowman.Comparison_of_PopSyns.pdf, accessed on 30/03/2015.
Muller, K., and Axhausen, K.W. (2010) Population synthesis for microsimulation: State of the art, 10th Swiss Transport Research Conference, Ascona, 2010.
Tanton R., Williamson P., and Harding A. (2014) Comparing Two Methods of Reweighting a Survey File to Small Area Data, International Journal of Microsimulation, 7 (1) 76-79.
Brilliant and innovative work from Nam and his team!