Synthetic Data for an Imaginary Country, Sample, 2023
A synthetic hierarchical dataset for simulation and training purposes
This dataset is part of a collection of fully synthetic data generated, for training and simulation purposes, for an imaginary middle-income country. The dataset is available in English and French. A full population dataset (~10 million individuals) is also available in English and French as a "synthetic census dataset".
The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.
The full-population dataset (with about 10 million individuals) is also distributed as open data.
Kind of Data
Unit of Analysis
V. 2023-05-01 8K HH EN
Dataset generated using RealTabFormer (sample of 8,000 households), with post-processing. English version.
water and sanitation
The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.
province (admin1), district (admin2)
The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.
Producers and sponsors
Development Data Group, Data Analytics Unit
UNHCR-World Bank Joint Data Center on Forced Displacement
Sponsored research work for the development of synthetic data for the purpose of assessing statistical disclosure risk measures.
The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.
This is a synthetic dataset; the "response rate" is 100%.
Sample weights were calculated that take the stratification into account. See the R script provided as an external resource.
Dates of Data Collection
Data Collection Mode
The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.
The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.
The dataset was generated as a fully-synthetic dataset. The model used to create the synthetic observations includes multiple procedures to avoid overfitting and data-copying. Also, the data used for training the model went through processes of sampling and recoding that make it impossible to link a synthetic observation to an actual observation. The dataset is thus safe for dissemination. It can be used with no restriction and is shared as open data.
World Bank, Microdata Library
Location of Data Collection
World Bank Microdata Library
Disclaimer and copyrights
The data are to be used for training or simulation purposes only. It is not intended to be representative of any particular country, and should not be used for inference purpose.