Synthetic Data for an Imaginary Country, Sample, 2023
A synthetic hierarchical dataset for simulation and training purposes

World, 2023

Get Microdata

Reference ID

WLD_2023_SYNTH-SVY-EN_v01_M

Producer(s)

Development Data Group, Data Analytics Unit

Metadata

DDI/XML JSON

Created on

Jul 07, 2023

Last modified

Jul 07, 2023

Page views

16103

Downloads

3867

Identification

Survey ID number

WLD_2023_SYNTH-SVY-EN_v01_M

Title

Synthetic Data for an Imaginary Country, Sample, 2023

Subtitle

A synthetic hierarchical dataset for simulation and training purposes

Country/Economy

Name	Country code
World	WLD

Study type

Synthetic data

Series Information

This dataset is part of a collection of fully synthetic data generated, for training and simulation purposes, for an imaginary middle-income country. The dataset is available in English and French. A full population dataset (~10 million individuals) is also available in English and French as a "synthetic census dataset".

Other identifiers

Type	Identifier
DOI	https://doi.org/10.48529/MC1F-QH23

Abstract

The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.

The full-population dataset (with about 10 million individuals) is also distributed as open data.

Kind of Data

ssd

Unit of Analysis

Household, Individual

Version

Version Description

V. 2023-05-01 8K HH EN

Version Date

2023-05-01T04:00:00.000Z

Version Responsibility Statement

World Bank, Development Data Group

Version Notes

Dataset generated using RealTabFormer (sample of 8,000 households), with post-processing. English version.

Scope

Keywords

synthetic data open data safe data demographics education mortality fertility child malnutrition labor, employment housing dwelling water and sanitation household expenditure migration

Coverage

Geographic Coverage

The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.

Geographic Unit

province (admin1), district (admin2)

Universe

The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.

Producers and sponsors

Primary investigators

Name	Affiliation
Development Data Group, Data Analytics Unit	World Bank

Funding Agency/Sponsor

Name	Abbreviation	Grant number	Role
UNHCR-World Bank Joint Data Center on Forced Displacement	JDC	KP-P174174-TF0B5124	Sponsored research work for the development of synthetic data for the purpose of assessing statistical disclosure risk measures.

Sampling

Sample frame

Sample frame name

Synthetic Data for an Imaginary Country, Full Population, 2023 (WLD_2023_SYNTH-CENS-EN_v01_M)

Custodian

World Bank

Unit Type

Household

Is Primary

true

Sampling Procedure

The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.

Response Rate

This is a synthetic dataset; the "response rate" is 100%.

Weighting

Sample weights were calculated that take the stratification into account. See the R script provided as an external resource.

Survey instrument

Questionnaires

The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.

Methodology notes

The dataset was generated using REaLTabFormer, a four-level hierarchical generative model. The first-level model is the household composition generator, which generates variables that define each household's composition (household size and basic demographic profile of members, including age and relationship to the head of household). The second-level model is the household-level variables generator, which generates the variables whose values are common to all household members (such as dwelling characteristics) based on the household composition. The third-level model is the household-head generator, which generates observations for the head of the households based on the output of the previous two models. The fourth-level model is the household member generator, which generates data on the household members, excluding the head, for households of size two and above. The household member generator model uses the data generated by the household composition, household-level variables, and household head generator models. This hierarchical model provides relational dependencies within a household that would not be guaranteed if all records were generated independently.

To implement the different models, we adopted a transformer architecture. The household composition generator is a decoder model that generates data from normally distributed noise. The other three models use a sequence-to-sequence model inspired by the application of deep learning to language translation.

More detailed information is available in the Technical Documentation provided as an external PDF document.

Data collection

Dates of Data Collection

Start	End
2023	2023

Time periods

Start date	End date
2023	2023

Mode of data collection

other

Data processing

Data Processing

Type	Description
synthetic data generation	The synthetic data generation process is described in detail in a technical document "Generating a relational synthetic dataset for an imaginary country - Technical documentation" provided as external resource.

Data Editing

The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.

Quality standards

Other quality statement

The synthetic dataset is intended to provide a realistic representation of a middle-income countries. A set of summary indicators/tables was produced to ensure the realistic aspect of the data.

Access policy

Location of Data Collection

World Bank Microdata Library

URL for Location of Data Collection

https://microdata.worldbank.org/index.php/catalog/study/WLD_2023_SYNTH-SVY-EN_v01_M

Number of Files

2 (one at the household level, one at the individual level). The two data files can be merged using variable "hid" as merging key.

Notes

Data available as open data (CC BY 4.0 license)

Data Access

Access authority

Name
World Bank, Microdata Library

Restrictions

The dataset was generated as a fully-synthetic dataset. The model used to create the synthetic observations includes multiple procedures to avoid overfitting and data-copying. Also, the data used for training the model went through processes of sampling and recoding that make it impossible to link a synthetic observation to an actual observation. The dataset is thus safe for dissemination. It can be used with no restriction and is shared as open data.

Disclaimer and copyrights

Disclaimer

The data are to be used for training or simulation purposes only. It is not intended to be representative of any particular country, and should not be used for inference purpose.

Metadata production

Producers

Name	Affiliation
OD	World Bank

Date of Metadata Production

2023-05-01T04:00:00.000Z

Metadata version

DDI Document version

1.0 EN

Version date

2023-05-01T04:00:00.000Z

Citation

loading, please wait...

Export citation: RIS | BibTeX | Plain text

Back to Catalog

Synthetic Data for an Imaginary Country, Sample, 2023 A synthetic hierarchical dataset for simulation and training purposes

World, 2023

Identification

Version

Scope

Coverage

Producers and sponsors

Sampling

Sample frame

Survey instrument

Data collection

Data processing

Quality standards

Access policy

Data Access

Disclaimer and copyrights

Metadata production

Metadata version

Citation

Synthetic Data for an Imaginary Country, Sample, 2023
A synthetic hierarchical dataset for simulation and training purposes