The World Bank Working for a World Free of Poverty Microdata Library
  • Data Catalog
  • Collections
  • Citations
  • Terms of use
  • About
  • Login
    Login
    Home / Central Data Catalog / WLD_2023_SYNTH-SVY-EN_V01_M
central

Synthetic Data for an Imaginary Country, Sample, 2023
A synthetic hierarchical dataset for simulation and training purposes

World, 2023
Get Microdata
Reference ID
WLD_2023_SYNTH-SVY-EN_v01_M
Producer(s)
Development Data Group, Data Analytics Unit
Metadata
DDI/XML JSON
Created on
Jul 07, 2023
Last modified
Jul 07, 2023
Page views
8700
Downloads
1226
  • Study Description
  • Data Description
  • Documentation
  • Get Microdata
  • Identification
  • Version
  • Scope
  • Coverage
  • Producers and sponsors
  • Sampling
  • Survey instrument
  • Data collection
  • Data processing
  • Quality standards
  • Access policy
  • Data Access
  • Disclaimer and copyrights
  • Metadata production
  • Citation
  • Identification

    Survey ID number

    WLD_2023_SYNTH-SVY-EN_v01_M

    Title

    Synthetic Data for an Imaginary Country, Sample, 2023

    Subtitle

    A synthetic hierarchical dataset for simulation and training purposes

    Country/Economy
    Name Country code
    World WLD
    Study type

    Synthetic data

    Series Information

    This dataset is part of a collection of fully synthetic data generated, for training and simulation purposes, for an imaginary middle-income country. The dataset is available in English and French. A full population dataset (~10 million individuals) is also available in English and French as a "synthetic census dataset".

    Other identifiers
    Type Identifier
    DOI https://doi.org/10.48529/MC1F-QH23
    Abstract
    The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.

    The full-population dataset (with about 10 million individuals) is also distributed as open data.
    Kind of Data

    ssd

    Unit of Analysis

    Household, Individual

    Version

    Version Description

    V. 2023-05-01 8K HH EN

    Version Date

    2023-05-01T04:00:00.000Z

    Version Responsibility Statement

    World Bank, Development Data Group

    Version Notes

    Dataset generated using RealTabFormer (sample of 8,000 households), with post-processing. English version.

    Scope

    Keywords
    synthetic data open data safe data demographics education mortality fertility child malnutrition labor, employment housing dwelling water and sanitation household expenditure migration

    Coverage

    Geographic Coverage

    The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.

    Geographic Unit

    province (admin1), district (admin2)

    Universe

    The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.

    Producers and sponsors

    Primary investigators
    Name Affiliation
    Development Data Group, Data Analytics Unit World Bank
    Funding Agency/Sponsor
    Name Abbreviation Grant number Role
    UNHCR-World Bank Joint Data Center on Forced Displacement JDC KP-P174174-TF0B5124 Sponsored research work for the development of synthetic data for the purpose of assessing statistical disclosure risk measures.

    Sampling

    Sample frame

    Sample frame name

    Synthetic Data for an Imaginary Country, Full Population, 2023 (WLD_2023_SYNTH-CENS-EN_v01_M)

    Custodian

    World Bank

    Unit Type

    Household

    Is Primary

    true

    Sampling Procedure

    The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.

    Response Rate

    This is a synthetic dataset; the "response rate" is 100%.

    Weighting

    Sample weights were calculated that take the stratification into account. See the R script provided as an external resource.

    Survey instrument

    Questionnaires

    The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.

    Methodology notes

    The dataset was generated using REaLTabFormer, a four-level hierarchical generative model. The first-level model is the household composition generator, which generates variables that define each household's composition (household size and basic demographic profile of members, including age and relationship to the head of household). The second-level model is the household-level variables generator, which generates the variables whose values are common to all household members (such as dwelling characteristics) based on the household composition. The third-level model is the household-head generator, which generates observations for the head of the households based on the output of the previous two models. The fourth-level model is the household member generator, which generates data on the household members, excluding the head, for households of size two and above. The household member generator model uses the data generated by the household composition, household-level variables, and household head generator models. This hierarchical model provides relational dependencies within a household that would not be guaranteed if all records were generated independently.

    To implement the different models, we adopted a transformer architecture. The household composition generator is a decoder model that generates data from normally distributed noise. The other three models use a sequence-to-sequence model inspired by the application of deep learning to language translation.

    More detailed information is available in the Technical Documentation provided as an external PDF document.

    Data collection

    Dates of Data Collection
    Start End
    2023 2023
    Time periods
    Start date End date
    2023 2023
    Mode of data collection
    • other

    Data processing

    Data Processing
    Type Description
    synthetic data generation The synthetic data generation process is described in detail in a technical document "Generating a relational synthetic dataset for an imaginary country - Technical documentation" provided as external resource.
    Data Editing

    The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.

    Quality standards

    Other quality statement

    The synthetic dataset is intended to provide a realistic representation of a middle-income countries. A set of summary indicators/tables was produced to ensure the realistic aspect of the data.

    Access policy

    Location of Data Collection

    World Bank Microdata Library

    URL for Location of Data Collection

    https://microdata.worldbank.org/index.php/catalog/study/WLD_2023_SYNTH-SVY-EN_v01_M

    Number of Files

    2 (one at the household level, one at the individual level). The two data files can be merged using variable "hid" as merging key.

    Notes

    Data available as open data (CC BY 4.0 license)

    Data Access

    Access authority
    Name
    World Bank, Microdata Library
    Restrictions

    The dataset was generated as a fully-synthetic dataset. The model used to create the synthetic observations includes multiple procedures to avoid overfitting and data-copying. Also, the data used for training the model went through processes of sampling and recoding that make it impossible to link a synthetic observation to an actual observation. The dataset is thus safe for dissemination. It can be used with no restriction and is shared as open data.

    Disclaimer and copyrights

    Disclaimer

    The data are to be used for training or simulation purposes only. It is not intended to be representative of any particular country, and should not be used for inference purpose.

    Metadata production

    Producers
    Name Affiliation
    OD World Bank
    Date of Metadata Production

    2023-05-01T04:00:00.000Z

    Metadata version

    DDI Document version

    1.0 EN

    Version date

    2023-05-01T04:00:00.000Z

    Citation

    Citation
    loading, please wait...
    Citation format
    Export citation: RIS | BibTeX | Plain text
    Back to Catalog
    The World Bank Working for a World Free of Poverty
    • IBRD IDA IFC MIGA ICSID

    © The World Bank Group, All Rights Reserved.

    This site uses cookies to optimize functionality and give you the best possible experience. If you continue to navigate this website beyond this page, cookies will be placed on your browser. To learn more about cookies, click here.