The World Bank Working for a World Free of Poverty Microdata Library
  • Data Catalog
  • Collections
  • Citations
  • Terms of use
  • About
  • Login
    Login
    Home / Central Data Catalog / WLD_2023_SYNTH-CEN-EN_V01_M
central

Synthetic Data for an Imaginary Country, Full Population, 2023
A synthetic hierarchical dataset for simulation and training purposes

World, 2023
Get Microdata
Reference ID
WLD_2023_SYNTH-CEN-EN_v01_M
Producer(s)
Development Data Group, Data Analytics Unit
Metadata
DDI/XML JSON
Created on
Jul 03, 2023
Last modified
Jul 03, 2023
Page views
8201
Downloads
537
  • Study Description
  • Data Description
  • Documentation
  • Get Microdata
  • Identification
  • Version
  • Scope
  • Coverage
  • Producers and sponsors
  • Sampling
  • Survey instrument
  • Data collection
  • Data processing
  • Quality standards
  • Access policy
  • Data Access
  • Disclaimer and copyrights
  • Metadata production
  • Citation
  • Identification

    Survey ID number

    WLD_2023_SYNTH-CEN-EN_v01_M

    Title

    Synthetic Data for an Imaginary Country, Full Population, 2023

    Subtitle

    A synthetic hierarchical dataset for simulation and training purposes

    Country/Economy
    Name Country code
    World WLD
    Study type

    Synthetic data

    Series Information

    This dataset is part of a collection of fully synthetic data generated, for training and simulation purposes, for an imaginary middle-income country. The dataset is available in English and French. A subset of 8,000 households is also available in English and French as a "synthetic survey dataset".

    Other identifiers
    Type Identifier
    DOI https://doi.org/10.48529/78M1-AE09
    Abstract
    The dataset is a relational dataset of 10,003,891 individuals (2,501,755 households), representing the entire population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.

    A sample dataset of 8000 households was created out of this full-population dataset, and is also distributed as open data.
    Kind of Data

    cen

    Unit of Analysis

    household, Individual

    Version

    Version Description

    V. 2023-05-01 10M PP EN

    Version Date

    2023-05-01

    Version Responsibility Statement

    World Bank, Development Data Group

    Version Notes

    Dataset generated using RealTabFormer (with 10,003,891 individuals and 2,501,755 households), with post-processing. English version.

    Scope

    Notes

    The dataset is a synthetic dataset generated using a model that used data from IPUMS International, from the Demographic and Health Survey program, and from the World Bank Global Consumption Database (microdata) as training data. The detailed list of datasets used to create the training datasets is available in the Technical Documentation (document provided as an external resource).

    Keywords
    synthetic data open data safe data demographics education mortality fertility child malnutrition labor, employment housing dwelling water and sanitation household expenditure migration

    Coverage

    Geographic Coverage

    The dataset is a synthetic dataset for an imaginary country. It was created to represent the full national population of this country, by province and district (equivalent to admin1 and admin2 levels) and by urban/rural areas of residence.

    Geographic Unit

    province (admin1), district (admin2)

    Universe

    The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.

    Producers and sponsors

    Primary investigators
    Name Affiliation
    Development Data Group, Data Analytics Unit World Bank
    Funding Agency/Sponsor
    Name Abbreviation Grant number Role
    UNHCR-World Bank Joint Data Center on Forced Displacement JDC KP-P174174-TF0B5124 Sponsored research work for the development of synthetic data for the purpose of assessing statistical disclosure risk measures.

    Sampling

    Weighting

    The dataset is the equivalent of a census dataset. No weighting applies.

    Survey instrument

    Questionnaires

    The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.

    Methodology notes

    The dataset was generated using REaLTabFormer, a four-level hierarchical generative model. The first-level model is the household composition generator, which generates variables that define each household's composition (household size and basic demographic profile of members, including age and relationship to the head of household). The second-level model is the household-level variables generator, which generates the variables whose values are common to all household members (such as dwelling characteristics) based on the household composition. The third-level model is the household-head generator, which generates observations for the head of the households based on the output of the previous two models. The fourth-level model is the household member generator, which generates data on the household members, excluding the head, for households of size two and above. The household member generator model uses the data generated by the household composition, household-level variables, and household head generator models. This hierarchical model provides relational dependencies within a household that would not be guaranteed if all records were generated independently.

    To implement the different models, we adopted a transformer architecture. The household composition generator is a decoder model that generates data from normally distributed noise. The other three models use a sequence-to-sequence model inspired by the application of deep learning to language translation.

    More detailed information is available in the Technical Documentation provided as an external PDF document.

    Data collection

    Dates of Data Collection
    Start End
    2023 2023
    Time Method

    cross section

    Time periods
    Start date End date
    2023 2023
    Mode of data collection
    • other

    Data processing

    Data Processing
    Type Description
    synthetic data generation The synthetic data generation process is described in detail in a technical document "Generating a relational synthetic dataset for an imaginary country - Technical documentation" provided as external resource.
    Data Editing

    The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.

    Quality standards

    Other quality statement

    The synthetic dataset is intended to provide a realistic representation of a middle-income countries. A set of summary indicators/tables was produced to ensure the realistic aspect of the data.

    Access policy

    Location of Data Collection

    World Bank Microdata Library

    URL for Location of Data Collection

    https://microdata.worldbank.org/index.php/catalog/study/WLD_2023_SYNTH-CEN-EN_v01_M

    Number of Files

    2 (one at the household level, one at the individual level). The two data files can be merged using variable "hid" as merging key.

    Notes

    Data available as open data (CC BY 4.0 license)

    Data Access

    Access authority
    Name
    World Bank, Microdata Library
    Restrictions

    The dataset was generated as a fully-synthetic dataset. The model used to create the synthetic observations includes multiple procedures to avoid overfitting and data-copying. Also, the data used for training the model went through processes of sampling and recoding that make it impossible to link a synthetic observation to an actual observation. The dataset is thus safe for dissemination. It can be used with no restriction and is shared as open data.

    Disclaimer and copyrights

    Disclaimer

    The data are to be used for training or simulation purposes only. It is not intended to be representative of any particular country, and should not be used for inference purpose.

    Metadata production

    Producers
    Name Affiliation
    OD World Bank
    Date of Metadata Production

    2023-05-01

    Metadata version

    DDI Document version

    1.0 EN

    Version date

    2023-05-01

    Version notes

    English version

    Citation

    Citation
    loading, please wait...
    Citation format
    Export citation: RIS | BibTeX | Plain text
    Back to Catalog
    The World Bank Working for a World Free of Poverty
    • IBRD IDA IFC MIGA ICSID

    © The World Bank Group, All Rights Reserved.

    This site uses cookies to optimize functionality and give you the best possible experience. If you continue to navigate this website beyond this page, cookies will be placed on your browser. To learn more about cookies, click here.