{"doc_desc":{"producers":[{"name":"OD","affiliation":"World Bank"}],"version_statement":{"version_date":"2023-05-01","version":"1.0 EN","version_notes":"English version"},"prod_date":"2023-05-01","title":"DDI Codebook documentation for the dataset \"Synthetic Full Population Dataset 2023\""},"study_desc":{"title_statement":{"idno":"WLD_2023_SYNTH-CEN-EN_v01_M","title":"Synthetic Data for an Imaginary Country, Full Population, 2023","sub_title":"A synthetic hierarchical dataset for simulation and training purposes ","identifiers":[{"type":"DOI","identifier":"https:\/\/doi.org\/10.48529\/78M1-AE09"}]},"version_statement":{"version":"V. 2023-05-01 10M PP EN","version_notes":"Dataset generated using RealTabFormer (with 10,003,891 individuals and 2,501,755 households), with post-processing. English version.","version_date":"2023-05-01","version_resp":"World Bank, Development Data Group"},"authoring_entity":[{"name":"Development Data Group, Data Analytics Unit","affiliation":"World Bank"}],"series_statement":{"series_name":"Synthetic data ","series_info":"This dataset is part of a collection of fully synthetic data generated, for training and simulation purposes, for an imaginary middle-income country. The dataset is available in English and French. A subset of 8,000 households is also available in English and French as a \"synthetic survey dataset\"."},"study_info":{"data_kind":"cen","analysis_unit":"household, Individual","abstract":"The dataset is a relational dataset of 10,003,891 individuals (2,501,755 households), representing the entire population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.  \n\nA sample dataset of 8000 households was created out of this full-population dataset, and is also distributed as open data.","keywords":[{"keyword":"synthetic data"},{"keyword":"open data"},{"keyword":"safe data"},{"keyword":"demographics"},{"keyword":"education"},{"keyword":"mortality"},{"keyword":"fertility"},{"keyword":"child malnutrition"},{"keyword":"labor, employment"},{"keyword":"housing"},{"keyword":"dwelling"},{"keyword":"water and sanitation"},{"keyword":"household expenditure"},{"keyword":"migration"}],"universe":"The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.","time_periods":[{"start":"2023","end":"2023"}],"nation":[{"name":"World","abbreviation":"WLD"}],"geog_coverage":"The dataset is a synthetic dataset for an imaginary country. It was created to represent the full national population of this country, by province and district (equivalent to admin1 and admin2 levels) and by urban\/rural areas of residence.","coll_dates":[{"start":"2023","end":"2023"}],"quality_statement":{"other_quality_statement":"The synthetic dataset is intended to provide a realistic representation of a middle-income countries. A set of summary indicators\/tables was produced to ensure the realistic aspect of the data. "},"geog_unit":"province (admin1), district (admin2)","notes":"The dataset is a synthetic dataset generated using a model that used data from IPUMS International, from the Demographic and Health Survey program, and from the World Bank Global Consumption Database (microdata) as training data. The detailed list of datasets used to create the training datasets is available in the Technical Documentation (document provided as an external resource)."},"method":{"data_collection":{"weight":"The dataset is the equivalent of a census dataset. No weighting applies.","time_method":"cross section","coll_mode":["other"],"research_instrument":"The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A \"fake\" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.","cleaning_operations":"The synthetic data generation process included a set of \"validators\" (consistency checks, based on which synthetic observation were assessed and rejected\/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files. "},"data_processing":[{"type":"synthetic data generation","description":"The synthetic data generation process is described in detail in a technical document \"Generating a relational synthetic dataset for an imaginary country - Technical documentation\" provided as external resource."}],"method_notes":"The dataset was generated using REaLTabFormer, a four-level hierarchical generative model. The first-level model is the household composition generator, which generates variables that define each household's composition (household size and basic demographic profile of members, including age and relationship to the head of household). The second-level model is the household-level variables generator, which generates the variables whose values are common to all household members (such as dwelling characteristics) based on the household composition. The third-level model is the household-head generator, which generates observations for the head of the households based on the output of the previous two models. The fourth-level model is the household member generator, which generates data on the household members, excluding the head, for households of size two and above. The household member generator model uses the data generated by the household composition, household-level variables, and household head generator models. This hierarchical model provides relational dependencies within a household that would not be guaranteed if all records were generated independently.\n\nTo implement the different models, we adopted a transformer architecture. The household composition generator is a decoder model that generates data from normally distributed noise. The other three models use a sequence-to-sequence model inspired by the application of deep learning to language translation.\n\nMore detailed information is available in the Technical Documentation provided as an external PDF document."},"data_access":{"dataset_availability":{"access_place":"World Bank Microdata Library","file_quantity":"2 (one at the household level, one at the individual level). The two data files can be merged using variable \"hid\" as merging key.","access_place_url":"https:\/\/microdata.worldbank.org\/index.php\/catalog\/study\/WLD_2023_SYNTH-CEN-EN_v01_M","notes":"Data available as open data (CC BY 4.0 license)"},"dataset_use":{"restrictions":"The dataset was generated as a fully-synthetic dataset. The model used to create the synthetic observations includes multiple procedures to avoid overfitting and data-copying. Also, the data used for training the model went through processes of sampling and recoding that make it impossible to link a synthetic observation to an actual observation. The dataset is thus safe for dissemination. It can be used with no restriction and is shared as open data.  ","disclaimer":"The data are to be used for training or simulation purposes only. It is not intended to be representative of any particular country, and should not be used for inference purpose. ","contact":[{"name":"World Bank, Microdata Library"}]}},"production_statement":{"prod_date":"2023-04","prod_place":"World Bank, Washington, DC, USA","funding_agencies":[{"grant":"KP-P174174-TF0B5124","role":"Sponsored research work for the development of synthetic data for the purpose of assessing statistical disclosure risk measures. ","name":"UNHCR-World Bank Joint Data Center on Forced Displacement","abbr":"JDC"}]},"study_development":[]},"schematype":"survey"}