Difference between revisions of "Microdata Catalog"

Jump to: navigation, search
 
(10 intermediate revisions by 4 users not shown)
Line 1: Line 1:
The [http://microdata.worldbank.org/index.php/home Microdata Library] is an online platform the offers free access to microdata produced not only by the World Bank, but also other international organizations, statistical agencies and different actors in developing countries. It includes datasets from surveys implemented as part of impact evaluations and research on development, as well as administrative data.
<onlyinclude>
 
The [http://microdata.worldbank.org Microdata Library] is an online platform the offers free access to microdata produced not only by the World Bank, but also other international organizations, statistical agencies and different actors in developing countries. It includes datasets from surveys implemented as part of impact evaluations and research on development, as well as administrative data.
</onlyinclude>
== Read first ==
== Read first ==
* Data sets publish in the microdata are tipically minimally processed survey data. The  [[Checklist: Microdata Catalog submission|Microdata Catalog Checklist]] lists data format, documentation requirements and instructions on how to deposit data
* Data sets published in the microdata are typically survey data. The  [[Checklist: Microdata Catalog submission|Microdata Catalog Checklist]] lists data format, documentation requirements and instructions on how to deposit data
* When submitting data, it is recommended to include as much information about the study and the data as possible. This reduces the number of future queries received from both catalog staff preparing the data and users trying to properly understand the survey process
* When submitting data, it is recommended to include as much information about the study and the data as possible. This reduces the number of future queries received from both catalog staff preparing the data and users trying to properly understand the survey process
* We recommend submitting the data and soon as it is collected, so that all relevant information is documented and safely stored, making transitions between team members easier and reducing the risk of not remembering details when analysis is done
* We recommend submitting the data and soon as it is collected, so that all relevant information is documented and safely stored, making transitions between team members easier and reducing the risk of not remembering details when analysis is done
Line 8: Line 9:


== Guidelines for submission ==
== Guidelines for submission ==
Submission to the Microdata Catalog is done after the initial data cleaning for a round of data collection is finished. That means one impact evaluation may have different data sets in the catalog, for example for baseline, midline and endline. Data sets submitted to the [http://microdata.worldbank.org/index.php/home Microdata Library] must be de-identified and accompanied by data documentation and study description. The  [[Checklist: Microdata Catalog submission|Microdata Catalog Checklist]] lists data format and documentation requirements as well as instructions on how to deposit data sets.
Submission to the Microdata Library is done after the initial data cleaning for a round of data collection is finished. That means one impact evaluation may have different rounds published in the catalog, for example baseline, midline and endline. Datasets submitted to the [http://microdata.worldbank.org/index.php/home Microdata Library] must be de-identified and accompanied by data documentation and study description. The  [[Checklist: Microdata Catalog submission|Microdata Catalog Checklist]] lists data format and documentation requirements as well as instructions on how to deposit datasets.


World Bank staff can deposit their data directly to the online [[http://microdatalib.worldbank.org/index.php/home Data Deposit Application]]. Depositors outside the World Bank can fill a form and submit it by email.
World Bank staff can deposit their data directly to the online '''data deposit application'''. Data can also be deposited by external researchers if necessary. It is recommended for deposits originating from outside the Bank to include names and contact details of the World Bank staff they are working with on their project. It is also recommended for the approver of access to licensed data to be a World Bank staff that can br easily contacted for access by the Microdata Library team. If the survey is owned by someone other than the World Bank, in addition to the deposit form, documentation is needed from the data provider, signed by an authorized signatory and explicitly authorizing the Microdata Library to disseminate the related survey data and specifying the access type.


=== Data sets ===
=== Datasets ===
Data may be uploaded in different formats, including STATA, SPSS and SAS, and must be [de-identified | De-identification] and minimally [cleaned | Data Cleaning]. The data cleaning required aims to provide a clear indication of what information is to be found in any given variable, so both variable and value labels must be present, including [labels for extended missing values | Data Cleaning#Survey Codes and Missing Values]. To protect the confidentiality of respondents, all [Personally Identifiable Information| De-identification#Personally Identifiable Information] must be removed. Variables containing sensitive information such as PII can be flagged in the ""Data Distribution"" section to indicate they should not be distributed.
Data may be uploaded in different formats, including Stata, SPSS and SAS, and must be [[De-identification | de-identified]] and minimally [[Data Cleaning | cleaned]]. The data cleaning required aims to provide a clear indication of what information is to be found in any given variable, so both variable and value labels must be present, including [[Data Cleaning#Survey Codes and Missing Values | labels for extended missing values]]. To protect the confidentiality of respondents, all [[De-identification#Personally Identifiable Information | Personally Identifiable Information]] must be removed. Variables containing sensitive information such as PII can be flagged in the ""Data Distribution"" section to indicate they should not be distributed.


=== Supporting documents ===
=== Supporting documents ===
All relevant material that would allow the users to better understand the data and interpret the results should be included. A non-comprehensive list of documents that may be relevant is included bellow. Note that some of the material in the list below may contatin sensitive information (for example in the form of options listed in the questionnaire), so it should also be checked and de-identified.
All relevant material that would allow the users to better understand the data and interpret the results should be included. A non-comprehensive list of documents that may be relevant is included below. Note that some of the material in the list below may contain sensitive information (for example in the form of options listed in the questionnaire), so it should also be checked and de-identified.
* Questionnaires (paper format equivalent is better than CAPI form)
* Questionnaires (paper format equivalent is better than CAPI form)
* Enumerator manuals
* Enumerator manuals
*[[Data Documentation#Field work documentation | Field work documentation ]]
*[[Data Documentation#Field work documentation | Fieldwork documentation ]]
* Methodology description
* Methodology description
*[[Data Documentation#Data cleaning documentation | Data cleaning documentation]]
*[[Data Documentation#Data cleaning documentation | Data cleaning documentation]]
Line 44: Line 45:


=== Access conditions ===
=== Access conditions ===
The World Bank Microdata Library disseminates data under the [[https://data.worldbank.org/summary-terms-of-use Microdata Terms of Use for the World Bank]]. When submitting data, it is possible to indicate wether the datasets should be available only to World Bank staff or to external users. It is also possible to embargo any data submitted for a certain period of time. To protect the confidentiality of individual information and to meet the requirements of the data owners who provide the microdata, there are five principal [[http://microdata.worldbank.org/index.php/terms-of-use types of access]] that may be applied:
The World Bank Microdata Library disseminates data under the [https://data.worldbank.org/summary-terms-of-use Microdata Terms of Use for the World Bank]. When submitting data, it is possible to indicate whether the datasets should be available only to World Bank staff or to external users. It is also possible to embargo any data submitted for a specified period of time. To protect the confidentiality of individual information and to meet the requirements of the data owners who provide the microdata, there are five principal [[http://microdata.worldbank.org/index.php/terms-of-use types of access]] that may be applied:


* '''Open access''': this is the least restrictive access policy. Datasets and the related documentation are available to users for commercial and noncommercial purposes at no cost. There is no need to be being logged into the application.
* '''Open access''': this is the least restrictive access policy. Datasets and the related documentation are available to users for commercial and non-commercial purposes at no cost. There is no need to be being logged into the application.


* '''Direct access''': relevant datasets and the related documentation are made freely available to registered and unregistered users for statistical and scientific research purposes only, and may not be distributed. Any publications employing this type of data must cite the source, in line with the citation requirement provided with the dataset.
* '''Direct access''': relevant datasets and the related documentation are made freely available to registered and unregistered users for statistical and scientific research purposes only, and may not be distributed. Any publications employing this type of data must cite the source, in line with the citation requirement provided with the dataset.
Line 52: Line 53:
* '''Public Use Files''': PUFs are available to anyone agreeing to respect a core set of easy-to-meet conditions. These data are made easily accessible because the risk of identifying individual respondents or data providers is considered to be low. Terms of use are the same as direct access, but users are required to register before obtaining the data sets.
* '''Public Use Files''': PUFs are available to anyone agreeing to respect a core set of easy-to-meet conditions. These data are made easily accessible because the risk of identifying individual respondents or data providers is considered to be low. Terms of use are the same as direct access, but users are required to register before obtaining the data sets.


* '''Licensed files''': are files whose dissemination is restricted to bona fide users. Access is granted to authenticated users who have received authorization to access them after submitting a documented application and signing an agreement governing the data's use. These users must be acting on behalf of an organization, who must take responsibility for the use. To release data under this license, a World Bank staff must be indicated as point of contact to grant access to the data. That person will be contacted through the data catalogue manager, who works with the team to approve or reject requests.
* '''Licensed files''': are files whose dissemination is restricted to bona fide users. Access is granted to authenticated users who have received authorization to access them after submitting a documented application and signing an agreement governing the data's use. These users must be acting on behalf of an organization, who must take responsibility for the use. To release data under this license, a World Bank staff must be indicated as the point of contact to grant access to the data. That person will be contacted by the data catalogue manager, who works with the team to approve or reject requests.


* '''External Repositories''': The World Bank Microdata Library operates both as a data catalog for World Bank owned or licensed data as well as a portal to data held in a number of external repositories. It is the aim of the Microdata Library to provide to user the most comprehensive catalog of development related microdata possible. To this end, studies conducted and owned by other institutions as well as links to those studies are listed in the Microdata Library Catalog. Datasets provided by external agencies are not owned or controlled by the World Bank and have their own conditions of use. When a user accesses external repositories, the terms governing use of those external repositories shall govern access to their data.
* '''External Repositories''': The World Bank Microdata Library operates both as a data catalog for World Bank owned or licensed data as well as a portal to data held in a number of external repositories. It is the aim of the Microdata Library to provide to the user the most comprehensive catalog of development related microdata possible. To this end, studies conducted and owned by other institutions as well as links to those studies are listed in the Microdata Library Catalog. Datasets provided by external agencies are not owned or controlled by the World Bank and have their own conditions of use. When a user accesses external repositories, the terms governing the use of those external repositories shall govern access to their data.


* '''No access''': some datasets have no access policy defined, or are not accessible. In some limited situations we may include a limited number of such datasets for the sake of completeness and for the purpose of providing access to questionnaires and reports.
* '''No access''': some datasets have no access policy defined, or are not accessible. In some limited situations, we may include a limited number of such datasets for the sake of completeness and for the purpose of providing access to questionnaires and reports. Note here that any datasets with no access will not be published on the external facing catalog.


== Collections available ==
== Collections available ==
The Microdata Library operates as a portal for datasets originating from the World Bank and other international, regional and national organizations. These contributions make up the Central Microdata Catalog, which can also be viewed and searched by collection. When submitting data to the Catalog, it is necessary to specify in which collection it should be filled.  
The Microdata Library operates as a portal for datasets originating from the World Bank and other international, regional and national organizations. These contributions make up the Central Microdata Catalog, which can also be viewed and searched by collection. When submitting data to the Catalog, it is necessary to specify in which collection it should be filed. Impact evaluation surveys are filed in the Impact Evaluation Survey Collection, even if treatment variables are temporarily embargoed.


* '''World Bank catalogs'''
* '''World Bank catalogs'''
Line 84: Line 85:
One common concern among researchers is under which conditions to submit data from studies that are still ongoing and whose results have not yet been published. There are several options available.
One common concern among researchers is under which conditions to submit data from studies that are still ongoing and whose results have not yet been published. There are several options available.


First of all, we recommend submitting the data and soon as it is collected. The review process will guarantee that documentation is submitted, reducing the riskof not remembering important details about how the data was processed if it is only used when the intervention is completed, or endline data is collected. Further more, once deposited the data is safely stored, reducing the less likely, but even more worrying chance of losing any data. So depositing data early on guarantees that transistion between team members is more smooth and that less information is lost over time.
First of all, it is best practice to submit the data and soon as it is collected. The review process will guarantee that documentation is submitted, reducing the risk of not remembering important details about how the data was processed, an issue that arises often if the analysis is only carried when the intervention is completed, or after endline data is collected. Furthermore, once deposited, the data is safely stored, reducing the less likely, but even more worrying chance of losing any data. Depositing data early on guarantees that transition between team members is smoother and less information is lost over time.
 
The different access conditions can be used to withhold from release any information that may create issues if made public prior to publication. One possibility is to submit a data set and embargo any treatment assignment variables until results are published. This means that these variables will only become available to users after an established date. If this solution is chosen, it's important to indicate in the documentation that such variables have been removed and will be released in a future date, as users expect treatment variables to be present in impact evaluation datasets. It is recommended to add a clause indicating that no conclusions on impact can be drawn until all rounds are published


The different access conditions and the possibility of updating the data can be used to hold from release any information that may create issues if made public prior to publication. One possibility is to submit a data set and embargo any treatment assignment variables from release until results are published. In this case, it is important to indicate in the documentation that such variables have been removed and will be released in a future date. Alternatively, it is also possible to embargo the whole data set, making it "no access".
Alternatively, it is also possible to embargo the whole data set, making it "no access". In this case, the metadata will not be made available to the external audience and may not be available to World Bank staff if the embargo applies internally as well. Another option is to first submit the "censored" version of the data, without treatment variables, and update the submission to include all variables after publication.  


== DIME Datasets on Microdata Catalog ==
== DIME Datasets on Microdata Catalog ==
Line 93: Line 96:




== Additional Resources ==
==Back to Parent==
This article is part of the topic [[Publishing Data]]


[[Category: Data Cleaning ]]
[[Category:Publishing Data]]

Latest revision as of 14:09, 1 November 2022

The Microdata Library is an online platform the offers free access to microdata produced not only by the World Bank, but also other international organizations, statistical agencies and different actors in developing countries. It includes datasets from surveys implemented as part of impact evaluations and research on development, as well as administrative data.

Read first

  • Data sets published in the microdata are typically survey data. The Microdata Catalog Checklist lists data format, documentation requirements and instructions on how to deposit data
  • When submitting data, it is recommended to include as much information about the study and the data as possible. This reduces the number of future queries received from both catalog staff preparing the data and users trying to properly understand the survey process
  • We recommend submitting the data and soon as it is collected, so that all relevant information is documented and safely stored, making transitions between team members easier and reducing the risk of not remembering details when analysis is done
  • As part of the submission process, it is possible to choose from different access conditions under which the data will be shared. It is also possible to make changes to access terms, as well as to the data, after the initial submission

Guidelines for submission

Submission to the Microdata Library is done after the initial data cleaning for a round of data collection is finished. That means one impact evaluation may have different rounds published in the catalog, for example baseline, midline and endline. Datasets submitted to the Microdata Library must be de-identified and accompanied by data documentation and study description. The Microdata Catalog Checklist lists data format and documentation requirements as well as instructions on how to deposit datasets.

World Bank staff can deposit their data directly to the online data deposit application. Data can also be deposited by external researchers if necessary. It is recommended for deposits originating from outside the Bank to include names and contact details of the World Bank staff they are working with on their project. It is also recommended for the approver of access to licensed data to be a World Bank staff that can br easily contacted for access by the Microdata Library team. If the survey is owned by someone other than the World Bank, in addition to the deposit form, documentation is needed from the data provider, signed by an authorized signatory and explicitly authorizing the Microdata Library to disseminate the related survey data and specifying the access type.

Datasets

Data may be uploaded in different formats, including Stata, SPSS and SAS, and must be de-identified and minimally cleaned. The data cleaning required aims to provide a clear indication of what information is to be found in any given variable, so both variable and value labels must be present, including labels for extended missing values. To protect the confidentiality of respondents, all Personally Identifiable Information must be removed. Variables containing sensitive information such as PII can be flagged in the ""Data Distribution"" section to indicate they should not be distributed.

Supporting documents

All relevant material that would allow the users to better understand the data and interpret the results should be included. A non-comprehensive list of documents that may be relevant is included below. Note that some of the material in the list below may contain sensitive information (for example in the form of options listed in the questionnaire), so it should also be checked and de-identified.

Study description

During submission, it is necessary to fill a form collecting information on the survey (metadata). Not all fields are mandatory, but providing as much information as possible makes it easier for the users to understand and explore the data. This reduces the number of future queries received from both catalogue staff preparing the data and users trying to properly understand the survey process.

  • Mandatory Fields:
    • Title
    • Country
    • Dates of Data Collection
    • Access policy
    • Catalogue where the data should be published
  • Recommended Fields:
    • Abstract
    • Geographic Coverage
    • Primary Investigator
    • Funding
    • Sampling Procedure
    • Weighting

Access conditions

The World Bank Microdata Library disseminates data under the Microdata Terms of Use for the World Bank. When submitting data, it is possible to indicate whether the datasets should be available only to World Bank staff or to external users. It is also possible to embargo any data submitted for a specified period of time. To protect the confidentiality of individual information and to meet the requirements of the data owners who provide the microdata, there are five principal [types of access] that may be applied:

  • Open access: this is the least restrictive access policy. Datasets and the related documentation are available to users for commercial and non-commercial purposes at no cost. There is no need to be being logged into the application.
  • Direct access: relevant datasets and the related documentation are made freely available to registered and unregistered users for statistical and scientific research purposes only, and may not be distributed. Any publications employing this type of data must cite the source, in line with the citation requirement provided with the dataset.
  • Public Use Files: PUFs are available to anyone agreeing to respect a core set of easy-to-meet conditions. These data are made easily accessible because the risk of identifying individual respondents or data providers is considered to be low. Terms of use are the same as direct access, but users are required to register before obtaining the data sets.
  • Licensed files: are files whose dissemination is restricted to bona fide users. Access is granted to authenticated users who have received authorization to access them after submitting a documented application and signing an agreement governing the data's use. These users must be acting on behalf of an organization, who must take responsibility for the use. To release data under this license, a World Bank staff must be indicated as the point of contact to grant access to the data. That person will be contacted by the data catalogue manager, who works with the team to approve or reject requests.
  • External Repositories: The World Bank Microdata Library operates both as a data catalog for World Bank owned or licensed data as well as a portal to data held in a number of external repositories. It is the aim of the Microdata Library to provide to the user the most comprehensive catalog of development related microdata possible. To this end, studies conducted and owned by other institutions as well as links to those studies are listed in the Microdata Library Catalog. Datasets provided by external agencies are not owned or controlled by the World Bank and have their own conditions of use. When a user accesses external repositories, the terms governing the use of those external repositories shall govern access to their data.
  • No access: some datasets have no access policy defined, or are not accessible. In some limited situations, we may include a limited number of such datasets for the sake of completeness and for the purpose of providing access to questionnaires and reports. Note here that any datasets with no access will not be published on the external facing catalog.

Collections available

The Microdata Library operates as a portal for datasets originating from the World Bank and other international, regional and national organizations. These contributions make up the Central Microdata Catalog, which can also be viewed and searched by collection. When submitting data to the Catalog, it is necessary to specify in which collection it should be filed. Impact evaluation surveys are filed in the Impact Evaluation Survey Collection, even if treatment variables are temporarily embargoed.

  • World Bank catalogs
    • Global Financial Inclusion (Global Findex) Database
    • Service Delivery Facility Surveys
    • The STEP Skills Measurement Program
    • The World Bank Group Country Opinion Survey Program (COS)
    • Development Research Microdata
    • Enterprise Surveys
    • Impact Evaluation Surveys
    • Living Standards Measurement Study (LSMS)
    • Migration and Remittances Surveys
  • External catalogs
    • Global Health Data Exchange (GHDx), Institute for Health Metrics and Evaluation (IHME)
    • Integrated Public Use Microdata Series (IPUMS) International
    • MEASURE DHS: Demographic and Health Surveys
    • Millennium Challenge Corporation (MCC)
    • UNICEF Multiple Indicator Cluster Surveys (MICS)
    • WHO’s Multi-Country Studies Programmes
    • DataFirst , University of Cape Town, South Africa

Releasing data before publication

One common concern among researchers is under which conditions to submit data from studies that are still ongoing and whose results have not yet been published. There are several options available.

First of all, it is best practice to submit the data and soon as it is collected. The review process will guarantee that documentation is submitted, reducing the risk of not remembering important details about how the data was processed, an issue that arises often if the analysis is only carried when the intervention is completed, or after endline data is collected. Furthermore, once deposited, the data is safely stored, reducing the less likely, but even more worrying chance of losing any data. Depositing data early on guarantees that transition between team members is smoother and less information is lost over time.

The different access conditions can be used to withhold from release any information that may create issues if made public prior to publication. One possibility is to submit a data set and embargo any treatment assignment variables until results are published. This means that these variables will only become available to users after an established date. If this solution is chosen, it's important to indicate in the documentation that such variables have been removed and will be released in a future date, as users expect treatment variables to be present in impact evaluation datasets. It is recommended to add a clause indicating that no conclusions on impact can be drawn until all rounds are published

Alternatively, it is also possible to embargo the whole data set, making it "no access". In this case, the metadata will not be made available to the external audience and may not be available to World Bank staff if the embargo applies internally as well. Another option is to first submit the "censored" version of the data, without treatment variables, and update the submission to include all variables after publication.

DIME Datasets on Microdata Catalog


Back to Parent

This article is part of the topic Publishing Data