https://dimewiki.worldbank.org/api.php?action=feedcontributions&user=501238&feedformat=atomDimewiki - User contributions [en]2024-03-29T11:08:03ZUser contributionsMediaWiki 1.37.2https://dimewiki.worldbank.org/index.php?title=Data_Cleaning&diff=4468Data Cleaning2018-03-28T15:39:56Z<p>501238: </p>
<hr />
<div><span style="font-size:150%"><br />
</span><br />
</span><br />
<br />
Data cleaning is an essential step between data collection and data analysis. Raw primary data is always imperfect and needs to be prepared so that it is easy to use in the analysis. This is the high level goal of Data Cleaning. In extremely rare cases, the only preparation needed is to document the data set, for example - by using labels. However, in the vast majority of cases, there are many small things that need to be addressed in the data set itself. This could both be addressing data points that are incorrect, or replacing values that are not real data points, but codes explaining why there is no real data point.<br />
<br />
== Read First ==<br />
<br />
*See this [[Checklist:_Data_Cleaning|check list]] that can be used to make sure that common cleaning actions have been done when applicable.<br />
*As a [[Impact_Evaluation_Team#Research_Assistant|Research Assistant]] (RA) or [[Impact_Evaluation_Team#Field_Coordinator|Field Coordinator]] (FC), do not spend time trying to fix irregularities in the data at the expense of not having time to identifying as many irregularities as possible.<br />
*The quality of the analysis will never be better than the quality of data cleaning.<br />
*There is no such thing as an exhaustive list of what to do during data cleaning as each project will have individual cleaning needs, but this article provides a very good place to start.<br />
*After finishing the data cleaning for each round of data collection, data can be [[Publishing Data|released]]<br />
<br />
== The Goal of Cleaning ==<br />
<br />
There are two main goals when cleaning the data set:<br />
<br />
#Cleaning individual data points that invalidate or incorrectly bias the analysis.<br />
#Preparing a clean data set so that it is easy to use for other researchers. Both for researchers inside your team and outside your team.<br />
<br />
[[File:Picture2.png|700px|link=|center]]<br />
Another overarching goal of the cleaning process is to understand the data and the data collection really well. Much of this understanding feeds directly into the two points above, but a really good data cleaning process should also result in documented lessons learned that can be used in future data collection. Both in later data collection rounds in the same project, but also in data collections in other similar projects.<br />
<br />
=== Cleaning individual data points ===<br />
<br />
In impact evaluations, our analysis often come down to test for statistical differences in the mean between the control group and any of the treatment arms. We do so through regression analysis where we include control variables, fixed effects, and different error estimators, among many other tools. In essence, though, one can think of it as an advanced comparison of means. While this is far from a complete description of impact evaluation analysis, it might give the person cleaning a data set for the first time a framework on what cleaning a data set should achieve.<br />
<br />
It is difficult to have an intuition for the math behind a regression, but it easy to have an intuition for the math behind a mean. Anything that biases a mean will bias a regression, and while there are many more things that can bias a regression, this is a good place to start for anyone cleaning a data set for the first time. The researcher in charge of the analysis is trained in what else that needs to be done for the specific regression models used. The articles linked to below will go through specific examples, but it is probably obvious to most readers that outliers, typos in data, survey codes (often values like -999 or -888) etc. bias means, so it is never wrong to start with those examples.<br />
<br />
=== Prepare a clean data set ===<br />
<br />
The second goal of the data cleaning is to document the data set so that variables, values, and anything else is as self-explanatory as possible. This will help other researchers that you grant access to this data set, but it will also help you and your research team when accessing the data set in the future. At the time of the data collection or at the time of the data cleaning, you know the data set much better than you will at any time in the future. Carefully documenting this knowledge so that it can be used at the time of analysis is often the difference between a good analysis and a great analysis.<br />
<br />
== Role Division during Data Cleaning ==<br />
As a [[Impact_Evaluation_Team#Research_Assistant|Research Assistant]] (RA) or [[Impact_Evaluation_Team#Field_Coordinator|Field Coordinator]] (FC), spend time identifying and documenting irregularities in the data. It is never bad to suggest corrections to irregularities, but a common mistake RAs or FCs do is that they spend too much time on trying to fix irregularities at the expense of having enough time to identify and document as many as possible. One major reason for that is that different regression models might require different ways to correct issues and this is often a perspective only the PI has. In such cases, much time might have been spent on coming up with a correction that is not valid given the regression model used in the analysis.<br />
<br />
Eventually the [[Impact_Evaluation_Team#Principal_Investigator|Principal Investigator]] (PI) and the RA or FC will have a common understanding on what correction calls can be made without involving the PI, but until then, it's recommended that the RA focus her/his time on identifying and documenting as many issues as possible rather than spending a lot of time on how to fix the issues. It is no problem to do both as long as the fixing doesn't happen at the cost of identifying as many issues as possible.<br />
<br />
== Import Data ==<br />
<br />
The first step in cleaning the data is to import the data. If you work with secondary data (data prepared by someone else) then this step is often straightforward, but this is a step often underestimated when working with primary data. It is very important for any change, no matter how small, to always be made in Stata (or in R or any other scripting language). Even if you know that there are incorrect submissions in your raw data (duplicates, pilot data mixed with the main data etc.), those deletions should always be done in such a way that they can be replicated by re-running code. Without this information, the analysis might no longer be valid. See the article on [[DataWork_Survey_Round#Raw_Folder|raw data folders]] for more details.<br />
<br />
=== Importing Primary Survey Data ===<br />
<br />
All modern CAPI survey data collections tools provided methods for importing the raw data in a way that drastically reduces the amount of work that needs to be done when cleaning the data. These methods typically include a Stata do-file that generates labels and much more from the questionnaire code and then applies that to the raw data as it is being imported. If you are working in SurveyCTO see this article on [[SurveyCTO Stata Template | SurveyCTO's Stata Template]].<br />
<br />
== Examples of Data Cleaning Actions ==<br />
<br />
The material in this section has been generated with primary survey data in mind, although a lot of these practices are also applicable when cleaning other types of data sets.<br />
<br />
'''Data Cleaning Check List'''. This is a check list that can be used to make sure that all common aspects of data cleaning has been covered. Note that this is not an exhaustive list. Such a list is impossible to create as the individual data sets and the analysis methods used on them all require different cleaning that in the details depends on the context of that data set.<br />
<br />
===ID Variables===<br />
It's important that the clean dataset be uniquely and fully identifiable by a single variable. It often is the case that when [[Primary Data Collection|primary data]] is imported, there are [[Duplicates and Survey Logs|duplicated entries]]. These cases must be carefully documented, and should only be corrected after discussing with the [[Field Coordinator]] and field team what caused them, so the right observations are kept in the dataset. [[ieduplicates]], a command in [[Stata Coding Practices#ietoolkit|ietoolkit]] is a useful command to identify and correct duplicated entries. Once duplicates are corrected, the observations can be linked to the [[Master Data Set|master dataset]], and the dataset, [[De-identification|de-identified]].<br />
<br />
=== Incorrect Data and Other Irregularities ===<br />
<br />
There are countless ways that there can be irregularities in a primary data set, so there is no way to do an exhaustive list of what should be done. This section gives a few examples: <br />
<br />
'''Outliers'''. There are many rules of thumb for how to define an outlier but there is no silver bullet. One rule of thumb is that any data point that is three standard deviations away from the mean of the same data point for all observations. This may always be a starting point, but one needs to qualitatively consider if this is a correct approach. Observations with outliers should not be dropped, but in some cases, the data point for that observation is replaced with a missing value. There are often better approaches. One common approach is to use winsorization, where any values bigger than a certain percentile, often the 99th, are replaced with the value at that percentile. This way very large values are prevented from biasing the mean. This also has an equality of impact aspect. For example, if all benefit of a project went to a single observation in the treatment group, then the mean would still be high, but that is rarely a desired outcome in development. So winsorization penalizes inequitable distribution of the benefits of a project.<br />
<!----- EDIT HERE -------><br />
<br />
'''Illogical Values'''. This is to test that one data point is possible in relation to another value. For example, if a respondent is male, then the respondent cannot answer that he is pregnant. This simple case is something that can and should be programmed into the questionnaire so that it does not happen. However, no questionnaire ever can be pre-programmed to control for every such case.<br />
<br />
'''Typos'''. If it is obvious beyond any doubt that the response is incorrect due to a simple typo, then it is a good idea to correct the type as long as it is done in a documented and reproducible way.<br />
<br />
=== Survey Codes and Missing Values ===<br />
<br />
Almost all data collection done through surveys of any sort allows the respondent to answer something like "Do not know" or "Declined to answer" for individual questions. These answers are usually recorded using survey codes on the format -999, -88 or something similar. It is obvious that these numbers will bias means and regressions if they are left as such. These values must be replaced with missing values in Stata. <br />
<br />
Stata has several missing values. The most well know is the regular missing value represented by a single "." but we would lose the difference in meaning between "Do not know" and "Declined to answer" if both codes representing them were replaced with the regular missing value. Stata offers a solution with its extended missing values. They are represented by ".a", ".b", ".c" etc. all the way to ".z". Stata handles these values the same as "." in commands that expect a numeric value, but they can be labeled differently and the original information is therefore not lost. Make sure that the same letter ".a", ".b" etc. is used to always represent only one thing across your project. The missing values should be assigned value labels so that they can be interpreted. See [http://www.stata.com/manuals13/u12.pdf#u12.2 Stata Manual Missing Values] for more details on missing values.<br />
<br />
Missing values can be used for much more than just survey codes. Any value that we remove because we found out is incorrect should be replaced with a missing value. In a [[Master Data Set]], there should be no regular missing values. All missing values in a master data set should contain an explanation of why we do not have that information for that observation.<br />
<br />
=== No Strings ===<br />
<br />
All data should be stored in numeric format. There are multiple reasons for this, but the two most important are that (1) numbers are stored more efficiently and (2) many Stata commands expect values to be stored numerically. Categorical string variables should be stored as numeric codes and have value labels assigned.<br />
<br />
There are two exceptions where string variables are allowed. The two examples are listed below:<br />
<br />
'''Numbers that cannot be stored correctly numerically'''. There are two cases of this exception. The first case is when a number is more than 15 digits long. This can happen when working with some national IDs. If a continuous variable has more than 15 digits, then it should be rounded and converted to a different scale, as a precision of 16 digits is not even possible in natural sciences. An ID can for obvious reasons not be rounded. The other case is that of numbers starting with a zero. This is sometimes the case in some national IDs and it is also sometimes the case with telephone numbers in some countries. Any leading zeros are removed by Stata and therefore have to be stored as a string.<br />
<br />
'''Non-categorical text'''. Text answers that cannot be converted into categories need to be stored as strings. One example is open-ended questions. Open-ended questions should, in general, be avoided, but sometimes the questionnaire asks the respondent to answer a question in his or her own words, and then that answer has to be stored as strings. Another example is if the respondent is asked to specify the answer after answering ''Other'' in a multiple choice question. A different example where string format is needed is some cases of proper names, for example, the name of the respondent. Not all proper names should be stored as string as some can be made into categories. For example, if you collect data on respondents and multiple respondents live in the same villages, then the variable with the village names should be converted into a categorical numeric variable and have a value label assigned. See the section on value labels below.<br />
<br />
=== Labels ===<br />
There are several ways to add helpful descriptive text to a data set in Stata, but the two most common and important ways are variables labels and value labels.<br />
<br />
'''Variable Labels'''<br />
All variables in a clean data set should have variable labels describing the variable. The label can be up to 80 characters long so there is a limitation to how much information can be included here. In addition to a brief explanation of the variable, it is usually good to include information such as unit or currency used in the variable and other things that are not possible to read from the values themselves.<br />
<br />
'''Value Labels'''<br />
Categorical variables should always be stored numerically and have value labels that describe what the numeric code represents. For example, yes and no questions should be stored as 0 and 1 and have the label ''No'' for data cells with 0, and the label ''Yes'' for all data cells with 1. This should be applied to all multiple choice variables.<br />
<br />
There are tools in Stata to convert categorical string variables to a categorical numeric variable where the strings are automatically applied as value labels. The most common tool is the command <code>encode</code>. However, if you use <code>encode</code>, you should always use the two options <code>label()</code> and <code>noextend</code>. Without these two options, Stata assigns a code to each string value in alphabetic order. There is no guarantee that the alphabetic order is changed when observations are added or removed, or if someone else makes changes earlier in the code. <code>label()</code> forces you to manually create the label before using encode (this requires some manual work but it is worth it). <code>noextend</code> throws an error if there is a value in the data that does not exist in the pre-defined label. This way you are notified that you need to add the new value to the value label you created manually. Or you can change the string value if there is a typo or similar that is the reason why that string value was not assigned a value label.<br />
<br />
== Additional Resources ==<br />
* <br />
<br />
[[Category: Data Cleaning ]]</div>501238https://dimewiki.worldbank.org/index.php?title=Synthetic_Control_Method&diff=4427Synthetic Control Method2018-02-12T23:08:35Z<p>501238: </p>
<hr />
<div>'''THIS IS A STUB PAGE - CONTRIBUTIONS REQUESTED'''<br />
<br />
The synthetic control method is a statistical method to evaluate treatment effect in comparative case studies. It creates a synthetic version of [[Treatment Group | treated units]] by weighting variables and observations in the [[Control Group | control group]].<br />
<br />
== Read First ==<br />
<br />
== Guidelines ==<br />
<br />
== Back to Parent ==<br />
<br />
This article is part of the topic [[Quasi-Experimental Methods]].<br />
<br />
== Additional Resources ==<br />
<br />
[http://onlinelibrary.wiley.com/doi/10.1111/ajps.12116/full] Abadie, Alberto, Alexis Diamond, and Jens Hainmueller. "Comparative politics and the synthetic control method." American Journal of Political Science 59, no. 2 (2015): 495-510.<br />
<br />
[https://web.stanford.edu/~jhain/synthpage.html Stata Package for Synthetic Control Method]<br />
<br />
[https://web.stanford.edu/~jhain/synthpage.html Synth - Package for Synthetic Control Method in Stata, R and MATLAB]<br />
<br />
[[Category: Quasi-Experimental Methods]][[Category: Data Analysis]]</div>501238https://dimewiki.worldbank.org/index.php?title=Experimental_Methods&diff=4426Experimental Methods2018-02-12T22:23:19Z<p>501238: /* Common Types of Experimental Methods */</p>
<hr />
<div>Impact evaluations aim to identify the impact of a particular intervention or program (a "treatment"), by comparing treated units (households, groups, villages, schools, firms, etc) to control units. Well-designed impact evaluations estimate the impact that can be 'causally attributed' to the treatment, i.e. the impact that was a result of the treatment itself not other factors. The main challenge in designing a rigorous impact evaluation is identifying a control group that is comparable to the treatment group. The gold-standard method for assigning treatment and control is randomization. To design a rigorous impact evaluation, it is essential to have a clear understanding of the [[Theory of Change]].<br />
<br />
== Read First ==<br />
Experimental methods are research designs in which the investigator explicitly and intentionally induces exogenous variation in the uptake of the program to be evaluated. Experimental methods, such as [[Randomized Control Trials]], are typically considered the gold standard design for impact evaluation, since by construction the takeup of the treatment is uncorrelated with other characteristics of the treated population. Under these conditions, it is always possible for the analyst to construct a regression model in which the estimate of the treatment effect is unbiased. <br />
<br />
<br />
== The Power of Experimental Methods ==<br />
Experimental methods, such as [[Randomized Control Trials]], are the gold standard for impact evaluation. Experimental variation imposes known variation on the study population. This guarantees that the intervention effect is not confounded (since it is not correlated with any external variable) and that causality is identified, since selection into the randomization is not possible. However, this leads to natural concerns about the structure of differential [[takeup]] and [[attrition]] in a randomization setting which must be addressed in every sample where [[noncompliance]] is a possibility.<br />
<br />
Experimental methods solve two primary sources of bias: <br />
<br />
* First, the estimate may be confounded, in the sense that it masks an effect produced reality by another, correlated variable. For example, schooling may improves the quality of job offers via network exposure, but the actual education adds no value. In this case the result remains "correct" in the sense that those who got more schooling got higher earnings, but "incorrect" in the sense that the estimate is not the marginal value of education.<br />
<br />
* Second, the direction of causality may be reversed or simultaneous. For example, individuals who are highly motivated may choose to complete more years of schooling as well as being more competent at work in general; or those who are highly motivated by financial returns in the workplace may choose more schooling because of that motivation.<br />
<br />
<br />
== Common Types of Experimental Methods ==<br />
<br />
Experimental methods typically include directly randomized variation of programs or interventions offered to study populations. This variation is usually broadly summarized as "[[Randomized Control Trials]]", but can include cross-unit variation with one or more periods ([[Cross-sectional Data|cross-sectional]] or [[Difference-in-Differences|difference-in-differences]] designs); within-participant variation ([[Panl Data |panel]] studies); or treatment randomization at a [[Randomized Control Trials#Clustered RCTs | clustered]] level with further variation within clusters (multi-level), for example.<br />
<br />
Experimental variation is also possible on the research side through randomized variation in the survey methodology. For example, public health surveys have used "[[mystery patients]]" to identify the quality of medical advice given to people in primary care settings; by comparing the outcomes with other health care providers given [[medical vignettes]] instead of mystery patients, or by changing the information given from the patient to the provider, or by changing the setting in which the interaction is conducted, causal differences in outcomes can be estimated.<br />
<br />
Additionally, designs like [[endorsement experiments]] and [[list experiments]] randomly vary the contents of the survey itself to elicit accurate responses from participants when there is concern about [[social desirability bias]] or [[Hawthorne effects]].<br />
<br />
== Designing Experimental Impact Evaluations ==<br />
<br />
== Additional Resources ==<br />
<!---- Impact Evaluation Methods Chart from JPAL: [[File:2016.08.31-Impact-Evaluation-Methods.pdf]] ------><br />
<br />
[http://runningres.com/ Running Randomized Evaluations] - the website includes all content from the book Running Randomized Evaluations, supplemental materials like case studies, and a blog.<br />
<br />
Impact Evaluation Toolkit from the Results-Based Financing Team at the World Bank - [[http://web.worldbank.org/WBSITE/EXTERNAL/TOPICS/EXTHEALTHNUTRITIONANDPOPULATION/EXTHSD/EXTIMPEVALTK/0,,contentMDK:23262154~pagePK:64168427~piPK:64168435~theSitePK:8811876,00.html| Impact Evaluation Questions]]<br />
<br />
[http://www.oecd.org/dac/evaluation/dcdndep/37671602.pdf Impact Evaluation Design Principles] from OECD<br />
<br />
[[Category: Experimental Methods]]</div>501238https://dimewiki.worldbank.org/index.php?title=Randomization_Inference&diff=4419Randomization Inference2018-02-12T18:00:08Z<p>501238: </p>
<hr />
<div>Randomization inference is a statistical practice for calculating regression p-values that reflect variation in experimentally assigned data arising from the randomization itself. When the researcher controls the treatment assignment of the entire observed group, variation arises from the treatment assignment rather than from the sampling strategy, and therefore p-values based on the randomization may be more appropriate than "standard" p-values.<br />
<br />
==Motivation: Baseline Balance in Experimental Data==<br />
<br />
[http://blogs.worldbank.org/impactevaluations/should-we-require-balance-t-tests-baseline-observables-randomized-experiments Recent discussions] have pointed out that "baseline balance" t-tests on datasets where treatment was randomly assigned are conceptually challenging. This is because the p-values from t-tests are properly interpreted as the estimated probability that the observed difference between the sampled groups would have been observed if those samples had been drawn from underlying sampling frames with no true mean difference. However, in a randomization framework, there is no underlying universe of observations from which the samples are drawn: the observed data comprises the full universe of eligible units and therefore the differences are exact, so "testing" them reveals no information in this view.<br />
<br />
==Randomization Inference: Is the Treatment Effect Significant?==<br />
<br />
[https://jasonkerwin.com/nonparibus/2017/09/25/randomization-inference-vs-bootstrapping-p-values/ The same logic extends to differences in outcome variables] that the researcher wants to investigate for causal response to a randomly assigned treatment. The differences between the treatment and control groups are in general exact because the full universes are observed in data. This means that asymptotically-motivated "sampling variation" cannot be used to calculate whether the difference between the treatment and control groups is statistically significant. Rather than estimating the variation in draws from a hypothesized infinite underlying distribution (the mathematical approach of "standard" p-values), the researcher should instead compute p-values based on the knowable variation in hypothetical ''treatment assignments'', using the randomization process as the source of variation for the estimate.<br />
<br />
==Calculating Randomization Inference p-Values==<br />
<br />
Although the practice is not yet required by most journals, randomization inference is straightforward to implement with [http://blogs.worldbank.org/impactevaluations/print/finally-way-do-easy-randomization-inference-stata modern statistical software]. The steps are conceptually straightforward in a Monte Carlo framework:<br />
<br />
#Preserve the original treatment assignment<br />
#Generate placebo treatment statuses according to the original assignment method<br />
#Estimate the original regression equation with an additional term for the placebo treatment<br />
#Repeat #1–3<br />
#The randomization inference p-value is ''the proportion of times the placebo treatment effect was larger than the estimated treatment effect''<br />
<br />
<br />
Because the treatment assignment is the source of variation in the experimental design, the p-value is correctly interpretable as "the probability that a similar size treatment effect would have been observed under different hypothetical realizations of the chosen randomization method".<br />
<br />
==Implications for Experimental Design==<br />
<br />
When planing to utilize randomization inference for an experimental analysis, it is also important to consider the difference in variation source during experimental design. In particular, this means performing power calculations and actual randomization to account for the randomization-inference method of p-value calculation.<br />
<br />
[https://www.povertyactionlab.org/sites/default/files/publications/athey_imbens_june19.pdf Athey and Imbens (2016)] provide an extensive guide to these considerations. Major takeaways include:<br />
<br />
#Power is maximized by forcing treatment-control balance on relevant baseline observables or outcome levels. This is achieved in theory by maximally partitioning into strata (2 treatment units and 2 control units in each, assuming a balanced design with one treatment arm), with fixed effects for the strata in the final regression.<br />
#Pairwise randomization is inappropriate because within-strata variances cannot be computed.<br />
#The "re-randomization" approach to force balance is typically inappropriate.<br />
<br />
<br />
[[Category:Data Analysis]]</div>501238https://dimewiki.worldbank.org/index.php?title=Difference-in-Differences&diff=4418Difference-in-Differences2018-02-12T17:58:03Z<p>501238: </p>
<hr />
<div>Difference-in-differences is a to determine treatment effect in [[Natural Experiments | natural experiments]]. Is compares the average change in the [[Treatment Group | treatment group]] to the average change in the [[Control Group | control group]]. It requires [[Panel Data | panel data]] to be implemented.<br />
<br />
== Read First == <br />
<br />
== Guidelines == <br />
<br />
== Back to Parent == <br />
This article is part of the topic [[Quasi-Experimental Methods]]<br />
<br />
== Additional Resources == <br />
[[Category: Quasi-Experimental Methods]][[Category: Data Analysis]]</div>501238https://dimewiki.worldbank.org/index.php?title=Synthetic_Control_Method&diff=4417Synthetic Control Method2018-02-12T17:57:38Z<p>501238: </p>
<hr />
<div>The synthetic control method is a statistical method to evaluate treatment effect in comparative case studies. It creates a synthetic version of [[Treatment Group | treated units]] by weighting variables and observations in the [[Control Group | control group]].<br />
<br />
== Read First ==<br />
<br />
== Guidelines ==<br />
<br />
== Back to Parent ==<br />
<br />
This article is part of the topic [[Quasi-Experimental Methods]].<br />
<br />
== Additional Resources ==<br />
<br />
[http://onlinelibrary.wiley.com/doi/10.1111/ajps.12116/full] Abadie, Alberto, Alexis Diamond, and Jens Hainmueller. "Comparative politics and the synthetic control method." American Journal of Political Science 59, no. 2 (2015): 495-510.<br />
<br />
[https://web.stanford.edu/~jhain/synthpage.html Stata Package for Synthetic Control Method]<br />
<br />
[https://web.stanford.edu/~jhain/synthpage.html Synth - Package for Synthetic Control Method in Stata, R and MATLAB]<br />
<br />
[[Category: Quasi-Experimental Methods]][[Category: Data Analysis]]</div>501238https://dimewiki.worldbank.org/index.php?title=Propensity_Score_Matching&diff=4416Propensity Score Matching2018-02-12T17:56:45Z<p>501238: </p>
<hr />
<div>'''Propensity Score Matching (PSM)''' is a quasi-experimental impact evaluation technique which attempts to estimate the effects of a treatment by matching control group participants to treatment group participants based on propensity score(predicted probability of participation based on observed characteristics). This is done to reduce the [[Selection Bias | selection bias]] that may be present in non-experimental data.<br />
<br />
<br />
== Read First ==<br />
<br />
* The efficacy of a PSM design depends mostly on how well the observed characters determine program participation. If the bias from unobserved characteristics are likely to be very small, PSM provides us with good estimates and if the bias from unobserved characteristics are large, then the estimates from the PSM can be sizably biased.<br />
<br />
== Guidelines ==<br />
<br />
<br />
== Back to Parent ==<br />
<br />
This article is part of the topic [[Quasi-Experimental Methods]].<br />
<br />
== Additional Resources ==<br />
<br />
[[Category: Quasi-Experimental Methods]][[Category: Data Analysis]]</div>501238https://dimewiki.worldbank.org/index.php?title=Data_Cleaning&diff=4415Data Cleaning2018-02-12T17:54:42Z<p>501238: /* ID Variables */</p>
<hr />
<div><span style="font-size:150%"><br />
</span><br />
</span><br />
<br />
Data cleaning is an essential step between data collection and data analysis. Raw primary data is always imperfect and needs to be prepared so that it is easy to use in the analysis. This is the high level goal of Data Cleaning. In extremely rare cases, the only preparation needed is to document the data set, for example - by using labels. However, in the vast majority of cases, there are many small things that need to be addressed in the data set itself. This could both be addressing data points that are incorrect, or replacing values that are not real data points, but codes explaining why there is no real data point.<br />
<br />
== Read First ==<br />
<br />
*See this check list (not done yet) that can be used to make sure that common cleaning actions have been done when applicable.<br />
*As a [[Impact_Evaluation_Team#Research_Assistant|Research Assistant]] (RA) or [[Impact_Evaluation_Team#Field_Coordinator|Field Coordinator]] (FC), do not spend time trying to fix irregularities in the data at the expense of not having time to identifying as many irregularities as possible.<br />
*The quality of the analysis will never be better than the quality of data cleaning.<br />
*There is no such thing as an exhaustive list of what to do during data cleaning as each project will have individual cleaning needs, but this article provides a very good place to start.<br />
*After finishing the data cleaning for each round of data collection, data can be [[Publishing Data|released]]<br />
<br />
== The Goal of Cleaning ==<br />
<br />
There are two main goals when cleaning the data set:<br />
<br />
#Cleaning individual data points that invalidate or incorrectly bias the analysis.<br />
#Preparing a clean data set so that it is easy to use for other researchers. Both for researchers inside your team and outside your team.<br />
<br />
[[File:Picture2.png|700px|link=|center]]<br />
Another overarching goal of the cleaning process is to understand the data and the data collection really well. Much of this understanding feeds directly into the two points above, but a really good data cleaning process should also result in documented lessons learned that can be used in future data collection. Both in later data collection rounds in the same project, but also in data collections in other similar projects.<br />
<br />
=== Cleaning individual data points ===<br />
<br />
In impact evaluations, our analysis often come down to test for statistical differences in the mean between the control group and any of the treatment arms. We do so through regression analysis where we include control variables, fixed effects, and different error estimators, among many other tools. In essence, though, one can think of it as an advanced comparison of means. While this is far from a complete description of impact evaluation analysis, it might give the person cleaning a data set for the first time a framework on what cleaning a data set should achieve.<br />
<br />
It is difficult to have an intuition for the math behind a regression, but it easy to have an intuition for the math behind a mean. Anything that biases a mean will bias a regression, and while there are many more things that can bias a regression, this is a good place to start for anyone cleaning a data set for the first time. The researcher in charge of the analysis is trained in what else that needs to be done for the specific regression models used. The articles linked to below will go through specific examples, but it is probably obvious to most readers that outliers, typos in data, survey codes (often values like -999 or -888) etc. bias means, so it is never wrong to start with those examples.<br />
<br />
=== Prepare a clean data set ===<br />
<br />
The second goal of the data cleaning is to document the data set so that variables, values, and anything else is as self-explanatory as possible. This will help other researchers that you grant access to this data set, but it will also help you and your research team when accessing the data set in the future. At the time of the data collection or at the time of the data cleaning, you know the data set much better than you will at any time in the future. Carefully documenting this knowledge so that it can be used at the time of analysis is often the difference between a good analysis and a great analysis.<br />
<br />
== Role Division during Data Cleaning ==<br />
As a [[Impact_Evaluation_Team#Research_Assistant|Research Assistant]] (RA) or [[Impact_Evaluation_Team#Field_Coordinator|Field Coordinator]] (FC), spend time identifying and documenting irregularities in the data. It is never bad to suggest corrections to irregularities, but a common mistake RAs or FCs do is that they spend too much time on trying to fix irregularities at the expense of having enough time to identify and document as many as possible. One major reason for that is that different regression models might require different ways to correct issues and this is often a perspective only the PI has. In such cases, much time might have been spent on coming up with a correction that is not valid given the regression model used in the analysis.<br />
<br />
Eventually the [[Impact_Evaluation_Team#Principal_Investigator|Principal Investigator]] (PI) and the RA or FC will have a common understanding on what correction calls can be made without involving the PI, but until then, it's recommended that the RA focus her/his time on identifying and documenting as many issues as possible rather than spending a lot of time on how to fix the issues. It is no problem to do both as long as the fixing doesn't happen at the cost of identifying as many issues as possible.<br />
<br />
== Import Data ==<br />
<br />
The first step in cleaning the data is to import the data. If you work with secondary data (data prepared by someone else) then this step is often straightforward, but this is a step often underestimated when working with primary data. It is very important for any change, no matter how small, to always be made in Stata (or in R or any other scripting language). Even if you know that there are incorrect submissions in your raw data (duplicates, pilot data mixed with the main data etc.), those deletions should always be done in such a way that they can be replicated by re-running code. Without this information, the analysis might no longer be valid. See the article on [[DataWork_Survey_Round#Raw_Folder|raw data folders]] for more details.<br />
<br />
=== Importing Primary Survey Data ===<br />
<br />
All modern CAPI survey data collections tools provided methods for importing the raw data in a way that drastically reduces the amount of work that needs to be done when cleaning the data. These methods typically include a Stata do-file that generates labels and much more from the questionnaire code and then applies that to the raw data as it is being imported. If you are working in SurveyCTO see this article on [[SurveyCTO Stata Template | SurveyCTO's Stata Template]].<br />
<br />
== Examples of Data Cleaning Actions ==<br />
<br />
The material in this section has been generated with primary survey data in mind, although a lot of these practices are also applicable when cleaning other types of data sets.<br />
<br />
'''Data Cleaning Check List'''. This is a check list that can be used to make sure that all common aspects of data cleaning has been covered. Note that this is not an exhaustive list. Such a list is impossible to create as the individual data sets and the analysis methods used on them all require different cleaning that in the details depends on the context of that data set.<br />
<br />
===ID Variables===<br />
It's important that the clean dataset be uniquely and fully identifiable by a single variable. It often is the case that when [[Primary Data Collection|primary data]] is imported, there are [[Duplicates and Survey Logs|duplicated entries]]. These cases must be carefully documented, and should only be corrected after discussing with the [[Field Coordinator]] and field team what caused them, so the right observations are kept in the dataset. [[ieduplicates]], a command in [[Stata Coding Practices#ietoolkit|ietoolkit]] is a useful command to identify and correct duplicated entries. Once duplicates are corrected, the observations can be linked to the [[Master Data Set|master dataset]], and the dataset, [[De-identification|de-identified]].<br />
<br />
=== Incorrect Data and Other Irregularities ===<br />
<br />
There are countless ways that there can be irregularities in a primary data set, so there is no way to do an exhaustive list of what should be done. This section gives a few examples: <br />
<br />
'''Outliers'''. There are many rules of thumb for how to define an outlier but there is no silver bullet. One rule of thumb is that any data point that is three standard deviations away from the mean of the same data point for all observations. This may always be a starting point, but one needs to qualitatively consider if this is a correct approach. Observations with outliers should not be dropped, but in some cases, the data point for that observation is replaced with a missing value. There are often better approaches. One common approach is to use winsorization, where any values bigger than a certain percentile, often the 99th, are replaced with the value at that percentile. This way very large values are prevented from biasing the mean. This also has an equality of impact aspect. For example, if all benefit of a project went to a single observation in the treatment group, then the mean would still be high, but that is rarely a desired outcome in development. So winsorization penalizes inequitable distribution of the benefits of a project.<br />
<!----- EDIT HERE -------><br />
<br />
'''Illogical Values'''. This is to test that one data point is possible in relation to another value. For example, if a respondent is male, then the respondent cannot answer that he is pregnant. This simple case is something that can and should be programmed into the questionnaire so that it does not happen. However, no questionnaire ever can be pre-programmed to control for every such case.<br />
<br />
'''Typos'''. If it is obvious beyond any doubt that the response is incorrect due to a simple typo, then it is a good idea to correct the type as long as it is done in a documented and reproducible way.<br />
<br />
=== Survey Codes and Missing Values ===<br />
<br />
Almost all data collection done through surveys of any sort allows the respondent to answer something like "Do not know" or "Declined to answer" for individual questions. These answers are usually recorded using survey codes on the format -999, -88 or something similar. It is obvious that these numbers will bias means and regressions if they are left as such. These values must be replaced with missing values in Stata. <br />
<br />
Stata has several missing values. The most well know is the regular missing value represented by a single "." but we would lose the difference in meaning between "Do not know" and "Declined to answer" if both codes representing them were replaced with the regular missing value. Stata offers a solution with its extended missing values. They are represented by ".a", ".b", ".c" etc. all the way to ".z". Stata handles these values the same as "." in commands that expect a numeric value, but they can be labeled differently and the original information is therefore not lost. Make sure that the same letter ".a", ".b" etc. is used to always represent only one thing across your project. The missing values should be assigned value labels so that they can be interpreted. See [http://www.stata.com/manuals13/u12.pdf#u12.2 Stata Manual Missing Values] for more details on missing values.<br />
<br />
Missing values can be used for much more than just survey codes. Any value that we remove because we found out is incorrect should be replaced with a missing value. In a [[Master Data Set]], there should be no regular missing values. All missing values in a master data set should contain an explanation of why we do not have that information for that observation.<br />
<br />
=== No Strings ===<br />
<br />
All data should be stored in numeric format. There are multiple reasons for this, but the two most important are that (1) numbers are stored more efficiently and (2) many Stata commands expect values to be stored numerically. Categorical string variables should be stored as numeric codes and have value labels assigned.<br />
<br />
There are two exceptions where string variables are allowed. The two examples are listed below:<br />
<br />
'''Numbers that cannot be stored correctly numerically'''. There are two cases of this exception. The first case is when a number is more than 15 digits long. This can happen when working with some national IDs. If a continuous variable has more than 15 digits, then it should be rounded and converted to a different scale, as a precision of 16 digits is not even possible in natural sciences. An ID can for obvious reasons not be rounded. The other case is that of numbers starting with a zero. This is sometimes the case in some national IDs and it is also sometimes the case with telephone numbers in some countries. Any leading zeros are removed by Stata and therefore have to be stored as a string.<br />
<br />
'''Non-categorical text'''. Text answers that cannot be converted into categories need to be stored as strings. One example is open-ended questions. Open-ended questions should, in general, be avoided, but sometimes the questionnaire asks the respondent to answer a question in his or her own words, and then that answer has to be stored as strings. Another example is if the respondent is asked to specify the answer after answering ''Other'' in a multiple choice question. A different example where string format is needed is some cases of proper names, for example, the name of the respondent. Not all proper names should be stored as string as some can be made into categories. For example, if you collect data on respondents and multiple respondents live in the same villages, then the variable with the village names should be converted into a categorical numeric variable and have a value label assigned. See the section on value labels below.<br />
<br />
=== Labels ===<br />
There are several ways to add helpful descriptive text to a data set in Stata, but the two most common and important ways are variables labels and value labels.<br />
<br />
'''Variable Labels'''<br />
All variables in a clean data set should have variable labels describing the variable. The label can be up to 80 characters long so there is a limitation to how much information can be included here. In addition to a brief explanation of the variable, it is usually good to include information such as unit or currency used in the variable and other things that are not possible to read from the values themselves.<br />
<br />
'''Value Labels'''<br />
Categorical variables should always be stored numerically and have value labels that describe what the numeric code represents. For example, yes and no questions should be stored as 0 and 1 and have the label ''No'' for data cells with 0, and the label ''Yes'' for all data cells with 1. This should be applied to all multiple choice variables.<br />
<br />
There are tools in Stata to convert categorical string variables to a categorical numeric variable where the strings are automatically applied as value labels. The most common tool is the command <code>encode</code>. However, if you use <code>encode</code>, you should always use the two options <code>label()</code> and <code>noextend</code>. Without these two options, Stata assigns a code to each string value in alphabetic order. There is no guarantee that the alphabetic order is changed when observations are added or removed, or if someone else makes changes earlier in the code. <code>label()</code> forces you to manually create the label before using encode (this requires some manual work but it is worth it). <code>noextend</code> throws an error if there is a value in the data that does not exist in the pre-defined label. This way you are notified that you need to add the new value to the value label you created manually. Or you can change the string value if there is a typo or similar that is the reason why that string value was not assigned a value label.<br />
<br />
== Additional Resources ==<br />
* <br />
<br />
[[Category: Data Cleaning ]]</div>501238https://dimewiki.worldbank.org/index.php?title=Data_Cleaning&diff=4414Data Cleaning2018-02-12T17:54:18Z<p>501238: /* ID Variables */</p>
<hr />
<div><span style="font-size:150%"><br />
</span><br />
</span><br />
<br />
Data cleaning is an essential step between data collection and data analysis. Raw primary data is always imperfect and needs to be prepared so that it is easy to use in the analysis. This is the high level goal of Data Cleaning. In extremely rare cases, the only preparation needed is to document the data set, for example - by using labels. However, in the vast majority of cases, there are many small things that need to be addressed in the data set itself. This could both be addressing data points that are incorrect, or replacing values that are not real data points, but codes explaining why there is no real data point.<br />
<br />
== Read First ==<br />
<br />
*See this check list (not done yet) that can be used to make sure that common cleaning actions have been done when applicable.<br />
*As a [[Impact_Evaluation_Team#Research_Assistant|Research Assistant]] (RA) or [[Impact_Evaluation_Team#Field_Coordinator|Field Coordinator]] (FC), do not spend time trying to fix irregularities in the data at the expense of not having time to identifying as many irregularities as possible.<br />
*The quality of the analysis will never be better than the quality of data cleaning.<br />
*There is no such thing as an exhaustive list of what to do during data cleaning as each project will have individual cleaning needs, but this article provides a very good place to start.<br />
*After finishing the data cleaning for each round of data collection, data can be [[Publishing Data|released]]<br />
<br />
== The Goal of Cleaning ==<br />
<br />
There are two main goals when cleaning the data set:<br />
<br />
#Cleaning individual data points that invalidate or incorrectly bias the analysis.<br />
#Preparing a clean data set so that it is easy to use for other researchers. Both for researchers inside your team and outside your team.<br />
<br />
[[File:Picture2.png|700px|link=|center]]<br />
Another overarching goal of the cleaning process is to understand the data and the data collection really well. Much of this understanding feeds directly into the two points above, but a really good data cleaning process should also result in documented lessons learned that can be used in future data collection. Both in later data collection rounds in the same project, but also in data collections in other similar projects.<br />
<br />
=== Cleaning individual data points ===<br />
<br />
In impact evaluations, our analysis often come down to test for statistical differences in the mean between the control group and any of the treatment arms. We do so through regression analysis where we include control variables, fixed effects, and different error estimators, among many other tools. In essence, though, one can think of it as an advanced comparison of means. While this is far from a complete description of impact evaluation analysis, it might give the person cleaning a data set for the first time a framework on what cleaning a data set should achieve.<br />
<br />
It is difficult to have an intuition for the math behind a regression, but it easy to have an intuition for the math behind a mean. Anything that biases a mean will bias a regression, and while there are many more things that can bias a regression, this is a good place to start for anyone cleaning a data set for the first time. The researcher in charge of the analysis is trained in what else that needs to be done for the specific regression models used. The articles linked to below will go through specific examples, but it is probably obvious to most readers that outliers, typos in data, survey codes (often values like -999 or -888) etc. bias means, so it is never wrong to start with those examples.<br />
<br />
=== Prepare a clean data set ===<br />
<br />
The second goal of the data cleaning is to document the data set so that variables, values, and anything else is as self-explanatory as possible. This will help other researchers that you grant access to this data set, but it will also help you and your research team when accessing the data set in the future. At the time of the data collection or at the time of the data cleaning, you know the data set much better than you will at any time in the future. Carefully documenting this knowledge so that it can be used at the time of analysis is often the difference between a good analysis and a great analysis.<br />
<br />
== Role Division during Data Cleaning ==<br />
As a [[Impact_Evaluation_Team#Research_Assistant|Research Assistant]] (RA) or [[Impact_Evaluation_Team#Field_Coordinator|Field Coordinator]] (FC), spend time identifying and documenting irregularities in the data. It is never bad to suggest corrections to irregularities, but a common mistake RAs or FCs do is that they spend too much time on trying to fix irregularities at the expense of having enough time to identify and document as many as possible. One major reason for that is that different regression models might require different ways to correct issues and this is often a perspective only the PI has. In such cases, much time might have been spent on coming up with a correction that is not valid given the regression model used in the analysis.<br />
<br />
Eventually the [[Impact_Evaluation_Team#Principal_Investigator|Principal Investigator]] (PI) and the RA or FC will have a common understanding on what correction calls can be made without involving the PI, but until then, it's recommended that the RA focus her/his time on identifying and documenting as many issues as possible rather than spending a lot of time on how to fix the issues. It is no problem to do both as long as the fixing doesn't happen at the cost of identifying as many issues as possible.<br />
<br />
== Import Data ==<br />
<br />
The first step in cleaning the data is to import the data. If you work with secondary data (data prepared by someone else) then this step is often straightforward, but this is a step often underestimated when working with primary data. It is very important for any change, no matter how small, to always be made in Stata (or in R or any other scripting language). Even if you know that there are incorrect submissions in your raw data (duplicates, pilot data mixed with the main data etc.), those deletions should always be done in such a way that they can be replicated by re-running code. Without this information, the analysis might no longer be valid. See the article on [[DataWork_Survey_Round#Raw_Folder|raw data folders]] for more details.<br />
<br />
=== Importing Primary Survey Data ===<br />
<br />
All modern CAPI survey data collections tools provided methods for importing the raw data in a way that drastically reduces the amount of work that needs to be done when cleaning the data. These methods typically include a Stata do-file that generates labels and much more from the questionnaire code and then applies that to the raw data as it is being imported. If you are working in SurveyCTO see this article on [[SurveyCTO Stata Template | SurveyCTO's Stata Template]].<br />
<br />
== Examples of Data Cleaning Actions ==<br />
<br />
The material in this section has been generated with primary survey data in mind, although a lot of these practices are also applicable when cleaning other types of data sets.<br />
<br />
'''Data Cleaning Check List'''. This is a check list that can be used to make sure that all common aspects of data cleaning has been covered. Note that this is not an exhaustive list. Such a list is impossible to create as the individual data sets and the analysis methods used on them all require different cleaning that in the details depends on the context of that data set.<br />
<br />
===ID Variables===<br />
It's important that the clean dataset be uniquely and fully identifiable by a single variable. It often is the case that when [[Primary Data Collection|primary data]] is imported, there are [[Duplicates and Survey Logs|duplicated entries]]. These cases must be carefully documented, and should only be corrected after discussing with the [[Field Coordinator]] and field team what caused them, so the right observations are kept in the dataset. [[ieduplicates]], a command in [[Stata Coding Practices#ietoolkit|ietoolkit]] is a useful command to identify and correct duplicated entries. Once duplicates are corrected, the observations can be linked to the [[Master Dataset]], and the dataset, [[De-identification|de-identified]].<br />
<br />
=== Incorrect Data and Other Irregularities ===<br />
<br />
There are countless ways that there can be irregularities in a primary data set, so there is no way to do an exhaustive list of what should be done. This section gives a few examples: <br />
<br />
'''Outliers'''. There are many rules of thumb for how to define an outlier but there is no silver bullet. One rule of thumb is that any data point that is three standard deviations away from the mean of the same data point for all observations. This may always be a starting point, but one needs to qualitatively consider if this is a correct approach. Observations with outliers should not be dropped, but in some cases, the data point for that observation is replaced with a missing value. There are often better approaches. One common approach is to use winsorization, where any values bigger than a certain percentile, often the 99th, are replaced with the value at that percentile. This way very large values are prevented from biasing the mean. This also has an equality of impact aspect. For example, if all benefit of a project went to a single observation in the treatment group, then the mean would still be high, but that is rarely a desired outcome in development. So winsorization penalizes inequitable distribution of the benefits of a project.<br />
<!----- EDIT HERE -------><br />
<br />
'''Illogical Values'''. This is to test that one data point is possible in relation to another value. For example, if a respondent is male, then the respondent cannot answer that he is pregnant. This simple case is something that can and should be programmed into the questionnaire so that it does not happen. However, no questionnaire ever can be pre-programmed to control for every such case.<br />
<br />
'''Typos'''. If it is obvious beyond any doubt that the response is incorrect due to a simple typo, then it is a good idea to correct the type as long as it is done in a documented and reproducible way.<br />
<br />
=== Survey Codes and Missing Values ===<br />
<br />
Almost all data collection done through surveys of any sort allows the respondent to answer something like "Do not know" or "Declined to answer" for individual questions. These answers are usually recorded using survey codes on the format -999, -88 or something similar. It is obvious that these numbers will bias means and regressions if they are left as such. These values must be replaced with missing values in Stata. <br />
<br />
Stata has several missing values. The most well know is the regular missing value represented by a single "." but we would lose the difference in meaning between "Do not know" and "Declined to answer" if both codes representing them were replaced with the regular missing value. Stata offers a solution with its extended missing values. They are represented by ".a", ".b", ".c" etc. all the way to ".z". Stata handles these values the same as "." in commands that expect a numeric value, but they can be labeled differently and the original information is therefore not lost. Make sure that the same letter ".a", ".b" etc. is used to always represent only one thing across your project. The missing values should be assigned value labels so that they can be interpreted. See [http://www.stata.com/manuals13/u12.pdf#u12.2 Stata Manual Missing Values] for more details on missing values.<br />
<br />
Missing values can be used for much more than just survey codes. Any value that we remove because we found out is incorrect should be replaced with a missing value. In a [[Master Data Set]], there should be no regular missing values. All missing values in a master data set should contain an explanation of why we do not have that information for that observation.<br />
<br />
=== No Strings ===<br />
<br />
All data should be stored in numeric format. There are multiple reasons for this, but the two most important are that (1) numbers are stored more efficiently and (2) many Stata commands expect values to be stored numerically. Categorical string variables should be stored as numeric codes and have value labels assigned.<br />
<br />
There are two exceptions where string variables are allowed. The two examples are listed below:<br />
<br />
'''Numbers that cannot be stored correctly numerically'''. There are two cases of this exception. The first case is when a number is more than 15 digits long. This can happen when working with some national IDs. If a continuous variable has more than 15 digits, then it should be rounded and converted to a different scale, as a precision of 16 digits is not even possible in natural sciences. An ID can for obvious reasons not be rounded. The other case is that of numbers starting with a zero. This is sometimes the case in some national IDs and it is also sometimes the case with telephone numbers in some countries. Any leading zeros are removed by Stata and therefore have to be stored as a string.<br />
<br />
'''Non-categorical text'''. Text answers that cannot be converted into categories need to be stored as strings. One example is open-ended questions. Open-ended questions should, in general, be avoided, but sometimes the questionnaire asks the respondent to answer a question in his or her own words, and then that answer has to be stored as strings. Another example is if the respondent is asked to specify the answer after answering ''Other'' in a multiple choice question. A different example where string format is needed is some cases of proper names, for example, the name of the respondent. Not all proper names should be stored as string as some can be made into categories. For example, if you collect data on respondents and multiple respondents live in the same villages, then the variable with the village names should be converted into a categorical numeric variable and have a value label assigned. See the section on value labels below.<br />
<br />
=== Labels ===<br />
There are several ways to add helpful descriptive text to a data set in Stata, but the two most common and important ways are variables labels and value labels.<br />
<br />
'''Variable Labels'''<br />
All variables in a clean data set should have variable labels describing the variable. The label can be up to 80 characters long so there is a limitation to how much information can be included here. In addition to a brief explanation of the variable, it is usually good to include information such as unit or currency used in the variable and other things that are not possible to read from the values themselves.<br />
<br />
'''Value Labels'''<br />
Categorical variables should always be stored numerically and have value labels that describe what the numeric code represents. For example, yes and no questions should be stored as 0 and 1 and have the label ''No'' for data cells with 0, and the label ''Yes'' for all data cells with 1. This should be applied to all multiple choice variables.<br />
<br />
There are tools in Stata to convert categorical string variables to a categorical numeric variable where the strings are automatically applied as value labels. The most common tool is the command <code>encode</code>. However, if you use <code>encode</code>, you should always use the two options <code>label()</code> and <code>noextend</code>. Without these two options, Stata assigns a code to each string value in alphabetic order. There is no guarantee that the alphabetic order is changed when observations are added or removed, or if someone else makes changes earlier in the code. <code>label()</code> forces you to manually create the label before using encode (this requires some manual work but it is worth it). <code>noextend</code> throws an error if there is a value in the data that does not exist in the pre-defined label. This way you are notified that you need to add the new value to the value label you created manually. Or you can change the string value if there is a typo or similar that is the reason why that string value was not assigned a value label.<br />
<br />
== Additional Resources ==<br />
* <br />
<br />
[[Category: Data Cleaning ]]</div>501238https://dimewiki.worldbank.org/index.php?title=Data_Cleaning&diff=4413Data Cleaning2018-02-12T17:53:48Z<p>501238: /* Examples of Data Cleaning Actions */</p>
<hr />
<div><span style="font-size:150%"><br />
</span><br />
</span><br />
<br />
Data cleaning is an essential step between data collection and data analysis. Raw primary data is always imperfect and needs to be prepared so that it is easy to use in the analysis. This is the high level goal of Data Cleaning. In extremely rare cases, the only preparation needed is to document the data set, for example - by using labels. However, in the vast majority of cases, there are many small things that need to be addressed in the data set itself. This could both be addressing data points that are incorrect, or replacing values that are not real data points, but codes explaining why there is no real data point.<br />
<br />
== Read First ==<br />
<br />
*See this check list (not done yet) that can be used to make sure that common cleaning actions have been done when applicable.<br />
*As a [[Impact_Evaluation_Team#Research_Assistant|Research Assistant]] (RA) or [[Impact_Evaluation_Team#Field_Coordinator|Field Coordinator]] (FC), do not spend time trying to fix irregularities in the data at the expense of not having time to identifying as many irregularities as possible.<br />
*The quality of the analysis will never be better than the quality of data cleaning.<br />
*There is no such thing as an exhaustive list of what to do during data cleaning as each project will have individual cleaning needs, but this article provides a very good place to start.<br />
*After finishing the data cleaning for each round of data collection, data can be [[Publishing Data|released]]<br />
<br />
== The Goal of Cleaning ==<br />
<br />
There are two main goals when cleaning the data set:<br />
<br />
#Cleaning individual data points that invalidate or incorrectly bias the analysis.<br />
#Preparing a clean data set so that it is easy to use for other researchers. Both for researchers inside your team and outside your team.<br />
<br />
[[File:Picture2.png|700px|link=|center]]<br />
Another overarching goal of the cleaning process is to understand the data and the data collection really well. Much of this understanding feeds directly into the two points above, but a really good data cleaning process should also result in documented lessons learned that can be used in future data collection. Both in later data collection rounds in the same project, but also in data collections in other similar projects.<br />
<br />
=== Cleaning individual data points ===<br />
<br />
In impact evaluations, our analysis often come down to test for statistical differences in the mean between the control group and any of the treatment arms. We do so through regression analysis where we include control variables, fixed effects, and different error estimators, among many other tools. In essence, though, one can think of it as an advanced comparison of means. While this is far from a complete description of impact evaluation analysis, it might give the person cleaning a data set for the first time a framework on what cleaning a data set should achieve.<br />
<br />
It is difficult to have an intuition for the math behind a regression, but it easy to have an intuition for the math behind a mean. Anything that biases a mean will bias a regression, and while there are many more things that can bias a regression, this is a good place to start for anyone cleaning a data set for the first time. The researcher in charge of the analysis is trained in what else that needs to be done for the specific regression models used. The articles linked to below will go through specific examples, but it is probably obvious to most readers that outliers, typos in data, survey codes (often values like -999 or -888) etc. bias means, so it is never wrong to start with those examples.<br />
<br />
=== Prepare a clean data set ===<br />
<br />
The second goal of the data cleaning is to document the data set so that variables, values, and anything else is as self-explanatory as possible. This will help other researchers that you grant access to this data set, but it will also help you and your research team when accessing the data set in the future. At the time of the data collection or at the time of the data cleaning, you know the data set much better than you will at any time in the future. Carefully documenting this knowledge so that it can be used at the time of analysis is often the difference between a good analysis and a great analysis.<br />
<br />
== Role Division during Data Cleaning ==<br />
As a [[Impact_Evaluation_Team#Research_Assistant|Research Assistant]] (RA) or [[Impact_Evaluation_Team#Field_Coordinator|Field Coordinator]] (FC), spend time identifying and documenting irregularities in the data. It is never bad to suggest corrections to irregularities, but a common mistake RAs or FCs do is that they spend too much time on trying to fix irregularities at the expense of having enough time to identify and document as many as possible. One major reason for that is that different regression models might require different ways to correct issues and this is often a perspective only the PI has. In such cases, much time might have been spent on coming up with a correction that is not valid given the regression model used in the analysis.<br />
<br />
Eventually the [[Impact_Evaluation_Team#Principal_Investigator|Principal Investigator]] (PI) and the RA or FC will have a common understanding on what correction calls can be made without involving the PI, but until then, it's recommended that the RA focus her/his time on identifying and documenting as many issues as possible rather than spending a lot of time on how to fix the issues. It is no problem to do both as long as the fixing doesn't happen at the cost of identifying as many issues as possible.<br />
<br />
== Import Data ==<br />
<br />
The first step in cleaning the data is to import the data. If you work with secondary data (data prepared by someone else) then this step is often straightforward, but this is a step often underestimated when working with primary data. It is very important for any change, no matter how small, to always be made in Stata (or in R or any other scripting language). Even if you know that there are incorrect submissions in your raw data (duplicates, pilot data mixed with the main data etc.), those deletions should always be done in such a way that they can be replicated by re-running code. Without this information, the analysis might no longer be valid. See the article on [[DataWork_Survey_Round#Raw_Folder|raw data folders]] for more details.<br />
<br />
=== Importing Primary Survey Data ===<br />
<br />
All modern CAPI survey data collections tools provided methods for importing the raw data in a way that drastically reduces the amount of work that needs to be done when cleaning the data. These methods typically include a Stata do-file that generates labels and much more from the questionnaire code and then applies that to the raw data as it is being imported. If you are working in SurveyCTO see this article on [[SurveyCTO Stata Template | SurveyCTO's Stata Template]].<br />
<br />
== Examples of Data Cleaning Actions ==<br />
<br />
The material in this section has been generated with primary survey data in mind, although a lot of these practices are also applicable when cleaning other types of data sets.<br />
<br />
'''Data Cleaning Check List'''. This is a check list that can be used to make sure that all common aspects of data cleaning has been covered. Note that this is not an exhaustive list. Such a list is impossible to create as the individual data sets and the analysis methods used on them all require different cleaning that in the details depends on the context of that data set.<br />
<br />
===ID Variables===<br />
It's important that the clean dataset be uniquely and fully identifiable by a single variable. It often is the case that when [[Primary Data Collection|primary data]] is imported, there are [[Duplicates and Survey Logs|duplicated entries]]. These cases must be carefully documented, and should only be corrected after discussing with the [[Field Coordinator]] and field team what caused them, so the right observations are kept in the dataset. [[ieduplicates]], a command in [[ietoolkit]] is a useful command to identify and correct duplicated entries. Once duplicates are corrected, the observations can be linked to the [[Master Dataset]], and the dataset, [[De-identification|de-identified]].<br />
<br />
=== Incorrect Data and Other Irregularities ===<br />
<br />
There are countless ways that there can be irregularities in a primary data set, so there is no way to do an exhaustive list of what should be done. This section gives a few examples: <br />
<br />
'''Outliers'''. There are many rules of thumb for how to define an outlier but there is no silver bullet. One rule of thumb is that any data point that is three standard deviations away from the mean of the same data point for all observations. This may always be a starting point, but one needs to qualitatively consider if this is a correct approach. Observations with outliers should not be dropped, but in some cases, the data point for that observation is replaced with a missing value. There are often better approaches. One common approach is to use winsorization, where any values bigger than a certain percentile, often the 99th, are replaced with the value at that percentile. This way very large values are prevented from biasing the mean. This also has an equality of impact aspect. For example, if all benefit of a project went to a single observation in the treatment group, then the mean would still be high, but that is rarely a desired outcome in development. So winsorization penalizes inequitable distribution of the benefits of a project.<br />
<!----- EDIT HERE -------><br />
<br />
'''Illogical Values'''. This is to test that one data point is possible in relation to another value. For example, if a respondent is male, then the respondent cannot answer that he is pregnant. This simple case is something that can and should be programmed into the questionnaire so that it does not happen. However, no questionnaire ever can be pre-programmed to control for every such case.<br />
<br />
'''Typos'''. If it is obvious beyond any doubt that the response is incorrect due to a simple typo, then it is a good idea to correct the type as long as it is done in a documented and reproducible way.<br />
<br />
=== Survey Codes and Missing Values ===<br />
<br />
Almost all data collection done through surveys of any sort allows the respondent to answer something like "Do not know" or "Declined to answer" for individual questions. These answers are usually recorded using survey codes on the format -999, -88 or something similar. It is obvious that these numbers will bias means and regressions if they are left as such. These values must be replaced with missing values in Stata. <br />
<br />
Stata has several missing values. The most well know is the regular missing value represented by a single "." but we would lose the difference in meaning between "Do not know" and "Declined to answer" if both codes representing them were replaced with the regular missing value. Stata offers a solution with its extended missing values. They are represented by ".a", ".b", ".c" etc. all the way to ".z". Stata handles these values the same as "." in commands that expect a numeric value, but they can be labeled differently and the original information is therefore not lost. Make sure that the same letter ".a", ".b" etc. is used to always represent only one thing across your project. The missing values should be assigned value labels so that they can be interpreted. See [http://www.stata.com/manuals13/u12.pdf#u12.2 Stata Manual Missing Values] for more details on missing values.<br />
<br />
Missing values can be used for much more than just survey codes. Any value that we remove because we found out is incorrect should be replaced with a missing value. In a [[Master Data Set]], there should be no regular missing values. All missing values in a master data set should contain an explanation of why we do not have that information for that observation.<br />
<br />
=== No Strings ===<br />
<br />
All data should be stored in numeric format. There are multiple reasons for this, but the two most important are that (1) numbers are stored more efficiently and (2) many Stata commands expect values to be stored numerically. Categorical string variables should be stored as numeric codes and have value labels assigned.<br />
<br />
There are two exceptions where string variables are allowed. The two examples are listed below:<br />
<br />
'''Numbers that cannot be stored correctly numerically'''. There are two cases of this exception. The first case is when a number is more than 15 digits long. This can happen when working with some national IDs. If a continuous variable has more than 15 digits, then it should be rounded and converted to a different scale, as a precision of 16 digits is not even possible in natural sciences. An ID can for obvious reasons not be rounded. The other case is that of numbers starting with a zero. This is sometimes the case in some national IDs and it is also sometimes the case with telephone numbers in some countries. Any leading zeros are removed by Stata and therefore have to be stored as a string.<br />
<br />
'''Non-categorical text'''. Text answers that cannot be converted into categories need to be stored as strings. One example is open-ended questions. Open-ended questions should, in general, be avoided, but sometimes the questionnaire asks the respondent to answer a question in his or her own words, and then that answer has to be stored as strings. Another example is if the respondent is asked to specify the answer after answering ''Other'' in a multiple choice question. A different example where string format is needed is some cases of proper names, for example, the name of the respondent. Not all proper names should be stored as string as some can be made into categories. For example, if you collect data on respondents and multiple respondents live in the same villages, then the variable with the village names should be converted into a categorical numeric variable and have a value label assigned. See the section on value labels below.<br />
<br />
=== Labels ===<br />
There are several ways to add helpful descriptive text to a data set in Stata, but the two most common and important ways are variables labels and value labels.<br />
<br />
'''Variable Labels'''<br />
All variables in a clean data set should have variable labels describing the variable. The label can be up to 80 characters long so there is a limitation to how much information can be included here. In addition to a brief explanation of the variable, it is usually good to include information such as unit or currency used in the variable and other things that are not possible to read from the values themselves.<br />
<br />
'''Value Labels'''<br />
Categorical variables should always be stored numerically and have value labels that describe what the numeric code represents. For example, yes and no questions should be stored as 0 and 1 and have the label ''No'' for data cells with 0, and the label ''Yes'' for all data cells with 1. This should be applied to all multiple choice variables.<br />
<br />
There are tools in Stata to convert categorical string variables to a categorical numeric variable where the strings are automatically applied as value labels. The most common tool is the command <code>encode</code>. However, if you use <code>encode</code>, you should always use the two options <code>label()</code> and <code>noextend</code>. Without these two options, Stata assigns a code to each string value in alphabetic order. There is no guarantee that the alphabetic order is changed when observations are added or removed, or if someone else makes changes earlier in the code. <code>label()</code> forces you to manually create the label before using encode (this requires some manual work but it is worth it). <code>noextend</code> throws an error if there is a value in the data that does not exist in the pre-defined label. This way you are notified that you need to add the new value to the value label you created manually. Or you can change the string value if there is a typo or similar that is the reason why that string value was not assigned a value label.<br />
<br />
== Additional Resources ==<br />
* <br />
<br />
[[Category: Data Cleaning ]]</div>501238https://dimewiki.worldbank.org/index.php?title=Data_Cleaning&diff=4412Data Cleaning2018-02-12T17:44:21Z<p>501238: /* Read First */</p>
<hr />
<div><span style="font-size:150%"><br />
</span><br />
</span><br />
<br />
Data cleaning is an essential step between data collection and data analysis. Raw primary data is always imperfect and needs to be prepared so that it is easy to use in the analysis. This is the high level goal of Data Cleaning. In extremely rare cases, the only preparation needed is to document the data set, for example - by using labels. However, in the vast majority of cases, there are many small things that need to be addressed in the data set itself. This could both be addressing data points that are incorrect, or replacing values that are not real data points, but codes explaining why there is no real data point.<br />
<br />
== Read First ==<br />
<br />
*See this check list (not done yet) that can be used to make sure that common cleaning actions have been done when applicable.<br />
*As a [[Impact_Evaluation_Team#Research_Assistant|Research Assistant]] (RA) or [[Impact_Evaluation_Team#Field_Coordinator|Field Coordinator]] (FC), do not spend time trying to fix irregularities in the data at the expense of not having time to identifying as many irregularities as possible.<br />
*The quality of the analysis will never be better than the quality of data cleaning.<br />
*There is no such thing as an exhaustive list of what to do during data cleaning as each project will have individual cleaning needs, but this article provides a very good place to start.<br />
*After finishing the data cleaning for each round of data collection, data can be [[Publishing Data|released]]<br />
<br />
== The Goal of Cleaning ==<br />
<br />
There are two main goals when cleaning the data set:<br />
<br />
#Cleaning individual data points that invalidate or incorrectly bias the analysis.<br />
#Preparing a clean data set so that it is easy to use for other researchers. Both for researchers inside your team and outside your team.<br />
<br />
[[File:Picture2.png|700px|link=|center]]<br />
Another overarching goal of the cleaning process is to understand the data and the data collection really well. Much of this understanding feeds directly into the two points above, but a really good data cleaning process should also result in documented lessons learned that can be used in future data collection. Both in later data collection rounds in the same project, but also in data collections in other similar projects.<br />
<br />
=== Cleaning individual data points ===<br />
<br />
In impact evaluations, our analysis often come down to test for statistical differences in the mean between the control group and any of the treatment arms. We do so through regression analysis where we include control variables, fixed effects, and different error estimators, among many other tools. In essence, though, one can think of it as an advanced comparison of means. While this is far from a complete description of impact evaluation analysis, it might give the person cleaning a data set for the first time a framework on what cleaning a data set should achieve.<br />
<br />
It is difficult to have an intuition for the math behind a regression, but it easy to have an intuition for the math behind a mean. Anything that biases a mean will bias a regression, and while there are many more things that can bias a regression, this is a good place to start for anyone cleaning a data set for the first time. The researcher in charge of the analysis is trained in what else that needs to be done for the specific regression models used. The articles linked to below will go through specific examples, but it is probably obvious to most readers that outliers, typos in data, survey codes (often values like -999 or -888) etc. bias means, so it is never wrong to start with those examples.<br />
<br />
=== Prepare a clean data set ===<br />
<br />
The second goal of the data cleaning is to document the data set so that variables, values, and anything else is as self-explanatory as possible. This will help other researchers that you grant access to this data set, but it will also help you and your research team when accessing the data set in the future. At the time of the data collection or at the time of the data cleaning, you know the data set much better than you will at any time in the future. Carefully documenting this knowledge so that it can be used at the time of analysis is often the difference between a good analysis and a great analysis.<br />
<br />
== Role Division during Data Cleaning ==<br />
As a [[Impact_Evaluation_Team#Research_Assistant|Research Assistant]] (RA) or [[Impact_Evaluation_Team#Field_Coordinator|Field Coordinator]] (FC), spend time identifying and documenting irregularities in the data. It is never bad to suggest corrections to irregularities, but a common mistake RAs or FCs do is that they spend too much time on trying to fix irregularities at the expense of having enough time to identify and document as many as possible. One major reason for that is that different regression models might require different ways to correct issues and this is often a perspective only the PI has. In such cases, much time might have been spent on coming up with a correction that is not valid given the regression model used in the analysis.<br />
<br />
Eventually the [[Impact_Evaluation_Team#Principal_Investigator|Principal Investigator]] (PI) and the RA or FC will have a common understanding on what correction calls can be made without involving the PI, but until then, it's recommended that the RA focus her/his time on identifying and documenting as many issues as possible rather than spending a lot of time on how to fix the issues. It is no problem to do both as long as the fixing doesn't happen at the cost of identifying as many issues as possible.<br />
<br />
== Import Data ==<br />
<br />
The first step in cleaning the data is to import the data. If you work with secondary data (data prepared by someone else) then this step is often straightforward, but this is a step often underestimated when working with primary data. It is very important for any change, no matter how small, to always be made in Stata (or in R or any other scripting language). Even if you know that there are incorrect submissions in your raw data (duplicates, pilot data mixed with the main data etc.), those deletions should always be done in such a way that they can be replicated by re-running code. Without this information, the analysis might no longer be valid. See the article on [[DataWork_Survey_Round#Raw_Folder|raw data folders]] for more details.<br />
<br />
=== Importing Primary Survey Data ===<br />
<br />
All modern CAPI survey data collections tools provided methods for importing the raw data in a way that drastically reduces the amount of work that needs to be done when cleaning the data. These methods typically include a Stata do-file that generates labels and much more from the questionnaire code and then applies that to the raw data as it is being imported. If you are working in SurveyCTO see this article on [[SurveyCTO Stata Template | SurveyCTO's Stata Template]].<br />
<br />
== Examples of Data Cleaning Actions ==<br />
<br />
The material in this section has been generated with primary survey data in mind, although a lot of these practices are also applicable when cleaning other types of data sets.<br />
<br />
'''Data Cleaning Check List'''. This is a check list that can be used to make sure that all common aspects of data cleaning has been covered. Note that this is not an exhaustive list. Such a list is impossible to create as the individual data sets and the analysis methods used on them all require different cleaning that in the details depends on the context of that data set.<br />
<br />
=== Incorrect Data and Other Irregularities ===<br />
<br />
There are countless ways that there can be irregularities in a primary data set, so there is no way to do an exhaustive list of what should be done. This section gives a few examples: <br />
<br />
'''Outliers'''. There are many rules of thumb for how to define an outlier but there is no silver bullet. One rule of thumb is that any data point that is three standard deviations away from the mean of the same data point for all observations. This may always be a starting point, but one needs to qualitatively consider if this is a correct approach. Observations with outliers should not be dropped, but in some cases, the data point for that observation is replaced with a missing value. There are often better approaches. One common approach is to use winsorization, where any values bigger than a certain percentile, often the 99th, are replaced with the value at that percentile. This way very large values are prevented from biasing the mean. This also has an equality of impact aspect. For example, if all benefit of a project went to a single observation in the treatment group, then the mean would still be high, but that is rarely a desired outcome in development. So winsorization penalizes inequitable distribution of the benefits of a project.<br />
<!----- EDIT HERE -------><br />
<br />
'''Illogical Values'''. This is to test that one data point is possible in relation to another value. For example, if a respondent is male, then the respondent cannot answer that he is pregnant. This simple case is something that can and should be programmed into the questionnaire so that it does not happen. However, no questionnaire ever can be pre-programmed to control for every such case.<br />
<br />
'''Typos'''. If it is obvious beyond any doubt that the response is incorrect due to a simple typo, then it is a good idea to correct the type as long as it is done in a documented and reproducible way.<br />
<br />
=== Survey Codes and Missing Values ===<br />
<br />
Almost all data collection done through surveys of any sort allows the respondent to answer something like "Do not know" or "Declined to answer" for individual questions. These answers are usually recorded using survey codes on the format -999, -88 or something similar. It is obvious that these numbers will bias means and regressions if they are left as such. These values must be replaced with missing values in Stata. <br />
<br />
Stata has several missing values. The most well know is the regular missing value represented by a single "." but we would lose the difference in meaning between "Do not know" and "Declined to answer" if both codes representing them were replaced with the regular missing value. Stata offers a solution with its extended missing values. They are represented by ".a", ".b", ".c" etc. all the way to ".z". Stata handles these values the same as "." in commands that expect a numeric value, but they can be labeled differently and the original information is therefore not lost. Make sure that the same letter ".a", ".b" etc. is used to always represent only one thing across your project. The missing values should be assigned value labels so that they can be interpreted. See [http://www.stata.com/manuals13/u12.pdf#u12.2 Stata Manual Missing Values] for more details on missing values.<br />
<br />
Missing values can be used for much more than just survey codes. Any value that we remove because we found out is incorrect should be replaced with a missing value. In a [[Master Data Set]], there should be no regular missing values. All missing values in a master data set should contain an explanation of why we do not have that information for that observation.<br />
<br />
=== No Strings ===<br />
<br />
All data should be stored in numeric format. There are multiple reasons for this, but the two most important are that (1) numbers are stored more efficiently and (2) many Stata commands expect values to be stored numerically. Categorical string variables should be stored as numeric codes and have value labels assigned.<br />
<br />
There are two exceptions where string variables are allowed. The two examples are listed below:<br />
<br />
'''Numbers that cannot be stored correctly numerically'''. There are two cases of this exception. The first case is when a number is more than 15 digits long. This can happen when working with some national IDs. If a continuous variable has more than 15 digits, then it should be rounded and converted to a different scale, as a precision of 16 digits is not even possible in natural sciences. An ID can for obvious reasons not be rounded. The other case is that of numbers starting with a zero. This is sometimes the case in some national IDs and it is also sometimes the case with telephone numbers in some countries. Any leading zeros are removed by Stata and therefore have to be stored as a string.<br />
<br />
'''Non-categorical text'''. Text answers that cannot be converted into categories need to be stored as strings. One example is open-ended questions. Open-ended questions should, in general, be avoided, but sometimes the questionnaire asks the respondent to answer a question in his or her own words, and then that answer has to be stored as strings. Another example is if the respondent is asked to specify the answer after answering ''Other'' in a multiple choice question. A different example where string format is needed is some cases of proper names, for example, the name of the respondent. Not all proper names should be stored as string as some can be made into categories. For example, if you collect data on respondents and multiple respondents live in the same villages, then the variable with the village names should be converted into a categorical numeric variable and have a value label assigned. See the section on value labels below.<br />
<br />
=== Labels ===<br />
There are several ways to add helpful descriptive text to a data set in Stata, but the two most common and important ways are variables labels and value labels.<br />
<br />
'''Variable Labels'''<br />
All variables in a clean data set should have variable labels describing the variable. The label can be up to 80 characters long so there is a limitation to how much information can be included here. In addition to a brief explanation of the variable, it is usually good to include information such as unit or currency used in the variable and other things that are not possible to read from the values themselves.<br />
<br />
'''Value Labels'''<br />
Categorical variables should always be stored numerically and have value labels that describe what the numeric code represents. For example, yes and no questions should be stored as 0 and 1 and have the label ''No'' for data cells with 0, and the label ''Yes'' for all data cells with 1. This should be applied to all multiple choice variables.<br />
<br />
There are tools in Stata to convert categorical string variables to a categorical numeric variable where the strings are automatically applied as value labels. The most common tool is the command <code>encode</code>. However, if you use <code>encode</code>, you should always use the two options <code>label()</code> and <code>noextend</code>. Without these two options, Stata assigns a code to each string value in alphabetic order. There is no guarantee that the alphabetic order is changed when observations are added or removed, or if someone else makes changes earlier in the code. <code>label()</code> forces you to manually create the label before using encode (this requires some manual work but it is worth it). <code>noextend</code> throws an error if there is a value in the data that does not exist in the pre-defined label. This way you are notified that you need to add the new value to the value label you created manually. Or you can change the string value if there is a typo or similar that is the reason why that string value was not assigned a value label.<br />
<br />
== Additional Resources ==<br />
* <br />
<br />
[[Category: Data Cleaning ]]</div>501238https://dimewiki.worldbank.org/index.php?title=Data_Documentation&diff=4411Data Documentation2018-02-12T17:41:52Z<p>501238: </p>
<hr />
<div>Documenting any aspects of the data work that may affect the analysis is a crucial part of dealing with data. Impact evaluation projects often take years to be completed and are executed by large teams. If the data work is not documented while it is ongoing, it is likely that many details will be lost and a considerable amount of time spent trying to understand what was previously done. For example, say it became clear during the field work that some respondents didn't understand a test that was applied because they had reading difficulties. If the [[Impact Evaluation Team#Field Coordinator | field coordinator]] didn't document this issue, the [[Impact Evaluation Team#Research Assistant | research assistant]] will not know to flag them during [[Data Cleaning | data cleaning]]. And if the [[Impact Evaluation Team#Research Assistant | research assistant]] doesn't document why the observations were flagged and what the flag means, they may not be correctly dealt with during [[Data Analysis | analysis]].<br />
<br />
There are different ways to document data work. One widespread practice is to send e-mails reporting issues to the team. Though this is easily done, it is time-consuming to find answers later on in the project development, even if someone in the team needs to remember that an e-mail was sent. For data cleaning, data analysis and variables construction, it is best practice to document the data work through comments on the code. However, even though this is very helpful for some reading the codes carefully, if these comments are not documented elsewhere, it may also take a long time to go through all the do-files and find the answer to a specific question. It's usually advisable to have all data work documentation in one file or folder, though how it is structured and when, how and by whom it is updated will vary from one project to the other. One advantage of submitting codes for [[Code Review| code review]] and depositing data on the [[Microdata Catalog | microdata catalog]] is that both cases the data work documentation will be reviewed, though does not guarantee that everything that should be documented is in fact, as reviewers cannot ask about issues unkown to them.<br />
<br />
== Read first ==<br />
<br />
== Field Work Documentation ==<br />
=== Sampling ===<br />
* Sample selection<br />
* Replacement criteria<br />
<br />
=== Field work dates === <br />
<br />
=== Tracking respondents === <br />
* Total number of respondents listed<br />
* Total number of respondents visited<br />
* Refusal rates<br />
* Total number of respondents in final sample<br />
<br />
=== Issues on the field ===<br />
Report any problems that occurred during the administration of the survey (strikes, inclement weather, inability to enter parts of the country)<br />
<br />
== Data Cleaning Documentation ==<br />
=== Outliers ===<br />
=== Inconsistencies === <br />
=== Survey Codes and Missing values ===<br />
<br />
== Variables Construction Documentation ==<br />
=== Sampling ===<br />
=== Weights and expansion factors=== <br />
=== Outliers ===<br />
=== Inconsistencies === <br />
=== Variables definition === <br />
=== References === <br />
<br />
== Datasets Documentation == <br />
=== Dataset creation === <br />
=== Linking data sets === <br />
<br />
== Additional Resources ==<br />
<br />
<br />
[[Category: Data Cleaning]][[Category: Data Management]]</div>501238https://dimewiki.worldbank.org/index.php?title=ID_Variable_Properties&diff=4410ID Variable Properties2018-02-12T17:37:16Z<p>501238: /* First property: Uniquely Identifying */</p>
<hr />
<div>An ID variable that identifies an observation should have the properties listed below. Note that this relates to the ID variable that identifies observations across data sets in out project folder. Some commands in Stata, for example <code>reclink</code> requires a <code>masterid()</code> and an <code>userid()</code> and these ID variables created temporarily for that command does not have to have all of these properties.<br />
<br />
== Read First ==<br />
* The first and the second properties are the properties that everyone should always test for before working with a new data set. The third, fourth and fifth property is more relevant when creating an ID variable or when assigning new values to newly encountered observations.<br />
* Only keep one variable as the ID variable. One common exception to this rule is a panel data set, where the combination of ID variable and time variable identifies the data set.<br />
<br />
==First property: Uniquely Identifying==<br />
<br />
The first and the second properties are the most commonly referred to property of an ID variable. An ID variable is uniquely identifying when there are no [[Duplicates and Survey Logs|duplicates]] -- that is ,no two observation share a value in the ID variable. Next paragraph shows that this is easy to test for a single data set, however, the first property does not only apply to a single data set, it applies to the full project. To test the first property for a full project one must first make sure that all observations are added to the [[Master Data Set|master data set]], and then test for the first property on the master data set as described in the next paragraph. <br />
<br />
There are several ways to test for this in Stata. For example <code>duplicates report idvar</code> where <code>idvar</code> is the ID variable. It is also possible to test the first property using this command <code>isid idvar</code>. While <code>duplicates report</code> provides a more informative output, <code>isid</code> is a quick and easy way to test for both the first and the second property.<br />
<br />
==Second property: Fully Identifying==<br />
<br />
An ID variable is fully identifying when all observations have a value in the ID variable. This property is, similarly to the first property, very easy to test on a single data set, and depends on how well the master data set has been kept up to data in order to test for a full project. If all observations in all data set has been added to the master data set, then they should all been given an value in the ID variable, but each time you modify the master data set you should test for this property to be sure.<br />
<br />
There are several ways to test for this in Stata but the command <code>isid idvar</code> where <code>idvar</code> is the ID variable is often used as it tests for both the first and the second property. Note that missing values should not be used as an ID value even though a missing value technically could be used to identify a single observation. Missing values implies that the information is missing so the command <code>isid</code> in Stata treats a missing value as if the ID variable is not fully identifying the data set.<br />
<br />
==Third property: Constant Across a Project==<br />
<br />
The third property says that no observation should have different IDs in different data set. Data sets collected from different sources might have different IDs when they are first included in the project, but one ID variable should be made the dominant one, and the other ID variable should be clearly marked that it is not the main ID variable for this project if there is a reason to at all keep it in the data set.<br />
<br />
There is no specific test for this, but this is a rule to follow when creating an ID variable. If the best practice of carefully adding all observations to the master data set is followed, then that usually ensures that no observation has two values in the ID variable, and it also easy to keep just the same primary ID variable in all data sets after the observations have been added to the master data set.<br />
<br />
==Fourth property: Constant Throughout the Duration of a Project==<br />
<br />
The fourth property is similar to the third property but it says that the same observation should have the same value in the ID variable throughout the project. The ID that an observation was assigned at baseline (or whenever it was assigned) should be not be changed throughout the rest of the project. One exception to this rule is obviously when we find a mistake in the ID variable. This hopefully happens rarely as it is very labor demanding to go over all do-files in a project in order to make sure that no values have to be updated for the code to work as intended.<br />
<br />
Another example is if the format of the ID variable needs to be extended in case a project runs out of IDs. This case is one of the rare examples when it could be justified, but it will never be the best practice, to have more than one ID variable. In this case it might be a good idea to create a new ID variable where the new value is based on the old value. For example, the new variable have two more digits or similar. Then the old ID variable can be kept so that old code does not have to be updated. Although it is best practice to update all references to the old ID variable with the new one, but this can be unfeasible due to taking too much time.<br />
<br />
==Fifth property: Anonymous IDs==<br />
<br />
The fifth property is less a requirement and more a good practice. Sometimes we have access to IDs that satisfy all of the four first properties above, but we should be very careful before using them. Examples of such cases could be individual national IDs, public company IDs, a hospital's patient ID etc. Since records over those IDs are available to people outside our team, there is no way for us to guarantee that we can protect the privacy of the data we collect if we use these IDs. In all of these cases we need to create our own ID that has no association with the ID variable created by someone else and is unique to our project and thereby is an anonymous ID that only identifies the observation to us. In the master data set we can include the other ID to enable us to merge data quickly, but then the information in the master data set becomes even more sensitive then usual.<br />
<br />
There is an exception to this rule that can be used to simplify the data work but it should only be used after careful consideration. If a project has a high-level unit of observation for which the project team is absolutely certain that it will not collect sensitive data, and there is an official code for it, then we could sometimesuse this code. It could for example be done for districts or region so that we can easier include publicly available data from those district or region. However, if there is any probability that we would include any data not publicly available, for example district budgets etc., then we need to make our own ID variable even for these units of observations. Also, if we have a unit of observation for which one or more instances that has only a few observations of another level mapped to it, for example a school with few students or a village with a few households, then we have to create an anonymous IDs for ''all'' instances at that level. Not just that one school or village, but all schools or villages. Otherwise the ID of the school or the village can be used to understand who each of those students or farmers are, despite the student ID and the farmer ID is anonymous.<br />
<br />
It is never incorrect to create an anonymous ID, so if there is any uncertainty whether a public ID can be used, then always go for the anonymous option.<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[Data Management]]<br />
<br />
<br />
== Additional Resources ==<br />
Please add here related articles, including a brief description and link. <br />
<br />
[[Category: Data Management ]]</div>501238https://dimewiki.worldbank.org/index.php?title=Collaboration_Tools&diff=4409Collaboration Tools2018-02-12T16:19:16Z<p>501238: /* GitHub */</p>
<hr />
<div>== Introduction == <br />
<br />
Collaboration among a research team is a critical part of nearly all impact evaluations. With the general availability of low-cost cloud collaboration tools, it is important to use tools that effectively achieve the goals of sharing access to data and content, while protecting the privacy, integrity, and history of that content and imposing as low of a learning cost on other collaborators as possible.<br />
<br />
==Collaboration tools for data analysis ==<br />
===GitHub===<br />
[https://github.com/ GitHub] allows for multiple data analysts to work on the same project simultaneously, by keeping a local copy of all analysis code, and merging final versions together in a centralized repository. GitHub also preserves every version history of code and outputs, making it easy to recover old code snippets after deleting them from the main production branch, as well as making them publicly accessible for others to review analyses that did not appear in the final publication.<br />
<br />
===Cloud Sync Tools===<br />
Cloud Sync tools, such as Dropbox, are commonly used to share code, data, and outputs with collaborators. Unlike GitHub, sync tools have limited version histories and typically do not allow multiple collaborators to work simultaneously on the same file without version conflicts arising. However, they bear more similarity to traditional filesystem structures, meaning the learning curve is nearly zero for working on files in sync tools.<br />
<br />
== Collaboration tools for paper writing==<br />
===Overleaf===<br />
[https://www.overleaf.com/ Overleaf] is a web-based LaTeX collaboration tool that allows multiple authors to simultaneously edit documents. It maintains a folder structure containing a main document, a bibliography document, and images and other resources. It supports limited integrations with Git and Dropbox, and is currently under active redevelopment following a merger with ShareLaTex, a similar service.<br />
<br />
While Overleaf is based on a LaTeX structure, it now offers a "Google-Docs-like" editor and limited comments and version histories, making it easier to collaborate with coauthors who are more comfortable with [https://en.wikipedia.org/wiki/WYSIWYG WYSIWYG] editors like Word.<br />
<br />
== Additional Resources ==<br />
* list here other articles related to this topic, with a brief description and link<br />
<br />
[[Category: Collaboration Tools]]</div>501238https://dimewiki.worldbank.org/index.php?title=Collaboration_Tools&diff=4408Collaboration Tools2018-02-12T16:18:49Z<p>501238: </p>
<hr />
<div>== Introduction == <br />
<br />
Collaboration among a research team is a critical part of nearly all impact evaluations. With the general availability of low-cost cloud collaboration tools, it is important to use tools that effectively achieve the goals of sharing access to data and content, while protecting the privacy, integrity, and history of that content and imposing as low of a learning cost on other collaborators as possible.<br />
<br />
==Collaboration tools for data analysis ==<br />
===GitHub===<br />
[[GitHub]] allows for multiple data analysts to work on the same project simultaneously, by keeping a local copy of all analysis code, and merging final versions together in a centralized repository. GitHub also preserves every [[version history]] of code and outputs, making it easy to recover old code snippets after deleting them from the main production branch, as well as making them publicly accessible for others to review analyses that did not appear in the final publication.<br />
<br />
===Cloud Sync Tools===<br />
Cloud Sync tools, such as Dropbox, are commonly used to share code, data, and outputs with collaborators. Unlike GitHub, sync tools have limited version histories and typically do not allow multiple collaborators to work simultaneously on the same file without version conflicts arising. However, they bear more similarity to traditional filesystem structures, meaning the learning curve is nearly zero for working on files in sync tools.<br />
<br />
== Collaboration tools for paper writing==<br />
===Overleaf===<br />
[https://www.overleaf.com/ Overleaf] is a web-based LaTeX collaboration tool that allows multiple authors to simultaneously edit documents. It maintains a folder structure containing a main document, a bibliography document, and images and other resources. It supports limited integrations with Git and Dropbox, and is currently under active redevelopment following a merger with ShareLaTex, a similar service.<br />
<br />
While Overleaf is based on a LaTeX structure, it now offers a "Google-Docs-like" editor and limited comments and version histories, making it easier to collaborate with coauthors who are more comfortable with [https://en.wikipedia.org/wiki/WYSIWYG WYSIWYG] editors like Word.<br />
<br />
== Additional Resources ==<br />
* list here other articles related to this topic, with a brief description and link<br />
<br />
[[Category: Collaboration Tools]]</div>501238https://dimewiki.worldbank.org/index.php?title=Collaboration_Tools&diff=4407Collaboration Tools2018-02-12T16:17:33Z<p>501238: /* Overleaf */</p>
<hr />
<div>== Introduction == <br />
<br />
Collaboration among a research team is a critical part of nearly all impact evaluations. With the general availability of low-cost cloud collaboration tools, it is important to use tools that effectively achieve the goals of sharing access to data and content, while protecting the privacy, integrity, and history of that content and imposing as low of a learning cost on other collaborators as possible.<br />
<br />
== Guidelines ==<br />
===Collaboration tools for data analysis ===<br />
====GitHub====<br />
[[GitHub]] allows for multiple data analysts to work on the same project simultaneously, by keeping a local copy of all analysis code, and merging final versions together in a centralized repository. GitHub also preserves every [[version history]] of code and outputs, making it easy to recover old code snippets after deleting them from the main production branch, as well as making them publicly accessible for others to review analyses that did not appear in the final publication.<br />
<br />
====Cloud Sync Tools====<br />
Cloud Sync tools, such as Dropbox, are commonly used to share code, data, and outputs with collaborators. Unlike GitHub, sync tools have limited version histories and typically do not allow multiple collaborators to work simultaneously on the same file without version conflicts arising. However, they bear more similarity to traditional filesystem structures, meaning the learning curve is nearly zero for working on files in sync tools.<br />
<br />
=== Collaboration tools for paper writing===<br />
====Overleaf====<br />
[https://www.overleaf.com/ Overleaf] is a web-based LaTeX collaboration tool that allows multiple authors to simultaneously edit documents. It maintains a folder structure containing a main document, a bibliography document, and images and other resources. It supports limited integrations with Git and Dropbox, and is currently under active redevelopment following a merger with ShareLaTex, a similar service.<br />
<br />
While Overleaf is based on a LaTeX structure, it now offers a "Google-Docs-like" editor and limited comments and version histories, making it easier to collaborate with coauthors who are more comfortable with [https://en.wikipedia.org/wiki/WYSIWYG WYSIWYG] editors like Word.<br />
<br />
== Additional Resources ==<br />
* list here other articles related to this topic, with a brief description and link<br />
<br />
[[Category: Collaboration Tools]]</div>501238https://dimewiki.worldbank.org/index.php?title=Microdata_Catalog&diff=4406Microdata Catalog2018-02-12T16:15:58Z<p>501238: /* Additional Resources */</p>
<hr />
<div>The [http://microdata.worldbank.org/index.php/home Microdata Library] is an online platform the offers free access to microdata produced not only by the World Bank, but also other international organizations, statistical agencies and different actors in developing countries. It includes datasets from surveys implemented as part of impact evaluations and research on development, as well as administrative data.<br />
<br />
== Read first ==<br />
* Data sets published in the microdata are typically survey data. The [[Checklist: Microdata Catalog submission|Microdata Catalog Checklist]] lists data format, documentation requirements and instructions on how to deposit data<br />
* When submitting data, it is recommended to include as much information about the study and the data as possible. This reduces the number of future queries received from both catalog staff preparing the data and users trying to properly understand the survey process<br />
* We recommend submitting the data and soon as it is collected, so that all relevant information is documented and safely stored, making transitions between team members easier and reducing the risk of not remembering details when analysis is done<br />
* As part of the submission process, it is possible to choose from different access conditions under which the data will be shared. It is also possible to make changes to access terms, as well as to the data, after the initial submission<br />
<br />
== Guidelines for submission ==<br />
Submission to the Microdata Library is done after the initial data cleaning for a round of data collection is finished. That means one impact evaluation may have different rounds published in the catalog, for example baseline, midline and endline. Datasets submitted to the [http://microdata.worldbank.org/index.php/home Microdata Library] must be de-identified and accompanied by data documentation and study description. The [[Checklist: Microdata Catalog submission|Microdata Catalog Checklist]] lists data format and documentation requirements as well as instructions on how to deposit datasets.<br />
<br />
World Bank staff can deposit their data directly to the online [http://microdatalib.worldbank.org/index.php/home Data Deposit Application]. Data can also be deposited by external researchers if necessary. It is recommended for deposits originating from outside the Bank to include names and contact details of the World Bank staff they are working with on their project. It is also recommended for the approver of access to licensed data to be a World Bank staff that can br easily contacted for access by the Microdata Library team. If the survey is owned by someone other than the World Bank, in addition to the deposit form, documentation is needed from the data provider, signed by an authorized signatory and explicitly authorizing the Microdata Library to disseminate the related survey data and specifying the access type.<br />
<br />
=== Datasets ===<br />
Data may be uploaded in different formats, including STATA, SPSS and SAS, and must be [[De-identification | de-identified]] and minimally [[Data Cleaning | cleaned]]. The data cleaning required aims to provide a clear indication of what information is to be found in any given variable, so both variable and value labels must be present, including [[Data Cleaning#Survey Codes and Missing Values | labels for extended missing values]]. To protect the confidentiality of respondents, all [[De-identification#Personally Identifiable Information | Personally Identifiable Information]] must be removed. Variables containing sensitive information such as PII can be flagged in the ""Data Distribution"" section to indicate they should not be distributed.<br />
<br />
=== Supporting documents ===<br />
All relevant material that would allow the users to better understand the data and interpret the results should be included. A non-comprehensive list of documents that may be relevant is included below. Note that some of the material in the list below may contain sensitive information (for example in the form of options listed in the questionnaire), so it should also be checked and de-identified.<br />
* Questionnaires (paper format equivalent is better than CAPI form)<br />
* Enumerator manuals<br />
*[[Data Documentation#Field work documentation | Fieldwork documentation ]]<br />
* Methodology description<br />
*[[Data Documentation#Data cleaning documentation | Data cleaning documentation]]<br />
*[[Data Documentation#Construct documentation | Variables construction documentation]], if applicable<br />
* Outputs such as reports, presentations, publications and papers<br />
<br />
=== Study description ===<br />
During submission, it is necessary to fill a form collecting information on the survey (metadata). Not all fields are mandatory, but providing as much information as possible makes it easier for the users to understand and explore the data. This reduces the number of future queries received from both catalogue staff preparing the data and users trying to properly understand the survey process.<br />
<br />
*Mandatory Fields:<br />
** Title<br />
** Country<br />
** Dates of Data Collection<br />
** Access policy<br />
** Catalogue where the data should be published<br />
<br />
* Recommended Fields:<br />
** Abstract<br />
** Geographic Coverage<br />
** Primary Investigator<br />
** Funding<br />
** Sampling Procedure<br />
** Weighting<br />
<br />
=== Access conditions ===<br />
The World Bank Microdata Library disseminates data under the [https://data.worldbank.org/summary-terms-of-use Microdata Terms of Use for the World Bank]. When submitting data, it is possible to indicate whether the datasets should be available only to World Bank staff or to external users. It is also possible to embargo any data submitted for a specified period of time. To protect the confidentiality of individual information and to meet the requirements of the data owners who provide the microdata, there are five principal [[http://microdata.worldbank.org/index.php/terms-of-use types of access]] that may be applied:<br />
<br />
* '''Open access''': this is the least restrictive access policy. Datasets and the related documentation are available to users for commercial and non-commercial purposes at no cost. There is no need to be being logged into the application.<br />
<br />
* '''Direct access''': relevant datasets and the related documentation are made freely available to registered and unregistered users for statistical and scientific research purposes only, and may not be distributed. Any publications employing this type of data must cite the source, in line with the citation requirement provided with the dataset.<br />
<br />
* '''Public Use Files''': PUFs are available to anyone agreeing to respect a core set of easy-to-meet conditions. These data are made easily accessible because the risk of identifying individual respondents or data providers is considered to be low. Terms of use are the same as direct access, but users are required to register before obtaining the data sets.<br />
<br />
* '''Licensed files''': are files whose dissemination is restricted to bona fide users. Access is granted to authenticated users who have received authorization to access them after submitting a documented application and signing an agreement governing the data's use. These users must be acting on behalf of an organization, who must take responsibility for the use. To release data under this license, a World Bank staff must be indicated as the point of contact to grant access to the data. That person will be contacted by the data catalogue manager, who works with the team to approve or reject requests.<br />
<br />
* '''External Repositories''': The World Bank Microdata Library operates both as a data catalog for World Bank owned or licensed data as well as a portal to data held in a number of external repositories. It is the aim of the Microdata Library to provide to the user the most comprehensive catalog of development related microdata possible. To this end, studies conducted and owned by other institutions as well as links to those studies are listed in the Microdata Library Catalog. Datasets provided by external agencies are not owned or controlled by the World Bank and have their own conditions of use. When a user accesses external repositories, the terms governing the use of those external repositories shall govern access to their data.<br />
<br />
* '''No access''': some datasets have no access policy defined, or are not accessible. In some limited situations, we may include a limited number of such datasets for the sake of completeness and for the purpose of providing access to questionnaires and reports. Note here that any datasets with no access will not be published on the external facing catalog.<br />
<br />
== Collections available ==<br />
The Microdata Library operates as a portal for datasets originating from the World Bank and other international, regional and national organizations. These contributions make up the Central Microdata Catalog, which can also be viewed and searched by collection. When submitting data to the Catalog, it is necessary to specify in which collection it should be filed. Impact evaluation surveys are filed in the Impact Evaluation Survey Collection, even if treatment variables are temporarily embargoed.<br />
<br />
* '''World Bank catalogs'''<br />
** Global Financial Inclusion (Global Findex) Database<br />
** Service Delivery Facility Surveys<br />
** The STEP Skills Measurement Program<br />
** The World Bank Group Country Opinion Survey Program (COS)<br />
** Development Research Microdata<br />
** Enterprise Surveys<br />
** Impact Evaluation Surveys<br />
** Living Standards Measurement Study (LSMS)<br />
** Migration and Remittances Surveys<br />
<br />
*''' External catalogs'''<br />
**Global Health Data Exchange (GHDx), Institute for Health Metrics and Evaluation (IHME)<br />
**Integrated Public Use Microdata Series (IPUMS) International<br />
**MEASURE DHS: Demographic and Health Surveys<br />
**Millennium Challenge Corporation (MCC)<br />
**UNICEF Multiple Indicator Cluster Surveys (MICS)<br />
**WHO’s Multi-Country Studies Programmes<br />
**DataFirst , University of Cape Town, South Africa<br />
<br />
== Releasing data before publication == <br />
One common concern among researchers is under which conditions to submit data from studies that are still ongoing and whose results have not yet been published. There are several options available.<br />
<br />
First of all, it is best practice to submit the data and soon as it is collected. The review process will guarantee that documentation is submitted, reducing the risk of not remembering important details about how the data was processed, an issue that arises often if the analysis is only carried when the intervention is completed, or after endline data is collected. Furthermore, once deposited, the data is safely stored, reducing the less likely, but even more worrying chance of losing any data. Depositing data early on guarantees that transition between team members is smoother and less information is lost over time.<br />
<br />
The different access conditions can be used to withhold from release any information that may create issues if made public prior to publication. One possibility is to submit a data set and embargo any treatment assignment variables until results are published. This means that these variables will only become available to users after an established date. If this solution is chosen, it's important to indicate in the documentation that such variables have been removed and will be released in a future date, as users expect treatment variables to be present in impact evaluation datasets. It is recommended to add a clause indicating that no conclusions on impact can be drawn until all rounds are published<br />
<br />
Alternatively, it is also possible to embargo the whole data set, making it "no access". In this case, the metadata will not be made available to the external audience and may not be available to World Bank staff if the embargo applies internally as well. Another option is to first submit the "censored" version of the data, without treatment variables, and update the submission to include all variables after publication. <br />
<br />
== DIME Datasets on Microdata Catalog ==<br />
*[[DIME_Datasets_on_Microdata_Catalog#By_region|By region]]<br />
*[[DIME_Datasets_on_Microdata_Catalog#By_area|By area]]<br />
<br />
<br />
==Back to Parent==<br />
This article is part of the topic [[Publishing Data]]<br />
<br />
[[Category:Publishing Data]]</div>501238https://dimewiki.worldbank.org/index.php?title=DIME_Datasets_on_Microdata_Catalog&diff=4405DIME Datasets on Microdata Catalog2018-02-12T16:15:24Z<p>501238: </p>
<hr />
<div>== By region ==<br />
===Asia===<br />
*[http://microdata.worldbank.org/index.php/catalog/2826 Bangladesh - Impact Evaluation of the Integrated Agricultural Productivity Project 2015, Endline Household Survey]<br />
*[http://microdata.worldbank.org/index.php/catalog/2817 Bangladesh - Impact Evaluation of the Integrated Agricultural Productivity Project 2014, Midline Household Survey - Round 2]<br />
*[http://microdata.worldbank.org/index.php/catalog/2816 Bangladesh - Impact Evaluation of the Integrated Agricultural Productivity Project 2013, Household Midline Survey - Round1]<br />
*[http://microdata.worldbank.org/index.php/catalog/2815 Bangladesh - Impact Evaluation of the Integrated Agricultural Productivity Project 2012, Baseline Household Survey]<br />
===Afica===<br />
* [http://microdata.worldbank.org/index.php/catalog/290 Eritrea - Indoor Residual Spraying (IRS) Impact Evaluation Survey 2009]<br />
* [http://microdata.worldbank.org/index.php/catalog/2735 Nigeria - Subsidy Reinvestment and Empowerment Programme Maternal and Child Health Initiative Impact Evaluation (SURE-P MCH) 2013, Baseline Survey]<br />
* [http://microdata.worldbank.org/index.php/catalog/1041 South Africa - Impact Evaluation of the Upgrading of Informal Settlements Programme 2010]<br />
* [http://microdata.worldbank.org/index.php/catalog/2232 Tanzania - Impact Evaluation of Scaling-up Handwashing and Rural Sanitation Behavior Projects in Tanzania 2012, Endline Survey]<br />
<br />
== By area == <br />
===Agriculture===<br />
*[http://microdata.worldbank.org/index.php/catalog/2826 Bangladesh - Impact Evaluation of the Integrated Agricultural Productivity Project 2015, Endline Household Survey]<br />
*[http://microdata.worldbank.org/index.php/catalog/2817 Bangladesh - Impact Evaluation of the Integrated Agricultural Productivity Project 2014, Midline Household Survey - Round 2]<br />
*[http://microdata.worldbank.org/index.php/catalog/2816 Bangladesh - Impact Evaluation of the Integrated Agricultural Productivity Project 2013, Household Midline Survey - Round1]<br />
*[http://microdata.worldbank.org/index.php/catalog/2815 Bangladesh - Impact Evaluation of the Integrated Agricultural Productivity Project 2012, Baseline Household Survey]<br />
<br />
===Health===<br />
* [http://microdata.worldbank.org/index.php/catalog/290 Eritrea - Indoor Residual Spraying (IRS) Impact Evaluation Survey 2009]<br />
* [http://microdata.worldbank.org/index.php/catalog/2735 Nigeria - Subsidy Reinvestment and Empowerment Programme Maternal and Child Health Initiative Impact Evaluation (SURE-P MCH) 2013, Baseline Survey]<br />
* [http://microdata.worldbank.org/index.php/catalog/2232 Tanzania - Impact Evaluation of Scaling-up Handwashing and Rural Sanitation Behavior Projects in Tanzania 2012, Endline Survey]<br />
<br />
==Back to Parent==<br />
This article is part of the topic [[Publishing Data]]<br />
<br />
[[Category:Publishing Data]]</div>501238https://dimewiki.worldbank.org/index.php?title=Publishing_Data&diff=4404Publishing Data2018-02-12T16:12:36Z<p>501238: /* DIME data releases */</p>
<hr />
<div>Making data available to other researchers in some form is a key need of research transparency and reproducibility. However, it is not generally possible or advisable to release raw data. [[Primary Data Collection | Primary data]] usually contains [[De-identification#Personally Identifiable Information | personally-identifying information (PII)]] such as names, locations, or financial records that are unethical to make public; [[Secondary Data Sources | secondary data]] is often owned by an entity other than the research team and therefore may face legal issues in public release. It is therefore important to structure both data management and analytics such that the data that is published replicates the researcher's primary results to the best degree possible and that the data that is released is appropriately accessible.<br />
<br />
== Guidelines==<br />
=== Publishing Primary Data ===<br />
<br />
==== Preparing data for release ====<br />
The main issue with releasing primary data is maintaining the privacy of respondents. It is essential to carefully [[De-identification | de-identify]] any sensitive or personally-identifying information contained in the dataset. Datasets released should be easily understandable by users, so [[Data Documentation | documentation]], including variable dictionaries and survey instruments, should be released with the data. See the [[Checklist: Microdata Catalog submission|Microdata Catalog Checklist]] for instructions on how to prepare primary data for release.<br />
<br />
==== DIME data releases ==== <br />
[[DIME_Datasets_on_Microdata_Catalog| DIME survey data]] is released through the [[Microdata Catalog]]. However, access to the data may be restricted and some variables may be embargoed prior to publication.<br />
<br />
=== Publishing Analysis Data ===<br />
Some journals require datasets used in [[Data Analysis | data analysis]] to be released when a paper is published. This is intended to make research more [[Research Ethics#Research Transparency | transparent]] and allow readers to [[Reproducible Research | reproduce findings]].<br />
<br />
==== Preparing data for release ====<br />
The objective of the data release is to allow users to reproduce results in the paper. Therefore, the released dataset needs to contain all variables used in [[Data Analysis | data analysis]], as well as all [[ID Variable Properties | identifying variables]]. Analysis datasets should be easily understandable to researchers trying replicate results, so it's important that it is [[Data_Cleaning#Labels | well-labelled]] and [[Data Documentation | documented]].<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[Publishing Data]]<br />
<br />
== Additional Resources==<br />
<br />
<br />
[[Category: Publishing Data]]</div>501238https://dimewiki.worldbank.org/index.php?title=Reproducible_Research&diff=4403Reproducible Research2018-02-12T16:11:12Z<p>501238: /* Data publication */</p>
<hr />
<div>In most scientific fields, results are validated through replication: that means that different scientists run the same experiment independently in different samples and find similar conclusions. That standard is not always feasible in development research. More often than not, the phenomena we analyze cannot be artificially re-created. Even in the case of field experiments, different populations can respond differently to a treatment, and the costs involved are high.<br />
<br />
Even in such cases, however, we should till require reproducibility: this means that different researchers, when running the same analysis in the same data should find the same results. That may seem obvious, but unfortunately is not as widely observed as we would like. The bottom line of research reproducibility is that the path used to get to your results are as much a research output as the results themselves, making the research process fully transparent. This means that not only the final findings should be made available by researchers, but data, codes and documentation are also of great relevance to the public.<br />
<br />
== Code replication ==<br />
Replicating results is the most important part of reproducible research. The easiest way to guarantee that results can be replicated is to have a code that runs all data work and can be run by anyone who has access to it. Different researchers running the same code on the same data should get the same results. So to guarantee that research is transparent and reproducible, codes and data should be shared.<br />
<br />
It is possible for the same data and code to create different results if the right measures are not taken. In Stata, for example, setting the seed and version is essential to [[Randomization in Stata |replicable sampling and randomization]]. Sorting observations is also frequently necessary. Having a [[Master do-files|master do-file]] greatly improves the replicability of results since it's possible to standardize setting and run do-files in a pre-specified order from it. If different languages or software are used in the same project, a shell script can be used for the same purpose.<br />
<br />
Another important part of replicable code is that it should be understandable. That is, if someone else runs your code and replicates all the results, but doesn't understand what was being done, then your research is still not [[Research Ethics#Research Transparency|transparent]]. Commenting code to make it clear where and why decisions were made is a crucial part of making your work transparent. For example, if observations are dropped of values are changed, the code should be commented to explain why that was done.<br />
<br />
===Software for Code Replication===<br />
Git is a free version-control software. Files are stored in Git Repositories, most commonly on [https://github.com/ GitHub]. They allow the user to track changes to the code and create messages explaining why the changes were made, which improves [[Data Documentation|documentation]]. Sharing Git repositories is a way to make code publicly available, and allow other researchers to read and replicate your code.<br />
<br />
To learn GitHub, there is an [https://services.github.com/on-demand/intro-to-github/ introductory training] available through GitHub Services, and multiple tutorials available through [https://guides.github.com/ GitHub Guides].<br />
<br />
== Data publication ==<br />
Research results can only be replicated if the data used for [[Publishing_Data#Publishing_Analysis_Data|analysis]] is [[Publishing Data|shared]]. Though being able to reproduce all steps from [[Data Cleaning|data cleaning]] to [[Data Analysis|data analysis]] is ideal to guarantee reproducibility and [[Research Ethics#Research Transparency|transparency]], that is not always possible, as some of the data used may be [[De-identification#Personally Identifiable Information|personally identifiable]] or confidential. However, sharing the final data is necessary for reproducibility. Some journals require datasets to be submitted along with papers, and some researchers prefer to make data available upon request.<br />
<br />
== Dynamic documents == <br />
Dynamic documents allow researchers to write papers and reports that automatically import or display results. This reduces the amount of manual work involved between analysing data and publishing the output of this analysis, so there's less room for error and manipulation.<br />
<br />
Different software allow for different degrees of automatization. Using [https://rmarkdown.rstudio.com/ R Markdown], for example, users can write text and code simultaneously, running analyses in different programming languages and printing results in the final document along with the text. Stata 15 also allows users to create such documents using [https://www.stata.com/manuals/pdyndoc.pdf dyndoc]. The output is a file, usually a PDF, that contains text, tables and graphs created simultaneously. With this kind of document, whenever data is updated or a change is made to some of the analysis, it's only necessary to run one file to generate a new final paper or report, with no copy-pasting or manual changes.<br />
<br />
LaTeX is another widely used tool in the scientific community. It is a type-setting system that allows users to reference code outputs such as tables and graphs so that they can be easily updated in a text document. Using this software, data analysis and writing are two separate processes: first, the data is analyzed using whatever software you prefer and the results are exported into TeX format ([https://cran.r-project.org/web/packages/stargazer/vignettes/stargazer.pdf R's stargazer] is commonly used for that, and Stata has different options such as [http://repec.org/bocode/e/estout/esttab.html esttab] and [http://repec.org/bocode/o/outreg2.html outreg2]), then a LaTeX document is written that uploads these outputs. The advantage of using LaTeX is that whenever results are updated, it's only necessary to recompile the LaTeX document for the new tables and graphs to be displayed.<br />
<br />
===Other software for dynamic documents===<br />
*[http://jupyter.org/ Jupyter Notebook] is used to create and share code in different programming languages, including Python, R, Julia, and Scala. It can also create dynamic documents in HTML, LaTeX and other formats.<br />
<br />
* [https://www.overleaf.com/ Overleaf] is a web based platform for collaboration in TeX documents.<br />
<br />
* [https://osf.io/ Open science framework] is a web based project management platform that combines registration, data storage (through Dropbox, Box, Google Drive and other platforms), code version control (through GitHub) and document composition (through Overleaf).<br />
<br />
== Additional Resources ==<br />
From Data Colada: <br />
*[http://datacolada.org/69 8 tips to make open research more findable and understandable]<br />
<br />
From the Abul Latif Jameel Poverty Action Lab (JPAL)<br />
* [https://www.povertyactionlab.org/research-resources/transparency-and-reproducibility Transparency and Reproducibility]<br />
<br />
From Innovations for Policy Action (IPA)<br />
* [http://www.poverty-action.org/sites/default/files/publications/IPA%27s%20Best%20Practices%20for%20Data%20and%20Code%20Management_Nov2015.pdf Reproducible Research: Best Practices for Data and Code Management] <br />
* [http://www.poverty-action.org/sites/default/files/Guidelines-for-data-publication.pdf Guidelines for data publication]<br />
* [https://dataverse.harvard.edu/dataverse/socialsciencercts Randomized Control Trials in the Social Science Dataverse]<br />
<br />
Center for Open Science<br />
* [https://cos.io/our-services/top-guidelines/ Transparency and Openness Guidelines], summarized in a [https://osf.io/pvf56/?_ga=1.225140506.1057649246.1484691980 1-Page Handout]<br />
<br />
Berkeley Initiative for Transparency in the Social Sciences<br />
* [http://www.bitss.org/education/manual-of-best-practices/ Manual of Best Practices in Transparent Social Science Research]<br />
<br />
Reproducible Research in R<br />
* [https://www.coursera.org/learn/reproducible-research Johns Hopkins' Online Course on Reproducible Research]<br />
<br />
Reproducible Research in Stata<br />
* [https://huapeng01016.github.io/reptalk/#/hua-pengstatacorphpeng Incorporating Stata into reproducible documents ]<br />
<br />
[[Category: Reproducible Research]]<br />
<br />
== Additional Resources ==</div>501238https://dimewiki.worldbank.org/index.php?title=Reproducible_Research&diff=4402Reproducible Research2018-02-12T16:10:12Z<p>501238: /* Dynamic documents */</p>
<hr />
<div>In most scientific fields, results are validated through replication: that means that different scientists run the same experiment independently in different samples and find similar conclusions. That standard is not always feasible in development research. More often than not, the phenomena we analyze cannot be artificially re-created. Even in the case of field experiments, different populations can respond differently to a treatment, and the costs involved are high.<br />
<br />
Even in such cases, however, we should till require reproducibility: this means that different researchers, when running the same analysis in the same data should find the same results. That may seem obvious, but unfortunately is not as widely observed as we would like. The bottom line of research reproducibility is that the path used to get to your results are as much a research output as the results themselves, making the research process fully transparent. This means that not only the final findings should be made available by researchers, but data, codes and documentation are also of great relevance to the public.<br />
<br />
== Code replication ==<br />
Replicating results is the most important part of reproducible research. The easiest way to guarantee that results can be replicated is to have a code that runs all data work and can be run by anyone who has access to it. Different researchers running the same code on the same data should get the same results. So to guarantee that research is transparent and reproducible, codes and data should be shared.<br />
<br />
It is possible for the same data and code to create different results if the right measures are not taken. In Stata, for example, setting the seed and version is essential to [[Randomization in Stata |replicable sampling and randomization]]. Sorting observations is also frequently necessary. Having a [[Master do-files|master do-file]] greatly improves the replicability of results since it's possible to standardize setting and run do-files in a pre-specified order from it. If different languages or software are used in the same project, a shell script can be used for the same purpose.<br />
<br />
Another important part of replicable code is that it should be understandable. That is, if someone else runs your code and replicates all the results, but doesn't understand what was being done, then your research is still not [[Research Ethics#Research Transparency|transparent]]. Commenting code to make it clear where and why decisions were made is a crucial part of making your work transparent. For example, if observations are dropped of values are changed, the code should be commented to explain why that was done.<br />
<br />
===Software for Code Replication===<br />
Git is a free version-control software. Files are stored in Git Repositories, most commonly on [https://github.com/ GitHub]. They allow the user to track changes to the code and create messages explaining why the changes were made, which improves [[Data Documentation|documentation]]. Sharing Git repositories is a way to make code publicly available, and allow other researchers to read and replicate your code.<br />
<br />
To learn GitHub, there is an [https://services.github.com/on-demand/intro-to-github/ introductory training] available through GitHub Services, and multiple tutorials available through [https://guides.github.com/ GitHub Guides].<br />
<br />
== Data publication ==<br />
Research results can only be replicated if the data used for [[Publishing_Data#Publishing_Analysis_Data|analysis]] is shared. Though being able to reproduce all steps from [[Data Cleaning|data cleaning]] to [[Data Analysis|data analysis]] is ideal to guarantee reproducibility and [[Research Ethics#Research Transparency|transparency]], that is not always possible, as some of the data used may be [[De-identification#Personally Identifiable Information|personally identifiable]] or confidential. However, sharing the final data is necessary for reproducibility. Some journals require datasets to be submitted along with papers, and some researchers prefer to make data available upon request.<br />
<br />
== Dynamic documents == <br />
Dynamic documents allow researchers to write papers and reports that automatically import or display results. This reduces the amount of manual work involved between analysing data and publishing the output of this analysis, so there's less room for error and manipulation.<br />
<br />
Different software allow for different degrees of automatization. Using [https://rmarkdown.rstudio.com/ R Markdown], for example, users can write text and code simultaneously, running analyses in different programming languages and printing results in the final document along with the text. Stata 15 also allows users to create such documents using [https://www.stata.com/manuals/pdyndoc.pdf dyndoc]. The output is a file, usually a PDF, that contains text, tables and graphs created simultaneously. With this kind of document, whenever data is updated or a change is made to some of the analysis, it's only necessary to run one file to generate a new final paper or report, with no copy-pasting or manual changes.<br />
<br />
LaTeX is another widely used tool in the scientific community. It is a type-setting system that allows users to reference code outputs such as tables and graphs so that they can be easily updated in a text document. Using this software, data analysis and writing are two separate processes: first, the data is analyzed using whatever software you prefer and the results are exported into TeX format ([https://cran.r-project.org/web/packages/stargazer/vignettes/stargazer.pdf R's stargazer] is commonly used for that, and Stata has different options such as [http://repec.org/bocode/e/estout/esttab.html esttab] and [http://repec.org/bocode/o/outreg2.html outreg2]), then a LaTeX document is written that uploads these outputs. The advantage of using LaTeX is that whenever results are updated, it's only necessary to recompile the LaTeX document for the new tables and graphs to be displayed.<br />
<br />
===Other software for dynamic documents===<br />
*[http://jupyter.org/ Jupyter Notebook] is used to create and share code in different programming languages, including Python, R, Julia, and Scala. It can also create dynamic documents in HTML, LaTeX and other formats.<br />
<br />
* [https://www.overleaf.com/ Overleaf] is a web based platform for collaboration in TeX documents.<br />
<br />
* [https://osf.io/ Open science framework] is a web based project management platform that combines registration, data storage (through Dropbox, Box, Google Drive and other platforms), code version control (through GitHub) and document composition (through Overleaf).<br />
<br />
== Additional Resources ==<br />
From Data Colada: <br />
*[http://datacolada.org/69 8 tips to make open research more findable and understandable]<br />
<br />
From the Abul Latif Jameel Poverty Action Lab (JPAL)<br />
* [https://www.povertyactionlab.org/research-resources/transparency-and-reproducibility Transparency and Reproducibility]<br />
<br />
From Innovations for Policy Action (IPA)<br />
* [http://www.poverty-action.org/sites/default/files/publications/IPA%27s%20Best%20Practices%20for%20Data%20and%20Code%20Management_Nov2015.pdf Reproducible Research: Best Practices for Data and Code Management] <br />
* [http://www.poverty-action.org/sites/default/files/Guidelines-for-data-publication.pdf Guidelines for data publication]<br />
* [https://dataverse.harvard.edu/dataverse/socialsciencercts Randomized Control Trials in the Social Science Dataverse]<br />
<br />
Center for Open Science<br />
* [https://cos.io/our-services/top-guidelines/ Transparency and Openness Guidelines], summarized in a [https://osf.io/pvf56/?_ga=1.225140506.1057649246.1484691980 1-Page Handout]<br />
<br />
Berkeley Initiative for Transparency in the Social Sciences<br />
* [http://www.bitss.org/education/manual-of-best-practices/ Manual of Best Practices in Transparent Social Science Research]<br />
<br />
Reproducible Research in R<br />
* [https://www.coursera.org/learn/reproducible-research Johns Hopkins' Online Course on Reproducible Research]<br />
<br />
Reproducible Research in Stata<br />
* [https://huapeng01016.github.io/reptalk/#/hua-pengstatacorphpeng Incorporating Stata into reproducible documents ]<br />
<br />
[[Category: Reproducible Research]]<br />
<br />
== Additional Resources ==</div>501238https://dimewiki.worldbank.org/index.php?title=Reproducible_Research&diff=4401Reproducible Research2018-02-12T15:52:15Z<p>501238: /* Data publication */</p>
<hr />
<div>In most scientific fields, results are validated through replication: that means that different scientists run the same experiment independently in different samples and find similar conclusions. That standard is not always feasible in development research. More often than not, the phenomena we analyze cannot be artificially re-created. Even in the case of field experiments, different populations can respond differently to a treatment, and the costs involved are high.<br />
<br />
Even in such cases, however, we should till require reproducibility: this means that different researchers, when running the same analysis in the same data should find the same results. That may seem obvious, but unfortunately is not as widely observed as we would like. The bottom line of research reproducibility is that the path used to get to your results are as much a research output as the results themselves, making the research process fully transparent. This means that not only the final findings should be made available by researchers, but data, codes and documentation are also of great relevance to the public.<br />
<br />
== Code replication ==<br />
Replicating results is the most important part of reproducible research. The easiest way to guarantee that results can be replicated is to have a code that runs all data work and can be run by anyone who has access to it. Different researchers running the same code on the same data should get the same results. So to guarantee that research is transparent and reproducible, codes and data should be shared.<br />
<br />
It is possible for the same data and code to create different results if the right measures are not taken. In Stata, for example, setting the seed and version is essential to [[Randomization in Stata |replicable sampling and randomization]]. Sorting observations is also frequently necessary. Having a [[Master do-files|master do-file]] greatly improves the replicability of results since it's possible to standardize setting and run do-files in a pre-specified order from it. If different languages or software are used in the same project, a shell script can be used for the same purpose.<br />
<br />
Another important part of replicable code is that it should be understandable. That is, if someone else runs your code and replicates all the results, but doesn't understand what was being done, then your research is still not [[Research Ethics#Research Transparency|transparent]]. Commenting code to make it clear where and why decisions were made is a crucial part of making your work transparent. For example, if observations are dropped of values are changed, the code should be commented to explain why that was done.<br />
<br />
===Software for Code Replication===<br />
Git is a free version-control software. Files are stored in Git Repositories, most commonly on [https://github.com/ GitHub]. They allow the user to track changes to the code and create messages explaining why the changes were made, which improves [[Data Documentation|documentation]]. Sharing Git repositories is a way to make code publicly available, and allow other researchers to read and replicate your code.<br />
<br />
To learn GitHub, there is an [https://services.github.com/on-demand/intro-to-github/ introductory training] available through GitHub Services, and multiple tutorials available through [https://guides.github.com/ GitHub Guides].<br />
<br />
== Data publication ==<br />
Research results can only be replicated if the data used for [[Publishing_Data#Publishing_Analysis_Data|analysis]] is shared. Though being able to reproduce all steps from [[Data Cleaning|data cleaning]] to [[Data Analysis|data analysis]] is ideal to guarantee reproducibility and [[Research Ethics#Research Transparency|transparency]], that is not always possible, as some of the data used may be [[De-identification#Personally Identifiable Information|personally identifiable]] or confidential. However, sharing the final data is necessary for reproducibility. Some journals require datasets to be submitted along with papers, and some researchers prefer to make data available upon request.<br />
<br />
== Dynamic documents == <br />
*R-markdown is a widely adopted tool for creating fully reproducible documents. It allows users to write text and code simultaneously, running analyses in different programming languages and printing results in the final document along with the text. Stata 15 also allows users to create dynamic documents using dyndoc. <br />
<br />
*[http://jupyter.org/ Jupyter Notebook] is used to create and share code in different programming languages, including Python, R, Julia, and Scala. It can also create dynamic documents in HTML, LaTeX and other formats.<br />
<br />
*LaTeX is another widely used tool in the scientific community. It is a type-setting system that allows users to reference code outputs such as tables and graphs so that they can be easily updated in a text document. Overleaf is a web based platform for collaboration in TeX documents.<br />
<br />
* Open science framework is a web based project management platform that combines registration, data storage (through Dropbox, Box, Google Drive and other platforms), code version control (through GitHub) and document composition (through Overleaf).<br />
<br />
== Additional Resources ==<br />
From Data Colada: <br />
*[http://datacolada.org/69 8 tips to make open research more findable and understandable]<br />
<br />
From the Abul Latif Jameel Poverty Action Lab (JPAL)<br />
* [https://www.povertyactionlab.org/research-resources/transparency-and-reproducibility Transparency and Reproducibility]<br />
<br />
From Innovations for Policy Action (IPA)<br />
* [http://www.poverty-action.org/sites/default/files/publications/IPA%27s%20Best%20Practices%20for%20Data%20and%20Code%20Management_Nov2015.pdf Reproducible Research: Best Practices for Data and Code Management] <br />
* [http://www.poverty-action.org/sites/default/files/Guidelines-for-data-publication.pdf Guidelines for data publication]<br />
* [https://dataverse.harvard.edu/dataverse/socialsciencercts Randomized Control Trials in the Social Science Dataverse]<br />
<br />
Center for Open Science<br />
* [https://cos.io/our-services/top-guidelines/ Transparency and Openness Guidelines], summarized in a [https://osf.io/pvf56/?_ga=1.225140506.1057649246.1484691980 1-Page Handout]<br />
<br />
Berkeley Initiative for Transparency in the Social Sciences<br />
* [http://www.bitss.org/education/manual-of-best-practices/ Manual of Best Practices in Transparent Social Science Research]<br />
<br />
Reproducible Research in R<br />
* [https://www.coursera.org/learn/reproducible-research Johns Hopkins' Online Course on Reproducible Research]<br />
<br />
Reproducible Research in Stata<br />
* [https://huapeng01016.github.io/reptalk/#/hua-pengstatacorphpeng Incorporating Stata into reproducible documents ]<br />
<br />
[[Category: Reproducible Research]]<br />
<br />
== Additional Resources ==</div>501238https://dimewiki.worldbank.org/index.php?title=De-identification&diff=4400De-identification2018-02-12T15:43:40Z<p>501238: /* Folder Encryption */</p>
<hr />
<div>== Read First ==<br />
* Some survey variables allow identification of individual respondents. This is called Personally Identifiable Information (PII). What variables are considered PII or not varies with the context of the survey. It is the responsibility of researchers to make sure this data is private and safely stored, and no PII can ever be publicly released without explicit consent<br />
* Variables including personally identifiable information that is not related to the research question should be dropped as soon as possible in the project, and all PII must be stored in an encrypted folder. PII variables that are needed for analysis can either encoded or masked, depending on the type of information they contain and who has access to the data<br />
<br />
==Personally Identifiable Information ==<br />
In the context of a survey, Personally identifiable information (PII) are the variables that can, either on their own or in combination with other variables, lead to identifying a single surveyed individual with reasonable certainty. Here's a list of variables that may lead to personal identification:<br />
* Names of survey respondent, household members, enumerators and other individuals<br />
* Names of schools, clinics, villages and possibly other administrative units (depending on the survey)<br />
* Dates of birth<br />
* GPS coordinates<br />
* Contact information<br />
* Record identifier (social security number, process number, medical record number, national clinic code, license plate, IP address)<br />
* Pictures (of individuals, houses, etc)<br />
<br />
<br />
A few examples of sensitive variables that depending on survey context may contain personally identifying information:<br />
* Age<br />
* Gender<br />
* Ethnicity<br />
* Grades, salary, job position<br />
<br />
<br />
As these variables exemplify, what exactly is PII will depend on the context of each survey. For example, if a survey covers a small farming community, variables such as plot size and crops cultivated can be combined to identify an individual household. Administrative units can be considered PII if there are few individuals in each of them. <br />
Details on how to calculate the disclosure risk -- that is, the risk of someone being able to track individual respondents from the available data can be found in [https://dimewiki.worldbank.org/wiki/De-identification#Additional_Resources Additional Resources]. It is common to define a threshold on the minimum number of individuals with a certain value of a variable that needs to be observed for it to be considered safe to disclose it. For example, if a school has less than 10 students of a certain age, then age is considered PII, as it may be used with other information to identify these students. The value of this thresholds depends on the context of the survey.<br />
The guidelines to deal with PII will be discussed below, but for common solutions are (1) restrict access to the data, (2) drop PII variables, (3) use anonymous codes for categoric variables, and (3) mask their values. The two first solutions make the data unavailable, while the last one edits the information shared when compared to the original survey data.<br />
<br />
==Access restriction==<br />
Data sets that are only available to the research team may contain identifiable information, and publicly released data, such as analysis datasets submitted as replication files for academic paper must be carefully de-identified. In between these two extremes, it is also common to share some relatively identifiable data under conditional access. The conditions required to access the data depend on how easy it is to identify an individual from it.<br />
==De-identification==<br />
There are different ways to de-identify data sets, resulting in different levels of information loss. It is advisable to remove immediately identifying variables such as names and contact information as early as possible in the project and stored under encryption, but what other information should be de-identified depends on how relevant the information is to the research question, and who has access to the data.<br />
Any identifiable information that is not related to the research question should be dropped, but there's a trade-off between ensuring data privacy and losing information and results quality when dealing with relevant variables. For example, a common practice is to create perturbed data, meaning some change is made to the shared variable compared to the original survey. Different methods to introduce change affect regression results and inference in different ways, and it is important to document the type of changes introduced so researchers can take this into account.<br />
<br />
=== Drop variables===<br />
Variables such as individual names (including survey respondent, family members, employees, enumerators), household coordinates, birth dates, contact information, IP address, job position should be dropped. This applies to any PII that is not necessary for analysis. They may be needed for high-frequency checks, back-checks and monitoring of intervention implementation and survey progress, but should be dropped from any data sets that are not used exactly for that.<br />
<br />
===Encode variables===<br />
Personally identifiable categoric variables that are needed for analysis, such as administrative units, ethnicity, etc, can be de-identified by encoding. That means dropping the [https://dimewiki.worldbank.org/wiki/Data_Cleaning#Labels value label] of a factor variable, so it is possible to tell which individuals are in the same group, but not what group that is. Be careful to use [https://dimewiki.worldbank.org/wiki/ID_Variable_Properties#Fifth_property:_Anonymous_IDs anonymous IDs] in this case, not some pre-existing code such as the State code used by the National Statistics Bureau or other authority.<br />
<br />
===Mask values===<br />
For numeric variables that are related to the research question and may be used to identify individuals, there are different methods that can be used to limit disclosure. This is necessary if the data is publicly available. Some of the most used methods, as well as their advantages and disadvantages, are discussed below. See [https://dimewiki.worldbank.org/wiki/De-identification#Additional_Resources Additional Resources] for more detailed information on how to implement each of them.<br />
When editing variable’s values, make sure to do it in a wait that cannot be reversed, for example by adding different random values to different variables and observations. For example, if you dislocate every GPS coordinate two kilometres South, the original coordinates can easily be traced back. Similarly, if you create one single noise variable with different values for each observation and add it to multiple variables to de-identify them, their original value can be obtained more easily than if you add different noises to different variables.<br />
* '''Categorization''': continuous variables can be transformed into categoric variables. This is done by reporting such variable in ranges instead of an individual’s specific value. For example, you can categorize ages and say that an individual is between 18 and 25 years old instead of 22. The range of each category will depend on how many individual observations exist in each of them.<br />
* '''Micro-aggregation''': This is done by forming groups with a certain number of observations and substituting the individual values with the group mean. This may affect estimation as even though the variable mean is not affected, the variance is. However, this change is the variance is small if the groups are small.<br />
* '''Adding noise''': white noise can be created by generating a new variable with mean zero and positive variance and adding it to the original variable. This causes the variable’s variance to be altered, therefore affecting inference.<br />
* '''Rounding''': consists in defining, often randomly, a rounding base and round each observation to its nearest multiple.<br />
* '''Top-coding''': when only a few extreme values can be individually identified, such values can be rounded so that, for example, any farmers producing more than a certain quantity of a crop is assigned that quantity.<br />
<br />
===Anonymous IDs===<br />
When a survey sample comes from a previously existing registry, or when survey data needs to be matched to administrative data, it is common to use a pre-existing ID variable from such registry or database, e.g. as State codes or clinic registries. Note that if these codes are publicly available, the data set created with them will still be personally identified, even if all names are deleted.<br />
<br />
In general, it is not recommended to use IDs that people outside the team have access to. It would be preferable to create a new, anonymous code. However, that are exceptions to this general rule. Read the [https://dimewiki.worldbank.org/wiki/ID_Variable_Properties#Fifth_property:_Anonymous_IDs Anonymous IDs] article for more information on how to deal with this specific issue.<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[Data Cleaning]]<br />
<br />
<br />
== Additional Resources ==<br />
*[https://projecteuclid.org/download/pdfview_1/euclid.ssu/1296828958 Matthews, Gregory J., and Ofer Harel. "Data confidentiality: A review of methods for statistical disclosure limitation and methods for assessing privacy." Statistics Surveys 5 (2011): 1-29.]<br />
*[http://repository.cmu.edu/jpc/vol2/iss1/7/ Shlomo, Natalie (2010) "Releasing Microdata: Disclosure Risk Estimation, Data Masking and Assessing Utility," Journal of Privacy and Confidentiality: Vol. 2 : Iss. 1 , Article 7. ]<br />
*[https://nces.ed.gov/pubs2011/2011603.pdf Guidelines for Protecting PII from the Institute of Education Siences]<br />
[[Category: Data Cleaning]] [[Category: Publishing Data]]</div>501238https://dimewiki.worldbank.org/index.php?title=Reproducible_Research&diff=4399Reproducible Research2018-02-12T15:43:01Z<p>501238: /* Code replication */</p>
<hr />
<div>In most scientific fields, results are validated through replication: that means that different scientists run the same experiment independently in different samples and find similar conclusions. That standard is not always feasible in development research. More often than not, the phenomena we analyze cannot be artificially re-created. Even in the case of field experiments, different populations can respond differently to a treatment, and the costs involved are high.<br />
<br />
Even in such cases, however, we should till require reproducibility: this means that different researchers, when running the same analysis in the same data should find the same results. That may seem obvious, but unfortunately is not as widely observed as we would like. The bottom line of research reproducibility is that the path used to get to your results are as much a research output as the results themselves, making the research process fully transparent. This means that not only the final findings should be made available by researchers, but data, codes and documentation are also of great relevance to the public.<br />
<br />
== Code replication ==<br />
Replicating results is the most important part of reproducible research. The easiest way to guarantee that results can be replicated is to have a code that runs all data work and can be run by anyone who has access to it. Different researchers running the same code on the same data should get the same results. So to guarantee that research is transparent and reproducible, codes and data should be shared.<br />
<br />
It is possible for the same data and code to create different results if the right measures are not taken. In Stata, for example, setting the seed and version is essential to [[Randomization in Stata |replicable sampling and randomization]]. Sorting observations is also frequently necessary. Having a [[Master do-files|master do-file]] greatly improves the replicability of results since it's possible to standardize setting and run do-files in a pre-specified order from it. If different languages or software are used in the same project, a shell script can be used for the same purpose.<br />
<br />
Another important part of replicable code is that it should be understandable. That is, if someone else runs your code and replicates all the results, but doesn't understand what was being done, then your research is still not [[Research Ethics#Research Transparency|transparent]]. Commenting code to make it clear where and why decisions were made is a crucial part of making your work transparent. For example, if observations are dropped of values are changed, the code should be commented to explain why that was done.<br />
<br />
===Software for Code Replication===<br />
Git is a free version-control software. Files are stored in Git Repositories, most commonly on [https://github.com/ GitHub]. They allow the user to track changes to the code and create messages explaining why the changes were made, which improves [[Data Documentation|documentation]]. Sharing Git repositories is a way to make code publicly available, and allow other researchers to read and replicate your code.<br />
<br />
To learn GitHub, there is an [https://services.github.com/on-demand/intro-to-github/ introductory training] available through GitHub Services, and multiple tutorials available through [https://guides.github.com/ GitHub Guides].<br />
<br />
== Data publication ==<br />
<br />
== Dynamic documents == <br />
*R-markdown is a widely adopted tool for creating fully reproducible documents. It allows users to write text and code simultaneously, running analyses in different programming languages and printing results in the final document along with the text. Stata 15 also allows users to create dynamic documents using dyndoc. <br />
<br />
*[http://jupyter.org/ Jupyter Notebook] is used to create and share code in different programming languages, including Python, R, Julia, and Scala. It can also create dynamic documents in HTML, LaTeX and other formats.<br />
<br />
*LaTeX is another widely used tool in the scientific community. It is a type-setting system that allows users to reference code outputs such as tables and graphs so that they can be easily updated in a text document. Overleaf is a web based platform for collaboration in TeX documents.<br />
<br />
* Open science framework is a web based project management platform that combines registration, data storage (through Dropbox, Box, Google Drive and other platforms), code version control (through GitHub) and document composition (through Overleaf).<br />
<br />
== Additional Resources ==<br />
From Data Colada: <br />
*[http://datacolada.org/69 8 tips to make open research more findable and understandable]<br />
<br />
From the Abul Latif Jameel Poverty Action Lab (JPAL)<br />
* [https://www.povertyactionlab.org/research-resources/transparency-and-reproducibility Transparency and Reproducibility]<br />
<br />
From Innovations for Policy Action (IPA)<br />
* [http://www.poverty-action.org/sites/default/files/publications/IPA%27s%20Best%20Practices%20for%20Data%20and%20Code%20Management_Nov2015.pdf Reproducible Research: Best Practices for Data and Code Management] <br />
* [http://www.poverty-action.org/sites/default/files/Guidelines-for-data-publication.pdf Guidelines for data publication]<br />
* [https://dataverse.harvard.edu/dataverse/socialsciencercts Randomized Control Trials in the Social Science Dataverse]<br />
<br />
Center for Open Science<br />
* [https://cos.io/our-services/top-guidelines/ Transparency and Openness Guidelines], summarized in a [https://osf.io/pvf56/?_ga=1.225140506.1057649246.1484691980 1-Page Handout]<br />
<br />
Berkeley Initiative for Transparency in the Social Sciences<br />
* [http://www.bitss.org/education/manual-of-best-practices/ Manual of Best Practices in Transparent Social Science Research]<br />
<br />
Reproducible Research in R<br />
* [https://www.coursera.org/learn/reproducible-research Johns Hopkins' Online Course on Reproducible Research]<br />
<br />
Reproducible Research in Stata<br />
* [https://huapeng01016.github.io/reptalk/#/hua-pengstatacorphpeng Incorporating Stata into reproducible documents ]<br />
<br />
[[Category: Reproducible Research]]<br />
<br />
== Additional Resources ==</div>501238https://dimewiki.worldbank.org/index.php?title=Reproducible_Research&diff=4398Reproducible Research2018-02-12T15:42:50Z<p>501238: /* Software for Code Replication */</p>
<hr />
<div>In most scientific fields, results are validated through replication: that means that different scientists run the same experiment independently in different samples and find similar conclusions. That standard is not always feasible in development research. More often than not, the phenomena we analyze cannot be artificially re-created. Even in the case of field experiments, different populations can respond differently to a treatment, and the costs involved are high.<br />
<br />
Even in such cases, however, we should till require reproducibility: this means that different researchers, when running the same analysis in the same data should find the same results. That may seem obvious, but unfortunately is not as widely observed as we would like. The bottom line of research reproducibility is that the path used to get to your results are as much a research output as the results themselves, making the research process fully transparent. This means that not only the final findings should be made available by researchers, but data, codes and documentation are also of great relevance to the public.<br />
<br />
== Code replication ==<br />
Replicating results is the most important part of reproducible research. The easiest way to guarantee that results can be replicated is to have a code that runs all data work and can be run by anyone who has access to it. Different researchers running the same code on the same data should get the same results. So to guarantee that research is transparent and reproducible, codes and data should be shared.<br />
<br />
It is possible for the same data and code to create different results if the right measures are not taken. In Stata, for example, setting the seed and version is essential to [[Randomization in Stata |replicable sampling and randomization]]. Sorting observations is also frequently necessary. Having a [[Master do-file|master do-file]] greatly improves the replicability of results since it's possible to standardize setting and run do-files in a pre-specified order from it. If different languages or software are used in the same project, a shell script can be used for the same purpose.<br />
<br />
Another important part of replicable code is that it should be understandable. That is, if someone else runs your code and replicates all the results, but doesn't understand what was being done, then your research is still not [[Research Ethics#Research Transparency|transparent]]. Commenting code to make it clear where and why decisions were made is a crucial part of making your work transparent. For example, if observations are dropped of values are changed, the code should be commented to explain why that was done.<br />
<br />
===Software for Code Replication===<br />
Git is a free version-control software. Files are stored in Git Repositories, most commonly on [https://github.com/ GitHub]. They allow the user to track changes to the code and create messages explaining why the changes were made, which improves [[Data Documentation|documentation]]. Sharing Git repositories is a way to make code publicly available, and allow other researchers to read and replicate your code.<br />
<br />
To learn GitHub, there is an [https://services.github.com/on-demand/intro-to-github/ introductory training] available through GitHub Services, and multiple tutorials available through [https://guides.github.com/ GitHub Guides].<br />
<br />
== Data publication ==<br />
<br />
== Dynamic documents == <br />
*R-markdown is a widely adopted tool for creating fully reproducible documents. It allows users to write text and code simultaneously, running analyses in different programming languages and printing results in the final document along with the text. Stata 15 also allows users to create dynamic documents using dyndoc. <br />
<br />
*[http://jupyter.org/ Jupyter Notebook] is used to create and share code in different programming languages, including Python, R, Julia, and Scala. It can also create dynamic documents in HTML, LaTeX and other formats.<br />
<br />
*LaTeX is another widely used tool in the scientific community. It is a type-setting system that allows users to reference code outputs such as tables and graphs so that they can be easily updated in a text document. Overleaf is a web based platform for collaboration in TeX documents.<br />
<br />
* Open science framework is a web based project management platform that combines registration, data storage (through Dropbox, Box, Google Drive and other platforms), code version control (through GitHub) and document composition (through Overleaf).<br />
<br />
== Additional Resources ==<br />
From Data Colada: <br />
*[http://datacolada.org/69 8 tips to make open research more findable and understandable]<br />
<br />
From the Abul Latif Jameel Poverty Action Lab (JPAL)<br />
* [https://www.povertyactionlab.org/research-resources/transparency-and-reproducibility Transparency and Reproducibility]<br />
<br />
From Innovations for Policy Action (IPA)<br />
* [http://www.poverty-action.org/sites/default/files/publications/IPA%27s%20Best%20Practices%20for%20Data%20and%20Code%20Management_Nov2015.pdf Reproducible Research: Best Practices for Data and Code Management] <br />
* [http://www.poverty-action.org/sites/default/files/Guidelines-for-data-publication.pdf Guidelines for data publication]<br />
* [https://dataverse.harvard.edu/dataverse/socialsciencercts Randomized Control Trials in the Social Science Dataverse]<br />
<br />
Center for Open Science<br />
* [https://cos.io/our-services/top-guidelines/ Transparency and Openness Guidelines], summarized in a [https://osf.io/pvf56/?_ga=1.225140506.1057649246.1484691980 1-Page Handout]<br />
<br />
Berkeley Initiative for Transparency in the Social Sciences<br />
* [http://www.bitss.org/education/manual-of-best-practices/ Manual of Best Practices in Transparent Social Science Research]<br />
<br />
Reproducible Research in R<br />
* [https://www.coursera.org/learn/reproducible-research Johns Hopkins' Online Course on Reproducible Research]<br />
<br />
Reproducible Research in Stata<br />
* [https://huapeng01016.github.io/reptalk/#/hua-pengstatacorphpeng Incorporating Stata into reproducible documents ]<br />
<br />
[[Category: Reproducible Research]]<br />
<br />
== Additional Resources ==</div>501238https://dimewiki.worldbank.org/index.php?title=Reproducible_Research&diff=4397Reproducible Research2018-02-12T15:42:32Z<p>501238: </p>
<hr />
<div>In most scientific fields, results are validated through replication: that means that different scientists run the same experiment independently in different samples and find similar conclusions. That standard is not always feasible in development research. More often than not, the phenomena we analyze cannot be artificially re-created. Even in the case of field experiments, different populations can respond differently to a treatment, and the costs involved are high.<br />
<br />
Even in such cases, however, we should till require reproducibility: this means that different researchers, when running the same analysis in the same data should find the same results. That may seem obvious, but unfortunately is not as widely observed as we would like. The bottom line of research reproducibility is that the path used to get to your results are as much a research output as the results themselves, making the research process fully transparent. This means that not only the final findings should be made available by researchers, but data, codes and documentation are also of great relevance to the public.<br />
<br />
== Code replication ==<br />
Replicating results is the most important part of reproducible research. The easiest way to guarantee that results can be replicated is to have a code that runs all data work and can be run by anyone who has access to it. Different researchers running the same code on the same data should get the same results. So to guarantee that research is transparent and reproducible, codes and data should be shared.<br />
<br />
It is possible for the same data and code to create different results if the right measures are not taken. In Stata, for example, setting the seed and version is essential to [[Randomization in Stata |replicable sampling and randomization]]. Sorting observations is also frequently necessary. Having a [[Master do-file|master do-file]] greatly improves the replicability of results since it's possible to standardize setting and run do-files in a pre-specified order from it. If different languages or software are used in the same project, a shell script can be used for the same purpose.<br />
<br />
Another important part of replicable code is that it should be understandable. That is, if someone else runs your code and replicates all the results, but doesn't understand what was being done, then your research is still not [[Research Ethics#Research Transparency|transparent]]. Commenting code to make it clear where and why decisions were made is a crucial part of making your work transparent. For example, if observations are dropped of values are changed, the code should be commented to explain why that was done.<br />
<br />
===Software for Code Replication===<br />
Git is a free version-control software. Files are stored in Git Repositories, most commonly on [https://github.com/ GitHub]. They allow the user to track changes to the code and create messages explaining why the changes were made, which improves [[Data Documentation|documentation]]. Sharing Git repositories is a way to make code publicly available, and allow other researchers to read and replicate your code.<br />
<br />
To learn GitHub, there is an [https://services.github.com/on-demand/intro-to-github/ introductory training] available through GitHub Services, and multiple tutorials available through [https://guides.github.com/ GitHub Guides]<br />
<br />
== Data publication ==<br />
<br />
== Dynamic documents == <br />
*R-markdown is a widely adopted tool for creating fully reproducible documents. It allows users to write text and code simultaneously, running analyses in different programming languages and printing results in the final document along with the text. Stata 15 also allows users to create dynamic documents using dyndoc. <br />
<br />
*[http://jupyter.org/ Jupyter Notebook] is used to create and share code in different programming languages, including Python, R, Julia, and Scala. It can also create dynamic documents in HTML, LaTeX and other formats.<br />
<br />
*LaTeX is another widely used tool in the scientific community. It is a type-setting system that allows users to reference code outputs such as tables and graphs so that they can be easily updated in a text document. Overleaf is a web based platform for collaboration in TeX documents.<br />
<br />
* Open science framework is a web based project management platform that combines registration, data storage (through Dropbox, Box, Google Drive and other platforms), code version control (through GitHub) and document composition (through Overleaf).<br />
<br />
== Additional Resources ==<br />
From Data Colada: <br />
*[http://datacolada.org/69 8 tips to make open research more findable and understandable]<br />
<br />
From the Abul Latif Jameel Poverty Action Lab (JPAL)<br />
* [https://www.povertyactionlab.org/research-resources/transparency-and-reproducibility Transparency and Reproducibility]<br />
<br />
From Innovations for Policy Action (IPA)<br />
* [http://www.poverty-action.org/sites/default/files/publications/IPA%27s%20Best%20Practices%20for%20Data%20and%20Code%20Management_Nov2015.pdf Reproducible Research: Best Practices for Data and Code Management] <br />
* [http://www.poverty-action.org/sites/default/files/Guidelines-for-data-publication.pdf Guidelines for data publication]<br />
* [https://dataverse.harvard.edu/dataverse/socialsciencercts Randomized Control Trials in the Social Science Dataverse]<br />
<br />
Center for Open Science<br />
* [https://cos.io/our-services/top-guidelines/ Transparency and Openness Guidelines], summarized in a [https://osf.io/pvf56/?_ga=1.225140506.1057649246.1484691980 1-Page Handout]<br />
<br />
Berkeley Initiative for Transparency in the Social Sciences<br />
* [http://www.bitss.org/education/manual-of-best-practices/ Manual of Best Practices in Transparent Social Science Research]<br />
<br />
Reproducible Research in R<br />
* [https://www.coursera.org/learn/reproducible-research Johns Hopkins' Online Course on Reproducible Research]<br />
<br />
Reproducible Research in Stata<br />
* [https://huapeng01016.github.io/reptalk/#/hua-pengstatacorphpeng Incorporating Stata into reproducible documents ]<br />
<br />
[[Category: Reproducible Research]]<br />
<br />
== Additional Resources ==</div>501238https://dimewiki.worldbank.org/index.php?title=Reproducible_Research&diff=4396Reproducible Research2018-02-12T15:22:01Z<p>501238: /* Code replication */</p>
<hr />
<div>In most scientific fields, results are validated through replication: that means that different scientists run the same experiment independently in different samples and find similar conclusions. That standard is not always feasible in development research. More often than not, the phenomena we analyze cannot be artifically re-created. Even in the case of field experiments, different populations can respond differently to a treatment, and the costs involved are high.<br />
<br />
Even in such cases, however, we should till require reproducibility: this means that different researchers, when running the same analysis in the same data should find the same results. That may seem obvious, but unfortunately is not as widely observed as we would like. The bottom line of research reproducibility is that the path used to get to your results are as much a research output as the results themselves, making the research process fully transparent. This means that not only the final findings should be made available by researchers, but data, codes and documentation are also of great relevance to the public.<br />
<br />
== Code replication ==<br />
Replicating <br />
<br />
===Software for Code Replication===<br />
*Git is a free version-control software. Files are stored in Git Repositories, most commonly on [https://github.com/ GitHub]. To learn GitHub, there is an [https://services.github.com/on-demand/intro-to-github/ introductory training] available through GitHub Services, and multiple tutorials available through [https://guides.github.com/ GitHub Guides]<br />
<br />
== Data publication ==<br />
<br />
== Dynamic documents == <br />
*R-markdown is a widely adopted tool for creating fully reproducible documents. It allows users to write text and code simultaneously, running analyses in different programming languages and printing results in the final document along with the text. Stata 15 also allows users to create dynamic documents using dyndoc. <br />
<br />
*[http://jupyter.org/ Jupyter Notebook] is used to create and share code in different programming languages, including Python, R, Julia, and Scala. It can also create dynamic documents in HTML, LaTeX and other formats.<br />
<br />
*LaTeX is another widely used tool in the scientific community. It is a type-setting system that allows users to reference code outputs such as tables and graphs so that they can be easily updated in a text document. Overleaf is a web based platform for collaboration in TeX documents.<br />
<br />
* Open science framework is a web based project management platform that combines registration, data storage (through Dropbox, Box, Google Drive and other platforms), code version control (through GitHub) and document composition (through Overleaf).<br />
<br />
== Additional Resources ==<br />
From Data Colada: <br />
*[http://datacolada.org/69 8 tips to make open research more findable and understandable]<br />
<br />
From the Abul Latif Jameel Poverty Action Lab (JPAL)<br />
* [https://www.povertyactionlab.org/research-resources/transparency-and-reproducibility Transparency and Reproducibility]<br />
<br />
From Innovations for Policy Action (IPA)<br />
* [http://www.poverty-action.org/sites/default/files/publications/IPA%27s%20Best%20Practices%20for%20Data%20and%20Code%20Management_Nov2015.pdf Reproducible Research: Best Practices for Data and Code Management] <br />
* [http://www.poverty-action.org/sites/default/files/Guidelines-for-data-publication.pdf Guidelines for data publication]<br />
* [https://dataverse.harvard.edu/dataverse/socialsciencercts Randomized Control Trials in the Social Science Dataverse]<br />
<br />
Center for Open Science<br />
* [https://cos.io/our-services/top-guidelines/ Transparency and Openness Guidelines], summarized in a [https://osf.io/pvf56/?_ga=1.225140506.1057649246.1484691980 1-Page Handout]<br />
<br />
Berkeley Initiative for Transparency in the Social Sciences<br />
* [http://www.bitss.org/education/manual-of-best-practices/ Manual of Best Practices in Transparent Social Science Research]<br />
<br />
Reproducible Research in R<br />
* [https://www.coursera.org/learn/reproducible-research Johns Hopkins' Online Course on Reproducible Research]<br />
<br />
Reproducible Research in Stata<br />
* [https://huapeng01016.github.io/reptalk/#/hua-pengstatacorphpeng Incorporating Stata into reproducible documents ]<br />
<br />
[[Category: Reproducible Research]]<br />
<br />
== Additional Resources ==</div>501238https://dimewiki.worldbank.org/index.php?title=Reproducible_Research&diff=4395Reproducible Research2018-02-12T15:12:02Z<p>501238: /* Code replication */</p>
<hr />
<div>In most scientific fields, results are validated through replication: that means that different scientists run the same experiment independently in different samples and find similar conclusions. That standard is not always feasible in development research. More often than not, the phenomena we analyze cannot be artifically re-created. Even in the case of field experiments, different populations can respond differently to a treatment, and the costs involved are high.<br />
<br />
Even in such cases, however, we should till require reproducibility: this means that different researchers, when running the same analysis in the same data should find the same results. That may seem obvious, but unfortunately is not as widely observed as we would like. The bottom line of research reproducibility is that the path used to get to your results are as much a research output as the results themselves, making the research process fully transparent. This means that not only the final findings should be made available by researchers, but data, codes and documentation are also of great relevance to the public.<br />
<br />
== Code replication ==<br />
<br />
===Software for Code Replication===<br />
*Git is a free version-control software. Files are stored in Git Repositories, most commonly on [https://github.com/ GitHub]. To learn GitHub, there is an [https://services.github.com/on-demand/intro-to-github/ introductory training] available through GitHub Services, and multiple tutorials available through [https://guides.github.com/ GitHub Guides]<br />
<br />
== Data publication ==<br />
<br />
== Dynamic documents == <br />
*R-markdown is a widely adopted tool for creating fully reproducible documents. It allows users to write text and code simultaneously, running analyses in different programming languages and printing results in the final document along with the text. Stata 15 also allows users to create dynamic documents using dyndoc. <br />
<br />
*[http://jupyter.org/ Jupyter Notebook] is used to create and share code in different programming languages, including Python, R, Julia, and Scala. It can also create dynamic documents in HTML, LaTeX and other formats.<br />
<br />
*LaTeX is another widely used tool in the scientific community. It is a type-setting system that allows users to reference code outputs such as tables and graphs so that they can be easily updated in a text document. Overleaf is a web based platform for collaboration in TeX documents.<br />
<br />
* Open science framework is a web based project management platform that combines registration, data storage (through Dropbox, Box, Google Drive and other platforms), code version control (through GitHub) and document composition (through Overleaf).<br />
<br />
== Additional Resources ==<br />
From Data Colada: <br />
*[http://datacolada.org/69 8 tips to make open research more findable and understandable]<br />
<br />
From the Abul Latif Jameel Poverty Action Lab (JPAL)<br />
* [https://www.povertyactionlab.org/research-resources/transparency-and-reproducibility Transparency and Reproducibility]<br />
<br />
From Innovations for Policy Action (IPA)<br />
* [http://www.poverty-action.org/sites/default/files/publications/IPA%27s%20Best%20Practices%20for%20Data%20and%20Code%20Management_Nov2015.pdf Reproducible Research: Best Practices for Data and Code Management] <br />
* [http://www.poverty-action.org/sites/default/files/Guidelines-for-data-publication.pdf Guidelines for data publication]<br />
* [https://dataverse.harvard.edu/dataverse/socialsciencercts Randomized Control Trials in the Social Science Dataverse]<br />
<br />
Center for Open Science<br />
* [https://cos.io/our-services/top-guidelines/ Transparency and Openness Guidelines], summarized in a [https://osf.io/pvf56/?_ga=1.225140506.1057649246.1484691980 1-Page Handout]<br />
<br />
Berkeley Initiative for Transparency in the Social Sciences<br />
* [http://www.bitss.org/education/manual-of-best-practices/ Manual of Best Practices in Transparent Social Science Research]<br />
<br />
Reproducible Research in R<br />
* [https://www.coursera.org/learn/reproducible-research Johns Hopkins' Online Course on Reproducible Research]<br />
<br />
Reproducible Research in Stata<br />
* [https://huapeng01016.github.io/reptalk/#/hua-pengstatacorphpeng Incorporating Stata into reproducible documents ]<br />
<br />
[[Category: Reproducible Research]]<br />
<br />
== Additional Resources ==</div>501238https://dimewiki.worldbank.org/index.php?title=De-identification&diff=4394De-identification2018-02-12T15:06:59Z<p>501238: /* Additional Resources */</p>
<hr />
<div>== Read First ==<br />
* Some survey variables allow identification of individual respondents. This is called Personally Identifiable Information (PII). What variables are considered PII or not varies with the context of the survey. It is the responsibility of researchers to make sure this data is private and safely stored, and no PII can ever be publicly released without explicit consent<br />
* Variables including personally identifiable information that is not related to the research question should be dropped as soon as possible in the project, and all PII must be stored in an encrypted folder. PII variables that are needed for analysis can either encoded or masked, depending on the type of information they contain and who has access to the data<br />
<br />
==Personally Identifiable Information ==<br />
In the context of a survey, Personally identifiable information (PII) are the variables that can, either on their own or in combination with other variables, lead to identifying a single surveyed individual with reasonable certainty. Here's a list of variables that may lead to personal identification:<br />
* Names of survey respondent, household members, enumerators and other individuals<br />
* Names of schools, clinics, villages and possibly other administrative units (depending on the survey)<br />
* Dates of birth<br />
* GPS coordinates<br />
* Contact information<br />
* Record identifier (social security number, process number, medical record number, national clinic code, license plate, IP address)<br />
* Pictures (of individuals, houses, etc)<br />
<br />
<br />
A few examples of sensitive variables that depending on survey context may contain personally identifying information:<br />
* Age<br />
* Gender<br />
* Ethnicity<br />
* Grades, salary, job position<br />
<br />
<br />
As these variables exemplify, what exactly is PII will depend on the context of each survey. For example, if a survey covers a small farming community, variables such as plot size and crops cultivated can be combined to identify an individual household. Administrative units can be considered PII if there are few individuals in each of them. <br />
Details on how to calculate the disclosure risk -- that is, the risk of someone being able to track individual respondents from the available data can be found in [https://dimewiki.worldbank.org/wiki/De-identification#Additional_Resources Additional Resources]. It is common to define a threshold on the minimum number of individuals with a certain value of a variable that needs to be observed for it to be considered safe to disclose it. For example, if a school has less than 10 students of a certain age, then age is considered PII, as it may be used with other information to identify these students. The value of this thresholds depends on the context of the survey.<br />
The guidelines to deal with PII will be discussed below, but for common solutions are (1) restrict access to the data, (2) drop PII variables, (3) use anonymous codes for categoric variables, and (3) mask their values. The two first solutions make the data unavailable, while the last one edits the information shared when compared to the original survey data.<br />
<br />
==Folder Encryption==<br />
==Access restriction==<br />
Data sets that are only available to the research team may contain identifiable information, and publicly released data, such as analysis datasets submitted as replication files for academic paper must be carefully de-identified. In between these two extremes, it is also common to share some relatively identifiable data under conditional access. The conditions required to access the data depend on how easy it is to identify an individual from it.<br />
==De-identification==<br />
There are different ways to de-identify data sets, resulting in different levels of information loss. It is advisable to remove immediately identifying variables such as names and contact information as early as possible in the project and stored under encryption, but what other information should be de-identified depends on how relevant the information is to the research question, and who has access to the data.<br />
Any identifiable information that is not related to the research question should be dropped, but there's a trade-off between ensuring data privacy and losing information and results quality when dealing with relevant variables. For example, a common practice is to create perturbed data, meaning some change is made to the shared variable compared to the original survey. Different methods to introduce change affect regression results and inference in different ways, and it is important to document the type of changes introduced so researchers can take this into account.<br />
<br />
=== Drop variables===<br />
Variables such as individual names (including survey respondent, family members, employees, enumerators), household coordinates, birth dates, contact information, IP address, job position should be dropped. This applies to any PII that is not necessary for analysis. They may be needed for high-frequency checks, back-checks and monitoring of intervention implementation and survey progress, but should be dropped from any data sets that are not used exactly for that.<br />
<br />
===Encode variables===<br />
Personally identifiable categoric variables that are needed for analysis, such as administrative units, ethnicity, etc, can be de-identified by encoding. That means dropping the [https://dimewiki.worldbank.org/wiki/Data_Cleaning#Labels value label] of a factor variable, so it is possible to tell which individuals are in the same group, but not what group that is. Be careful to use [https://dimewiki.worldbank.org/wiki/ID_Variable_Properties#Fifth_property:_Anonymous_IDs anonymous IDs] in this case, not some pre-existing code such as the State code used by the National Statistics Bureau or other authority.<br />
<br />
===Mask values===<br />
For numeric variables that are related to the research question and may be used to identify individuals, there are different methods that can be used to limit disclosure. This is necessary if the data is publicly available. Some of the most used methods, as well as their advantages and disadvantages, are discussed below. See [https://dimewiki.worldbank.org/wiki/De-identification#Additional_Resources Additional Resources] for more detailed information on how to implement each of them.<br />
When editing variable’s values, make sure to do it in a wait that cannot be reversed, for example by adding different random values to different variables and observations. For example, if you dislocate every GPS coordinate two kilometres South, the original coordinates can easily be traced back. Similarly, if you create one single noise variable with different values for each observation and add it to multiple variables to de-identify them, their original value can be obtained more easily than if you add different noises to different variables.<br />
* '''Categorization''': continuous variables can be transformed into categoric variables. This is done by reporting such variable in ranges instead of an individual’s specific value. For example, you can categorize ages and say that an individual is between 18 and 25 years old instead of 22. The range of each category will depend on how many individual observations exist in each of them.<br />
* '''Micro-aggregation''': This is done by forming groups with a certain number of observations and substituting the individual values with the group mean. This may affect estimation as even though the variable mean is not affected, the variance is. However, this change is the variance is small if the groups are small.<br />
* '''Adding noise''': white noise can be created by generating a new variable with mean zero and positive variance and adding it to the original variable. This causes the variable’s variance to be altered, therefore affecting inference.<br />
* '''Rounding''': consists in defining, often randomly, a rounding base and round each observation to its nearest multiple.<br />
* '''Top-coding''': when only a few extreme values can be individually identified, such values can be rounded so that, for example, any farmers producing more than a certain quantity of a crop is assigned that quantity.<br />
<br />
===Anonymous IDs===<br />
When a survey sample comes from a previously existing registry, or when survey data needs to be matched to administrative data, it is common to use a pre-existing ID variable from such registry or database, e.g. as State codes or clinic registries. Note that if these codes are publicly available, the data set created with them will still be personally identified, even if all names are deleted.<br />
<br />
In general, it is not recommended to use IDs that people outside the team have access to. It would be preferable to create a new, anonymous code. However, that are exceptions to this general rule. Read the [https://dimewiki.worldbank.org/wiki/ID_Variable_Properties#Fifth_property:_Anonymous_IDs Anonymous IDs] article for more information on how to deal with this specific issue.<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[Data Cleaning]]<br />
<br />
<br />
== Additional Resources ==<br />
*[https://projecteuclid.org/download/pdfview_1/euclid.ssu/1296828958 Matthews, Gregory J., and Ofer Harel. "Data confidentiality: A review of methods for statistical disclosure limitation and methods for assessing privacy." Statistics Surveys 5 (2011): 1-29.]<br />
*[http://repository.cmu.edu/jpc/vol2/iss1/7/ Shlomo, Natalie (2010) "Releasing Microdata: Disclosure Risk Estimation, Data Masking and Assessing Utility," Journal of Privacy and Confidentiality: Vol. 2 : Iss. 1 , Article 7. ]<br />
*[https://nces.ed.gov/pubs2011/2011603.pdf Guidelines for Protecting PII from the Institute of Education Siences]<br />
[[Category: Data Cleaning]] [[Category: Publishing Data]]</div>501238https://dimewiki.worldbank.org/index.php?title=Checklist:_Microdata_Catalog_submission&diff=4393Checklist: Microdata Catalog submission2018-02-12T15:06:38Z<p>501238: </p>
<hr />
<div>Get printable version by clicking on ''printable version'' in the menu to the left. Find instructions for editing the checklist [https://github.com/worldbank/DIMEwiki/tree/master/Topics/Checklists here]. The latest version of this checklist can be found at https://dimewiki.worldbank.org/wiki/Checklist:_Microdata_Catalog_submission. <br />
<br />
For more detailed instructions on sections 1 and 2 of this checklist, see [[Data Cleaning]] and [[De-identification]].<br />
<br />
<div id="chk_microdata"></div><br />
<br />
==Back to Parent == <br />
This article is part of the topic [[Check Lists]]. It's also related to [[Microdata Catalog]]<br />
<br />
<br />
[[Category: Data Cleaning]] [[Category: Check Lists]] [[Category: Publishing Data]]</div>501238https://dimewiki.worldbank.org/index.php?title=Publishing_Data&diff=4392Publishing Data2018-02-12T15:05:41Z<p>501238: /* Preparing data for release */</p>
<hr />
<div>Making data available to other researchers in some form is a key need of research transparency and reproducibility. However, it is not generally possible or advisable to release raw data. [[Primary Data Collection | Primary data]] usually contains [[De-identification#Personally Identifiable Information | personally-identifying information (PII)]] such as names, locations, or financial records that are unethical to make public; [[Secondary Data Sources | secondary data]] is often owned by an entity other than the research team and therefore may face legal issues in public release. It is therefore important to structure both data management and analytics such that the data that is published replicates the researcher's primary results to the best degree possible and that the data that is released is appropriately accessible.<br />
<br />
== Guidelines==<br />
=== Publishing Primary Data ===<br />
<br />
==== Preparing data for release ====<br />
The main issue with releasing primary data is maintaining the privacy of respondents. It is essential to carefully [[De-identification | de-identify]] any sensitive or personally-identifying information contained in the dataset. Datasets released should be easily understandable by users, so [[Data Documentation | documentation]], including variable dictionaries and survey instruments, should be released with the data. See the [[Checklist: Microdata Catalog submission|Microdata Catalog Checklist]] for instructions on how to prepare primary data for release.<br />
<br />
==== DIME data releases ==== <br />
DIME survey data is released through the [[Microdata Catalog]]. However, access to the data may be restricted and some variables may be embargoed prior to publication.<br />
<br />
=== Publishing Analysis Data ===<br />
Some journals require datasets used in [[Data Analysis | data analysis]] to be released when a paper is published. This is intended to make research more [[Research Ethics#Research Transparency | transparent]] and allow readers to [[Reproducible Research | reproduce findings]].<br />
<br />
==== Preparing data for release ====<br />
The objective of the data release is to allow users to reproduce results in the paper. Therefore, the released dataset needs to contain all variables used in [[Data Analysis | data analysis]], as well as all [[ID Variable Properties | identifying variables]]. Analysis datasets should be easily understandable to researchers trying replicate results, so it's important that it is [[Data_Cleaning#Labels | well-labelled]] and [[Data Documentation | documented]].<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[Publishing Data]]<br />
<br />
== Additional Resources==<br />
<br />
<br />
[[Category: Publishing Data]]</div>501238https://dimewiki.worldbank.org/index.php?title=Microdata_Catalog&diff=4391Microdata Catalog2018-02-12T15:04:24Z<p>501238: </p>
<hr />
<div>The [http://microdata.worldbank.org/index.php/home Microdata Library] is an online platform the offers free access to microdata produced not only by the World Bank, but also other international organizations, statistical agencies and different actors in developing countries. It includes datasets from surveys implemented as part of impact evaluations and research on development, as well as administrative data.<br />
<br />
== Read first ==<br />
* Data sets published in the microdata are typically survey data. The [[Checklist: Microdata Catalog submission|Microdata Catalog Checklist]] lists data format, documentation requirements and instructions on how to deposit data<br />
* When submitting data, it is recommended to include as much information about the study and the data as possible. This reduces the number of future queries received from both catalog staff preparing the data and users trying to properly understand the survey process<br />
* We recommend submitting the data and soon as it is collected, so that all relevant information is documented and safely stored, making transitions between team members easier and reducing the risk of not remembering details when analysis is done<br />
* As part of the submission process, it is possible to choose from different access conditions under which the data will be shared. It is also possible to make changes to access terms, as well as to the data, after the initial submission<br />
<br />
== Guidelines for submission ==<br />
Submission to the Microdata Library is done after the initial data cleaning for a round of data collection is finished. That means one impact evaluation may have different rounds published in the catalog, for example baseline, midline and endline. Datasets submitted to the [http://microdata.worldbank.org/index.php/home Microdata Library] must be de-identified and accompanied by data documentation and study description. The [[Checklist: Microdata Catalog submission|Microdata Catalog Checklist]] lists data format and documentation requirements as well as instructions on how to deposit datasets.<br />
<br />
World Bank staff can deposit their data directly to the online [http://microdatalib.worldbank.org/index.php/home Data Deposit Application]. Data can also be deposited by external researchers if necessary. It is recommended for deposits originating from outside the Bank to include names and contact details of the World Bank staff they are working with on their project. It is also recommended for the approver of access to licensed data to be a World Bank staff that can br easily contacted for access by the Microdata Library team. If the survey is owned by someone other than the World Bank, in addition to the deposit form, documentation is needed from the data provider, signed by an authorized signatory and explicitly authorizing the Microdata Library to disseminate the related survey data and specifying the access type.<br />
<br />
=== Datasets ===<br />
Data may be uploaded in different formats, including STATA, SPSS and SAS, and must be [[De-identification | de-identified]] and minimally [[Data Cleaning | cleaned]]. The data cleaning required aims to provide a clear indication of what information is to be found in any given variable, so both variable and value labels must be present, including [[Data Cleaning#Survey Codes and Missing Values | labels for extended missing values]]. To protect the confidentiality of respondents, all [[De-identification#Personally Identifiable Information | Personally Identifiable Information]] must be removed. Variables containing sensitive information such as PII can be flagged in the ""Data Distribution"" section to indicate they should not be distributed.<br />
<br />
=== Supporting documents ===<br />
All relevant material that would allow the users to better understand the data and interpret the results should be included. A non-comprehensive list of documents that may be relevant is included below. Note that some of the material in the list below may contain sensitive information (for example in the form of options listed in the questionnaire), so it should also be checked and de-identified.<br />
* Questionnaires (paper format equivalent is better than CAPI form)<br />
* Enumerator manuals<br />
*[[Data Documentation#Field work documentation | Fieldwork documentation ]]<br />
* Methodology description<br />
*[[Data Documentation#Data cleaning documentation | Data cleaning documentation]]<br />
*[[Data Documentation#Construct documentation | Variables construction documentation]], if applicable<br />
* Outputs such as reports, presentations, publications and papers<br />
<br />
=== Study description ===<br />
During submission, it is necessary to fill a form collecting information on the survey (metadata). Not all fields are mandatory, but providing as much information as possible makes it easier for the users to understand and explore the data. This reduces the number of future queries received from both catalogue staff preparing the data and users trying to properly understand the survey process.<br />
<br />
*Mandatory Fields:<br />
** Title<br />
** Country<br />
** Dates of Data Collection<br />
** Access policy<br />
** Catalogue where the data should be published<br />
<br />
* Recommended Fields:<br />
** Abstract<br />
** Geographic Coverage<br />
** Primary Investigator<br />
** Funding<br />
** Sampling Procedure<br />
** Weighting<br />
<br />
=== Access conditions ===<br />
The World Bank Microdata Library disseminates data under the [https://data.worldbank.org/summary-terms-of-use Microdata Terms of Use for the World Bank]. When submitting data, it is possible to indicate whether the datasets should be available only to World Bank staff or to external users. It is also possible to embargo any data submitted for a specified period of time. To protect the confidentiality of individual information and to meet the requirements of the data owners who provide the microdata, there are five principal [[http://microdata.worldbank.org/index.php/terms-of-use types of access]] that may be applied:<br />
<br />
* '''Open access''': this is the least restrictive access policy. Datasets and the related documentation are available to users for commercial and non-commercial purposes at no cost. There is no need to be being logged into the application.<br />
<br />
* '''Direct access''': relevant datasets and the related documentation are made freely available to registered and unregistered users for statistical and scientific research purposes only, and may not be distributed. Any publications employing this type of data must cite the source, in line with the citation requirement provided with the dataset.<br />
<br />
* '''Public Use Files''': PUFs are available to anyone agreeing to respect a core set of easy-to-meet conditions. These data are made easily accessible because the risk of identifying individual respondents or data providers is considered to be low. Terms of use are the same as direct access, but users are required to register before obtaining the data sets.<br />
<br />
* '''Licensed files''': are files whose dissemination is restricted to bona fide users. Access is granted to authenticated users who have received authorization to access them after submitting a documented application and signing an agreement governing the data's use. These users must be acting on behalf of an organization, who must take responsibility for the use. To release data under this license, a World Bank staff must be indicated as the point of contact to grant access to the data. That person will be contacted by the data catalogue manager, who works with the team to approve or reject requests.<br />
<br />
* '''External Repositories''': The World Bank Microdata Library operates both as a data catalog for World Bank owned or licensed data as well as a portal to data held in a number of external repositories. It is the aim of the Microdata Library to provide to the user the most comprehensive catalog of development related microdata possible. To this end, studies conducted and owned by other institutions as well as links to those studies are listed in the Microdata Library Catalog. Datasets provided by external agencies are not owned or controlled by the World Bank and have their own conditions of use. When a user accesses external repositories, the terms governing the use of those external repositories shall govern access to their data.<br />
<br />
* '''No access''': some datasets have no access policy defined, or are not accessible. In some limited situations, we may include a limited number of such datasets for the sake of completeness and for the purpose of providing access to questionnaires and reports. Note here that any datasets with no access will not be published on the external facing catalog.<br />
<br />
== Collections available ==<br />
The Microdata Library operates as a portal for datasets originating from the World Bank and other international, regional and national organizations. These contributions make up the Central Microdata Catalog, which can also be viewed and searched by collection. When submitting data to the Catalog, it is necessary to specify in which collection it should be filed. Impact evaluation surveys are filed in the Impact Evaluation Survey Collection, even if treatment variables are temporarily embargoed.<br />
<br />
* '''World Bank catalogs'''<br />
** Global Financial Inclusion (Global Findex) Database<br />
** Service Delivery Facility Surveys<br />
** The STEP Skills Measurement Program<br />
** The World Bank Group Country Opinion Survey Program (COS)<br />
** Development Research Microdata<br />
** Enterprise Surveys<br />
** Impact Evaluation Surveys<br />
** Living Standards Measurement Study (LSMS)<br />
** Migration and Remittances Surveys<br />
<br />
*''' External catalogs'''<br />
**Global Health Data Exchange (GHDx), Institute for Health Metrics and Evaluation (IHME)<br />
**Integrated Public Use Microdata Series (IPUMS) International<br />
**MEASURE DHS: Demographic and Health Surveys<br />
**Millennium Challenge Corporation (MCC)<br />
**UNICEF Multiple Indicator Cluster Surveys (MICS)<br />
**WHO’s Multi-Country Studies Programmes<br />
**DataFirst , University of Cape Town, South Africa<br />
<br />
== Releasing data before publication == <br />
One common concern among researchers is under which conditions to submit data from studies that are still ongoing and whose results have not yet been published. There are several options available.<br />
<br />
First of all, it is best practice to submit the data and soon as it is collected. The review process will guarantee that documentation is submitted, reducing the risk of not remembering important details about how the data was processed, an issue that arises often if the analysis is only carried when the intervention is completed, or after endline data is collected. Furthermore, once deposited, the data is safely stored, reducing the less likely, but even more worrying chance of losing any data. Depositing data early on guarantees that transition between team members is smoother and less information is lost over time.<br />
<br />
The different access conditions can be used to withhold from release any information that may create issues if made public prior to publication. One possibility is to submit a data set and embargo any treatment assignment variables until results are published. This means that these variables will only become available to users after an established date. If this solution is chosen, it's important to indicate in the documentation that such variables have been removed and will be released in a future date, as users expect treatment variables to be present in impact evaluation datasets. It is recommended to add a clause indicating that no conclusions on impact can be drawn until all rounds are published<br />
<br />
Alternatively, it is also possible to embargo the whole data set, making it "no access". In this case, the metadata will not be made available to the external audience and may not be available to World Bank staff if the embargo applies internally as well. Another option is to first submit the "censored" version of the data, without treatment variables, and update the submission to include all variables after publication. <br />
<br />
== DIME Datasets on Microdata Catalog ==<br />
*[[DIME_Datasets_on_Microdata_Catalog#By_region|By region]]<br />
*[[DIME_Datasets_on_Microdata_Catalog#By_area|By area]]<br />
<br />
<br />
== Additional Resources ==<br />
<br />
[[Category: Publishing Data]]</div>501238https://dimewiki.worldbank.org/index.php?title=Publishing_Data&diff=4390Publishing Data2018-02-12T15:02:09Z<p>501238: /* Publishing Analysis Data */</p>
<hr />
<div>Making data available to other researchers in some form is a key need of research transparency and reproducibility. However, it is not generally possible or advisable to release raw data. [[Primary Data Collection | Primary data]] usually contains [[De-identification#Personally Identifiable Information | personally-identifying information (PII)]] such as names, locations, or financial records that are unethical to make public; [[Secondary Data Sources | secondary data]] is often owned by an entity other than the research team and therefore may face legal issues in public release. It is therefore important to structure both data management and analytics such that the data that is published replicates the researcher's primary results to the best degree possible and that the data that is released is appropriately accessible.<br />
<br />
== Guidelines==<br />
=== Publishing Primary Data ===<br />
<br />
==== Preparing data for release ====<br />
The main issue with releasing primary data is maintaining the privacy of respondents. It is essential to carefully [[De-identification | de-identify]] any sensitive or personally-identifying information contained in the dataset. Datasets released should be easily understandable by users, so [[Data Documentation | documentation]], including variable dictionaries and survey instruments, should be released with the data.<br />
<br />
==== DIME data releases ==== <br />
DIME survey data is released through the [[Microdata Catalog]]. However, access to the data may be restricted and some variables may be embargoed prior to publication.<br />
<br />
=== Publishing Analysis Data ===<br />
Some journals require datasets used in [[Data Analysis | data analysis]] to be released when a paper is published. This is intended to make research more [[Research Ethics#Research Transparency | transparent]] and allow readers to [[Reproducible Research | reproduce findings]].<br />
<br />
==== Preparing data for release ====<br />
The objective of the data release is to allow users to reproduce results in the paper. Therefore, the released dataset needs to contain all variables used in [[Data Analysis | data analysis]], as well as all [[ID Variable Properties | identifying variables]]. Analysis datasets should be easily understandable to researchers trying replicate results, so it's important that it is [[Data_Cleaning#Labels | well-labelled]] and [[Data Documentation | documented]].<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[Publishing Data]]<br />
<br />
== Additional Resources==<br />
<br />
<br />
[[Category: Publishing Data]]</div>501238https://dimewiki.worldbank.org/index.php?title=Publishing_Data&diff=4389Publishing Data2018-02-12T14:56:00Z<p>501238: /* Preparing data for release */</p>
<hr />
<div>Making data available to other researchers in some form is a key need of research transparency and reproducibility. However, it is not generally possible or advisable to release raw data. [[Primary Data Collection | Primary data]] usually contains [[De-identification#Personally Identifiable Information | personally-identifying information (PII)]] such as names, locations, or financial records that are unethical to make public; [[Secondary Data Sources | secondary data]] is often owned by an entity other than the research team and therefore may face legal issues in public release. It is therefore important to structure both data management and analytics such that the data that is published replicates the researcher's primary results to the best degree possible and that the data that is released is appropriately accessible.<br />
<br />
== Guidelines==<br />
=== Publishing Primary Data ===<br />
<br />
==== Preparing data for release ====<br />
The main issue with releasing primary data is maintaining the privacy of respondents. It is essential to carefully [[De-identification | de-identify]] any sensitive or personally-identifying information contained in the dataset. Datasets released should be easily understandable by users, so [[Data Documentation | documentation]], including variable dictionaries and survey instruments, should be released with the data.<br />
<br />
==== DIME data releases ==== <br />
DIME survey data is released through the [[Microdata Catalog]]. However, access to the data may be restricted and some variables may be embargoed prior to publication.<br />
<br />
=== Publishing Analysis Data ===<br />
Some journals require datasets used in [[Data Analysis | data analysis]] to be released when a paper is published. This is intended to make research more transparent and allow readers to [[Reproducible Research | reproduce findings]].<br />
<br />
==== Preparing data for release ====<br />
The objective of the data release is to allow users to reproduce results in the paper. Therefore, the released dataset needs to contain all variables used in [[Data Analysis | data analysis]], as well as all [[ID Variable Properties | identifying variables]].<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[Publishing Data]]<br />
<br />
== Additional Resources==<br />
<br />
<br />
[[Category: Publishing Data]]</div>501238https://dimewiki.worldbank.org/index.php?title=Sampling_%26_Power_Calculations&diff=4382Sampling & Power Calculations2018-02-10T17:05:46Z<p>501238: /* Sotware for Sampling */</p>
<hr />
<div>Creating a statistically valid sample representative of the population of interest for the impact evaluation is a crucial aspect of impact evaluation design. This task can be roughly divided into two phases: sample design and implementation. Implementation typically means writing a software program to enact the sampling strategy. <br />
<br />
<br />
== Read First ==<br />
* To calculate exact sample size, you need to know the effect of the program and the mean and standard deviation of your outcome of interest for both the treatment and the control group. You cannot know these with certainty at the start of an impact evaluation. For this reason, power calculations require estimates and assumptions, and can seem like more of an art than a science. <br />
<br />
* Sampling code requires extra care! Errors cannot be corrected after the intervention (or survey) has started. Always ask a second person to double-check your code before you use the sampling it generated in the field. For DIME projects, you should always consult any member of DIME Analytics before sending a sample to the field. Do not randomize the sample from a temporary data set or a data set constructed for only this purpose. Instead, always randomize from a [[Master_Data_Set|Master data set]]. If no master data set exist for the [[Unit_of_Observation|unit of observation]] you are sampling on, then it is very important that you start by creating that.<br />
<br />
== Sampling ==<br />
<br />
=== Sample Size===<br />
Power Calculations are a statistical tool to help determine [[Sample Size]]. This is important, a sample that is too small means that you will not be able to detect a statistically significant effect, and a sample size that is too large can be a waste of limited resources. <br />
You can estimate either sample size or minimum detectable effect. Which you should estimate depends on the research design and constraints of a specific impact evaluation. The types of questions you can answer through power calculations include:<br />
* Given that I want to be able to statistically distinguish program impact of a 10% change in my outcome of interest, what is the minimum sample size needed?<br />
* Given that I only have budget to sample 1,000 households, what is the minimum effect size that I will be able to distinguish from a null effect? (this is known as [[Minimum Detectable Effect]])<br />
<br />
<br />
Power calculations should be done at [[Impact Evaluation Design]] stage. They are mostly typically done using [https://www.stata.com/ Stata] or [http://hlmsoft.net/od/ Optimal Design] (See [[Power Calculations in Optimal Design]], [[Power Calculations in Stata]]). Power calculations can be used to determine either sample size (using standard assumption of 80% power) or power (if sample size is constrained). <br />
<br />
Intuition: <br />
[[Media:Sample Size Intuition.png|Summary of Determinants of Sample Size ]]<br />
<br />
=== Sample Design ===<br />
''Population'': What is the population of interest for the impact evaluation? In other words, what population does your sample need to represent? This will vary depending on the study design. Some data on the overall population is required, in order to draw a representative sample.<br />
<br />
''Stratification'': To ensure a representative sample you can use [[Stratified Random Sample|stratification]]. A typical variable to stratify on is gender. When you stratify on gender you guarantee that your sample has the same ratio of women as the population frame you are sampling from.<br />
<br />
=== Sample Selection ===<br />
The most basic sampling technique is a Simple Random Sample. This works well for studies of small populations, with a complete sampling frame for the population. More typically, impact evaluations rely on [[Multi-stage (Cluster) Sampling | multi-stage or clustered sampling]], often with [[Stratified Random Sample|stratification]].<br />
<br />
You should always work from a [[Master_Data_Set|master data set]] of the population (sampling frame). If you do not have a master data set for the [[Unit_of_Observation|unit of observation]] you are sampling from (for example, households, villages, clinics, schools) you should always start by creating one. In the field, this is done by a [[listing]] at the lowest level of clustering possible. If it is impossible to do a listing, an alternative is to do an "on-the-spot" randomization. There are a few different methods here, for example, a ‘random walk’ by enumerators where they spin a bottle to determine a random direction. But without knowing the total number of households this will always be biased towards the households at the center of the village. In addition, it’s hard to monitor whether protocols are adhered to in the field, and there isn’t a systematic way of tracing when replacements were used and how they were established.<br />
<br />
== Software Tools ==<br />
===Sotware for Sampling===<br />
<br />
All sampling code you produce must be reproducible. Any code that includes randomization needs version, seed and sort to be reproducible. If [[Randomization in Stata|randomization in Stata]] is feasible, it should be always preferred, since it is more easily reproducible. If using Stata is not an option, it is also possible to use [[Randomization in Excel|Excel for randomization]].<br />
<br />
===Software for Power Calculations===<br />
[http://www.stata.com/ Stata Stata] is better for [[Reproducible Research|reproducible research]], in that the power calculations are codified in a do file. However, it is less visual and intuitive than [[Power Calculations in Optimal Design|Optimal Design]], and Stata's built-in program for sample size calculations, ''power'', does not allow for corrections for clustering (there are user written programs to do this, but all have some pitfalls). See [[Power Calculations in Stata]] for details. <br />
<br />
[https://sites.google.com/site/optimaldesignsoftware/home Optimal Design] creates graphs to visualize trade-offs and relationships between the various components of the sample size equation. However, transparency is an issue when using this software. Most people just save graphs it creates, but that could be difficult to replicate in the future. Other issues with Optimal Design are:<br />
* It cannot calculate power for an individual-level randomization with binary outcome<br />
* It assumes equal mean and variance for treatment and control (for an RCT this is generally okay)<br />
* It only gives you total number of clusters or sample size, assuming equal split, whereas you might want to fix the size of your treatment group (say budget constraints) and calculate control group size<br />
See [[Power Calculations in Optimal Design]] for details.<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[Sampling & Power Calculations]]<br />
<br />
== Additional Resources ==<br />
*[https://www.povertyactionlab.org/sites/default/files/resources/2017.01.11-The-Danger-of-Underpowered-Evaluations.pdf The Danger of Underpowered Evaluations], JPAL North America<br />
* [http://unstats.un.org/unsd/demographic/sources/surveys/Series_F98en.pdf Designing Household Survey Samples: Practical Guidelines] United Nations, Department of Economic and Social Affairs, Statistics Division - 2008<br />
* Why it makes sense to revisit power calculations after data has been collected: http://andrewgelman.com/2017/03/03/yes-makes-sense-design-analysis-power-calculations-data-collected/<br />
* Development Impact Blog: [http://blogs.worldbank.org/impactevaluations/power-calculations-what-software-should-i-use "Power Calculations: What software should I use?"]<br />
<br />
[[Category: Sampling & Power Calculations ]]</div>501238https://dimewiki.worldbank.org/index.php?title=Sampling_%26_Power_Calculations&diff=4381Sampling & Power Calculations2018-02-10T17:02:10Z<p>501238: /* Randomization in Stata */</p>
<hr />
<div>Creating a statistically valid sample representative of the population of interest for the impact evaluation is a crucial aspect of impact evaluation design. This task can be roughly divided into two phases: sample design and implementation. Implementation typically means writing a software program to enact the sampling strategy. <br />
<br />
<br />
== Read First ==<br />
* To calculate exact sample size, you need to know the effect of the program and the mean and standard deviation of your outcome of interest for both the treatment and the control group. You cannot know these with certainty at the start of an impact evaluation. For this reason, power calculations require estimates and assumptions, and can seem like more of an art than a science. <br />
<br />
* Sampling code requires extra care! Errors cannot be corrected after the intervention (or survey) has started. Always ask a second person to double-check your code before you use the sampling it generated in the field. For DIME projects, you should always consult any member of DIME Analytics before sending a sample to the field. Do not randomize the sample from a temporary data set or a data set constructed for only this purpose. Instead, always randomize from a [[Master_Data_Set|Master data set]]. If no master data set exist for the [[Unit_of_Observation|unit of observation]] you are sampling on, then it is very important that you start by creating that.<br />
<br />
== Sampling ==<br />
<br />
=== Sample Size===<br />
Power Calculations are a statistical tool to help determine [[Sample Size]]. This is important, a sample that is too small means that you will not be able to detect a statistically significant effect, and a sample size that is too large can be a waste of limited resources. <br />
You can estimate either sample size or minimum detectable effect. Which you should estimate depends on the research design and constraints of a specific impact evaluation. The types of questions you can answer through power calculations include:<br />
* Given that I want to be able to statistically distinguish program impact of a 10% change in my outcome of interest, what is the minimum sample size needed?<br />
* Given that I only have budget to sample 1,000 households, what is the minimum effect size that I will be able to distinguish from a null effect? (this is known as [[Minimum Detectable Effect]])<br />
<br />
<br />
Power calculations should be done at [[Impact Evaluation Design]] stage. They are mostly typically done using [https://www.stata.com/ Stata] or [http://hlmsoft.net/od/ Optimal Design] (See [[Power Calculations in Optimal Design]], [[Power Calculations in Stata]]). Power calculations can be used to determine either sample size (using standard assumption of 80% power) or power (if sample size is constrained). <br />
<br />
Intuition: <br />
[[Media:Sample Size Intuition.png|Summary of Determinants of Sample Size ]]<br />
<br />
=== Sample Design ===<br />
''Population'': What is the population of interest for the impact evaluation? In other words, what population does your sample need to represent? This will vary depending on the study design. Some data on the overall population is required, in order to draw a representative sample.<br />
<br />
''Stratification'': To ensure a representative sample you can use [[Stratified Random Sample|stratification]]. A typical variable to stratify on is gender. When you stratify on gender you guarantee that your sample has the same ratio of women as the population frame you are sampling from.<br />
<br />
=== Sample Selection ===<br />
The most basic sampling technique is a Simple Random Sample. This works well for studies of small populations, with a complete sampling frame for the population. More typically, impact evaluations rely on [[Multi-stage (Cluster) Sampling | multi-stage or clustered sampling]], often with [[Stratified Random Sample|stratification]].<br />
<br />
You should always work from a [[Master_Data_Set|master data set]] of the population (sampling frame). If you do not have a master data set for the [[Unit_of_Observation|unit of observation]] you are sampling from (for example, households, villages, clinics, schools) you should always start by creating one. In the field, this is done by a [[listing]] at the lowest level of clustering possible. If it is impossible to do a listing, an alternative is to do an "on-the-spot" randomization. There are a few different methods here, for example, a ‘random walk’ by enumerators where they spin a bottle to determine a random direction. But without knowing the total number of households this will always be biased towards the households at the center of the village. In addition, it’s hard to monitor whether protocols are adhered to in the field, and there isn’t a systematic way of tracing when replacements were used and how they were established.<br />
<br />
== Software Tools ==<br />
===Sotware for Sampling===<br />
<br />
===Software for Power Calculations===<br />
[http://www.stata.com/ Stata Stata] is better for [[Reproducible Research|reproducible research]], in that the power calculations are codified in a do file. However, it is less visual and intuitive than [[Power Calculations in Optimal Design|Optimal Design]], and Stata's built-in program for sample size calculations, ''power'', does not allow for corrections for clustering (there are user written programs to do this, but all have some pitfalls). See [[Power Calculations in Stata]] for details. <br />
<br />
[https://sites.google.com/site/optimaldesignsoftware/home Optimal Design] creates graphs to visualize trade-offs and relationships between the various components of the sample size equation. However, transparency is an issue when using this software. Most people just save graphs it creates, but that could be difficult to replicate in the future. Other issues with Optimal Design are:<br />
* It cannot calculate power for an individual-level randomization with binary outcome<br />
* It assumes equal mean and variance for treatment and control (for an RCT this is generally okay)<br />
* It only gives you total number of clusters or sample size, assuming equal split, whereas you might want to fix the size of your treatment group (say budget constraints) and calculate control group size<br />
See [[Power Calculations in Optimal Design]] for details.<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[Sampling & Power Calculations]]<br />
<br />
== Additional Resources ==<br />
*[https://www.povertyactionlab.org/sites/default/files/resources/2017.01.11-The-Danger-of-Underpowered-Evaluations.pdf The Danger of Underpowered Evaluations], JPAL North America<br />
* [http://unstats.un.org/unsd/demographic/sources/surveys/Series_F98en.pdf Designing Household Survey Samples: Practical Guidelines] United Nations, Department of Economic and Social Affairs, Statistics Division - 2008<br />
* Why it makes sense to revisit power calculations after data has been collected: http://andrewgelman.com/2017/03/03/yes-makes-sense-design-analysis-power-calculations-data-collected/<br />
* Development Impact Blog: [http://blogs.worldbank.org/impactevaluations/power-calculations-what-software-should-i-use "Power Calculations: What software should I use?"]<br />
<br />
[[Category: Sampling & Power Calculations ]]</div>501238https://dimewiki.worldbank.org/index.php?title=Sampling_%26_Power_Calculations&diff=4380Sampling & Power Calculations2018-02-10T17:01:22Z<p>501238: /* Power Calculations */</p>
<hr />
<div>Creating a statistically valid sample representative of the population of interest for the impact evaluation is a crucial aspect of impact evaluation design. This task can be roughly divided into two phases: sample design and implementation. Implementation typically means writing a software program to enact the sampling strategy. <br />
<br />
<br />
== Read First ==<br />
* To calculate exact sample size, you need to know the effect of the program and the mean and standard deviation of your outcome of interest for both the treatment and the control group. You cannot know these with certainty at the start of an impact evaluation. For this reason, power calculations require estimates and assumptions, and can seem like more of an art than a science. <br />
<br />
* Sampling code requires extra care! Errors cannot be corrected after the intervention (or survey) has started. Always ask a second person to double-check your code before you use the sampling it generated in the field. For DIME projects, you should always consult any member of DIME Analytics before sending a sample to the field. Do not randomize the sample from a temporary data set or a data set constructed for only this purpose. Instead, always randomize from a [[Master_Data_Set|Master data set]]. If no master data set exist for the [[Unit_of_Observation|unit of observation]] you are sampling on, then it is very important that you start by creating that.<br />
<br />
== Sampling ==<br />
<br />
=== Sample Size===<br />
Power Calculations are a statistical tool to help determine [[Sample Size]]. This is important, a sample that is too small means that you will not be able to detect a statistically significant effect, and a sample size that is too large can be a waste of limited resources. <br />
You can estimate either sample size or minimum detectable effect. Which you should estimate depends on the research design and constraints of a specific impact evaluation. The types of questions you can answer through power calculations include:<br />
* Given that I want to be able to statistically distinguish program impact of a 10% change in my outcome of interest, what is the minimum sample size needed?<br />
* Given that I only have budget to sample 1,000 households, what is the minimum effect size that I will be able to distinguish from a null effect? (this is known as [[Minimum Detectable Effect]])<br />
<br />
<br />
Power calculations should be done at [[Impact Evaluation Design]] stage. They are mostly typically done using [https://www.stata.com/ Stata] or [http://hlmsoft.net/od/ Optimal Design] (See [[Power Calculations in Optimal Design]], [[Power Calculations in Stata]]). Power calculations can be used to determine either sample size (using standard assumption of 80% power) or power (if sample size is constrained). <br />
<br />
Intuition: <br />
[[Media:Sample Size Intuition.png|Summary of Determinants of Sample Size ]]<br />
<br />
=== Sample Design ===<br />
''Population'': What is the population of interest for the impact evaluation? In other words, what population does your sample need to represent? This will vary depending on the study design. Some data on the overall population is required, in order to draw a representative sample.<br />
<br />
''Stratification'': To ensure a representative sample you can use [[Stratified Random Sample|stratification]]. A typical variable to stratify on is gender. When you stratify on gender you guarantee that your sample has the same ratio of women as the population frame you are sampling from.<br />
<br />
=== Sample Selection ===<br />
The most basic sampling technique is a Simple Random Sample. This works well for studies of small populations, with a complete sampling frame for the population. More typically, impact evaluations rely on [[Multi-stage (Cluster) Sampling | multi-stage or clustered sampling]], often with [[Stratified Random Sample|stratification]].<br />
<br />
You should always work from a [[Master_Data_Set|master data set]] of the population (sampling frame). If you do not have a master data set for the [[Unit_of_Observation|unit of observation]] you are sampling from (for example, households, villages, clinics, schools) you should always start by creating one. In the field, this is done by a [[listing]] at the lowest level of clustering possible. If it is impossible to do a listing, an alternative is to do an "on-the-spot" randomization. There are a few different methods here, for example, a ‘random walk’ by enumerators where they spin a bottle to determine a random direction. But without knowing the total number of households this will always be biased towards the households at the center of the village. In addition, it’s hard to monitor whether protocols are adhered to in the field, and there isn’t a systematic way of tracing when replacements were used and how they were established.<br />
<br />
=== Randomization in Stata ===<br />
All sampling code you produce must be reproducible. Any code that includes randomization needs version, seed and sort to be reproducible. See [[Randomization in Stata|reproducible randomization in Stata]] for details.<br />
<br />
== Software Tools ==<br />
===Sotware for Sampling===<br />
<br />
===Software for Power Calculations===<br />
[http://www.stata.com/ Stata Stata] is better for [[Reproducible Research|reproducible research]], in that the power calculations are codified in a do file. However, it is less visual and intuitive than [[Power Calculations in Optimal Design|Optimal Design]], and Stata's built-in program for sample size calculations, ''power'', does not allow for corrections for clustering (there are user written programs to do this, but all have some pitfalls). See [[Power Calculations in Stata]] for details. <br />
<br />
[https://sites.google.com/site/optimaldesignsoftware/home Optimal Design] creates graphs to visualize trade-offs and relationships between the various components of the sample size equation. However, transparency is an issue when using this software. Most people just save graphs it creates, but that could be difficult to replicate in the future. Other issues with Optimal Design are:<br />
* It cannot calculate power for an individual-level randomization with binary outcome<br />
* It assumes equal mean and variance for treatment and control (for an RCT this is generally okay)<br />
* It only gives you total number of clusters or sample size, assuming equal split, whereas you might want to fix the size of your treatment group (say budget constraints) and calculate control group size<br />
See [[Power Calculations in Optimal Design]] for details.<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[Sampling & Power Calculations]]<br />
<br />
== Additional Resources ==<br />
*[https://www.povertyactionlab.org/sites/default/files/resources/2017.01.11-The-Danger-of-Underpowered-Evaluations.pdf The Danger of Underpowered Evaluations], JPAL North America<br />
* [http://unstats.un.org/unsd/demographic/sources/surveys/Series_F98en.pdf Designing Household Survey Samples: Practical Guidelines] United Nations, Department of Economic and Social Affairs, Statistics Division - 2008<br />
* Why it makes sense to revisit power calculations after data has been collected: http://andrewgelman.com/2017/03/03/yes-makes-sense-design-analysis-power-calculations-data-collected/<br />
* Development Impact Blog: [http://blogs.worldbank.org/impactevaluations/power-calculations-what-software-should-i-use "Power Calculations: What software should I use?"]<br />
<br />
[[Category: Sampling & Power Calculations ]]</div>501238https://dimewiki.worldbank.org/index.php?title=Power_Calculations_in_Stata&diff=4379Power Calculations in Stata2018-02-10T17:00:08Z<p>501238: </p>
<hr />
<div>There is no one stata command that meets all needs for power calculations. This article discusses the pros and cons of the different available options. <br />
<br />
<br />
== Guidelines ==<br />
<br />
=== What data do I need? ===<br />
<br />
You must have: <br />
* Mean and variance for outcome variable for your population <br />
** Typically can assume mean and SD are the same for treatment and control groups if randomized<br />
<br />
* Sample size (assuming you are calculating MDES (δ))<br />
** If individual randomization, number of people/units (n)<br />
** If clustered, number of clusters (k), number of units per cluster (m), intracluster correlation (ICC, ρ) and ideally, variation of cluster size<br />
<br />
* The following standard conventions<br />
** Significance level (α) = 0.05<br />
** Power = 0.80 (i.e. probability of type II error (β) = 0.20<br />
<br />
<br />
Ideally, you will also have:<br />
* Baseline correlation of outcome with covariates<br />
** Covariates (individual and/or cluster level) reduce the residual variance of the outcome variable, leading to lower required sample sizes<br />
*** Reducing individual level residual variance is akin to increasing # obs per cluster (bigger effect if ICC low)<br />
*** Reducing cluster level residual variance is akin to increasing # of clusters (bigger effect if ICC and m high)<br />
**If you have baseline data, this is easy to obtain<br />
*** Including baseline autocorrelation will improve power (keep only time invariant portion of variance) <br />
<br />
* Number of follow-up surveys<br />
<br />
* Autocorrelation of outcome between FUP rounds<br />
<br />
=== How do I get this data? ===<br />
<br />
You will basically never have the data you need for your exact population of interest at the time when you first do power calculations. <br />
<br />
You will need to use the best available data to estimate values for each parameter. Sources to consider:<br />
* High-quality nationally representative survey (e.g. LSMS)<br />
* Data from DIME IE in same country (or region, if pressed)<br />
* Review the literature – especially published papers on the sector and country. What kind of effects? Summary stats available?<br />
<br />
If you can’t come up with a specific value you feel very confident in, run a few different power calculations with alternate assumptions and create bounded estimates. <br />
<br />
=== Stata Command Options ===<br />
<br />
Quick Reference on options: <br />
<br />
[[File:Power_Calcs_in_Stata_Quick_Reference.png|500px]]<br />
<br />
<br />
==== ''power'' ====<br />
Stata’s newest updated to power calculations. Introduced with Stata13, replaces ''sampsi''.<br />
<br />
Pros<br />
* More flexible in terms of input/output choices<br />
* Better output: more info, graph option<br />
* Automatically saves output to a file<br />
* Can compute sample size of control group given treatment group size (or vice versa)<br />
* Directly calculate MDES<br />
<br />
Cons<br />
* Doesn’t allow for clustering<br />
* No straightforward way to control for repeated measures<br />
* Allows for treatment and control groups of different sizes<br />
<br />
'''When to use?''' Simple randomizations (no clustering)<br />
<br />
Useful options <br />
* ''power onemean'' – assume means same in tmt & control<br />
* ''n'' sample size<br />
* ''n1()'' control group size, ''n2()'' treatment group size<br />
* ''nratio'' ratio of n1/n2, default is 1 (not necessary to specify if you list n1 and n2)<br />
* ''power, table'' outputs results in table format<br />
* ''power, saving(filename, [replace])'' saves results in .dta format<br />
<br />
==== ''sampsi'' ====<br />
No longer an officially supported stata package (replaced by power), though it continues to work. <br />
Default is to compute sample size. To compute power: specify n1 or n2. To compare means (not proportions), specify sd1(#) or sd2(#). For repeated measures, sd1(#) or sd2(#) must be specified<br />
<br />
Pros<br />
* Works with Stata13 or earlier <br />
* Allows repeated measures (multiple follow-ups)<br />
<br />
Cons<br />
* Does not allow clustering<br />
* Have to impute MDES<br />
* Defaults to 90% power (not really a con, but be aware)<br />
<br />
Useful Options<br />
* ''onesample'': use if randomized (assume means the same between treatment and control)<br />
* Sample size<br />
** ''n1(#)'' size of treatment group<br />
** ''n2(#)'' size of control group<br />
** ''ratio()'' n1/n2, default is 1<br />
* Repeated measures<br />
** ''pre'' number of baseline measurements<br />
** ''post'' number of follow-up measurements<br />
** ''r0(#)'' correlation between baseline measures (default r0 = r1)<br />
** ''r1(#)'' correlation between follow-up measures<br />
** ''r01(#)'' correlation between baseline and follow-up<br />
* ''method(post change anova or all)'', default is all<br />
<br />
How to use ''sampsi'' to compute MDES?<br />
* Has to be done through a guess-and-check method. The difference between baseline and hypothesized mean is MDES. Compute power, using different hypothesized means, aiming for power = 0.8.<br />
<br />
''sampclus'' is an add-on to ''sampsi'' that allows for clustering. It must be directly preceded by ''sampsi'' command. For example:<br />
<br />
''sampsi 200 185, alpha(.01) power(.8) sd(30)<br />
<br />
sampclus, obsclus(10) rho(.2) ''<br />
<br />
Corrects sample size and computes number of clusters from a t-test. Adjusts this sample size calculation for 10 observations per cluster and an ICC of 0.2.<br />
<br />
==== ''clsampsi'' ====<br />
<br />
Pros<br />
* Allows for clustering<br />
<br />
Cons<br />
* Have to impute MDES<br />
* Does not allow for repeated measures<br />
* Does not allow for baseline correlation<br />
<br />
Useful options<br />
* ''m(#)'' cluster size in treatment and control assuming equal cluster size in tmt & control<br />
** alternative ''m1(#)'' and ''m2(#)''<br />
* ''k(#)'' number of clusters in tmt and control assuming equal number in tmt & control<br />
** Alternative ''k1(#)'' and ''k2(#)''<br />
* ''sd(#)'' standard deviation assuming same sd in tmt & control<br />
** Alternative ''sd1(#)'' and ''sd2(#)''<br />
* ''rho(#)'' ICC assuming same in tmt & control <br />
** Alternatively ''rho1'' and ''rho2''<br />
* ''sampsi'' determines power of means (or proportion) comparison using the standard sampsi command<br />
* ''varm(#)'' cluster size variation assuming same in tmt & ctl<br />
** only affects power if larger than m(#) and rho(#)>0<br />
<br />
==== ''clustersampsi'' ====<br />
Pros<br />
* Allows for clustering<br />
* Allows for baseline correlations<br />
* Directly calculates MDES<br />
<br />
Cons<br />
* Doesn’t allow for different sized treatment / control groups<br />
* Doesn’t allow for repeated measures<br />
<br />
Useful options<br />
* ''detectabledifference'' calculate MDES<br />
** Alternative options: ''power, samplesize''<br />
** to use detectabledifference must specify ''m, k, mu1''<br />
* ''rho(#)'' ICC<br />
* ''k(#)'' number of clusters in each arm<br />
* ''m(#)'' average cluster size<br />
* ''size_cv(#)'' coefficient of variation of cluster sizes (default is 0). Can be any number greater than 1.<br />
* ''mu1'' mean for tmt (''mu2'' = mean for control)<br />
* ''sd1'' mean for tmt (''sd2'' = mean for control)<br />
* ''base_correl'' correlation btw baseline measurements (or other predictive covariates) and outcome<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[Sampling & Power Calculations]]<br />
<br />
<br />
== Additional Resources ==<br />
Please add here related articles, including a brief description and link. <br />
<br />
[[Category: Sampling & Power Calculations]]</div>501238https://dimewiki.worldbank.org/index.php?title=Sampling_%26_Power_Calculations&diff=4378Sampling & Power Calculations2018-02-10T16:58:45Z<p>501238: /* Sample Selection */</p>
<hr />
<div>Creating a statistically valid sample representative of the population of interest for the impact evaluation is a crucial aspect of impact evaluation design. This task can be roughly divided into two phases: sample design and implementation. Implementation typically means writing a software program to enact the sampling strategy. <br />
<br />
<br />
== Read First ==<br />
* To calculate exact sample size, you need to know the effect of the program and the mean and standard deviation of your outcome of interest for both the treatment and the control group. You cannot know these with certainty at the start of an impact evaluation. For this reason, power calculations require estimates and assumptions, and can seem like more of an art than a science. <br />
<br />
* Sampling code requires extra care! Errors cannot be corrected after the intervention (or survey) has started. Always ask a second person to double-check your code before you use the sampling it generated in the field. For DIME projects, you should always consult any member of DIME Analytics before sending a sample to the field. Do not randomize the sample from a temporary data set or a data set constructed for only this purpose. Instead, always randomize from a [[Master_Data_Set|Master data set]]. If no master data set exist for the [[Unit_of_Observation|unit of observation]] you are sampling on, then it is very important that you start by creating that.<br />
<br />
== Sampling ==<br />
<br />
=== Sample Size===<br />
Power Calculations are a statistical tool to help determine [[Sample Size]]. This is important, a sample that is too small means that you will not be able to detect a statistically significant effect, and a sample size that is too large can be a waste of limited resources. <br />
You can estimate either sample size or minimum detectable effect. Which you should estimate depends on the research design and constraints of a specific impact evaluation. The types of questions you can answer through power calculations include:<br />
* Given that I want to be able to statistically distinguish program impact of a 10% change in my outcome of interest, what is the minimum sample size needed?<br />
* Given that I only have budget to sample 1,000 households, what is the minimum effect size that I will be able to distinguish from a null effect? (this is known as [[Minimum Detectable Effect]])<br />
<br />
<br />
Power calculations should be done at [[Impact Evaluation Design]] stage. They are mostly typically done using [https://www.stata.com/ Stata] or [http://hlmsoft.net/od/ Optimal Design] (See [[Power Calculations in Optimal Design]], [[Power Calculations in Stata]]). Power calculations can be used to determine either sample size (using standard assumption of 80% power) or power (if sample size is constrained). <br />
<br />
Intuition: <br />
[[Media:Sample Size Intuition.png|Summary of Determinants of Sample Size ]]<br />
<br />
=== Sample Design ===<br />
''Population'': What is the population of interest for the impact evaluation? In other words, what population does your sample need to represent? This will vary depending on the study design. Some data on the overall population is required, in order to draw a representative sample.<br />
<br />
''Stratification'': To ensure a representative sample you can use [[Stratified Random Sample|stratification]]. A typical variable to stratify on is gender. When you stratify on gender you guarantee that your sample has the same ratio of women as the population frame you are sampling from.<br />
<br />
=== Sample Selection ===<br />
The most basic sampling technique is a Simple Random Sample. This works well for studies of small populations, with a complete sampling frame for the population. More typically, impact evaluations rely on [[Multi-stage (Cluster) Sampling | multi-stage or clustered sampling]], often with [[Stratified Random Sample|stratification]].<br />
<br />
You should always work from a [[Master_Data_Set|master data set]] of the population (sampling frame). If you do not have a master data set for the [[Unit_of_Observation|unit of observation]] you are sampling from (for example, households, villages, clinics, schools) you should always start by creating one. In the field, this is done by a [[listing]] at the lowest level of clustering possible. If it is impossible to do a listing, an alternative is to do an "on-the-spot" randomization. There are a few different methods here, for example, a ‘random walk’ by enumerators where they spin a bottle to determine a random direction. But without knowing the total number of households this will always be biased towards the households at the center of the village. In addition, it’s hard to monitor whether protocols are adhered to in the field, and there isn’t a systematic way of tracing when replacements were used and how they were established.<br />
<br />
=== Randomization in Stata ===<br />
All sampling code you produce must be reproducible. Any code that includes randomization needs version, seed and sort to be reproducible. See [[Randomization in Stata|reproducible randomization in Stata]] for details.<br />
<br />
== Power Calculations ==<br />
<br />
===Software for Power Calculations===<br />
[http://www.stata.com/ Stata Stata] is better for [[Reproducible Research|reproducible research]], in that the power calculations are codified in a do file. However, it is less visual and intuitive than [[Power Calculations in Optimal Design|Optimal Design]], and Stata's built-in program for sample size calculations, ''power'', does not allow for corrections for clustering (there are user written programs to do this, but all have some pitfalls). See [[Power Calculations in Stata]] for details. <br />
<br />
[https://sites.google.com/site/optimaldesignsoftware/home Optimal Design] creates graphs to visualize trade-offs and relationships between the various components of the sample size equation. However, transparency is an issue when using this software. Most people just save graphs it creates, but that could be difficult to replicate in the future. Other issues with Optimal Design are:<br />
* It cannot calculate power for an individual-level randomization with binary outcome<br />
* It assumes equal mean and variance for treatment and control (for an RCT this is generally okay)<br />
* It only gives you total number of clusters or sample size, assuming equal split, whereas you might want to fix the size of your treatment group (say budget constraints) and calculate control group size<br />
See [[Power Calculations in Optimal Design]] for details. <br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[Sampling & Power Calculations]]<br />
<br />
== Additional Resources ==<br />
*[https://www.povertyactionlab.org/sites/default/files/resources/2017.01.11-The-Danger-of-Underpowered-Evaluations.pdf The Danger of Underpowered Evaluations], JPAL North America<br />
* [http://unstats.un.org/unsd/demographic/sources/surveys/Series_F98en.pdf Designing Household Survey Samples: Practical Guidelines] United Nations, Department of Economic and Social Affairs, Statistics Division - 2008<br />
* Why it makes sense to revisit power calculations after data has been collected: http://andrewgelman.com/2017/03/03/yes-makes-sense-design-analysis-power-calculations-data-collected/<br />
* Development Impact Blog: [http://blogs.worldbank.org/impactevaluations/power-calculations-what-software-should-i-use "Power Calculations: What software should I use?"]<br />
<br />
[[Category: Sampling & Power Calculations ]]</div>501238https://dimewiki.worldbank.org/index.php?title=Sampling_%26_Power_Calculations&diff=4377Sampling & Power Calculations2018-02-10T16:54:26Z<p>501238: </p>
<hr />
<div>Creating a statistically valid sample representative of the population of interest for the impact evaluation is a crucial aspect of impact evaluation design. This task can be roughly divided into two phases: sample design and implementation. Implementation typically means writing a software program to enact the sampling strategy. <br />
<br />
<br />
== Read First ==<br />
* To calculate exact sample size, you need to know the effect of the program and the mean and standard deviation of your outcome of interest for both the treatment and the control group. You cannot know these with certainty at the start of an impact evaluation. For this reason, power calculations require estimates and assumptions, and can seem like more of an art than a science. <br />
<br />
* Sampling code requires extra care! Errors cannot be corrected after the intervention (or survey) has started. Always ask a second person to double-check your code before you use the sampling it generated in the field. For DIME projects, you should always consult any member of DIME Analytics before sending a sample to the field. Do not randomize the sample from a temporary data set or a data set constructed for only this purpose. Instead, always randomize from a [[Master_Data_Set|Master data set]]. If no master data set exist for the [[Unit_of_Observation|unit of observation]] you are sampling on, then it is very important that you start by creating that.<br />
<br />
== Sampling ==<br />
<br />
=== Sample Size===<br />
Power Calculations are a statistical tool to help determine [[Sample Size]]. This is important, a sample that is too small means that you will not be able to detect a statistically significant effect, and a sample size that is too large can be a waste of limited resources. <br />
You can estimate either sample size or minimum detectable effect. Which you should estimate depends on the research design and constraints of a specific impact evaluation. The types of questions you can answer through power calculations include:<br />
* Given that I want to be able to statistically distinguish program impact of a 10% change in my outcome of interest, what is the minimum sample size needed?<br />
* Given that I only have budget to sample 1,000 households, what is the minimum effect size that I will be able to distinguish from a null effect? (this is known as [[Minimum Detectable Effect]])<br />
<br />
<br />
Power calculations should be done at [[Impact Evaluation Design]] stage. They are mostly typically done using [https://www.stata.com/ Stata] or [http://hlmsoft.net/od/ Optimal Design] (See [[Power Calculations in Optimal Design]], [[Power Calculations in Stata]]). Power calculations can be used to determine either sample size (using standard assumption of 80% power) or power (if sample size is constrained). <br />
<br />
Intuition: <br />
[[Media:Sample Size Intuition.png|Summary of Determinants of Sample Size ]]<br />
<br />
=== Sample Design ===<br />
''Population'': What is the population of interest for the impact evaluation? In other words, what population does your sample need to represent? This will vary depending on the study design. Some data on the overall population is required, in order to draw a representative sample.<br />
<br />
''Stratification'': To ensure a representative sample you can use [[Stratified Random Sample|stratification]]. A typical variable to stratify on is gender. When you stratify on gender you guarantee that your sample has the same ratio of women as the population frame you are sampling from.<br />
<br />
=== Sample Selection ===<br />
The most basic sampling technique is a Simple Random Sample. This works well for studies of small populations, with a complete sampling frame for the population. More typically, impact evaluations rely on [[Multi-stage (Cluster) Sampling|Multi-Stage]] or [[Multi-stage (Cluster) Sampling|Clustered]] Sampling, often with [[Stratified Random Sample|stratification]].<br />
<br />
You should always work from a [[Master_Data_Set|master data set]] of the population (sampling frame). If you do not have a master data set for the [[Unit_of_Observation|unit of observation]] you are sampling from (for example, households, villages, clinics, schools) you should always start by creating one. In the field, this is done by a [[listing]] at the lowest level of clustering possible. If it is impossible to do a listing, an alternative is to do an "on-the-spot" randomization. There are a few different methods here, for example, a ‘random walk’ by enumerators where they spin a bottle to determine a random direction. But without knowing the total number of households this will always be biased towards the households at the center of the village. In addition, it’s hard to monitor whether protocols are adhered to in the field, and there isn’t a systematic way of tracing when replacements were used and how they were established.<br />
<br />
=== Randomization in Stata ===<br />
All sampling code you produce must be reproducible. Any code that includes randomization needs version, seed and sort to be reproducible. See [[Randomization in Stata|reproducible randomization in Stata]] for details.<br />
<br />
== Power Calculations ==<br />
<br />
===Software for Power Calculations===<br />
[http://www.stata.com/ Stata Stata] is better for [[Reproducible Research|reproducible research]], in that the power calculations are codified in a do file. However, it is less visual and intuitive than [[Power Calculations in Optimal Design|Optimal Design]], and Stata's built-in program for sample size calculations, ''power'', does not allow for corrections for clustering (there are user written programs to do this, but all have some pitfalls). See [[Power Calculations in Stata]] for details. <br />
<br />
[https://sites.google.com/site/optimaldesignsoftware/home Optimal Design] creates graphs to visualize trade-offs and relationships between the various components of the sample size equation. However, transparency is an issue when using this software. Most people just save graphs it creates, but that could be difficult to replicate in the future. Other issues with Optimal Design are:<br />
* It cannot calculate power for an individual-level randomization with binary outcome<br />
* It assumes equal mean and variance for treatment and control (for an RCT this is generally okay)<br />
* It only gives you total number of clusters or sample size, assuming equal split, whereas you might want to fix the size of your treatment group (say budget constraints) and calculate control group size<br />
See [[Power Calculations in Optimal Design]] for details. <br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[Sampling & Power Calculations]]<br />
<br />
== Additional Resources ==<br />
*[https://www.povertyactionlab.org/sites/default/files/resources/2017.01.11-The-Danger-of-Underpowered-Evaluations.pdf The Danger of Underpowered Evaluations], JPAL North America<br />
* [http://unstats.un.org/unsd/demographic/sources/surveys/Series_F98en.pdf Designing Household Survey Samples: Practical Guidelines] United Nations, Department of Economic and Social Affairs, Statistics Division - 2008<br />
* Why it makes sense to revisit power calculations after data has been collected: http://andrewgelman.com/2017/03/03/yes-makes-sense-design-analysis-power-calculations-data-collected/<br />
* Development Impact Blog: [http://blogs.worldbank.org/impactevaluations/power-calculations-what-software-should-i-use "Power Calculations: What software should I use?"]<br />
<br />
[[Category: Sampling & Power Calculations ]]</div>501238https://dimewiki.worldbank.org/index.php?title=Unit_of_Observation&diff=4376Unit of Observation2018-02-10T16:31:59Z<p>501238: /* Methods to confirm the Unit of Observation in a data set */</p>
<hr />
<div>While the specific term ''Unit of Observation'' is not always well known, it is a concept that all people who work with data have come across. Having an exact understanding of this concept and getting into the habit of thinking about your data sets in terms of unit of observation, and organizing your data sets and your project folder accordingly is key to a efficient data work. Mistakes done in regards to this concepts are more common that what one would expect, and those mistakes will bias your analysis.<br />
<br />
== Read First ==<br />
* Never trust the file name by itself as an indicator of what the unit of observations is. Always perform some tests to convince yourself of the unit of observation.<br />
* A data set always only has one unit of observation. It is incorrect to include two different units of observation in a single data set.<br />
<br />
==Definition==<br />
<br />
The ''unit of observation'' is the who or what about which data is collected in a survey or the who or what is being studied in an analysis. In a data set, this is represented by a row in the data set. ''Unit of observation'' refers to the category, type or classification that each who or what belongs to, not to the specific people or objects included. The term ''unit of analysis'' is synonymous to ''unit of observation'' in the context of analysis, but it is used slightly different in the context of data collection. For example, if data is collected on both students and schools, but the analysis only focuses on students, then both school and students are ''unit of observation'', but schools would rarely be referred to as ''unit of observation'' in this case.<br />
<br />
In many cases, there is little risk for confusion in terms of ''unit of observation'', but errors due to a clear understanding of this are more common than what one might first think. Just as distance data does not make sense unless we know whether it is measured in miles or kilometer, we need to know the unit of our data set. We often have a good idea what the ''unit of observation'' is already at the first glance of a data set, but do not trust this, and always test that your assumption is correct. Even if you are quite sure that you know the ''unit of observation'' of a data set that you are working with, always make sure that you are sure beyond any reasonable doubt before working with a data set.<br />
<br />
A data set is always incorrectly constructed if one data set has more than one ''unit of observation''. Even if the two units of observation has the same variables, it is incorrect, bad practice, and a huge source of error if they were included in the same data set. All such data sets should be separated into two data sets.<br />
<br />
===Methods to confirm the Unit of Observation in a data set===<br />
<br />
The first time you are using a data set you have not created yourself, you should always start by making sure that you have no doubt what the ''unit of observation'' is. You often get this information from the name of the data file, but you should always test that before believing it. The most obvious method to make sure you know what is the unit of observation is to ask the person that sent you the data set, but the rest of this section assumes that you for any reason cannot confirm the ''unit of observation'' that easily.<br />
<br />
If you open up a data set for which you have a good reason to believe the ''unit of observations'' is, for example, household, then look for a household ID variable and test if it is [[ID Variable Properties|uniquely and fully identifying ]] the data set. If this is the case, then you are done. However, if you do not find such variable, you will have to find other information that uniquely and fully identifies the data set. For example, in this case, you would look for variables with information of household head name. Test if this variable uniquely identifies all observations. Names are often not unique across a country, so you might have to add region name and village name to the test.<br />
<br />
==Usages other than in data sets==<br />
<br />
The examples below all have many similarities to how ''unit of observation'' is used in the context of a data set. They are included to give further explanation to the concept or highlight small differences in usage.<br />
<br />
===Regressions===<br />
The unit of observation in a regression is what the N (or number of observations) represents. That is very much related to how the concept is used in the data set, as the N is the number of rows from the data set included in the regression. To be able to interpret the regression correctly therefore depends on understanding the ''unit of observation''. In most cases this is trivial, but we have had issues where regressions have been misinterpreted as a monitoring data that was believed to have the unit of observation "households", while it actually was "packages distributed to households". Since the vast majority of households only received one package each, it was easy to make this mistake than what it first might seem.<br />
<br />
Note that some regressions collapse your data set, so the unit of observation in the regression is different from the unit of observation in your data set. This is one example when unit of observation cannot described as a row in a data set.<br />
<br />
===Surveys===<br />
The concept of unit of observation can also be used to describe for example surveys. The unit of observation in a survey is the type of respondent. For example, household, company, school etc. In the cases of company and school the respondent is a person, for example the CEO or the principal, but they provide answers about the company or the school. If they would be asked questions about themselves, then the ''unit of observation'' would be CEOs and principals.<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[Data Management]]<br />
<br />
<br />
== Additional Resources ==<br />
Please add here related articles, including a brief description and link. <br />
<br />
[[Category: Data Management ]]</div>501238https://dimewiki.worldbank.org/index.php?title=Unit_of_Observation&diff=4375Unit of Observation2018-02-10T16:30:21Z<p>501238: /* Methods to confirm the Unit of Observation in a data set */</p>
<hr />
<div>While the specific term ''Unit of Observation'' is not always well known, it is a concept that all people who work with data have come across. Having an exact understanding of this concept and getting into the habit of thinking about your data sets in terms of unit of observation, and organizing your data sets and your project folder accordingly is key to a efficient data work. Mistakes done in regards to this concepts are more common that what one would expect, and those mistakes will bias your analysis.<br />
<br />
== Read First ==<br />
* Never trust the file name by itself as an indicator of what the unit of observations is. Always perform some tests to convince yourself of the unit of observation.<br />
* A data set always only has one unit of observation. It is incorrect to include two different units of observation in a single data set.<br />
<br />
==Definition==<br />
<br />
The ''unit of observation'' is the who or what about which data is collected in a survey or the who or what is being studied in an analysis. In a data set, this is represented by a row in the data set. ''Unit of observation'' refers to the category, type or classification that each who or what belongs to, not to the specific people or objects included. The term ''unit of analysis'' is synonymous to ''unit of observation'' in the context of analysis, but it is used slightly different in the context of data collection. For example, if data is collected on both students and schools, but the analysis only focuses on students, then both school and students are ''unit of observation'', but schools would rarely be referred to as ''unit of observation'' in this case.<br />
<br />
In many cases, there is little risk for confusion in terms of ''unit of observation'', but errors due to a clear understanding of this are more common than what one might first think. Just as distance data does not make sense unless we know whether it is measured in miles or kilometer, we need to know the unit of our data set. We often have a good idea what the ''unit of observation'' is already at the first glance of a data set, but do not trust this, and always test that your assumption is correct. Even if you are quite sure that you know the ''unit of observation'' of a data set that you are working with, always make sure that you are sure beyond any reasonable doubt before working with a data set.<br />
<br />
A data set is always incorrectly constructed if one data set has more than one ''unit of observation''. Even if the two units of observation has the same variables, it is incorrect, bad practice, and a huge source of error if they were included in the same data set. All such data sets should be separated into two data sets.<br />
<br />
===Methods to confirm the Unit of Observation in a data set===<br />
<br />
The first time you are using a data set you have not created yourself, you should always start by making sure that you have no doubt what the ''unit of observation'' is. You often get this information from the name of the data file, but you should always test that before believing it. The most obvious method to make sure you know what is the unit of observation is to ask the person that sent you the data set, but the rest of this section assumes that you for any reason cannot confirm the ''unit of observation'' that easily.<br />
<br />
If you open up a data set for which you have a good reason to believe the ''unit of observations'' is, for example, household, then look for a household ID variable and test if it is [[ID Variable Properties|uniquely and fully identifying ]] the data set. (See the page for uniquely and fully identifying for instructions) If this is the case, then you are done. However, if you do not find such variable, you will have to find other information that uniquely and fully identifies the data set. For example, in this case, you would look for variables with information of household head name. Test if this variable uniquely identifies all observations. Names are often not unique across a country, so you might have to add region name and village name to the test.<br />
<br />
==Usages other than in data sets==<br />
<br />
The examples below all have many similarities to how ''unit of observation'' is used in the context of a data set. They are included to give further explanation to the concept or highlight small differences in usage.<br />
<br />
===Regressions===<br />
The unit of observation in a regression is what the N (or number of observations) represents. That is very much related to how the concept is used in the data set, as the N is the number of rows from the data set included in the regression. To be able to interpret the regression correctly therefore depends on understanding the ''unit of observation''. In most cases this is trivial, but we have had issues where regressions have been misinterpreted as a monitoring data that was believed to have the unit of observation "households", while it actually was "packages distributed to households". Since the vast majority of households only received one package each, it was easy to make this mistake than what it first might seem.<br />
<br />
Note that some regressions collapse your data set, so the unit of observation in the regression is different from the unit of observation in your data set. This is one example when unit of observation cannot described as a row in a data set.<br />
<br />
===Surveys===<br />
The concept of unit of observation can also be used to describe for example surveys. The unit of observation in a survey is the type of respondent. For example, household, company, school etc. In the cases of company and school the respondent is a person, for example the CEO or the principal, but they provide answers about the company or the school. If they would be asked questions about themselves, then the ''unit of observation'' would be CEOs and principals.<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[Data Management]]<br />
<br />
<br />
== Additional Resources ==<br />
Please add here related articles, including a brief description and link. <br />
<br />
[[Category: Data Management ]]</div>501238https://dimewiki.worldbank.org/index.php?title=Research_Ethics&diff=4374Research Ethics2018-02-10T16:03:55Z<p>501238: /* Research Transparency */</p>
<hr />
<div>Impact evaluations often involve the direct manipulation of people's personal or economic situations, collection of [[De-identification#Personally-Identifying Information | personal and/or sensitive data]] about people, and publication of results that have direct implications for political or economic governance. Ensuring that these tasks are undertaken in a way that is both protective of the individuals who are part of the study population as well as broadly ethical for the research question and context is a critical responsibility of research designers and a requirement for institutional, [[Human_Subjects_Approval#IRB_Approval | IRB]], and government approval and support of any study, and should be set out in the [[Pre-Analysis Plan | pre-analysis plan]].<br />
<br />
== Research with Human Subjects ==<br />
<br />
Any research that involves economic intervention or [[Primary Data Collection | data collection]] on specific individuals is almost certainly subject to [[Human Subjects Approval | human subjects]] ethics rules. This means that pre-approval by an [[Human_Subjects_Approval#IRB_Approval | institutional review board]], as well as a [https://humansubjects.nih.gov/requirement-education human subjects education certificate] from the NIH or other body for each researcher or assistant handling implementation or data.<br />
<br />
In practice, this may or may not require [[Human_Subjects_Approval#Informed Consent | informed consent]] from individual research participants, depending on the design and purpose of the study. For example, an [[Human_Subjects_Approval#IRB_Approval | IRB]] may grant approval to collect administrative data or health care provider data for public health reasons, given written consent from the appropriate government ministry or office. Similarly, when the information to be collected is not especially sensitive and affirmative consent might endanger the feasibility of the study, individualized consent may be possible to waive.<br />
<br />
== Handling Personally-Identifying Information (PII) ==<br />
<br />
Whether or not individualized informed consent is required, research that involves the collection of sensitive information – including but not limited to names, addresses, mobile phone numbers, bank or credit accounts, or location information – should be handled from collection to publication in a way that ensures the privacy of research participants. This means using appropriately secure electronic methods to collect and store data, appropriate [[De-identification#Folder Encryption | data encryption]] on devices like laptops or hard drives, and [[De-identification#De-identification | anonymization]] of data before any [[Publishing Data | public release]].<br />
<br />
== Research Transparency ==<br />
Once common concern in research is the possibility of manipulating results. Political factors, [[Publication Bias | publication bias]] and other circumstances may pressure researchers to target findings, for example through [[P-Hacking | P-hacking]] and [[Selective Reporting | selective reporting]]. To avoid this concern, researchers often choose to develop [[Pre-Analysis Plan | pre-analysis plans]] and [[Pre-Registration | pre-register]] studies. Sharing [[Publishing Data | data]] and codes improves [[Reproducible Research | research reproducility]].</div>501238https://dimewiki.worldbank.org/index.php?title=Reproducible_Research&diff=4373Reproducible Research2018-02-10T00:04:34Z<p>501238: </p>
<hr />
<div>In most scientific fields, results are validated through replication: that means that different scientists run the same experiment independently in different samples and find similar conclusions. That standard is not always feasible in development research. More often than not, the phenomena we analyze cannot be artifically re-created. Even in the case of field experiments, different populations can respond differently to a treatment, and the costs involved are high.<br />
<br />
Even in such cases, however, we should till require reproducibility: this means that different researchers, when running the same analysis in the same data should find the same results. That may seem obvious, but unfortunately is not as widely observed as we would like. The bottom line of research reproducibility is that the path used to get to your results are as much a research output as the results themselves, making the research process fully transparent. This means that not only the final findings should be made available by researchers, but data, codes and documentation are also of great relevance to the public.<br />
<br />
== Code replication ==<br />
*Git is a free version-control software. Files are stored in Git Repositories, most commonly on [https://github.com/ GitHub]. To learn GitHub, there is an [https://services.github.com/on-demand/intro-to-github/ introductory training] available through GitHub Services, and multiple tutorials available through [https://guides.github.com/ GitHub Guides]<br />
<br />
== Data publication ==<br />
<br />
== Dynamic documents == <br />
*R-markdown is a widely adopted tool for creating fully reproducible documents. It allows users to write text and code simultaneously, running analyses in different programming languages and printing results in the final document along with the text. Stata 15 also allows users to create dynamic documents using dyndoc. <br />
<br />
*[http://jupyter.org/ Jupyter Notebook] is used to create and share code in different programming languages, including Python, R, Julia, and Scala. It can also create dynamic documents in HTML, LaTeX and other formats.<br />
<br />
*LaTeX is another widely used tool in the scientific community. It is a type-setting system that allows users to reference code outputs such as tables and graphs so that they can be easily updated in a text document. Overleaf is a web based platform for collaboration in TeX documents.<br />
<br />
* Open science framework is a web based project management platform that combines registration, data storage (through Dropbox, Box, Google Drive and other platforms), code version control (through GitHub) and document composition (through Overleaf).<br />
<br />
== Additional Resources ==<br />
From Data Colada: <br />
*[http://datacolada.org/69 8 tips to make open research more findable and understandable]<br />
<br />
From the Abul Latif Jameel Poverty Action Lab (JPAL)<br />
* [https://www.povertyactionlab.org/research-resources/transparency-and-reproducibility Transparency and Reproducibility]<br />
<br />
From Innovations for Policy Action (IPA)<br />
* [http://www.poverty-action.org/sites/default/files/publications/IPA%27s%20Best%20Practices%20for%20Data%20and%20Code%20Management_Nov2015.pdf Reproducible Research: Best Practices for Data and Code Management] <br />
* [http://www.poverty-action.org/sites/default/files/Guidelines-for-data-publication.pdf Guidelines for data publication]<br />
* [https://dataverse.harvard.edu/dataverse/socialsciencercts Randomized Control Trials in the Social Science Dataverse]<br />
<br />
Center for Open Science<br />
* [https://cos.io/our-services/top-guidelines/ Transparency and Openness Guidelines], summarized in a [https://osf.io/pvf56/?_ga=1.225140506.1057649246.1484691980 1-Page Handout]<br />
<br />
Berkeley Initiative for Transparency in the Social Sciences<br />
* [http://www.bitss.org/education/manual-of-best-practices/ Manual of Best Practices in Transparent Social Science Research]<br />
<br />
Reproducible Research in R<br />
* [https://www.coursera.org/learn/reproducible-research Johns Hopkins' Online Course on Reproducible Research]<br />
<br />
Reproducible Research in Stata<br />
* [https://huapeng01016.github.io/reptalk/#/hua-pengstatacorphpeng Incorporating Stata into reproducible documents ]<br />
<br />
[[Category: Reproducible Research]]<br />
<br />
== Additional Resources ==</div>501238https://dimewiki.worldbank.org/index.php?title=Sampling_%26_Power_Calculations&diff=4372Sampling & Power Calculations2018-02-09T23:59:04Z<p>501238: /* Read First */</p>
<hr />
<div>Creating a statistically valid sample representative of the population of interest for the impact evaluation is a crucial aspect of impact evaluation design. This task can be roughly divided into two phases: sample design and implementation. Implementation typically means writing a software program to enact the sampling strategy. <br />
<br />
<br />
== Read First ==<br />
* To calculate exact sample size, you need to know the effect of the program and the mean and standard deviation of your outcome of interest for both the treatment and the control group. You cannot know these with certainty at the start of an impact evaluation. For this reason, power calculations require estimates and assumptions, and can seem like more of an art than a science. <br />
<br />
* Sampling code requires extra care! Errors cannot be corrected after the intervention (or survey) has started. Always ask a second person to double-check your code before you use the sampling it generated in the field. For DIME projects, you should always consult any member of DIME Analytics before sending a sample to the field. Do not randomize the sample from a temporary data set or a data set constructed for only this purpose. Instead, always randomize from a [[Master_Data_Set|Master data set]]. If no master data set exist for the [[Unit_of_Observation|unit of observation]] you are sampling on, then it is very important that you start by creating that.<br />
<br />
== Guidelines ==<br />
<br />
=== Sample Size===<br />
Power Calculations are a statistical tool to help determine [[Sample Size]]. This is important, a sample that is too small means that you will not be able to detect a statistically significant effect, and a sample size that is too large can be a waste of limited resources. <br />
You can estimate either sample size or minimum detectable effect. Which you should estimate depends on the research design and constraints of a specific impact evaluation. The types of questions you can answer through power calculations include:<br />
* Given that I want to be able to statistically distinguish program impact of a 10% change in my outcome of interest, what is the minimum sample size needed?<br />
* Given that I only have budget to sample 1,000 households, what is the minimum effect size that I will be able to distinguish from a null effect? (this is known as [[Minimum Detectable Effect]])<br />
<br />
<br />
Power calculations should be done at [[Impact Evaluation Design]] stage. They are mostly typically done using [https://www.stata.com/ Stata] or [http://hlmsoft.net/od/ Optimal Design] (See [[Power Calculations in Optimal Design]], [[Power Calculations in Stata]]). Power calculations can be used to determine either sample size (using standard assumption of 80% power) or power (if sample size is constrained). <br />
<br />
Intuition: <br />
[[Media:Sample Size Intuition.png|Summary of Determinants of Sample Size ]]<br />
<br />
=== Sample Design ===<br />
''Population'': What is the population of interest for the impact evaluation? In other words, what population does your sample need to represent? This will vary depending on the study design. Some data on the overall population is required, in order to draw a representative sample.<br />
<br />
''Stratification'': To ensure a representative sample you can use [[Stratified Random Sample|stratification]]. A typical variable to stratify on is gender. When you stratify on gender you guarantee that your sample has the same ratio of women as the population frame you are sampling from.<br />
<br />
=== Sample Selection ===<br />
The most basic sampling technique is a Simple Random Sample. This works well for studies of small populations, with a complete sampling frame for the population. More typically, impact evaluations rely on [[Multi-stage (Cluster) Sampling|Multi-Stage]] or [[Multi-stage (Cluster) Sampling|Clustered]] Sampling, often with [[Stratified Random Sample|stratification]].<br />
<br />
You should always work from a [[Master_Data_Set|master data set]] of the population (sampling frame). If you do not have a master data set for the [[Unit_of_Observation|unit of observation]] you are sampling from (for example, households, villages, clinics, schools) you should always start by creating one. In the field, this is done by a [[listing]] at the lowest level of clustering possible. If it is impossible to do a listing, an alternative is to do an "on-the-spot" randomization. There are a few different methods here, for example, a ‘random walk’ by enumerators where they spin a bottle to determine a random direction. But without knowing the total number of households this will always be biased towards the households at the center of the village. In addition, it’s hard to monitor whether protocols are adhered to in the field, and there isn’t a systematic way of tracing when replacements were used and how they were established.<br />
<br />
=== Randomization in Stata ===<br />
All sampling code you produce must be reproducible. Any code that includes randomization needs version, seed and sort to be reproducible. See [[Randomization in Stata|reproducible randomization in Stata]] for details.<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[Sampling & Power Calculations]]<br />
<br />
<br />
== Additional Resources ==<br />
<br />
*[https://www.povertyactionlab.org/sites/default/files/resources/2017.01.11-The-Danger-of-Underpowered-Evaluations.pdf The Danger of Underpowered Evaluations], JPAL North America<br />
* [http://unstats.un.org/unsd/demographic/sources/surveys/Series_F98en.pdf Designing Household Survey Samples: Practical Guidelines] United Nations, Department of Economic and Social Affairs, Statistics Division - 2008<br />
* Why it makes sense to revisit power calculations after data has been collected: http://andrewgelman.com/2017/03/03/yes-makes-sense-design-analysis-power-calculations-data-collected/<br />
<br />
[[Category: Sampling & Power Calculations ]]</div>501238https://dimewiki.worldbank.org/index.php?title=Pre-Registration&diff=4371Pre-Registration2018-02-09T23:57:48Z<p>501238: Created page with "Trial registries offer researchers the chance to upload and timestamp their study designs before they have been conducted. The aim of these registries is to build research tra..."</p>
<hr />
<div>Trial registries offer researchers the chance to upload and timestamp their study designs before they have been conducted. The aim of these registries is to build research transparency by reducing selective reporting and provide researchers with an overview of ongoing studies in their field. While trial registration is commonplace in the clinical health trials (see, for example, https://clinicaltrials.gov/), their use in development economics is more recent.<br />
<br />
== Guidelines ==<br />
===Where can I register?===<br />
The American Economic Association (AEA) hosts a trial registry specifically for randomized controlled trials[https://www.socialscienceregistry.org/]. The international Initiative for Impact Evaluation (3ie) provides a registry for experimental and quasi-experimental research in developing countries [http://www.ridie.org/]. <br />
<br />
===What information should be included?===<br />
The information required for registering a trial typically includes the country and title, a brief description of the project, timeline, outcomes, sample size, study design, and ethical approval details. Some of the details provided can be uploaded and time stamped, but hidden from public view prior to study completion. A pre-analysis plan can be uploaded providing a detailed description of how the analysis will be conducted, but this is typically not mandatory for registration. <br />
<br />
===When should I register?===<br />
While clinical trials in health are expected to be registered before patient enrolment [http://icmje.org/recommendations/browse/publishing-and-editorial-issues/clinical-trial-registration.html], there is currently no formal requirement for development economics trials to be registered by a particular stage of the research. In cases where intervention delivery is uncertain, development economics researchers wait to register their trials after baseline and interventions have been completed, but before any follow up data collection or analysis [http://blogs.worldbank.org/impactevaluations/trying-out-new-trial-registries].<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[Research Ethics]]<br />
<br />
[[Category: Research Ethics]]</div>501238https://dimewiki.worldbank.org/index.php?title=Reproducible_Research&diff=4370Reproducible Research2018-02-09T23:56:03Z<p>501238: /* Pre-registration */</p>
<hr />
<div>== Read First ==<br />
*In most scientific fields, results are validated through replication: that means that different scientists run the same experiment independently in different samples and find similar conclusions. That standard is not always feasible in development research. More often than not, the phenomena we analyze cannot be artifically re-created. Even in the case of field experiments, different populations can respond differently to a treatment, and the costs involved are high.<br />
*Even in such cases, however, we should till require reproducibility: this means that different researchers, when running the same analysis in the same data should find the same results. That may seem obvious, but unfortunately is not as widely observed as we would like.<br />
*The bottom line of research reproducibility is that the path used to get to your results are as much a research output as the results themselves, making the research process fully transparent. This means that not only the final findings should be made available by researchers, but data, codes and documentation are also of great relevance to the public.<br />
<br />
== Code replication ==<br />
*Git is a free version-control software. Files are stored in Git Repositories, most commonly on [https://github.com/ GitHub]. To learn GitHub, there is an [https://services.github.com/on-demand/intro-to-github/ introductory training] available through GitHub Services, and multiple tutorials available through [https://guides.github.com/ GitHub Guides]<br />
<br />
== Data publication ==<br />
<br />
== Dynamic documents == <br />
*R-markdown is a widely adopted tool for creating fully reproducible documents. It allows users to write text and code simultaneously, running analyses in different programming languages and printing results in the final document along with the text. Stata 15 also allows users to create dynamic documents using dyndoc. <br />
<br />
*[http://jupyter.org/ Jupyter Notebook] is used to create and share code in different programming languages, including Python, R, Julia, and Scala. It can also create dynamic documents in HTML, LaTeX and other formats.<br />
<br />
*LaTeX is another widely used tool in the scientific community. It is a type-setting system that allows users to reference code outputs such as tables and graphs so that they can be easily updated in a text document. Overleaf is a web based platform for collaboration in TeX documents.<br />
<br />
* Open science framework is a web based project management platform that combines registration, data storage (through Dropbox, Box, Google Drive and other platforms), code version control (through GitHub) and document composition (through Overleaf).<br />
<br />
== Additional Resources ==<br />
From Data Colada: <br />
*[http://datacolada.org/69 8 tips to make open research more findable and understandable]<br />
<br />
From the Abul Latif Jameel Poverty Action Lab (JPAL)<br />
* [https://www.povertyactionlab.org/research-resources/transparency-and-reproducibility Transparency and Reproducibility]<br />
<br />
From Innovations for Policy Action (IPA)<br />
* [http://www.poverty-action.org/sites/default/files/publications/IPA%27s%20Best%20Practices%20for%20Data%20and%20Code%20Management_Nov2015.pdf Reproducible Research: Best Practices for Data and Code Management] <br />
* [http://www.poverty-action.org/sites/default/files/Guidelines-for-data-publication.pdf Guidelines for data publication]<br />
* [https://dataverse.harvard.edu/dataverse/socialsciencercts Randomized Control Trials in the Social Science Dataverse]<br />
<br />
Center for Open Science<br />
* [https://cos.io/our-services/top-guidelines/ Transparency and Openness Guidelines], summarized in a [https://osf.io/pvf56/?_ga=1.225140506.1057649246.1484691980 1-Page Handout]<br />
<br />
Berkeley Initiative for Transparency in the Social Sciences<br />
* [http://www.bitss.org/education/manual-of-best-practices/ Manual of Best Practices in Transparent Social Science Research]<br />
<br />
Reproducible Research in R<br />
* [https://www.coursera.org/learn/reproducible-research Johns Hopkins' Online Course on Reproducible Research]<br />
<br />
Reproducible Research in Stata<br />
* [https://huapeng01016.github.io/reptalk/#/hua-pengstatacorphpeng Incorporating Stata into reproducible documents ]<br />
<br />
[[Category: Reproducible Research]]<br />
<br />
== Additional Resources ==</div>501238https://dimewiki.worldbank.org/index.php?title=Research_Ethics&diff=4369Research Ethics2018-02-09T23:55:37Z<p>501238: </p>
<hr />
<div>Impact evaluations often involve the direct manipulation of people's personal or economic situations, collection of [[De-identification#Personally-Identifying Information | personal and/or sensitive data]] about people, and publication of results that have direct implications for political or economic governance. Ensuring that these tasks are undertaken in a way that is both protective of the individuals who are part of the study population as well as broadly ethical for the research question and context is a critical responsibility of research designers and a requirement for institutional, [[Human_Subjects_Approval#IRB_Approval | IRB]], and government approval and support of any study, and should be set out in the [[Pre-Analysis Plan | pre-analysis plan]].<br />
<br />
== Research with Human Subjects ==<br />
<br />
Any research that involves economic intervention or [[Primary Data Collection | data collection]] on specific individuals is almost certainly subject to [[Human Subjects Approval | human subjects]] ethics rules. This means that pre-approval by an [[Human_Subjects_Approval#IRB_Approval | institutional review board]], as well as a [https://humansubjects.nih.gov/requirement-education human subjects education certificate] from the NIH or other body for each researcher or assistant handling implementation or data.<br />
<br />
In practice, this may or may not require [[Human_Subjects_Approval#Informed Consent | informed consent]] from individual research participants, depending on the design and purpose of the study. For example, an [[Human_Subjects_Approval#IRB_Approval | IRB]] may grant approval to collect administrative data or health care provider data for public health reasons, given written consent from the appropriate government ministry or office. Similarly, when the information to be collected is not especially sensitive and affirmative consent might endanger the feasibility of the study, individualized consent may be possible to waive.<br />
<br />
== Handling Personally-Identifying Information (PII) ==<br />
<br />
Whether or not individualized informed consent is required, research that involves the collection of sensitive information – including but not limited to names, addresses, mobile phone numbers, bank or credit accounts, or location information – should be handled from collection to publication in a way that ensures the privacy of research participants. This means using appropriately secure electronic methods to collect and store data, appropriate [[De-identification#Folder Encryption | data encryption]] on devices like laptops or hard drives, and [[De-identification#De-identification | anonymization]] of data before any [[Publishing Data | public release]].<br />
<br />
== Research Transparency ==<br />
Once common concern in research is the possibility of manipulating results. Political factors, [[Publication Bias | publication bias]] and other circumstances may pressure researchers to target findings, for example through [[P-Hacking | P-hacking]] and [[Selective Reporting | selective reporting]]. To avoid this concern, researchers often choose to develop [[Pre-Analysis Plan | pre-analysis plans]] and [[Pre-Registration | pre-register]] studies. Sharing [[Publishing Data | data] and codes improves [[Reproducible Research | research reproducility]].</div>501238https://dimewiki.worldbank.org/index.php?title=Pre-Analysis_Plan&diff=4368Pre-Analysis Plan2018-02-09T23:41:30Z<p>501238: </p>
<hr />
<div>A pre-analysis plan (PAP) lays out how the researcher will analyze data, at the design stage of an impact evaluation. The objective of a PAP is to prevent data mining and specification searching. <br />
<br />
<br />
== Read First ==<br />
While most economics journals do not currently require PAPs as a condition for publication, researchers may choose to produce a PAP prior to data analysis to: (i) increase the credibility of their findings; and (ii) help researchers finetune their analysis strategy.<br />
<br />
While PAPs provide the benefit of potentially reducing the prevalence of spurious results, this comes at the cost of tying researcher hands more formally to ex ante analysis plans that may limit the potential of exploratory learning. Benjamin Olken provides a summary of the costs and benefits associated with fully pre-specifying the analysis for a development economics RCT [https://www.aeaweb.org/articles?id=10.1257/jep.29.3.61]. He notes that "forcing all papers to be fully pre-specified from start to end would likely results in simpler papers, which could potentially lose some of the nuance of current work", but that "in many contexts, pre-specification of one (or a few) key primary outcome variables, statistical specifications, and control variables offers a number of advantages".<br />
<br />
== Guidelines ==<br />
<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[Research Ethics]]<br />
<br />
<br />
== Additional Resources ==<br />
*Olken, Benjamin A.. 2015. "[https://www.aeaweb.org/articles?id=10.1257/jep.29.3.61 Promises and Perils of Pre-analysis Plans]." Journal of Economic Perspectives, 29(3): 61-80.<br />
DOI: 10.1257/jep.29.3.61<br />
* [https://www.bitss.org/wp-content/uploads/2015/12/Pre-Analysis-Plan-Template.pdf Pre-Analysis Plan Template]<br />
* [http://blogs.worldbank.org/impactevaluations/a-pre-analysis-plan-checklist Pre-analysis plan checklist from Development Impact Blog]<br />
<br />
[[Category: Research Ethics]]</div>501238https://dimewiki.worldbank.org/index.php?title=Reproducible_Research&diff=4367Reproducible Research2018-02-09T23:40:52Z<p>501238: </p>
<hr />
<div>== Read First ==<br />
*In most scientific fields, results are validated through replication: that means that different scientists run the same experiment independently in different samples and find similar conclusions. That standard is not always feasible in development research. More often than not, the phenomena we analyze cannot be artifically re-created. Even in the case of field experiments, different populations can respond differently to a treatment, and the costs involved are high.<br />
*Even in such cases, however, we should till require reproducibility: this means that different researchers, when running the same analysis in the same data should find the same results. That may seem obvious, but unfortunately is not as widely observed as we would like.<br />
*The bottom line of research reproducibility is that the path used to get to your results are as much a research output as the results themselves, making the research process fully transparent. This means that not only the final findings should be made available by researchers, but data, codes and documentation are also of great relevance to the public.<br />
<br />
== Pre-registration ==<br />
Trial registries offer researchers the chance to upload and timestamp their study designs before they have been conducted. The aim of these registries is to build research transparency by reducing selective reporting and provide researchers with an overview of ongoing studies in their field. While trial registration is commonplace in the clinical health trials (see, for example, https://clinicaltrials.gov/), their use in development economics is more recent.<br />
<br />
===Where can I register?===<br />
The American Economic Association (AEA) hosts a trial registry specifically for randomized controlled trials[https://www.socialscienceregistry.org/]. The international Initiative for Impact Evaluation (3ie) provides a registry for experimental and quasi-experimental research in developing countries [http://www.ridie.org/]. <br />
<br />
===What information should be included?===<br />
The information required for registering a trial typically includes the country and title, a brief description of the project, timeline, outcomes, sample size, study design, and ethical approval details. Some of the details provided can be uploaded and time stamped, but hidden from public view prior to study completion. A pre-analysis plan can be uploaded providing a detailed description of how the analysis will be conducted, but this is typically not mandatory for registration. <br />
<br />
===When should I register?===<br />
While clinical trials in health are expected to be registered before patient enrolment [http://icmje.org/recommendations/browse/publishing-and-editorial-issues/clinical-trial-registration.html], there is currently no formal requirement for development economics trials to be registered by a particular stage of the research. In cases where intervention delivery is uncertain, development economics researchers wait to register their trials after baseline and interventions have been completed, but before any follow up data collection or analysis [http://blogs.worldbank.org/impactevaluations/trying-out-new-trial-registries].<br />
<br />
== Code replication ==<br />
*Git is a free version-control software. Files are stored in Git Repositories, most commonly on [https://github.com/ GitHub]. To learn GitHub, there is an [https://services.github.com/on-demand/intro-to-github/ introductory training] available through GitHub Services, and multiple tutorials available through [https://guides.github.com/ GitHub Guides]<br />
<br />
== Data publication ==<br />
<br />
== Dynamic documents == <br />
*R-markdown is a widely adopted tool for creating fully reproducible documents. It allows users to write text and code simultaneously, running analyses in different programming languages and printing results in the final document along with the text. Stata 15 also allows users to create dynamic documents using dyndoc. <br />
<br />
*[http://jupyter.org/ Jupyter Notebook] is used to create and share code in different programming languages, including Python, R, Julia, and Scala. It can also create dynamic documents in HTML, LaTeX and other formats.<br />
<br />
*LaTeX is another widely used tool in the scientific community. It is a type-setting system that allows users to reference code outputs such as tables and graphs so that they can be easily updated in a text document. Overleaf is a web based platform for collaboration in TeX documents.<br />
<br />
* Open science framework is a web based project management platform that combines registration, data storage (through Dropbox, Box, Google Drive and other platforms), code version control (through GitHub) and document composition (through Overleaf).<br />
<br />
== Additional Resources ==<br />
From Data Colada: <br />
*[http://datacolada.org/69 8 tips to make open research more findable and understandable]<br />
<br />
From the Abul Latif Jameel Poverty Action Lab (JPAL)<br />
* [https://www.povertyactionlab.org/research-resources/transparency-and-reproducibility Transparency and Reproducibility]<br />
<br />
From Innovations for Policy Action (IPA)<br />
* [http://www.poverty-action.org/sites/default/files/publications/IPA%27s%20Best%20Practices%20for%20Data%20and%20Code%20Management_Nov2015.pdf Reproducible Research: Best Practices for Data and Code Management] <br />
* [http://www.poverty-action.org/sites/default/files/Guidelines-for-data-publication.pdf Guidelines for data publication]<br />
* [https://dataverse.harvard.edu/dataverse/socialsciencercts Randomized Control Trials in the Social Science Dataverse]<br />
<br />
Center for Open Science<br />
* [https://cos.io/our-services/top-guidelines/ Transparency and Openness Guidelines], summarized in a [https://osf.io/pvf56/?_ga=1.225140506.1057649246.1484691980 1-Page Handout]<br />
<br />
Berkeley Initiative for Transparency in the Social Sciences<br />
* [http://www.bitss.org/education/manual-of-best-practices/ Manual of Best Practices in Transparent Social Science Research]<br />
<br />
Reproducible Research in R<br />
* [https://www.coursera.org/learn/reproducible-research Johns Hopkins' Online Course on Reproducible Research]<br />
<br />
Reproducible Research in Stata<br />
* [https://huapeng01016.github.io/reptalk/#/hua-pengstatacorphpeng Incorporating Stata into reproducible documents ]<br />
<br />
[[Category: Reproducible Research]]<br />
<br />
== Additional Resources ==</div>501238