Difference between revisions of "Data Cleaning"

Jump to: navigation, search
Line 1: Line 1:
<span style="font-size:150%">
Data cleaning is an essential step between [[Primary Data Collection | data collection]] and [[Data Analysis | data analysis]]. Raw primary data is always imperfect and needs to be prepared for a high quality analysis and overall [[Reproducible Research | replicability]]. In extremely rare cases, the only preparation needed is [[Data Documentation | dataset documentation]]. However, in the vast majority of cases, data cleaning requires significant energy and attention, typically on the part of the [[Impact_Evaluation_Team#Research_Assistant|Research Assistant]] (RA). This page outlines the goals of data cleaning, recommends role division, outlines common issues encountered, and presents approaches to resolve them.  
</span>
 
</span>
<onlyinclude>
Data cleaning is an essential step between data collection and data analysis. Raw primary data is always imperfect and needs to be prepared so that it is easy to use in the analysis. This is the high level goal of Data Cleaning. In extremely rare cases, the only preparation needed is to document the data set, for example - by using labels. However, in the vast majority of cases, there are many small things that need to be addressed in the data set itself. This could both be addressing data points that are incorrect, or replacing values that are not real data points, but codes explaining why there is no real data point.
</onlyinclude>
== Read First ==
== Read First ==


*See this [[Checklist:_Data_Cleaning|check list]] that can be used to make sure that common cleaning actions have been done when applicable.
*The goal of data cleaning is to clean individual data points and to make the dataset easily usable and understandable for the research team and external users.  
*As a [[Impact_Evaluation_Team#Research_Assistant|Research Assistant]] (RA) or [[Impact_Evaluation_Team#Field_Coordinator|Field Coordinator]] (FC), do not spend time trying to fix irregularities in the data at the expense of not having time to identifying as many irregularities as possible.
*The quality of the analysis will never be better than the quality of data cleaning.
*The quality of the analysis will never be better than the quality of data cleaning.
*There is no such thing as an exhaustive list of what to do during data cleaning as each project will have individual cleaning needs, but this article provides a very good place to start.
*There is no such thing as an exhaustive list of what to do during data cleaning: each project will have individual cleaning needs. This article provides a very good place to start.
*After finishing the data cleaning for each round of data collection, data can be [[Publishing Data|released]]
*See this [[Checklist:_Data_Cleaning| data cleaning checklist]] to ensure that common cleaning actions have been completed.
 
== The Goal of Cleaning ==


There are two main goals when cleaning the data set:
== The Goals of Cleaning ==


#Cleaning individual data points that invalidate or incorrectly bias the analysis.
The data cleaning process seeks to fulfill two goals: 1) to clean individual data points that invalidate or bias the analysis and 2) to make the dataset easily usable and understandable for researchers both within and outside of the research team. A really good data cleaning process should also result in documented insights about the data and data collection to inform future data collection – either for a different round of the same project or for other future projects.  
#Preparing a clean data set so that it is easy to use for other researchers. Both for researchers inside your team and outside your team.


[[File:Picture2.png|700px|link=|center]]
[[File:Picture2.png|700px|link=|center]]
Another overarching goal of the cleaning process is to understand the data and the data collection really well. Much of this understanding feeds directly into the two points above, but a really good data cleaning process should also result in documented lessons learned that can be used in future data collection. Both in later data collection rounds in the same project, but also in data collections in other similar projects.


=== Cleaning individual data points ===
=== Cleaning Individual Data Points ===


In impact evaluations, our analysis often come down to test for statistical differences in the mean between the control group and any of the treatment arms. We do so through regression analysis where we include control variables, fixed effects, and different error estimators, among many other tools. In essence, though, one can think of it as an advanced comparison of means. While this is far from a complete description of impact evaluation analysis, it might give the person cleaning a data set for the first time a framework on what cleaning a data set should achieve.
[[Randomized Control Trials | RCT]] analysis typically tests for statistical differences between the means of the control and treatment groups. To do so, research teams use regression analyses; in essence, one can think of these as an advanced comparison of means. While this is, of course, far from a complete description of impact evaluation analysis, it may provide a use framework and perspective to an RA cleaning a dataset for the first time. It is difficult to have an intuition for the math behind a regression, but it easy to have an intuition for the math behind a mean.  


It is difficult to have an intuition for the math behind a regression, but it easy to have an intuition for the math behind a mean. Anything that biases a mean will bias a regression, and while there are many more things that can bias a regression, this is a good place to start for anyone cleaning a data set for the first time. The researcher in charge of the analysis is trained in what else that needs to be done for the specific regression models used. The articles linked to below will go through specific examples, but it is probably obvious to most readers that outliers, typos in data, survey codes (often values like -999 or -888) etc. bias means, so it is never wrong to start with those examples.
Anything that biases a mean will bias a regression: outliers, missing values, typos, erroneous survey codes, illogical values, duplicates, etc. While many more things can also bias a regression, this conceptualization is a good starting place for anyone cleaning a dataset for the first time. The researcher leading the analysis is trained in the other granular details and knowledge necessary for the specific regression models.


=== Prepare a clean data set ===
===Making the Dataset Usable and Understandable ===


The second goal of the data cleaning is to document the data set so that variables, values, and anything else is as self-explanatory as possible. This will help other researchers that you grant access to this data set, but it will also help you and your research team when accessing the data set in the future. At the time of the data collection or at the time of the data cleaning, you know the data set much better than you will at any time in the future. Carefully documenting this knowledge so that it can be used at the time of analysis is often the difference between a good analysis and a great analysis.
The second goal of the data cleaning is to code and document the dataset to make it as self-explanatory as possible. At the time of the data collection and data cleaning, you know the dataset much better than you will at any time in the future. Carefully documenting this knowledge often makes the difference between a good analysis and a great analysis. A usable and understandable dataset will not only help you and your research team in the present and the future, but also other researchers who use the dataset down the road.  


== Role Division during Data Cleaning ==
== Role Division during Data Cleaning ==
As a [[Impact_Evaluation_Team#Research_Assistant|Research Assistant]] (RA) or [[Impact_Evaluation_Team#Field_Coordinator|Field Coordinator]] (FC), spend time identifying and documenting irregularities in the data. It is never bad to suggest corrections to irregularities, but a common mistake RAs or FCs do is that they spend too much time on trying to fix irregularities at the expense of having enough time to identify and document as many as possible. One major reason for that is that different regression models might require different ways to correct issues and this is often a perspective only the PI has. In such cases, much time might have been spent on coming up with a correction that is not valid given the regression model used in the analysis.
[[Impact_Evaluation_Team#Research_Assistant|Research Assistants]] (RAs) and [[Impact_Evaluation_Team#Field_Coordinator|Field Coordinators]] (FCs) should prioritize their time on identifying and documenting irregularities in the data rather than correcting them. It is never bad to suggest corrections to irregularities. However, many RAs or FCs spend too much time trying to fix irregularities at the expense of having enough time to identify and document as many as possible. This is often inefficient, as different regression models and/or PI preferences may require different corrections. In such cases, time-consuming corrections may not be valid given the regression model used in the analysis.
 
Eventually the [[Impact_Evaluation_Team#Principal_Investigator|Principal Investigator]] (PI) and the RA or FC will have a common understanding on what correction calls can be made without involving the PI, but until then, it's recommended that the RA focus her/his time on identifying and documenting as many issues as possible rather than spending a lot of time on how to fix the issues. It is no problem to do both as long as the fixing doesn't happen at the cost of identifying as many issues as possible.


== Import Data ==
Eventually, the [[Impact_Evaluation_Team#Principal_Investigator|Principal Investigator]] (PI) and the RA or FC will have a common understanding on what correction decisions to make without involving the PI. Until then, the RA should focus her/his time on identifying and documenting as many issues as possible rather than fixing them. Again, it is no problem to do both so long as the time spent fixing doesn't prevent the RA from identifying and documenting as many issues as possible.


The first step in cleaning the data is to import the data. If you work with secondary data (data prepared by someone else) then this step is often straightforward, but this is a step often underestimated when working with primary data. It is very important for any change, no matter how small, to always be made in Stata (or in R or any other scripting language). Even if you know that there are incorrect submissions in your raw data (duplicates, pilot data mixed with the main data etc.), those deletions should always be done in such a way that they can be replicated by re-running code. Without this information, the analysis might no longer be valid. See the article on [[DataWork_Survey_Round#Raw_Folder|raw data folders]] for more details.
== Importing Data ==


=== Importing Primary Survey Data ===
The first step in cleaning the data is to import the data. When working with secondary data (i.e. [[Administrative and Monitoring Data | administrative data]]), this step is often straightforward. However, with primary data, this step is often underestimated.


All modern CAPI survey data collections tools provided methods for importing the raw data in a way that drastically reduces the amount of work that needs to be done when cleaning the data. These methods typically include a Stata do-file that generates labels and much more from the questionnaire code and then applies that to the raw data as it is being imported. If you are working in SurveyCTO see this article on [[SurveyCTO Stata Template | SurveyCTO's Stata Template]].
All modern [[Computer-Assisted Personal Interviews (CAPI) | CAPI]] survey data collection tools provide methods for importing the raw data in a way that drastically reduces cleaning work. These methods typically include a Stata do-file that generates labels and other features from the [[Questionnaire Design | questionnaire]] code and then applies them to the raw data during the import. If you are working in SurveyCTO, see this article on [[SurveyCTO Stata Template | SurveyCTO's Stata Template]].


== Examples of Data Cleaning Actions ==
Ensure that any change, no matter how small, is always be made in Stata, R, or the scripting language of use. When dealing with incorrect submissions in raw data (i.e. [[Duplicate and Survey Logs | duplicates]], pilot data mixed with the main data, etc.), handle these issues and deletions in such a way that they can be [[Reproducible Research | replicated]] by re-running code: without this information, the analysis may no longer be valid. See the article on [[DataWork_Survey_Round#Raw_Folder|raw data folders]] for more details.


The material in this section has been generated with primary survey data in mind, although a lot of these practices are also applicable when cleaning other types of data sets.
== Data Issues and Approaches ==


'''Data Cleaning Check List'''. This is a check list that can be used to make sure that all common aspects of data cleaning has been covered. Note that this is not an exhaustive list. Such a list is impossible to create as the individual data sets and the analysis methods used on them all require different cleaning that in the details depends on the context of that data set.
A countless list of irregularities may appear in a primary dataset, requiring a multitude of data cleaning actions. This section does not provide an exhaustive list but rather a few examples of irregularities and approaches.


===ID Variables===
===ID Variables===
It's important that the clean dataset be uniquely and fully identifiable by a single variable. It often is the case that when [[Primary Data Collection|primary data]] is imported, there are [[Duplicates and Survey Logs|duplicated entries]]. These cases must be carefully documented, and should only be corrected after discussing with the [[Field Coordinator]] and field team what caused them, so the right observations are kept in the dataset. [[ieduplicates]], a command in [[Stata Coding Practices#ietoolkit|ietoolkit]] is a useful command to identify and correct duplicated entries. Once duplicates are corrected, the observations can be linked to the [[Master Data Set|master dataset]], and the dataset, [[De-identification|de-identified]].
Observations in the dataset should be [[ID Variable Properties | uniquely and fully identifiable]] by a single variable. Often, raw [[Primary Data Collection|primary data]] includes [[Duplicates and Survey Logs|duplicate entries]]. Carefully [[Data Documentation | document]] these cases. To ensure accuracy, only correct them after discussing with the [[Field Coordinator]] and field team what caused them. [[ieduplicates]], a command in [[Stata Coding Practices#ietoolkit|ietoolkit]] identifies duplicated entries, while [[iecompdup]] helps to correct them. Once duplicates are corrected, the observations can be linked to the [[Master Data Set|master dataset]].
===Outliers===
While there are many rules of thumb for how to define an outlier, there is no silver bullet. One rule of thumb is that any data point that is three standard deviations away from the mean of the same data point for all observations. This may always be a starting point, but one needs to qualitatively consider if this is a correct approach. An inexhaustive list of approaches to correcting outliers follows:


=== Incorrect Data and Other Irregularities ===
*Missing value: replace outlier values with a missing value.
*Winsorization: replace any values bigger than a certain percentile, often the 99th, with the value at that percentile. This prevents very large values from biasing the mean while maintaining an equality of impact aspect. For example, if all project benefits go to a single observation in the treatment group, then the mean would still be high, but that is rarely a desired outcome in development; winsorization thus penalizes inequitable distribution of the benefits of a project.


There are countless ways that there can be irregularities in a primary data set, so there is no way to do an exhaustive list of what should be done. This section gives a few examples:
===Illogical Values===
For example, if a respondent is male, then the respondent cannot answer that he is pregnant. This simple case is something that can and should be programmed into the questionnaire so that it does not happen. However, no questionnaire ever can be pre-programmed to control for every such case. In theory, good [[Questionnaire Design | questionnaire design]] should include logic checks that prevent illogical values.  


'''Outliers'''. There are many rules of thumb for how to define an outlier but there is no silver bullet. One rule of thumb is that any data point that is three standard deviations away from the mean of the same data point for all observations. This may always be a starting point, but one needs to qualitatively consider if this is a correct approach. Observations with outliers should not be dropped, but in some cases, the data point for that observation is replaced with a missing value. There are often better approaches. One common approach is to use winsorization, where any values bigger than a certain percentile, often the 99th, are replaced with the value at that percentile. This way very large values are prevented from biasing the mean. This also has an equality of impact aspect. For example, if all benefit of a project went to a single observation in the treatment group, then the mean would still be high, but that is rarely a desired outcome in development. So winsorization penalizes inequitable distribution of the benefits of a project.
===Typos===
<!----- EDIT HERE ------->
If it is obvious beyond any doubt that the response is incorrect due to a simple typo, then it is a good idea to correct the typo as long. Make sure to document the change in a replicable way.  
 
'''Illogical Values'''. This is to test that one data point is possible in relation to another value. For example, if a respondent is male, then the respondent cannot answer that he is pregnant. This simple case is something that can and should be programmed into the questionnaire so that it does not happen. However, no questionnaire ever can be pre-programmed to control for every such case.
 
'''Typos'''. If it is obvious beyond any doubt that the response is incorrect due to a simple typo, then it is a good idea to correct the type as long as it is done in a documented and reproducible way.


=== Survey Codes and Missing Values ===
=== Survey Codes and Missing Values ===


Almost all data collection done through surveys of any sort allows the respondent to answer something like "Do not know" or "Declined to answer" for individual questions. These answers are usually recorded using survey codes on the format -999, -88 or something similar. It is obvious that these numbers will bias means and regressions if they are left as such. These values must be replaced with missing values in Stata.  
Almost all data collection done through surveys of any sort allows respondents to answer something like "Do not know" or "Decline to answer" for individual questions. These answers are usually recorded using survey codes in the format -999, -88 or something similar. If left as such, these numbers will bias means and regressions; they must be replaced with missing values in Stata.  


Stata has several missing values. The most well know is the regular missing value represented by a single "." but we would lose the difference in meaning between "Do not know" and "Declined to answer" if both codes representing them were replaced with the regular missing value. Stata offers a solution with its extended missing values. They are represented by ".a", ".b", ".c" etc. all the way to ".z". Stata handles these values the same as "." in commands that expect a numeric value, but they can be labeled differently and the original information is therefore not lost. Make sure that the same letter ".a", ".b" etc. is used to always represent only one thing across your project. The missing values should be assigned value labels so that they can be interpreted.  See [http://www.stata.com/manuals13/u12.pdf#u12.2 Stata Manual Missing Values] for more details on missing values.
Stata has several missing values. The most well-known is the regular missing value represented by a single "." but Stata also offers extended missing values: ".a", ".b", ".c" etc. all the way to ".z". Stata handles these values the same as "." in commands that expect a numeric value. Conveniently, these extended missing values accept value labels that allow you to distinguish between, for example, "Do not know" and "Decline to answer." You might label ".d", for example, as "Decline to answer", and ".k" as "Do not know." Make sure to always assign value labels to extended missing values so that they can be precisely interpreted. Finally, make sure to consistently use the same letter ".a", ".b" etc. to represent only one response across your project. See [http://www.stata.com/manuals13/u12.pdf#u12.2 Stata Manual Missing Values] for more details on missing values.


Missing values can be used for much more than just survey codes. Any value that we remove because we found out is incorrect should be replaced with a missing value. In a [[Master Data Set]], there should be no regular missing values. All missing values in a master data set should contain an explanation of why we do not have that information for that observation.
Missing values can be used for much more than just survey codes. Any value that we remove because we found out is incorrect should be replaced with a missing value. In a [[Master Data Set]], there should be no regular missing values. All missing values in a master data set should contain an explanation of why we do not have that information for that observation.
Line 76: Line 65:
=== No Strings ===
=== No Strings ===


All data should be stored in numeric format. There are multiple reasons for this, but the two most important are that (1) numbers are stored more efficiently and (2) many Stata commands expect values to be stored numerically. Categorical string variables should be stored as numeric codes and have value labels assigned.
All data should be stored in numeric format, since, among other reasons, (1) Stata stores numbers more efficiently than strings and (2) many Stata commands expect values to be stored numerically. During cleaning, clean categorical string variables and convert them into numeric codes. Then assign value labels for clarity. The commands <code>destring</code> and/or <code>encode</code> may be useful during this process.


There are two exceptions where string variables are allowed. The two examples are listed below:
There are two exceptions in which string variables are acceptable:
#If the number cannot be stored correctly numerically. This may occur in two scenarios:  
##If the number is more than 15 digits long. For obvious reasons, an [[ID Variable Properties | ID]] cannot be rounded and may remain a string. However, if a continuous variable has more than 15 digits, round it and convert it to a different scale. After all, a precision of 16 digits is not even possible in natural sciences.
##If the number begins with 0, as is sometimes the case for national IDs and telephone numbers. In this case, continue storing the number as a string, as Stata would remove any leading zeros when destringing.


'''Numbers that cannot be stored correctly numerically'''. There are two cases of this exception. The first case is when a number is more than 15 digits long. This can happen when working with some national IDs. If a continuous variable has more than 15 digits, then it should be rounded and converted to a different scale, as a precision of 16 digits is not even possible in natural sciences. An ID can for obvious reasons not be rounded. The other case is that of numbers starting with a zero. This is sometimes the case in some national IDs and it is also sometimes the case with telephone numbers in some countries. Any leading zeros are removed by Stata and therefore have to be stored as a string.
#Non-categorical text. It is acceptable to store text answers that cannot be converted into categories as strings. A few examples follow:
##Open-ended questions: open-ended questions should, in general, be avoided, but sometimes the questionnaire asks the respondent to answer a question in his or her own words.
##Other specifications: the respondent is asked to specify the answer after answering ''other'' in a multiple choice question.  
##Proper names: names of people, etc. Note that not all proper names should be stored as string as some can be made into categories. For example, if you collect data on respondents and multiple respondents live in the same villages, then convert the variable with the village names into a categorical numeric variable and assign a value label.


'''Non-categorical text'''. Text answers that cannot be converted into categories need to be stored as strings. One example is open-ended questions. Open-ended questions should, in general, be avoided, but sometimes the questionnaire asks the respondent to answer a question in his or her own words, and then that answer has to be stored as strings. Another example is if the respondent is asked to specify the answer after answering ''Other'' in a multiple choice question. A different example where string format is needed is some cases of proper names, for example, the name of the respondent. Not all proper names should be stored as string as some can be made into categories. For example, if you collect data on respondents and multiple respondents live in the same villages, then the variable with the village names should be converted into a categorical numeric variable and have a value label assigned. See the section on value labels below.
== Labels ==
There are several ways to add helpful descriptive text to a data set in Stata, but the two most common and important ways are variables labels and value labels.


=== Labels ===
===Variable Labels===
There are several ways to add helpful descriptive text to a data set in Stata, but the two most common and important ways are variables labels and value labels.
All variables in a clean data set should have variable labels describing the variable. In addition to a brief explanation of the variable and perhaps the question number from which it comes, you may also include information such as the unit or currency used in the variable and other things that are not possible to read from the values themselves. The label can be up to 80 characters long, so there is a limitation to how much information can be included here.  


'''Variable Labels'''
===Value Labels===
All variables in a clean data set should have variable labels describing the variable. The label can be up to 80 characters long so there is a limitation to how much information can be included here. In addition to a brief explanation of the variable, it is usually good to include information such as unit or currency used in the variable and other things that are not possible to read from the values themselves.


'''Value Labels'''
Always store categorical variables numerically and use value labels to indicate what the numeric code represents. For example, yes and no questions should be stored as 0 and 1 with the value labels ''No'' for data cells with 0, and the label ''Yes'' for all data cells with 1. This same concept applies to multiple choice variables.
Categorical variables should always be stored numerically and have value labels that describe what the numeric code represents. For example, yes and no questions should be stored as 0 and 1 and have the label ''No'' for data cells with 0, and the label ''Yes'' for all data cells with 1. This should be applied to all multiple choice variables.


There are tools in Stata to convert categorical string variables to a categorical numeric variable where the strings are automatically applied as value labels. The most common tool is the command <code>encode</code>. However, if you use <code>encode</code>, you should always use the two options <code>label()</code> and <code>noextend</code>. Without these two options, Stata assigns a code to each string value in alphabetic order. There is no guarantee that the alphabetic order is changed when observations are added or removed, or if someone else makes changes earlier in the code. <code>label()</code> forces you to manually create the label before using encode (this requires some manual work but it is worth it). <code>noextend</code> throws an error if there is a value in the data that does not exist in the pre-defined label. This way you are notified that you need to add the new value to the value label you created manually. Or you can change the string value if there is a typo or similar that is the reason why that string value was not assigned a value label.
There are tools in Stata to convert categorical string variables to a categorical numeric variable where the strings are automatically applied as value labels. The most common tool is the command <code>encode</code>. However, if you use <code>encode</code>, you should always use the two options <code>label()</code> and <code>noextend</code>.  
*<code>label()</code> forces you to manually create the label before using encode (this requires some manual work but it is worth it).  
*<code>noextend</code> throws an error if there is a value in the data that does not exist in the pre-defined label. This way you are notified that you need to add the new value to the value label you created manually. Or you can change the string value if there is a typo or similar that is the reason why that string value was not assigned a value label.
Without these two options, Stata assigns a code to each string value in alphabetic order. There is no guarantee that the alphabetic order is changed when observations are added or removed, or if someone else makes changes earlier in the code.  


== Additional Resources ==
== Additional Resources ==
Line 99: Line 96:
* The Stata Cheat Sheets on [http://geocenter.github.io/StataTraining/pdf/StataCheatsheet_processing_15_June_2016_TE-REV.pdf Data processing] and [http://geocenter.github.io/StataTraining/pdf/StataCheatsheet_Transformation15_June_2016_TE-REV.pdf Data Transformation] are helpful reminder of relevant stata code
* The Stata Cheat Sheets on [http://geocenter.github.io/StataTraining/pdf/StataCheatsheet_processing_15_June_2016_TE-REV.pdf Data processing] and [http://geocenter.github.io/StataTraining/pdf/StataCheatsheet_Transformation15_June_2016_TE-REV.pdf Data Transformation] are helpful reminder of relevant stata code
* The [https://github.com/Quartz/bad-data-guide#values-are-missing Quartz guide to bad data] on Github has lots of helpful tips for dealing with the kind of data problems that often come up in real world settings.
* The [https://github.com/Quartz/bad-data-guide#values-are-missing Quartz guide to bad data] on Github has lots of helpful tips for dealing with the kind of data problems that often come up in real world settings.
*See this [[Checklist:_Data_Cleaning| data cleaning checklist]] to ensure that common cleaning actions have been completed. Note that this is not an exhaustive list. Such a list is impossible to create as the individual datasets and the analysis require different cleaning depending on context.


[[Category: Data Cleaning ]]
[[Category: Data Cleaning ]]

Revision as of 21:10, 5 June 2019

Data cleaning is an essential step between data collection and data analysis. Raw primary data is always imperfect and needs to be prepared for a high quality analysis and overall replicability. In extremely rare cases, the only preparation needed is dataset documentation. However, in the vast majority of cases, data cleaning requires significant energy and attention, typically on the part of the Research Assistant (RA). This page outlines the goals of data cleaning, recommends role division, outlines common issues encountered, and presents approaches to resolve them.

Read First

  • The goal of data cleaning is to clean individual data points and to make the dataset easily usable and understandable for the research team and external users.
  • The quality of the analysis will never be better than the quality of data cleaning.
  • There is no such thing as an exhaustive list of what to do during data cleaning: each project will have individual cleaning needs. This article provides a very good place to start.
  • See this data cleaning checklist to ensure that common cleaning actions have been completed.

The Goals of Cleaning

The data cleaning process seeks to fulfill two goals: 1) to clean individual data points that invalidate or bias the analysis and 2) to make the dataset easily usable and understandable for researchers both within and outside of the research team. A really good data cleaning process should also result in documented insights about the data and data collection to inform future data collection – either for a different round of the same project or for other future projects.

Picture2.png

Cleaning Individual Data Points

RCT analysis typically tests for statistical differences between the means of the control and treatment groups. To do so, research teams use regression analyses; in essence, one can think of these as an advanced comparison of means. While this is, of course, far from a complete description of impact evaluation analysis, it may provide a use framework and perspective to an RA cleaning a dataset for the first time. It is difficult to have an intuition for the math behind a regression, but it easy to have an intuition for the math behind a mean.

Anything that biases a mean will bias a regression: outliers, missing values, typos, erroneous survey codes, illogical values, duplicates, etc. While many more things can also bias a regression, this conceptualization is a good starting place for anyone cleaning a dataset for the first time. The researcher leading the analysis is trained in the other granular details and knowledge necessary for the specific regression models.

Making the Dataset Usable and Understandable

The second goal of the data cleaning is to code and document the dataset to make it as self-explanatory as possible. At the time of the data collection and data cleaning, you know the dataset much better than you will at any time in the future. Carefully documenting this knowledge often makes the difference between a good analysis and a great analysis. A usable and understandable dataset will not only help you and your research team in the present and the future, but also other researchers who use the dataset down the road.

Role Division during Data Cleaning

Research Assistants (RAs) and Field Coordinators (FCs) should prioritize their time on identifying and documenting irregularities in the data rather than correcting them. It is never bad to suggest corrections to irregularities. However, many RAs or FCs spend too much time trying to fix irregularities at the expense of having enough time to identify and document as many as possible. This is often inefficient, as different regression models and/or PI preferences may require different corrections. In such cases, time-consuming corrections may not be valid given the regression model used in the analysis.

Eventually, the Principal Investigator (PI) and the RA or FC will have a common understanding on what correction decisions to make without involving the PI. Until then, the RA should focus her/his time on identifying and documenting as many issues as possible rather than fixing them. Again, it is no problem to do both so long as the time spent fixing doesn't prevent the RA from identifying and documenting as many issues as possible.

Importing Data

The first step in cleaning the data is to import the data. When working with secondary data (i.e. administrative data), this step is often straightforward. However, with primary data, this step is often underestimated.

All modern CAPI survey data collection tools provide methods for importing the raw data in a way that drastically reduces cleaning work. These methods typically include a Stata do-file that generates labels and other features from the questionnaire code and then applies them to the raw data during the import. If you are working in SurveyCTO, see this article on SurveyCTO's Stata Template.

Ensure that any change, no matter how small, is always be made in Stata, R, or the scripting language of use. When dealing with incorrect submissions in raw data (i.e. duplicates, pilot data mixed with the main data, etc.), handle these issues and deletions in such a way that they can be replicated by re-running code: without this information, the analysis may no longer be valid. See the article on raw data folders for more details.

Data Issues and Approaches

A countless list of irregularities may appear in a primary dataset, requiring a multitude of data cleaning actions. This section does not provide an exhaustive list but rather a few examples of irregularities and approaches.

ID Variables

Observations in the dataset should be uniquely and fully identifiable by a single variable. Often, raw primary data includes duplicate entries. Carefully document these cases. To ensure accuracy, only correct them after discussing with the Field Coordinator and field team what caused them. ieduplicates, a command in ietoolkit identifies duplicated entries, while iecompdup helps to correct them. Once duplicates are corrected, the observations can be linked to the master dataset.

Outliers

While there are many rules of thumb for how to define an outlier, there is no silver bullet. One rule of thumb is that any data point that is three standard deviations away from the mean of the same data point for all observations. This may always be a starting point, but one needs to qualitatively consider if this is a correct approach. An inexhaustive list of approaches to correcting outliers follows:

  • Missing value: replace outlier values with a missing value.
  • Winsorization: replace any values bigger than a certain percentile, often the 99th, with the value at that percentile. This prevents very large values from biasing the mean while maintaining an equality of impact aspect. For example, if all project benefits go to a single observation in the treatment group, then the mean would still be high, but that is rarely a desired outcome in development; winsorization thus penalizes inequitable distribution of the benefits of a project.

Illogical Values

For example, if a respondent is male, then the respondent cannot answer that he is pregnant. This simple case is something that can and should be programmed into the questionnaire so that it does not happen. However, no questionnaire ever can be pre-programmed to control for every such case. In theory, good questionnaire design should include logic checks that prevent illogical values.

Typos

If it is obvious beyond any doubt that the response is incorrect due to a simple typo, then it is a good idea to correct the typo as long. Make sure to document the change in a replicable way.

Survey Codes and Missing Values

Almost all data collection done through surveys of any sort allows respondents to answer something like "Do not know" or "Decline to answer" for individual questions. These answers are usually recorded using survey codes in the format -999, -88 or something similar. If left as such, these numbers will bias means and regressions; they must be replaced with missing values in Stata.

Stata has several missing values. The most well-known is the regular missing value represented by a single "." but Stata also offers extended missing values: ".a", ".b", ".c" etc. all the way to ".z". Stata handles these values the same as "." in commands that expect a numeric value. Conveniently, these extended missing values accept value labels that allow you to distinguish between, for example, "Do not know" and "Decline to answer." You might label ".d", for example, as "Decline to answer", and ".k" as "Do not know." Make sure to always assign value labels to extended missing values so that they can be precisely interpreted. Finally, make sure to consistently use the same letter ".a", ".b" etc. to represent only one response across your project. See Stata Manual Missing Values for more details on missing values.

Missing values can be used for much more than just survey codes. Any value that we remove because we found out is incorrect should be replaced with a missing value. In a Master Data Set, there should be no regular missing values. All missing values in a master data set should contain an explanation of why we do not have that information for that observation.

No Strings

All data should be stored in numeric format, since, among other reasons, (1) Stata stores numbers more efficiently than strings and (2) many Stata commands expect values to be stored numerically. During cleaning, clean categorical string variables and convert them into numeric codes. Then assign value labels for clarity. The commands destring and/or encode may be useful during this process.

There are two exceptions in which string variables are acceptable:

  1. If the number cannot be stored correctly numerically. This may occur in two scenarios:
    1. If the number is more than 15 digits long. For obvious reasons, an ID cannot be rounded and may remain a string. However, if a continuous variable has more than 15 digits, round it and convert it to a different scale. After all, a precision of 16 digits is not even possible in natural sciences.
    2. If the number begins with 0, as is sometimes the case for national IDs and telephone numbers. In this case, continue storing the number as a string, as Stata would remove any leading zeros when destringing.
  1. Non-categorical text. It is acceptable to store text answers that cannot be converted into categories as strings. A few examples follow:
    1. Open-ended questions: open-ended questions should, in general, be avoided, but sometimes the questionnaire asks the respondent to answer a question in his or her own words.
    2. Other specifications: the respondent is asked to specify the answer after answering other in a multiple choice question.
    3. Proper names: names of people, etc. Note that not all proper names should be stored as string as some can be made into categories. For example, if you collect data on respondents and multiple respondents live in the same villages, then convert the variable with the village names into a categorical numeric variable and assign a value label.

Labels

There are several ways to add helpful descriptive text to a data set in Stata, but the two most common and important ways are variables labels and value labels.

Variable Labels

All variables in a clean data set should have variable labels describing the variable. In addition to a brief explanation of the variable and perhaps the question number from which it comes, you may also include information such as the unit or currency used in the variable and other things that are not possible to read from the values themselves. The label can be up to 80 characters long, so there is a limitation to how much information can be included here.

Value Labels

Always store categorical variables numerically and use value labels to indicate what the numeric code represents. For example, yes and no questions should be stored as 0 and 1 with the value labels No for data cells with 0, and the label Yes for all data cells with 1. This same concept applies to multiple choice variables.

There are tools in Stata to convert categorical string variables to a categorical numeric variable where the strings are automatically applied as value labels. The most common tool is the command encode. However, if you use encode, you should always use the two options label() and noextend.

  • label() forces you to manually create the label before using encode (this requires some manual work but it is worth it).
  • noextend throws an error if there is a value in the data that does not exist in the pre-defined label. This way you are notified that you need to add the new value to the value label you created manually. Or you can change the string value if there is a typo or similar that is the reason why that string value was not assigned a value label.

Without these two options, Stata assigns a code to each string value in alphabetic order. There is no guarantee that the alphabetic order is changed when observations are added or removed, or if someone else makes changes earlier in the code.

Additional Resources

  • DIME Analytics’ Data Management and Cleaning
  • The Stata Cheat Sheets on Data processing and Data Transformation are helpful reminder of relevant stata code
  • The Quartz guide to bad data on Github has lots of helpful tips for dealing with the kind of data problems that often come up in real world settings.
  • See this data cleaning checklist to ensure that common cleaning actions have been completed. Note that this is not an exhaustive list. Such a list is impossible to create as the individual datasets and the analysis require different cleaning depending on context.