Difference between revisions of "Data Cleaning"

Jump to: navigation, search
Line 88: Line 88:
Categorical variables should always be stored numerically and have a value labels that describe what the numeric code represent. For example, a yes and no questions should be stored as 0 and 1 and where the label ''No'' is put to data cells with 0, and the label ''Yes'' to all data cells with 1. This should be applied to all multiple choice variables.
Categorical variables should always be stored numerically and have a value labels that describe what the numeric code represent. For example, a yes and no questions should be stored as 0 and 1 and where the label ''No'' is put to data cells with 0, and the label ''Yes'' to all data cells with 1. This should be applied to all multiple choice variables.


There are tools in Stata to convert categorical string variables to a categorical numeric variable where the strings are automatically applied as value labels. The most common tool is the command <code>encode</code>.
There are tools in Stata to convert categorical string variables to a categorical numeric variable where the strings are automatically applied as value labels. The most common tool is the command <code>encode</code>. However, if you use <code>encode</code> you should always use the two options <code>label()</code> and <code>noextend</code>.


== Additional Resources ==
== Additional Resources ==

Revision as of 19:40, 19 April 2017

Data cleaning is an essential step between data collection and data analysis. The aim is to (i) identify data errors, (ii) correct errors, and (iii) improve data collection process.


Read First

  • See this check list that can be used to make sure that common cleaning actions have been done when applicable.

It is really difficult to have a fully efficient data collection procedure in place that would generate error-free raw data. Any output of raw data needs some level of cleaning, either minor or major. Through the cleaning process, the research team can learn lessons and feed such information into next round's data collection, and to make the whole process more efficient.

Data cleaning becomes essential because without it any analytical work loses validity. Models used in research work assume data to be clean at the least.

Data cleaning is an important aspect of any impact evaluation project. Almost every research team keep research assistant(s) solely for the purpose of data cleaning, hence the additional costs.

The Goal of Cleaning

Picture2.png

There are two main goals when cleaning the data set:

  1. Cleaning individual data points that invalidate or incorrectly bias the analysis
  2. Prepare a clean data set so that it is easy to use to other researcher. Both for researchers inside your team and outside your team.

Another overarching goal of the cleaning process is to understand the data and the data collection really well. Much of this understanding feeds directly into the two points above, but a really good data cleaning process should also result in documented lessons learned that can be used in future data collection. Both in later data collection rounds in the same project, but also in data collections in other similar projects.

Cleaning individual data points

In impact evaluations our analysis often come down to test for statistical differences in the mean between the control group and any of the treatment arms. We do so through advance regression analysis where we include control variables, fixed effects, different error estimators among many other tools, but in essence one can think of it as an advanced comparison of means. While this is far from a complete description of impact evaluation analysis it might give the person cleaning a data set for the first time a framework on what cleaning a data set should achieve.

It is difficult to have an intuition for the math behind a regression, but it easy to have an intuition for the math behind a mean. Anything that bias a mean will bias a regression, and while there are many more things that can bias a regression, this is a good place to start for anyone cleaning a data set for the first time. The researcher in charge of the analysis is trained in what else that needs to be done for the specific regression models used. The articles linked to below will go through specific examples, but it is probably obvious to most readers that outliers, typos in data, survey codes (often values like -999 or -888) etc. bias means, so it is never wring to start with those examples.

Prepare a clean data set

The second goal of the data cleaning is to document that data set so that variables, values and anythings else is as self-explanatory as possible. This will help other researchers that you grant access to this data set, but it will also help you and your research team when access the data set in the future. At the time of the data collection or at the time of the data cleaning, you know the data set much better than you will at any time in the future. Carefully documenting this knowledge so that it can be used at the time of analysis is often the difference between a good analysis and a great analysis.

Role Division during Data Cleaning

As a Research Assistant (RA) or Field Coordinator (FC), spend time identifying and documenting irregularities in the data. It is never bad to suggest corrections to irregularities, but a common mistake RAs or FCs do is that they spend too much time on trying to fix irregularities on the expense of having enough time to identify and document as many as possible.

Eventually the Principal Investigator (PI) and the RA or FC will have a common understanding on what corrections calls can made without involving the PI, but until then, it's recommended that the RA focus her/his time on identifying and documenting as many issues as possible rather than spending a lot of time on how to fix the issues. It is no problem to do both unless fixing happens to the cost of identifying as much as possible. One major reason is that different regression models might require different ways to correct issues and this is often a perspective only the PI have.

Import Data

The first step in cleaning the data is to import the data. If you work with secondary data (data prepared by someone else) then this step is often straightforward, but this is a step often underestimated when working with primary data. It is very important that any change, no matter how small, always is done in Stata (or in R or any other scripting language). Even if you know that there are incorrect submission in your raw data (duplicates, pilot data mixed with the main data etc.) those deletions should always be done so that it can be replicated by re-running code. Without this information the analysis might not longer be valid. See the article on raw data folders for more details.

Importing Primary Survey Data

All modern CAPI survey data collections tools provided methods for importing the raw data in a way that drastically reduces the amount of work that needs to be done when cleaning the data. These methods typically includes a Stata do-file that generates labels and much more from the questionnaire code and then applies that to the raw data as it is being imported. If you are working in SurveyCTO see this article on SurveyCTO's Stata Template.

Examples of Data Cleaning Actions

The material in this section has been generated having primary survey data in mind. Although, a lot of these practices are also applicable when cleaning other types of data sets.

Data Cleaning Check List. This is a check list that can be used to make sure that all common aspects of data cleaning has been covered. Note that this is not a exhaustive list. Such a list is impossible to create as the individual data sets and the analysis methods used on them all require different cleaning that in the details depends on the context of that data set.

Incorrect Data and Other Irregularities

Survey Codes and Missing Values

Almost all data collection done through surveys of any sort allow the respondent to answer something like "Do not know" or "Declined to answer" for individual questions. These answers are usually recorded using survey codes on the format -999, -88 or something similar. It is obvious that these numbers will bias means and regressions if they are left out. These values must be replaces with missing values in Stata.

Stata has several missing values. The most well know is the regular missing value represented by a single "." but we would lose the difference in meaning between "Do not know" and "Declined to answer" if both codes representing them would be replaced with the regular missing value. Stata offers a solution with its extended missing values. They are represented by ".a", ".b", ".c" etc. all the way to ".z". Stata handles these values the same as "." in commands that expect a numeric value, but they can be labeled differently and the original information is therefore not lost. Make sure that the same letter ".a", ".b" etc. is used to always represent only one thing across your project. The missing values should be assigned value labels so that See Stata Manual Missing Values for more details on missing values.

Missing values can be used for much more than just survey codes. Any value that we remove because we found out is incorrect should be replaced with a missing value. In a Master Data Set there should be no regular missing values. All missing values in a master data set should be explained why we do not have that information for that observation.

No Strings

All data should be stored in numeric format. There are multiple reasons for this, but the two most important is that it is much more efficiently stored and a lot Stata commands expect values to be stored numerically. Categorical string variables should be stored as numeric codes and have value labels assigned.

There are two exceptions where string variables are allowed. The two examples are listed below:

Numbers that cannot be stored correctly numerically. There are two cases of this exception. The first case is when a number is more than 15 digits long. This can happen when working with some national IDs. If a continuous variable has more than 15 digits, then it should be rounded and converted to a different scale, as a precision of 16 digits is not even possible in natural sciences. An ID can for obvious reasons not be rounded. The other case are numbers starting with a zero. This is sometimes the case in some national IDs and it is also sometimes the case with telephone numbers in some countries. Any leading zeros are removed by Stata and those cases therefore have to be stored as string.

Non-categorical text. Text answers that cannot be converted into categories needs to be stored as strings. One example is open ended questions. Open ended questions should in general be avoided, but sometimes the questionnaire ask the respondent to answer a question in his or her own words, and then that answer has to be stored as strings. Another example is if the respondent is asked to specify the answer after answering Other in a multiple choice question. A different example where string format is needed is some cases of proper names, for example the name of the respondent. Not all proper names should be stored as string as some can be made into categories. For example, if you collect data on respondents and multiple respondents live in the same villages, then the variable with the village names should be converted into a categorical numeric variables and have a value label assigned. See the section on value labels below.

Labels

There are several ways to add helpful descriptive text to a data set in Stata, but the two most common and important ways are variables labels and value labels.

Variable Labels All variables in a clean data set should have variable labels describing the variable. The label can be up to 80 characters long so there is a limitation to how much information can be included here. In addition to a brief explanation of the variable, it is usually good to include information such as unit or currency used in the variable and other things that are not possible to read from the values themselves.

Value Labels Categorical variables should always be stored numerically and have a value labels that describe what the numeric code represent. For example, a yes and no questions should be stored as 0 and 1 and where the label No is put to data cells with 0, and the label Yes to all data cells with 1. This should be applied to all multiple choice variables.

There are tools in Stata to convert categorical string variables to a categorical numeric variable where the strings are automatically applied as value labels. The most common tool is the command encode. However, if you use encode you should always use the two options label() and noextend.

Additional Resources