Data Cleaning

Revision as of 22:06, 14 April 2017 by Kbjarkefur (talk | contribs)
Jump to: navigation, search

Data cleaning is an essential step between data collection and data analysis. The aim is to (i) identify data errors, (ii) correct errors, and (iii) improve data collection process.


Read First

Picture2.png

It is really difficult to have a fully efficient data collection procedure in place that would generate error-free raw data. Any output of raw data needs some level of cleaning, either minor or major. Through the cleaning process, the research team can learn lessons and feed such information into next round's data collection, and to make the whole process more efficient.

Data cleaning becomes essential because without it any analytical work loses validity. Models used in research work assume data to be clean at the least.

Data cleaning is an important aspect of any impact evaluation project. Almost every research team keep research assistant(s) solely for the purpose of data cleaning, hence the additional costs.

The Goal of Cleaning

There are two main goals when cleaning the data set:

  1. Cleaning individual data points that invalidate or incorrectly bias the analysis
  2. Prepare a clean data set so that it is easy to use to other researcher. Both for researchers inside your team and outside your team.

Cleaning individual data points

In impact evaluations our analysis often come down to test for statistical differences in the mean between the control group and any of the treatment arms. We do so through advance regression analysis where we include control variables, fixed effects, different error estimators among many other tools, but in essence one can think of it as an advanced comparison of means. While this is far from a complete description of impact evaluation analysis it might give the person cleaning a data set for the first time a framework on what cleaning a data set should achieve.

It is difficult to have an intuition for the math behind a regression, but it easy to have an intuition for the math behind a mean. Anything that bias a mean will bias a regression, and while there are many more things that can bias a regression, this is a good place to start for anyone cleaning a data set for the first time. The researcher in charge of the analysis will be trained in what else that needs to be done for the specific regression models used. The articles linked to below will go through specific examples, but it is probably obvious to most readers that outliers, typos in data, survey codes (often values like -999 or -888) etc. bias means, so it is never wring to start with those examples.

Prepare a clean data set

The second goal of the data cleaning is to document that data set so that variables, values and anythings else is as self-explanatory as possible. This will help other researchers that you grant access to this data set, but it will also help you and your research team when access the data set in the future. At the time of the data collection or at the time of the data cleaning, you know the data set much better than you will at any time in the future. Carefully documenting this knowledge so that it can be used at the time of analysis is often the difference between a good analysis and a great analysis.

Role Division during Data Cleaning

Spend time identifying and documenting irregularities in the data. It is never bad to suggest corrections to irregularities, but a common mistake RAs do is that they spend too much time on trying to fix irregularities on the expense of having enough time to identify and document as many as possible.

Eventually you and your RA will have understanding on what corrections you can make a decision on yourself, but until then, focus your time on identifying and documenting any issues

Examples of Data Cleaning Actions

This is a check list that can be used to make sure that all common aspects of data cleaning has been covered. Note that this is not a exhaustive list. Such a list is impossible to create as the individual data sets and the analysis methods used on them all require different cleaning that in the details depends on the context of that data set.

Import Data

Incorrect Data and Other Irregularities

Missing Values

No Strings

Labels

Additional Resources

  • list here other articles related to this topic, with a brief description and link