Difference between revisions of "Data Cleaning"

Jump to: navigation, search
Line 18: Line 18:


== The Goal of Cleaning ==
== The Goal of Cleaning ==
* organize information on the topic into subsections. for each subsection, include a brief description / overview, with links to articles that provide details
 
There are two main goals when cleaning the data set:
 
#Cleaning individual data points that invalidate or incorrectly bias the analysis
#Prepare a clean data set so that it is easy to use to other researcher. Both for researchers inside your team and outside your team.
 
=== Cleaning individual data points ===
 
=== Prepare a clean data set ===
 
=== Role Division during Data Cleaning ===
=== Role Division during Data Cleaning ===
Spend time identifying and documenting irregularities in the data. It is never bad to suggest corrections to irregularities, but a common mistake RAs do is that they spend too much time on trying to fix irrgularities on the expense of having enough time to identify and document as many as possible.
Eventually you and your RA will have understanding on what corrections you can make a decision on yourself, but until then, focus your time on identifying and documenting any issues
== Import Data ==
== Import Data ==
== Incorrect Data and Other Irregularities ==
== Incorrect Data and Other Irregularities ==

Revision as of 20:41, 14 April 2017

Data cleaning is an essential step between data collection and data analysis. The aim is to (i) identify data errors, (ii) correct errors, and (iii) improve data collection process.


Read First

Picture2.png

It is really difficult to have a fully efficient data collection procedure in place that would generate error-free raw data. Any output of raw data needs some level of cleaning, either minor or major. Through the cleaning process, the research team can learn lessons and feed such information into next round's data collection, and to make the whole process more efficient.

Data cleaning becomes essential because without it any analytical work loses validity. Models used in research work assume data to be clean at the least.

Data cleaning is an important aspect of any impact evaluation project. Almost every research team keep research assistant(s) solely for the purpose of data cleaning, hence the additional costs.

The Goal of Cleaning

There are two main goals when cleaning the data set:

  1. Cleaning individual data points that invalidate or incorrectly bias the analysis
  2. Prepare a clean data set so that it is easy to use to other researcher. Both for researchers inside your team and outside your team.

Cleaning individual data points

Prepare a clean data set

Role Division during Data Cleaning

Spend time identifying and documenting irregularities in the data. It is never bad to suggest corrections to irregularities, but a common mistake RAs do is that they spend too much time on trying to fix irrgularities on the expense of having enough time to identify and document as many as possible.

Eventually you and your RA will have understanding on what corrections you can make a decision on yourself, but until then, focus your time on identifying and documenting any issues

Import Data

Incorrect Data and Other Irregularities

Missing Values

No Strings

Labels

Additional Resources

  • list here other articles related to this topic, with a brief description and link