Checklist: Data Cleaning

Jump to: navigation, search

Get printable version here. For more detailed instructions on how to implement the different tasks in this checklist, see Data Cleaning. Note that this checklist is best displayed in Chrome, Firefox, Safari or any other modern browser.

DIME Image
Project name: _______________________________________
Country: ___________________________________________
District: ____________________________________________
Year, Month and/or Day: _____________________________
1. Before data cleaning: Importing the data
Initials#NoChecklist Item
[ __ ]1.1Check for importing issues such as broken lines when importing .csv files
[ __ ]1.2Make sure you have unique IDs
[ __ ]1.3De-identify all data and save in a new .dta file
[ __ ]1.4Never make any changes to the raw data
2. Important steps for data cleaning
Initials#NoChecklist Item
[ __ ]2.1Label variables, don’t use special characters
[ __ ]2.2Recode and label missing values: your data set should not have observations with -777, -88 or -9 values, for example
[ __ ]2.3Encode variables: all categorical variables should be saved as labeled numeric variables, no strings
[ __ ]2.4Don’t change variable names from questionnaire, except for nested repeat groups and reshaped roster data
[ __ ]2.5Check sample representativeness of age, gender, urban/rural, region and religion
[ __ ]2.6Check administrative data such as date, time, interviewer variables included
[ __ ]2.7Test variables consistency
[ __ ]2.8Identify and document outliers
[ __ ]2.9Compress dataset so it is saved in the most efficient format
[ __ ]2.10Save cleaned data set with an informative name. Avoid saving in a very recent Stata version
3. Optional steps in data cleaning
Initials#NoChecklist Item
[ __ ]3.1Order variables – unique ID always first, then same order as questionnaire
[ __ ]3.2Drop variables that only make sense for questionnaire review (duration, notes, calculates)
[ __ ]3.3Rename roster variables
[ __ ]3.4Categorize variables listed as “others”
[ __ ]3.5Add metadata as notes: original survey question, relevance, constraints, etc
The checklist are edited through Git Hub. This checklist corresponds to the file with the name chk_datacleaning.js. To read a simple step by step guide on how to edit the checklist, see this documentation:
https://github.com/worldbank/DIMEwiki/tree/master/Topics/Checklists.

Back to Parent

This article is part of the topic Check Lists

Additional Resources

  • DIME Analytics’ guidelines on data cleaning 1 and 2
  • The Stata Cheat Sheets on Data processing and Data Transformation are helpful reminder of relevant Stata code.
  • The Quartz guide to bad data on Github has lots of helpful tips for dealing with the kind of data problems that often come up in real world settings.
  • See this data cleaning checklist to ensure that common cleaning actions have been completed. Note that this is not an exhaustive list. Such a list is impossible to create as the individual datasets and the analysis require different cleaning depending on context.