Checklist: Data Cleaning
Get printable version here. For more detailed instructions on how to implement the different tasks in this checklist, see Data Cleaning. Note that this checklist is best displayed in Chrome, Firefox, Safari or any other modern browser.
![]() | |||
---|---|---|---|
Project name: _______________________________________ | |||
Country: ___________________________________________ | |||
District: ____________________________________________ | |||
Year, Month and/or Day: _____________________________ | |||
1. Before data cleaning: Importing the data | |||
Initials | #No | Checklist Item | |
[ __ ] | 1.1 | Check for importing issues such as broken lines when importing .csv files | |
[ __ ] | 1.2 | Make sure you have unique IDs | |
[ __ ] | 1.3 | De-identify all data and save in a new .dta file | |
[ __ ] | 1.4 | Never make any changes to the raw data | |
2. Important steps for data cleaning | |||
Initials | #No | Checklist Item | |
[ __ ] | 2.1 | Label variables, don’t use special characters | |
[ __ ] | 2.2 | Recode and label missing values: your data set should not have observations with -777, -88 or -9 values, for example | |
[ __ ] | 2.3 | Encode variables: all categorical variables should be saved as labeled numeric variables, no strings | |
[ __ ] | 2.4 | Don’t change variable names from questionnaire, except for nested repeat groups and reshaped roster data | |
[ __ ] | 2.5 | Check sample representativeness of age, gender, urban/rural, region and religion | |
[ __ ] | 2.6 | Check administrative data such as date, time, interviewer variables included | |
[ __ ] | 2.7 | Test variables consistency | |
[ __ ] | 2.8 | Identify and document outliers | |
[ __ ] | 2.9 | Compress dataset so it is saved in the most efficient format | |
[ __ ] | 2.10 | Save cleaned data set with an informative name. Avoid saving in a very recent Stata version | |
3. Optional steps in data cleaning | |||
Initials | #No | Checklist Item | |
[ __ ] | 3.1 | Order variables – unique ID always first, then same order as questionnaire | |
[ __ ] | 3.2 | Drop variables that only make sense for questionnaire review (duration, notes, calculates) | |
[ __ ] | 3.3 | Rename roster variables | |
[ __ ] | 3.4 | Categorize variables listed as “others” | |
[ __ ] | 3.5 | Add metadata as notes: original survey question, relevance, constraints, etc | |
The checklist are edited through Git Hub. This checklist corresponds to the file with the name chk_datacleaning.js. To read a simple step by step guide on how to edit the checklist, see this documentation: https://github.com/worldbank/DIMEwiki/tree/master/Topics/Checklists. |
Back to Parent
This article is part of the topic Check Lists
Additional Resources
- DIME Analytics’ guidelines on data cleaning 1 and 2
- The Stata Cheat Sheets on Data processing and Data Transformation are helpful reminder of relevant Stata code.
- The Quartz guide to bad data on Github has lots of helpful tips for dealing with the kind of data problems that often come up in real world settings.
- See this data cleaning checklist to ensure that common cleaning actions have been completed. Note that this is not an exhaustive list. Such a list is impossible to create as the individual datasets and the analysis require different cleaning depending on context.