Difference between revisions of "Data Cleaning"
(54 intermediate revisions by 6 users not shown) | |||
Line 1: | Line 1: | ||
Data cleaning is an essential step between [[Primary Data Collection | data collection]] and [[Data Analysis | data analysis]]. Raw primary data is always imperfect and needs to be prepared for a high quality '''analysis''' and overall [[Reproducible Research | replicability]]. In extremely rare cases, the only preparation needed is [[Data Documentation | dataset documentation]]. However, in the vast majority of cases, data cleaning requires significant energy and attention, typically on the part of the [[Impact_Evaluation_Team#Research_Assistant|Research Assistant]] (RA). This page outlines the goals of data cleaning, recommends role division, outlines common issues encountered, and presents approaches to resolve them. | |||
Data cleaning is an essential step between data collection and data analysis. Raw primary data is always imperfect and needs to be prepared | |||
== Read First == | == Read First == | ||
* | *The goal of data cleaning is to clean individual data points and to make the [[Master Dataset|dataset]] easily usable and understandable for the [[Impact Evaluation Team|research team]] and external users. | ||
*The quality of the [[Data Analysis|analysis]] will never be better than the quality of data cleaning. | |||
*The quality of the analysis will never be better than the quality of data cleaning. | *There is no such thing as an exhaustive list of what to do during data cleaning: each project will have individual cleaning needs. This article provides a very good place to start. | ||
*There is no such thing as an exhaustive list of what to do during data cleaning | *See this [[Checklist:_Data_Cleaning| data cleaning checklist]] to ensure that common cleaning actions have been completed. | ||
* | |||
== The Goals of Cleaning == | |||
# | The data cleaning process seeks to fulfill two goals: | ||
# | #To ensure valid [[Data Analysis|analysis]] by cleaning individual data points that bias the analysis | ||
#To make the [[Master Dataset|dataset]] easily usable and understandable for researchers both within and outside of the [[Impact Evaluation Team|research team]]. A really good data cleaning process should also result in [[Data Documentation|documented]] insights about the data and [[Primary Data Collection|data collection]] to inform future '''data collection''' – either for a different round of the same project or for other future projects. | |||
[[File: | [[File:Picture2.png|700px|link=|center]] | ||
=== | === Ensuring Valid Analysis === | ||
[[Randomized Control Trials | RCT]] analysis typically rely to regressions to test for statistical differences between the means of the '''control''' and '''treatment groups'''. In essence, one can think of regression analysis as an advanced comparison of means. While this is, of course, an extreme simplification, it may provide a useful framework and perspective to an RA cleaning a [[Master Dataset|dataset]] for the first time. While it may be difficult to have an intuition for the math behind a regression, it easy to have an intuition for the math behind a mean. | |||
Anything that biases a mean will bias a regression: outliers, missing values, typos, erroneous [[Survey Pilot|survey]] codes, illogical values, [[Duplicates and Survey Logs|duplicates]], etc. While many more things can also bias a regression, this conceptualization provides a good starting place for anyone cleaning a '''dataset''' for the first time. The researcher leading the analysis is trained in the other granular details and knowledge necessary for the specific regression models. | |||
=== | ===Making the Dataset Usable and Understandable === | ||
The second goal of the data cleaning is to document the | The second goal of the data cleaning is to code and [[Data Documentation|document ]]the [[Master Dataset|dataset]] to make it as self-explanatory as possible. At the time of the [[Primary Data Collection|data collection]] and data cleaning, you know the '''dataset''' much better than you will at any time in the future. Carefully '''documenting''' this knowledge often makes the difference between a good [[Data Analysis|analysis]] and a great one. A usable and understandable '''dataset''' will not only help you and your [[Impact Evaluation Team|research team]] in the future, but also other researchers who use the '''dataset''' down the road. | ||
== Role Division during Data Cleaning == | == Role Division during Data Cleaning == | ||
[[Impact_Evaluation_Team#Research_Assistant|Research Assistants]] (RAs) and [[Field_Coordinator|Field Coordinators]] (FCs) should prioritize their time on identifying and [[Data Documentation|documenting]] irregularities in the data rather than correcting them. It is never bad to suggest corrections to irregularities. However, many RAs or '''FCs''' spend too much time trying to fix irregularities and, in turn, do not have enough time to identify and '''document''' them completely. This is often inefficient, as different regression models and/or PI preferences may require different corrections. In such cases, time-consuming corrections may not be valid given the regression model used in the analysis. | |||
Eventually, the [[Impact_Evaluation_Team#Principal_Investigator|Principal Investigator]] (PI) and the RA or '''FC''' will have a common understanding on what correction decisions to make without involving the PI. Until then, the RA should focus their time on identifying and '''documenting''' as many issues as possible rather than fixing them. Again, it is no problem to do both so long as the time spent fixing doesn't prevent the RA from identifying and '''documenting''' as many issues as possible. | |||
== Importing Data == | |||
The first step in cleaning the data is to import the data. When working with [[Secondary Data Sources|secondary data]], i.e. [[Administrative and Monitoring Data | administrative data]], this step is often straightforward. However, with primary data, this step is often underestimated. | |||
All modern CAPI survey data | All modern [[Computer-Assisted Personal Interviews (CAPI) | CAPI]] [[Survey Pilot|survey]] [[Primary Data Collection|data collection]] tools provide methods for importing the raw data in a way that drastically reduces cleaning work. These methods typically include a [[Stata Coding Practices|Stata]] '''do-file''' that generates labels and other features from the [[Questionnaire Programming | questionnaire]] code and then applies them to the raw data during the import. | ||
Ensure that any change, no matter how small, is always be made in '''Stata''', [[R Coding Practies|R]], or the scripting language of use. When dealing with incorrect submissions in raw data, for example [[Duplicates and Survey Logs | duplicates]], pilot data mixed with the main data, etc., handle these issues and deletions in such a way that they can be [[Reproducible Research | replicated]] by re-running code: without this information, the [[Data Analysis|analysis]] may no longer be valid. See the article on [[DataWork_Survey_Round#Raw_Folder|raw data folders]] for more details. | |||
== Data Issues and Approaches == | |||
A countless list of irregularities may appear in a primary [[Master Dataset|dataset]], requiring a multitude of data cleaning actions. This section does not provide an exhaustive list but rather a few examples of irregularities and approaches. | |||
=== | ===ID Variables=== | ||
Observations in the [[Master Dataset|dataset]] should be [[ID Variable Properties | uniquely and fully identifiable]] by a single '''ID variable'''. Often, raw [[Primary Data Collection|primary data]] includes [[Duplicates and Survey Logs|duplicate entries]]. Carefully [[Data Documentation | document]] these cases. To ensure accuracy, only correct them after discussing with the [[Field Coordinator]] and field team what caused them. <code>[[ieduplicates]]</code>, a command in [[Stata Coding Practices|Stata]], identifies '''duplicated''' entries, while <code>[[iecompdup]]</code> helps to correct them. Once '''duplicates''' are corrected, the observations can be linked to the [[Master DataSet|master dataset]]. | |||
===Illogical Values=== | |||
In theory, good [[Questionnaire Programming | questionnaire programming]] should include logic checks that prevent illogical values. | |||
For example, if a respondent is male, then the '''questionnaire''' should not allow the respondent to answer that he is pregnant. However, no '''questionnaire''' can ever be pre-programmed to control for every such case. Discuss with the research best approaches to illogical values found in the raw [[Master Dataset|dataset]]. | |||
===Typos=== | |||
If it is obvious beyond any doubt that the response is incorrect due to a simple typo, then correct the typo. Make sure to [[Data Documentation|document]] the change in a [[Reproducible Research|replicable]] way. | |||
=== Survey Codes and Missing Values === | === Survey Codes and Missing Values === | ||
Almost all data collection done through surveys of any sort allows | Almost all [[Primary Data Collection|data collection]] done through [[Survey Pilot|surveys]] of any sort allows respondents to answer something like "Do not know" or "Decline to answer" for individual questions. These answers are usually recorded using '''survey''' codes in the format -999, -88 or something similar. If left as such, these numbers will bias means and regressions. Accordingly, they must be replaced with missing values in [[Stata Coding Practices|Stata]]. | ||
'''Stata''' has several missing values. The most well-known is the regular missing value represented by a single "." but '''Stata''' also offers extended missing values: ".a", ".b", ".c" etc. all the way to ".z". '''Stata''' handles these values the same as "." in commands that expect a numeric value. Conveniently, these extended missing values accept value labels that allow you to distinguish between, for example, "Do not know" and "Decline to answer." You might label ".d", for example, as "Decline to answer", and ".k" as "Do not know." Make sure to always assign value labels to extended missing values so that they can be precisely interpreted. Finally, make sure to consistently use the same letter ".a", ".b" etc. to represent only one response across your project. See [http://www.stata.com/manuals13/u12.pdf#u12.2 Stata Manual Missing Values] for more details on missing values. | |||
Missing values can be used for much more than just '''survey''' codes. Any value that we remove because it is incorrect should be replaced with a missing value. In a [[Master DataSet | master dataset]], there should be no regular missing values. All missing values in a '''master dataset''' should contain an explanation of why there is no information for that value. | |||
=== Strings === | |||
''' | All data should be stored in numeric format because | ||
#[[Stata Coding Practices|Stata]] stores numbers more efficiently than strings | |||
#'''Stata''' commands expect values to be stored numerically. During the data cleaning process, make sure to clean categorical string '''variables''' and convert them into numeric codes. Then assign value labels for clarity. The commands <code>destring</code> and/or <code>encode</code> may be useful during this process. | |||
'''Non-categorical text | There are two exceptions in which string '''variables''' are acceptable: | ||
#If the number cannot be stored correctly numerically. This may occur in two scenarios: | |||
##If the number is more than 15 digits long. For obvious reasons, an [[ID Variable Properties | ID]] cannot be rounded and may remain a string. However, if a continuous '''variable''' has more than 15 digits, round it and convert it to a different scale. After all, a precision of 16 digits is not even possible in natural sciences. | |||
##If the number begins with 0, as is sometimes the case for national IDs and telephone numbers. In this case, continue storing the number as a string, as '''Stata''' would remove any leading zeros when destringing. | |||
#Non-categorical text. It is acceptable to store text answers that cannot be converted into categories as strings. A few examples follow: | |||
##Open-ended questions: open-ended questions should, in general, be avoided, but sometimes the [[Questionnaire Programming|questionnaire]] asks the respondent to answer a question in his or her own words. | |||
##Other specifications: the respondent is asked to specify the answer after answering ''other'' in a multiple choice question. | |||
##Proper names: names of people, etc. Note that not all proper names should be stored as string as some can be made into categories. For example, if you [[Primary Data Collection|collect data]] on respondents and multiple respondents live in the same villages, then convert the '''variable''' with the village names into a categorical numeric '''variable''' and assign a value label. | |||
== | == Applying Labels == | ||
There are several ways to add helpful descriptive text to a | There are several ways to add helpful descriptive text to a [[Master Dataset|dataset]] in [[Stata Coding Practices|Stata]], but the two most common and important ways are '''variable''' labels and value labels. | ||
===Variable Labels=== | |||
All variables in a clean | All variables in a clean [[Master Dataset|dataset]] should have '''variable''' labels that explain what the '''variable''' represents. In addition to a brief explanation of the '''variable''' and perhaps the question number from which it comes, you may also decide to include information such as the unit or currency used in the '''variable'''. The label can be up to 80 characters long. | ||
===Value Labels=== | |||
There are tools in Stata | Always store categorical '''variables''' numerically and use value labels to indicate what the numeric code represents. For example, yes and no questions should be stored as 0 and 1 with the value labels ''No'' for data cells with 0, and the label ''Yes'' for all data cells with 1. This same concept applies to multiple choice '''variables'''. There are tools in [[Stata Coding Practices|Stata]] that convert categorical string '''variables''' into categorical numeric '''variables''' and automatically apply the string as value labels. The most common tool is <code>encode</code>. However, if you use <code>encode</code>, always use the two options <code>label()</code> and <code>noextend</code>. | ||
*<code>label()</code> forces you to manually create the label before using encode. This requires some manual work but it is worth it. | |||
*<code>noextend</code> throws an error if there is a value in the data that does not exist in the pre-defined label. This way you are notified that you need to add the new value to the value label you created manually. Or you can change the string value if there is a typo. | |||
Without these two options, '''Stata''' assigns a code to each string value in alphabetic order. There is no guarantee that the alphabetic order is changed when observations are added or removed, or if someone else makes changes earlier in the code. | |||
== Additional Resources == | == Additional Resources == | ||
* | *DIME Analytics (World Bank), [https://osf.io/ndsk6 Guidelines on Data Cleaning] | ||
* The Stata Cheat Sheets on [http://geocenter.github.io/StataTraining/pdf/StataCheatsheet_processing_15_June_2016_TE-REV.pdf Data processing] and [http://geocenter.github.io/StataTraining/pdf/StataCheatsheet_Transformation15_June_2016_TE-REV.pdf Data Transformation] are helpful reminder of relevant Stata code. | |||
* The [https://github.com/Quartz/bad-data-guide#values-are-missing Quartz guide to bad data] on Github has lots of helpful tips for dealing with the kind of data problems that often come up in real world settings. | |||
*See this [[Checklist:_Data_Cleaning| data cleaning checklist]] to ensure that common cleaning actions have been completed. Note that this is not an exhaustive list. Such a list is impossible to create as the individual datasets and the analysis require different cleaning depending on context. | |||
[[Category: Data Cleaning ]] | [[Category: Data Cleaning ]] |
Latest revision as of 18:09, 14 August 2023
Data cleaning is an essential step between data collection and data analysis. Raw primary data is always imperfect and needs to be prepared for a high quality analysis and overall replicability. In extremely rare cases, the only preparation needed is dataset documentation. However, in the vast majority of cases, data cleaning requires significant energy and attention, typically on the part of the Research Assistant (RA). This page outlines the goals of data cleaning, recommends role division, outlines common issues encountered, and presents approaches to resolve them.
Read First
- The goal of data cleaning is to clean individual data points and to make the dataset easily usable and understandable for the research team and external users.
- The quality of the analysis will never be better than the quality of data cleaning.
- There is no such thing as an exhaustive list of what to do during data cleaning: each project will have individual cleaning needs. This article provides a very good place to start.
- See this data cleaning checklist to ensure that common cleaning actions have been completed.
The Goals of Cleaning
The data cleaning process seeks to fulfill two goals:
- To ensure valid analysis by cleaning individual data points that bias the analysis
- To make the dataset easily usable and understandable for researchers both within and outside of the research team. A really good data cleaning process should also result in documented insights about the data and data collection to inform future data collection – either for a different round of the same project or for other future projects.
Ensuring Valid Analysis
RCT analysis typically rely to regressions to test for statistical differences between the means of the control and treatment groups. In essence, one can think of regression analysis as an advanced comparison of means. While this is, of course, an extreme simplification, it may provide a useful framework and perspective to an RA cleaning a dataset for the first time. While it may be difficult to have an intuition for the math behind a regression, it easy to have an intuition for the math behind a mean.
Anything that biases a mean will bias a regression: outliers, missing values, typos, erroneous survey codes, illogical values, duplicates, etc. While many more things can also bias a regression, this conceptualization provides a good starting place for anyone cleaning a dataset for the first time. The researcher leading the analysis is trained in the other granular details and knowledge necessary for the specific regression models.
Making the Dataset Usable and Understandable
The second goal of the data cleaning is to code and document the dataset to make it as self-explanatory as possible. At the time of the data collection and data cleaning, you know the dataset much better than you will at any time in the future. Carefully documenting this knowledge often makes the difference between a good analysis and a great one. A usable and understandable dataset will not only help you and your research team in the future, but also other researchers who use the dataset down the road.
Role Division during Data Cleaning
Research Assistants (RAs) and Field Coordinators (FCs) should prioritize their time on identifying and documenting irregularities in the data rather than correcting them. It is never bad to suggest corrections to irregularities. However, many RAs or FCs spend too much time trying to fix irregularities and, in turn, do not have enough time to identify and document them completely. This is often inefficient, as different regression models and/or PI preferences may require different corrections. In such cases, time-consuming corrections may not be valid given the regression model used in the analysis.
Eventually, the Principal Investigator (PI) and the RA or FC will have a common understanding on what correction decisions to make without involving the PI. Until then, the RA should focus their time on identifying and documenting as many issues as possible rather than fixing them. Again, it is no problem to do both so long as the time spent fixing doesn't prevent the RA from identifying and documenting as many issues as possible.
Importing Data
The first step in cleaning the data is to import the data. When working with secondary data, i.e. administrative data, this step is often straightforward. However, with primary data, this step is often underestimated.
All modern CAPI survey data collection tools provide methods for importing the raw data in a way that drastically reduces cleaning work. These methods typically include a Stata do-file that generates labels and other features from the questionnaire code and then applies them to the raw data during the import.
Ensure that any change, no matter how small, is always be made in Stata, R, or the scripting language of use. When dealing with incorrect submissions in raw data, for example duplicates, pilot data mixed with the main data, etc., handle these issues and deletions in such a way that they can be replicated by re-running code: without this information, the analysis may no longer be valid. See the article on raw data folders for more details.
Data Issues and Approaches
A countless list of irregularities may appear in a primary dataset, requiring a multitude of data cleaning actions. This section does not provide an exhaustive list but rather a few examples of irregularities and approaches.
ID Variables
Observations in the dataset should be uniquely and fully identifiable by a single ID variable. Often, raw primary data includes duplicate entries. Carefully document these cases. To ensure accuracy, only correct them after discussing with the Field Coordinator and field team what caused them. ieduplicates
, a command in Stata, identifies duplicated entries, while iecompdup
helps to correct them. Once duplicates are corrected, the observations can be linked to the master dataset.
Illogical Values
In theory, good questionnaire programming should include logic checks that prevent illogical values. For example, if a respondent is male, then the questionnaire should not allow the respondent to answer that he is pregnant. However, no questionnaire can ever be pre-programmed to control for every such case. Discuss with the research best approaches to illogical values found in the raw dataset.
Typos
If it is obvious beyond any doubt that the response is incorrect due to a simple typo, then correct the typo. Make sure to document the change in a replicable way.
Survey Codes and Missing Values
Almost all data collection done through surveys of any sort allows respondents to answer something like "Do not know" or "Decline to answer" for individual questions. These answers are usually recorded using survey codes in the format -999, -88 or something similar. If left as such, these numbers will bias means and regressions. Accordingly, they must be replaced with missing values in Stata.
Stata has several missing values. The most well-known is the regular missing value represented by a single "." but Stata also offers extended missing values: ".a", ".b", ".c" etc. all the way to ".z". Stata handles these values the same as "." in commands that expect a numeric value. Conveniently, these extended missing values accept value labels that allow you to distinguish between, for example, "Do not know" and "Decline to answer." You might label ".d", for example, as "Decline to answer", and ".k" as "Do not know." Make sure to always assign value labels to extended missing values so that they can be precisely interpreted. Finally, make sure to consistently use the same letter ".a", ".b" etc. to represent only one response across your project. See Stata Manual Missing Values for more details on missing values.
Missing values can be used for much more than just survey codes. Any value that we remove because it is incorrect should be replaced with a missing value. In a master dataset, there should be no regular missing values. All missing values in a master dataset should contain an explanation of why there is no information for that value.
Strings
All data should be stored in numeric format because
- Stata stores numbers more efficiently than strings
- Stata commands expect values to be stored numerically. During the data cleaning process, make sure to clean categorical string variables and convert them into numeric codes. Then assign value labels for clarity. The commands
destring
and/orencode
may be useful during this process.
There are two exceptions in which string variables are acceptable:
- If the number cannot be stored correctly numerically. This may occur in two scenarios:
- If the number is more than 15 digits long. For obvious reasons, an ID cannot be rounded and may remain a string. However, if a continuous variable has more than 15 digits, round it and convert it to a different scale. After all, a precision of 16 digits is not even possible in natural sciences.
- If the number begins with 0, as is sometimes the case for national IDs and telephone numbers. In this case, continue storing the number as a string, as Stata would remove any leading zeros when destringing.
- Non-categorical text. It is acceptable to store text answers that cannot be converted into categories as strings. A few examples follow:
- Open-ended questions: open-ended questions should, in general, be avoided, but sometimes the questionnaire asks the respondent to answer a question in his or her own words.
- Other specifications: the respondent is asked to specify the answer after answering other in a multiple choice question.
- Proper names: names of people, etc. Note that not all proper names should be stored as string as some can be made into categories. For example, if you collect data on respondents and multiple respondents live in the same villages, then convert the variable with the village names into a categorical numeric variable and assign a value label.
Applying Labels
There are several ways to add helpful descriptive text to a dataset in Stata, but the two most common and important ways are variable labels and value labels.
Variable Labels
All variables in a clean dataset should have variable labels that explain what the variable represents. In addition to a brief explanation of the variable and perhaps the question number from which it comes, you may also decide to include information such as the unit or currency used in the variable. The label can be up to 80 characters long.
Value Labels
Always store categorical variables numerically and use value labels to indicate what the numeric code represents. For example, yes and no questions should be stored as 0 and 1 with the value labels No for data cells with 0, and the label Yes for all data cells with 1. This same concept applies to multiple choice variables. There are tools in Stata that convert categorical string variables into categorical numeric variables and automatically apply the string as value labels. The most common tool is encode
. However, if you use encode
, always use the two options label()
and noextend
.
label()
forces you to manually create the label before using encode. This requires some manual work but it is worth it.noextend
throws an error if there is a value in the data that does not exist in the pre-defined label. This way you are notified that you need to add the new value to the value label you created manually. Or you can change the string value if there is a typo.
Without these two options, Stata assigns a code to each string value in alphabetic order. There is no guarantee that the alphabetic order is changed when observations are added or removed, or if someone else makes changes earlier in the code.
Additional Resources
- DIME Analytics (World Bank), Guidelines on Data Cleaning
- The Stata Cheat Sheets on Data processing and Data Transformation are helpful reminder of relevant Stata code.
- The Quartz guide to bad data on Github has lots of helpful tips for dealing with the kind of data problems that often come up in real world settings.
- See this data cleaning checklist to ensure that common cleaning actions have been completed. Note that this is not an exhaustive list. Such a list is impossible to create as the individual datasets and the analysis require different cleaning depending on context.