Tidying Data
Data is often acquired in various shapes and sizes, but it is most commonly received in the form of data tables. In the context of statistics, the ideal format in which to analyze data tables is called tidy data. A tabular dataset is tidy when each column corresponds to one variable in the dataset, each row corresponds to one observation, and all variables in the dataset have the same unit of observation.
Read First
- In the context of development research, survey data is rarely acquired in a tidy format.
- A variable is a collection of data points that measure the same attribute across units, for example, name, age, income, etc. There are multiple variables represented in a single column.
- An observation is a collection of all values measured on the same unit across attributes, for example, in a survey of seasonal crop patterns for 1000 households in a district, each household is an observation. Each observation is separated into multiple rows.
- Each data point represents one variable and one observation.
- A dataset is a collection of data points and the most common form in which it is organized is a data table.
- Not all data tables are easy to work with, but work in database management has shown that the easiest format to work with these tables in is called tidied data.
The Simple Case: Reshaping
When all observations in a data table have the same unit of observation, reshaping is enough to make untidy data tidy. Reshaping is the transformation of the organization of rows and columns in a data table in such a way that the unit of observation represented by a row changes. This can be done in Stata using the reshape
command. It is important to note that reshaping does not change the content of the data points in the table.
More Complex Cases
A more complex, often encountered problem is that a dataset consists of multiple units of observation stored in the same table. This is resolved in three steps:
- Identify all the variables that were measured at the same level of observation
- Create separate tables for each level of observation
- Reshape the data
At the end of this process, your dataset will consist of multiple data tables, one for each unit of observation. To give a concrete example, say we have a data table consisting of two variables, age and gender, where the variables in the table have two different units of observation: household and household member. Because age and gender are measured at two different levels, they are split into two columns and each row contains multiple observations.
There are two additional components that are crucial to successfully managing a set of tidy tables:
- Each table must be uniquely and fully identified by one or a set of ID variables
- You must be able to use these ID variables to link all data tables to each other
The tidying process gets more complex as the number of nested groups increases: the steps of identifying the unit of observation of each variable and reshaping the separated tables need to be repeated. However, the more nested groups a dataset includes, the more efficient it is to deal with tidy data as compared to untidy.
Relevance for Data Cleaning
When cleaning data, and survey data in particular, it is usually necessary to recode values, label variables and values, and explore variables. Cleaning tidy data is easier because in the tidy dataset, each variable corresponds to one survey question:
- Each question has the same set of options, you can apply them to a single column in a tidy dataset, instead of looping over multiple columns.
- Using a tidy dataset, you can summarize all answers to a question easily
Additional Resources
- DIME Analytics (World Bank) Motivations and Definitions for Tidy Data
- DIME Analytics (World Bank) Tidying Data