Difference between revisions of "Tidying Data"
(23 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
Data is often acquired in various shapes and sizes, but it is most commonly received in the form of data tables. In the context of statistics, the ideal format in which to analyze data tables is called '''tidy data'''. A tabular dataset is '''tidy''' when each column corresponds to one '''variable''' in the dataset, each row corresponds to one '''observation''', and all variables in the dataset have the same [[Units of Observation|unit of observation]]. | Data is often acquired in various shapes and sizes, but it is most commonly received in the form of '''data tables'''. In the context of statistics, the ideal format in which to analyze '''data tables''' is called '''tidy data'''. A tabular '''dataset''' is '''tidy''' when each column corresponds to one '''variable''' in the '''dataset''', each row corresponds to one '''observation''', and all '''variables''' in the '''dataset''' have the same [[Units of Observation|unit of observation]]. | ||
== Read First == | == Read First == | ||
* In the context of development research, [[Primary Data Collection|survey data]] is rarely acquired in a '''tidy''' format. | * In the context of development research, [[Primary Data Collection|survey data]] is rarely acquired in a '''tidy''' format. | ||
* A '''variable''' is a collection of data points that measure the same attribute across units, | * A '''variable''' is a collection of data points that measure the same attribute across units, i.e. name, age, income, etc. One '''variable''' makes up a single column. | ||
* An '''observation''' is a collection of | * An '''observation''' is a collection of values measured on the same unit across attributes. For example, in a '''survey''' of seasonal crop patterns for 1000 households in a district, each household is an '''observation'''. Each '''observation''' is separated into multiple rows. | ||
* Each '''data point''' represents one '''variable''' and one '''observation'''. | * Each '''data point''' represents one '''variable''' and one '''observation'''. | ||
* A '''dataset''' is a collection of data points and the most common form in which it is organized is a '''data table'''. | * A '''dataset''' is a collection of data points and the most common form in which it is organized is a '''data table'''. | ||
*Not all '''data tables''' are easy to work with, but work in '''database management''' has shown that the easiest format to work with these tables in is called '''tidied data'''. | *Not all '''data tables''' are easy to work with, but work in '''database management''' has shown that the easiest format to work with these tables in is called '''tidied data'''. | ||
==Messy Datasets== | |||
This section describes the five most common problems with messy '''datasets''', along with their solutions: | |||
*Column headers are values, not '''variable''' names | |||
*Multiple '''variables''' are stored in one column | |||
*'''Variables''' are stored in both rows and columns | |||
===Column headers are values, not variable names=== | |||
Consider the cropped '''dataset''' below that explores the relationship between income and religion, with the three '''variables''' being "income", "religion", and "frequency". Notice that the values of the "income" '''variable''' are actually the columns of our '''dataset''': | |||
{| class="wikitable" style="width: 45%" | |||
|+ Column Headers as Variable Names | |||
|- | |||
! Religion !! <$10k !! $10-20k | |||
|- | |||
| Agnostic || 27 || 34 | |||
|- | |||
| Atheist || 12 || 27 | |||
|} | |||
The tidied dataset would look like this: | |||
{| class="wikitable" style="width: 45%" | |||
|+ Tidied Example 1 | |||
|- | |||
! Religion !! Income !! Frequency | |||
|- | |||
| Agnostic ||<$10k || 27 | |||
|- | |||
| Agnostic || $10-20k || 34 | |||
|- | |||
| ... || ... || ... | |||
|- | |||
| Atheist || $<10k || 12 | |||
|- | |||
| Atheist || $10-20k || 27 | |||
|} | |||
===Multiple variables are stored in one column=== | |||
Another data problem is when multiple bits of information are encoded into the same cell. Take a look at the cropped tuberculosis '''dataset''' below. Corresponding to each "M" column for males, there is also a "F" column for females. The intervening columns between M1524 and F014 depicting males between the ages of 24 and 65 aren't shown to save space: | |||
{| class="wikitable" style="width: 45%" | |||
|+ Multiple Variables Stored in One Column | |||
|- | |||
! Country !! Year !! M014 !! M1524 !! ... !! F014 | |||
|- | |||
| AD || 2000 || 0 || 0 || ... || MISSING | |||
|- | |||
| AE || 2000 || 2 || 4 || ... || 3 | |||
|} | |||
This section won't detail how to correct the problem but the following table shows the '''tidied''' data where there are now two columns, one for the age '''variable''' and another for the sex '''variable''': | |||
{| class="wikitable" style="width: 45%" | |||
|+ Tidied Example 2 | |||
|- | |||
! Country !! Year !! Sex !! Age !! Cases | |||
|- | |||
| AD || 2000 || M || 0-14 || 0 | |||
|- | |||
| AD || 2000 || M || 15-24 || 0 | |||
|- | |||
| ... || ... || ... || ... || ... | |||
|- | |||
| AE || 2000 || F || 0-14 || 3 | |||
|} | |||
===Variables are stored in both rows and columns=== | |||
Take a look at the following weather '''dataset'''. Notice that it seems strange: weather information is usually reported at the "day" level. That is, every '''observation''' is a day, and we should have a maximum and minimum temperature value for each day. However, in the initial “long” '''dataset''' format, we see that the data is stored by date and element values instead: | |||
{| class="wikitable" style="width: 45%" | |||
|+ Variables are stored in both rows and columns | |||
|- | |||
! ID !! Year !! Month !! Element !! d1 !! ... !! d31 | |||
|- | |||
| MX17004 || 2010 || 1 || tmax || - || ... || ... | |||
|- | |||
| MX17004 || 2010 || 1 || tmin || - || ... || ... | |||
|- | |||
| ... || ... || ... || ... || ... || ... || ... | |||
|- | |||
| MX17004 || 2010 || 5 || tmin || - || ... || ... | |||
|} | |||
Note that days 2-30 have been omitted to save space. Also note that this particular '''dataset''' has been cropped and so many of values in the table above seem like they are missing. The '''tidied dataset''' below shows some of the missing values. Again, this section will not show how the data was '''tidied''', only the result of the '''tidying''' itself: | |||
{| class="wikitable" style="width: 45%" | |||
|+ Tidied Example 3 | |||
|- | |||
! ID !! Date !! tmax !! tmin | |||
|- | |||
| MX17004 || 2010-01-30 || 27.8 || 14.5 | |||
|- | |||
| MX17004 || 2010-02-02 || 27.3 || 14.4 | |||
|- | |||
| MX17004 || 2010-02-03 || 24.1 || 14.4 | |||
|- | |||
| MX17004 || 2010-02-11 || 29.7 || 13.4 || | |||
|} | |||
For additional reading on '''tidying data''', such as the steps for '''tidying''' the data in examples 2 and 3 above as well as each example's complete '''dataset''' which may provide additional context, check out [https://ds4biomed.tech/ Data Science for the Biomedical Sciences], a book by Daniel Chen and Anne Brown, and [https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html this paper] on '''tidying data'''. They are also linked in the Additional Resources section. | |||
==The Simple Case: Reshaping== | ==The Simple Case: Reshaping== | ||
Line 145: | Line 248: | ||
*DIME Analytics (World Bank) [https://osf.io/k6nbz Motivations and Definitions for Tidy Data] | *DIME Analytics (World Bank) [https://osf.io/k6nbz Motivations and Definitions for Tidy Data] | ||
* DIME Analytics (World Bank) [https://osf.io/p4grs Tidying Data] | * DIME Analytics (World Bank) [https://osf.io/p4grs Tidying Data] | ||
* Daniel Chen and Anne Brown [https://ds4biomed.tech/clean-data-tidy.html#common-data-problems Data Science for the Biomedical Sciences ] | |||
*Hadley Wickham's [https://vita.had.co.nz/papers/tidy-data.html Tidy Data] (abbreviated version) | |||
** Full version [https://vita.had.co.nz/papers/tidy-data.html here] | |||
* Chester Ismay's and Albert Kim's [https://github.com/moderndive/ModernDive_book A Modern Dive Into R and the Tidyverse] |
Latest revision as of 18:44, 2 August 2023
Data is often acquired in various shapes and sizes, but it is most commonly received in the form of data tables. In the context of statistics, the ideal format in which to analyze data tables is called tidy data. A tabular dataset is tidy when each column corresponds to one variable in the dataset, each row corresponds to one observation, and all variables in the dataset have the same unit of observation.
Read First
- In the context of development research, survey data is rarely acquired in a tidy format.
- A variable is a collection of data points that measure the same attribute across units, i.e. name, age, income, etc. One variable makes up a single column.
- An observation is a collection of values measured on the same unit across attributes. For example, in a survey of seasonal crop patterns for 1000 households in a district, each household is an observation. Each observation is separated into multiple rows.
- Each data point represents one variable and one observation.
- A dataset is a collection of data points and the most common form in which it is organized is a data table.
- Not all data tables are easy to work with, but work in database management has shown that the easiest format to work with these tables in is called tidied data.
Messy Datasets
This section describes the five most common problems with messy datasets, along with their solutions:
- Column headers are values, not variable names
- Multiple variables are stored in one column
- Variables are stored in both rows and columns
Column headers are values, not variable names
Consider the cropped dataset below that explores the relationship between income and religion, with the three variables being "income", "religion", and "frequency". Notice that the values of the "income" variable are actually the columns of our dataset:
Religion | <$10k | $10-20k |
---|---|---|
Agnostic | 27 | 34 |
Atheist | 12 | 27 |
The tidied dataset would look like this:
Religion | Income | Frequency |
---|---|---|
Agnostic | <$10k | 27 |
Agnostic | $10-20k | 34 |
... | ... | ... |
Atheist | $<10k | 12 |
Atheist | $10-20k | 27 |
Multiple variables are stored in one column
Another data problem is when multiple bits of information are encoded into the same cell. Take a look at the cropped tuberculosis dataset below. Corresponding to each "M" column for males, there is also a "F" column for females. The intervening columns between M1524 and F014 depicting males between the ages of 24 and 65 aren't shown to save space:
Country | Year | M014 | M1524 | ... | F014 |
---|---|---|---|---|---|
AD | 2000 | 0 | 0 | ... | MISSING |
AE | 2000 | 2 | 4 | ... | 3 |
This section won't detail how to correct the problem but the following table shows the tidied data where there are now two columns, one for the age variable and another for the sex variable:
Country | Year | Sex | Age | Cases |
---|---|---|---|---|
AD | 2000 | M | 0-14 | 0 |
AD | 2000 | M | 15-24 | 0 |
... | ... | ... | ... | ... |
AE | 2000 | F | 0-14 | 3 |
Variables are stored in both rows and columns
Take a look at the following weather dataset. Notice that it seems strange: weather information is usually reported at the "day" level. That is, every observation is a day, and we should have a maximum and minimum temperature value for each day. However, in the initial “long” dataset format, we see that the data is stored by date and element values instead:
ID | Year | Month | Element | d1 | ... | d31 |
---|---|---|---|---|---|---|
MX17004 | 2010 | 1 | tmax | - | ... | ... |
MX17004 | 2010 | 1 | tmin | - | ... | ... |
... | ... | ... | ... | ... | ... | ... |
MX17004 | 2010 | 5 | tmin | - | ... | ... |
Note that days 2-30 have been omitted to save space. Also note that this particular dataset has been cropped and so many of values in the table above seem like they are missing. The tidied dataset below shows some of the missing values. Again, this section will not show how the data was tidied, only the result of the tidying itself:
ID | Date | tmax | tmin | |
---|---|---|---|---|
MX17004 | 2010-01-30 | 27.8 | 14.5 | |
MX17004 | 2010-02-02 | 27.3 | 14.4 | |
MX17004 | 2010-02-03 | 24.1 | 14.4 | |
MX17004 | 2010-02-11 | 29.7 | 13.4 |
For additional reading on tidying data, such as the steps for tidying the data in examples 2 and 3 above as well as each example's complete dataset which may provide additional context, check out Data Science for the Biomedical Sciences, a book by Daniel Chen and Anne Brown, and this paper on tidying data. They are also linked in the Additional Resources section.
The Simple Case: Reshaping
When all observations in a data table have the same unit of observation, reshaping is enough to make untidy data tidy. Reshaping is the transformation of the organization of rows and columns in a data table in such a way that the unit of observation represented by a row changes. This can be done in Stata using the reshape
command. It is important to note that reshaping does not change the content of the data points in the table.
Let's take a look at the following example, drawn from the e-book A Modern Dive into R and the Tidyverse. Look at the table below.
Date | Boeing Stock Price | Amazon Stock Price | Google Stock Price |
---|---|---|---|
2009-01-01 | $173.55 | $174.90 | $174.34 |
2009-01-02 | $172.61 | $171.42 | $170.04 |
Note that there are three variables here:
- Date
- Stock Name
- Stock Price
While the data is neatly organized in a spreadsheet-type format, it is not in a tidy format since there are not three columns corresponding to the three variables. Remember, in a tidy format, each variable should be its own column. The table below shows the data reorganized into this tidy format.
Date | Stock Name | Stock Price |
---|---|---|
2009-01-01 | Boeing | $173.55 |
2009-01-02 | Boeing | $172.61 |
2009-01-01 | Amazon | $174.90 |
2009-01-02 | Amazon | $171.42 |
2009-01-01 | $174.34 | |
2009-01-02 | $170.04 |
Note that each variable (Date, Stock Price, and Stock Price) has its own column. The non-tidy data format in the original table is also known as wide format whereas the tidy data format in the second table is also known as long / narrow data format.
More Complex Cases
A more complex, often encountered problem is that a dataset consists of multiple units of observation stored in the same table. This is resolved in three steps:
- Identify all the variables that were measured at the same level of observation
- Create separate tables for each level of observation
- Reshape the data
At the end of this process, your dataset will consist of multiple data tables, one for each unit of observation. To give a concrete example, say we have a data table consisting of two variables, age and gender, where the variables in the table have two different units of observation: household and household member. Because age and gender are measured at two different levels, they are split into two columns and each row contains multiple observations.
There are two additional components that are crucial to successfully managing a set of tidy tables:
- Each table must be uniquely and fully identified by one or a set of ID variables
- You must be able to use these ID variables to link all data tables to each other
The tidying process gets more complex as the number of nested groups increases: the steps of identifying the unit of observation of each variable and reshaping the separated tables need to be repeated. However, the more nested groups a dataset includes, the more efficient it is to deal with tidy data as compared to untidy.
Let's visualize the household example mentioned above. Consider the following table:
hhid | radio | bicycle | stable | hhsize | gender_1 | age_1 | gender_2 | age_2 |
---|---|---|---|---|---|---|---|---|
1001 | 1 | 1 | 0 | 3 | 1 | 74 | 2 | 5 |
1195 | 1 | 0 | 0 | 17 | 2 | 15 | 1 | 55 |
1374 | 0 | 0 | 0 | 2 | 1 | 23 | 1 | 9 |
1077 | 0 | 0 | 0 | 1 | 2 | 5 | 2 | 43 |
1159 | 1 | 0 | 0 | 9 | 1 | 61 | 2 | 35 |
1104 | 0 | 0 | 0 | 15 | 1 | 50 | 2 | 65 |
1455 | 1 | 1 | 1 | 6 | 1 | 91 | 2 | 70 |
This data is not tidy. First, note that the variables have two different units of observation: household and household member. As a result, we have multiple columns for a single variable: two columns for gender and two for age. Second, note that each row contains multiple observations.
How would we tidy this data? We need to make two separate tables, one for each unit of observation and then reshape it.
hhid | radio | bicycle | stable | hhsize |
---|---|---|---|---|
1001 | 1 | 1 | 0 | 3 |
1195 | 1 | 0 | 0 | 17 |
1374 | 0 | 0 | 0 | 2 |
1077 | 0 | 0 | 0 | 1 |
1159 | 1 | 0 | 0 | 9 |
1104 | 0 | 0 | 0 | 15 |
1455 | 1 | 1 | 1 | 6 |
hhid | member | gender | age | |
---|---|---|---|---|
1815 | 1 | 1 | 20 | |
1815 | 2 | 1 | 16 | |
1328 | 1 | 2 | 24 | |
1328 | 2 | 2 | 69 | |
1784 | 1 | 1 | 42 | |
1784 | 2 | 1 | 20 | |
1641 | 1 | 2 | 89 |
Note that now the two tables correspond to the two units of observation, the first table at the household level and the second table at the household member level. There is now one column each for the gender and age variables and each row contains one observation.
Relevance for Data Cleaning
When cleaning data, and survey data in particular, it is usually necessary to recode values, label variables and values, and explore variables. Cleaning tidy data is easier because in the tidy dataset, each variable corresponds to one survey question:
- Each question has the same set of options, you can apply them to a single column in a tidy dataset, instead of looping over multiple columns.
- Using a tidy dataset, you can summarize all answers to a question easily
Additional Resources
- DIME Analytics (World Bank) Motivations and Definitions for Tidy Data
- DIME Analytics (World Bank) Tidying Data
- Daniel Chen and Anne Brown Data Science for the Biomedical Sciences
- Hadley Wickham's Tidy Data (abbreviated version)
- Full version here
- Chester Ismay's and Albert Kim's A Modern Dive Into R and the Tidyverse