Difference between revisions of "Tidying Data"

Jump to: navigation, search
(Created page with "Data is often acquired in various shapes and sizes, but it is most commonly received in the form of data tables. Data tables can organize information in different ways, but no...")
 
Line 1: Line 1:
Data is often acquired in various shapes and sizes, but it is most commonly received in the form of data tables. Data tables can organize information in different ways, but not all of them result in datasets that are easy to work with. Fortunately, numerous papers on '''database management''' have identified the format that makes interacting with data very easy. In the context of statistics, this ideal format is called '''tidy data'''. Specifically, a tabular dataset is '''tidy''' when - each column corresponds to one '''variable''' in the dataset, each row corresponds to one '''observation''', and all variables in the dataset have the same [[Units of Observation|unit of observation]].
Data is often acquired in various shapes and sizes, but it is most commonly received in the form of data tables. Data tables can organize information in different ways, but not all of them result in datasets that are easy to work with. Fortunately, numerous papers on '''database management''' have identified the format that makes interacting with data very easy. In the context of statistics, this ideal format is called '''tidy data'''. Specifically, a tabular dataset is '''tidy''' when - each column corresponds to one '''variable''' in the dataset, each row corresponds to one '''observation''', and all variables in the dataset have the same [[Units of Observation|unit of observation]].
== Read First ==
* In the context of development research, [[Primary Data Collection|survey data]] is rarely received (or acquired) in a '''tidy''' format.
* A '''variable''' is a collection of data points that measure the same attribute across units. For example, name, age, income, etc.
* An '''observation''' is a collection of all values measured on the same unit across attributes. For example, in a survey of seasonal crop patterns for 1000 households in a district, each household is an observation.
* Each '''data point''' represents one variable and one observation.
* A '''dataset''' is a collection of data points.

Revision as of 21:02, 19 October 2021

Data is often acquired in various shapes and sizes, but it is most commonly received in the form of data tables. Data tables can organize information in different ways, but not all of them result in datasets that are easy to work with. Fortunately, numerous papers on database management have identified the format that makes interacting with data very easy. In the context of statistics, this ideal format is called tidy data. Specifically, a tabular dataset is tidy when - each column corresponds to one variable in the dataset, each row corresponds to one observation, and all variables in the dataset have the same unit of observation.

Read First

  • In the context of development research, survey data is rarely received (or acquired) in a tidy format.
  • A variable is a collection of data points that measure the same attribute across units. For example, name, age, income, etc.
  • An observation is a collection of all values measured on the same unit across attributes. For example, in a survey of seasonal crop patterns for 1000 households in a district, each household is an observation.
  • Each data point represents one variable and one observation.
  • A dataset is a collection of data points.