Tidying Data

Revision as of 21:02, 19 October 2021 by Avnish95 (talk | contribs)
Jump to: navigation, search

Data is often acquired in various shapes and sizes, but it is most commonly received in the form of data tables. Data tables can organize information in different ways, but not all of them result in datasets that are easy to work with. Fortunately, numerous papers on database management have identified the format that makes interacting with data very easy. In the context of statistics, this ideal format is called tidy data. Specifically, a tabular dataset is tidy when - each column corresponds to one variable in the dataset, each row corresponds to one observation, and all variables in the dataset have the same unit of observation.

Read First

  • In the context of development research, survey data is rarely received (or acquired) in a tidy format.
  • A variable is a collection of data points that measure the same attribute across units. For example, name, age, income, etc.
  • An observation is a collection of all values measured on the same unit across attributes. For example, in a survey of seasonal crop patterns for 1000 households in a district, each household is an observation.
  • Each data point represents one variable and one observation.
  • A dataset is a collection of data points.