Difference between revisions of "Standardization"

Jump to: navigation, search
 
(5 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
<onlyinclude>Standardizations is not an abstract concept difficult to have an intuition for, but there are  very large sources of errors if bad practices are used. The source of error usually arise from the same standardization is done very many times on different variables.
 
</onlyinclude>
adStandardizations is not an abstract concept difficult to have an intuition for, but there are  very large sources of errors if bad practices are used.
 
The source of error usually arise from the same standardization is done very many times on different variables.
 
== Read First ==
== Read First ==
* The key to reduce the risk errors is to remove reduce the number of repeated parts as much as possible. See both globals and loops sections below
* The key to reduce the risk errors is to remove reduce the number of repeated parts as much as possible. See both globals and loops sections below
Line 13: Line 9:


<pre>
<pre>
*Conversion rate globals
global gram      = 1/1000
global gram      = 1/1000
global mon40kg    = 40>
global mon40kg    = 40
global mon28kg    = 28
global mon28kg    = 28
global shortTon  = 907.1847
global shortTon  = 907.1847


*Unit globals
global kg_unit    1
global kg_unit    1
global gram_unit  5
global gram_unit  5
Line 66: Line 64:


== Non-Standardized Local Units ==
== Non-Standardized Local Units ==
One challenge that often does not even have a good solution is finding conversion rates for non-standardized local units. Food purchases in local markets are often measured in imprecise units such as a pile, a heap or a bunch. Agricultural plots are often described in units which definition is *the amount of land one person can plow in one day*. There could also be regional differences in units.
One challenge that often does not even have a good solution is finding conversion rates for non-standardized local units. Food purchases in local markets are often measured in imprecise units such as a pile, a heap or a bunch. Agricultural plots are often described in units which definition is something similar to ''the amount of land one person can plow in one day''. There could also be regional differences in units.


Note that in many cases there are conversion rates established by the centralized government for these local units, but it is not sure that the established rates are followed in the region where you are operating.
Note that in many cases there are conversion rates established by the centralized government for these local units, but it is not sure that the established rates are followed in the region where you are operating.
Line 72: Line 70:
The easiest solution to this issue is to find a solution used in a different project so start by doing desk research and reach out to your network as you are unlikely to be the first one to have this issue. If you do not find any reliable solutions, visit markets and do qualitative research to try to come up with a conversion rate that you trust.
The easiest solution to this issue is to find a solution used in a different project so start by doing desk research and reach out to your network as you are unlikely to be the first one to have this issue. If you do not find any reliable solutions, visit markets and do qualitative research to try to come up with a conversion rate that you trust.


In some cases this issue can be solved using the same method as described in the *Non-Convertible Units* section, but be extra careful. Regional differences and great variance in prices is even more likely to be in issue in this case.
In some cases this issue can be solved using the same method as described in the '''Non-Convertible Units''' section above, but be extra careful. Regional differences and great variance in prices is even more likely to be in issue in this case.


== Deleting The Original Variables ==
== Deleting The Original Variables ==
Line 79: Line 77:
Sometimes this is not possible or advisable due to memory restrictions, so use your own judgement for
Sometimes this is not possible or advisable due to memory restrictions, so use your own judgement for


== Back to Parent ==
== Back to Parent ==
This article is part of the topic [[Data Analysis]]
This article is part of the topic [[Data Analysis]]


== Additional Resources ==
== Additional Resources ==
list here other articles related to this topic, with a brief description and link
*Read more about macros in DIME Analytic’s [https://github.com/worldbank/DIME-Resources/blob/master/stata1-2-coding.pdf Coding for Reproducible Research].
 
[[Category: Data Analysis ]]
[[Category: Data Analysis ]]

Latest revision as of 19:28, 14 May 2019

Standardizations is not an abstract concept difficult to have an intuition for, but there are very large sources of errors if bad practices are used. The source of error usually arise from the same standardization is done very many times on different variables.

Read First

  • The key to reduce the risk errors is to remove reduce the number of repeated parts as much as possible. See both globals and loops sections below
  • Make sure that local units that have centrally established are also followed locally, that is far from always the case.

Globals

Conversion units are constant across a project and should therefore only be defined at one location. The DIME recommendation is to do this in the master do-file. It is possible to go over and review these conversion rates if they are only defined once. If these rates are set each time they are used across the project, then it is humanly impossible to review them all and there a large risk that errors go unnoticed.

*Conversion rate globals
global gram       = 1/1000
global mon40kg    = 40
global mon28kg    = 28
global shortTon   = 907.1847

*Unit globals
global kg_unit    1
global gram_unit  5
global mon40_Unit 12
global mon28_Unit 13
global shTon_Unit 18

See how these globals are used below.

Loops

Loops can be used similarly to globals to reduce the amount of code that need to be reviewed to make sure that there are no errors in the code. To the most extent possible, variables should be groups together as much as possible. While not always possible, it is a good idea to strive for one loop for all weight units where all weight variables in the data set are standardized, one for all length units etc., regardless if the variables are in different sections of the data set.

Here is an example of a loop. This example creates a new variable instead of replacing the original one. See *Deleting The Original Variables* section below for explanation why.

foreach variable in planted harvest sold {
  forvalues cropNo = 1/4 {

    *Set locals that makes the rest of the loop much more readable
    local amount_var  `variable'_c`cropNo'_amount
    local amount_new  `variable'_c`cropNo'_amount_kg
    local unit_var    `variable'_c`cropNo'_unit	 

    **Generate a new variable for standardized values and
    * order it next to the original variable
    gen   `amount_new` = .z
    order `amount_new` , after(`amount_var')

    *Convert the values in
    replace `amount_new' = `amount_var' * $gram     if `unit_var' == $gram_unit
    replace `amount_new' = `amount_var' * $mon40kg  if `unit_var' == $mon40_Unit
    replace `amount_new' = `amount_var' * $mon28kg  if `unit_var' == $mon28_Unit
    replace `amount_new' = `amount_var' * $shortTon if `unit_var' == $shTon_Unit

  }
}

Non-Convertible Units

Sometimes the same item is sold in both weight and volume. For example, rice could be sold both in weight and in volume (for example, scoops at local markets). For something as common as rice we can perhaps find an established conversion rate from volume to weight, but we won't find that all amounts that we are standardizing.

One solution to this can sometimes be to use price if we have price data. If we know what the average price for one liter is and we know what the average price for one kilogram is, then we can use that information to convert between volume and weight. This should only be used as a last resort as there are many potential sources of errors. Carefully consider the following points before using this method:

  • Only use the average price if it is based on a substantial amount of observations, otherwise there is a risk that a few price data points will drive a large part of your results
  • Make sure that the variation in price per unit is not very large. We should only use this method using price as an indirect conversion rate if the price per unit is stable. Regional differences might require that different price conversion rates needs to be used.
  • Make sure that different types of units are not used in different circumstances that has different prices. For example, going back to the rice example, let's say that large quantities are measured in kilograms and only small quantities in volume. Unit prices tend to be higher for small quantities and large quantities are discounted and then the unite price would be different when volume is used, and if the unit price is different, then this method is no longer valid. Try to figure out if this applies to your context before using this method.

Non-Standardized Local Units

One challenge that often does not even have a good solution is finding conversion rates for non-standardized local units. Food purchases in local markets are often measured in imprecise units such as a pile, a heap or a bunch. Agricultural plots are often described in units which definition is something similar to the amount of land one person can plow in one day. There could also be regional differences in units.

Note that in many cases there are conversion rates established by the centralized government for these local units, but it is not sure that the established rates are followed in the region where you are operating.

The easiest solution to this issue is to find a solution used in a different project so start by doing desk research and reach out to your network as you are unlikely to be the first one to have this issue. If you do not find any reliable solutions, visit markets and do qualitative research to try to come up with a conversion rate that you trust.

In some cases this issue can be solved using the same method as described in the Non-Convertible Units section above, but be extra careful. Regional differences and great variance in prices is even more likely to be in issue in this case.

Deleting The Original Variables

It is often a good idea to generate new standardized variables instead of replacing the original variables. The reason for this is that the original unit could be very useful information when investigating results that are unexpected. For example, the strange results can be driven by an incorrect conversion rate and we could find this out by looking at the results per unit.

Sometimes this is not possible or advisable due to memory restrictions, so use your own judgement for

Back to Parent

This article is part of the topic Data Analysis

Additional Resources