Difference between revisions of "Standardization"
Line 81: | Line 81: | ||
== Additional Resources == | == Additional Resources == | ||
*Read more about macros in DIME Analytic’s [https://github.com/worldbank/DIME-Resources/blob/master/stata1-2-coding.pdf Coding for Reproducible Research]. | |||
[[Category: Data Analysis ]] | [[Category: Data Analysis ]] |
Latest revision as of 19:28, 14 May 2019
Standardizations is not an abstract concept difficult to have an intuition for, but there are very large sources of errors if bad practices are used. The source of error usually arise from the same standardization is done very many times on different variables.
Read First
- The key to reduce the risk errors is to remove reduce the number of repeated parts as much as possible. See both globals and loops sections below
- Make sure that local units that have centrally established are also followed locally, that is far from always the case.
Globals
Conversion units are constant across a project and should therefore only be defined at one location. The DIME recommendation is to do this in the master do-file. It is possible to go over and review these conversion rates if they are only defined once. If these rates are set each time they are used across the project, then it is humanly impossible to review them all and there a large risk that errors go unnoticed.
*Conversion rate globals global gram = 1/1000 global mon40kg = 40 global mon28kg = 28 global shortTon = 907.1847 *Unit globals global kg_unit 1 global gram_unit 5 global mon40_Unit 12 global mon28_Unit 13 global shTon_Unit 18
See how these globals are used below.
Loops
Loops can be used similarly to globals to reduce the amount of code that need to be reviewed to make sure that there are no errors in the code. To the most extent possible, variables should be groups together as much as possible. While not always possible, it is a good idea to strive for one loop for all weight units where all weight variables in the data set are standardized, one for all length units etc., regardless if the variables are in different sections of the data set.
Here is an example of a loop. This example creates a new variable instead of replacing the original one. See *Deleting The Original Variables* section below for explanation why.
foreach variable in planted harvest sold { forvalues cropNo = 1/4 { *Set locals that makes the rest of the loop much more readable local amount_var `variable'_c`cropNo'_amount local amount_new `variable'_c`cropNo'_amount_kg local unit_var `variable'_c`cropNo'_unit **Generate a new variable for standardized values and * order it next to the original variable gen `amount_new` = .z order `amount_new` , after(`amount_var') *Convert the values in replace `amount_new' = `amount_var' * $gram if `unit_var' == $gram_unit replace `amount_new' = `amount_var' * $mon40kg if `unit_var' == $mon40_Unit replace `amount_new' = `amount_var' * $mon28kg if `unit_var' == $mon28_Unit replace `amount_new' = `amount_var' * $shortTon if `unit_var' == $shTon_Unit } }
Non-Convertible Units
Sometimes the same item is sold in both weight and volume. For example, rice could be sold both in weight and in volume (for example, scoops at local markets). For something as common as rice we can perhaps find an established conversion rate from volume to weight, but we won't find that all amounts that we are standardizing.
One solution to this can sometimes be to use price if we have price data. If we know what the average price for one liter is and we know what the average price for one kilogram is, then we can use that information to convert between volume and weight. This should only be used as a last resort as there are many potential sources of errors. Carefully consider the following points before using this method:
- Only use the average price if it is based on a substantial amount of observations, otherwise there is a risk that a few price data points will drive a large part of your results
- Make sure that the variation in price per unit is not very large. We should only use this method using price as an indirect conversion rate if the price per unit is stable. Regional differences might require that different price conversion rates needs to be used.
- Make sure that different types of units are not used in different circumstances that has different prices. For example, going back to the rice example, let's say that large quantities are measured in kilograms and only small quantities in volume. Unit prices tend to be higher for small quantities and large quantities are discounted and then the unite price would be different when volume is used, and if the unit price is different, then this method is no longer valid. Try to figure out if this applies to your context before using this method.
Non-Standardized Local Units
One challenge that often does not even have a good solution is finding conversion rates for non-standardized local units. Food purchases in local markets are often measured in imprecise units such as a pile, a heap or a bunch. Agricultural plots are often described in units which definition is something similar to the amount of land one person can plow in one day. There could also be regional differences in units.
Note that in many cases there are conversion rates established by the centralized government for these local units, but it is not sure that the established rates are followed in the region where you are operating.
The easiest solution to this issue is to find a solution used in a different project so start by doing desk research and reach out to your network as you are unlikely to be the first one to have this issue. If you do not find any reliable solutions, visit markets and do qualitative research to try to come up with a conversion rate that you trust.
In some cases this issue can be solved using the same method as described in the Non-Convertible Units section above, but be extra careful. Regional differences and great variance in prices is even more likely to be in issue in this case.
Deleting The Original Variables
It is often a good idea to generate new standardized variables instead of replacing the original variables. The reason for this is that the original unit could be very useful information when investigating results that are unexpected. For example, the strange results can be driven by an incorrect conversion rate and we could find this out by looking at the results per unit.
Sometimes this is not possible or advisable due to memory restrictions, so use your own judgement for
Back to Parent
This article is part of the topic Data Analysis
Additional Resources
- Read more about macros in DIME Analytic’s Coding for Reproducible Research.