Difference between revisions of "Iefieldkit"

Jump to: navigation, search
 
(35 intermediate revisions by 2 users not shown)
Line 1: Line 1:
[[Primary Data Collection|Primary data collection]] and [[Data Cleaning|cleaning]] involve highly repetitive but extremely important processes that contribute to high quality [[Reproducible Research|reproducible research]]. [https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics] has developed '''<code>iefieldkit</code>''' as a package in [https://www.stata.com/ Stata] to standardize and simplify '''best practices''' involved in '''primary data collection'''. '''<code>Iefieldkit</code>''' consists of commands that automate: [[Ietestform|error-checking]] for electronic '''Open Data Kit (ODK)-based''' survey modules; [[Ieduplicates|duplicate checking]] and [[Iecompdup|resolution]]; [[Iecodebook#Apply|data cleaning]] and [[Iecodebook#Harmonize|survey harmonization]]; and [[Iecodebook#Export|codebook creation]].
[[Primary Data Collection|Primary data collection]] and [[Data Cleaning|cleaning]] involve highly repetitive but extremely important processes that contribute to high quality [[Reproducible Research|reproducible research]]. [https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics] has developed <code>iefieldkit</code> as a package in [https://www.stata.com/ Stata] to standardize and simplify '''best practices''' involved in '''primary data collection'''. <code>iefieldkit</code> consists of commands that automate: [[Ietestform|error-checking]] for electronic '''Open Data Kit (ODK)-based''' survey modules; [[Ieduplicates|duplicate checking]] and [[Iecompdup|resolution]]; [[Iecodebook#Apply|data cleaning]] and [[Iecodebook#Append and Harmonize|survey harmonization]]; and [[Iecodebook#Export|codebook creation]].
==Read First==
==Read First==
* [[Stata Coding Practices|Stata coding practices]].
* [https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics] has conducted a [https://osf.io/csmxz/ bootcamp on reproducible research] which establishes standard best practices in development research.
* [https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics] [https://osf.io/csmxz/ Bootcamp on Reproducible Research].
* [[Stata Coding Practices|Stata coding practices]] lists common best practices for writing reproducible and replicable Stata '''do-files'''.
*'''<code>iefieldkit</code>''' aims to provide Stata-based tools for managing the primary data collection process from start to finish.
* <code>iefieldkit</code> currently consists of four commands: <code>[[ietestform]]</code>, <code>[[ieduplicates]]</code>, <code>[[iecompdup]]</code>, and <code>[[iecodebook]]</code>.
*'''<code>iefieldkit</code>''' currently consists of four commands: <code>[[ietestform]]</code>, <code>[[ieduplicates]]</code>, <code>[[iecompdup]]</code>, and <code>[[iecodebook]]</code>.
* Each of these commands can be used independently in a wide range contexts.  
* Each of these commands can be used independently in a wide range contexts.  
* The [https://github.com/worldbank/iefieldkit open-source code] for '''<code>iefieldkit</code>''' is available on GitHub for public contribution and comment.
* The <code>iefieldkit</code> [https://github.com/worldbank/iefieldkit open-source code] is available on GitHub for public contribution and comments.
* To install the package, type <code>ssc install iefieldkit</code> in the Stata command box.
* To install the package, type <syntaxhighlight lang="Stata" inline>ssc install iefieldkit</syntaxhighlight> in the Stata command box.


==Overview==
== Objective ==
One of the most important developments in economics over the past two decades has been the rise of '''empirical research''', through [[Primary Data Collection|primary]] as well as  [[Secondary Data Sources|secondary data collection]]. The authors of <code>iefieldkit</code> have developed the package to support data collection by researchers directly in a wide range of fields like agriculture, health, energy and environment, transport, financial and private sector development, gender, governance, and fragility, conflict and violence (FCV). <code>iefieldkit</code> therefore supports general '''best practices''' in '''primary data collection''' from start to finish:
* '''Before data collection:''' <code>[[ietestform]]</code>
* '''During data collection:''' <code>[[ieduplicates]]</code> and <code>[[iecompdup]]</code>
* '''After data collection:''' <code>[[iecodebook]]</code>


One of the most important developments in economics research over the past two decades has been the rise of empirical data collection, especially with unique primary datasets collected by the researchers themselves. The authors of <code>iefieldkit</code> have supported the implementation of a wide range of primary data collection in fields including agriculture, health, energy and environment, edutainment, financial and private sector development, fragility, conflict, violence, gender, governance, and transport. They have developed workflows to support general best practices for data collection. As a rule, they develop new packages only in order to fill an essential gap in Stata functionality. <code>iefieldkit</code> aims to provide Stata-based tools for managing the primary data collection process from start to finish.
These four commands in this package make sure that inputs and outputs are significantly more human-readable by working with spreadsheets instead of Stata '''do-files'''. In doing so, they allow field personnel who do not specialize in [[Software Tools#Statistical Software|code tools]] to understand and review the tasks involved in '''primary data collection'''. <code>iefieldkit</code> thus recognizes the vital role played by field personnel in supporting [[Data Management|data management]] and [[Data Cleaning|data cleaning]] even if they are not proficient in Stata.


All commands utilize spreadsheet-based workflows so that their inputs and outputs are significantly more human-readable than Stata do files completing the same tasks would be. These tasks can be supported and reviewed by personnel who specialize in field work rather than code tools. The increasing diversity and specialization of research teams has made accessibility to non-Stata-proficient personnel an essential component of data management workflows, and this package takes this development seriously.  
== Before Data Collection ==
In '''Open Data Kit (ODK)-based''' electronic survey kits, including [https://www.surveycto.com/ SurveyCTO], '''survey forms''' (or questionnaires) are typically [[SurveyCTO Programming#Programming in Excel|built in Excel]] using a specialized structured syntax. Before the [[Impact Evaluation Team|research team]] starts with [[Preparing for Field Data Collection|field data collection]], they can use <code>[[ietestform]]</code> to test '''Open Data Kit (ODK)-based''' [[Field Surveys|electronic survey forms]] for common errors, as well as [[SurveyCTO Coding Practices | best practices]] for '''SurveyCTO-based''' forms.  


==Commands==
Most ODK servers, including [[SurveyCTO Server Management|SurveyCTO servers]], have a built-in test feature that tests the '''ODK''' syntax of a form when it is uploaded by the '''research team'''. <code>ietestform</code> complements these built-in tests to ensure that the collected data is in a format that is easily readable in Stata, and warns users who use practices we have learnt are prone to data quality errors.


===Before Data Collection===
== During Data Collection ==
While data collection is ongoing, <code>[[ieduplicates]]</code> and <code>[[iecompdup]]</code> allow researchers to test for, and resolve duplicate
entries in the dataset. The commands combine four key tasks to deal with duplicate ID values:
* '''Identifying duplicate entries.'''
* '''Comparing observations with the same ID value.'''
* '''Tracking and documenting any changes made to the identifying variable.'''
* '''Applying the necessary corrections to the data.'''
Together these commands ensure that the collected data will be a correct record of the sample, and can be merged with the [[Master Data Set|master database]]. Both these commands were previously part of the <code>ietoolkit</code> package, but have now been moved to <code>iefieldkit</code>.


Before data collection occurs, <code>[[ietestform]]</code> allows for rapid error-checking of ODK-based electronic surveys, including best practices for [[SurveyCTO Coding Practices | SurveyCTO]]-styled forms. This ensures that data, once collected, will import in Stata-friendly formats -- such as avoiding name conflicts and ensuring compliant variable naming and labelling.
== After Data Collection ==
After data collection is complete, <code>[[iecodebook]]</code> allows the '''research team''' to automate the repetitive tasks involved in [[Data Cleaning|cleaning data]] before it can be [[Data Analysis|analyzed]]. As the name suggests, the <code>iecodebook</code> command is structured around Excel-based '''codebooks''', which allows researchers to perform and [[Data Documentation|document]] data cleaning tasks in Excel itself, instead of '''do-files'''. Therefore, '''codebooks''' allow researchers to document the cleaned data in a format that is both human and machine-readable. <code>iecodebook</code> implements this through 4 subcommands:
* <code>iecodebook apply</code> applies rename, recode, and/or label commands to a large number of variables in the dataset.
* <code>iecodebook append</code> '''harmonizes''' two or more datasets, and '''appends''' them. That is, it allows two or more datasets to have the same variable names, labels, and value labels.
* <code>iecodebook export</code> creates an Excel '''codebook''' that describes the current dataset. It can also produce an exportable version of the dataset which only contains the variables used in a particular '''do-file'''.
* <code>iecodebook template</code> creates an Excel template that describes the current or targeted dataset(s), and prepares the '''codebook''' for the other subcommands in <code>iecodebook</code> .


complements the ODK syntax test on [[SurveyCTO Coding Practices | SurveyCTO]] server. It runs tests to inform researchers how to use ODK programming language features to ensure high data quality. This command is especially useful if the data that will be imported to Stata has other restrictions in addition to ODK syntax.
== Related Pages ==
 
[[Special:WhatLinksHere/Iefieldkit|Click here to see pages that link to this topic]]. <br>
===During Data Collection===
This page is part of the topic [[Stata Coding Practices|Stata coding practices]].
 
During data collection, <code>[[ieduplicates]]</code> and <code>[[iecompdup]]</code> (both previously released as a part of the package <code>ietoolkit</code> but now moved to this package) provide a workflow for detecting and resolving duplicate entries in the dataset, ensuring that the final survey dataset will be a correct record of the survey sample to merge onto the master sampling database.
===After Data Collection===
 
After data collection, the <code>[[iecodebook]]</code> commands provide a workflow for rapidly [[Data Cleaning | cleaning]], harmonizing, and [[Data Documentation | documenting]] datasets. <code>iecodebook</code> uses input specified in an Excel sheet, which provides a much more well-structured and easy to follow overview – especially for non-technical users – than the same operations written directly to a dofile.


==Additional Resources==
==Additional Resources==
* Visit the <code>iefieldkit</code> GitHub page [https://github.com/worldbank/iefieldkit here]
* DIME Analytics (World Bank), [https://github.com/worldbank/iefieldkit/ The <code>iefieldkit</code> GitHub page]
[[Category: Stata]]
[[Category: Reproducible Research]]
[[Category: Data Cleaning]]

Latest revision as of 15:09, 13 April 2021

Primary data collection and cleaning involve highly repetitive but extremely important processes that contribute to high quality reproducible research. DIME Analytics has developed iefieldkit as a package in Stata to standardize and simplify best practices involved in primary data collection. iefieldkit consists of commands that automate: error-checking for electronic Open Data Kit (ODK)-based survey modules; duplicate checking and resolution; data cleaning and survey harmonization; and codebook creation.

Read First

Objective

One of the most important developments in economics over the past two decades has been the rise of empirical research, through primary as well as secondary data collection. The authors of iefieldkit have developed the package to support data collection by researchers directly in a wide range of fields like agriculture, health, energy and environment, transport, financial and private sector development, gender, governance, and fragility, conflict and violence (FCV). iefieldkit therefore supports general best practices in primary data collection from start to finish:

These four commands in this package make sure that inputs and outputs are significantly more human-readable by working with spreadsheets instead of Stata do-files. In doing so, they allow field personnel who do not specialize in code tools to understand and review the tasks involved in primary data collection. iefieldkit thus recognizes the vital role played by field personnel in supporting data management and data cleaning even if they are not proficient in Stata.

Before Data Collection

In Open Data Kit (ODK)-based electronic survey kits, including SurveyCTO, survey forms (or questionnaires) are typically built in Excel using a specialized structured syntax. Before the research team starts with field data collection, they can use ietestform to test Open Data Kit (ODK)-based electronic survey forms for common errors, as well as best practices for SurveyCTO-based forms.

Most ODK servers, including SurveyCTO servers, have a built-in test feature that tests the ODK syntax of a form when it is uploaded by the research team. ietestform complements these built-in tests to ensure that the collected data is in a format that is easily readable in Stata, and warns users who use practices we have learnt are prone to data quality errors.

During Data Collection

While data collection is ongoing, ieduplicates and iecompdup allow researchers to test for, and resolve duplicate entries in the dataset. The commands combine four key tasks to deal with duplicate ID values:

  • Identifying duplicate entries.
  • Comparing observations with the same ID value.
  • Tracking and documenting any changes made to the identifying variable.
  • Applying the necessary corrections to the data.

Together these commands ensure that the collected data will be a correct record of the sample, and can be merged with the master database. Both these commands were previously part of the ietoolkit package, but have now been moved to iefieldkit.

After Data Collection

After data collection is complete, iecodebook allows the research team to automate the repetitive tasks involved in cleaning data before it can be analyzed. As the name suggests, the iecodebook command is structured around Excel-based codebooks, which allows researchers to perform and document data cleaning tasks in Excel itself, instead of do-files. Therefore, codebooks allow researchers to document the cleaned data in a format that is both human and machine-readable. iecodebook implements this through 4 subcommands:

  • iecodebook apply applies rename, recode, and/or label commands to a large number of variables in the dataset.
  • iecodebook append harmonizes two or more datasets, and appends them. That is, it allows two or more datasets to have the same variable names, labels, and value labels.
  • iecodebook export creates an Excel codebook that describes the current dataset. It can also produce an exportable version of the dataset which only contains the variables used in a particular do-file.
  • iecodebook template creates an Excel template that describes the current or targeted dataset(s), and prepares the codebook for the other subcommands in iecodebook .

Related Pages

Click here to see pages that link to this topic.
This page is part of the topic Stata coding practices.

Additional Resources