Iehfc
High Frequency Checks (HFCs) are repeated quality control procedures conducted during data collection to ensure data integrity, identify errors early, and allow corrective actions before the fieldwork is complete. HFCs should be conducted every time new data is collected to provide timely feedback to both the field team and the research team. To support this process, DIME Analytics developed iehfc as an R package to help standardize and simplify high-frequency checks according to best practices.
Read First
Before you start running checks, it is recommended to read the Data Quality Assurance Plan and High Frequency Checks page, which provides an overview of the best practices and guiding principles for conducting quality checks.
Don’t have any data, but want to try out iehfc? You can use test data and test parameters that are pre-loaded in the package!
Overview
High-frequency checks are an essential part of the data management process, especially when primary data is collected through surveys. However, setting up high frequency checks from scratch for each round of data collection can be time-consuming, prone to errors, and often requires substantial coding experience —creating a barrier for less technical teams. iehfc addresses this challenge by streamlining the initial data processing stage. It allows you to upload your raw data and efficiently set up standardized checks (see the Types of Checks section below).
iehfc is a Shiny application that can be installed as an R package from RStudio. It allows users to upload data, explore variable structure, and configure ready-to-run checks for duplicates, outliers, enumerator-level metrics, administrative-unit-level metrics, and unit-of-observation-level metrics, all through a simple interface.While iehfc is built in R, it's designed to be accessible to both coders and non-coders. For users with no programming background, the built-in checks can be run directly through the app. More advanced users can download the generated scripts in either R or Stata to customize or rerun checks manually, enabling greater flexibility and integration into existing workflows. Since iehfc runs locally on your machine, no data is uploaded to external servers, maintaining data privacy.
Getting Started
Installation
To install and launch iehfc, follow these steps:
- Open RStudio.
- If you haven’t already, install the devtools package by running:
install.packages("devtools")`
- Install the iehfc package from GitHub:
`devtools::install_github("dime-worldbank/iehfc")`
- Load the package:
`library(iehfc)`
- Launch the Shiny application:
`iehfc_app()`
When the app opens, it will start on the Introduction tab, which provides an overview of the tool along with useful links. You will also find a link to share feedback or report bugs via the iehfc's GitHub repository.
To close the app, either close the application window or click the stop icon in the top-right corner to interrupt the R process.
Uploading and exploring your dataset
To upload your dataset, go to the Upload Data tab in the app. If you do not have a dataset available but would like to explore the tool’s features, you can click the Use Test Data button in the sidebar.
To upload your own file:
- Click on the Browse button in the left sidebar.
- Select the file from your computer. Note: Only .csv files are supported.
Once uploaded, your data will appear as a searchable, sortable table on the right side of the screen. Below the table, you will see a list of variables along with their data types as interpreted during the upload process. Take a moment to review this list to ensure that all variables were imported correctly—sometimes variables may show unexpected values if they were not formatted correctly in the original .csv file.
Setting up your checks
To begin setting your checks, navigate to the Check Selection and Setup tab.
- Select your unique ID variable from the sidebar on the left-this identifies each unit of observation your dataset.
- Choose the types of checks you want to include by ticking the boxes next to each option. As you select different checks, additional configuration boxes will appear on the screen.
- Customize your checks using the fields that appear in each box. Hover over each field to see a helper text that explains how the value you enter will be used. For details on each type of check, see the Types of Checks section below.
Once you are satisfied with your setup:
- Save your configuration by clicking the Download Parameters button on the left sidebar. This will generate a .csv with all your selected settings. You can reuse these settings in future sessions by uploading the parameter file using the Upload Parameters button located at the bottom of the same panel.
- Download the code for any check directly from iehfc by clicking the icon next to the corresponding check. The code can be downloaded in both Stata and R formats, allowing advanced users to customize or run the checks outside the app. To use the script, just change the file path in the script to match your dataset location and run it in RStudio or Stata.
Types of checks
iehfc currently supports five main types of checks, each designed to help you identify issues commonly encountered in survey data.
Duplicate Checks:
- iehfc automatically checks for duplicates in your ID variable (set in the sidebar).
- You can also Add Display Variables to choose additional variables to be displayed in the summary tables. Ideally, ID variables should be unique. However, if your dataset does not contain unique IDs and duplicates are expected, you can ignore the duplicate summary table.
- You can set up observation-wide duplicate checks by specifying a set of variables in the corresponding field. These variables will be considered jointly, and any observations flagged as duplicates across all these variables will appear in the summary table.
Outlier checks:
- Outliers can be detected in individual variables or for grouped variables. Group variables will be selected by choosing a common string in the grouped variables field. This will automatically select all variables containing that string.
- You can also choose display variables to be shown in the summary tables.
- You can select the method for outlier calculation: either standard deviation (sd) or interquartile range (iqr).
- You can also specify the multiplier to be used in the calculation (either 3 or 1.5).
- If you need more advanced options, you can download the generated code and modify the parameters directly.
Enumerator-level checks:
- To view submissions by enumerators, select the enumerator variable and any other variables you want to be averaged for each enumerator.
- You can also specify the date when the survey was submitted using the Submission Date Variable (the variable must be of type date to be selected).
- Additionally, you can specify a dummy variable indicating whether the submission was marked as complete in the Submission Complete Variable field.
- This setup allows you to identify any anomalies related to specific enumerators.
Administrative-unit-level checks:
- If your survey contains administrative unit variables, you can select up to two levels of administrative units. However, specifying at least one Administrative Unit Variable is necessary to run the check.
- You can select the submission date under the Submission Date Variable (which must be of type date) and specify a dummy variable for submission completion in the Submission Complete Variable field.
- This setup helps identify anomalies within specific administrative units.
Unit-of-observation-level checks:
- The Unit-of-Observation check helps you create a one-row-per-unit summary of key metadata for easy tracking and verification. This is particularly useful when reviewing submissions from the field—whether to confirm if an interview was finalized, to identify the enumerator, or to check submission dates and locations.
- Select the unit of observation. If your unit of observation differs from your ID variable, you can specify it in the Unit of Observation field. Ideally, the unit of observation should uniquely identify the dataset.
- Use the Display Variables field to add any other variables you would like to include in the tracking table. This might include enumerator, administrative location, or submission details.
- These variables will appear in the output table, giving you a quick and comprehensive overview of each observation’s key attributes.
Viewing the results
Once you have configured your checks, click the Run HFCs button located in the left sidebar. This will generate your results and take you to the Outputs tab, where you can explore the outputs of each check you selected. Each check has a dedicated section within the Outputs tab, organized by type:
Duplicates
- The Duplicates section shows a summary table of all duplicated IDs detected in your dataset.
- If duplicates are expected in your dataset, you can ignore this output.
- If you configured observation-wide duplicate checks, you will see an additional table listing all observations flagged as duplicates across your selected variables.
Outliers
- The Outliers section presents a summary table listing all observations flagged as outliers across the selected variables. The variable that triggered the flag will be shown in the issue_var column. The summary table also includes basic statistics to help you understand the distribution of the flagged values.
- Below the table, you will find a histogram for each variable you selected for individual outlier checks. These histograms are shown alongside winsorized versions, allowing you to compare the original distribution to a trimmed version.
- If you selected variables for grouped outlier checks, you will also see boxplots for those variables. Hovering over the graphs provides additional information.
Enumerator-level
- The Enumerator section displays outputs summarizing survey submissions by enumerator.
- Depending on your setup:
- If you only selected the Enumerator Variable, the output table will show the average number of submissions per enumerator.
- If you included the Submission Complete Variable, the output adds a daily count of completed submissions.
- If you included the Submission Date Variable, the table will also include columns with submission dates and a daily sum of complete submissions.
- If the Submission Date Variable was specified, the output will include a cumulative submissions graph for each enumerator. This graph makes it easy to spot anomalies in submission patterns.
- If you set up the Enumerator Average Value Variable, you will see a table that shows the average value of the selected variables per enumerator.
Administrative unit level
- The Admin Level section summarizes submissions by geographic units.
- Depending on your setup:
- If you chose only the Administrative Unit Variable, you will see average numbers of submissions per administrative unit.
- If you included the Submission Complete Variable, the table will also include a column displaying the number of complete submissions per day.
- If you added the Submission Date Variable, you will see columns for submission dates and daily totals.
- When the Submission Date Variable is included, a graph will be generated that shows the cumulative number of submissions per administrative unit, allowing you to spot patterns or anomalies over time.
Tracking
- The Tracking section displays your dataset organized by the Unit of Observation you selected in the setup.
- The table includes one row per observation, showing the variables you selected in the Display Variables field—such as enumerator ID, location, and submission date.This makes it easy to review interview-level metadata at a glance and confirm whether each unit was visited and completed as expected.
- If the Unit of Observation is not unique in your data, iehfc will automatically adjust it to ensure each row is distinct.
- The number of rows in this table will match the number of entries in your original dataset.
Exporting your checks
You can export each component of your checks individually or download everything as a consolidated report.
- Export individual tables: Each table has a Download Table button. Clicking this button will download the corresponding table in .csv format.
- Export graphs: To save a graph, hover over it reveal the menu bar. Click the camera icon to download the graph as a .png file.
- Download the full report: To export all your checks in one file, click on the Download Consolidated Report button located at the top left of the Output tab. This will create an interactive and well-formatted html report. You can open this file offline and share it with others easily.
Additional Resources
- DIME Analytics (World Bank), Primary Data Collection
- DIME Analytics (World Bank), SurveyCTO Server Management
- DIME Analytics (World Bank), Monitoring Data Quality
- DIME Analytics (World Bank), Open Learning Campus
- Oxfam, Planning Survey Research
- SurveyCTO, Data Quality with SurveyCTO