Innovative Data Sources

Jump to: navigation, search

In addition to traditional data sources, such as information gathered during surveys, data can be collected from a variety of alternative sources.

Read First

  • Primary data is the main type of information that comes to mind when people talk about collecting data. It consists of gathering data through surveys, interviews, or experiments.
  • Occasionally, researchers find that data has already been collected, sometimes by the government and sometimes by a third party. Previously collected information that the field team then makes use of is known as secondary data.
  • Any source of data, such as secondary data, that is not collected first-hand is an innovative data source.
  • Examples of secondary data include administrative and monitoring data and Mobile Big Data.

Acquiring Secondary Data

Some types of secondary data, such as satellite imagery, are publicly available and don't require special agreements with government institutions or private companies. However, most information of interest to researchers, whatever kind of secondary data it may be, must be obtained through a data license agreement. Data license agreements formally grants rights to people who do not the own data they will be analyzing. When drafting a DLA, the involved parties must consider logistics, study scope, constraints on publishing PII, and other issues.

Types of Secondary Data

There are a variety of categories of secondary data. Among others, examples include satellite imagery, social media data, and mobile phone data.

Satellite imagery

Among the information satellite imagery can offer is evidence of economic activity and city expansion (seen from nighttime lights); true color imagery and vegetation (seen from daytime lights); weather patterns, such as rainfall and temperature; pollution levels of CO2 and NO2; and data on a region's terrain, i.e. is the area urban, cropland, forested, home to bodies of water, etc. The benefits of satellite data are

  • broad coverage (often the entire earth)
  • Free
  • High frequency

Despite these advantages, each also has a drawback. Free data is high coverage, but often at low-resolution. In addition, high resolution data can be expensive to obtain and incomplete.

Social Media Data

Social media data can be used for an amazing variety of purposes, from more prosaic tasks such as tracking traffic patterns to ones of political importance such as conducting sentiment analyses (how people feel about policies), tracing the flow of misinformation, and discovering the prevalence of bias and hate speech. It can of course also be used to measure economic status, directly and indirectly. For example, one way that researchers have measured poverty is by looking at Facebook users whose accounts show an interest in high-end restaurants, luxury goods, travel, etc. Accounts that show a higher volume in this content are more likely to belong to those who can afford such items. In addition, researchers can sometimes obtain detailed information on an area's education levels (should users choose to report their educational attainment), thus obtaining a rough measure of an area's affluence. There are of course drawbacks to this social media data, two examples being

  • Coverage: only those who have internet access and interact with app
  • Social Desirability Bias: Self-reported and subject to community pressures, may harm quality

Mobile Phone Data

Mobile phone data consists of two types: call data records (CDR), records of mobile phone activity mapped to cell towers; and GPS data which is compiled from pings from applications, such as Google Maps querying GPS. As an example of the information that can be extracted from GPS mobility data, consider the case where researchers queried the travel time for over 1000 origin and destination pairs every hour using Google & Mapbox. The resultant dataset contained information on peak and off-peak travel hours for the months April-October as well as average speeds during those hours.

Some major advantages of working with mobile phone data are its automatic generation, low marginal cost, and absence of a response burden. However, disadvantages include difficulty in accessing this data due to proprietary and privacy concerns as well as the fact that not everybody owns a phone.

For other types of secondary data, see the linked page at the beginning of this section. It contains information on secondary data not covered here, including geospatial data, remote sensing, telecom data, and crowd-sourced data.

Administrative Data

Data collected through existing government ministries, programs and projects is called administrative data. It is so called because data collected and maintained by government agencies are used to "administer" programs and provide services to the public. For example, line ministries, agencies responsible for a particular economic sector or activity and for delivering government programs to citizens, have access to administrative data in order to carry out their mandate. Or take national statistics offices (NSOs), agencies responsible for producing and disseminating quantitative and qualitative information on major areas in citizens' lives, that possess, say, census data while regulatory agencies have tax, price, and trade data.

Administrative Data is also generated through existing systems or processes as part of normal business procedures (salary, employment, utility billing, cash register scanning, procurement). There are advantages and disadvantages to this kind of data, the pros being:

  1. Large datasets with broad coverage
  2. Generated regularly (often real-time)
  3. Low cost for M&E
  4. Creates less burden for the population of interest
  5. May make it easier to construct the counterfactual
  6. Can be more objective, avoid social desirability or recall biases of survey data.

The cons are:

  1. Often not digitized
  2. May not have clear identifier or way to link to other datasets
  3. Difficult to access (usually sensitive, proprietary)

Mobile Big Data

A type of secondary data, mobile big data (MBD) is anonymized, aggregated data generated from personal mobile devices and mobile network operators (MNOs). It consists of information about the phone number of caller, number of the receiver, length of the call, phone towers associated with the call, and type of mobile phone. There is ongoing research to harness this information to track population trends, augment statistics, and deliver policy insights which can be used to provide targeted services. In response to the Covid-19 Pandemic and the push towards sustainable growth, nearly 80% of NSOs have indicated they want to improve their use of MBD. To show how effective this technology can be consider, consider the estimated impact the adoption of MBD would have in Sub-Saharan Africa:

  • 60 million people could have better access to healthcare due to better positioning of health care services
  • 120 million people saved because of better-informed measures to limit air pollution 
  • Cost effective: $30 for every $1 dollar invested in Integrated National Data Systems

MBD in Policy

The top sources of MBD for policy use are call detail records (CDR) and GPS data. CDR is metadata of voice, text, and other data points collected by MNOs. There are two advantages to using CDR:

  1. More representative of bottom 40%​ of population in Low-income countries
  2. Event driven (voice/text), medium spatial and temporal resolution mapped to closest cell tower

And four obstacles:

  1. Difficult to access (sensitive or proprietary information)
  2. Needs high performance computing and storage
  3. Data sharing arrangements with MNOs
  4. Local capacity to analyze

GPS data consists of location coordinates generated from usage of location-enabled smartphone applications. Chipsets on smartphones communicate with global navigation satellite systems (GNSS). As always, there is a mix of pros and cons to its usage, the advantages being:

  1. High spatial and temporal resolution (meter)
  2. Readily Accessible via third party aggregators or big tech products (Cuebiq, Veraset)

And the cons being:

  1. Less representative of bottom 40%​ of population in Low-income countries
  2. High Performance Computing and technical capacity may be needed to process raw data

Those are the sources of MBD in policy. But what about the goals? Broadly speaking, there are five areas where MBD is envisioned to play a role:

  1. Dynamic Population Mapping: Population dynamics and characteristics can be used to inform a wide range of policy indicators
  2. Migration Statistics: CDR and GPS data can be used to understand and predict human mobility patterns
  3. Displacement and Disaster: CDR data can be useful for producing statistical information to supplement traditional survey data in disaster contexts.
  4. Information Society: Produce internationally agreed information and communication technology (ICT) indicators that are included in the SDG monitoring framework
  5. Tourism: MBD is as an alternative source for generating and/or filling the gap in tourism statistics.


Challenges

As with all evolving technological fields, there are challenges to using MBD. These are

  1. Variation: Complexity in the maturity of the Integrated National Data Systems
  2. Tools: Special software / hardware must be provisioned on MNO network to store / process CDR.
  3. Safeguards: Make sure to have good practices for data security, privacy preservation, and legal protections.
  4. Standards: Guidance for developing measurements and official statistics as well as standard data sharing agreements
  5. Capacity: L/MICS often lack capacity to repurpose MBD into policy data products
  6. Funding: Has been for one-off projects to date has; programmatic funding needed
  7. Access: Median ownership of all types of phones that allow for collection of MBD is lower in emerging economies

Additional Resources