Difference between revisions of "Innovative Data Sources"

Jump to: navigation, search
 
(36 intermediate revisions by 2 users not shown)
Line 5: Line 5:
*Occasionally, researchers find that data has already been collected, sometimes by the government and sometimes by a third party. Previously collected information that the field team then makes use of is known as [[Secondary Data Sources|secondary data]].
*Occasionally, researchers find that data has already been collected, sometimes by the government and sometimes by a third party. Previously collected information that the field team then makes use of is known as [[Secondary Data Sources|secondary data]].
*Any source of data, such as '''secondary data''', that is not collected first-hand is an '''innovative data source'''.  
*Any source of data, such as '''secondary data''', that is not collected first-hand is an '''innovative data source'''.  
* Examples of '''secondary data''' include [[Administrative and Monitoring Data|administrative and monitoring data]], [[Geospatial Data|geospatial data]], and many more discussed in below.
* Examples of '''secondary data''' include [[Administrative and Monitoring Data|administrative and monitoring data]] and '''Mobile Big Data'''.


== Acquiring Secondary Data ==
== Acquiring Secondary Data ==
Some types of [[Secondary Data|secondary data]], such as satellite imagery, are publicly available and don't require special agreements with government institutions or private companies.  However, most information of interest to researchers, whatever kind of '''secondary data''' it may be, must be obtained through a [[Data License Agreement| data license agreement]]. '''Data License Agreements''' formally grants rights to people who do not the own data they will be analyzing. The key elements are
Some types of [[Secondary Data|secondary data]], such as satellite imagery, are publicly available and don't require special agreements with government institutions or private companies.  However, most information of interest to researchers, whatever kind of '''secondary data''' it may be, must be obtained through a [[Data License Agreement| data license agreement]]. '''Data license agreements''' formally grants rights to people who do not the own data they will be analyzing. When drafting a '''DLA''', the involved parties must consider logistics, study scope, constraints on publishing [[Personally Identifiable Information (PII)| PII]], and other issues.
*What data will be received
*Intended use(s)
*How long it will be retained
*Who will have access to it
*Rights to derivative data, metadata, and other outputs
*How to cite the data


==Types of Secondary Data==
==Types of Secondary Data==
There are a variety of categories of [[Secondary Data|secondary data]].  Among others, examples include
There are a variety of categories of [[Secondary Data Sources|secondary data]].  Among others, examples include '''satellite imagery''', '''social media data''', and '''mobile phone data'''.
*Satellite Imagery
*Social Media Data
*Mobile Phone Data


===Satellite imagery===
===Satellite imagery===
Among the information satellite imagery can offer is evidence of economic activity and city expansion (seen from nighttime lights); true color imagery and vegetation (seen from daytime lights); weather patterns, such as rainfall and temperature; pollution levels of CO2 and NO2; and data on a region's terrain, i.e. is the area urban, cropland, forested, home to bodies of water, etc.
Among the information '''satellite imagery''' can offer is evidence of economic activity and city expansion (seen from nighttime lights); true color imagery and vegetation (seen from daytime lights); weather patterns, such as rainfall and temperature; pollution levels of CO2 and NO2; and data on a region's terrain, i.e. is the area urban, cropland, forested, home to bodies of water, etc. The benefits of '''satellite data''' are
*broad coverage (often the entire earth)
*Free
*High frequency
Despite these advantages, each also has a drawback. Free data is high coverage, but often at low-resolution. In addition, high resolution data can be expensive to obtain and incomplete.


===Social Media Data===
===Social Media Data===
Social media can offer information on poverty and education levels. For example, one way that researchers have measured poverty is by looking at Facebook users whose accounts show an interest in restaurants, luxury goods, travel, etc. Educational attainment is also self-reported on Facebook and other social media platforms, so researchers can sometimes obtain detailed information on an area's education levels.
'''Social media data''' can be used for an amazing variety of purposes, from more prosaic tasks such as tracking traffic patterns to ones of political importance such as conducting sentiment analyses (how people feel about policies), tracing the flow of misinformation, and discovering the prevalence of bias and hate speech. It can of course also be used to measure economic status, directly and indirectly. For example, one way that researchers have measured poverty is by looking at Facebook users whose accounts show an interest in high-end restaurants, luxury goods, travel, etc. Accounts that show a higher volume in this content are more likely to belong to those who can afford such items. In addition, researchers can sometimes obtain detailed information on an area's education levels (should users choose to report their educational attainment), thus obtaining a rough measure of an area's affluence. There are of course drawbacks to this '''social media data''', two examples being
*Coverage: only those who have internet access and interact with app
*Social Desirability Bias: Self-reported and subject to community pressures, may harm quality


===Mobile Phone Data===
===Mobile Phone Data===
Mobile phone data consists of two types: call data records (CDR), records of mobile phone activity mapped to cell towers; and GPS data which is compiled from pings from applications, such as Google Maps querying GPS. As an example of the information that can be extracted from GPS mobility data, consider the case where researchers queried the travel time for over 1000 origin and destination pairs every hour using Google & Mapbox. The resultant '''dataset''' contained information on peak and off-peak travel hours for the months April-October as well as average speeds during those hours.
'''Mobile phone data''' consists of two types: call data records (CDR), records of mobile phone activity mapped to cell towers; and GPS data which is compiled from pings from applications, such as Google Maps querying GPS. As an example of the information that can be extracted from GPS mobility data, consider the case where researchers queried the travel time for over 1000 origin and destination pairs every hour using Google & Mapbox. The resultant '''dataset''' contained information on peak and off-peak travel hours for the months April-October as well as average speeds during those hours.  


For other types of '''secondary data''', see the linked page at the beginning of this section. It contains information on [[Geospatial Data|geospatial data]], [[Remote Sensing|remote sensing]], [[Telecom Data|telecom data]], and [[Crowd-Sourced Data|crowd-sourced data]].
Some major advantages of working with '''mobile phone data''' are its automatic generation, low marginal cost, and absence of a response burden. However, disadvantages include difficulty in accessing this data due to proprietary and privacy concerns as well as the fact that not everybody owns a phone.


== Administrative Data ==
For other types of '''secondary data''', see the linked page at the beginning of this section. It contains information on '''secondary data''' not covered here, including [[Geospatial Data|geospatial data]], [[Remote Sensing|remote sensing]], [[Telecom Data|telecom data]], and [[Crowd-Sourced Data|crowd-sourced data]].
Data collected through existing government ministries, programs and projects is called [[Administrative and Monitoring Data|administrative data]]. It is so called because data collected and maintained by agencies or firms are used to "administer" programs and provide services to the public.


For example, line ministries, agencies responsible for a particular economic sector or activity and for delivering government programs to citizens, have access to '''administrative data''' in order to carry out their mandate. Or take national statistics offices (NSOs), agencies responsible for producing and disseminating quantitative and qualitative information on major areas in citizens' lives, that possess, say, census and [[Geospatial Data|geospatial data]] while regulatory agencies have tax, price, and trade data.
== Mobile Big Data ==
A type of [[Secondary Data Sources|secondary data]], '''mobile big data''' (MBD) is anonymized, aggregated data generated from personal mobile devices and mobile network operators (MNOs). It consists of information about the phone number of caller, number of the receiver, length of the call, phone towers associated with the call, and type of mobile phone. There is ongoing research to harness this information to track population trends, augment statistics, and deliver policy insights which can be used to provide targeted services. In response to the Covid-19 Pandemic and the push towards sustainable growth, nearly 80% of NSOs have indicated they want to improve their use of '''MBD'''. To show how effective this technology can be consider, consider the estimated impact the adoption of '''MBD''' would have in Sub-Saharan Africa:
*60 million people could have better access to healthcare due to better positioning of health care services
*120 million people saved because of better-informed measures to limit air pollution 
*Cost effective: $30 for every $1 dollar invested in Integrated National Data Systems
 
===MBD in Policy===
The top sources of '''MBD''' for policy use are call detail records ('''CDR''') and '''GPS data'''. '''CDR''' is metadata of voice, text, and other data points collected by MNOs. There are two advantages to using '''CDR''':
#More representative of bottom 40%​ of population in Low-income countries
#Event driven (voice/text), medium spatial and temporal resolution mapped to closest cell tower
And four obstacles:
#Difficult to access (sensitive or proprietary information)
#Needs high performance computing and storage
#Data sharing arrangements with MNOs
#Local capacity to analyze
 
'''GPS''' data consists of location coordinates generated from usage of location-enabled smartphone applications. Chipsets on smartphones communicate with global navigation satellite systems (GNSS). As always, there is a mix of pros and cons to its usage, the advantages being:
#High spatial and temporal resolution (meter)
#Readily Accessible via third party aggregators or big tech products (Cuebiq, Veraset)
And the cons being:
#Less representative of bottom 40%​ of population in Low-income countries
#High Performance Computing and technical capacity may be needed to process raw data


== Mobile Big Data ==
Those are the sources of '''MBD''' in policy.  But what about the goals? Broadly speaking, there are five areas where '''MBD''' is envisioned to play a role:
A type of [[Secondary Data Sources|secondary data]], '''mobile big data''' is anonymized, aggregated data generated from personal mobile devices (phones) and mobile network operators. There is ongoing research to harness this information to track population trends, augment statistics, and deliver policy insights which can be used to provide targeted services. For example, '''mobile big data''' can be used to predict the spread of infectious diseases which would allow governments to optimize delivery of public health services; or it can be used to track migration patterns in response to climate disasters which could be used to improve government response.
# '''Dynamic Population Mapping''': Population dynamics and characteristics can be used to inform a wide range of policy indicators
#'''Migration Statistics''': CDR and GPS data can be used to understand and predict human mobility patterns
#'''Displacement and Disaster''': CDR data can be useful for producing statistical information to supplement traditional survey data in disaster contexts.
#'''Information Society''': Produce internationally agreed information and communication technology (ICT) indicators that are included in the SDG monitoring framework
#'''Tourism''': MBD is as an alternative source for generating and/or filling the gap in tourism statistics.
 
 
===Challenges===
As with all evolving technological fields, there are challenges to using '''MBD'''. These are
#'''Variation''': Complexity in the maturity of the Integrated National Data Systems
#'''Tools''': Special software / hardware must be provisioned on MNO network to store / process CDR.
#'''Safeguards''': Make sure to have good practices for data security, privacy preservation, and legal protections.
#'''Standards''': Guidance for developing measurements and official statistics as well as standard data sharing agreements
#'''Capacity''': L/MICS often lack capacity to repurpose '''MBD''' into policy data products
#'''Funding''': Has been for one-off projects to date has; programmatic funding needed
#'''Access''': Median ownership of all types of phones that allow for collection of '''MBD''' is lower in emerging economies
 
==Additional Resources==
*DIME Analytics (World Bank), [https://osf.io/rv4h5 Integrated Data Systems for Monitoring & Impact Evaluation]
*DIME Analytics (World Bank), [https://osf.io/36nyq Acquiring Secondary Data]

Latest revision as of 20:06, 28 May 2024

In addition to traditional data sources, such as information gathered during surveys, data can be collected from a variety of alternative sources.

Read First

  • Primary data is the main type of information that comes to mind when people talk about collecting data. It consists of gathering data through surveys, interviews, or experiments.
  • Occasionally, researchers find that data has already been collected, sometimes by the government and sometimes by a third party. Previously collected information that the field team then makes use of is known as secondary data.
  • Any source of data, such as secondary data, that is not collected first-hand is an innovative data source.
  • Examples of secondary data include administrative and monitoring data and Mobile Big Data.

Acquiring Secondary Data

Some types of secondary data, such as satellite imagery, are publicly available and don't require special agreements with government institutions or private companies. However, most information of interest to researchers, whatever kind of secondary data it may be, must be obtained through a data license agreement. Data license agreements formally grants rights to people who do not the own data they will be analyzing. When drafting a DLA, the involved parties must consider logistics, study scope, constraints on publishing PII, and other issues.

Types of Secondary Data

There are a variety of categories of secondary data. Among others, examples include satellite imagery, social media data, and mobile phone data.

Satellite imagery

Among the information satellite imagery can offer is evidence of economic activity and city expansion (seen from nighttime lights); true color imagery and vegetation (seen from daytime lights); weather patterns, such as rainfall and temperature; pollution levels of CO2 and NO2; and data on a region's terrain, i.e. is the area urban, cropland, forested, home to bodies of water, etc. The benefits of satellite data are

  • broad coverage (often the entire earth)
  • Free
  • High frequency

Despite these advantages, each also has a drawback. Free data is high coverage, but often at low-resolution. In addition, high resolution data can be expensive to obtain and incomplete.

Social Media Data

Social media data can be used for an amazing variety of purposes, from more prosaic tasks such as tracking traffic patterns to ones of political importance such as conducting sentiment analyses (how people feel about policies), tracing the flow of misinformation, and discovering the prevalence of bias and hate speech. It can of course also be used to measure economic status, directly and indirectly. For example, one way that researchers have measured poverty is by looking at Facebook users whose accounts show an interest in high-end restaurants, luxury goods, travel, etc. Accounts that show a higher volume in this content are more likely to belong to those who can afford such items. In addition, researchers can sometimes obtain detailed information on an area's education levels (should users choose to report their educational attainment), thus obtaining a rough measure of an area's affluence. There are of course drawbacks to this social media data, two examples being

  • Coverage: only those who have internet access and interact with app
  • Social Desirability Bias: Self-reported and subject to community pressures, may harm quality

Mobile Phone Data

Mobile phone data consists of two types: call data records (CDR), records of mobile phone activity mapped to cell towers; and GPS data which is compiled from pings from applications, such as Google Maps querying GPS. As an example of the information that can be extracted from GPS mobility data, consider the case where researchers queried the travel time for over 1000 origin and destination pairs every hour using Google & Mapbox. The resultant dataset contained information on peak and off-peak travel hours for the months April-October as well as average speeds during those hours.

Some major advantages of working with mobile phone data are its automatic generation, low marginal cost, and absence of a response burden. However, disadvantages include difficulty in accessing this data due to proprietary and privacy concerns as well as the fact that not everybody owns a phone.

For other types of secondary data, see the linked page at the beginning of this section. It contains information on secondary data not covered here, including geospatial data, remote sensing, telecom data, and crowd-sourced data.

Mobile Big Data

A type of secondary data, mobile big data (MBD) is anonymized, aggregated data generated from personal mobile devices and mobile network operators (MNOs). It consists of information about the phone number of caller, number of the receiver, length of the call, phone towers associated with the call, and type of mobile phone. There is ongoing research to harness this information to track population trends, augment statistics, and deliver policy insights which can be used to provide targeted services. In response to the Covid-19 Pandemic and the push towards sustainable growth, nearly 80% of NSOs have indicated they want to improve their use of MBD. To show how effective this technology can be consider, consider the estimated impact the adoption of MBD would have in Sub-Saharan Africa:

  • 60 million people could have better access to healthcare due to better positioning of health care services
  • 120 million people saved because of better-informed measures to limit air pollution 
  • Cost effective: $30 for every $1 dollar invested in Integrated National Data Systems

MBD in Policy

The top sources of MBD for policy use are call detail records (CDR) and GPS data. CDR is metadata of voice, text, and other data points collected by MNOs. There are two advantages to using CDR:

  1. More representative of bottom 40%​ of population in Low-income countries
  2. Event driven (voice/text), medium spatial and temporal resolution mapped to closest cell tower

And four obstacles:

  1. Difficult to access (sensitive or proprietary information)
  2. Needs high performance computing and storage
  3. Data sharing arrangements with MNOs
  4. Local capacity to analyze

GPS data consists of location coordinates generated from usage of location-enabled smartphone applications. Chipsets on smartphones communicate with global navigation satellite systems (GNSS). As always, there is a mix of pros and cons to its usage, the advantages being:

  1. High spatial and temporal resolution (meter)
  2. Readily Accessible via third party aggregators or big tech products (Cuebiq, Veraset)

And the cons being:

  1. Less representative of bottom 40%​ of population in Low-income countries
  2. High Performance Computing and technical capacity may be needed to process raw data

Those are the sources of MBD in policy. But what about the goals? Broadly speaking, there are five areas where MBD is envisioned to play a role:

  1. Dynamic Population Mapping: Population dynamics and characteristics can be used to inform a wide range of policy indicators
  2. Migration Statistics: CDR and GPS data can be used to understand and predict human mobility patterns
  3. Displacement and Disaster: CDR data can be useful for producing statistical information to supplement traditional survey data in disaster contexts.
  4. Information Society: Produce internationally agreed information and communication technology (ICT) indicators that are included in the SDG monitoring framework
  5. Tourism: MBD is as an alternative source for generating and/or filling the gap in tourism statistics.


Challenges

As with all evolving technological fields, there are challenges to using MBD. These are

  1. Variation: Complexity in the maturity of the Integrated National Data Systems
  2. Tools: Special software / hardware must be provisioned on MNO network to store / process CDR.
  3. Safeguards: Make sure to have good practices for data security, privacy preservation, and legal protections.
  4. Standards: Guidance for developing measurements and official statistics as well as standard data sharing agreements
  5. Capacity: L/MICS often lack capacity to repurpose MBD into policy data products
  6. Funding: Has been for one-off projects to date has; programmatic funding needed
  7. Access: Median ownership of all types of phones that allow for collection of MBD is lower in emerging economies

Additional Resources