Difference between revisions of "Innovative Data Sources"
(57 intermediate revisions by 2 users not shown) | |||
Line 2: | Line 2: | ||
== Read First == | == Read First == | ||
*[[Primary Data Collection|Primary data]] is the main type of information that comes to | *[[Primary Data Collection|Primary data]] is the main type of information that comes to mind when people talk about collecting data. It consists of gathering data through [[Survey Pilot|surveys]], interviews, or experiments. | ||
*Occasionally, researchers find that data has already been collected, sometimes by the government and sometimes by a third party. [Secondary Data Sources| | *Occasionally, researchers find that data has already been collected, sometimes by the government and sometimes by a third party. Previously collected information that the field team then makes use of is known as [[Secondary Data Sources|secondary data]]. | ||
*Any source of data, such as '''secondary data''', that is not collected first-hand is an '''innovative data source'''. | |||
* Examples of '''secondary data''' include [[Administrative and Monitoring Data|administrative and monitoring data]] and '''Mobile Big Data'''. | |||
== Acquiring Secondary Data == | == Acquiring Secondary Data == | ||
Some types of [[Secondary Data|secondary data]], such as satellite imagery, are publicly available and don't require special agreements with government institutions or private companies. However, most information of interest to researchers, whatever kind of '''secondary data''' it may be, must be obtained through a [[Data License Agreement| data license agreement]]. '''Data license agreements''' formally grants rights to people who do not the own data they will be analyzing. When drafting a '''DLA''', the involved parties must consider logistics, study scope, constraints on publishing [[Personally Identifiable Information (PII)| PII]], and other issues. | |||
==Types of Secondary Data== | |||
There are a variety of categories of [[Secondary Data Sources|secondary data]]. Among others, examples include '''satellite imagery''', '''social media data''', and '''mobile phone data'''. | |||
== | ===Satellite imagery=== | ||
Among the information '''satellite imagery''' can offer is evidence of economic activity and city expansion (seen from nighttime lights); true color imagery and vegetation (seen from daytime lights); weather patterns, such as rainfall and temperature; pollution levels of CO2 and NO2; and data on a region's terrain, i.e. is the area urban, cropland, forested, home to bodies of water, etc. The benefits of '''satellite data''' are | |||
*broad coverage (often the entire earth) | |||
*Free | |||
*High frequency | |||
Despite these advantages, each also has a drawback. Free data is high coverage, but often at low-resolution. In addition, high resolution data can be expensive to obtain and incomplete. | |||
===Social Media Data=== | |||
'''Social media data''' can be used for an amazing variety of purposes, from more prosaic tasks such as tracking traffic patterns to ones of political importance such as conducting sentiment analyses (how people feel about policies), tracing the flow of misinformation, and discovering the prevalence of bias and hate speech. It can of course also be used to measure economic status, directly and indirectly. For example, one way that researchers have measured poverty is by looking at Facebook users whose accounts show an interest in high-end restaurants, luxury goods, travel, etc. Accounts that show a higher volume in this content are more likely to belong to those who can afford such items. In addition, researchers can sometimes obtain detailed information on an area's education levels (should users choose to report their educational attainment), thus obtaining a rough measure of an area's affluence. There are of course drawbacks to this '''social media data''', two examples being | |||
*Coverage: only those who have internet access and interact with app | |||
*Social Desirability Bias: Self-reported and subject to community pressures, may harm quality | |||
===Mobile Phone Data=== | |||
'''Mobile phone data''' consists of two types: call data records (CDR), records of mobile phone activity mapped to cell towers; and GPS data which is compiled from pings from applications, such as Google Maps querying GPS. As an example of the information that can be extracted from GPS mobility data, consider the case where researchers queried the travel time for over 1000 origin and destination pairs every hour using Google & Mapbox. The resultant '''dataset''' contained information on peak and off-peak travel hours for the months April-October as well as average speeds during those hours. | |||
Some major advantages of working with '''mobile phone data''' are its automatic generation, low marginal cost, and absence of a response burden. However, disadvantages include difficulty in accessing this data due to proprietary and privacy concerns as well as the fact that not everybody owns a phone. | |||
For other types of '''secondary data''', see the linked page at the beginning of this section. It contains information on '''secondary data''' not covered here, including [[Geospatial Data|geospatial data]], [[Remote Sensing|remote sensing]], [[Telecom Data|telecom data]], and [[Crowd-Sourced Data|crowd-sourced data]]. | |||
== Mobile Big Data == | == Mobile Big Data == | ||
A type of [[Secondary Data Sources|secondary data]], '''mobile big data''' (MBD) is anonymized, aggregated data generated from personal mobile devices and mobile network operators (MNOs). It consists of information about the phone number of caller, number of the receiver, length of the call, phone towers associated with the call, and type of mobile phone. There is ongoing research to harness this information to track population trends, augment statistics, and deliver policy insights which can be used to provide targeted services. In response to the Covid-19 Pandemic and the push towards sustainable growth, nearly 80% of NSOs have indicated they want to improve their use of '''MBD'''. To show how effective this technology can be consider, consider the estimated impact the adoption of '''MBD''' would have in Sub-Saharan Africa: | |||
*60 million people could have better access to healthcare due to better positioning of health care services | |||
*120 million people saved because of better-informed measures to limit air pollution | |||
*Cost effective: $30 for every $1 dollar invested in Integrated National Data Systems | |||
===MBD in Policy=== | |||
The top sources of '''MBD''' for policy use are call detail records ('''CDR''') and '''GPS data'''. '''CDR''' is metadata of voice, text, and other data points collected by MNOs. There are two advantages to using '''CDR''': | |||
#More representative of bottom 40% of population in Low-income countries | |||
#Event driven (voice/text), medium spatial and temporal resolution mapped to closest cell tower | |||
And four obstacles: | |||
#Difficult to access (sensitive or proprietary information) | |||
#Needs high performance computing and storage | |||
#Data sharing arrangements with MNOs | |||
#Local capacity to analyze | |||
'''GPS''' data consists of location coordinates generated from usage of location-enabled smartphone applications. Chipsets on smartphones communicate with global navigation satellite systems (GNSS). As always, there is a mix of pros and cons to its usage, the advantages being: | |||
#High spatial and temporal resolution (meter) | |||
#Readily Accessible via third party aggregators or big tech products (Cuebiq, Veraset) | |||
And the cons being: | |||
#Less representative of bottom 40% of population in Low-income countries | |||
#High Performance Computing and technical capacity may be needed to process raw data | |||
Those are the sources of '''MBD''' in policy. But what about the goals? Broadly speaking, there are five areas where '''MBD''' is envisioned to play a role: | |||
# '''Dynamic Population Mapping''': Population dynamics and characteristics can be used to inform a wide range of policy indicators | |||
#'''Migration Statistics''': CDR and GPS data can be used to understand and predict human mobility patterns | |||
#'''Displacement and Disaster''': CDR data can be useful for producing statistical information to supplement traditional survey data in disaster contexts. | |||
#'''Information Society''': Produce internationally agreed information and communication technology (ICT) indicators that are included in the SDG monitoring framework | |||
#'''Tourism''': MBD is as an alternative source for generating and/or filling the gap in tourism statistics. | |||
===Challenges=== | |||
As with all evolving technological fields, there are challenges to using '''MBD'''. These are | |||
#'''Variation''': Complexity in the maturity of the Integrated National Data Systems | |||
#'''Tools''': Special software / hardware must be provisioned on MNO network to store / process CDR. | |||
#'''Safeguards''': Make sure to have good practices for data security, privacy preservation, and legal protections. | |||
#'''Standards''': Guidance for developing measurements and official statistics as well as standard data sharing agreements | |||
#'''Capacity''': L/MICS often lack capacity to repurpose '''MBD''' into policy data products | |||
#'''Funding''': Has been for one-off projects to date has; programmatic funding needed | |||
#'''Access''': Median ownership of all types of phones that allow for collection of '''MBD''' is lower in emerging economies | |||
==Additional Resources== | |||
*DIME Analytics (World Bank), [https://osf.io/rv4h5 Integrated Data Systems for Monitoring & Impact Evaluation] | |||
*DIME Analytics (World Bank), [https://osf.io/36nyq Acquiring Secondary Data] |
Latest revision as of 20:06, 28 May 2024
In addition to traditional data sources, such as information gathered during surveys, data can be collected from a variety of alternative sources.
Read First
- Primary data is the main type of information that comes to mind when people talk about collecting data. It consists of gathering data through surveys, interviews, or experiments.
- Occasionally, researchers find that data has already been collected, sometimes by the government and sometimes by a third party. Previously collected information that the field team then makes use of is known as secondary data.
- Any source of data, such as secondary data, that is not collected first-hand is an innovative data source.
- Examples of secondary data include administrative and monitoring data and Mobile Big Data.
Acquiring Secondary Data
Some types of secondary data, such as satellite imagery, are publicly available and don't require special agreements with government institutions or private companies. However, most information of interest to researchers, whatever kind of secondary data it may be, must be obtained through a data license agreement. Data license agreements formally grants rights to people who do not the own data they will be analyzing. When drafting a DLA, the involved parties must consider logistics, study scope, constraints on publishing PII, and other issues.
Types of Secondary Data
There are a variety of categories of secondary data. Among others, examples include satellite imagery, social media data, and mobile phone data.
Satellite imagery
Among the information satellite imagery can offer is evidence of economic activity and city expansion (seen from nighttime lights); true color imagery and vegetation (seen from daytime lights); weather patterns, such as rainfall and temperature; pollution levels of CO2 and NO2; and data on a region's terrain, i.e. is the area urban, cropland, forested, home to bodies of water, etc. The benefits of satellite data are
- broad coverage (often the entire earth)
- Free
- High frequency
Despite these advantages, each also has a drawback. Free data is high coverage, but often at low-resolution. In addition, high resolution data can be expensive to obtain and incomplete.
Social Media Data
Social media data can be used for an amazing variety of purposes, from more prosaic tasks such as tracking traffic patterns to ones of political importance such as conducting sentiment analyses (how people feel about policies), tracing the flow of misinformation, and discovering the prevalence of bias and hate speech. It can of course also be used to measure economic status, directly and indirectly. For example, one way that researchers have measured poverty is by looking at Facebook users whose accounts show an interest in high-end restaurants, luxury goods, travel, etc. Accounts that show a higher volume in this content are more likely to belong to those who can afford such items. In addition, researchers can sometimes obtain detailed information on an area's education levels (should users choose to report their educational attainment), thus obtaining a rough measure of an area's affluence. There are of course drawbacks to this social media data, two examples being
- Coverage: only those who have internet access and interact with app
- Social Desirability Bias: Self-reported and subject to community pressures, may harm quality
Mobile Phone Data
Mobile phone data consists of two types: call data records (CDR), records of mobile phone activity mapped to cell towers; and GPS data which is compiled from pings from applications, such as Google Maps querying GPS. As an example of the information that can be extracted from GPS mobility data, consider the case where researchers queried the travel time for over 1000 origin and destination pairs every hour using Google & Mapbox. The resultant dataset contained information on peak and off-peak travel hours for the months April-October as well as average speeds during those hours.
Some major advantages of working with mobile phone data are its automatic generation, low marginal cost, and absence of a response burden. However, disadvantages include difficulty in accessing this data due to proprietary and privacy concerns as well as the fact that not everybody owns a phone.
For other types of secondary data, see the linked page at the beginning of this section. It contains information on secondary data not covered here, including geospatial data, remote sensing, telecom data, and crowd-sourced data.
Mobile Big Data
A type of secondary data, mobile big data (MBD) is anonymized, aggregated data generated from personal mobile devices and mobile network operators (MNOs). It consists of information about the phone number of caller, number of the receiver, length of the call, phone towers associated with the call, and type of mobile phone. There is ongoing research to harness this information to track population trends, augment statistics, and deliver policy insights which can be used to provide targeted services. In response to the Covid-19 Pandemic and the push towards sustainable growth, nearly 80% of NSOs have indicated they want to improve their use of MBD. To show how effective this technology can be consider, consider the estimated impact the adoption of MBD would have in Sub-Saharan Africa:
- 60 million people could have better access to healthcare due to better positioning of health care services
- 120 million people saved because of better-informed measures to limit air pollution
- Cost effective: $30 for every $1 dollar invested in Integrated National Data Systems
MBD in Policy
The top sources of MBD for policy use are call detail records (CDR) and GPS data. CDR is metadata of voice, text, and other data points collected by MNOs. There are two advantages to using CDR:
- More representative of bottom 40% of population in Low-income countries
- Event driven (voice/text), medium spatial and temporal resolution mapped to closest cell tower
And four obstacles:
- Difficult to access (sensitive or proprietary information)
- Needs high performance computing and storage
- Data sharing arrangements with MNOs
- Local capacity to analyze
GPS data consists of location coordinates generated from usage of location-enabled smartphone applications. Chipsets on smartphones communicate with global navigation satellite systems (GNSS). As always, there is a mix of pros and cons to its usage, the advantages being:
- High spatial and temporal resolution (meter)
- Readily Accessible via third party aggregators or big tech products (Cuebiq, Veraset)
And the cons being:
- Less representative of bottom 40% of population in Low-income countries
- High Performance Computing and technical capacity may be needed to process raw data
Those are the sources of MBD in policy. But what about the goals? Broadly speaking, there are five areas where MBD is envisioned to play a role:
- Dynamic Population Mapping: Population dynamics and characteristics can be used to inform a wide range of policy indicators
- Migration Statistics: CDR and GPS data can be used to understand and predict human mobility patterns
- Displacement and Disaster: CDR data can be useful for producing statistical information to supplement traditional survey data in disaster contexts.
- Information Society: Produce internationally agreed information and communication technology (ICT) indicators that are included in the SDG monitoring framework
- Tourism: MBD is as an alternative source for generating and/or filling the gap in tourism statistics.
Challenges
As with all evolving technological fields, there are challenges to using MBD. These are
- Variation: Complexity in the maturity of the Integrated National Data Systems
- Tools: Special software / hardware must be provisioned on MNO network to store / process CDR.
- Safeguards: Make sure to have good practices for data security, privacy preservation, and legal protections.
- Standards: Guidance for developing measurements and official statistics as well as standard data sharing agreements
- Capacity: L/MICS often lack capacity to repurpose MBD into policy data products
- Funding: Has been for one-off projects to date has; programmatic funding needed
- Access: Median ownership of all types of phones that allow for collection of MBD is lower in emerging economies
Additional Resources
- DIME Analytics (World Bank), Integrated Data Systems for Monitoring & Impact Evaluation
- DIME Analytics (World Bank), Acquiring Secondary Data