https://dimewiki.worldbank.org/api.php?action=feedcontributions&user=Kbjarkefur&feedformat=atomDimewiki - User contributions [en]2024-03-29T00:50:32ZUser contributionsMediaWiki 1.37.2https://dimewiki.worldbank.org/index.php?title=Main_Page&diff=9381Main Page2023-10-23T15:59:04Z<p>Kbjarkefur: link to handbook</p>
<hr />
<div>__NOTOC__<br />
<br />
<div class="home_top"><br />
<div class="policy_intro container"><br />
<div class="row"><br />
<div class="col-md-2"></div><br />
<div class="col-md-8"><br />
<h3 class="mega_title text-center"><b>Welcome to the DIME Wiki</b></h3> <br />
<div class="policy_desc"><br />
<p>The <strong>DIME Wiki</strong> is a <strong>public good</strong> developed and maintained by <strong>DIME Analytics</strong>. The <strong>DIME Wiki</strong> is designed for researchers and M&E specialists at the World Bank, as well as clients, donor institutions, universities, NGOs, and governments. The <strong>DIME Wiki</strong> is a collaborative, open-source resource that presents guidelines that are easy to understand and apply for users of varying levels of expertise. </p><br />
<p style="color:black;text-align:center;"><b>This landing page offers links to several curated categories that users may find helpful, as well as our narrative [https://openknowledge.worldbank.org/handle/10986/35594 Handbook].</b></p><br />
</div><br />
</div><br />
<div class="col-md-2"></div><br />
</div><br />
</div><br />
<div class="container bottomspace"><br />
<br />
<div class="row"><br />
<div class="col-sm-1"></div><br />
<div class="col-sm-10" style= "width:85% !important;" ><br />
<div class="col-sm-6 rg_red hmeBox"><br />
<div><br />
<h3><span class="mw-headline" id="Design"><i class="fa fa-sitemap"></i> Research Design</span></h3><br />
<div class="m_links"><br />
<ul><br />
<li>[[Experimental Methods]]</li><br />
<li>[[Quasi-Experimental Methods]]</li><br />
<li>[[Research Ethics]]</li><br />
<li>[[Power Calculations]]</li><br />
</ul><br />
</div><br />
</div><br />
</div><br />
<div class="col-sm-6 rg_purple hmeBox"><br />
<div><br />
<h3><span class="mw-headline" id="Data"><i class="fa fa-database"></i> Data Collection</span></h3><br />
<div class="m_links"><br />
<ul><br />
<li>[[Primary Data Collection]]</li><br />
<li>[[Secondary Data Sources]]</li><br />
<li>[[Field Management]]</li><br />
<li>[[Questionnaire Design]]</li><br />
</ul><br />
</div><br />
</div><br />
</div><br />
<div class="col-sm-6 rg_green hmeBox"><br />
<div><br />
<h3><span class="mw-headline" id="Analysis"><i class="fa fa-search"></i> Analysis</span></h3><br />
<div class="m_links"><br />
<ul><br />
<li>[[Data Management]]</li><br />
<li>[[Data Cleaning]]</li><br />
<li>[[Data Analysis]]</li><br />
<li>[[Software Tools]]</li><br />
</ul><br />
</div><br />
</div><br />
</div><br />
<div class="col-sm-6 rg_blue hmeBox"><br />
<div><br />
<h3><span class="mw-headline" id="Fieldwork"><i class="fa fa-globe"></i> Publication</span></h3><br />
<div class="m_links"><br />
<ul><br />
<li>[[Reproducible Research]]</li><br />
<li>[[Publishing Data]]</li><br />
<li>[[Collaboration Tools]]</li><br />
<li>[[Dissemination]]</li><br />
</ul><br />
</div><br />
</div><br />
</div><br />
</div><br />
</div><br />
</div><br />
<br />
<div class="well"><br />
<div class="container"><br />
<br />
<h3 class="mega_title"><span class="mw-headline" id="Recent_Contributions"><b>Cross-cutting Resources</b></span></h3><br />
<div class="recent-activity"><br />
<br />
<div class="row"> <br />
<div class="col-md-4"> <br />
<div class="rec_contrib_box"><br />
<br />
<div class="rec_title">[[Stata Coding Practices]]</div><br />
</div><br />
</div><br />
<div class="col-md-4"> <br />
<div class="rec_contrib_box"><br />
<br />
<div class="rec_title">[[SurveyCTO Coding Practices]]</div><br />
</div><br />
</div><br />
<div class="col-md-4"> <br />
<div class="rec_contrib_box"><br />
<br />
<div class="rec_title">[[Check Lists]]</div><br />
</div><br />
</div><br />
</div><!--end .row--><br />
</div><br />
<br />
<br />
<br />
<h3 class="mega_title"><span class="mw-headline" id="Recent_Contributions"><b>Download the Dime Analytics Data Handbook </b></span></h3><br />
<div class="rec_contrib_box"><br />
<div class="dk_bar"><br />
[[File:BookBanner.png|700px||center|link=https://worldbank.github.io/dime-data-handbook/]]<br />
</div><br />
</div><br />
<br />
<div class="random"> <span class="hiddentext"><randompages limit="0" namespace="DIME Wiki" levels="1"></randompages></span></div><br />
<br />
</div><br />
</div><br />
<br />
<div class="policy_intro container"><br />
<div class="row"><br />
<div class="col-md-2"></div><br />
<div class="col-md-8"><br />
<h3 class="mega_title text-center"><b>About DIME and DIME Analytics</b></h3><br />
<div class="policy_desc"><br />
<onlyinclude><br />
[https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics] creates tools that improve the quality of impact evaluation research for all. We take advantage of the concentration and scale of research at DIME to develop and test solutions to ensure data work quality across our portfolio, and to make public training and tools available to the larger community of development researchers who might not have the same capabilities.<br />
<br />
[https://www.worldbank.org/en/research/dime DIME] is the World Bank’s impact evaluation department. Part of DIME’s mission is to intensify the production of, and access to, public goods that improve the quantity and quality of global development research, while lowering the costs of performing impact evaluations for the entire research community. The <strong>DIME Wiki</strong> aims to further this initiative, and is funded by the United Kingdom’s Foreign, Commonwealth & Development Office through the i2i Trust Fund.<br />
</div><br />
<br />
</div><br />
<div class="col-md-2"></div><br />
</div><br />
</div><br />
<br />
</div></div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Stata_Coding_Practices&diff=9380Stata Coding Practices2023-08-31T07:36:36Z<p>Kbjarkefur: /* File Paths */</p>
<hr />
<div>Researchers use Stata in all stages of an '''impact evaluation''' (or study), such as [[Sampling & Power Calculations |sampling]], [[Randomization in Stata | randomizing]], [[Monitoring Data Quality | monitoring data quality]], [[Data Cleaning | cleaning]], and [[Data Analysis | analysis]]. Good '''Stata coding practices''', packages, and commands are a critical component of high quality [[Reproducible Research | reproducible research]]. These practices also allow the [[Impact Evaluation Team|impact evaluation team]] (or research team) to save time and energy, and focus on other [[Randomized Evaluations: Principles of Study Design|aspects of study design]]. <br />
==Read First==<br />
* [https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics] and institutions like [https://github.com/PovertyAction Innovations for Poverty Action (IPA)] offer a wide range of resources - tutorials, sample codes, and easy-to-install packages and commands.<br />
* <code>[https://github.com/worldbank/iefieldkit/ iefieldkit]</code> is a Stata package that standardizes '''best practices''' (guidelines) for high quality, [[Reproducible Research | reproducible]] [[Primary Data Collection | primary data collection]].<br />
* <code>[https://worldbank.github.io/ietoolkit/ ietoolkit]</code> is a Stata package that standardizes best practices in [[Data Management|data management]] and [[Data Analysis|data analysis]]. <br />
* As with other Stata packages like [https://www.stata-journal.com/article.html?article=gr0059 <code>coefplot</code>], use <syntaxhighlight lang="Stata" inline>ssc install</syntaxhighlight> to download these packages.<br />
* Other common Stata best practices, for instance, with respect to naming file paths, also contribute to successful impact evaluations.<br />
<br />
== iefieldkit ==<br />
DIME has developed the <code>[[iefieldkit]]</code> package for Stata to simplify the process of [[Primary Data Collection|primary data collection]]. The package currently supports three major components of this '''workflow''' (process) - [[Questionnaire Design|survey design]], [[Iecompdup|survey completion]], and [[Data Cleaning|data cleaning]] and [[Iecodebook#Harmonize| data harmonization]]. <code>[[iefieldkit]]</code> uses four commands to simplify each of these tasks:<br />
* '''Before data collection.''' The <code>[[ietestform]]</code> command tests the collected data to make sure it follows '''best practices''' in naming, coding, and labeling. For instance, it does not let an '''enumerator''' move to the next field until they enter a response, thus ensuring that incomplete forms can not be submitted. <br />
* '''During data collection.''' The <code>[[ieduplicates]]</code> and <code>[[Iecompdup|iecompdup]]</code> commands allow the [[Impact Evaluation Team|research team]] to '''detect''' (identify) and '''resolve''' (deal with) duplicate entries in the data set. These commands were previously a part of the <code>[[Stata Coding Practices#ietoolkit|ietoolkit]]</code> package, but are now part of the <code>[[iefieldkit]]</code> package.<br />
* '''After data collection.''' The <code>[[iecodebook]]</code> command provides a method for rapidly [[Data Cleaning|cleaning]], [[iecodebook#Harmonize|harmonizing]], and [[Data Documentation|documenting]] data sets. <br />
To install the <code>[[iefieldkit]]</code> package, type <syntaxhighlight lang="Stata" inline>ssc install iefieldkit</syntaxhighlight> in your Stata command window. Note that some features of this package might require '''meta data''' (information) that is specific to '''SurveyCTO''', but users can still test them in other cases.<br />
<br />
== ietoolkit ==<br />
DIME has developed the <code>[[Ietoolkit|ietoolkit]]</code> package for Stata to simplify the process of [[Data Management|data management]] and [[Data Analysis|analysis]] in impact evaluations. Given below are the list of commands that are currently part of this package. <br />
* '''Data management.'''<br />
** <code>[[iefolder]]</code> sets up a '''standardized''' (common) structure for all folders that are shared as part of a project, that is the '''project folder'''. It creates [[Master Do-files|master do-files]] that link to all '''sub-folders''' (folders within another folder), so that the project folder is automatically updated every time more data or files are shared from the '''field teams'''. This command helps create [[Reproducible Research|reproducible research]].<br />
** <code>[[iegitaddmd]]</code> allows members of the research team to share a '''template''' (outline) folder for a new project on GitHub even if it is empty. This command creates a '''placeholder''' that can be updated later when a file is added to that folder. For example, templates often include an output folder where the results of [[Data Analysis|data analysis]] will be stored. This folder remains empty until the data set is [[Data Cleaning|cleaned]] to prepare it for analysis. Using this command, two people, say A and B, can still share this folder with each other on GitHub.<br />
** <code>[[ieboilstart]]</code> standardizes the '''version''', '''capacity''' (in terms of the number of observations it can store in memory), and other Stata settings for all users in a project. This command should be '''run''' (typed) at the top of all do-files that are shared between members of the [[Impact Evaluation Team|research team]]. Such a code is called a '''boilerplate code''', since it standardizes the code at the beginning for all do-files. <br />
An example of a code that uses these commands is given below:<br />
<syntaxhighlight lang="stata" line>ieboilstart, version(14.0) //Standardizes the version for everyone.<br />
<br />
global folder "C:/Users/username/DropBox/ProjectABC" <br />
<br />
iefolder new project, projectfolder("$folder") //Sets up the main structure<br />
<br />
iegitaddmd, folder ("$folder") //Makes sure users can share the main folder on GitHub even if it is empty </syntaxhighlight><br />
* '''Data analysis.''' <br />
** <code>[[iematch]]</code> is a command which can be used for matching observations in one group to observations in another group which are the closest in terms of a particular characteristic. <br>For example, consider a study which is designed to evaluate the impact of randomly providing cash transfers to half the workers in a firm. The research team can use <code>[[iematch]]</code> to match and compare wages of women in the '''treatment''' group (which received the cash transfers) with observations in a '''control''' group (which did not receive the cash transfers). <br />
** <code>[[iebaltab]]</code> runs [[Balance tests|balance tests]], and produces '''balance tables''' which show the difference in means for one or more '''treatment''' groups. It can be used to check if there are '''statistically significant''' differences between the '''treatment''' and '''control''' groups. In case there are significant differences in the means, <code>[[iebaltab]]</code> even displays an error message that suggests that results from such data can be wrongly interpreted.<br />
** <code>[[iedropone]]</code> drops only a specific number of observations, and makes sure that no additional observations are dropped.<br />
** <code>[[ieboilsave]]</code> performs checks to ensure that '''best practices''' are followed before saving a data set.<br />
** <code>[[ieddtab]]</code> runs [[Difference-in-Differences | difference-in-difference]] regressions and displays the result in well-formatted tables.<br />
** <code>[[iegraph]]</code> produces graphs of results from regression models that researchers commonly use during impact evaluations.<br />
To install the <code>ietoolkit</code>, type <syntaxhighlight lang="Stata" inline>ssc install ietoolkit</syntaxhighlight> in your Stata command window.<br />
<br />
== File Paths==<br />
DIME Analytics suggests the following guidelines for specifying '''file paths''' in Stata: <br />
* '''Double quotes (<code>"</code>).''' Always enclose file paths in double quotes (<code>"</code>) . For example, <syntaxhighlight lang="Stata" inline>"${maindir}"</syntaxhighlight>.<br />
* '''Forward slashes (<code>/</code>).''' Always use forward slashes (<code>/</code>) to specify folder '''hierarchies''', that is, the exact location of a folder inside another folder, and so on. For example, <code>"C:/Users/username/Documents"</code>. This is important because Mac and Linux computers cannot read file paths with '''back slashes'''(<code>\</code>). <br />
* '''File extension.''' Always include the file extension in the file path, such as <code>.dta</code>, <code>.do</code>, or <code>.csv</code>. This helps to avoid '''ambiguity''' (or doubt) if another file with the same name exists.<br />
<br />
'''''Dynamic and absolute file paths'''''.<br />
<br />
Relative file paths exists in Stata but is implemented differently in Stata compared to many other computer languages. One should therefore use caution when translating practices that builds on relative file paths from other languages into Stata.<br />
<br />
Therefore, it is common to use ''dynamic'' and ''absolute'' file paths in Stata. A file path is '''absolute''' when it begins from the '''root folder''' of the computer, for example, <code>C:/</code> on a PC or <code>/Users/</code> on a Mac. This guarantees that a each file path only can corresponds to a single location in the file system, no matter what the working directory is set to. <br />
<br />
In contrast, relative file path points to a different location each time the working directory is changed. In a collaborative context your file paths might start to point to other locations on your computer if someone in your team introduce code that use <code>cd</code> to change the directory. The types of errors this can lead to are not possible when a team use absolute paths.<br />
<br />
However, in absolute paths, the first part of the file path is almost always unique to each user. To make this work, you need to create a file path that is both '''dynamic''' and absolute. An absolute file path is dynamic if it sets the first part of the path dynamically with code. This means that users set '''globals''' (global macros) located in the [[Master Do-files|main do-files]] to specify the root part of file paths. The root part is the part of the file path that differs between all users. <br />
<br />
There are other ways to solve the same problem, but dynamic absolute file paths is considered a very generalizable method with few and simple steps to learn.<br />
<br />
=== Examples ===<br />
* Dynamic and absolute file path.<br />
<syntaxhighlight lang="stata" line>global root "C:/Users/username/Documents"<br />
global myProject "${root}/MyProject"<br />
use "${myProject}/MyDataset.dta"</syntaxhighlight><br />
* Non-absolute, non-dynamic file path.<br />
<syntaxhighlight lang="stata" line>cd "C:/Users/username/Documents/MyProject"<br />
use MyDataset.dta</syntaxhighlight><br />
* Absolute, but non-dynamic file path.<br />
<syntaxhighlight lang="stata" line>cd "C:/Users/username/Documents/MyProject" <br />
use "C:/Users/username/Documents/MyProject/MyDataset.dta"</syntaxhighlight><br />
<br />
== Exporting Tables ==<br />
Tables play a crucial role in representing the results of a study in an easy-to-understand format. However, it is common to copy-and-paste results from Stata, and format them in a word-processing software, which affects the [[Reproducible Research|reproducibility of research]]. [https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics] has therefore created the following resources for exporting tables in Stata:<br />
* [[Checklist:_Submit_Table|Checklist for submitting tables in development research]]<br />
* [https://osf.io/78nuc/ Nice and fast tables in Stata for LaTex and Excel]<br />
* [https://github.com/worldbank/stata-tables GitHub - Stata tables] is a repository with do-files and output tables. Use these to practice exporting tables using the <code>esttab</code> command. <br />
* [https://blogs.worldbank.org/impactevaluations/nice-and-fast-tables-stata Blog post on Stata tables]<br />
<br />
== Related Pages ==<br />
[[Special:WhatLinksHere/Stata_Coding_Practices|Click here for pages that link to this topic.]]<br />
<br />
== Additional Resources ==<br />
* DIME Analytics (World Bank), [https://osf.io/36hys Basics of Programming in Stata]<br />
* DIME Analytics (World Bank), [https://osf.io/zatqj Statistical Programming 101]<br />
* DIME Analytics (World Bank), [https://github.com/vikjam/mostly-harmless-replication Mostly Harmless Replication]<br />
* DIME Analytics (World Bank, [https://gist.github.com/kbjarkefur/16b63c1fc89ab52c3d4cae9c74288452 Sharing sub-functions between different commands]. Download the <code>.ado</code> files and follow the instructions.<br />
* DIME Analytics (World Bank), [https://worldbank.github.io/Stata-IE-Visual-Library/ Stata visual library]<br />
* DIME Analytics (World Bank), [https://osf.io/mw965 Data Management]<br />
* DIME Analytics (World Bank), [https://osf.io/msh8r ietoolkit and iefieldkit- introduction]<br />
* DIME Analytics (World Bank), [https://osf.io/4tbkr ietoolkit- follow up slides]<br />
* DIME Analytics (World Bank), [https://osf.io/t48ug Data Quality Assurance].<br />
* DIME Analytics (World Bank), [https://osf.io/nzbvu Data Cleaning and Documentation in Stata (Intro)].<br />
* DIME Analytics (World Bank), [https://osf.io/juxcb Data Cleaning in Stata].<br />
* DIME (World Bank), [[Checklist: Submit Table| Checklist on submitting results.]] <br />
* David McKenzie (World Bank), [https://blogs.worldbank.org/impactevaluations/updated-overview-multiple-hypothesis-testing-commands-stata An updated overview of multiple hypothesis testing commands in Stata]<br />
* Gentzkow and Shapiro (Stanford) [http://web.stanford.edu/~gentzkow/research/CodeAndData.pdf Code and Data for the Social Sciences] <br />
* The GeoCenter, [http://geocenter.github.io/StataTraining/portfolio/01_resource/ Stata cheat sheets.]<br />
* Innovations for Poverty Action, [http://www.poverty-action.org/researchers/research-resources/stata-programs Stata modules for data collection and analysis] <br />
* Innovations for Poverty Action, [https://github.com/PovertyAction GitHub repository on impact evaluations] <br />
* Innovations for Poverty Action, [https://github.com/PovertyAction/odkmeta Odkmeta command]. This command writes a do-file to import ODK (Open Data Kit) data to Stata, using metadata from the survey and choices worksheets of the XLSForm.<br />
* J-PAL, [https://www.povertyactionlab.org/sites/default/files/resources/IAPStataWorkshopSlides.pdf Programming with Stata]<br />
* Princeton, [https://www.princeton.edu/~otorres/StataTutorial.pdf Data analysis in Stata for beginners] <br />
* Standford, [https://web.stanford.edu/~leinav/teaching/econ257/STATA.pdf Basics of Stata] <br />
* World Bank, [https://worldbank.github.io/stata/ Stata repository].<br />
[[Category: Coding Practices]]<br />
[[Category: Reproducible Research]]<br />
[[Category: Stata Coding Practices]]<br />
[[Category: Technical Tools]]</div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Stata_Coding_Practices&diff=9379Stata Coding Practices2023-08-29T13:52:06Z<p>Kbjarkefur: Update about relative file paths in Stata</p>
<hr />
<div>Researchers use Stata in all stages of an '''impact evaluation''' (or study), such as [[Sampling & Power Calculations |sampling]], [[Randomization in Stata | randomizing]], [[Monitoring Data Quality | monitoring data quality]], [[Data Cleaning | cleaning]], and [[Data Analysis | analysis]]. Good '''Stata coding practices''', packages, and commands are a critical component of high quality [[Reproducible Research | reproducible research]]. These practices also allow the [[Impact Evaluation Team|impact evaluation team]] (or research team) to save time and energy, and focus on other [[Randomized Evaluations: Principles of Study Design|aspects of study design]]. <br />
==Read First==<br />
* [https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics] and institutions like [https://github.com/PovertyAction Innovations for Poverty Action (IPA)] offer a wide range of resources - tutorials, sample codes, and easy-to-install packages and commands.<br />
* <code>[https://github.com/worldbank/iefieldkit/ iefieldkit]</code> is a Stata package that standardizes '''best practices''' (guidelines) for high quality, [[Reproducible Research | reproducible]] [[Primary Data Collection | primary data collection]].<br />
* <code>[https://worldbank.github.io/ietoolkit/ ietoolkit]</code> is a Stata package that standardizes best practices in [[Data Management|data management]] and [[Data Analysis|data analysis]]. <br />
* As with other Stata packages like [https://www.stata-journal.com/article.html?article=gr0059 <code>coefplot</code>], use <syntaxhighlight lang="Stata" inline>ssc install</syntaxhighlight> to download these packages.<br />
* Other common Stata best practices, for instance, with respect to naming file paths, also contribute to successful impact evaluations.<br />
<br />
== iefieldkit ==<br />
DIME has developed the <code>[[iefieldkit]]</code> package for Stata to simplify the process of [[Primary Data Collection|primary data collection]]. The package currently supports three major components of this '''workflow''' (process) - [[Questionnaire Design|survey design]], [[Iecompdup|survey completion]], and [[Data Cleaning|data cleaning]] and [[Iecodebook#Harmonize| data harmonization]]. <code>[[iefieldkit]]</code> uses four commands to simplify each of these tasks:<br />
* '''Before data collection.''' The <code>[[ietestform]]</code> command tests the collected data to make sure it follows '''best practices''' in naming, coding, and labeling. For instance, it does not let an '''enumerator''' move to the next field until they enter a response, thus ensuring that incomplete forms can not be submitted. <br />
* '''During data collection.''' The <code>[[ieduplicates]]</code> and <code>[[Iecompdup|iecompdup]]</code> commands allow the [[Impact Evaluation Team|research team]] to '''detect''' (identify) and '''resolve''' (deal with) duplicate entries in the data set. These commands were previously a part of the <code>[[Stata Coding Practices#ietoolkit|ietoolkit]]</code> package, but are now part of the <code>[[iefieldkit]]</code> package.<br />
* '''After data collection.''' The <code>[[iecodebook]]</code> command provides a method for rapidly [[Data Cleaning|cleaning]], [[iecodebook#Harmonize|harmonizing]], and [[Data Documentation|documenting]] data sets. <br />
To install the <code>[[iefieldkit]]</code> package, type <syntaxhighlight lang="Stata" inline>ssc install iefieldkit</syntaxhighlight> in your Stata command window. Note that some features of this package might require '''meta data''' (information) that is specific to '''SurveyCTO''', but users can still test them in other cases.<br />
<br />
== ietoolkit ==<br />
DIME has developed the <code>[[Ietoolkit|ietoolkit]]</code> package for Stata to simplify the process of [[Data Management|data management]] and [[Data Analysis|analysis]] in impact evaluations. Given below are the list of commands that are currently part of this package. <br />
* '''Data management.'''<br />
** <code>[[iefolder]]</code> sets up a '''standardized''' (common) structure for all folders that are shared as part of a project, that is the '''project folder'''. It creates [[Master Do-files|master do-files]] that link to all '''sub-folders''' (folders within another folder), so that the project folder is automatically updated every time more data or files are shared from the '''field teams'''. This command helps create [[Reproducible Research|reproducible research]].<br />
** <code>[[iegitaddmd]]</code> allows members of the research team to share a '''template''' (outline) folder for a new project on GitHub even if it is empty. This command creates a '''placeholder''' that can be updated later when a file is added to that folder. For example, templates often include an output folder where the results of [[Data Analysis|data analysis]] will be stored. This folder remains empty until the data set is [[Data Cleaning|cleaned]] to prepare it for analysis. Using this command, two people, say A and B, can still share this folder with each other on GitHub.<br />
** <code>[[ieboilstart]]</code> standardizes the '''version''', '''capacity''' (in terms of the number of observations it can store in memory), and other Stata settings for all users in a project. This command should be '''run''' (typed) at the top of all do-files that are shared between members of the [[Impact Evaluation Team|research team]]. Such a code is called a '''boilerplate code''', since it standardizes the code at the beginning for all do-files. <br />
An example of a code that uses these commands is given below:<br />
<syntaxhighlight lang="stata" line>ieboilstart, version(14.0) //Standardizes the version for everyone.<br />
<br />
global folder "C:/Users/username/DropBox/ProjectABC" <br />
<br />
iefolder new project, projectfolder("$folder") //Sets up the main structure<br />
<br />
iegitaddmd, folder ("$folder") //Makes sure users can share the main folder on GitHub even if it is empty </syntaxhighlight><br />
* '''Data analysis.''' <br />
** <code>[[iematch]]</code> is a command which can be used for matching observations in one group to observations in another group which are the closest in terms of a particular characteristic. <br>For example, consider a study which is designed to evaluate the impact of randomly providing cash transfers to half the workers in a firm. The research team can use <code>[[iematch]]</code> to match and compare wages of women in the '''treatment''' group (which received the cash transfers) with observations in a '''control''' group (which did not receive the cash transfers). <br />
** <code>[[iebaltab]]</code> runs [[Balance tests|balance tests]], and produces '''balance tables''' which show the difference in means for one or more '''treatment''' groups. It can be used to check if there are '''statistically significant''' differences between the '''treatment''' and '''control''' groups. In case there are significant differences in the means, <code>[[iebaltab]]</code> even displays an error message that suggests that results from such data can be wrongly interpreted.<br />
** <code>[[iedropone]]</code> drops only a specific number of observations, and makes sure that no additional observations are dropped.<br />
** <code>[[ieboilsave]]</code> performs checks to ensure that '''best practices''' are followed before saving a data set.<br />
** <code>[[ieddtab]]</code> runs [[Difference-in-Differences | difference-in-difference]] regressions and displays the result in well-formatted tables.<br />
** <code>[[iegraph]]</code> produces graphs of results from regression models that researchers commonly use during impact evaluations.<br />
To install the <code>ietoolkit</code>, type <syntaxhighlight lang="Stata" inline>ssc install ietoolkit</syntaxhighlight> in your Stata command window.<br />
<br />
== File Paths==<br />
DIME Analytics suggests the following guidelines for specifying '''file paths''' in Stata: <br />
* '''Double quotes (<code>"</code>).''' Always enclose file paths in double quotes (<code>"</code>) . For example, <syntaxhighlight lang="Stata" inline>"${maindir}"</syntaxhighlight>.<br />
* '''Forward slashes (<code>/</code>).''' Always use forward slashes (<code>/</code>) to specify folder '''hierarchies''', that is, the exact location of a folder inside another folder, and so on. For example, <code>"C:/Users/username/Documents"</code>. This is important because Mac and Linux computers cannot read file paths with '''back slashes'''(<code>\</code>). <br />
* '''File extension.''' Always include the file extension in the file path, such as <code>.dta</code>, <code>.do</code>, or <code>.csv</code>. This helps to avoid '''ambiguity''' (or doubt) if another file with the same name exists.<br />
<br />
'''''Dynamic and absolute file paths'''''.<br />
<br />
Relative file paths exists in Stata but is implemented differently in Stata compared to many other computer languages. One should therefore use caution when translating practices that builds on relative file paths from other languages into Stata.<br />
<br />
Therefore, it is common to use ''dynamic'' and ''absolute'' file paths in Stata. A file path is '''absolute''' when it begins from the '''root folder''' of the computer, for example, <code>C:/</code> on a PC or <code>/Users/</code> on a Mac. This guarantees that a each file path only can corresponds to a single location in the file system, no matter what the working directory is set to. <br />
<br />
In contrast, relative file path points to a different location each time the working directory is changed. In a collaborative context your file paths might start to point to other locations on your computer if someone in your team introduce code that use <code>cd</code> to change the directory. The types of errors this can lead to are not possible when a team use absolute paths.<br />
<br />
However, in absolute paths, the first part of the file path is almost always unique to each user. To make this work, you need to create a file path that is both '''dynamic''' and absolute. An absolute file path is dynamic if it sets the first part of the path dynamically with code. This means that users set '''globals''' (global macros) located in the [[Master Do-files|main do-files]] to specify the root part of file paths. The root part is the part of the file path that differs between all users. <br />
<br />
=== Examples ===<br />
* Dynamic and absolute file path.<br />
<syntaxhighlight lang="stata" line>global root "C:/Users/username/Documents"<br />
global myProject "${root}/MyProject"<br />
use "${myProject}/MyDataset.dta"</syntaxhighlight><br />
* Non-absolute, non-dynamic file path.<br />
<syntaxhighlight lang="stata" line>cd "C:/Users/username/Documents/MyProject"<br />
use MyDataset.dta</syntaxhighlight><br />
* Absolute, but non-dynamic file path.<br />
<syntaxhighlight lang="stata" line>cd "C:/Users/username/Documents/MyProject" <br />
use "C:/Users/username/Documents/MyProject/MyDataset.dta"</syntaxhighlight><br />
<br />
== Exporting Tables ==<br />
Tables play a crucial role in representing the results of a study in an easy-to-understand format. However, it is common to copy-and-paste results from Stata, and format them in a word-processing software, which affects the [[Reproducible Research|reproducibility of research]]. [https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics] has therefore created the following resources for exporting tables in Stata:<br />
* [[Checklist:_Submit_Table|Checklist for submitting tables in development research]]<br />
* [https://osf.io/78nuc/ Nice and fast tables in Stata for LaTex and Excel]<br />
* [https://github.com/worldbank/stata-tables GitHub - Stata tables] is a repository with do-files and output tables. Use these to practice exporting tables using the <code>esttab</code> command. <br />
* [https://blogs.worldbank.org/impactevaluations/nice-and-fast-tables-stata Blog post on Stata tables]<br />
<br />
== Related Pages ==<br />
[[Special:WhatLinksHere/Stata_Coding_Practices|Click here for pages that link to this topic.]]<br />
<br />
== Additional Resources ==<br />
* DIME Analytics (World Bank), [https://osf.io/36hys Basics of Programming in Stata]<br />
* DIME Analytics (World Bank), [https://osf.io/zatqj Statistical Programming 101]<br />
* DIME Analytics (World Bank), [https://github.com/vikjam/mostly-harmless-replication Mostly Harmless Replication]<br />
* DIME Analytics (World Bank, [https://gist.github.com/kbjarkefur/16b63c1fc89ab52c3d4cae9c74288452 Sharing sub-functions between different commands]. Download the <code>.ado</code> files and follow the instructions.<br />
* DIME Analytics (World Bank), [https://worldbank.github.io/Stata-IE-Visual-Library/ Stata visual library]<br />
* DIME Analytics (World Bank), [https://osf.io/mw965 Data Management]<br />
* DIME Analytics (World Bank), [https://osf.io/msh8r ietoolkit and iefieldkit- introduction]<br />
* DIME Analytics (World Bank), [https://osf.io/4tbkr ietoolkit- follow up slides]<br />
* DIME Analytics (World Bank), [https://osf.io/t48ug Data Quality Assurance].<br />
* DIME Analytics (World Bank), [https://osf.io/nzbvu Data Cleaning and Documentation in Stata (Intro)].<br />
* DIME Analytics (World Bank), [https://osf.io/juxcb Data Cleaning in Stata].<br />
* DIME (World Bank), [[Checklist: Submit Table| Checklist on submitting results.]] <br />
* David McKenzie (World Bank), [https://blogs.worldbank.org/impactevaluations/updated-overview-multiple-hypothesis-testing-commands-stata An updated overview of multiple hypothesis testing commands in Stata]<br />
* Gentzkow and Shapiro (Stanford) [http://web.stanford.edu/~gentzkow/research/CodeAndData.pdf Code and Data for the Social Sciences] <br />
* The GeoCenter, [http://geocenter.github.io/StataTraining/portfolio/01_resource/ Stata cheat sheets.]<br />
* Innovations for Poverty Action, [http://www.poverty-action.org/researchers/research-resources/stata-programs Stata modules for data collection and analysis] <br />
* Innovations for Poverty Action, [https://github.com/PovertyAction GitHub repository on impact evaluations] <br />
* Innovations for Poverty Action, [https://github.com/PovertyAction/odkmeta Odkmeta command]. This command writes a do-file to import ODK (Open Data Kit) data to Stata, using metadata from the survey and choices worksheets of the XLSForm.<br />
* J-PAL, [https://www.povertyactionlab.org/sites/default/files/resources/IAPStataWorkshopSlides.pdf Programming with Stata]<br />
* Princeton, [https://www.princeton.edu/~otorres/StataTutorial.pdf Data analysis in Stata for beginners] <br />
* Standford, [https://web.stanford.edu/~leinav/teaching/econ257/STATA.pdf Basics of Stata] <br />
* World Bank, [https://worldbank.github.io/stata/ Stata repository].<br />
[[Category: Coding Practices]]<br />
[[Category: Reproducible Research]]<br />
[[Category: Stata Coding Practices]]<br />
[[Category: Technical Tools]]</div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Branch-pr-merge_cycle&diff=8399Branch-pr-merge cycle2021-11-30T10:59:56Z<p>Kbjarkefur: </p>
<hr />
<div>The branch-pr-merge cycle is a quality assurance and code review model for when using GitHub as a collaboration and version control tool.<br />
<br />
== Read First ==<br />
* include here key points you want to make sure all readers understand<br />
<br />
<br />
== Guidelines ==<br />
* organize information on the topic into subsections. for each subsection, include a brief description / overview, with links to articles that provide details<br />
===Subsection 1===<br />
===Subsection 2===<br />
===Subsection 3===<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[*topic name, as listed on main page*]]<br />
<br />
<br />
== Additional Resources ==<br />
* list here other articles related to this topic, with a brief description and link<br />
<br />
[[Category: *category name* ]]</div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Branch-pr-merge-cycle&diff=8398Branch-pr-merge-cycle2021-11-30T10:48:07Z<p>Kbjarkefur: Kbjarkefur moved page Branch-pr-merge-cycle to Branch-pr-merge cycle</p>
<hr />
<div>#REDIRECT [[Branch-pr-merge cycle]]</div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Branch-pr-merge_cycle&diff=8397Branch-pr-merge cycle2021-11-30T10:48:07Z<p>Kbjarkefur: Kbjarkefur moved page Branch-pr-merge-cycle to Branch-pr-merge cycle</p>
<hr />
<div><span style="font-size:150%"><br />
<span style="color:#ff0000"> '''NOTE: this article is only a template. Please add content!''' </span><br />
</span><br />
<br />
<br />
add introductory 1-2 sentences here<br />
<br />
<br />
<br />
== Read First ==<br />
* include here key points you want to make sure all readers understand<br />
<br />
<br />
== Guidelines ==<br />
* organize information on the topic into subsections. for each subsection, include a brief description / overview, with links to articles that provide details<br />
===Subsection 1===<br />
===Subsection 2===<br />
===Subsection 3===<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[*topic name, as listed on main page*]]<br />
<br />
<br />
== Additional Resources ==<br />
* list here other articles related to this topic, with a brief description and link<br />
<br />
[[Category: *category name* ]]</div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Branch-pr-merge_cycle&diff=8396Branch-pr-merge cycle2021-11-30T10:47:17Z<p>Kbjarkefur: </p>
<hr />
<div><span style="font-size:150%"><br />
<span style="color:#ff0000"> '''NOTE: this article is only a template. Please add content!''' </span><br />
</span><br />
<br />
<br />
add introductory 1-2 sentences here<br />
<br />
<br />
<br />
== Read First ==<br />
* include here key points you want to make sure all readers understand<br />
<br />
<br />
== Guidelines ==<br />
* organize information on the topic into subsections. for each subsection, include a brief description / overview, with links to articles that provide details<br />
===Subsection 1===<br />
===Subsection 2===<br />
===Subsection 3===<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[*topic name, as listed on main page*]]<br />
<br />
<br />
== Additional Resources ==<br />
* list here other articles related to this topic, with a brief description and link<br />
<br />
[[Category: *category name* ]]</div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Branch-pr-merge_cycle&diff=8395Branch-pr-merge cycle2021-11-30T10:45:54Z<p>Kbjarkefur: Created page with "{{Dime_wiki}}"</p>
<hr />
<div>{{Dime_wiki}}</div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Main_Page&diff=8382Main Page2021-07-30T18:12:29Z<p>Kbjarkefur: </p>
<hr />
<div>__NOTOC__<br />
<br />
<div class="home_top"><br />
<div class="policy_intro container"><br />
<div class="row"><br />
<div class="col-md-2"></div><br />
<div class="col-md-8"><br />
<h3 class="mega_title text-center"><b>Welcome to the DIME Wiki</b></h3> <br />
<div class="policy_desc"><br />
<p>The <strong>DIME Wiki</strong> is a <strong>public good</strong> developed and maintained by <strong>DIME Analytics</strong>. The <strong>DIME Wiki</strong> is designed for researchers and M&E specialists at the World Bank, as well as clients, donor institutions, universities, NGOs, and governments. The <strong>DIME Wiki</strong> is a collaborative, open-source resource that presents guidelines that are easy to understand and apply for users of varying levels of expertise. </p><br />
<p style="color:black;text-align:center;"><b>This landing page offers links to several curated categories that users may find helpful, as well as our narrative Handbook.</b></p><br />
</div><br />
</div><br />
<div class="col-md-2"></div><br />
</div><br />
</div><br />
<div class="container bottomspace"><br />
<br />
<div class="row"><br />
<div class="col-sm-1"></div><br />
<div class="col-sm-10" style= "width:85% !important;" ><br />
<div class="col-sm-6 rg_red hmeBox"><br />
<div><br />
<h3><span class="mw-headline" id="Design"><i class="fa fa-sitemap"></i> Research Design</span></h3><br />
<div class="m_links"><br />
<ul><br />
<li>[[Experimental Methods]]</li><br />
<li>[[Quasi-Experimental Methods]]</li><br />
<li>[[Research Ethics]]</li><br />
<li>[[Sampling & Power Calculations]]</li><br />
</ul><br />
</div><br />
</div><br />
</div><br />
<div class="col-sm-6 rg_purple hmeBox"><br />
<div><br />
<h3><span class="mw-headline" id="Data"><i class="fa fa-database"></i> Data Collection</span></h3><br />
<div class="m_links"><br />
<ul><br />
<li>[[Primary Data Collection]]</li><br />
<li>[[Secondary Data Sources]]</li><br />
<li>[[Field Management]]</li><br />
<li>[[Questionnaire Design]]</li><br />
</ul><br />
</div><br />
</div><br />
</div><br />
<div class="col-sm-6 rg_green hmeBox"><br />
<div><br />
<h3><span class="mw-headline" id="Analysis"><i class="fa fa-search"></i> Analysis</span></h3><br />
<div class="m_links"><br />
<ul><br />
<li>[[Data Management]]</li><br />
<li>[[Data Cleaning]]</li><br />
<li>[[Data Analysis]]</li><br />
<li>[[Software Tools]]</li><br />
</ul><br />
</div><br />
</div><br />
</div><br />
<div class="col-sm-6 rg_blue hmeBox"><br />
<div><br />
<h3><span class="mw-headline" id="Fieldwork"><i class="fa fa-globe"></i> Publication</span></h3><br />
<div class="m_links"><br />
<ul><br />
<li>[[Reproducible Research]]</li><br />
<li>[[Publishing Data]]</li><br />
<li>[[Collaboration Tools]]</li><br />
<li>[[Dissemination]]</li><br />
</ul><br />
</div><br />
</div><br />
</div><br />
</div><br />
</div><br />
</div><br />
<br />
<div class="well"><br />
<div class="container"><br />
<br />
<h3 class="mega_title"><span class="mw-headline" id="Recent_Contributions"><b>Cross-cutting Resources</b></span></h3><br />
<div class="recent-activity"><br />
<br />
<div class="row"> <br />
<div class="col-md-4"> <br />
<div class="rec_contrib_box"><br />
<br />
<div class="rec_title">[[Stata Coding Practices]]</div><br />
</div><br />
</div><br />
<div class="col-md-4"> <br />
<div class="rec_contrib_box"><br />
<br />
<div class="rec_title">[[SurveyCTO Coding Practices]]</div><br />
</div><br />
</div><br />
<div class="col-md-4"> <br />
<div class="rec_contrib_box"><br />
<br />
<div class="rec_title">[[Check Lists]]</div><br />
</div><br />
</div><br />
</div><!--end .row--><br />
</div><br />
<br />
<br />
<br />
<h3 class="mega_title"><span class="mw-headline" id="Recent_Contributions"><b>Download the Dime Analytics Data Handbook </b></span></h3><br />
<div class="rec_contrib_box"><br />
<div class="dk_bar"><br />
[[File:BookBanner.png|700px||center|link=https://worldbank.github.io/dime-data-handbook/]]<br />
</div><br />
</div><br />
<br />
<div class="random"> <span class="hiddentext"><randompages limit="0" namespace="DIME Wiki" levels="1"></randompages></span></div><br />
<br />
</div><br />
</div><br />
<br />
<div class="policy_intro container"><br />
<div class="row"><br />
<div class="col-md-2"></div><br />
<div class="col-md-8"><br />
<h3 class="mega_title text-center"><b>About DIME and DIME Analytics</b></h3><br />
<div class="policy_desc"><br />
<onlyinclude><br />
[https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics] creates tools that improve the quality of impact evaluation research for all. We take advantage of the concentration and scale of research at DIME to develop and test solutions to ensure data work quality across our portfolio, and to make public training and tools available to the larger community of development researchers who might not have the same capabilities.<br />
<br />
[https://www.worldbank.org/en/research/dime DIME] is the World Bank’s impact evaluation department. Part of DIME’s mission is to intensify the production of, and access to, public goods that improve the quantity and quality of global development research, while lowering the costs of performing impact evaluations for the entire research community. The <strong>DIME Wiki</strong> aims to further this initiative, and is funded by the United Kingdom’s Foreign, Commonwealth & Development Office through the i2i Trust Fund.<br />
</div><br />
<br />
</div><br />
<div class="col-md-2"></div><br />
</div><br />
</div><br />
<br />
</div></div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Main_Page&diff=8381Main Page2021-07-30T18:11:15Z<p>Kbjarkefur: </p>
<hr />
<div>__NOTOC__<br />
<br />
<div class="home_top"><br />
<div class="policy_intro container"><br />
<div class="row"><br />
<div class="col-md-2"></div><br />
<div class="col-md-8"><br />
<h2><span class="mega_title text-center">Welcome to the DIME Wiki </span></h2> <br />
<div class="policy_desc"><br />
<p>The <strong>DIME Wiki</strong> is a <strong>public good</strong> developed and maintained by <strong>DIME Analytics</strong>. The <strong>DIME Wiki</strong> is designed for researchers and M&E specialists at the World Bank, as well as clients, donor institutions, universities, NGOs, and governments. The <strong>DIME Wiki</strong> is a collaborative, open-source resource that presents guidelines that are easy to understand and apply for users of varying levels of expertise. </p><br />
<p style="color:black;text-align:center;"><b>This landing page offers links to several curated categories that users may find helpful, as well as our narrative Handbook.</b></p><br />
</div><br />
</div><br />
<div class="col-md-2"></div><br />
</div><br />
</div><br />
<div class="container bottomspace"><br />
<br />
<div class="row"><br />
<div class="col-sm-1"></div><br />
<div class="col-sm-10" style= "width:85% !important;" ><br />
<div class="col-sm-6 rg_red hmeBox"><br />
<div><br />
<h3><span class="mw-headline" id="Design"><i class="fa fa-sitemap"></i> Research Design</span></h3><br />
<div class="m_links"><br />
<ul><br />
<li>[[Experimental Methods]]</li><br />
<li>[[Quasi-Experimental Methods]]</li><br />
<li>[[Research Ethics]]</li><br />
<li>[[Sampling & Power Calculations]]</li><br />
</ul><br />
</div><br />
</div><br />
</div><br />
<div class="col-sm-6 rg_purple hmeBox"><br />
<div><br />
<h3><span class="mw-headline" id="Data"><i class="fa fa-database"></i> Data Collection</span></h3><br />
<div class="m_links"><br />
<ul><br />
<li>[[Primary Data Collection]]</li><br />
<li>[[Secondary Data Sources]]</li><br />
<li>[[Field Management]]</li><br />
<li>[[Questionnaire Design]]</li><br />
</ul><br />
</div><br />
</div><br />
</div><br />
<div class="col-sm-6 rg_green hmeBox"><br />
<div><br />
<h3><span class="mw-headline" id="Analysis"><i class="fa fa-search"></i> Analysis</span></h3><br />
<div class="m_links"><br />
<ul><br />
<li>[[Data Management]]</li><br />
<li>[[Data Cleaning]]</li><br />
<li>[[Data Analysis]]</li><br />
<li>[[Software Tools]]</li><br />
</ul><br />
</div><br />
</div><br />
</div><br />
<div class="col-sm-6 rg_blue hmeBox"><br />
<div><br />
<h3><span class="mw-headline" id="Fieldwork"><i class="fa fa-globe"></i> Publication</span></h3><br />
<div class="m_links"><br />
<ul><br />
<li>[[Reproducible Research]]</li><br />
<li>[[Publishing Data]]</li><br />
<li>[[Collaboration Tools]]</li><br />
<li>[[Dissemination]]</li><br />
</ul><br />
</div><br />
</div><br />
</div><br />
</div><br />
</div><br />
</div><br />
<br />
<div class="well"><br />
<div class="container"><br />
<br />
<h3 class="mega_title"><span class="mw-headline" id="Recent_Contributions"><b>Cross-cutting Resources</b></span></h3><br />
<div class="recent-activity"><br />
<br />
<div class="row"> <br />
<div class="col-md-4"> <br />
<div class="rec_contrib_box"><br />
<br />
<div class="rec_title">[[Stata Coding Practices]]</div><br />
</div><br />
</div><br />
<div class="col-md-4"> <br />
<div class="rec_contrib_box"><br />
<br />
<div class="rec_title">[[SurveyCTO Coding Practices]]</div><br />
</div><br />
</div><br />
<div class="col-md-4"> <br />
<div class="rec_contrib_box"><br />
<br />
<div class="rec_title">[[Check Lists]]</div><br />
</div><br />
</div><br />
</div><!--end .row--><br />
</div><br />
<br />
<br />
<br />
<h3 class="mega_title"><span class="mw-headline" id="Recent_Contributions"><b>Download the Dime Analytics Data Handbook </b></span></h3><br />
<div class="rec_contrib_box"><br />
<div class="dk_bar"><br />
[[File:BookBanner.png|700px||center|link=https://worldbank.github.io/dime-data-handbook/]]<br />
</div><br />
</div><br />
<br />
<div class="random"> <span class="hiddentext"><randompages limit="0" namespace="DIME Wiki" levels="1"></randompages></span></div><br />
<br />
</div><br />
</div><br />
<br />
<div class="policy_intro container"><br />
<div class="row"><br />
<div class="col-md-2"></div><br />
<div class="col-md-8"><br />
<h3 class="mega_title text-center"><b>About DIME and DIME Analytics</b></h3><br />
<div class="policy_desc"><br />
<onlyinclude><br />
[https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics] creates tools that improve the quality of impact evaluation research for all. We take advantage of the concentration and scale of research at DIME to develop and test solutions to ensure data work quality across our portfolio, and to make public training and tools available to the larger community of development researchers who might not have the same capabilities.<br />
<br />
[https://www.worldbank.org/en/research/dime DIME] is the World Bank’s impact evaluation department. Part of DIME’s mission is to intensify the production of, and access to, public goods that improve the quantity and quality of global development research, while lowering the costs of performing impact evaluations for the entire research community. The <strong>DIME Wiki</strong> aims to further this initiative, and is funded by the United Kingdom’s Foreign, Commonwealth & Development Office through the i2i Trust Fund.<br />
</div><br />
<br />
</div><br />
<div class="col-md-2"></div><br />
</div><br />
</div><br />
<br />
</div></div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Main_Page&diff=8380Main Page2021-07-30T18:10:32Z<p>Kbjarkefur: </p>
<hr />
<div>__NOTOC__<br />
<br />
<div class="home_top"><br />
<div class="policy_intro container"><br />
<div class="row"><br />
<div class="col-md-2"></div><br />
<div class="col-md-8"><br />
<h2><span class="mw-headline" style="color:black;">Welcome to the DIME Wiki </span></h2> <br />
<div class="policy_desc"><br />
<p>The <strong>DIME Wiki</strong> is a <strong>public good</strong> developed and maintained by <strong>DIME Analytics</strong>. The <strong>DIME Wiki</strong> is designed for researchers and M&E specialists at the World Bank, as well as clients, donor institutions, universities, NGOs, and governments. The <strong>DIME Wiki</strong> is a collaborative, open-source resource that presents guidelines that are easy to understand and apply for users of varying levels of expertise. </p><br />
<p style="color:black;text-align:center;"><b>This landing page offers links to several curated categories that users may find helpful, as well as our narrative Handbook.</b></p><br />
</div><br />
</div><br />
<div class="col-md-2"></div><br />
</div><br />
</div><br />
<div class="container bottomspace"><br />
<br />
<div class="row"><br />
<div class="col-sm-1"></div><br />
<div class="col-sm-10" style= "width:85% !important;" ><br />
<div class="col-sm-6 rg_red hmeBox"><br />
<div><br />
<h3><span class="mw-headline" id="Design"><i class="fa fa-sitemap"></i> Research Design</span></h3><br />
<div class="m_links"><br />
<ul><br />
<li>[[Experimental Methods]]</li><br />
<li>[[Quasi-Experimental Methods]]</li><br />
<li>[[Research Ethics]]</li><br />
<li>[[Sampling & Power Calculations]]</li><br />
</ul><br />
</div><br />
</div><br />
</div><br />
<div class="col-sm-6 rg_purple hmeBox"><br />
<div><br />
<h3><span class="mw-headline" id="Data"><i class="fa fa-database"></i> Data Collection</span></h3><br />
<div class="m_links"><br />
<ul><br />
<li>[[Primary Data Collection]]</li><br />
<li>[[Secondary Data Sources]]</li><br />
<li>[[Field Management]]</li><br />
<li>[[Questionnaire Design]]</li><br />
</ul><br />
</div><br />
</div><br />
</div><br />
<div class="col-sm-6 rg_green hmeBox"><br />
<div><br />
<h3><span class="mw-headline" id="Analysis"><i class="fa fa-search"></i> Analysis</span></h3><br />
<div class="m_links"><br />
<ul><br />
<li>[[Data Management]]</li><br />
<li>[[Data Cleaning]]</li><br />
<li>[[Data Analysis]]</li><br />
<li>[[Software Tools]]</li><br />
</ul><br />
</div><br />
</div><br />
</div><br />
<div class="col-sm-6 rg_blue hmeBox"><br />
<div><br />
<h3><span class="mw-headline" id="Fieldwork"><i class="fa fa-globe"></i> Publication</span></h3><br />
<div class="m_links"><br />
<ul><br />
<li>[[Reproducible Research]]</li><br />
<li>[[Publishing Data]]</li><br />
<li>[[Collaboration Tools]]</li><br />
<li>[[Dissemination]]</li><br />
</ul><br />
</div><br />
</div><br />
</div><br />
</div><br />
</div><br />
</div><br />
<br />
<div class="well"><br />
<div class="container"><br />
<br />
<h3 class="mega_title"><span class="mw-headline" id="Recent_Contributions"><b>Cross-cutting Resources</b></span></h3><br />
<div class="recent-activity"><br />
<br />
<div class="row"> <br />
<div class="col-md-4"> <br />
<div class="rec_contrib_box"><br />
<br />
<div class="rec_title">[[Stata Coding Practices]]</div><br />
</div><br />
</div><br />
<div class="col-md-4"> <br />
<div class="rec_contrib_box"><br />
<br />
<div class="rec_title">[[SurveyCTO Coding Practices]]</div><br />
</div><br />
</div><br />
<div class="col-md-4"> <br />
<div class="rec_contrib_box"><br />
<br />
<div class="rec_title">[[Check Lists]]</div><br />
</div><br />
</div><br />
</div><!--end .row--><br />
</div><br />
<br />
<br />
<br />
<h3 class="mega_title"><span class="mw-headline" id="Recent_Contributions"><b>Download the Dime Analytics Data Handbook </b></span></h3><br />
<div class="rec_contrib_box"><br />
<div class="dk_bar"><br />
[[File:BookBanner.png|700px||center|link=https://worldbank.github.io/dime-data-handbook/]]<br />
</div><br />
</div><br />
<br />
<div class="random"> <span class="hiddentext"><randompages limit="0" namespace="DIME Wiki" levels="1"></randompages></span></div><br />
<br />
</div><br />
</div><br />
<br />
<div class="policy_intro container"><br />
<div class="row"><br />
<div class="col-md-2"></div><br />
<div class="col-md-8"><br />
<h3 class="mega_title text-center"><b>About DIME and DIME Analytics</b></h3><br />
<div class="policy_desc"><br />
<onlyinclude><br />
[https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics] creates tools that improve the quality of impact evaluation research for all. We take advantage of the concentration and scale of research at DIME to develop and test solutions to ensure data work quality across our portfolio, and to make public training and tools available to the larger community of development researchers who might not have the same capabilities.<br />
<br />
[https://www.worldbank.org/en/research/dime DIME] is the World Bank’s impact evaluation department. Part of DIME’s mission is to intensify the production of, and access to, public goods that improve the quantity and quality of global development research, while lowering the costs of performing impact evaluations for the entire research community. The <strong>DIME Wiki</strong> aims to further this initiative, and is funded by the United Kingdom’s Foreign, Commonwealth & Development Office through the i2i Trust Fund.<br />
</div><br />
<br />
</div><br />
<div class="col-md-2"></div><br />
</div><br />
</div><br />
<br />
</div></div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Main_Page&diff=8379Main Page2021-07-30T18:08:26Z<p>Kbjarkefur: </p>
<hr />
<div>__NOTOC__<br />
<br />
<div class="home_top"><br />
<div class="container bottomspace"><br />
<h2><span class="mw-headline" style="color:black;">Welcome to the DIME Wiki </span></h2> <br />
<br />
<div class="policy_desc"><br />
<p>The <strong>DIME Wiki</strong> is a <strong>public good</strong> developed and maintained by <strong>DIME Analytics</strong>. The <strong>DIME Wiki</strong> is designed for researchers and M&E specialists at the World Bank, as well as clients, donor institutions, universities, NGOs, and governments. The <strong>DIME Wiki</strong> is a collaborative, open-source resource that presents guidelines that are easy to understand and apply for users of varying levels of expertise. </p><br />
<p style="color:black;text-align:center;"><b>This landing page offers links to several curated categories that users may find helpful, as well as our narrative Handbook.</b></p><br />
</div><br />
<div class="row"><br />
<div class="col-sm-1"></div><br />
<div class="col-sm-10" style= "width:85% !important;" ><br />
<div class="col-sm-6 rg_red hmeBox"><br />
<div><br />
<h3><span class="mw-headline" id="Design"><i class="fa fa-sitemap"></i> Research Design</span></h3><br />
<div class="m_links"><br />
<ul><br />
<li>[[Experimental Methods]]</li><br />
<li>[[Quasi-Experimental Methods]]</li><br />
<li>[[Research Ethics]]</li><br />
<li>[[Sampling & Power Calculations]]</li><br />
</ul><br />
</div><br />
</div><br />
</div><br />
<div class="col-sm-6 rg_purple hmeBox"><br />
<div><br />
<h3><span class="mw-headline" id="Data"><i class="fa fa-database"></i> Data Collection</span></h3><br />
<div class="m_links"><br />
<ul><br />
<li>[[Primary Data Collection]]</li><br />
<li>[[Secondary Data Sources]]</li><br />
<li>[[Field Management]]</li><br />
<li>[[Questionnaire Design]]</li><br />
</ul><br />
</div><br />
</div><br />
</div><br />
<div class="col-sm-6 rg_green hmeBox"><br />
<div><br />
<h3><span class="mw-headline" id="Analysis"><i class="fa fa-search"></i> Analysis</span></h3><br />
<div class="m_links"><br />
<ul><br />
<li>[[Data Management]]</li><br />
<li>[[Data Cleaning]]</li><br />
<li>[[Data Analysis]]</li><br />
<li>[[Software Tools]]</li><br />
</ul><br />
</div><br />
</div><br />
</div><br />
<div class="col-sm-6 rg_blue hmeBox"><br />
<div><br />
<h3><span class="mw-headline" id="Fieldwork"><i class="fa fa-globe"></i> Publication</span></h3><br />
<div class="m_links"><br />
<ul><br />
<li>[[Reproducible Research]]</li><br />
<li>[[Publishing Data]]</li><br />
<li>[[Collaboration Tools]]</li><br />
<li>[[Dissemination]]</li><br />
</ul><br />
</div><br />
</div><br />
</div><br />
</div><br />
</div><br />
</div><br />
<br />
<div class="well"><br />
<div class="container"><br />
<br />
<h3 class="mega_title"><span class="mw-headline" id="Recent_Contributions"><b>Cross-cutting Resources</b></span></h3><br />
<div class="recent-activity"><br />
<br />
<div class="row"> <br />
<div class="col-md-4"> <br />
<div class="rec_contrib_box"><br />
<br />
<div class="rec_title">[[Stata Coding Practices]]</div><br />
</div><br />
</div><br />
<div class="col-md-4"> <br />
<div class="rec_contrib_box"><br />
<br />
<div class="rec_title">[[SurveyCTO Coding Practices]]</div><br />
</div><br />
</div><br />
<div class="col-md-4"> <br />
<div class="rec_contrib_box"><br />
<br />
<div class="rec_title">[[Check Lists]]</div><br />
</div><br />
</div><br />
</div><!--end .row--><br />
</div><br />
<br />
<br />
<br />
<h3 class="mega_title"><span class="mw-headline" id="Recent_Contributions"><b>Download the Dime Analytics Data Handbook </b></span></h3><br />
<div class="rec_contrib_box"><br />
<div class="dk_bar"><br />
[[File:BookBanner.png|700px||center|link=https://worldbank.github.io/dime-data-handbook/]]<br />
</div><br />
</div><br />
<br />
<div class="random"> <span class="hiddentext"><randompages limit="0" namespace="DIME Wiki" levels="1"></randompages></span></div><br />
<br />
</div><br />
</div><br />
<br />
<div class="policy_intro container"><br />
<div class="row"><br />
<div class="col-md-2"></div><br />
<div class="col-md-8"><br />
<h3 class="mega_title text-center"><b>About DIME and DIME Analytics</b></h3><br />
<div class="policy_desc"><br />
<onlyinclude><br />
[https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics] creates tools that improve the quality of impact evaluation research for all. We take advantage of the concentration and scale of research at DIME to develop and test solutions to ensure data work quality across our portfolio, and to make public training and tools available to the larger community of development researchers who might not have the same capabilities.<br />
<br />
[https://www.worldbank.org/en/research/dime DIME] is the World Bank’s impact evaluation department. Part of DIME’s mission is to intensify the production of, and access to, public goods that improve the quantity and quality of global development research, while lowering the costs of performing impact evaluations for the entire research community. The <strong>DIME Wiki</strong> aims to further this initiative, and is funded by the United Kingdom’s Foreign, Commonwealth & Development Office through the i2i Trust Fund.<br />
</div><br />
<br />
</div><br />
<div class="col-md-2"></div><br />
</div><br />
</div><br />
<br />
</div></div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Main_Page&diff=8378Main Page2021-07-30T18:04:10Z<p>Kbjarkefur: </p>
<hr />
<div>__NOTOC__<br />
<br />
<div class="home_top"><br />
<div class="container bottomspace"><br />
<h2><span class="mw-headline" style="color:black;">Welcome to the DIME Wiki </span></h2> <br />
<p>The <strong>DIME Wiki</strong> is a <strong>public good</strong> developed and maintained by <strong>DIME Analytics</strong>. The <strong>DIME Wiki</strong> is designed for researchers and M&E specialists at the World Bank, as well as clients, donor institutions, universities, NGOs, and governments. The <strong>DIME Wiki</strong> is a collaborative, open-source resource that presents guidelines that are easy to understand and apply for users of varying levels of expertise. </p><br />
<p style="color:black;text-align:center;"><b>This landing page offers links to several curated categories that users may find helpful, as well as our narrative Handbook.</b></p><br />
<div class="row"><br />
<div class="col-sm-1"></div><br />
<div class="col-sm-10" style= "width:85% !important;" ><br />
<div class="col-sm-6 rg_red hmeBox"><br />
<div><br />
<h3><span class="mw-headline" id="Design"><i class="fa fa-sitemap"></i> Research Design</span></h3><br />
<div class="m_links"><br />
<ul><br />
<li>[[Experimental Methods]]</li><br />
<li>[[Quasi-Experimental Methods]]</li><br />
<li>[[Research Ethics]]</li><br />
<li>[[Sampling & Power Calculations]]</li><br />
</ul><br />
</div><br />
</div><br />
</div><br />
<div class="col-sm-6 rg_purple hmeBox"><br />
<div><br />
<h3><span class="mw-headline" id="Data"><i class="fa fa-database"></i> Data Collection</span></h3><br />
<div class="m_links"><br />
<ul><br />
<li>[[Primary Data Collection]]</li><br />
<li>[[Secondary Data Sources]]</li><br />
<li>[[Field Management]]</li><br />
<li>[[Questionnaire Design]]</li><br />
</ul><br />
</div><br />
</div><br />
</div><br />
<div class="col-sm-6 rg_green hmeBox"><br />
<div><br />
<h3><span class="mw-headline" id="Analysis"><i class="fa fa-search"></i> Analysis</span></h3><br />
<div class="m_links"><br />
<ul><br />
<li>[[Data Management]]</li><br />
<li>[[Data Cleaning]]</li><br />
<li>[[Data Analysis]]</li><br />
<li>[[Software Tools]]</li><br />
</ul><br />
</div><br />
</div><br />
</div><br />
<div class="col-sm-6 rg_blue hmeBox"><br />
<div><br />
<h3><span class="mw-headline" id="Fieldwork"><i class="fa fa-globe"></i> Publication</span></h3><br />
<div class="m_links"><br />
<ul><br />
<li>[[Reproducible Research]]</li><br />
<li>[[Publishing Data]]</li><br />
<li>[[Collaboration Tools]]</li><br />
<li>[[Dissemination]]</li><br />
</ul><br />
</div><br />
</div><br />
</div><br />
</div><br />
</div><br />
</div><br />
<br />
<div class="well"><br />
<div class="container"><br />
<br />
<h3 class="mega_title"><span class="mw-headline" id="Recent_Contributions"><b>Cross-cutting Resources</b></span></h3><br />
<div class="recent-activity"><br />
<br />
<div class="row"> <br />
<div class="col-md-4"> <br />
<div class="rec_contrib_box"><br />
<br />
<div class="rec_title">[[Stata Coding Practices]]</div><br />
</div><br />
</div><br />
<div class="col-md-4"> <br />
<div class="rec_contrib_box"><br />
<br />
<div class="rec_title">[[SurveyCTO Coding Practices]]</div><br />
</div><br />
</div><br />
<div class="col-md-4"> <br />
<div class="rec_contrib_box"><br />
<br />
<div class="rec_title">[[Check Lists]]</div><br />
</div><br />
</div><br />
</div><!--end .row--><br />
</div><br />
<br />
<br />
<br />
<h3 class="mega_title"><span class="mw-headline" id="Recent_Contributions"><b>Download the Dime Analytics Data Handbook </b></span></h3><br />
<div class="rec_contrib_box"><br />
<div class="dk_bar"><br />
[[File:BookBanner.png|700px||center|link=https://worldbank.github.io/dime-data-handbook/]]<br />
</div><br />
</div><br />
<br />
<div class="random"> <span class="hiddentext"><randompages limit="0" namespace="DIME Wiki" levels="1"></randompages></span></div><br />
<br />
</div><br />
</div><br />
<br />
<div class="policy_intro container"><br />
<div class="row"><br />
<div class="col-md-2"></div><br />
<div class="col-md-8"><br />
<h3 class="mega_title text-center"><b>About DIME and DIME Analytics</b></h3><br />
<div class="policy_desc"><br />
<onlyinclude><br />
[https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics] creates tools that improve the quality of impact evaluation research for all. We take advantage of the concentration and scale of research at DIME to develop and test solutions to ensure data work quality across our portfolio, and to make public training and tools available to the larger community of development researchers who might not have the same capabilities.<br />
<br />
[https://www.worldbank.org/en/research/dime DIME] is the World Bank’s impact evaluation department. Part of DIME’s mission is to intensify the production of, and access to, public goods that improve the quantity and quality of global development research, while lowering the costs of performing impact evaluations for the entire research community. The <strong>DIME Wiki</strong> aims to further this initiative, and is funded by the United Kingdom’s Foreign, Commonwealth & Development Office through the i2i Trust Fund.<br />
</div><br />
<br />
</div><br />
<div class="col-md-2"></div><br />
</div><br />
</div><br />
<br />
</div></div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Main_Page&diff=8377Main Page2021-07-30T18:03:41Z<p>Kbjarkefur: </p>
<hr />
<div>__NOTOC__<br />
<br />
<div class="home_top"><br />
<br />
<h2><span class="mw-headline" style="color:black;">Welcome to the DIME Wiki </span></h2> <br />
<br />
<p>The <strong>DIME Wiki</strong> is a <strong>public good</strong> developed and maintained by <strong>DIME Analytics</strong>. The <strong>DIME Wiki</strong> is designed for researchers and M&E specialists at the World Bank, as well as clients, donor institutions, universities, NGOs, and governments. The <strong>DIME Wiki</strong> is a collaborative, open-source resource that presents guidelines that are easy to understand and apply for users of varying levels of expertise. </p><br />
<br />
<p style="color:black;text-align:center;"><b>This landing page offers links to several curated categories that users may find helpful, as well as our narrative Handbook.</b></p><br />
<div class="container bottomspace"><br />
<br />
<br />
<div class="row"><br />
<div class="col-sm-1"></div><br />
<div class="col-sm-10" style= "width:85% !important;" ><br />
<div class="col-sm-6 rg_red hmeBox"><br />
<div><br />
<h3><span class="mw-headline" id="Design"><i class="fa fa-sitemap"></i> Research Design</span></h3><br />
<div class="m_links"><br />
<ul><br />
<li>[[Experimental Methods]]</li><br />
<li>[[Quasi-Experimental Methods]]</li><br />
<li>[[Research Ethics]]</li><br />
<li>[[Sampling & Power Calculations]]</li><br />
</ul><br />
</div><br />
</div><br />
</div><br />
<div class="col-sm-6 rg_purple hmeBox"><br />
<div><br />
<h3><span class="mw-headline" id="Data"><i class="fa fa-database"></i> Data Collection</span></h3><br />
<div class="m_links"><br />
<ul><br />
<li>[[Primary Data Collection]]</li><br />
<li>[[Secondary Data Sources]]</li><br />
<li>[[Field Management]]</li><br />
<li>[[Questionnaire Design]]</li><br />
</ul><br />
</div><br />
</div><br />
</div><br />
<div class="col-sm-6 rg_green hmeBox"><br />
<div><br />
<h3><span class="mw-headline" id="Analysis"><i class="fa fa-search"></i> Analysis</span></h3><br />
<div class="m_links"><br />
<ul><br />
<li>[[Data Management]]</li><br />
<li>[[Data Cleaning]]</li><br />
<li>[[Data Analysis]]</li><br />
<li>[[Software Tools]]</li><br />
</ul><br />
</div><br />
</div><br />
</div><br />
<div class="col-sm-6 rg_blue hmeBox"><br />
<div><br />
<h3><span class="mw-headline" id="Fieldwork"><i class="fa fa-globe"></i> Publication</span></h3><br />
<div class="m_links"><br />
<ul><br />
<li>[[Reproducible Research]]</li><br />
<li>[[Publishing Data]]</li><br />
<li>[[Collaboration Tools]]</li><br />
<li>[[Dissemination]]</li><br />
</ul><br />
</div><br />
</div><br />
</div><br />
</div><br />
</div><br />
</div><br />
<br />
<div class="well"><br />
<div class="container"><br />
<br />
<h3 class="mega_title"><span class="mw-headline" id="Recent_Contributions"><b>Cross-cutting Resources</b></span></h3><br />
<div class="recent-activity"><br />
<br />
<div class="row"> <br />
<div class="col-md-4"> <br />
<div class="rec_contrib_box"><br />
<br />
<div class="rec_title">[[Stata Coding Practices]]</div><br />
</div><br />
</div><br />
<div class="col-md-4"> <br />
<div class="rec_contrib_box"><br />
<br />
<div class="rec_title">[[SurveyCTO Coding Practices]]</div><br />
</div><br />
</div><br />
<div class="col-md-4"> <br />
<div class="rec_contrib_box"><br />
<br />
<div class="rec_title">[[Check Lists]]</div><br />
</div><br />
</div><br />
</div><!--end .row--><br />
</div><br />
<br />
<br />
<br />
<h3 class="mega_title"><span class="mw-headline" id="Recent_Contributions"><b>Download the Dime Analytics Data Handbook </b></span></h3><br />
<div class="rec_contrib_box"><br />
<div class="dk_bar"><br />
[[File:BookBanner.png|700px||center|link=https://worldbank.github.io/dime-data-handbook/]]<br />
</div><br />
</div><br />
<br />
<div class="random"> <span class="hiddentext"><randompages limit="0" namespace="DIME Wiki" levels="1"></randompages></span></div><br />
<br />
</div><br />
</div><br />
<br />
<div class="policy_intro container"><br />
<div class="row"><br />
<div class="col-md-2"></div><br />
<div class="col-md-8"><br />
<h3 class="mega_title text-center"><b>About DIME and DIME Analytics</b></h3><br />
<div class="policy_desc"><br />
<onlyinclude><br />
[https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics] creates tools that improve the quality of impact evaluation research for all. We take advantage of the concentration and scale of research at DIME to develop and test solutions to ensure data work quality across our portfolio, and to make public training and tools available to the larger community of development researchers who might not have the same capabilities.<br />
<br />
[https://www.worldbank.org/en/research/dime DIME] is the World Bank’s impact evaluation department. Part of DIME’s mission is to intensify the production of, and access to, public goods that improve the quantity and quality of global development research, while lowering the costs of performing impact evaluations for the entire research community. The <strong>DIME Wiki</strong> aims to further this initiative, and is funded by the United Kingdom’s Foreign, Commonwealth & Development Office through the i2i Trust Fund.<br />
</div><br />
<br />
</div><br />
<div class="col-md-2"></div><br />
</div><br />
</div><br />
<br />
</div></div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Main_Page&diff=8376Main Page2021-07-30T18:01:34Z<p>Kbjarkefur: </p>
<hr />
<div>__NOTOC__<br />
<br />
<div class="home_top"><br />
<div class="container bottomspace"><br />
<h2><span class="mw-headline" style="color:black;">Welcome to the DIME Wiki </span></h2> <br />
<br />
<p>The <strong>DIME Wiki</strong> is a <strong>public good</strong> developed and maintained by <strong>DIME Analytics</strong>. The <strong>DIME Wiki</strong> is designed for researchers and M&E specialists at the World Bank, as well as clients, donor institutions, universities, NGOs, and governments. The <strong>DIME Wiki</strong> is a collaborative, open-source resource that presents guidelines that are easy to understand and apply for users of varying levels of expertise. </p><br />
<br />
<p style="color:black;text-align:center;"><b>This landing page offers links to several curated categories that users may find helpful, as well as our narrative Handbook.</b></p><br />
<br />
<div class="row"><br />
<div class="col-sm-1"></div><br />
<div class="col-sm-10" style= "width:85% !important;" ><br />
<div class="col-sm-6 rg_red hmeBox"><br />
<div><br />
<h3><span class="mw-headline" id="Design"><i class="fa fa-sitemap"></i> Research Design</span></h3><br />
<div class="m_links"><br />
<ul><br />
<li>[[Experimental Methods]]</li><br />
<li>[[Quasi-Experimental Methods]]</li><br />
<li>[[Research Ethics]]</li><br />
<li>[[Sampling & Power Calculations]]</li><br />
</ul><br />
</div><br />
</div><br />
</div><br />
<div class="col-sm-6 rg_purple hmeBox"><br />
<div><br />
<h3><span class="mw-headline" id="Data"><i class="fa fa-database"></i> Data Collection</span></h3><br />
<div class="m_links"><br />
<ul><br />
<li>[[Primary Data Collection]]</li><br />
<li>[[Secondary Data Sources]]</li><br />
<li>[[Field Management]]</li><br />
<li>[[Questionnaire Design]]</li><br />
</ul><br />
</div><br />
</div><br />
</div><br />
<div class="col-sm-6 rg_green hmeBox"><br />
<div><br />
<h3><span class="mw-headline" id="Analysis"><i class="fa fa-search"></i> Analysis</span></h3><br />
<div class="m_links"><br />
<ul><br />
<li>[[Data Management]]</li><br />
<li>[[Data Cleaning]]</li><br />
<li>[[Data Analysis]]</li><br />
<li>[[Software Tools]]</li><br />
</ul><br />
</div><br />
</div><br />
</div><br />
<div class="col-sm-6 rg_blue hmeBox"><br />
<div><br />
<h3><span class="mw-headline" id="Fieldwork"><i class="fa fa-globe"></i> Publication</span></h3><br />
<div class="m_links"><br />
<ul><br />
<li>[[Reproducible Research]]</li><br />
<li>[[Publishing Data]]</li><br />
<li>[[Collaboration Tools]]</li><br />
<li>[[Dissemination]]</li><br />
</ul><br />
</div><br />
</div><br />
</div><br />
</div><br />
</div><br />
</div><br />
<br />
<div class="well"><br />
<div class="container"><br />
<br />
<h3 class="mega_title"><span class="mw-headline" id="Recent_Contributions"><b>Cross-cutting Resources</b></span></h3><br />
<div class="recent-activity"><br />
<br />
<div class="row"> <br />
<div class="col-md-4"> <br />
<div class="rec_contrib_box"><br />
<br />
<div class="rec_title">[[Stata Coding Practices]]</div><br />
</div><br />
</div><br />
<div class="col-md-4"> <br />
<div class="rec_contrib_box"><br />
<br />
<div class="rec_title">[[SurveyCTO Coding Practices]]</div><br />
</div><br />
</div><br />
<div class="col-md-4"> <br />
<div class="rec_contrib_box"><br />
<br />
<div class="rec_title">[[Check Lists]]</div><br />
</div><br />
</div><br />
</div><!--end .row--><br />
</div><br />
<br />
<br />
<br />
<h3 class="mega_title"><span class="mw-headline" id="Recent_Contributions"><b>Download the Dime Analytics Data Handbook </b></span></h3><br />
<div class="rec_contrib_box"><br />
<div class="dk_bar"><br />
[[File:BookBanner.png|700px||center|link=https://worldbank.github.io/dime-data-handbook/]]<br />
</div><br />
</div><br />
<br />
<div class="random"> <span class="hiddentext"><randompages limit="0" namespace="DIME Wiki" levels="1"></randompages></span></div><br />
<br />
</div><br />
</div><br />
<br />
<div class="policy_intro container"><br />
<div class="row"><br />
<div class="col-md-2"></div><br />
<div class="col-md-8"><br />
<h3 class="mega_title text-center"><b>About DIME and DIME Analytics</b></h3><br />
<div class="policy_desc"><br />
<onlyinclude><br />
[https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics] creates tools that improve the quality of impact evaluation research for all. We take advantage of the concentration and scale of research at DIME to develop and test solutions to ensure data work quality across our portfolio, and to make public training and tools available to the larger community of development researchers who might not have the same capabilities.<br />
<br />
[https://www.worldbank.org/en/research/dime DIME] is the World Bank’s impact evaluation department. Part of DIME’s mission is to intensify the production of, and access to, public goods that improve the quantity and quality of global development research, while lowering the costs of performing impact evaluations for the entire research community. The <strong>DIME Wiki</strong> aims to further this initiative, and is funded by the United Kingdom’s Foreign, Commonwealth & Development Office through the i2i Trust Fund.<br />
</div><br />
<br />
</div><br />
<div class="col-md-2"></div><br />
</div><br />
</div><br />
<br />
</div></div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Main_Page&diff=8375Main Page2021-07-30T18:00:41Z<p>Kbjarkefur: </p>
<hr />
<div>__NOTOC__<br />
<br />
<div class="home_top"><br />
<br />
<div class="container bottomspace"><br />
<br />
<h2><span class="mw-headline" style="color:black;">Welcome to the DIME Wiki!</span></h2> <br />
<br />
The <strong>DIME Wiki</strong> is a <strong>public good</strong> developed and maintained by <strong>DIME Analytics</strong>. The <strong>DIME Wiki</strong> is designed for researchers and M&E specialists at the World Bank, as well as clients, donor institutions, universities, NGOs, and governments. The <strong>DIME Wiki</strong> is a collaborative, open-source resource that presents guidelines that are easy to understand and apply for users of varying levels of expertise. <br />
<br />
<p style="color:black;text-align:center;"><b>This landing page offers links to several curated categories that users may find helpful, as well as our narrative Handbook.</b></p><br />
<br />
<div class="row"><br />
<div class="col-sm-1"></div><br />
<div class="col-sm-10" style= "width:85% !important;" ><br />
<div class="col-sm-6 rg_red hmeBox"><br />
<div><br />
<h3><span class="mw-headline" id="Design"><i class="fa fa-sitemap"></i> Research Design</span></h3><br />
<div class="m_links"><br />
<ul><br />
<li>[[Experimental Methods]]</li><br />
<li>[[Quasi-Experimental Methods]]</li><br />
<li>[[Research Ethics]]</li><br />
<li>[[Sampling & Power Calculations]]</li><br />
</ul><br />
</div><br />
</div><br />
</div><br />
<div class="col-sm-6 rg_purple hmeBox"><br />
<div><br />
<h3><span class="mw-headline" id="Data"><i class="fa fa-database"></i> Data Collection</span></h3><br />
<div class="m_links"><br />
<ul><br />
<li>[[Primary Data Collection]]</li><br />
<li>[[Secondary Data Sources]]</li><br />
<li>[[Field Management]]</li><br />
<li>[[Questionnaire Design]]</li><br />
</ul><br />
</div><br />
</div><br />
</div><br />
<div class="col-sm-6 rg_green hmeBox"><br />
<div><br />
<h3><span class="mw-headline" id="Analysis"><i class="fa fa-search"></i> Analysis</span></h3><br />
<div class="m_links"><br />
<ul><br />
<li>[[Data Management]]</li><br />
<li>[[Data Cleaning]]</li><br />
<li>[[Data Analysis]]</li><br />
<li>[[Software Tools]]</li><br />
</ul><br />
</div><br />
</div><br />
</div><br />
<div class="col-sm-6 rg_blue hmeBox"><br />
<div><br />
<h3><span class="mw-headline" id="Fieldwork"><i class="fa fa-globe"></i> Publication</span></h3><br />
<div class="m_links"><br />
<ul><br />
<li>[[Reproducible Research]]</li><br />
<li>[[Publishing Data]]</li><br />
<li>[[Collaboration Tools]]</li><br />
<li>[[Dissemination]]</li><br />
</ul><br />
</div><br />
</div><br />
</div><br />
</div><br />
</div><br />
</div><br />
<br />
<div class="well"><br />
<div class="container"><br />
<br />
<h3 class="mega_title"><span class="mw-headline" id="Recent_Contributions"><b>Cross-cutting Resources</b></span></h3><br />
<div class="recent-activity"><br />
<br />
<div class="row"> <br />
<div class="col-md-4"> <br />
<div class="rec_contrib_box"><br />
<br />
<div class="rec_title">[[Stata Coding Practices]]</div><br />
</div><br />
</div><br />
<div class="col-md-4"> <br />
<div class="rec_contrib_box"><br />
<br />
<div class="rec_title">[[SurveyCTO Coding Practices]]</div><br />
</div><br />
</div><br />
<div class="col-md-4"> <br />
<div class="rec_contrib_box"><br />
<br />
<div class="rec_title">[[Check Lists]]</div><br />
</div><br />
</div><br />
</div><!--end .row--><br />
</div><br />
<br />
<br />
<br />
<h3 class="mega_title"><span class="mw-headline" id="Recent_Contributions"><b>Download the Dime Analytics Data Handbook </b></span></h3><br />
<div class="rec_contrib_box"><br />
<div class="dk_bar"><br />
[[File:BookBanner.png|700px||center|link=https://worldbank.github.io/dime-data-handbook/]]<br />
</div><br />
</div><br />
<br />
<div class="random"> <span class="hiddentext"><randompages limit="0" namespace="DIME Wiki" levels="1"></randompages></span></div><br />
<br />
</div><br />
</div><br />
<br />
<div class="policy_intro container"><br />
<div class="row"><br />
<div class="col-md-2"></div><br />
<div class="col-md-8"><br />
<h3 class="mega_title text-center"><b>About DIME and DIME Analytics</b></h3><br />
<div class="policy_desc"><br />
<onlyinclude><br />
[https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics] creates tools that improve the quality of impact evaluation research for all. We take advantage of the concentration and scale of research at DIME to develop and test solutions to ensure data work quality across our portfolio, and to make public training and tools available to the larger community of development researchers who might not have the same capabilities.<br />
<br />
[https://www.worldbank.org/en/research/dime DIME] is the World Bank’s impact evaluation department. Part of DIME’s mission is to intensify the production of, and access to, public goods that improve the quantity and quality of global development research, while lowering the costs of performing impact evaluations for the entire research community. The <strong>DIME Wiki</strong> aims to further this initiative, and is funded by the United Kingdom’s Foreign, Commonwealth & Development Office through the i2i Trust Fund.<br />
</div><br />
<br />
</div><br />
<div class="col-md-2"></div><br />
</div><br />
</div><br />
<br />
</div></div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Main_Page&diff=8374Main Page2021-07-30T18:00:23Z<p>Kbjarkefur: </p>
<hr />
<div>__NOTOC__<br />
<br />
<div class="home_top"><br />
<br />
<div class="container bottomspace"><br />
<br />
<h2><span class="mw-headline" style="color:black;">Welcome to the DIME Wiki!</span></h2> <br />
<br />
The <strong>DIME Wiki</strong> is a <strong>public good</strong> developed and maintained by <strong>DIME Analytics</strong>. The <strong>DIME Wiki</strong> is designed for researchers and M&E specialists at the World Bank, as well as clients, donor institutions, universities, NGOs, and governments. The <strong>DIME Wiki</strong> is a collaborative, open-source resource that presents guidelines that are easy to understand and apply for users of varying levels of expertise. <br />
<br />
<p style="color:black;text-align:center;"><b>This landing page offers links to several curated categories that users may find helpful, as well as our narrative Handbook.</b></p><br />
<br />
<div class="row"><br />
<div class="col-sm-1"></div><br />
<div class="col-sm-10" style= "width:85% !important;" ><br />
<div class="col-sm-6 rg_red hmeBox"><br />
<div><br />
<h3><span class="mw-headline" id="Design"><i class="fa fa-sitemap"></i> Research Design</span></h3><br />
<div class="m_links"><br />
<ul><br />
<li>[[Experimental Methods]]</li><br />
<li>[[Quasi-Experimental Methods]]</li><br />
<li>[[Research Ethics]]</li><br />
<li>[[Sampling & Power Calculations]]</li><br />
</ul><br />
</div><br />
</div><br />
</div><br />
<div class="col-sm-6 rg_purple hmeBox"><br />
<div><br />
<h3><span class="mw-headline" id="Data"><i class="fa fa-database"></i> Data Collection</span></h3><br />
<div class="m_links"><br />
<ul><br />
<li>[[Primary Data Collection]]</li><br />
<li>[[Secondary Data Sources]]</li><br />
<li>[[Field Management]]</li><br />
<li>[[Questionnaire Design]]</li><br />
</ul><br />
</div><br />
</div><br />
</div><br />
<div class="col-sm-6 rg_green hmeBox"><br />
<div><br />
<h3><span class="mw-headline" id="Analysis"><i class="fa fa-search"></i> Analysis</span></h3><br />
<div class="m_links"><br />
<ul><br />
<li>[[Data Management]]</li><br />
<li>[[Data Cleaning]]</li><br />
<li>[[Data Analysis]]</li><br />
<li>[[Software Tools]]</li><br />
</ul><br />
</div><br />
</div><br />
</div><br />
<div class="col-sm-6 rg_blue hmeBox"><br />
<div><br />
<h3><span class="mw-headline" id="Fieldwork"><i class="fa fa-globe"></i> Publication</span></h3><br />
<div class="m_links"><br />
<ul><br />
<li>[[Reproducible Research]]</li><br />
<li>[[Publishing Data]]</li><br />
<li>[[Collaboration Tools]]</li><br />
<li>[[Dissemination]]</li><br />
</ul><br />
</div><br />
</div><br />
</div><br />
</div><br />
</div><br />
</div><br />
<br />
<div class="well"><br />
<div class="container"><br />
<br />
<h3 class="mega_title"><span class="mw-headline" id="Recent_Contributions"><b>Cross-cutting Resources</b></span></h3><br />
<div class="recent-activity"><br />
<br />
<div class="row"> <br />
<div class="col-md-4"> <br />
<div class="rec_contrib_box"><br />
<br />
<div class="rec_title">[[Stata Coding Practices]]</div><br />
</div><br />
</div><br />
<div class="col-md-4"> <br />
<div class="rec_contrib_box"><br />
<br />
<div class="rec_title">[[SurveyCTO Coding Practices]]</div><br />
</div><br />
</div><br />
<div class="col-md-4"> <br />
<div class="rec_contrib_box"><br />
<br />
<div class="rec_title">[[Check Lists]]</div><br />
</div><br />
</div><br />
</div><!--end .row--><br />
</div><br />
<br />
<br />
<br />
<h3 class="mega_title"><span class="mw-headline" id="Recent_Contributions"><b>Download the Dime Analytics Data Handbook </b></span></h3><br />
<div class="rec_contrib_box"><br />
<div class="dk_bar"><br />
[[File:BookBanner.png|700px||center|link=https://worldbank.github.io/dime-data-handbook/]]<br />
</div><br />
</div><br />
<br />
<div class="random"> <span class="hiddentext"><randompages limit="0" namespace="DIME Wiki" levels="1"></randompages></span></div><br />
<br />
</div><br />
</div><br />
<br />
<div class="policy_intro container"><br />
<div class="row"><br />
<div class="col-md-2"></div><br />
<div class="col-md-8"><br />
<h3 class="mega_title text-center"><b>About DIME and DIME Analytics</b></h3><br />
<div class="policy_desc"><br />
<onlyinclude><br />
[https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics] creates tools that improve the quality of impact evaluation research for all. We take advantage of the concentration and scale of research at DIME to develop and test solutions to ensure data work quality across our portfolio, and to make public training and tools available to the larger community of development researchers who might not have the same capabilities.<br />
<br />
[https://www.worldbank.org/en/research/dime DIME] is the World Bank’s impact evaluation department. Part of DIME’s mission is to intensify the production of, and access to, public goods that improve the quantity and quality of global development research, while lowering the costs of performing impact evaluations for the entire research community. The <strong>DIME Wiki</strong> aims to further this initiative, and is funded by the United Kingdom’s Foreign, Commonwealth & Development Office through the i2i Trust Fund.<br />
</div><br />
<br />
</div><br />
<div class="col-md-2"></div><br />
</div><br />
</div><br />
<br />
</div></div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Main_Page&diff=8373Main Page2021-07-30T17:59:51Z<p>Kbjarkefur: </p>
<hr />
<div>__NOTOC__<br />
<br />
<div class="home_top"><br />
<br />
<h2><span class="mw-headline" style="color:black;">Welcome to the DIME Wiki!</span></h2> <br />
<br />
The <strong>DIME Wiki</strong> is a <strong>public good</strong> developed and maintained by <strong>DIME Analytics</strong>. The <strong>DIME Wiki</strong> is designed for researchers and M&E specialists at the World Bank, as well as clients, donor institutions, universities, NGOs, and governments. The <strong>DIME Wiki</strong> is a collaborative, open-source resource that presents guidelines that are easy to understand and apply for users of varying levels of expertise. <br />
<br />
<p style="color:black;text-align:center;"><b>This landing page offers links to several curated categories that users may find helpful, as well as our narrative Handbook.</b></p><br />
<br />
<div class="container bottomspace"><br />
<div class="row"><br />
<div class="col-sm-1"></div><br />
<div class="col-sm-10" style= "width:85% !important;" ><br />
<div class="col-sm-6 rg_red hmeBox"><br />
<div><br />
<h3><span class="mw-headline" id="Design"><i class="fa fa-sitemap"></i> Research Design</span></h3><br />
<div class="m_links"><br />
<ul><br />
<li>[[Experimental Methods]]</li><br />
<li>[[Quasi-Experimental Methods]]</li><br />
<li>[[Research Ethics]]</li><br />
<li>[[Sampling & Power Calculations]]</li><br />
</ul><br />
</div><br />
</div><br />
</div><br />
<div class="col-sm-6 rg_purple hmeBox"><br />
<div><br />
<h3><span class="mw-headline" id="Data"><i class="fa fa-database"></i> Data Collection</span></h3><br />
<div class="m_links"><br />
<ul><br />
<li>[[Primary Data Collection]]</li><br />
<li>[[Secondary Data Sources]]</li><br />
<li>[[Field Management]]</li><br />
<li>[[Questionnaire Design]]</li><br />
</ul><br />
</div><br />
</div><br />
</div><br />
<div class="col-sm-6 rg_green hmeBox"><br />
<div><br />
<h3><span class="mw-headline" id="Analysis"><i class="fa fa-search"></i> Analysis</span></h3><br />
<div class="m_links"><br />
<ul><br />
<li>[[Data Management]]</li><br />
<li>[[Data Cleaning]]</li><br />
<li>[[Data Analysis]]</li><br />
<li>[[Software Tools]]</li><br />
</ul><br />
</div><br />
</div><br />
</div><br />
<div class="col-sm-6 rg_blue hmeBox"><br />
<div><br />
<h3><span class="mw-headline" id="Fieldwork"><i class="fa fa-globe"></i> Publication</span></h3><br />
<div class="m_links"><br />
<ul><br />
<li>[[Reproducible Research]]</li><br />
<li>[[Publishing Data]]</li><br />
<li>[[Collaboration Tools]]</li><br />
<li>[[Dissemination]]</li><br />
</ul><br />
</div><br />
</div><br />
</div><br />
</div><br />
</div><br />
</div><br />
<br />
<div class="well"><br />
<div class="container"><br />
<br />
<h3 class="mega_title"><span class="mw-headline" id="Recent_Contributions"><b>Cross-cutting Resources</b></span></h3><br />
<div class="recent-activity"><br />
<br />
<div class="row"> <br />
<div class="col-md-4"> <br />
<div class="rec_contrib_box"><br />
<br />
<div class="rec_title">[[Stata Coding Practices]]</div><br />
</div><br />
</div><br />
<div class="col-md-4"> <br />
<div class="rec_contrib_box"><br />
<br />
<div class="rec_title">[[SurveyCTO Coding Practices]]</div><br />
</div><br />
</div><br />
<div class="col-md-4"> <br />
<div class="rec_contrib_box"><br />
<br />
<div class="rec_title">[[Check Lists]]</div><br />
</div><br />
</div><br />
</div><!--end .row--><br />
</div><br />
<br />
<br />
<br />
<h3 class="mega_title"><span class="mw-headline" id="Recent_Contributions"><b>Download the Dime Analytics Data Handbook </b></span></h3><br />
<div class="rec_contrib_box"><br />
<div class="dk_bar"><br />
[[File:BookBanner.png|700px||center|link=https://worldbank.github.io/dime-data-handbook/]]<br />
</div><br />
</div><br />
<br />
<div class="random"> <span class="hiddentext"><randompages limit="0" namespace="DIME Wiki" levels="1"></randompages></span></div><br />
<br />
</div><br />
</div><br />
<br />
<div class="policy_intro container"><br />
<div class="row"><br />
<div class="col-md-2"></div><br />
<div class="col-md-8"><br />
<h3 class="mega_title text-center"><b>About DIME and DIME Analytics</b></h3><br />
<div class="policy_desc"><br />
<onlyinclude><br />
[https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics] creates tools that improve the quality of impact evaluation research for all. We take advantage of the concentration and scale of research at DIME to develop and test solutions to ensure data work quality across our portfolio, and to make public training and tools available to the larger community of development researchers who might not have the same capabilities.<br />
<br />
[https://www.worldbank.org/en/research/dime DIME] is the World Bank’s impact evaluation department. Part of DIME’s mission is to intensify the production of, and access to, public goods that improve the quantity and quality of global development research, while lowering the costs of performing impact evaluations for the entire research community. The <strong>DIME Wiki</strong> aims to further this initiative, and is funded by the United Kingdom’s Foreign, Commonwealth & Development Office through the i2i Trust Fund.<br />
</div><br />
<br />
</div><br />
<div class="col-md-2"></div><br />
</div><br />
</div><br />
<br />
</div></div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Reproducible_Research&diff=8371Reproducible Research2021-07-16T15:35:06Z<p>Kbjarkefur: /* Code Publication */</p>
<hr />
<div>'''Reproducible research''' is the system of [[Data Documentation|documenting]] and [[Publishing Data|publishing]] results of an '''impact evaluation'''. At the very least, '''reproducibility''' allows other researchers to [[Data Analysis|analyze]] the same data to get the same results as the original study, which strengthens the conclusions of the original study. It is important to push researchers towards publishing '''reproducible research''' because the path to research findings is just as important as the findings themselves. <br />
==Read First==<br />
* [https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics] has created the [https://github.com/worldbank/dime-standards/tree/master/dime-research-standards/pillar-3-research-reproducibility DIME Research Reproducibility Standards].<br />
* [https://osf.io/wzjtk/ DIME Analytics] has also conducted a [https://osf.io/csmxz/ bootcamp on reproducible research], which covers the various aspects of '''reproducibility'''.<br />
* Well-written [[Master Do-files | master do-files]] are critical to transparent, '''reproducible research'''.<br />
* [[Getting started with GitHub | GitHub repositories]] play a major role in making research reproducible.<br />
* Specialized [[Software Tools#Text Editing Software|text editing]] and [[Collaboration Tools#Paper Writing|collaboration tools]] ensure that output is reproducible.<br />
== Replication and Reproducibility ==<br />
'''Replication''' is a process where different researchers conduct the same study independently in different samples and find similar conclusions. It adds more validity to the conclusions of an '''empirical''' study. However, in most field experiments, the [[Impact Evaluation Team|research team]] cannot create the same conditions for replication. Different populations can respond differently to the same '''treatment''', and replication is often too expensive. <br />
In such cases, the researchers should still try to achieve '''reproducibility'''. There are four key elements of '''reproducible research''' - [[Data Documentation|data documentation]], [[Publishing Data#Preparing for Release|data publication]], [[Publishing Data#Preparing for Release|code publication]], and [[Reproducible Research#Output Publication|output publication]].<br />
<br />
==Data Documentation==<br />
[[Data Documentation | Data documentation]] deals with all aspects of an '''impact evaluation''' - [[Sampling | sampling]], [[Primary Data Collection | data collection]], [[Data Cleaning | cleaning]], and [[Data Analysis | analysis]]. Proper documentation not only produces reproducible [[Publishing Data| data for publication]] in the future , but also ensures [[Data Quality Assurance Plan| high quality data]] in the present. For example, a [[Impact Evaluation Team#Field Coordinators (FCs) | field coordinator (FC)]] may notice that some [[Survey Pilot Participants|respondents]] do not understand a questionnaire because of reading difficulties. If the '''field coordinator (FC)''' does not document this issue, the [[Impact Evaluation Team#Research Assistant | research assistant]] will not flag these observations during [[Data Cleaning | data cleaning]]. And if the [[Impact Evaluation Team#Research Assistant | research assistant]] does not document why the observations were flagged, and what the flag means, it will affect the results of the [[Data Analysis | analysis]].<br />
=== Guidelines ===<br />
Accordingly, in the lead up to, and during [[Primary Data Collection | data collection]], the [[Impact Evaluation Team|research team]] should follow these guidelines for '''data documentation'''. <br />
* '''Comments.''' Use comments in your code to document the reasons for a particular line or group of commands. In [[Stata Coding Practices|Stata]], for instance, use <code>*</code> to insert comments. <br />
* '''Folders.''' Create separate folders to store all documentation related to the project in separate files. For example, in [https://github.com/ Github], the research team can store notes about each folder and its contents under [https://guides.github.com/features/wikis/ README.md].<br />
* '''Consult data collection teams.''' Throughout the process of [[Data Cleaning|data cleaning]], take extensive inputs from the people who are responsible for collecting data. This could be a field team, a government ministry responsible for [[Administrative_and_Monitoring_Data#Administrative Data|administrative data]], or a technology firm that handles [[Remote_Sensing|remote sensing]] data.<br />
* '''Exploratory analysis.''' While '''cleaning''' the data set, look for issues such as '''outliers''', and [[Monitoring Data Quality#High Frequency Checks|data entry errors]] like missing or duplicate values. Record these observations for use during the process of [[Data Documentation#What to Document|variable construction]] and [[Data Analysis|analysis]].<br />
* '''Feedback.''' When researchers submit codes for review, or release data on a public platform (such as the [[Microdata Catalog]]), others may provide feedback, either positive or negative. It is important to document these comments as well, as this can improve the quality of the results of the '''impact evaluation'''. <br />
* '''Corrections.''' Include records of any corrections made to the data, as well as to the code. For example, based on feedback, the research team may realize that they forgot to drop duplicated entries. Publish these corrections in the '''documentation folder''', along with the communications where theses issues were reported. <br />
* '''Confidential information.''' The research team must be careful not to include confidential information, or any information that is not securely stored.<br />
=== Documentation tools ===<br />
There are various tools available for '''data documentation'''. [https://github.com/ GitHub] and [https://osf.io/ Open Science Framework (OSF)] are two such tools. <br />
* '''The Open Science Framework (OSF).''' It supports documentation by allowing users to store files and version histories, and collaborate using [https://osf.io/4znzp/wiki/home/ OSF Wiki pages]. <br />
* '''GitHub.''' This is a useful tool for managing tasks and responsibilities across the research team. Like '''OSF''', '''Git''' also stores every version of every file. It supports documentation through [https://guides.github.com/features/wikis/#creating-your-wiki Wiki pages] and [https://guides.github.com/features/wikis/#creating-a-readme README.md].<br />
<br />
== Data Publication ==<br />
[[Publishing Data|Data publication]] is the public release of all data once the process of [[Primary Data Collection | data collection]] and [[Data Analysis | analysis]] is complete. '''Data publication''' must be accompanied by proper [[Data Documentation|data documentation]]. Ideally, the [[Impact Evaluation Team|research team]] should publish all data that is needed for others to reproduce every step of the original code, from [[Data Cleaning| cleaning]] to [[Data Analysis| analysis]]. However, this may not always be feasible, since data often contains [[Personally Identifiable Information (PII)|personally identifiable information (PII)]] and other confidential information.<br />
=== Guidelines === <br />
The '''research team''' must keep the following things in mind to ensure that the data is well-organized before publishing:<br />
* '''Cleaning.''' Ensure that the data has been [[Data Cleaning | cleaned]] and is [[Data_Cleaning#Applying Labels | well-labelled]]. <br />
* '''Missing variables.''' Make sure the data contains all variables used during [[Data Analysis | data analysis]], and includes uniquely [[ID Variable Properties | identifying variables]]. <br />
* '''De-identification.''' Careful [[De-identification | de-identification]] is important to maintain the privacy of respondents and to meet [[Research Ethics|research ethics standards]]. The '''research team''' must carefully de-identify any sensitive or '''personally-identifying information (PII)''' such as names, locations, or financial records before release. <br />
[https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics] has developed the following resources to help researchers store and organize data for public release.<br />
* '''Iefieldkit.''' <code>[[iefieldkit]]</code> is a Stata package which allows the research team to follow '''best practices''' for [[Data Cleaning|data cleaning]].<br />
* '''Ietoolkit.''' [https://worldbank.github.io/ietoolkit/ <code>ietoolkit</code>] is a Stata package which simplifies the process of [[Data Management|data management]] and [[Data Analysis|analysis]] in '''impact evaluations'''. It allows the research team to organize the raw data.<br />
* '''Data management guidelines.''' The [https://osf.io/b7z6h/ data management guidelines] provide steps on how to organize data for [[Data Cleaning|cleaning]] and [[Data Analysis|analysis]].<br />
* '''DataWork folder.''' The [[DataWork Folder|DataWork folder]] is a standardized folder template for organizing data in a project folder. The raw '''de-identified data''' can be stored in the [[DataWork_Survey_Round#DataSets_Folder|DataSets folder]] of the [[DataWork_Survey_Round|DataWork survey round folder]].<br />
* '''Microdata catalog checklist.''' The [[Checklist: Microdata Catalog submission|microdata catalog checklist]] provides instructions on how to prepare data for release using the [[Microdata Catalog|Microdata catalog]] of the [https://www.worldbank.org/ World Bank]. The [https://microdata.worldbank.org/index.php/home Microdata Library] offers free access to '''microdata''' produced not only by the World Bank, but also other international organizations, statistical agencies, and government organizations.<br />
* '''Data publication standards.''' The [https://github.com/worldbank/dime-standards/tree/master/dime-research-standards/pillar-5-data-publication DIME Data Publication Standards] provide detailed guidelines for preparing data for release.<br />
<br />
=== Data publication tools ===<br />
There are several free software tools that allow the [[Impact Evaluation Team|research team]] to publicly release the data and the associated [[Data Documentation|documentation]], including [https://github.com/ GitHub] and [https://osf.io/ Open Science Framework], and [https://www.researchgate.net/Research Research Gate]. <br />
Each of these platforms can handle organized directories and can provide a static '''uniform resource locator (URL)''' which makes it easy to collaborate with other users. <br />
* '''ResearchGate.''' It allows users to assign a '''digital object identifier (DOI)''' to published work, which they can then share with external researchers for review or '''replication'''.<br />
* '''The Open Science Framework (OSF).''' It is an online platform which allows members of a '''research team''' to store all project data, and even publish reports using [https://osf.io/preprints/ OSF preprints].<br />
* '''DIME survey data.''' [https://www.worldbank.org/en/research/dime DIME] also publishes and releases [[DIME_Datasets_on_Microdata_Catalog| survey data]] through the [[Microdata Catalog]]. However, access to the data may be restricted, and some variables are not allowed to be published.<br />
<br />
== Code Publication==<br />
'''Code publication''' is another key element of '''reproducible research'''. Sometimes academic journals ask for '''reproducible code''' (and data) along with the actual academic paper. Even if they don't, it is a good practice to share codes and data with others. The [[Impact Evaluation Team|research team]] should ensure that external researchers have access to, and can execute the same code and data that was used during the original '''impact evaluation'''. This can be made possible through proper [[Data Documentation|documentation]] and [[Data Management|management]] of data.<br />
=== Guidelines ===<br />
With careful coding, use of [[Master Do-files | master do-files]], and adherence to [[Stata Coding Practices|coding best practices]] the same data and code will yield the same results for any given person. Follow these guidelines when publishing the code:<br />
* '''Master do-files.''' The [[Master Do-files|master do-file]] should set the Stata seed and version to allow replicable [[Sampling|sampling]] and [[Randomization in Stata|randomization]]. By nature, the '''master do-file''' will run project do-files in a pre-specified order, which strengthens '''reproducibility'''. The '''master do-file''' can also be used to list assumptions of a study and list all data sets that are used in the study.<br />
* '''Packages and settings.''' Install all necessary commands and packages in your '''master do-file''' itself. Specify all settings and sort observations frequently to minimize errors. [https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics] has created two packages to help researchers in producing '''reproducible research''' - <code>[[Iefieldkit|iefieldkit]]</code> and <code>ietoolkit</code>.<br />
* '''Globals.''' Create '''globals''' (or global macros) for the root folder and all project folders. '''Globals''' should only be specified in the '''master do-file''' and can be used '''standardizing coefficients''' for the data set that will be used for [[Data Analysis|analysis]].<br />
* '''Shell script.''' If you use different languages or software in the same project, consider using a '''shell script''', which ensure that other users run the different languages or software in the correct order. This would mean that you have a script that you run in your command line that first execute a master script in one language and then a master script in another one or several other languages. This way the shell script becomes the "super master script" that execute several other master scripts in the correct order.<br />
* '''Comments.''' Include '''comments''' (using <code>*</code>) in your code frequently to explain what a line of code (or a group of commands) is doing, and why. For example, if the code drops observations or changes values, explain why this was necessary using comments. This ensures that the code is also easy to understand, and that research is [[Research Ethics#Research Transparency|transparent]].<br />
=== Code publication tools ===<br />
There are several free software tools that allow the [[Impact Evaluation Team|research team]] to publicly release the code, including [https://github.com/ GitHub] and [http://jupyter.org/ Jupyter Notebook]. Users can pick any of these depending on how familiar they are with these tools. There are several pre-publication code review facilities as well.<br />
* '''GitHub.''' It is a free '''version-control''' software. It is popular because users can store every version of every component of a project (like data and code) in '''repositories''' which can be accessed by everyone working in a project. [[Getting started with GitHub |With GitHub repositories]], users can track changes to code in different programming languages, and create [[Data Documentation | documentation]] explaining what changes were made and why. The '''research team''' can then simply share '''Git repositories''' with an external audience which allows others to read and replicate the code as well as the results of an '''impact evaluation'''. <br />
* '''Jupyter Notebook.''' This is another platform where researchers can create and share code in different programming languages, including [https://www.python.org/ Python], [https://www.r-project.org/ R], [https://julialang.org/ Julia], and [https://www.scala-lang.org/ Scala].<br />
* [https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics] has also created a [https://osf.io/m36kg/ sample peer code review form] that researchers can refer to before publishing their code.<br />
To learn more about how to use these tools, users can refer to the following resources:<br />
* [https://services.github.com/on-demand/intro-to-github/ GitHub introductory training]<br />
* [https://guides.github.com/ GitHub guides]<br />
* [https://jupyter.org/documentation Jupyter documentation]<br />
* [https://blog.jupyter.org/ Jupyter blogs]<br />
<br />
== Output Publication ==<br />
The research output is not just a paper or report, but also includes the codes, data, and the documentation. '''Output publication''' is the final aspect of '''reproducible research''' after completing [[Data Documentation|documentation]] and [[Publishing Data|publication]] of data and codes. The [[Impact Evaluation Team|research team]] can follow certain guidelines to ensure their research output is '''reproducible''' and transparent.<br />
* '''Checklist.''' DIME Analytics has created a [https://osf.io/cdxnf/ pre-publication reproducibility checklist] for researchers.<br />
* '''GitHub repos.''' [[Getting started with GitHub | GitHub repositories]] (or repos) allow researchers to track changes to the code, create messages explaining the changes, and make code publicly available for others to read and replicate.<br />
* '''Dynamic documents.''' These are documents which allow researchers to write reports that can automatically display results after [[Data Analysis|analysis]]. This reduces the amount of manual work, and there is also less room for error and manipulation of results.<br />
=== Publication tools ===<br />
There are a wide range of tools that are available for '''output publication'''. Each of them allows users to create '''dynamic documents''' and edit the reports using various programming languages like [https://www.r-project.org/ R], [https://www.stata.com/ Stata], and [https://www.python.org/ Python].<br />
* '''R.''' This language has a feature called [https://rmarkdown.rstudio.com/ R Markdown], which allows users to perform [[Data Analysis|analysis]] using different programming languages, and print the results in the final document along with text to explain the results. <br />
* '''Stata.''' New versions of Stata ([https://www.stata.com/stata15/ version 15] onwards) allow users to [https://www.stata.com/manuals/pdyndoc.pdf create dynamic documents]. The output is usually a PDF file, which contains text, tables and graphs. Whenever there are changes to raw data or in the analysis, the research team only needs to execute one '''do-file''' to create a new document. This improves '''reproducibility''' since users do not have to make changes manually every time.<br />
* '''LaTeX.''' [https://www.latex-project.org/ LaTeX] is a widely used publication tool. It is a '''typesetting system''' that allows users to reference lines of code and outputs such as tables and graphs, and easily update them in a text document. Users can export the results into '''.tex''' format after analyzing the data in their preferred software – using [https://cran.r-project.org/web/packages/stargazer/vignettes/stargazer.pdf stargazer] in '''R''', and packages like <code>[http://repec.org/bocode/e/estout/esttab.html esttab]</code> and <code>[http://repec.org/bocode/o/outreg2.html outreg2]</code> in '''Stata'''. Whenever there are new graphs and tables in the analysis, simply recompile the '''LaTeX''' document with the press of a button in order to include the new graphs and tables. <br />
* '''Overleaf.''' [https://www.overleaf.com/ Overleaf] is a web-based platform that allows users to collaborate on '''LaTeX''', and receive feedback from other researchers. <br />
* '''Jupyter Notebook.''' [http://jupyter.org/ Jupyter Notebook] can create '''dynamic documents''' in various formats like HTML and '''LaTeX'''.<br />
<br />
== Related Pages == <br />
[[Special:WhatLinksHere/Reproducible_Research|Click here for pages that link to this topic.]]<br />
<br />
== Additional Resources ==<br />
* Berkeley Initiative for Transparency in the Social Sciences (BITSS), [http://www.bitss.org/education/manual-of-best-practices/ Manual of Best Practices in Transparent Social Science Research]<br />
* Berkeley Initiative for Transparency in the Social Sciences (BITSS), [https://www.bitss.org/wp-content/uploads/2015/12/Pre-Analysis-Plan-Template.pdf Pre-Analysis Plan template]<br />
* Center for Open Science, [https://cos.io/our-services/top-guidelines/ Transparency and Openness Guidelines]<br />
* Coursera, [https://www.coursera.org/learn/reproducible-research Course on Reproducible Research in R]<br />
* Dataverse (Harvard), [https://dataverse.harvard.edu/dataverse/socialsciencercts Randomized Control Trials in the Social Science Dataverse]<br />
* DIME Analytics (World Bank), [https://osf.io/b7z6h/ Data Management and Cleaning]<br />
* DIME Analytics (World Bank), [https://osf.io/u386j/ Coding for Reproducible Research]<br />
* DIME Analytics (World Bank), [https://osf.io/5ugkv/ Intro to GitHub]<br />
* DIME Analytics (World Bank), [https://osf.io/9fu7r/ Using GitHub]<br />
* DIME Analytics (World Bank), [https://osf.io/ea6dz/ GitHub Flows] <br />
* DIME Analytics (World Bank), [https://osf.io/dtf4a/ Management using GitHub Repositories]<br />
* DIME Analytics (World Bank), [https://osf.io/f3kad/ Initializing and Synchronizing a Git Repo with GitHub Desktop]<br />
* DIME Analytics (World Bank), [https://osf.io/szbwq/ Using Git Flow to Manage Code Projects with GitKraken]<br />
* DIME Analytics (World Bank), [https://osf.io/bjvaf/ Stata Coding] <br />
* Data Colada, [http://datacolada.org/69 Tips for making research findable and reproducible]<br />
* Hua Peng (StataCorp), [https://huapeng01016.github.io/reptalk/#/hua-pengstatacorphpeng Incorporating Stata into Reproducible Documents]<br />
* Innovation for Poverty Action (IPA), [http://www.poverty-action.org/sites/default/files/publications/IPA%27s%20Best%20Practices%20for%20Data%20and%20Code%20Management_Nov2015.pdf Reproducible Research: Best Practices for Data and Code Management] <br />
* Innovation for Poverty Action (IPA), [http://www.poverty-action.org/sites/default/files/Guidelines-for-data-publication.pdf Guidelines for data publication]<br />
* J-PAL, [https://www.povertyactionlab.org/research-resources/transparency-and-reproducibility Transparency and reproducibility] <br />
*Matthew Salganik (Princeton), [http://www.princeton.edu/~mjs3/open_and_reproducible_opr_2017.pdf Open and Reproducible Research: Goals, Obstacles and Solutions]<br />
[[Category: Reproducible Research]]</div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Data_Storage&diff=8321Data Storage2021-06-21T14:37:45Z<p>Kbjarkefur: /* Data retention */</p>
<hr />
<div>This article discusses different aspects of data storage (such as different types of storage, data back up and data retention). It is important to make sure you have appropriate data storage solutions before you start receiving data. You should plan your data storage for the full life-cycle of a project and not just for your immediate needs. Changing data storage solution mid-project can be costly and break the code already written for the project making earlier research outputs non-reproducible. <br />
<br />
== Read First ==<br />
* Data should be version controlled. Ideally the version control system used should both be able to detect changes to the data and provide a possibility to revert to old versions<br />
* Data storage depends on how big the data is, if there is confidential information in the data and how the data will be used.<br />
* All original data (the data sets collected or received by the team) should be backed-up. All derivative data (data sets created from original data) should be created by reproducible code and that code should be version controlled in Git. Then derivative data does not need to be backed-up as it can be re-created.<br />
<br />
== Version control for data ==<br />
Version control is used for two purposes. The first is to keep track of changes and modifications to files and the second one is to provide the possibility to revert a file an old version. For code there is an industry standard that efficiently fulfills these two purposes and that system is called Git. Git can be used to both keep track of changes made to code as well as restore older versions of code. Unfortunately there is no system that as elegantly does both of those two things for data over a longer period of time. There is therefore no industry wide one-size-fits-all solutions for version control of data. This article therefore suggests different methods and the project team should pick the method best for their exact use case.<br />
<br />
=== Version control code that generates data ===<br />
<br />
All derivative datasets (datasets that the project team creates from original data they received or collected) should be generated by reproducible code so that it can be re-created by re-running that code. Therefore, if the original data is properly [[Data_Storage#Back-up_protocols | backed-up]] and all code is version controlled using Git, then the derivative datasets are already implicitly version controlled since changes to derivative datasets can only happen through changes to the code tracked in Git. Git can also be used to restore old version of code that in turn can be used to restore old versions of the derivative data.<br />
<br />
While this method often is an excellent option to version control derivative data, it does not work when the original data is updated frequently (ongoing data collection or when data is continuously received) or when the code is not accessible (someone else is generating the data). Derivative data should, in these cases, still be generated using reproducible code tracked in Git, but the effect is no longer that those derivative datasets are implicitly properly version controlled.<br />
<br />
=== Version control using checksums/hashes ===<br />
<br />
One way to keep track of changes done to data is to use ''checksums'', ''hashes'' or ''data signatures''. These three concepts all work slightly differently but all follows the same principle and serves the same purpose in the context of version control of data. The principle they all follow is that a datasets is boiled down to a short string of text or a number. The same data is always boiled down to the same string or number, but even small differences to the data boils down to a different string or number. That string or number can then quickly and effortlessly be compared across datasets to test if they are identical or not.<br />
<br />
Most checksums or hashes boils down to a string or a number that has no interpretable meaning to humans, it is just meant to serve the purpose of answering the yes/no question if the two datasets are identical or not. It is usually impossible to tell from a has or a checksum if the files are almost identical or very different. One exception to this is the Stata command <code>datasignature</code> that generates a signature that combines a hash value with some basic data on the dataset such as number of observations. While this provides some useful human readable information, it can only be used on Stata <code>.dta</code> files.<br />
<br />
The drawback of all these methods is that it says nothing about how the dataset was created. So there is no way to recreate a dataset based only on the checksum or hash signature. If you have the checksum of a dataset used in the past but not the corresponding version of the dataset, then all you can do is to say that dataset was identical to the current version of the same dataset or not. You cannot say what differs if they are different. If you used the Stata command <code>datasignature</code> you can get some clues of what differs, such as number of observations or number of variables, but you will still not be able to recreate the old version of the data.<br />
<br />
However, this method is a good fit for when you want to have a quick way to test that the dataset has not changed, and details of what has changed either does not matter or you have another way to find out (perhaps manually) what those differences are. Examples of this can be if you are accessing someone else's dataset and you want to know if that person makes changes to that dataset. Another way this can be used is to check which if any datasets are updated when the code is updated. Sometimes we want to make changes to the code that should not change the dataset it generates, and then this method is a great way to verify that.<br />
<br />
=== Version control in sync software ===<br />
<br />
[[Data_Storage#File_sync_services | File syncing software]] often have version control systems that allows you to both detect changes made to data files and allows you to restore old versions of those files. However, the way they it is done in these systems is so storage in-efficient that you can only restore files version that are less than a few months old. If your project is completed with that time frame, then this is a great solution, however, typically a project runs for much longer than that.<br />
<br />
== Data storage types ==<br />
There are different types of data storage option and which one that is best for each use case typically depends on how big the data is, if there is confidential information in the data and how the data will be used. Below are some different storage types with their pros and cons. The type of data that is considered in this article are data files, and not, for example, data stored in data bases. <br />
<br />
=== File sync services ===<br />
<br />
Common examples of file sync services are DropBox, OneDrive, Box among several others. This storage type is suitable when the data files required for a project is not bigger than that they can be stored on a regular laptop or desktop computer. In file sync services a copy of the file exists locally on all computers that has the folder synced. This means that accessing the file is quick and that there is never a problem for multiple people to access the file at the same time as they have their own version of the file. <br />
<br />
However, because all files are saved locally this also means that if one person works on many projects there is a risk that there is not space for all files for those projects on that persons computer. Many file syncing services have options to mitigate this. For example you can chose to not sync the parts of the project folder that you know you do not need access to. Some sync services also allow you to have non-synced files ready to be downloaded on demand (sometimes referred to as ''smart sync''). However, this is not great for data work as data files are usually to big to be instantly downloaded on demand when your code trying to access them leading to your code to crash when trying to read the file.<br />
<br />
Another concern with file syncing services are privacy concerns when sharing confidential data. There are enterprise subscriptions that can be installed on your own servers. This makes sharing data much more secure as the data is never stored or transferred on the servers that belong to the company that offers the syncing service. If you do not have an enterprise subscription or are not sure if you do, you should always [[encryption| encrypt ]] any non-public data before saving it in a synced folder. DIME Analytics have published [https://github.com/worldbank/dime-standards/blob/master/dime-research-standards/pillar-4-data-security/data-security-resources/veracrypt-guidelines.md guidelines to encrypt files with Veracrypt] which is a secure and free software often used for this purpose.<br />
<br />
=== Cloud storage ===<br />
<br />
Cloud storage comes in many different kinds and it is out of the scope of this wiki article to describe them all. This article will only cover general points about cloud storage. This article only covers data stored in files, so cloud storage here typically means [https://aws.amazon.com/s3/ S3 storage on AWS] or [https://docs.microsoft.com/en-us/azure/storage/blobs/ Blob storage on Azure], but are not limited to those.<br />
<br />
The benefit of cloud storage is that you do not need to buy or upgrade any hardware. To get started or to increase your capacity you only need to click a button or it might even be upgraded automatically. However, you will be charged more the more you use. <br />
<br />
Cloud storage might not be the best option when the data will frequently be accessed from laptops and desktops. The download time will be too long even on a fast internet connection for most use cases if the data is downloaded each time the script is run. Instead, cloud storage is best used when frequently accessed from another cloud resource. <br />
<br />
However, make sure that the other cloud storage is in the same physical location as the cloud resource that will use it. The same cloud provider have data centers across the world and you might have issues if your storage is on the other side of the world from the resource that will use it. There are ways to address these issues but they are outside of the scope of this article. The main takeaway is that cloud storage is only a great solution if the resources that needs access to that have a very quick connection to them.<br />
<br />
At the same time, this does not mean that cloud storage should never be accessed from a laptop or a desktop. Cloud storage can be a great place to store data that is infrequently accessed by users on regular computers. There will still be a delay, but that can be acceptable as the data is only infrequently accessed.<br />
<br />
=== Network drive storage ===<br />
<br />
Network drive storage is in many way similar to the types of cloud storage that we discussed above. The major difference for the context discussed here is that they can be located in the same location as a user accessing them from a laptop/desktop making it a viable option for such a user to access it frequently. This is a great solution when a regular laptop/desktop frequently needs access to the data and the combined size of all of the data files are too big to all be stored on the laptop/desktop.<br />
<br />
The problem with this solution is that you need to purchase hardware and run your own network drive. This can be complicated if you do not work at a large organization where there is an IT team that can do this for you. Another drawback is that if you work on an organization that is spread out geographically then you will have the same issues as with cloud storage with slow access speed.<br />
<br />
== Back-up protocols ==<br />
<br />
The DIME Analytics recommendation for data back up starts by classifying all data into either original data or derivative data. Original data is data that was received or collected by the project team, and derivative data is data created from the original data using reproducible code. Original data must be backed up, but derivative data does not have to be backed up as long as the code used to create it is backed up (typically through GitHub or another version control system for code). If the original data and the code is backed up, then the derivative data is implicitly backed up as well. By this method, backing up every single derivative data file which would be storage space inefficient is avoided.<br />
<br />
There are many good alternative data back-up protocols but a great place to start for anyone who has not adapted a back-up protocol for their projects yet is the [https://www.backblaze.com/blog/the-3-2-1-backup-strategy 3-2-1 strategy]. This strategy means that the project has 3 copies of the data, where 2 are easily accessible and 1 is stored on an remote location. This means that one copy, called the first copy, is the only copy that the code ever should read the data from. This copy can be shared in a file syncing service. If the original data file has [[Personally_Identifiable_Information_(PII) | PII information]] (which they often do) the file needs to be encrypted at all times. A de-identified version of this data that does not need to be encrypted can be created for easier day-to-day use.<br />
<br />
If the first copy of the data is accidentally deleted or modified then the team will access the second copy and make a new first copy of the second copy. The second copy could, for example, be stored on a external drive. This external drive can be stored in the office of someone in the project team. If the data has PII data it must be encrypted unless the external drive is stored in a locked safe. When the team has access to a secure physical safe, then it is a good idea to keep the data unencrypted on the drive that is stored locked in the safe. This is to reduce the risk of loosing data due to lost encryption keys. If an external hard drive is not an option, then the second copy can be stored in any other file storage that is not the same service as the first copy.<br />
<br />
Finally, the third copy should be stored off-site. In today's world this usually means somewhere in the cloud. The second copy can also be stored in the cloud, but the difference from the third copy is that the second copy may be stored locally. The third copy may never be stored locally. This is meant to make sure that if theft, fire, disaster etc. or anything similar happens, then the third copy should be stored in a way there is no way it is affected by the same event. The other important consideration for the third copy storage is that it is a suitable long term storage. This means that this storage should not be affected by a single person leaving the team or if it does there should be a clear protocol for how access is maintained when that person is leaving the team. If you are using a sync service for the third copy it should ideally be a different sync service than for the first copy. Regardless which sync service you are using, if you are using a sync service you should never have the third copy synced to any computer. This file should live in an online-only folder. Alternatives to sync services for the third copy would be S3 on AWS or Blob storage on Azure. If you are using such services, see if you can make these files read-only so there is no chance that they are accidentally deleted or modified.<br />
<br />
== Data retention ==<br />
<br />
The project team should have a written document that clearly regulates how long identified data should be kept by the research team before it is permanently destroyed. It is a bad, but unfortunately common, practice that research teams keeps identified datasets without any timeline on when to delete them. Note that it is not a bad practice to keep or publish de-identified datasets, but best data privacy practice is to have a clear timeline for when datasets should be deleted. The document regulating this timeline is called a "data retention policy". There is a tradeoff between the length of time that the data will be retained and the right to privacy of the human subject participating in your research. Therefore, a data retention policy should not allow the research team to keep the data longer than what the research team can justify by linking it to a research utility. Keeping identified data just for the sake that it might be good to have one day is bad data privacy practice.<br />
<br />
Here is a [https://kb.mit.edu/confluence/display/istcontrib/Removing+Sensitive+Data good guide] for how to make sure that data is actually deleted:<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[*topic name, as listed on main page*]]<br />
<br />
<br />
== Additional Resources ==<br />
* Pillar 4 in the [https://github.com/worldbank/dime-standards DIME Research Standards] covers data security<br />
<br />
[[Category: Data Management]]</div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Data_Storage&diff=8320Data Storage2021-06-07T19:58:55Z<p>Kbjarkefur: </p>
<hr />
<div>This article discusses different aspects of data storage (such as different types of storage, data back up and data retention). It is important to make sure you have appropriate data storage solutions before you start receiving data. You should plan your data storage for the full life-cycle of a project and not just for your immediate needs. Changing data storage solution mid-project can be costly and break the code already written for the project making earlier research outputs non-reproducible. <br />
<br />
== Read First ==<br />
* Data should be version controlled. Ideally the version control system used should both be able to detect changes to the data and provide a possibility to revert to old versions<br />
* Data storage depends on how big the data is, if there is confidential information in the data and how the data will be used.<br />
* All original data (the data sets collected or received by the team) should be backed-up. All derivative data (data sets created from original data) should be created by reproducible code and that code should be version controlled in Git. Then derivative data does not need to be backed-up as it can be re-created.<br />
<br />
== Version control for data ==<br />
Version control is used for two purposes. The first is to keep track of changes and modifications to files and the second one is to provide the possibility to revert a file an old version. For code there is an industry standard that efficiently fulfills these two purposes and that system is called Git. Git can be used to both keep track of changes made to code as well as restore older versions of code. Unfortunately there is no system that as elegantly does both of those two things for data over a longer period of time. There is therefore no industry wide one-size-fits-all solutions for version control of data. This article therefore suggests different methods and the project team should pick the method best for their exact use case.<br />
<br />
=== Version control code that generates data ===<br />
<br />
All derivative datasets (datasets that the project team creates from original data they received or collected) should be generated by reproducible code so that it can be re-created by re-running that code. Therefore, if the original data is properly [[Data_Storage#Back-up_protocols | backed-up]] and all code is version controlled using Git, then the derivative datasets are already implicitly version controlled since changes to derivative datasets can only happen through changes to the code tracked in Git. Git can also be used to restore old version of code that in turn can be used to restore old versions of the derivative data.<br />
<br />
While this method often is an excellent option to version control derivative data, it does not work when the original data is updated frequently (ongoing data collection or when data is continuously received) or when the code is not accessible (someone else is generating the data). Derivative data should, in these cases, still be generated using reproducible code tracked in Git, but the effect is no longer that those derivative datasets are implicitly properly version controlled.<br />
<br />
=== Version control using checksums/hashes ===<br />
<br />
One way to keep track of changes done to data is to use ''checksums'', ''hashes'' or ''data signatures''. These three concepts all work slightly differently but all follows the same principle and serves the same purpose in the context of version control of data. The principle they all follow is that a datasets is boiled down to a short string of text or a number. The same data is always boiled down to the same string or number, but even small differences to the data boils down to a different string or number. That string or number can then quickly and effortlessly be compared across datasets to test if they are identical or not.<br />
<br />
Most checksums or hashes boils down to a string or a number that has no interpretable meaning to humans, it is just meant to serve the purpose of answering the yes/no question if the two datasets are identical or not. It is usually impossible to tell from a has or a checksum if the files are almost identical or very different. One exception to this is the Stata command <code>datasignature</code> that generates a signature that combines a hash value with some basic data on the dataset such as number of observations. While this provides some useful human readable information, it can only be used on Stata <code>.dta</code> files.<br />
<br />
The drawback of all these methods is that it says nothing about how the dataset was created. So there is no way to recreate a dataset based only on the checksum or hash signature. If you have the checksum of a dataset used in the past but not the corresponding version of the dataset, then all you can do is to say that dataset was identical to the current version of the same dataset or not. You cannot say what differs if they are different. If you used the Stata command <code>datasignature</code> you can get some clues of what differs, such as number of observations or number of variables, but you will still not be able to recreate the old version of the data.<br />
<br />
However, this method is a good fit for when you want to have a quick way to test that the dataset has not changed, and details of what has changed either does not matter or you have another way to find out (perhaps manually) what those differences are. Examples of this can be if you are accessing someone else's dataset and you want to know if that person makes changes to that dataset. Another way this can be used is to check which if any datasets are updated when the code is updated. Sometimes we want to make changes to the code that should not change the dataset it generates, and then this method is a great way to verify that.<br />
<br />
=== Version control in sync software ===<br />
<br />
[[Data_Storage#File_sync_services | File syncing software]] often have version control systems that allows you to both detect changes made to data files and allows you to restore old versions of those files. However, the way they it is done in these systems is so storage in-efficient that you can only restore files version that are less than a few months old. If your project is completed with that time frame, then this is a great solution, however, typically a project runs for much longer than that.<br />
<br />
== Data storage types ==<br />
There are different types of data storage option and which one that is best for each use case typically depends on how big the data is, if there is confidential information in the data and how the data will be used. Below are some different storage types with their pros and cons. The type of data that is considered in this article are data files, and not, for example, data stored in data bases. <br />
<br />
=== File sync services ===<br />
<br />
Common examples of file sync services are DropBox, OneDrive, Box among several others. This storage type is suitable when the data files required for a project is not bigger than that they can be stored on a regular laptop or desktop computer. In file sync services a copy of the file exists locally on all computers that has the folder synced. This means that accessing the file is quick and that there is never a problem for multiple people to access the file at the same time as they have their own version of the file. <br />
<br />
However, because all files are saved locally this also means that if one person works on many projects there is a risk that there is not space for all files for those projects on that persons computer. Many file syncing services have options to mitigate this. For example you can chose to not sync the parts of the project folder that you know you do not need access to. Some sync services also allow you to have non-synced files ready to be downloaded on demand (sometimes referred to as ''smart sync''). However, this is not great for data work as data files are usually to big to be instantly downloaded on demand when your code trying to access them leading to your code to crash when trying to read the file.<br />
<br />
Another concern with file syncing services are privacy concerns when sharing confidential data. There are enterprise subscriptions that can be installed on your own servers. This makes sharing data much more secure as the data is never stored or transferred on the servers that belong to the company that offers the syncing service. If you do not have an enterprise subscription or are not sure if you do, you should always [[encryption| encrypt ]] any non-public data before saving it in a synced folder. DIME Analytics have published [https://github.com/worldbank/dime-standards/blob/master/dime-research-standards/pillar-4-data-security/data-security-resources/veracrypt-guidelines.md guidelines to encrypt files with Veracrypt] which is a secure and free software often used for this purpose.<br />
<br />
=== Cloud storage ===<br />
<br />
Cloud storage comes in many different kinds and it is out of the scope of this wiki article to describe them all. This article will only cover general points about cloud storage. This article only covers data stored in files, so cloud storage here typically means [https://aws.amazon.com/s3/ S3 storage on AWS] or [https://docs.microsoft.com/en-us/azure/storage/blobs/ Blob storage on Azure], but are not limited to those.<br />
<br />
The benefit of cloud storage is that you do not need to buy or upgrade any hardware. To get started or to increase your capacity you only need to click a button or it might even be upgraded automatically. However, you will be charged more the more you use. <br />
<br />
Cloud storage might not be the best option when the data will frequently be accessed from laptops and desktops. The download time will be too long even on a fast internet connection for most use cases if the data is downloaded each time the script is run. Instead, cloud storage is best used when frequently accessed from another cloud resource. <br />
<br />
However, make sure that the other cloud storage is in the same physical location as the cloud resource that will use it. The same cloud provider have data centers across the world and you might have issues if your storage is on the other side of the world from the resource that will use it. There are ways to address these issues but they are outside of the scope of this article. The main takeaway is that cloud storage is only a great solution if the resources that needs access to that have a very quick connection to them.<br />
<br />
At the same time, this does not mean that cloud storage should never be accessed from a laptop or a desktop. Cloud storage can be a great place to store data that is infrequently accessed by users on regular computers. There will still be a delay, but that can be acceptable as the data is only infrequently accessed.<br />
<br />
=== Network drive storage ===<br />
<br />
Network drive storage is in many way similar to the types of cloud storage that we discussed above. The major difference for the context discussed here is that they can be located in the same location as a user accessing them from a laptop/desktop making it a viable option for such a user to access it frequently. This is a great solution when a regular laptop/desktop frequently needs access to the data and the combined size of all of the data files are too big to all be stored on the laptop/desktop.<br />
<br />
The problem with this solution is that you need to purchase hardware and run your own network drive. This can be complicated if you do not work at a large organization where there is an IT team that can do this for you. Another drawback is that if you work on an organization that is spread out geographically then you will have the same issues as with cloud storage with slow access speed.<br />
<br />
== Back-up protocols ==<br />
<br />
The DIME Analytics recommendation for data back up starts by classifying all data into either original data or derivative data. Original data is data that was received or collected by the project team, and derivative data is data created from the original data using reproducible code. Original data must be backed up, but derivative data does not have to be backed up as long as the code used to create it is backed up (typically through GitHub or another version control system for code). If the original data and the code is backed up, then the derivative data is implicitly backed up as well. By this method, backing up every single derivative data file which would be storage space inefficient is avoided.<br />
<br />
There are many good alternative data back-up protocols but a great place to start for anyone who has not adapted a back-up protocol for their projects yet is the [https://www.backblaze.com/blog/the-3-2-1-backup-strategy 3-2-1 strategy]. This strategy means that the project has 3 copies of the data, where 2 are easily accessible and 1 is stored on an remote location. This means that one copy, called the first copy, is the only copy that the code ever should read the data from. This copy can be shared in a file syncing service. If the original data file has [[Personally_Identifiable_Information_(PII) | PII information]] (which they often do) the file needs to be encrypted at all times. A de-identified version of this data that does not need to be encrypted can be created for easier day-to-day use.<br />
<br />
If the first copy of the data is accidentally deleted or modified then the team will access the second copy and make a new first copy of the second copy. The second copy could, for example, be stored on a external drive. This external drive can be stored in the office of someone in the project team. If the data has PII data it must be encrypted unless the external drive is stored in a locked safe. When the team has access to a secure physical safe, then it is a good idea to keep the data unencrypted on the drive that is stored locked in the safe. This is to reduce the risk of loosing data due to lost encryption keys. If an external hard drive is not an option, then the second copy can be stored in any other file storage that is not the same service as the first copy.<br />
<br />
Finally, the third copy should be stored off-site. In today's world this usually means somewhere in the cloud. The second copy can also be stored in the cloud, but the difference from the third copy is that the second copy may be stored locally. The third copy may never be stored locally. This is meant to make sure that if theft, fire, disaster etc. or anything similar happens, then the third copy should be stored in a way there is no way it is affected by the same event. The other important consideration for the third copy storage is that it is a suitable long term storage. This means that this storage should not be affected by a single person leaving the team or if it does there should be a clear protocol for how access is maintained when that person is leaving the team. If you are using a sync service for the third copy it should ideally be a different sync service than for the first copy. Regardless which sync service you are using, if you are using a sync service you should never have the third copy synced to any computer. This file should live in an online-only folder. Alternatives to sync services for the third copy would be S3 on AWS or Blob storage on Azure. If you are using such services, see if you can make these files read-only so there is no chance that they are accidentally deleted or modified.<br />
<br />
== Data retention ==<br />
<br />
The project team should have a written document that clearly regulates how long identified data should be kept by the research team before it is permanently destroyed. It is a bad, but unfortunately common, practice that research teams keeps identified datasets without any timeline on when to delete them. Note that it is not a bad practice to keep or publish de-identified datasets, but best data privacy practice is to have a clear timeline for when datasets should be deleted. The document regulating this timeline is called a "data retention policy". There is a tradeoff between the length of time that the data will be retained and the right to privacy of the human subject participating in your research. Therefore, a data retention policy should not allow the research team to keep the data longer than what the research team can justify by linking it to a research utility. Keeping identified data just for the sake that it might be good to have one day is bad data privacy practice.<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[*topic name, as listed on main page*]]<br />
<br />
<br />
== Additional Resources ==<br />
* Pillar 4 in the [https://github.com/worldbank/dime-standards DIME Research Standards] covers data security<br />
<br />
[[Category: Data Management]]</div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Data_Storage&diff=8319Data Storage2021-06-07T19:53:43Z<p>Kbjarkefur: /* Data retention */</p>
<hr />
<div>This article discuss different aspects of data storage (such as different types of storage, data back up and data retention). While [[Data_Security]] is a very important topic related to data storage, it is not covered in this article as there is a dedicated link for that.<br />
<br />
<br />
<br />
== Read First ==<br />
* Data should be version controlled. Ideally the version control system used should both be able to detect changes to the data and provide a possibility to revert to old versions<br />
* Data storage depends on how big the data is, if there is confidential information in the data and how the data will be used.<br />
* All original data (the data sets collected or received by the team) should be backed-up. All derivative data (data sets created from original data) should be created by reproducible code and that code should be version controlled in Git. Then derivative data does not need to be backed-up as it can be re-created.<br />
<br />
== Version control for data ==<br />
Version control is used for two purposes. The first is to keep track of changes and modifications to files and the second one is to provide the possibility to revert a file an old version. For code there is an industry standard that efficiently fulfills these two purposes and that system is called Git. Git can be used to both keep track of changes made to code as well as restore older versions of code. Unfortunately there is no system that as elegantly does both of those two things for data over a longer period of time. There is therefore no industry wide one-size-fits-all solutions for version control of data. This article therefore suggests different methods and the project team should pick the method best for their exact use case.<br />
<br />
=== Version control code that generates data ===<br />
<br />
All derivative datasets (datasets that the project team creates from original data they received or collected) should be generated by reproducible code so that it can be re-created by re-running that code. Therefore, if the original data is properly [[Data_Storage#Back-up_protocols | backed-up]] and all code is version controlled using Git, then the derivative datasets are already implicitly version controlled since changes to derivative datasets can only happen through changes to the code tracked in Git. Git can also be used to restore old version of code that in turn can be used to restore old versions of the derivative data.<br />
<br />
While this method often is an excellent option to version control derivative data, it does not work when the original data is updated frequently (ongoing data collection or when data is continuously received) or when the code is not accessible (someone else is generating the data). Derivative data should, in these cases, still be generated using reproducible code tracked in Git, but the effect is no longer that those derivative datasets are implicitly properly version controlled.<br />
<br />
=== Version control using checksums/hashes ===<br />
<br />
One way to keep track of changes done to data is to use ''checksums'', ''hashes'' or ''data signatures''. These three concepts all work slightly differently but all follows the same principle and serves the same purpose in the context of version control of data. The principle they all follow is that a datasets is boiled down to a short string of text or a number. The same data is always boiled down to the same string or number, but even small differences to the data boils down to a different string or number. That string or number can then quickly and effortlessly be compared across datasets to test if they are identical or not.<br />
<br />
Most checksums or hashes boils down to a string or a number that has no interpretable meaning to humans, it is just meant to serve the purpose of answering the yes/no question if the two datasets are identical or not. It is usually impossible to tell from a has or a checksum if the files are almost identical or very different. One exception to this is the Stata command <code>datasignature</code> that generates a signature that combines a hash value with some basic data on the dataset such as number of observations. While this provides some useful human readable information, it can only be used on Stata <code>.dta</code> files.<br />
<br />
The drawback of all these methods is that it says nothing about how the dataset was created. So there is no way to recreate a dataset based only on the checksum or hash signature. If you have the checksum of a dataset used in the past but not the corresponding version of the dataset, then all you can do is to say that dataset was identical to the current version of the same dataset or not. You cannot say what differs if they are different. If you used the Stata command <code>datasignature</code> you can get some clues of what differs, such as number of observations or number of variables, but you will still not be able to recreate the old version of the data.<br />
<br />
However, this method is a good fit for when you want to have a quick way to test that the dataset has not changed, and details of what has changed either does not matter or you have another way to find out (perhaps manually) what those differences are. Examples of this can be if you are accessing someone else's dataset and you want to know if that person makes changes to that dataset. Another way this can be used is to check which if any datasets are updated when the code is updated. Sometimes we want to make changes to the code that should not change the dataset it generates, and then this method is a great way to verify that.<br />
<br />
=== Version control in sync software ===<br />
<br />
[[Data_Storage#File_sync_services | File syncing software]] often have version control systems that allows you to both detect changes made to data files and allows you to restore old versions of those files. However, the way they it is done in these systems is so storage in-efficient that you can only restore files version that are less than a few months old. If your project is completed with that time frame, then this is a great solution, however, typically a project runs for much longer than that.<br />
<br />
== Data storage types ==<br />
There are different types of data storage option and which one that is best for each use case typically depends on how big the data is, if there is confidential information in the data and how the data will be used. Below are some different storage types with their pros and cons. The type of data that is considered in this article are data files, and not, for example, data stored in data bases. <br />
<br />
=== File sync services ===<br />
<br />
Common examples of file sync services are DropBox, OneDrive, Box among several others. This storage type is suitable when the data files required for a project is not bigger than that they can be stored on a regular laptop or desktop computer. In file sync services a copy of the file exists locally on all computers that has the folder synced. This means that accessing the file is quick and that there is never a problem for multiple people to access the file at the same time as they have their own version of the file. <br />
<br />
However, because all files are saved locally this also means that if one person works on many projects there is a risk that there is not space for all files for those projects on that persons computer. Many file syncing services have options to mitigate this. For example you can chose to not sync the parts of the project folder that you know you do not need access to. Some sync services also allow you to have non-synced files ready to be downloaded on demand (sometimes referred to as ''smart sync''). However, this is not great for data work as data files are usually to big to be instantly downloaded on demand when your code trying to access them leading to your code to crash when trying to read the file.<br />
<br />
Another concern with file syncing services are privacy concerns when sharing confidential data. There are enterprise subscriptions that can be installed on your own servers. This makes sharing data much more secure as the data is never stored or transferred on the servers that belong to the company that offers the syncing service. If you do not have an enterprise subscription or are not sure if you do, you should always [[encryption| encrypt ]] any non-public data before saving it in a synced folder. DIME Analytics have published [https://github.com/worldbank/dime-standards/blob/master/dime-research-standards/pillar-4-data-security/data-security-resources/veracrypt-guidelines.md guidelines to encrypt files with Veracrypt] which is a secure and free software often used for this purpose.<br />
<br />
=== Cloud storage ===<br />
<br />
Cloud storage comes in many different kinds and it is out of the scope of this wiki article to describe them all. This article will only cover general points about cloud storage. This article only covers data stored in files, so cloud storage here typically means [https://aws.amazon.com/s3/ S3 storage on AWS] or [https://docs.microsoft.com/en-us/azure/storage/blobs/ Blob storage on Azure], but are not limited to those.<br />
<br />
The benefit of cloud storage is that you do not need to buy or upgrade any hardware. To get started or to increase your capacity you only need to click a button or it might even be upgraded automatically. However, you will be charged more the more you use. <br />
<br />
Cloud storage might not be the best option when the data will frequently be accessed from laptops and desktops. The download time will be too long even on a fast internet connection for most use cases if the data is downloaded each time the script is run. Instead, cloud storage is best used when frequently accessed from another cloud resource. <br />
<br />
However, make sure that the other cloud storage is in the same physical location as the cloud resource that will use it. The same cloud provider have data centers across the world and you might have issues if your storage is on the other side of the world from the resource that will use it. There are ways to address these issues but they are outside of the scope of this article. The main takeaway is that cloud storage is only a great solution if the resources that needs access to that have a very quick connection to them.<br />
<br />
At the same time, this does not mean that cloud storage should never be accessed from a laptop or a desktop. Cloud storage can be a great place to store data that is infrequently accessed by users on regular computers. There will still be a delay, but that can be acceptable as the data is only infrequently accessed.<br />
<br />
=== Network drive storage ===<br />
<br />
Network drive storage is in many way similar to the types of cloud storage that we discussed above. The major difference for the context discussed here is that they can be located in the same location as a user accessing them from a laptop/desktop making it a viable option for such a user to access it frequently. This is a great solution when a regular laptop/desktop frequently needs access to the data and the combined size of all of the data files are too big to all be stored on the laptop/desktop.<br />
<br />
The problem with this solution is that you need to purchase hardware and run your own network drive. This can be complicated if you do not work at a large organization where there is an IT team that can do this for you. Another drawback is that if you work on an organization that is spread out geographically then you will have the same issues as with cloud storage with slow access speed.<br />
<br />
== Back-up protocols ==<br />
<br />
The DIME Analytics recommendation for data back up starts by classifying all data into either original data or derivative data. Original data is data that was received or collected by the project team, and derivative data is data created from the original data using reproducible code. Original data must be backed up, but derivative data does not have to be backed up as long as the code used to create it is backed up (typically through GitHub or another version control system for code). If the original data and the code is backed up, then the derivative data is implicitly backed up as well. By this method, backing up every single derivative data file which would be storage space inefficient is avoided.<br />
<br />
There are many good alternative data back-up protocols but a great place to start for anyone who has not adapted a back-up protocol for their projects yet is the [https://www.backblaze.com/blog/the-3-2-1-backup-strategy 3-2-1 strategy]. This strategy means that the project has 3 copies of the data, where 2 are easily accessible and 1 is stored on an remote location. This means that one copy, called the first copy, is the only copy that the code ever should read the data from. This copy can be shared in a file syncing service. If the original data file has [[Personally_Identifiable_Information_(PII) | PII information]] (which they often do) the file needs to be encrypted at all times. A de-identified version of this data that does not need to be encrypted can be created for easier day-to-day use.<br />
<br />
If the first copy of the data is accidentally deleted or modified then the team will access the second copy and make a new first copy of the second copy. The second copy could, for example, be stored on a external drive. This external drive can be stored in the office of someone in the project team. If the data has PII data it must be encrypted unless the external drive is stored in a locked safe. When the team has access to a secure physical safe, then it is a good idea to keep the data unencrypted on the drive that is stored locked in the safe. This is to reduce the risk of loosing data due to lost encryption keys. If an external hard drive is not an option, then the second copy can be stored in any other file storage that is not the same service as the first copy.<br />
<br />
Finally, the third copy should be stored off-site. In today's world this usually means somewhere in the cloud. The second copy can also be stored in the cloud, but the difference from the third copy is that the second copy may be stored locally. The third copy may never be stored locally. This is meant to make sure that if theft, fire, disaster etc. or anything similar happens, then the third copy should be stored in a way there is no way it is affected by the same event. The other important consideration for the third copy storage is that it is a suitable long term storage. This means that this storage should not be affected by a single person leaving the team or if it does there should be a clear protocol for how access is maintained when that person is leaving the team. If you are using a sync service for the third copy it should ideally be a different sync service than for the first copy. Regardless which sync service you are using, if you are using a sync service you should never have the third copy synced to any computer. This file should live in an online-only folder. Alternatives to sync services for the third copy would be S3 on AWS or Blob storage on Azure. If you are using such services, see if you can make these files read-only so there is no chance that they are accidentally deleted or modified.<br />
<br />
== Data retention ==<br />
<br />
The project team should have a written document that clearly regulates how long identified data should be kept by the research team before it is permanently destroyed. It is a bad, but unfortunately common, practice that research teams keeps identified datasets without any timeline on when to delete them. Note that it is not a bad practice to keep or publish de-identified datasets, but best data privacy practice is to have a clear timeline for when datasets should be deleted. The document regulating this timeline is called a "data retention policy". There is a tradeoff between the length of time that the data will be retained and the right to privacy of the human subject participating in your research. Therefore, a data retention policy should not allow the research team to keep the data longer than what the research team can justify by linking it to a research utility. Keeping identified data just for the sake that it might be good to have one day is bad data privacy practice.<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[*topic name, as listed on main page*]]<br />
<br />
<br />
== Additional Resources ==<br />
* Pillar 4 in the [https://github.com/worldbank/dime-standards DIME Research Standards] covers data security<br />
<br />
[[Category: Data Management]]</div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Data_Storage&diff=8318Data Storage2021-06-07T19:20:07Z<p>Kbjarkefur: /* Additional Resources */</p>
<hr />
<div>This article discuss different aspects of data storage (such as different types of storage, data back up and data retention). While [[Data_Security]] is a very important topic related to data storage, it is not covered in this article as there is a dedicated link for that.<br />
<br />
<br />
<br />
== Read First ==<br />
* Data should be version controlled. Ideally the version control system used should both be able to detect changes to the data and provide a possibility to revert to old versions<br />
* Data storage depends on how big the data is, if there is confidential information in the data and how the data will be used.<br />
* All original data (the data sets collected or received by the team) should be backed-up. All derivative data (data sets created from original data) should be created by reproducible code and that code should be version controlled in Git. Then derivative data does not need to be backed-up as it can be re-created.<br />
<br />
== Version control for data ==<br />
Version control is used for two purposes. The first is to keep track of changes and modifications to files and the second one is to provide the possibility to revert a file an old version. For code there is an industry standard that efficiently fulfills these two purposes and that system is called Git. Git can be used to both keep track of changes made to code as well as restore older versions of code. Unfortunately there is no system that as elegantly does both of those two things for data over a longer period of time. There is therefore no industry wide one-size-fits-all solutions for version control of data. This article therefore suggests different methods and the project team should pick the method best for their exact use case.<br />
<br />
=== Version control code that generates data ===<br />
<br />
All derivative datasets (datasets that the project team creates from original data they received or collected) should be generated by reproducible code so that it can be re-created by re-running that code. Therefore, if the original data is properly [[Data_Storage#Back-up_protocols | backed-up]] and all code is version controlled using Git, then the derivative datasets are already implicitly version controlled since changes to derivative datasets can only happen through changes to the code tracked in Git. Git can also be used to restore old version of code that in turn can be used to restore old versions of the derivative data.<br />
<br />
While this method often is an excellent option to version control derivative data, it does not work when the original data is updated frequently (ongoing data collection or when data is continuously received) or when the code is not accessible (someone else is generating the data). Derivative data should, in these cases, still be generated using reproducible code tracked in Git, but the effect is no longer that those derivative datasets are implicitly properly version controlled.<br />
<br />
=== Version control using checksums/hashes ===<br />
<br />
One way to keep track of changes done to data is to use ''checksums'', ''hashes'' or ''data signatures''. These three concepts all work slightly differently but all follows the same principle and serves the same purpose in the context of version control of data. The principle they all follow is that a datasets is boiled down to a short string of text or a number. The same data is always boiled down to the same string or number, but even small differences to the data boils down to a different string or number. That string or number can then quickly and effortlessly be compared across datasets to test if they are identical or not.<br />
<br />
Most checksums or hashes boils down to a string or a number that has no interpretable meaning to humans, it is just meant to serve the purpose of answering the yes/no question if the two datasets are identical or not. It is usually impossible to tell from a has or a checksum if the files are almost identical or very different. One exception to this is the Stata command <code>datasignature</code> that generates a signature that combines a hash value with some basic data on the dataset such as number of observations. While this provides some useful human readable information, it can only be used on Stata <code>.dta</code> files.<br />
<br />
The drawback of all these methods is that it says nothing about how the dataset was created. So there is no way to recreate a dataset based only on the checksum or hash signature. If you have the checksum of a dataset used in the past but not the corresponding version of the dataset, then all you can do is to say that dataset was identical to the current version of the same dataset or not. You cannot say what differs if they are different. If you used the Stata command <code>datasignature</code> you can get some clues of what differs, such as number of observations or number of variables, but you will still not be able to recreate the old version of the data.<br />
<br />
However, this method is a good fit for when you want to have a quick way to test that the dataset has not changed, and details of what has changed either does not matter or you have another way to find out (perhaps manually) what those differences are. Examples of this can be if you are accessing someone else's dataset and you want to know if that person makes changes to that dataset. Another way this can be used is to check which if any datasets are updated when the code is updated. Sometimes we want to make changes to the code that should not change the dataset it generates, and then this method is a great way to verify that.<br />
<br />
=== Version control in sync software ===<br />
<br />
[[Data_Storage#File_sync_services | File syncing software]] often have version control systems that allows you to both detect changes made to data files and allows you to restore old versions of those files. However, the way they it is done in these systems is so storage in-efficient that you can only restore files version that are less than a few months old. If your project is completed with that time frame, then this is a great solution, however, typically a project runs for much longer than that.<br />
<br />
== Data storage types ==<br />
There are different types of data storage option and which one that is best for each use case typically depends on how big the data is, if there is confidential information in the data and how the data will be used. Below are some different storage types with their pros and cons. The type of data that is considered in this article are data files, and not, for example, data stored in data bases. <br />
<br />
=== File sync services ===<br />
<br />
Common examples of file sync services are DropBox, OneDrive, Box among several others. This storage type is suitable when the data files required for a project is not bigger than that they can be stored on a regular laptop or desktop computer. In file sync services a copy of the file exists locally on all computers that has the folder synced. This means that accessing the file is quick and that there is never a problem for multiple people to access the file at the same time as they have their own version of the file. <br />
<br />
However, because all files are saved locally this also means that if one person works on many projects there is a risk that there is not space for all files for those projects on that persons computer. Many file syncing services have options to mitigate this. For example you can chose to not sync the parts of the project folder that you know you do not need access to. Some sync services also allow you to have non-synced files ready to be downloaded on demand (sometimes referred to as ''smart sync''). However, this is not great for data work as data files are usually to big to be instantly downloaded on demand when your code trying to access them leading to your code to crash when trying to read the file.<br />
<br />
Another concern with file syncing services are privacy concerns when sharing confidential data. There are enterprise subscriptions that can be installed on your own servers. This makes sharing data much more secure as the data is never stored or transferred on the servers that belong to the company that offers the syncing service. If you do not have an enterprise subscription or are not sure if you do, you should always [[encryption| encrypt ]] any non-public data before saving it in a synced folder. DIME Analytics have published [https://github.com/worldbank/dime-standards/blob/master/dime-research-standards/pillar-4-data-security/data-security-resources/veracrypt-guidelines.md guidelines to encrypt files with Veracrypt] which is a secure and free software often used for this purpose.<br />
<br />
=== Cloud storage ===<br />
<br />
Cloud storage comes in many different kinds and it is out of the scope of this wiki article to describe them all. This article will only cover general points about cloud storage. This article only covers data stored in files, so cloud storage here typically means [https://aws.amazon.com/s3/ S3 storage on AWS] or [https://docs.microsoft.com/en-us/azure/storage/blobs/ Blob storage on Azure], but are not limited to those.<br />
<br />
The benefit of cloud storage is that you do not need to buy or upgrade any hardware. To get started or to increase your capacity you only need to click a button or it might even be upgraded automatically. However, you will be charged more the more you use. <br />
<br />
Cloud storage might not be the best option when the data will frequently be accessed from laptops and desktops. The download time will be too long even on a fast internet connection for most use cases if the data is downloaded each time the script is run. Instead, cloud storage is best used when frequently accessed from another cloud resource. <br />
<br />
However, make sure that the other cloud storage is in the same physical location as the cloud resource that will use it. The same cloud provider have data centers across the world and you might have issues if your storage is on the other side of the world from the resource that will use it. There are ways to address these issues but they are outside of the scope of this article. The main takeaway is that cloud storage is only a great solution if the resources that needs access to that have a very quick connection to them.<br />
<br />
At the same time, this does not mean that cloud storage should never be accessed from a laptop or a desktop. Cloud storage can be a great place to store data that is infrequently accessed by users on regular computers. There will still be a delay, but that can be acceptable as the data is only infrequently accessed.<br />
<br />
=== Network drive storage ===<br />
<br />
Network drive storage is in many way similar to the types of cloud storage that we discussed above. The major difference for the context discussed here is that they can be located in the same location as a user accessing them from a laptop/desktop making it a viable option for such a user to access it frequently. This is a great solution when a regular laptop/desktop frequently needs access to the data and the combined size of all of the data files are too big to all be stored on the laptop/desktop.<br />
<br />
The problem with this solution is that you need to purchase hardware and run your own network drive. This can be complicated if you do not work at a large organization where there is an IT team that can do this for you. Another drawback is that if you work on an organization that is spread out geographically then you will have the same issues as with cloud storage with slow access speed.<br />
<br />
== Back-up protocols ==<br />
<br />
The DIME Analytics recommendation for data back up starts by classifying all data into either original data or derivative data. Original data is data that was received or collected by the project team, and derivative data is data created from the original data using reproducible code. Original data must be backed up, but derivative data does not have to be backed up as long as the code used to create it is backed up (typically through GitHub or another version control system for code). If the original data and the code is backed up, then the derivative data is implicitly backed up as well. By this method, backing up every single derivative data file which would be storage space inefficient is avoided.<br />
<br />
There are many good alternative data back-up protocols but a great place to start for anyone who has not adapted a back-up protocol for their projects yet is the [https://www.backblaze.com/blog/the-3-2-1-backup-strategy 3-2-1 strategy]. This strategy means that the project has 3 copies of the data, where 2 are easily accessible and 1 is stored on an remote location. This means that one copy, called the first copy, is the only copy that the code ever should read the data from. This copy can be shared in a file syncing service. If the original data file has [[Personally_Identifiable_Information_(PII) | PII information]] (which they often do) the file needs to be encrypted at all times. A de-identified version of this data that does not need to be encrypted can be created for easier day-to-day use.<br />
<br />
If the first copy of the data is accidentally deleted or modified then the team will access the second copy and make a new first copy of the second copy. The second copy could, for example, be stored on a external drive. This external drive can be stored in the office of someone in the project team. If the data has PII data it must be encrypted unless the external drive is stored in a locked safe. When the team has access to a secure physical safe, then it is a good idea to keep the data unencrypted on the drive that is stored locked in the safe. This is to reduce the risk of loosing data due to lost encryption keys. If an external hard drive is not an option, then the second copy can be stored in any other file storage that is not the same service as the first copy.<br />
<br />
Finally, the third copy should be stored off-site. In today's world this usually means somewhere in the cloud. The second copy can also be stored in the cloud, but the difference from the third copy is that the second copy may be stored locally. The third copy may never be stored locally. This is meant to make sure that if theft, fire, disaster etc. or anything similar happens, then the third copy should be stored in a way there is no way it is affected by the same event. The other important consideration for the third copy storage is that it is a suitable long term storage. This means that this storage should not be affected by a single person leaving the team or if it does there should be a clear protocol for how access is maintained when that person is leaving the team. If you are using a sync service for the third copy it should ideally be a different sync service than for the first copy. Regardless which sync service you are using, if you are using a sync service you should never have the third copy synced to any computer. This file should live in an online-only folder. Alternatives to sync services for the third copy would be S3 on AWS or Blob storage on Azure. If you are using such services, see if you can make these files read-only so there is no chance that they are accidentally deleted or modified.<br />
<br />
== Data retention ==<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[*topic name, as listed on main page*]]<br />
<br />
<br />
== Additional Resources ==<br />
* Pillar 4 in the [https://github.com/worldbank/dime-standards DIME Research Standards] covers data security<br />
<br />
[[Category: Data Management]]</div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Data_Storage&diff=8317Data Storage2021-06-07T19:18:37Z<p>Kbjarkefur: /* Additional Resources */</p>
<hr />
<div>This article discuss different aspects of data storage (such as different types of storage, data back up and data retention). While [[Data_Security]] is a very important topic related to data storage, it is not covered in this article as there is a dedicated link for that.<br />
<br />
<br />
<br />
== Read First ==<br />
* Data should be version controlled. Ideally the version control system used should both be able to detect changes to the data and provide a possibility to revert to old versions<br />
* Data storage depends on how big the data is, if there is confidential information in the data and how the data will be used.<br />
* All original data (the data sets collected or received by the team) should be backed-up. All derivative data (data sets created from original data) should be created by reproducible code and that code should be version controlled in Git. Then derivative data does not need to be backed-up as it can be re-created.<br />
<br />
== Version control for data ==<br />
Version control is used for two purposes. The first is to keep track of changes and modifications to files and the second one is to provide the possibility to revert a file an old version. For code there is an industry standard that efficiently fulfills these two purposes and that system is called Git. Git can be used to both keep track of changes made to code as well as restore older versions of code. Unfortunately there is no system that as elegantly does both of those two things for data over a longer period of time. There is therefore no industry wide one-size-fits-all solutions for version control of data. This article therefore suggests different methods and the project team should pick the method best for their exact use case.<br />
<br />
=== Version control code that generates data ===<br />
<br />
All derivative datasets (datasets that the project team creates from original data they received or collected) should be generated by reproducible code so that it can be re-created by re-running that code. Therefore, if the original data is properly [[Data_Storage#Back-up_protocols | backed-up]] and all code is version controlled using Git, then the derivative datasets are already implicitly version controlled since changes to derivative datasets can only happen through changes to the code tracked in Git. Git can also be used to restore old version of code that in turn can be used to restore old versions of the derivative data.<br />
<br />
While this method often is an excellent option to version control derivative data, it does not work when the original data is updated frequently (ongoing data collection or when data is continuously received) or when the code is not accessible (someone else is generating the data). Derivative data should, in these cases, still be generated using reproducible code tracked in Git, but the effect is no longer that those derivative datasets are implicitly properly version controlled.<br />
<br />
=== Version control using checksums/hashes ===<br />
<br />
One way to keep track of changes done to data is to use ''checksums'', ''hashes'' or ''data signatures''. These three concepts all work slightly differently but all follows the same principle and serves the same purpose in the context of version control of data. The principle they all follow is that a datasets is boiled down to a short string of text or a number. The same data is always boiled down to the same string or number, but even small differences to the data boils down to a different string or number. That string or number can then quickly and effortlessly be compared across datasets to test if they are identical or not.<br />
<br />
Most checksums or hashes boils down to a string or a number that has no interpretable meaning to humans, it is just meant to serve the purpose of answering the yes/no question if the two datasets are identical or not. It is usually impossible to tell from a has or a checksum if the files are almost identical or very different. One exception to this is the Stata command <code>datasignature</code> that generates a signature that combines a hash value with some basic data on the dataset such as number of observations. While this provides some useful human readable information, it can only be used on Stata <code>.dta</code> files.<br />
<br />
The drawback of all these methods is that it says nothing about how the dataset was created. So there is no way to recreate a dataset based only on the checksum or hash signature. If you have the checksum of a dataset used in the past but not the corresponding version of the dataset, then all you can do is to say that dataset was identical to the current version of the same dataset or not. You cannot say what differs if they are different. If you used the Stata command <code>datasignature</code> you can get some clues of what differs, such as number of observations or number of variables, but you will still not be able to recreate the old version of the data.<br />
<br />
However, this method is a good fit for when you want to have a quick way to test that the dataset has not changed, and details of what has changed either does not matter or you have another way to find out (perhaps manually) what those differences are. Examples of this can be if you are accessing someone else's dataset and you want to know if that person makes changes to that dataset. Another way this can be used is to check which if any datasets are updated when the code is updated. Sometimes we want to make changes to the code that should not change the dataset it generates, and then this method is a great way to verify that.<br />
<br />
=== Version control in sync software ===<br />
<br />
[[Data_Storage#File_sync_services | File syncing software]] often have version control systems that allows you to both detect changes made to data files and allows you to restore old versions of those files. However, the way they it is done in these systems is so storage in-efficient that you can only restore files version that are less than a few months old. If your project is completed with that time frame, then this is a great solution, however, typically a project runs for much longer than that.<br />
<br />
== Data storage types ==<br />
There are different types of data storage option and which one that is best for each use case typically depends on how big the data is, if there is confidential information in the data and how the data will be used. Below are some different storage types with their pros and cons. The type of data that is considered in this article are data files, and not, for example, data stored in data bases. <br />
<br />
=== File sync services ===<br />
<br />
Common examples of file sync services are DropBox, OneDrive, Box among several others. This storage type is suitable when the data files required for a project is not bigger than that they can be stored on a regular laptop or desktop computer. In file sync services a copy of the file exists locally on all computers that has the folder synced. This means that accessing the file is quick and that there is never a problem for multiple people to access the file at the same time as they have their own version of the file. <br />
<br />
However, because all files are saved locally this also means that if one person works on many projects there is a risk that there is not space for all files for those projects on that persons computer. Many file syncing services have options to mitigate this. For example you can chose to not sync the parts of the project folder that you know you do not need access to. Some sync services also allow you to have non-synced files ready to be downloaded on demand (sometimes referred to as ''smart sync''). However, this is not great for data work as data files are usually to big to be instantly downloaded on demand when your code trying to access them leading to your code to crash when trying to read the file.<br />
<br />
Another concern with file syncing services are privacy concerns when sharing confidential data. There are enterprise subscriptions that can be installed on your own servers. This makes sharing data much more secure as the data is never stored or transferred on the servers that belong to the company that offers the syncing service. If you do not have an enterprise subscription or are not sure if you do, you should always [[encryption| encrypt ]] any non-public data before saving it in a synced folder. DIME Analytics have published [https://github.com/worldbank/dime-standards/blob/master/dime-research-standards/pillar-4-data-security/data-security-resources/veracrypt-guidelines.md guidelines to encrypt files with Veracrypt] which is a secure and free software often used for this purpose.<br />
<br />
=== Cloud storage ===<br />
<br />
Cloud storage comes in many different kinds and it is out of the scope of this wiki article to describe them all. This article will only cover general points about cloud storage. This article only covers data stored in files, so cloud storage here typically means [https://aws.amazon.com/s3/ S3 storage on AWS] or [https://docs.microsoft.com/en-us/azure/storage/blobs/ Blob storage on Azure], but are not limited to those.<br />
<br />
The benefit of cloud storage is that you do not need to buy or upgrade any hardware. To get started or to increase your capacity you only need to click a button or it might even be upgraded automatically. However, you will be charged more the more you use. <br />
<br />
Cloud storage might not be the best option when the data will frequently be accessed from laptops and desktops. The download time will be too long even on a fast internet connection for most use cases if the data is downloaded each time the script is run. Instead, cloud storage is best used when frequently accessed from another cloud resource. <br />
<br />
However, make sure that the other cloud storage is in the same physical location as the cloud resource that will use it. The same cloud provider have data centers across the world and you might have issues if your storage is on the other side of the world from the resource that will use it. There are ways to address these issues but they are outside of the scope of this article. The main takeaway is that cloud storage is only a great solution if the resources that needs access to that have a very quick connection to them.<br />
<br />
At the same time, this does not mean that cloud storage should never be accessed from a laptop or a desktop. Cloud storage can be a great place to store data that is infrequently accessed by users on regular computers. There will still be a delay, but that can be acceptable as the data is only infrequently accessed.<br />
<br />
=== Network drive storage ===<br />
<br />
Network drive storage is in many way similar to the types of cloud storage that we discussed above. The major difference for the context discussed here is that they can be located in the same location as a user accessing them from a laptop/desktop making it a viable option for such a user to access it frequently. This is a great solution when a regular laptop/desktop frequently needs access to the data and the combined size of all of the data files are too big to all be stored on the laptop/desktop.<br />
<br />
The problem with this solution is that you need to purchase hardware and run your own network drive. This can be complicated if you do not work at a large organization where there is an IT team that can do this for you. Another drawback is that if you work on an organization that is spread out geographically then you will have the same issues as with cloud storage with slow access speed.<br />
<br />
== Back-up protocols ==<br />
<br />
The DIME Analytics recommendation for data back up starts by classifying all data into either original data or derivative data. Original data is data that was received or collected by the project team, and derivative data is data created from the original data using reproducible code. Original data must be backed up, but derivative data does not have to be backed up as long as the code used to create it is backed up (typically through GitHub or another version control system for code). If the original data and the code is backed up, then the derivative data is implicitly backed up as well. By this method, backing up every single derivative data file which would be storage space inefficient is avoided.<br />
<br />
There are many good alternative data back-up protocols but a great place to start for anyone who has not adapted a back-up protocol for their projects yet is the [https://www.backblaze.com/blog/the-3-2-1-backup-strategy 3-2-1 strategy]. This strategy means that the project has 3 copies of the data, where 2 are easily accessible and 1 is stored on an remote location. This means that one copy, called the first copy, is the only copy that the code ever should read the data from. This copy can be shared in a file syncing service. If the original data file has [[Personally_Identifiable_Information_(PII) | PII information]] (which they often do) the file needs to be encrypted at all times. A de-identified version of this data that does not need to be encrypted can be created for easier day-to-day use.<br />
<br />
If the first copy of the data is accidentally deleted or modified then the team will access the second copy and make a new first copy of the second copy. The second copy could, for example, be stored on a external drive. This external drive can be stored in the office of someone in the project team. If the data has PII data it must be encrypted unless the external drive is stored in a locked safe. When the team has access to a secure physical safe, then it is a good idea to keep the data unencrypted on the drive that is stored locked in the safe. This is to reduce the risk of loosing data due to lost encryption keys. If an external hard drive is not an option, then the second copy can be stored in any other file storage that is not the same service as the first copy.<br />
<br />
Finally, the third copy should be stored off-site. In today's world this usually means somewhere in the cloud. The second copy can also be stored in the cloud, but the difference from the third copy is that the second copy may be stored locally. The third copy may never be stored locally. This is meant to make sure that if theft, fire, disaster etc. or anything similar happens, then the third copy should be stored in a way there is no way it is affected by the same event. The other important consideration for the third copy storage is that it is a suitable long term storage. This means that this storage should not be affected by a single person leaving the team or if it does there should be a clear protocol for how access is maintained when that person is leaving the team. If you are using a sync service for the third copy it should ideally be a different sync service than for the first copy. Regardless which sync service you are using, if you are using a sync service you should never have the third copy synced to any computer. This file should live in an online-only folder. Alternatives to sync services for the third copy would be S3 on AWS or Blob storage on Azure. If you are using such services, see if you can make these files read-only so there is no chance that they are accidentally deleted or modified.<br />
<br />
== Data retention ==<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[*topic name, as listed on main page*]]<br />
<br />
<br />
== Additional Resources ==<br />
* list here other articles related to this topic, with a brief description and link<br />
<br />
[[Category: Data Management ]]</div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Data_Storage&diff=8316Data Storage2021-06-07T19:18:24Z<p>Kbjarkefur: /* Additional Resources */</p>
<hr />
<div>This article discuss different aspects of data storage (such as different types of storage, data back up and data retention). While [[Data_Security]] is a very important topic related to data storage, it is not covered in this article as there is a dedicated link for that.<br />
<br />
<br />
<br />
== Read First ==<br />
* Data should be version controlled. Ideally the version control system used should both be able to detect changes to the data and provide a possibility to revert to old versions<br />
* Data storage depends on how big the data is, if there is confidential information in the data and how the data will be used.<br />
* All original data (the data sets collected or received by the team) should be backed-up. All derivative data (data sets created from original data) should be created by reproducible code and that code should be version controlled in Git. Then derivative data does not need to be backed-up as it can be re-created.<br />
<br />
== Version control for data ==<br />
Version control is used for two purposes. The first is to keep track of changes and modifications to files and the second one is to provide the possibility to revert a file an old version. For code there is an industry standard that efficiently fulfills these two purposes and that system is called Git. Git can be used to both keep track of changes made to code as well as restore older versions of code. Unfortunately there is no system that as elegantly does both of those two things for data over a longer period of time. There is therefore no industry wide one-size-fits-all solutions for version control of data. This article therefore suggests different methods and the project team should pick the method best for their exact use case.<br />
<br />
=== Version control code that generates data ===<br />
<br />
All derivative datasets (datasets that the project team creates from original data they received or collected) should be generated by reproducible code so that it can be re-created by re-running that code. Therefore, if the original data is properly [[Data_Storage#Back-up_protocols | backed-up]] and all code is version controlled using Git, then the derivative datasets are already implicitly version controlled since changes to derivative datasets can only happen through changes to the code tracked in Git. Git can also be used to restore old version of code that in turn can be used to restore old versions of the derivative data.<br />
<br />
While this method often is an excellent option to version control derivative data, it does not work when the original data is updated frequently (ongoing data collection or when data is continuously received) or when the code is not accessible (someone else is generating the data). Derivative data should, in these cases, still be generated using reproducible code tracked in Git, but the effect is no longer that those derivative datasets are implicitly properly version controlled.<br />
<br />
=== Version control using checksums/hashes ===<br />
<br />
One way to keep track of changes done to data is to use ''checksums'', ''hashes'' or ''data signatures''. These three concepts all work slightly differently but all follows the same principle and serves the same purpose in the context of version control of data. The principle they all follow is that a datasets is boiled down to a short string of text or a number. The same data is always boiled down to the same string or number, but even small differences to the data boils down to a different string or number. That string or number can then quickly and effortlessly be compared across datasets to test if they are identical or not.<br />
<br />
Most checksums or hashes boils down to a string or a number that has no interpretable meaning to humans, it is just meant to serve the purpose of answering the yes/no question if the two datasets are identical or not. It is usually impossible to tell from a has or a checksum if the files are almost identical or very different. One exception to this is the Stata command <code>datasignature</code> that generates a signature that combines a hash value with some basic data on the dataset such as number of observations. While this provides some useful human readable information, it can only be used on Stata <code>.dta</code> files.<br />
<br />
The drawback of all these methods is that it says nothing about how the dataset was created. So there is no way to recreate a dataset based only on the checksum or hash signature. If you have the checksum of a dataset used in the past but not the corresponding version of the dataset, then all you can do is to say that dataset was identical to the current version of the same dataset or not. You cannot say what differs if they are different. If you used the Stata command <code>datasignature</code> you can get some clues of what differs, such as number of observations or number of variables, but you will still not be able to recreate the old version of the data.<br />
<br />
However, this method is a good fit for when you want to have a quick way to test that the dataset has not changed, and details of what has changed either does not matter or you have another way to find out (perhaps manually) what those differences are. Examples of this can be if you are accessing someone else's dataset and you want to know if that person makes changes to that dataset. Another way this can be used is to check which if any datasets are updated when the code is updated. Sometimes we want to make changes to the code that should not change the dataset it generates, and then this method is a great way to verify that.<br />
<br />
=== Version control in sync software ===<br />
<br />
[[Data_Storage#File_sync_services | File syncing software]] often have version control systems that allows you to both detect changes made to data files and allows you to restore old versions of those files. However, the way they it is done in these systems is so storage in-efficient that you can only restore files version that are less than a few months old. If your project is completed with that time frame, then this is a great solution, however, typically a project runs for much longer than that.<br />
<br />
== Data storage types ==<br />
There are different types of data storage option and which one that is best for each use case typically depends on how big the data is, if there is confidential information in the data and how the data will be used. Below are some different storage types with their pros and cons. The type of data that is considered in this article are data files, and not, for example, data stored in data bases. <br />
<br />
=== File sync services ===<br />
<br />
Common examples of file sync services are DropBox, OneDrive, Box among several others. This storage type is suitable when the data files required for a project is not bigger than that they can be stored on a regular laptop or desktop computer. In file sync services a copy of the file exists locally on all computers that has the folder synced. This means that accessing the file is quick and that there is never a problem for multiple people to access the file at the same time as they have their own version of the file. <br />
<br />
However, because all files are saved locally this also means that if one person works on many projects there is a risk that there is not space for all files for those projects on that persons computer. Many file syncing services have options to mitigate this. For example you can chose to not sync the parts of the project folder that you know you do not need access to. Some sync services also allow you to have non-synced files ready to be downloaded on demand (sometimes referred to as ''smart sync''). However, this is not great for data work as data files are usually to big to be instantly downloaded on demand when your code trying to access them leading to your code to crash when trying to read the file.<br />
<br />
Another concern with file syncing services are privacy concerns when sharing confidential data. There are enterprise subscriptions that can be installed on your own servers. This makes sharing data much more secure as the data is never stored or transferred on the servers that belong to the company that offers the syncing service. If you do not have an enterprise subscription or are not sure if you do, you should always [[encryption| encrypt ]] any non-public data before saving it in a synced folder. DIME Analytics have published [https://github.com/worldbank/dime-standards/blob/master/dime-research-standards/pillar-4-data-security/data-security-resources/veracrypt-guidelines.md guidelines to encrypt files with Veracrypt] which is a secure and free software often used for this purpose.<br />
<br />
=== Cloud storage ===<br />
<br />
Cloud storage comes in many different kinds and it is out of the scope of this wiki article to describe them all. This article will only cover general points about cloud storage. This article only covers data stored in files, so cloud storage here typically means [https://aws.amazon.com/s3/ S3 storage on AWS] or [https://docs.microsoft.com/en-us/azure/storage/blobs/ Blob storage on Azure], but are not limited to those.<br />
<br />
The benefit of cloud storage is that you do not need to buy or upgrade any hardware. To get started or to increase your capacity you only need to click a button or it might even be upgraded automatically. However, you will be charged more the more you use. <br />
<br />
Cloud storage might not be the best option when the data will frequently be accessed from laptops and desktops. The download time will be too long even on a fast internet connection for most use cases if the data is downloaded each time the script is run. Instead, cloud storage is best used when frequently accessed from another cloud resource. <br />
<br />
However, make sure that the other cloud storage is in the same physical location as the cloud resource that will use it. The same cloud provider have data centers across the world and you might have issues if your storage is on the other side of the world from the resource that will use it. There are ways to address these issues but they are outside of the scope of this article. The main takeaway is that cloud storage is only a great solution if the resources that needs access to that have a very quick connection to them.<br />
<br />
At the same time, this does not mean that cloud storage should never be accessed from a laptop or a desktop. Cloud storage can be a great place to store data that is infrequently accessed by users on regular computers. There will still be a delay, but that can be acceptable as the data is only infrequently accessed.<br />
<br />
=== Network drive storage ===<br />
<br />
Network drive storage is in many way similar to the types of cloud storage that we discussed above. The major difference for the context discussed here is that they can be located in the same location as a user accessing them from a laptop/desktop making it a viable option for such a user to access it frequently. This is a great solution when a regular laptop/desktop frequently needs access to the data and the combined size of all of the data files are too big to all be stored on the laptop/desktop.<br />
<br />
The problem with this solution is that you need to purchase hardware and run your own network drive. This can be complicated if you do not work at a large organization where there is an IT team that can do this for you. Another drawback is that if you work on an organization that is spread out geographically then you will have the same issues as with cloud storage with slow access speed.<br />
<br />
== Back-up protocols ==<br />
<br />
The DIME Analytics recommendation for data back up starts by classifying all data into either original data or derivative data. Original data is data that was received or collected by the project team, and derivative data is data created from the original data using reproducible code. Original data must be backed up, but derivative data does not have to be backed up as long as the code used to create it is backed up (typically through GitHub or another version control system for code). If the original data and the code is backed up, then the derivative data is implicitly backed up as well. By this method, backing up every single derivative data file which would be storage space inefficient is avoided.<br />
<br />
There are many good alternative data back-up protocols but a great place to start for anyone who has not adapted a back-up protocol for their projects yet is the [https://www.backblaze.com/blog/the-3-2-1-backup-strategy 3-2-1 strategy]. This strategy means that the project has 3 copies of the data, where 2 are easily accessible and 1 is stored on an remote location. This means that one copy, called the first copy, is the only copy that the code ever should read the data from. This copy can be shared in a file syncing service. If the original data file has [[Personally_Identifiable_Information_(PII) | PII information]] (which they often do) the file needs to be encrypted at all times. A de-identified version of this data that does not need to be encrypted can be created for easier day-to-day use.<br />
<br />
If the first copy of the data is accidentally deleted or modified then the team will access the second copy and make a new first copy of the second copy. The second copy could, for example, be stored on a external drive. This external drive can be stored in the office of someone in the project team. If the data has PII data it must be encrypted unless the external drive is stored in a locked safe. When the team has access to a secure physical safe, then it is a good idea to keep the data unencrypted on the drive that is stored locked in the safe. This is to reduce the risk of loosing data due to lost encryption keys. If an external hard drive is not an option, then the second copy can be stored in any other file storage that is not the same service as the first copy.<br />
<br />
Finally, the third copy should be stored off-site. In today's world this usually means somewhere in the cloud. The second copy can also be stored in the cloud, but the difference from the third copy is that the second copy may be stored locally. The third copy may never be stored locally. This is meant to make sure that if theft, fire, disaster etc. or anything similar happens, then the third copy should be stored in a way there is no way it is affected by the same event. The other important consideration for the third copy storage is that it is a suitable long term storage. This means that this storage should not be affected by a single person leaving the team or if it does there should be a clear protocol for how access is maintained when that person is leaving the team. If you are using a sync service for the third copy it should ideally be a different sync service than for the first copy. Regardless which sync service you are using, if you are using a sync service you should never have the third copy synced to any computer. This file should live in an online-only folder. Alternatives to sync services for the third copy would be S3 on AWS or Blob storage on Azure. If you are using such services, see if you can make these files read-only so there is no chance that they are accidentally deleted or modified.<br />
<br />
== Data retention ==<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[*topic name, as listed on main page*]]<br />
<br />
<br />
== Additional Resources ==<br />
* list here other articles related to this topic, with a brief description and link<br />
<br />
[[Category: *Data Management* ]]</div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Data_Storage&diff=8315Data Storage2021-06-07T19:17:41Z<p>Kbjarkefur: /* Back-up protocols */</p>
<hr />
<div>This article discuss different aspects of data storage (such as different types of storage, data back up and data retention). While [[Data_Security]] is a very important topic related to data storage, it is not covered in this article as there is a dedicated link for that.<br />
<br />
<br />
<br />
== Read First ==<br />
* Data should be version controlled. Ideally the version control system used should both be able to detect changes to the data and provide a possibility to revert to old versions<br />
* Data storage depends on how big the data is, if there is confidential information in the data and how the data will be used.<br />
* All original data (the data sets collected or received by the team) should be backed-up. All derivative data (data sets created from original data) should be created by reproducible code and that code should be version controlled in Git. Then derivative data does not need to be backed-up as it can be re-created.<br />
<br />
== Version control for data ==<br />
Version control is used for two purposes. The first is to keep track of changes and modifications to files and the second one is to provide the possibility to revert a file an old version. For code there is an industry standard that efficiently fulfills these two purposes and that system is called Git. Git can be used to both keep track of changes made to code as well as restore older versions of code. Unfortunately there is no system that as elegantly does both of those two things for data over a longer period of time. There is therefore no industry wide one-size-fits-all solutions for version control of data. This article therefore suggests different methods and the project team should pick the method best for their exact use case.<br />
<br />
=== Version control code that generates data ===<br />
<br />
All derivative datasets (datasets that the project team creates from original data they received or collected) should be generated by reproducible code so that it can be re-created by re-running that code. Therefore, if the original data is properly [[Data_Storage#Back-up_protocols | backed-up]] and all code is version controlled using Git, then the derivative datasets are already implicitly version controlled since changes to derivative datasets can only happen through changes to the code tracked in Git. Git can also be used to restore old version of code that in turn can be used to restore old versions of the derivative data.<br />
<br />
While this method often is an excellent option to version control derivative data, it does not work when the original data is updated frequently (ongoing data collection or when data is continuously received) or when the code is not accessible (someone else is generating the data). Derivative data should, in these cases, still be generated using reproducible code tracked in Git, but the effect is no longer that those derivative datasets are implicitly properly version controlled.<br />
<br />
=== Version control using checksums/hashes ===<br />
<br />
One way to keep track of changes done to data is to use ''checksums'', ''hashes'' or ''data signatures''. These three concepts all work slightly differently but all follows the same principle and serves the same purpose in the context of version control of data. The principle they all follow is that a datasets is boiled down to a short string of text or a number. The same data is always boiled down to the same string or number, but even small differences to the data boils down to a different string or number. That string or number can then quickly and effortlessly be compared across datasets to test if they are identical or not.<br />
<br />
Most checksums or hashes boils down to a string or a number that has no interpretable meaning to humans, it is just meant to serve the purpose of answering the yes/no question if the two datasets are identical or not. It is usually impossible to tell from a has or a checksum if the files are almost identical or very different. One exception to this is the Stata command <code>datasignature</code> that generates a signature that combines a hash value with some basic data on the dataset such as number of observations. While this provides some useful human readable information, it can only be used on Stata <code>.dta</code> files.<br />
<br />
The drawback of all these methods is that it says nothing about how the dataset was created. So there is no way to recreate a dataset based only on the checksum or hash signature. If you have the checksum of a dataset used in the past but not the corresponding version of the dataset, then all you can do is to say that dataset was identical to the current version of the same dataset or not. You cannot say what differs if they are different. If you used the Stata command <code>datasignature</code> you can get some clues of what differs, such as number of observations or number of variables, but you will still not be able to recreate the old version of the data.<br />
<br />
However, this method is a good fit for when you want to have a quick way to test that the dataset has not changed, and details of what has changed either does not matter or you have another way to find out (perhaps manually) what those differences are. Examples of this can be if you are accessing someone else's dataset and you want to know if that person makes changes to that dataset. Another way this can be used is to check which if any datasets are updated when the code is updated. Sometimes we want to make changes to the code that should not change the dataset it generates, and then this method is a great way to verify that.<br />
<br />
=== Version control in sync software ===<br />
<br />
[[Data_Storage#File_sync_services | File syncing software]] often have version control systems that allows you to both detect changes made to data files and allows you to restore old versions of those files. However, the way they it is done in these systems is so storage in-efficient that you can only restore files version that are less than a few months old. If your project is completed with that time frame, then this is a great solution, however, typically a project runs for much longer than that.<br />
<br />
== Data storage types ==<br />
There are different types of data storage option and which one that is best for each use case typically depends on how big the data is, if there is confidential information in the data and how the data will be used. Below are some different storage types with their pros and cons. The type of data that is considered in this article are data files, and not, for example, data stored in data bases. <br />
<br />
=== File sync services ===<br />
<br />
Common examples of file sync services are DropBox, OneDrive, Box among several others. This storage type is suitable when the data files required for a project is not bigger than that they can be stored on a regular laptop or desktop computer. In file sync services a copy of the file exists locally on all computers that has the folder synced. This means that accessing the file is quick and that there is never a problem for multiple people to access the file at the same time as they have their own version of the file. <br />
<br />
However, because all files are saved locally this also means that if one person works on many projects there is a risk that there is not space for all files for those projects on that persons computer. Many file syncing services have options to mitigate this. For example you can chose to not sync the parts of the project folder that you know you do not need access to. Some sync services also allow you to have non-synced files ready to be downloaded on demand (sometimes referred to as ''smart sync''). However, this is not great for data work as data files are usually to big to be instantly downloaded on demand when your code trying to access them leading to your code to crash when trying to read the file.<br />
<br />
Another concern with file syncing services are privacy concerns when sharing confidential data. There are enterprise subscriptions that can be installed on your own servers. This makes sharing data much more secure as the data is never stored or transferred on the servers that belong to the company that offers the syncing service. If you do not have an enterprise subscription or are not sure if you do, you should always [[encryption| encrypt ]] any non-public data before saving it in a synced folder. DIME Analytics have published [https://github.com/worldbank/dime-standards/blob/master/dime-research-standards/pillar-4-data-security/data-security-resources/veracrypt-guidelines.md guidelines to encrypt files with Veracrypt] which is a secure and free software often used for this purpose.<br />
<br />
=== Cloud storage ===<br />
<br />
Cloud storage comes in many different kinds and it is out of the scope of this wiki article to describe them all. This article will only cover general points about cloud storage. This article only covers data stored in files, so cloud storage here typically means [https://aws.amazon.com/s3/ S3 storage on AWS] or [https://docs.microsoft.com/en-us/azure/storage/blobs/ Blob storage on Azure], but are not limited to those.<br />
<br />
The benefit of cloud storage is that you do not need to buy or upgrade any hardware. To get started or to increase your capacity you only need to click a button or it might even be upgraded automatically. However, you will be charged more the more you use. <br />
<br />
Cloud storage might not be the best option when the data will frequently be accessed from laptops and desktops. The download time will be too long even on a fast internet connection for most use cases if the data is downloaded each time the script is run. Instead, cloud storage is best used when frequently accessed from another cloud resource. <br />
<br />
However, make sure that the other cloud storage is in the same physical location as the cloud resource that will use it. The same cloud provider have data centers across the world and you might have issues if your storage is on the other side of the world from the resource that will use it. There are ways to address these issues but they are outside of the scope of this article. The main takeaway is that cloud storage is only a great solution if the resources that needs access to that have a very quick connection to them.<br />
<br />
At the same time, this does not mean that cloud storage should never be accessed from a laptop or a desktop. Cloud storage can be a great place to store data that is infrequently accessed by users on regular computers. There will still be a delay, but that can be acceptable as the data is only infrequently accessed.<br />
<br />
=== Network drive storage ===<br />
<br />
Network drive storage is in many way similar to the types of cloud storage that we discussed above. The major difference for the context discussed here is that they can be located in the same location as a user accessing them from a laptop/desktop making it a viable option for such a user to access it frequently. This is a great solution when a regular laptop/desktop frequently needs access to the data and the combined size of all of the data files are too big to all be stored on the laptop/desktop.<br />
<br />
The problem with this solution is that you need to purchase hardware and run your own network drive. This can be complicated if you do not work at a large organization where there is an IT team that can do this for you. Another drawback is that if you work on an organization that is spread out geographically then you will have the same issues as with cloud storage with slow access speed.<br />
<br />
== Back-up protocols ==<br />
<br />
The DIME Analytics recommendation for data back up starts by classifying all data into either original data or derivative data. Original data is data that was received or collected by the project team, and derivative data is data created from the original data using reproducible code. Original data must be backed up, but derivative data does not have to be backed up as long as the code used to create it is backed up (typically through GitHub or another version control system for code). If the original data and the code is backed up, then the derivative data is implicitly backed up as well. By this method, backing up every single derivative data file which would be storage space inefficient is avoided.<br />
<br />
There are many good alternative data back-up protocols but a great place to start for anyone who has not adapted a back-up protocol for their projects yet is the [https://www.backblaze.com/blog/the-3-2-1-backup-strategy 3-2-1 strategy]. This strategy means that the project has 3 copies of the data, where 2 are easily accessible and 1 is stored on an remote location. This means that one copy, called the first copy, is the only copy that the code ever should read the data from. This copy can be shared in a file syncing service. If the original data file has [[Personally_Identifiable_Information_(PII) | PII information]] (which they often do) the file needs to be encrypted at all times. A de-identified version of this data that does not need to be encrypted can be created for easier day-to-day use.<br />
<br />
If the first copy of the data is accidentally deleted or modified then the team will access the second copy and make a new first copy of the second copy. The second copy could, for example, be stored on a external drive. This external drive can be stored in the office of someone in the project team. If the data has PII data it must be encrypted unless the external drive is stored in a locked safe. When the team has access to a secure physical safe, then it is a good idea to keep the data unencrypted on the drive that is stored locked in the safe. This is to reduce the risk of loosing data due to lost encryption keys. If an external hard drive is not an option, then the second copy can be stored in any other file storage that is not the same service as the first copy.<br />
<br />
Finally, the third copy should be stored off-site. In today's world this usually means somewhere in the cloud. The second copy can also be stored in the cloud, but the difference from the third copy is that the second copy may be stored locally. The third copy may never be stored locally. This is meant to make sure that if theft, fire, disaster etc. or anything similar happens, then the third copy should be stored in a way there is no way it is affected by the same event. The other important consideration for the third copy storage is that it is a suitable long term storage. This means that this storage should not be affected by a single person leaving the team or if it does there should be a clear protocol for how access is maintained when that person is leaving the team. If you are using a sync service for the third copy it should ideally be a different sync service than for the first copy. Regardless which sync service you are using, if you are using a sync service you should never have the third copy synced to any computer. This file should live in an online-only folder. Alternatives to sync services for the third copy would be S3 on AWS or Blob storage on Azure. If you are using such services, see if you can make these files read-only so there is no chance that they are accidentally deleted or modified.<br />
<br />
== Data retention ==<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[*topic name, as listed on main page*]]<br />
<br />
<br />
== Additional Resources ==<br />
* list here other articles related to this topic, with a brief description and link<br />
<br />
[[Category: *category name* ]]</div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Data_Storage&diff=8314Data Storage2021-06-07T15:20:39Z<p>Kbjarkefur: /* Back-up storage types */</p>
<hr />
<div>This article discuss different aspects of data storage (such as different types of storage, data back up and data retention). While [[Data_Security]] is a very important topic related to data storage, it is not covered in this article as there is a dedicated link for that.<br />
<br />
<br />
<br />
== Read First ==<br />
* Data should be version controlled. Ideally the version control system used should both be able to detect changes to the data and provide a possibility to revert to old versions<br />
* Data storage depends on how big the data is, if there is confidential information in the data and how the data will be used.<br />
* All original data (the data sets collected or received by the team) should be backed-up. All derivative data (data sets created from original data) should be created by reproducible code and that code should be version controlled in Git. Then derivative data does not need to be backed-up as it can be re-created.<br />
<br />
== Version control for data ==<br />
Version control is used for two purposes. The first is to keep track of changes and modifications to files and the second one is to provide the possibility to revert a file an old version. For code there is an industry standard that efficiently fulfills these two purposes and that system is called Git. Git can be used to both keep track of changes made to code as well as restore older versions of code. Unfortunately there is no system that as elegantly does both of those two things for data over a longer period of time. There is therefore no industry wide one-size-fits-all solutions for version control of data. This article therefore suggests different methods and the project team should pick the method best for their exact use case.<br />
<br />
=== Version control code that generates data ===<br />
<br />
All derivative datasets (datasets that the project team creates from original data they received or collected) should be generated by reproducible code so that it can be re-created by re-running that code. Therefore, if the original data is properly [[Data_Storage#Back-up_protocols | backed-up]] and all code is version controlled using Git, then the derivative datasets are already implicitly version controlled since changes to derivative datasets can only happen through changes to the code tracked in Git. Git can also be used to restore old version of code that in turn can be used to restore old versions of the derivative data.<br />
<br />
While this method often is an excellent option to version control derivative data, it does not work when the original data is updated frequently (ongoing data collection or when data is continuously received) or when the code is not accessible (someone else is generating the data). Derivative data should, in these cases, still be generated using reproducible code tracked in Git, but the effect is no longer that those derivative datasets are implicitly properly version controlled.<br />
<br />
=== Version control using checksums/hashes ===<br />
<br />
One way to keep track of changes done to data is to use ''checksums'', ''hashes'' or ''data signatures''. These three concepts all work slightly differently but all follows the same principle and serves the same purpose in the context of version control of data. The principle they all follow is that a datasets is boiled down to a short string of text or a number. The same data is always boiled down to the same string or number, but even small differences to the data boils down to a different string or number. That string or number can then quickly and effortlessly be compared across datasets to test if they are identical or not.<br />
<br />
Most checksums or hashes boils down to a string or a number that has no interpretable meaning to humans, it is just meant to serve the purpose of answering the yes/no question if the two datasets are identical or not. It is usually impossible to tell from a has or a checksum if the files are almost identical or very different. One exception to this is the Stata command <code>datasignature</code> that generates a signature that combines a hash value with some basic data on the dataset such as number of observations. While this provides some useful human readable information, it can only be used on Stata <code>.dta</code> files.<br />
<br />
The drawback of all these methods is that it says nothing about how the dataset was created. So there is no way to recreate a dataset based only on the checksum or hash signature. If you have the checksum of a dataset used in the past but not the corresponding version of the dataset, then all you can do is to say that dataset was identical to the current version of the same dataset or not. You cannot say what differs if they are different. If you used the Stata command <code>datasignature</code> you can get some clues of what differs, such as number of observations or number of variables, but you will still not be able to recreate the old version of the data.<br />
<br />
However, this method is a good fit for when you want to have a quick way to test that the dataset has not changed, and details of what has changed either does not matter or you have another way to find out (perhaps manually) what those differences are. Examples of this can be if you are accessing someone else's dataset and you want to know if that person makes changes to that dataset. Another way this can be used is to check which if any datasets are updated when the code is updated. Sometimes we want to make changes to the code that should not change the dataset it generates, and then this method is a great way to verify that.<br />
<br />
=== Version control in sync software ===<br />
<br />
[[Data_Storage#File_sync_services | File syncing software]] often have version control systems that allows you to both detect changes made to data files and allows you to restore old versions of those files. However, the way they it is done in these systems is so storage in-efficient that you can only restore files version that are less than a few months old. If your project is completed with that time frame, then this is a great solution, however, typically a project runs for much longer than that.<br />
<br />
== Data storage types ==<br />
There are different types of data storage option and which one that is best for each use case typically depends on how big the data is, if there is confidential information in the data and how the data will be used. Below are some different storage types with their pros and cons. The type of data that is considered in this article are data files, and not, for example, data stored in data bases. <br />
<br />
=== File sync services ===<br />
<br />
Common examples of file sync services are DropBox, OneDrive, Box among several others. This storage type is suitable when the data files required for a project is not bigger than that they can be stored on a regular laptop or desktop computer. In file sync services a copy of the file exists locally on all computers that has the folder synced. This means that accessing the file is quick and that there is never a problem for multiple people to access the file at the same time as they have their own version of the file. <br />
<br />
However, because all files are saved locally this also means that if one person works on many projects there is a risk that there is not space for all files for those projects on that persons computer. Many file syncing services have options to mitigate this. For example you can chose to not sync the parts of the project folder that you know you do not need access to. Some sync services also allow you to have non-synced files ready to be downloaded on demand (sometimes referred to as ''smart sync''). However, this is not great for data work as data files are usually to big to be instantly downloaded on demand when your code trying to access them leading to your code to crash when trying to read the file.<br />
<br />
Another concern with file syncing services are privacy concerns when sharing confidential data. There are enterprise subscriptions that can be installed on your own servers. This makes sharing data much more secure as the data is never stored or transferred on the servers that belong to the company that offers the syncing service. If you do not have an enterprise subscription or are not sure if you do, you should always [[encryption| encrypt ]] any non-public data before saving it in a synced folder. DIME Analytics have published [https://github.com/worldbank/dime-standards/blob/master/dime-research-standards/pillar-4-data-security/data-security-resources/veracrypt-guidelines.md guidelines to encrypt files with Veracrypt] which is a secure and free software often used for this purpose.<br />
<br />
=== Cloud storage ===<br />
<br />
Cloud storage comes in many different kinds and it is out of the scope of this wiki article to describe them all. This article will only cover general points about cloud storage. This article only covers data stored in files, so cloud storage here typically means [https://aws.amazon.com/s3/ S3 storage on AWS] or [https://docs.microsoft.com/en-us/azure/storage/blobs/ Blob storage on Azure], but are not limited to those.<br />
<br />
The benefit of cloud storage is that you do not need to buy or upgrade any hardware. To get started or to increase your capacity you only need to click a button or it might even be upgraded automatically. However, you will be charged more the more you use. <br />
<br />
Cloud storage might not be the best option when the data will frequently be accessed from laptops and desktops. The download time will be too long even on a fast internet connection for most use cases if the data is downloaded each time the script is run. Instead, cloud storage is best used when frequently accessed from another cloud resource. <br />
<br />
However, make sure that the other cloud storage is in the same physical location as the cloud resource that will use it. The same cloud provider have data centers across the world and you might have issues if your storage is on the other side of the world from the resource that will use it. There are ways to address these issues but they are outside of the scope of this article. The main takeaway is that cloud storage is only a great solution if the resources that needs access to that have a very quick connection to them.<br />
<br />
At the same time, this does not mean that cloud storage should never be accessed from a laptop or a desktop. Cloud storage can be a great place to store data that is infrequently accessed by users on regular computers. There will still be a delay, but that can be acceptable as the data is only infrequently accessed.<br />
<br />
=== Network drive storage ===<br />
<br />
Network drive storage is in many way similar to the types of cloud storage that we discussed above. The major difference for the context discussed here is that they can be located in the same location as a user accessing them from a laptop/desktop making it a viable option for such a user to access it frequently. This is a great solution when a regular laptop/desktop frequently needs access to the data and the combined size of all of the data files are too big to all be stored on the laptop/desktop.<br />
<br />
The problem with this solution is that you need to purchase hardware and run your own network drive. This can be complicated if you do not work at a large organization where there is an IT team that can do this for you. Another drawback is that if you work on an organization that is spread out geographically then you will have the same issues as with cloud storage with slow access speed.<br />
<br />
== Back-up protocols ==<br />
<br />
The DIME Analytics recommendation for data back up starts by classifying all data into either original data or derivative data. Original data is data that was received or collected by the project team, and derivative data is data created from the original data using reproducible code. Original data must be backed up, but derivative data does not have to be backed up as long as the code used to create it is backed up (typically through GitHub or another version control system for code). If the original data and the code is backed up, then the derivative data is implicitly backed up as well. By this method, backing up every single derivative data file which would be storage space inefficient is avoided.<br />
<br />
There are many good alternative data back-up protocols but a great place to start for anyone who has not adapted a back-up protocol for their projects yet is the [https://www.backblaze.com/blog/the-3-2-1-backup-strategy 3-2-1 strategy]. This strategy means that the project has 3 copies of the data, where 2 are easily accessible and 1 is stored on an remote location. This means that one copy, called the first copy, is the only copy that the code ever should read the data from. This copy can be shared in a file syncing service. If the original data file has [[Personally_Identifiable_Information_(PII) | PII information]] (which they often do) the file needs to be encrypted at all times. A de-identified version of this data that does not need to be encrypted can be created for easier day-to-day use.<br />
<br />
If the first copy of the data is accidentally deleted or modified then the team will access the second copy and make a new first copy of the second copy. The second copy could, for example, be stored on a external drive. This external drive can be stored in the office of someone in the project team. If the data has PII data it must be encrypted unless the external drive is stored in a locked safe. When the team has access to a secure physical safe, then it is a good idea to keep the data unencrypted on the drive that is stored locked in the safe. This is to reduce the risk of loosing data due to lost encryption keys. If an external hard drive is not an option, then the second copy can be stored in any other file storage that is not the same service as the first copy.<br />
<br />
Finally, the third copy should be stored off-site. In today's world this usually means somewhere in the cloud. The second copy can also be stored in the cloud, but the difference from the third copy is that the second copy may be stored locally. The third copy may never be stored locally. This is meant to make sure that if theft, fire, disaster etc. or anything similar happens, then the third copy should be stored in a way there is no way it is affected by the same event. The other important consideration for the third copy storage is that it is a suitable long term storage. This means that this storage should not be affected by a single person leaving the team or if it does there should be a clear protocol for how access is maintained when that person is leaving the team.<br />
<br />
== Data retention ==<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[*topic name, as listed on main page*]]<br />
<br />
<br />
== Additional Resources ==<br />
* list here other articles related to this topic, with a brief description and link<br />
<br />
[[Category: *category name* ]]</div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Data_Storage&diff=8313Data Storage2021-06-07T15:20:26Z<p>Kbjarkefur: /* Back-up protocols */</p>
<hr />
<div>This article discuss different aspects of data storage (such as different types of storage, data back up and data retention). While [[Data_Security]] is a very important topic related to data storage, it is not covered in this article as there is a dedicated link for that.<br />
<br />
<br />
<br />
== Read First ==<br />
* Data should be version controlled. Ideally the version control system used should both be able to detect changes to the data and provide a possibility to revert to old versions<br />
* Data storage depends on how big the data is, if there is confidential information in the data and how the data will be used.<br />
* All original data (the data sets collected or received by the team) should be backed-up. All derivative data (data sets created from original data) should be created by reproducible code and that code should be version controlled in Git. Then derivative data does not need to be backed-up as it can be re-created.<br />
<br />
== Version control for data ==<br />
Version control is used for two purposes. The first is to keep track of changes and modifications to files and the second one is to provide the possibility to revert a file an old version. For code there is an industry standard that efficiently fulfills these two purposes and that system is called Git. Git can be used to both keep track of changes made to code as well as restore older versions of code. Unfortunately there is no system that as elegantly does both of those two things for data over a longer period of time. There is therefore no industry wide one-size-fits-all solutions for version control of data. This article therefore suggests different methods and the project team should pick the method best for their exact use case.<br />
<br />
=== Version control code that generates data ===<br />
<br />
All derivative datasets (datasets that the project team creates from original data they received or collected) should be generated by reproducible code so that it can be re-created by re-running that code. Therefore, if the original data is properly [[Data_Storage#Back-up_protocols | backed-up]] and all code is version controlled using Git, then the derivative datasets are already implicitly version controlled since changes to derivative datasets can only happen through changes to the code tracked in Git. Git can also be used to restore old version of code that in turn can be used to restore old versions of the derivative data.<br />
<br />
While this method often is an excellent option to version control derivative data, it does not work when the original data is updated frequently (ongoing data collection or when data is continuously received) or when the code is not accessible (someone else is generating the data). Derivative data should, in these cases, still be generated using reproducible code tracked in Git, but the effect is no longer that those derivative datasets are implicitly properly version controlled.<br />
<br />
=== Version control using checksums/hashes ===<br />
<br />
One way to keep track of changes done to data is to use ''checksums'', ''hashes'' or ''data signatures''. These three concepts all work slightly differently but all follows the same principle and serves the same purpose in the context of version control of data. The principle they all follow is that a datasets is boiled down to a short string of text or a number. The same data is always boiled down to the same string or number, but even small differences to the data boils down to a different string or number. That string or number can then quickly and effortlessly be compared across datasets to test if they are identical or not.<br />
<br />
Most checksums or hashes boils down to a string or a number that has no interpretable meaning to humans, it is just meant to serve the purpose of answering the yes/no question if the two datasets are identical or not. It is usually impossible to tell from a has or a checksum if the files are almost identical or very different. One exception to this is the Stata command <code>datasignature</code> that generates a signature that combines a hash value with some basic data on the dataset such as number of observations. While this provides some useful human readable information, it can only be used on Stata <code>.dta</code> files.<br />
<br />
The drawback of all these methods is that it says nothing about how the dataset was created. So there is no way to recreate a dataset based only on the checksum or hash signature. If you have the checksum of a dataset used in the past but not the corresponding version of the dataset, then all you can do is to say that dataset was identical to the current version of the same dataset or not. You cannot say what differs if they are different. If you used the Stata command <code>datasignature</code> you can get some clues of what differs, such as number of observations or number of variables, but you will still not be able to recreate the old version of the data.<br />
<br />
However, this method is a good fit for when you want to have a quick way to test that the dataset has not changed, and details of what has changed either does not matter or you have another way to find out (perhaps manually) what those differences are. Examples of this can be if you are accessing someone else's dataset and you want to know if that person makes changes to that dataset. Another way this can be used is to check which if any datasets are updated when the code is updated. Sometimes we want to make changes to the code that should not change the dataset it generates, and then this method is a great way to verify that.<br />
<br />
=== Version control in sync software ===<br />
<br />
[[Data_Storage#File_sync_services | File syncing software]] often have version control systems that allows you to both detect changes made to data files and allows you to restore old versions of those files. However, the way they it is done in these systems is so storage in-efficient that you can only restore files version that are less than a few months old. If your project is completed with that time frame, then this is a great solution, however, typically a project runs for much longer than that.<br />
<br />
== Data storage types ==<br />
There are different types of data storage option and which one that is best for each use case typically depends on how big the data is, if there is confidential information in the data and how the data will be used. Below are some different storage types with their pros and cons. The type of data that is considered in this article are data files, and not, for example, data stored in data bases. <br />
<br />
=== File sync services ===<br />
<br />
Common examples of file sync services are DropBox, OneDrive, Box among several others. This storage type is suitable when the data files required for a project is not bigger than that they can be stored on a regular laptop or desktop computer. In file sync services a copy of the file exists locally on all computers that has the folder synced. This means that accessing the file is quick and that there is never a problem for multiple people to access the file at the same time as they have their own version of the file. <br />
<br />
However, because all files are saved locally this also means that if one person works on many projects there is a risk that there is not space for all files for those projects on that persons computer. Many file syncing services have options to mitigate this. For example you can chose to not sync the parts of the project folder that you know you do not need access to. Some sync services also allow you to have non-synced files ready to be downloaded on demand (sometimes referred to as ''smart sync''). However, this is not great for data work as data files are usually to big to be instantly downloaded on demand when your code trying to access them leading to your code to crash when trying to read the file.<br />
<br />
Another concern with file syncing services are privacy concerns when sharing confidential data. There are enterprise subscriptions that can be installed on your own servers. This makes sharing data much more secure as the data is never stored or transferred on the servers that belong to the company that offers the syncing service. If you do not have an enterprise subscription or are not sure if you do, you should always [[encryption| encrypt ]] any non-public data before saving it in a synced folder. DIME Analytics have published [https://github.com/worldbank/dime-standards/blob/master/dime-research-standards/pillar-4-data-security/data-security-resources/veracrypt-guidelines.md guidelines to encrypt files with Veracrypt] which is a secure and free software often used for this purpose.<br />
<br />
=== Cloud storage ===<br />
<br />
Cloud storage comes in many different kinds and it is out of the scope of this wiki article to describe them all. This article will only cover general points about cloud storage. This article only covers data stored in files, so cloud storage here typically means [https://aws.amazon.com/s3/ S3 storage on AWS] or [https://docs.microsoft.com/en-us/azure/storage/blobs/ Blob storage on Azure], but are not limited to those.<br />
<br />
The benefit of cloud storage is that you do not need to buy or upgrade any hardware. To get started or to increase your capacity you only need to click a button or it might even be upgraded automatically. However, you will be charged more the more you use. <br />
<br />
Cloud storage might not be the best option when the data will frequently be accessed from laptops and desktops. The download time will be too long even on a fast internet connection for most use cases if the data is downloaded each time the script is run. Instead, cloud storage is best used when frequently accessed from another cloud resource. <br />
<br />
However, make sure that the other cloud storage is in the same physical location as the cloud resource that will use it. The same cloud provider have data centers across the world and you might have issues if your storage is on the other side of the world from the resource that will use it. There are ways to address these issues but they are outside of the scope of this article. The main takeaway is that cloud storage is only a great solution if the resources that needs access to that have a very quick connection to them.<br />
<br />
At the same time, this does not mean that cloud storage should never be accessed from a laptop or a desktop. Cloud storage can be a great place to store data that is infrequently accessed by users on regular computers. There will still be a delay, but that can be acceptable as the data is only infrequently accessed.<br />
<br />
=== Network drive storage ===<br />
<br />
Network drive storage is in many way similar to the types of cloud storage that we discussed above. The major difference for the context discussed here is that they can be located in the same location as a user accessing them from a laptop/desktop making it a viable option for such a user to access it frequently. This is a great solution when a regular laptop/desktop frequently needs access to the data and the combined size of all of the data files are too big to all be stored on the laptop/desktop.<br />
<br />
The problem with this solution is that you need to purchase hardware and run your own network drive. This can be complicated if you do not work at a large organization where there is an IT team that can do this for you. Another drawback is that if you work on an organization that is spread out geographically then you will have the same issues as with cloud storage with slow access speed.<br />
<br />
== Back-up protocols ==<br />
<br />
The DIME Analytics recommendation for data back up starts by classifying all data into either original data or derivative data. Original data is data that was received or collected by the project team, and derivative data is data created from the original data using reproducible code. Original data must be backed up, but derivative data does not have to be backed up as long as the code used to create it is backed up (typically through GitHub or another version control system for code). If the original data and the code is backed up, then the derivative data is implicitly backed up as well. By this method, backing up every single derivative data file which would be storage space inefficient is avoided.<br />
<br />
There are many good alternative data back-up protocols but a great place to start for anyone who has not adapted a back-up protocol for their projects yet is the [https://www.backblaze.com/blog/the-3-2-1-backup-strategy 3-2-1 strategy]. This strategy means that the project has 3 copies of the data, where 2 are easily accessible and 1 is stored on an remote location. This means that one copy, called the first copy, is the only copy that the code ever should read the data from. This copy can be shared in a file syncing service. If the original data file has [[Personally_Identifiable_Information_(PII) | PII information]] (which they often do) the file needs to be encrypted at all times. A de-identified version of this data that does not need to be encrypted can be created for easier day-to-day use.<br />
<br />
If the first copy of the data is accidentally deleted or modified then the team will access the second copy and make a new first copy of the second copy. The second copy could, for example, be stored on a external drive. This external drive can be stored in the office of someone in the project team. If the data has PII data it must be encrypted unless the external drive is stored in a locked safe. When the team has access to a secure physical safe, then it is a good idea to keep the data unencrypted on the drive that is stored locked in the safe. This is to reduce the risk of loosing data due to lost encryption keys. If an external hard drive is not an option, then the second copy can be stored in any other file storage that is not the same service as the first copy.<br />
<br />
Finally, the third copy should be stored off-site. In today's world this usually means somewhere in the cloud. The second copy can also be stored in the cloud, but the difference from the third copy is that the second copy may be stored locally. The third copy may never be stored locally. This is meant to make sure that if theft, fire, disaster etc. or anything similar happens, then the third copy should be stored in a way there is no way it is affected by the same event. The other important consideration for the third copy storage is that it is a suitable long term storage. This means that this storage should not be affected by a single person leaving the team or if it does there should be a clear protocol for how access is maintained when that person is leaving the team.<br />
<br />
== Back-up storage types ==<br />
You should never use the same file or storage solution for your back-up copy of your original or raw data, as the files and storage solution you use for your day-to-day work. While some storage types could be used for both day-to-day and backup, this sections covers the former use case. Backing up of data is covered in a later section below. There are multiple types of storage and what type is best for you is how depends on the following factors: size of your data, where it will be used and cost.<br />
<br />
== Data retention ==<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[*topic name, as listed on main page*]]<br />
<br />
<br />
== Additional Resources ==<br />
* list here other articles related to this topic, with a brief description and link<br />
<br />
[[Category: *category name* ]]</div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Data_Storage&diff=8312Data Storage2021-06-07T13:56:53Z<p>Kbjarkefur: /* Back-up protocols */</p>
<hr />
<div>This article discuss different aspects of data storage (such as different types of storage, data back up and data retention). While [[Data_Security]] is a very important topic related to data storage, it is not covered in this article as there is a dedicated link for that.<br />
<br />
<br />
<br />
== Read First ==<br />
* Data should be version controlled. Ideally the version control system used should both be able to detect changes to the data and provide a possibility to revert to old versions<br />
* Data storage depends on how big the data is, if there is confidential information in the data and how the data will be used.<br />
* All original data (the data sets collected or received by the team) should be backed-up. All derivative data (data sets created from original data) should be created by reproducible code and that code should be version controlled in Git. Then derivative data does not need to be backed-up as it can be re-created.<br />
<br />
== Version control for data ==<br />
Version control is used for two purposes. The first is to keep track of changes and modifications to files and the second one is to provide the possibility to revert a file an old version. For code there is an industry standard that efficiently fulfills these two purposes and that system is called Git. Git can be used to both keep track of changes made to code as well as restore older versions of code. Unfortunately there is no system that as elegantly does both of those two things for data over a longer period of time. There is therefore no industry wide one-size-fits-all solutions for version control of data. This article therefore suggests different methods and the project team should pick the method best for their exact use case.<br />
<br />
=== Version control code that generates data ===<br />
<br />
All derivative datasets (datasets that the project team creates from original data they received or collected) should be generated by reproducible code so that it can be re-created by re-running that code. Therefore, if the original data is properly [[Data_Storage#Back-up_protocols | backed-up]] and all code is version controlled using Git, then the derivative datasets are already implicitly version controlled since changes to derivative datasets can only happen through changes to the code tracked in Git. Git can also be used to restore old version of code that in turn can be used to restore old versions of the derivative data.<br />
<br />
While this method often is an excellent option to version control derivative data, it does not work when the original data is updated frequently (ongoing data collection or when data is continuously received) or when the code is not accessible (someone else is generating the data). Derivative data should, in these cases, still be generated using reproducible code tracked in Git, but the effect is no longer that those derivative datasets are implicitly properly version controlled.<br />
<br />
=== Version control using checksums/hashes ===<br />
<br />
One way to keep track of changes done to data is to use ''checksums'', ''hashes'' or ''data signatures''. These three concepts all work slightly differently but all follows the same principle and serves the same purpose in the context of version control of data. The principle they all follow is that a datasets is boiled down to a short string of text or a number. The same data is always boiled down to the same string or number, but even small differences to the data boils down to a different string or number. That string or number can then quickly and effortlessly be compared across datasets to test if they are identical or not.<br />
<br />
Most checksums or hashes boils down to a string or a number that has no interpretable meaning to humans, it is just meant to serve the purpose of answering the yes/no question if the two datasets are identical or not. It is usually impossible to tell from a has or a checksum if the files are almost identical or very different. One exception to this is the Stata command <code>datasignature</code> that generates a signature that combines a hash value with some basic data on the dataset such as number of observations. While this provides some useful human readable information, it can only be used on Stata <code>.dta</code> files.<br />
<br />
The drawback of all these methods is that it says nothing about how the dataset was created. So there is no way to recreate a dataset based only on the checksum or hash signature. If you have the checksum of a dataset used in the past but not the corresponding version of the dataset, then all you can do is to say that dataset was identical to the current version of the same dataset or not. You cannot say what differs if they are different. If you used the Stata command <code>datasignature</code> you can get some clues of what differs, such as number of observations or number of variables, but you will still not be able to recreate the old version of the data.<br />
<br />
However, this method is a good fit for when you want to have a quick way to test that the dataset has not changed, and details of what has changed either does not matter or you have another way to find out (perhaps manually) what those differences are. Examples of this can be if you are accessing someone else's dataset and you want to know if that person makes changes to that dataset. Another way this can be used is to check which if any datasets are updated when the code is updated. Sometimes we want to make changes to the code that should not change the dataset it generates, and then this method is a great way to verify that.<br />
<br />
=== Version control in sync software ===<br />
<br />
[[Data_Storage#File_sync_services | File syncing software]] often have version control systems that allows you to both detect changes made to data files and allows you to restore old versions of those files. However, the way they it is done in these systems is so storage in-efficient that you can only restore files version that are less than a few months old. If your project is completed with that time frame, then this is a great solution, however, typically a project runs for much longer than that.<br />
<br />
== Data storage types ==<br />
There are different types of data storage option and which one that is best for each use case typically depends on how big the data is, if there is confidential information in the data and how the data will be used. Below are some different storage types with their pros and cons. The type of data that is considered in this article are data files, and not, for example, data stored in data bases. <br />
<br />
=== File sync services ===<br />
<br />
Common examples of file sync services are DropBox, OneDrive, Box among several others. This storage type is suitable when the data files required for a project is not bigger than that they can be stored on a regular laptop or desktop computer. In file sync services a copy of the file exists locally on all computers that has the folder synced. This means that accessing the file is quick and that there is never a problem for multiple people to access the file at the same time as they have their own version of the file. <br />
<br />
However, because all files are saved locally this also means that if one person works on many projects there is a risk that there is not space for all files for those projects on that persons computer. Many file syncing services have options to mitigate this. For example you can chose to not sync the parts of the project folder that you know you do not need access to. Some sync services also allow you to have non-synced files ready to be downloaded on demand (sometimes referred to as ''smart sync''). However, this is not great for data work as data files are usually to big to be instantly downloaded on demand when your code trying to access them leading to your code to crash when trying to read the file.<br />
<br />
Another concern with file syncing services are privacy concerns when sharing confidential data. There are enterprise subscriptions that can be installed on your own servers. This makes sharing data much more secure as the data is never stored or transferred on the servers that belong to the company that offers the syncing service. If you do not have an enterprise subscription or are not sure if you do, you should always [[encryption| encrypt ]] any non-public data before saving it in a synced folder. DIME Analytics have published [https://github.com/worldbank/dime-standards/blob/master/dime-research-standards/pillar-4-data-security/data-security-resources/veracrypt-guidelines.md guidelines to encrypt files with Veracrypt] which is a secure and free software often used for this purpose.<br />
<br />
=== Cloud storage ===<br />
<br />
Cloud storage comes in many different kinds and it is out of the scope of this wiki article to describe them all. This article will only cover general points about cloud storage. This article only covers data stored in files, so cloud storage here typically means [https://aws.amazon.com/s3/ S3 storage on AWS] or [https://docs.microsoft.com/en-us/azure/storage/blobs/ Blob storage on Azure], but are not limited to those.<br />
<br />
The benefit of cloud storage is that you do not need to buy or upgrade any hardware. To get started or to increase your capacity you only need to click a button or it might even be upgraded automatically. However, you will be charged more the more you use. <br />
<br />
Cloud storage might not be the best option when the data will frequently be accessed from laptops and desktops. The download time will be too long even on a fast internet connection for most use cases if the data is downloaded each time the script is run. Instead, cloud storage is best used when frequently accessed from another cloud resource. <br />
<br />
However, make sure that the other cloud storage is in the same physical location as the cloud resource that will use it. The same cloud provider have data centers across the world and you might have issues if your storage is on the other side of the world from the resource that will use it. There are ways to address these issues but they are outside of the scope of this article. The main takeaway is that cloud storage is only a great solution if the resources that needs access to that have a very quick connection to them.<br />
<br />
At the same time, this does not mean that cloud storage should never be accessed from a laptop or a desktop. Cloud storage can be a great place to store data that is infrequently accessed by users on regular computers. There will still be a delay, but that can be acceptable as the data is only infrequently accessed.<br />
<br />
=== Network drive storage ===<br />
<br />
Network drive storage is in many way similar to the types of cloud storage that we discussed above. The major difference for the context discussed here is that they can be located in the same location as a user accessing them from a laptop/desktop making it a viable option for such a user to access it frequently. This is a great solution when a regular laptop/desktop frequently needs access to the data and the combined size of all of the data files are too big to all be stored on the laptop/desktop.<br />
<br />
The problem with this solution is that you need to purchase hardware and run your own network drive. This can be complicated if you do not work at a large organization where there is an IT team that can do this for you. Another drawback is that if you work on an organization that is spread out geographically then you will have the same issues as with cloud storage with slow access speed.<br />
<br />
== Back-up protocols ==<br />
<br />
The DIME Analytics recommendation for data back up starts by classifying all data into original data and derivative data. Original data is data that was received or collected by the project, and derivative data is data created from the original data using reproducible code. Original data must be backed up, but derivative data does not have to be backed up as long as the code used to create it is backed up (typically through GitHub or another version control system for code). If the original data and the code is backed up, then the derivative data is implicitly backed up as well. By this method we avoid backing up every single derivative data file which would be storage space inefficient.<br />
<br />
There are many good alternative data back-up protocols but a great place to start for anyone who has not adapted a back-up protocol for their projects yet is the [https://www.backblaze.com/blog/the-3-2-1-backup-strategy 3-2-1 strategy]. This strategy means that you have 3 copies of the data, where 2 are easily accessible to you and 1 is stored on an remote location. What this means in practice is that you have one version of the data that<br />
<br />
== Back-up storage types ==<br />
You should never use the same file or storage solution for your back-up copy of your original or raw data, as the files and storage solution you use for your day-to-day work. While some storage types could be used for both day-to-day and backup, this sections covers the former use case. Backing up of data is covered in a later section below. There are multiple types of storage and what type is best for you is how depends on the following factors: size of your data, where it will be used and cost.<br />
<br />
== Data retention ==<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[*topic name, as listed on main page*]]<br />
<br />
<br />
== Additional Resources ==<br />
* list here other articles related to this topic, with a brief description and link<br />
<br />
[[Category: *category name* ]]</div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Data_Storage&diff=8310Data Storage2021-06-03T22:41:31Z<p>Kbjarkefur: /* Cloud storage */</p>
<hr />
<div>This article discuss different aspects of data storage (such as different types of storage, data back up and data retention). While [[Data_Security]] is a very important topic related to data storage, it is not covered in this article as there is a dedicated link for that.<br />
<br />
<br />
<br />
== Read First ==<br />
* Data should be version controlled. Ideally the version control system used should both be able to detect changes to the data and provide a possibility to revert to old versions<br />
* Data storage depends on how big the data is, if there is confidential information in the data and how the data will be used.<br />
* All original data (the data sets collected or received by the team) should be backed-up. All derivative data (data sets created from original data) should be created by reproducible code and that code should be version controlled in Git. Then derivative data does not need to be backed-up as it can be re-created.<br />
<br />
== Version control for data ==<br />
Version control is used for two purposes. The first is to keep track of changes and modifications to files and the second one is to provide the possibility to revert a file an old version. For code there is an industry standard that efficiently fulfills these two purposes and that system is called Git. Git can be used to both keep track of changes made to code as well as restore older versions of code. Unfortunately there is no system that as elegantly does both of those two things for data over a longer period of time. There is therefore no industry wide one-size-fits-all solutions for version control of data. This article therefore suggests different methods and the project team should pick the method best for their exact use case.<br />
<br />
=== Version control code that generates data ===<br />
<br />
All derivative datasets (datasets that the project team creates from original data they received or collected) should be generated by reproducible code so that it can be re-created by re-running that code. Therefore, if the original data is properly [[Data_Storage#Back-up_protocols | backed-up]] and all code is version controlled using Git, then the derivative datasets are already implicitly version controlled since changes to derivative datasets can only happen through changes to the code tracked in Git. Git can also be used to restore old version of code that in turn can be used to restore old versions of the derivative data.<br />
<br />
While this method often is an excellent option to version control derivative data, it does not work when the original data is updated frequently (ongoing data collection or when data is continuously received) or when the code is not accessible (someone else is generating the data). Derivative data should, in these cases, still be generated using reproducible code tracked in Git, but the effect is no longer that those derivative datasets are implicitly properly version controlled.<br />
<br />
=== Version control using checksums/hashes ===<br />
<br />
One way to keep track of changes done to data is to use ''checksums'', ''hashes'' or ''data signatures''. These three concepts all work slightly differently but all follows the same principle and serves the same purpose in the context of version control of data. The principle they all follow is that a datasets is boiled down to a short string of text or a number. The same data is always boiled down to the same string or number, but even small differences to the data boils down to a different string or number. That string or number can then quickly and effortlessly be compared across datasets to test if they are identical or not.<br />
<br />
Most checksums or hashes boils down to a string or a number that has no interpretable meaning to humans, it is just meant to serve the purpose of answering the yes/no question if the two datasets are identical or not. It is usually impossible to tell from a has or a checksum if the files are almost identical or very different. One exception to this is the Stata command <code>datasignature</code> that generates a signature that combines a hash value with some basic data on the dataset such as number of observations. While this provides some useful human readable information, it can only be used on Stata <code>.dta</code> files.<br />
<br />
The drawback of all these methods is that it says nothing about how the dataset was created. So there is no way to recreate a dataset based only on the checksum or hash signature. If you have the checksum of a dataset used in the past but not the corresponding version of the dataset, then all you can do is to say that dataset was identical to the current version of the same dataset or not. You cannot say what differs if they are different. If you used the Stata command <code>datasignature</code> you can get some clues of what differs, such as number of observations or number of variables, but you will still not be able to recreate the old version of the data.<br />
<br />
However, this method is a good fit for when you want to have a quick way to test that the dataset has not changed, and details of what has changed either does not matter or you have another way to find out (perhaps manually) what those differences are. Examples of this can be if you are accessing someone else's dataset and you want to know if that person makes changes to that dataset. Another way this can be used is to check which if any datasets are updated when the code is updated. Sometimes we want to make changes to the code that should not change the dataset it generates, and then this method is a great way to verify that.<br />
<br />
=== Version control in sync software ===<br />
<br />
[[Data_Storage#File_sync_services | File syncing software]] often have version control systems that allows you to both detect changes made to data files and allows you to restore old versions of those files. However, the way they it is done in these systems is so storage in-efficient that you can only restore files version that are less than a few months old. If your project is completed with that time frame, then this is a great solution, however, typically a project runs for much longer than that.<br />
<br />
== Data storage types ==<br />
There are different types of data storage option and which one that is best for each use case typically depends on how big the data is, if there is confidential information in the data and how the data will be used. Below are some different storage types with their pros and cons. The type of data that is considered in this article are data files, and not, for example, data stored in data bases. <br />
<br />
=== File sync services ===<br />
<br />
Common examples of file sync services are DropBox, OneDrive, Box among several others. This storage type is suitable when the data files required for a project is not bigger than that they can be stored on a regular laptop or desktop computer. In file sync services a copy of the file exists locally on all computers that has the folder synced. This means that accessing the file is quick and that there is never a problem for multiple people to access the file at the same time as they have their own version of the file. <br />
<br />
However, because all files are saved locally this also means that if one person works on many projects there is a risk that there is not space for all files for those projects on that persons computer. Many file syncing services have options to mitigate this. For example you can chose to not sync the parts of the project folder that you know you do not need access to. Some sync services also allow you to have non-synced files ready to be downloaded on demand (sometimes referred to as ''smart sync''). However, this is not great for data work as data files are usually to big to be instantly downloaded on demand when your code trying to access them leading to your code to crash when trying to read the file.<br />
<br />
Another concern with file syncing services are privacy concerns when sharing confidential data. There are enterprise subscriptions that can be installed on your own servers. This makes sharing data much more secure as the data is never stored or transferred on the servers that belong to the company that offers the syncing service. If you do not have an enterprise subscription or are not sure if you do, you should always [[encryption| encrypt ]] any non-public data before saving it in a synced folder. DIME Analytics have published [https://github.com/worldbank/dime-standards/blob/master/dime-research-standards/pillar-4-data-security/data-security-resources/veracrypt-guidelines.md guidelines to encrypt files with Veracrypt] which is a secure and free software often used for this purpose.<br />
<br />
=== Cloud storage ===<br />
<br />
Cloud storage comes in many different kinds and it is out of the scope of this wiki article to describe them all. This article will only cover general points about cloud storage. This article only covers data stored in files, so cloud storage here typically means [https://aws.amazon.com/s3/ S3 storage on AWS] or [https://docs.microsoft.com/en-us/azure/storage/blobs/ Blob storage on Azure], but are not limited to those.<br />
<br />
The benefit of cloud storage is that you do not need to buy or upgrade any hardware. To get started or to increase your capacity you only need to click a button or it might even be upgraded automatically. However, you will be charged more the more you use. <br />
<br />
Cloud storage might not be the best option when the data will frequently be accessed from laptops and desktops. The download time will be too long even on a fast internet connection for most use cases if the data is downloaded each time the script is run. Instead, cloud storage is best used when frequently accessed from another cloud resource. <br />
<br />
However, make sure that the other cloud storage is in the same physical location as the cloud resource that will use it. The same cloud provider have data centers across the world and you might have issues if your storage is on the other side of the world from the resource that will use it. There are ways to address these issues but they are outside of the scope of this article. The main takeaway is that cloud storage is only a great solution if the resources that needs access to that have a very quick connection to them.<br />
<br />
At the same time, this does not mean that cloud storage should never be accessed from a laptop or a desktop. Cloud storage can be a great place to store data that is infrequently accessed by users on regular computers. There will still be a delay, but that can be acceptable as the data is only infrequently accessed.<br />
<br />
=== Network drive storage ===<br />
<br />
Network drive storage is in many way similar to the types of cloud storage that we discussed above. The major difference for the context discussed here is that they can be located in the same location as a user accessing them from a laptop/desktop making it a viable option for such a user to access it frequently. This is a great solution when a regular laptop/desktop frequently needs access to the data and the combined size of all of the data files are too big to all be stored on the laptop/desktop.<br />
<br />
The problem with this solution is that you need to purchase hardware and run your own network drive. This can be complicated if you do not work at a large organization where there is an IT team that can do this for you. Another drawback is that if you work on an organization that is spread out geographically then you will have the same issues as with cloud storage with slow access speed.<br />
<br />
== Back-up protocols ==<br />
<br />
Use<br />
<br />
== Back-up storage types ==<br />
You should never use the same file or storage solution for your back-up copy of your original or raw data, as the files and storage solution you use for your day-to-day work. While some storage types could be used for both day-to-day and backup, this sections covers the former use case. Backing up of data is covered in a later section below. There are multiple types of storage and what type is best for you is how depends on the following factors: size of your data, where it will be used and cost.<br />
<br />
== Data retention ==<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[*topic name, as listed on main page*]]<br />
<br />
<br />
== Additional Resources ==<br />
* list here other articles related to this topic, with a brief description and link<br />
<br />
[[Category: *category name* ]]</div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Data_Storage&diff=8309Data Storage2021-06-03T22:14:22Z<p>Kbjarkefur: /* File sync services */</p>
<hr />
<div>This article discuss different aspects of data storage (such as different types of storage, data back up and data retention). While [[Data_Security]] is a very important topic related to data storage, it is not covered in this article as there is a dedicated link for that.<br />
<br />
<br />
<br />
== Read First ==<br />
* Data should be version controlled. Ideally the version control system used should both be able to detect changes to the data and provide a possibility to revert to old versions<br />
* Data storage depends on how big the data is, if there is confidential information in the data and how the data will be used.<br />
* All original data (the data sets collected or received by the team) should be backed-up. All derivative data (data sets created from original data) should be created by reproducible code and that code should be version controlled in Git. Then derivative data does not need to be backed-up as it can be re-created.<br />
<br />
== Version control for data ==<br />
Version control is used for two purposes. The first is to keep track of changes and modifications to files and the second one is to provide the possibility to revert a file an old version. For code there is an industry standard that efficiently fulfills these two purposes and that system is called Git. Git can be used to both keep track of changes made to code as well as restore older versions of code. Unfortunately there is no system that as elegantly does both of those two things for data over a longer period of time. There is therefore no industry wide one-size-fits-all solutions for version control of data. This article therefore suggests different methods and the project team should pick the method best for their exact use case.<br />
<br />
=== Version control code that generates data ===<br />
<br />
All derivative datasets (datasets that the project team creates from original data they received or collected) should be generated by reproducible code so that it can be re-created by re-running that code. Therefore, if the original data is properly [[Data_Storage#Back-up_protocols | backed-up]] and all code is version controlled using Git, then the derivative datasets are already implicitly version controlled since changes to derivative datasets can only happen through changes to the code tracked in Git. Git can also be used to restore old version of code that in turn can be used to restore old versions of the derivative data.<br />
<br />
While this method often is an excellent option to version control derivative data, it does not work when the original data is updated frequently (ongoing data collection or when data is continuously received) or when the code is not accessible (someone else is generating the data). Derivative data should, in these cases, still be generated using reproducible code tracked in Git, but the effect is no longer that those derivative datasets are implicitly properly version controlled.<br />
<br />
=== Version control using checksums/hashes ===<br />
<br />
One way to keep track of changes done to data is to use ''checksums'', ''hashes'' or ''data signatures''. These three concepts all work slightly differently but all follows the same principle and serves the same purpose in the context of version control of data. The principle they all follow is that a datasets is boiled down to a short string of text or a number. The same data is always boiled down to the same string or number, but even small differences to the data boils down to a different string or number. That string or number can then quickly and effortlessly be compared across datasets to test if they are identical or not.<br />
<br />
Most checksums or hashes boils down to a string or a number that has no interpretable meaning to humans, it is just meant to serve the purpose of answering the yes/no question if the two datasets are identical or not. It is usually impossible to tell from a has or a checksum if the files are almost identical or very different. One exception to this is the Stata command <code>datasignature</code> that generates a signature that combines a hash value with some basic data on the dataset such as number of observations. While this provides some useful human readable information, it can only be used on Stata <code>.dta</code> files.<br />
<br />
The drawback of all these methods is that it says nothing about how the dataset was created. So there is no way to recreate a dataset based only on the checksum or hash signature. If you have the checksum of a dataset used in the past but not the corresponding version of the dataset, then all you can do is to say that dataset was identical to the current version of the same dataset or not. You cannot say what differs if they are different. If you used the Stata command <code>datasignature</code> you can get some clues of what differs, such as number of observations or number of variables, but you will still not be able to recreate the old version of the data.<br />
<br />
However, this method is a good fit for when you want to have a quick way to test that the dataset has not changed, and details of what has changed either does not matter or you have another way to find out (perhaps manually) what those differences are. Examples of this can be if you are accessing someone else's dataset and you want to know if that person makes changes to that dataset. Another way this can be used is to check which if any datasets are updated when the code is updated. Sometimes we want to make changes to the code that should not change the dataset it generates, and then this method is a great way to verify that.<br />
<br />
=== Version control in sync software ===<br />
<br />
[[Data_Storage#File_sync_services | File syncing software]] often have version control systems that allows you to both detect changes made to data files and allows you to restore old versions of those files. However, the way they it is done in these systems is so storage in-efficient that you can only restore files version that are less than a few months old. If your project is completed with that time frame, then this is a great solution, however, typically a project runs for much longer than that.<br />
<br />
== Data storage types ==<br />
There are different types of data storage option and which one that is best for each use case typically depends on how big the data is, if there is confidential information in the data and how the data will be used. Below are some different storage types with their pros and cons. The type of data that is considered in this article are data files, and not, for example, data stored in data bases. <br />
<br />
=== File sync services ===<br />
<br />
Common examples of file sync services are DropBox, OneDrive, Box among several others. This storage type is suitable when the data files required for a project is not bigger than that they can be stored on a regular laptop or desktop computer. In file sync services a copy of the file exists locally on all computers that has the folder synced. This means that accessing the file is quick and that there is never a problem for multiple people to access the file at the same time as they have their own version of the file. <br />
<br />
However, because all files are saved locally this also means that if one person works on many projects there is a risk that there is not space for all files for those projects on that persons computer. Many file syncing services have options to mitigate this. For example you can chose to not sync the parts of the project folder that you know you do not need access to. Some sync services also allow you to have non-synced files ready to be downloaded on demand (sometimes referred to as ''smart sync''). However, this is not great for data work as data files are usually to big to be instantly downloaded on demand when your code trying to access them leading to your code to crash when trying to read the file.<br />
<br />
Another concern with file syncing services are privacy concerns when sharing confidential data. There are enterprise subscriptions that can be installed on your own servers. This makes sharing data much more secure as the data is never stored or transferred on the servers that belong to the company that offers the syncing service. If you do not have an enterprise subscription or are not sure if you do, you should always [[encryption| encrypt ]] any non-public data before saving it in a synced folder. DIME Analytics have published [https://github.com/worldbank/dime-standards/blob/master/dime-research-standards/pillar-4-data-security/data-security-resources/veracrypt-guidelines.md guidelines to encrypt files with Veracrypt] which is a secure and free software often used for this purpose.<br />
<br />
=== Cloud storage ===<br />
<br />
Cloud storage comes in many different kinds and it is out of the scope of this wiki article to describe them all. This article will only cover general points about cloud storage. This article only covers data stored in files, so cloud storage here typically means [https://aws.amazon.com/s3/ S3 storage on AWS] or [https://docs.microsoft.com/en-us/azure/storage/blobs/ Blob storage on Azure], but are not limited to those.<br />
<br />
The benefit of cloud storage is that you do not need to buy or upgrade any hardware. To get started or to increase your capacity you only need to click a button or it might even be upgraded automatically. However, you will be charged more the more you use. <br />
<br />
Cloud storage might not be the best option when the data will frequently be accessed from laptops and desktops. Even on a fast internet connection will the download time be too long for most use cases if the data is downloaded each time the script is run. Instead, cloud storage is best used when frequently accessed from another cloud resource. However, make sure that the other cloud storage is in the same physical location as the cloud resource that will use it. The same cloud provider have data centers across the world and you might have issues if your storage is on the other side of the world from the resource that will use it. There are ways to address these issues but they are outside of the scope of this article. The main takeaway is that cloud storage is only a great solution if the resources that needs access to that have a very quick connection to them.<br />
<br />
At the same time, this does not mean that cloud storage should never be accessed from a laptop or a desktop. Cloud storage can be a great place to store data that is infrequently accessed by users on regular computers. There will still be a delay, but that can be acceptable as the data is only infrequently accessed.<br />
<br />
=== Network drive storage ===<br />
<br />
Network drive storage is in many way similar to the types of cloud storage that we discussed above. The major difference for the context discussed here is that they can be located in the same location as a user accessing them from a laptop/desktop making it a viable option for such a user to access it frequently. This is a great solution when a regular laptop/desktop frequently needs access to the data and the combined size of all of the data files are too big to all be stored on the laptop/desktop.<br />
<br />
The problem with this solution is that you need to purchase hardware and run your own network drive. This can be complicated if you do not work at a large organization where there is an IT team that can do this for you. Another drawback is that if you work on an organization that is spread out geographically then you will have the same issues as with cloud storage with slow access speed.<br />
<br />
== Back-up protocols ==<br />
<br />
Use<br />
<br />
== Back-up storage types ==<br />
You should never use the same file or storage solution for your back-up copy of your original or raw data, as the files and storage solution you use for your day-to-day work. While some storage types could be used for both day-to-day and backup, this sections covers the former use case. Backing up of data is covered in a later section below. There are multiple types of storage and what type is best for you is how depends on the following factors: size of your data, where it will be used and cost.<br />
<br />
== Data retention ==<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[*topic name, as listed on main page*]]<br />
<br />
<br />
== Additional Resources ==<br />
* list here other articles related to this topic, with a brief description and link<br />
<br />
[[Category: *category name* ]]</div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Data_Storage&diff=8308Data Storage2021-06-03T22:13:06Z<p>Kbjarkefur: /* File sync services */</p>
<hr />
<div>This article discuss different aspects of data storage (such as different types of storage, data back up and data retention). While [[Data_Security]] is a very important topic related to data storage, it is not covered in this article as there is a dedicated link for that.<br />
<br />
<br />
<br />
== Read First ==<br />
* Data should be version controlled. Ideally the version control system used should both be able to detect changes to the data and provide a possibility to revert to old versions<br />
* Data storage depends on how big the data is, if there is confidential information in the data and how the data will be used.<br />
* All original data (the data sets collected or received by the team) should be backed-up. All derivative data (data sets created from original data) should be created by reproducible code and that code should be version controlled in Git. Then derivative data does not need to be backed-up as it can be re-created.<br />
<br />
== Version control for data ==<br />
Version control is used for two purposes. The first is to keep track of changes and modifications to files and the second one is to provide the possibility to revert a file an old version. For code there is an industry standard that efficiently fulfills these two purposes and that system is called Git. Git can be used to both keep track of changes made to code as well as restore older versions of code. Unfortunately there is no system that as elegantly does both of those two things for data over a longer period of time. There is therefore no industry wide one-size-fits-all solutions for version control of data. This article therefore suggests different methods and the project team should pick the method best for their exact use case.<br />
<br />
=== Version control code that generates data ===<br />
<br />
All derivative datasets (datasets that the project team creates from original data they received or collected) should be generated by reproducible code so that it can be re-created by re-running that code. Therefore, if the original data is properly [[Data_Storage#Back-up_protocols | backed-up]] and all code is version controlled using Git, then the derivative datasets are already implicitly version controlled since changes to derivative datasets can only happen through changes to the code tracked in Git. Git can also be used to restore old version of code that in turn can be used to restore old versions of the derivative data.<br />
<br />
While this method often is an excellent option to version control derivative data, it does not work when the original data is updated frequently (ongoing data collection or when data is continuously received) or when the code is not accessible (someone else is generating the data). Derivative data should, in these cases, still be generated using reproducible code tracked in Git, but the effect is no longer that those derivative datasets are implicitly properly version controlled.<br />
<br />
=== Version control using checksums/hashes ===<br />
<br />
One way to keep track of changes done to data is to use ''checksums'', ''hashes'' or ''data signatures''. These three concepts all work slightly differently but all follows the same principle and serves the same purpose in the context of version control of data. The principle they all follow is that a datasets is boiled down to a short string of text or a number. The same data is always boiled down to the same string or number, but even small differences to the data boils down to a different string or number. That string or number can then quickly and effortlessly be compared across datasets to test if they are identical or not.<br />
<br />
Most checksums or hashes boils down to a string or a number that has no interpretable meaning to humans, it is just meant to serve the purpose of answering the yes/no question if the two datasets are identical or not. It is usually impossible to tell from a has or a checksum if the files are almost identical or very different. One exception to this is the Stata command <code>datasignature</code> that generates a signature that combines a hash value with some basic data on the dataset such as number of observations. While this provides some useful human readable information, it can only be used on Stata <code>.dta</code> files.<br />
<br />
The drawback of all these methods is that it says nothing about how the dataset was created. So there is no way to recreate a dataset based only on the checksum or hash signature. If you have the checksum of a dataset used in the past but not the corresponding version of the dataset, then all you can do is to say that dataset was identical to the current version of the same dataset or not. You cannot say what differs if they are different. If you used the Stata command <code>datasignature</code> you can get some clues of what differs, such as number of observations or number of variables, but you will still not be able to recreate the old version of the data.<br />
<br />
However, this method is a good fit for when you want to have a quick way to test that the dataset has not changed, and details of what has changed either does not matter or you have another way to find out (perhaps manually) what those differences are. Examples of this can be if you are accessing someone else's dataset and you want to know if that person makes changes to that dataset. Another way this can be used is to check which if any datasets are updated when the code is updated. Sometimes we want to make changes to the code that should not change the dataset it generates, and then this method is a great way to verify that.<br />
<br />
=== Version control in sync software ===<br />
<br />
[[Data_Storage#File_sync_services | File syncing software]] often have version control systems that allows you to both detect changes made to data files and allows you to restore old versions of those files. However, the way they it is done in these systems is so storage in-efficient that you can only restore files version that are less than a few months old. If your project is completed with that time frame, then this is a great solution, however, typically a project runs for much longer than that.<br />
<br />
== Data storage types ==<br />
There are different types of data storage option and which one that is best for each use case typically depends on how big the data is, if there is confidential information in the data and how the data will be used. Below are some different storage types with their pros and cons. The type of data that is considered in this article are data files, and not, for example, data stored in data bases. <br />
<br />
=== File sync services ===<br />
<br />
Common examples of file sync services are DropBox, OneDrive, Box among several others. This storage type is suitable when the data files required for a project is not bigger than that they can be stored on a regular laptop or desktop computer. In file sync services a copy of the file exists locally on all computers that has the folder synced. This means that accessing the file is quick and that there is never a problem for multiple people to access the file at the same time as they have their own version of the file. <br />
<br />
However, because all files are saved locally this also means that if one person works on many projects there is a risk that there is not space for all files for those projects on that persons computer. Many file syncing services have options to mitigate this. For example you can chose to not sync the parts of the project folder that you know you do not need access to. Some sync services also allow you to have non-synced files ready to be downloaded on demand (sometimes referred to as smart sync). However, this is not great for data work as data files are usually to big to be downloaded on demand when your code trying to access them leading to your programming software to crash when trying to read the file.<br />
<br />
Another concern with file syncing services are privacy concerns when sharing confidential data. There are enterprise subscriptions that can be installed on your own servers. This makes sharing data much more secure as the data is never stored or transferred on the servers that belong to the company that offers the syncing service. If you do not have an enterprise subscription or are not sure if you do, you should always [[encryption| encrypt ]] any non-public data before saving it in a synced folder. DIME Analytics have published [https://github.com/worldbank/dime-standards/blob/master/dime-research-standards/pillar-4-data-security/data-security-resources/veracrypt-guidelines.md guidelines to encrypt files with Veracrypt] which is a secure and free software often used for this purpose.<br />
<br />
=== Cloud storage ===<br />
<br />
Cloud storage comes in many different kinds and it is out of the scope of this wiki article to describe them all. This article will only cover general points about cloud storage. This article only covers data stored in files, so cloud storage here typically means [https://aws.amazon.com/s3/ S3 storage on AWS] or [https://docs.microsoft.com/en-us/azure/storage/blobs/ Blob storage on Azure], but are not limited to those.<br />
<br />
The benefit of cloud storage is that you do not need to buy or upgrade any hardware. To get started or to increase your capacity you only need to click a button or it might even be upgraded automatically. However, you will be charged more the more you use. <br />
<br />
Cloud storage might not be the best option when the data will frequently be accessed from laptops and desktops. Even on a fast internet connection will the download time be too long for most use cases if the data is downloaded each time the script is run. Instead, cloud storage is best used when frequently accessed from another cloud resource. However, make sure that the other cloud storage is in the same physical location as the cloud resource that will use it. The same cloud provider have data centers across the world and you might have issues if your storage is on the other side of the world from the resource that will use it. There are ways to address these issues but they are outside of the scope of this article. The main takeaway is that cloud storage is only a great solution if the resources that needs access to that have a very quick connection to them.<br />
<br />
At the same time, this does not mean that cloud storage should never be accessed from a laptop or a desktop. Cloud storage can be a great place to store data that is infrequently accessed by users on regular computers. There will still be a delay, but that can be acceptable as the data is only infrequently accessed.<br />
<br />
=== Network drive storage ===<br />
<br />
Network drive storage is in many way similar to the types of cloud storage that we discussed above. The major difference for the context discussed here is that they can be located in the same location as a user accessing them from a laptop/desktop making it a viable option for such a user to access it frequently. This is a great solution when a regular laptop/desktop frequently needs access to the data and the combined size of all of the data files are too big to all be stored on the laptop/desktop.<br />
<br />
The problem with this solution is that you need to purchase hardware and run your own network drive. This can be complicated if you do not work at a large organization where there is an IT team that can do this for you. Another drawback is that if you work on an organization that is spread out geographically then you will have the same issues as with cloud storage with slow access speed.<br />
<br />
== Back-up protocols ==<br />
<br />
Use<br />
<br />
== Back-up storage types ==<br />
You should never use the same file or storage solution for your back-up copy of your original or raw data, as the files and storage solution you use for your day-to-day work. While some storage types could be used for both day-to-day and backup, this sections covers the former use case. Backing up of data is covered in a later section below. There are multiple types of storage and what type is best for you is how depends on the following factors: size of your data, where it will be used and cost.<br />
<br />
== Data retention ==<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[*topic name, as listed on main page*]]<br />
<br />
<br />
== Additional Resources ==<br />
* list here other articles related to this topic, with a brief description and link<br />
<br />
[[Category: *category name* ]]</div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Data_Storage&diff=8307Data Storage2021-06-03T22:08:40Z<p>Kbjarkefur: /* Version control in sync software */</p>
<hr />
<div>This article discuss different aspects of data storage (such as different types of storage, data back up and data retention). While [[Data_Security]] is a very important topic related to data storage, it is not covered in this article as there is a dedicated link for that.<br />
<br />
<br />
<br />
== Read First ==<br />
* Data should be version controlled. Ideally the version control system used should both be able to detect changes to the data and provide a possibility to revert to old versions<br />
* Data storage depends on how big the data is, if there is confidential information in the data and how the data will be used.<br />
* All original data (the data sets collected or received by the team) should be backed-up. All derivative data (data sets created from original data) should be created by reproducible code and that code should be version controlled in Git. Then derivative data does not need to be backed-up as it can be re-created.<br />
<br />
== Version control for data ==<br />
Version control is used for two purposes. The first is to keep track of changes and modifications to files and the second one is to provide the possibility to revert a file an old version. For code there is an industry standard that efficiently fulfills these two purposes and that system is called Git. Git can be used to both keep track of changes made to code as well as restore older versions of code. Unfortunately there is no system that as elegantly does both of those two things for data over a longer period of time. There is therefore no industry wide one-size-fits-all solutions for version control of data. This article therefore suggests different methods and the project team should pick the method best for their exact use case.<br />
<br />
=== Version control code that generates data ===<br />
<br />
All derivative datasets (datasets that the project team creates from original data they received or collected) should be generated by reproducible code so that it can be re-created by re-running that code. Therefore, if the original data is properly [[Data_Storage#Back-up_protocols | backed-up]] and all code is version controlled using Git, then the derivative datasets are already implicitly version controlled since changes to derivative datasets can only happen through changes to the code tracked in Git. Git can also be used to restore old version of code that in turn can be used to restore old versions of the derivative data.<br />
<br />
While this method often is an excellent option to version control derivative data, it does not work when the original data is updated frequently (ongoing data collection or when data is continuously received) or when the code is not accessible (someone else is generating the data). Derivative data should, in these cases, still be generated using reproducible code tracked in Git, but the effect is no longer that those derivative datasets are implicitly properly version controlled.<br />
<br />
=== Version control using checksums/hashes ===<br />
<br />
One way to keep track of changes done to data is to use ''checksums'', ''hashes'' or ''data signatures''. These three concepts all work slightly differently but all follows the same principle and serves the same purpose in the context of version control of data. The principle they all follow is that a datasets is boiled down to a short string of text or a number. The same data is always boiled down to the same string or number, but even small differences to the data boils down to a different string or number. That string or number can then quickly and effortlessly be compared across datasets to test if they are identical or not.<br />
<br />
Most checksums or hashes boils down to a string or a number that has no interpretable meaning to humans, it is just meant to serve the purpose of answering the yes/no question if the two datasets are identical or not. It is usually impossible to tell from a has or a checksum if the files are almost identical or very different. One exception to this is the Stata command <code>datasignature</code> that generates a signature that combines a hash value with some basic data on the dataset such as number of observations. While this provides some useful human readable information, it can only be used on Stata <code>.dta</code> files.<br />
<br />
The drawback of all these methods is that it says nothing about how the dataset was created. So there is no way to recreate a dataset based only on the checksum or hash signature. If you have the checksum of a dataset used in the past but not the corresponding version of the dataset, then all you can do is to say that dataset was identical to the current version of the same dataset or not. You cannot say what differs if they are different. If you used the Stata command <code>datasignature</code> you can get some clues of what differs, such as number of observations or number of variables, but you will still not be able to recreate the old version of the data.<br />
<br />
However, this method is a good fit for when you want to have a quick way to test that the dataset has not changed, and details of what has changed either does not matter or you have another way to find out (perhaps manually) what those differences are. Examples of this can be if you are accessing someone else's dataset and you want to know if that person makes changes to that dataset. Another way this can be used is to check which if any datasets are updated when the code is updated. Sometimes we want to make changes to the code that should not change the dataset it generates, and then this method is a great way to verify that.<br />
<br />
=== Version control in sync software ===<br />
<br />
[[Data_Storage#File_sync_services | File syncing software]] often have version control systems that allows you to both detect changes made to data files and allows you to restore old versions of those files. However, the way they it is done in these systems is so storage in-efficient that you can only restore files version that are less than a few months old. If your project is completed with that time frame, then this is a great solution, however, typically a project runs for much longer than that.<br />
<br />
== Data storage types ==<br />
There are different types of data storage option and which one that is best for each use case typically depends on how big the data is, if there is confidential information in the data and how the data will be used. Below are some different storage types with their pros and cons. The type of data that is considered in this article are data files, and not, for example, data stored in data bases. <br />
<br />
=== File sync services ===<br />
<br />
Common examples of file sync services are DropBox, OneDrive, Box among several others. This storage type is suitable when the data files required for a project is not bigger than that they can be stored on a regular laptop or desktop computer. In file sync services a copy of the file exists locally on all computers that has the folder synced. This means that accessing the file is quick and that there is never a problem for multiple people to access the file at the same time as they have their own version of the file. <br />
<br />
However, because all files are saved locally this also means that if one person works on many projects there is a risk that there is not space for all files for those projects on that persons computer. Many file syncing services have options to mitigate this. For example you can chose to not sync the parts of the project folder that you know you do not need access to. Some sync services also allow you to have non-synced files ready to be downloaded on demand (sometimes referred to as smart sync). However, this is not great for data work as data files are usually to big to be downloaded on demand when your code trying to access them leading to your programming software to crash when trying to read the file.<br />
<br />
Another concern with file syncing services are privacy concerns when sharing confidential data. There are enterprise subscriptions that can be installed on your own servers. This makes sharing data much more secure as the data is never stored or transferred on the servers that belong to the company that offers the syncing service. If you do not have an enterprise subscription or are not sure if you do, you should always [[encryption| encrypt ]] the data before saving it in a synced folder. DIME Analytics have published [https://github.com/worldbank/dime-standards/blob/master/dime-research-standards/pillar-4-data-security/data-security-resources/veracrypt-guidelines.md guidelines to encrypt files with Veracrypt] which is a secure and free software often used for this purpose.<br />
<br />
=== Cloud storage ===<br />
<br />
Cloud storage comes in many different kinds and it is out of the scope of this wiki article to describe them all. This article will only cover general points about cloud storage. This article only covers data stored in files, so cloud storage here typically means [https://aws.amazon.com/s3/ S3 storage on AWS] or [https://docs.microsoft.com/en-us/azure/storage/blobs/ Blob storage on Azure], but are not limited to those.<br />
<br />
The benefit of cloud storage is that you do not need to buy or upgrade any hardware. To get started or to increase your capacity you only need to click a button or it might even be upgraded automatically. However, you will be charged more the more you use. <br />
<br />
Cloud storage might not be the best option when the data will frequently be accessed from laptops and desktops. Even on a fast internet connection will the download time be too long for most use cases if the data is downloaded each time the script is run. Instead, cloud storage is best used when frequently accessed from another cloud resource. However, make sure that the other cloud storage is in the same physical location as the cloud resource that will use it. The same cloud provider have data centers across the world and you might have issues if your storage is on the other side of the world from the resource that will use it. There are ways to address these issues but they are outside of the scope of this article. The main takeaway is that cloud storage is only a great solution if the resources that needs access to that have a very quick connection to them.<br />
<br />
At the same time, this does not mean that cloud storage should never be accessed from a laptop or a desktop. Cloud storage can be a great place to store data that is infrequently accessed by users on regular computers. There will still be a delay, but that can be acceptable as the data is only infrequently accessed.<br />
<br />
=== Network drive storage ===<br />
<br />
Network drive storage is in many way similar to the types of cloud storage that we discussed above. The major difference for the context discussed here is that they can be located in the same location as a user accessing them from a laptop/desktop making it a viable option for such a user to access it frequently. This is a great solution when a regular laptop/desktop frequently needs access to the data and the combined size of all of the data files are too big to all be stored on the laptop/desktop.<br />
<br />
The problem with this solution is that you need to purchase hardware and run your own network drive. This can be complicated if you do not work at a large organization where there is an IT team that can do this for you. Another drawback is that if you work on an organization that is spread out geographically then you will have the same issues as with cloud storage with slow access speed.<br />
<br />
== Back-up protocols ==<br />
<br />
Use<br />
<br />
== Back-up storage types ==<br />
You should never use the same file or storage solution for your back-up copy of your original or raw data, as the files and storage solution you use for your day-to-day work. While some storage types could be used for both day-to-day and backup, this sections covers the former use case. Backing up of data is covered in a later section below. There are multiple types of storage and what type is best for you is how depends on the following factors: size of your data, where it will be used and cost.<br />
<br />
== Data retention ==<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[*topic name, as listed on main page*]]<br />
<br />
<br />
== Additional Resources ==<br />
* list here other articles related to this topic, with a brief description and link<br />
<br />
[[Category: *category name* ]]</div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Data_Storage&diff=8306Data Storage2021-06-03T22:08:24Z<p>Kbjarkefur: /* Version control in sync software */</p>
<hr />
<div>This article discuss different aspects of data storage (such as different types of storage, data back up and data retention). While [[Data_Security]] is a very important topic related to data storage, it is not covered in this article as there is a dedicated link for that.<br />
<br />
<br />
<br />
== Read First ==<br />
* Data should be version controlled. Ideally the version control system used should both be able to detect changes to the data and provide a possibility to revert to old versions<br />
* Data storage depends on how big the data is, if there is confidential information in the data and how the data will be used.<br />
* All original data (the data sets collected or received by the team) should be backed-up. All derivative data (data sets created from original data) should be created by reproducible code and that code should be version controlled in Git. Then derivative data does not need to be backed-up as it can be re-created.<br />
<br />
== Version control for data ==<br />
Version control is used for two purposes. The first is to keep track of changes and modifications to files and the second one is to provide the possibility to revert a file an old version. For code there is an industry standard that efficiently fulfills these two purposes and that system is called Git. Git can be used to both keep track of changes made to code as well as restore older versions of code. Unfortunately there is no system that as elegantly does both of those two things for data over a longer period of time. There is therefore no industry wide one-size-fits-all solutions for version control of data. This article therefore suggests different methods and the project team should pick the method best for their exact use case.<br />
<br />
=== Version control code that generates data ===<br />
<br />
All derivative datasets (datasets that the project team creates from original data they received or collected) should be generated by reproducible code so that it can be re-created by re-running that code. Therefore, if the original data is properly [[Data_Storage#Back-up_protocols | backed-up]] and all code is version controlled using Git, then the derivative datasets are already implicitly version controlled since changes to derivative datasets can only happen through changes to the code tracked in Git. Git can also be used to restore old version of code that in turn can be used to restore old versions of the derivative data.<br />
<br />
While this method often is an excellent option to version control derivative data, it does not work when the original data is updated frequently (ongoing data collection or when data is continuously received) or when the code is not accessible (someone else is generating the data). Derivative data should, in these cases, still be generated using reproducible code tracked in Git, but the effect is no longer that those derivative datasets are implicitly properly version controlled.<br />
<br />
=== Version control using checksums/hashes ===<br />
<br />
One way to keep track of changes done to data is to use ''checksums'', ''hashes'' or ''data signatures''. These three concepts all work slightly differently but all follows the same principle and serves the same purpose in the context of version control of data. The principle they all follow is that a datasets is boiled down to a short string of text or a number. The same data is always boiled down to the same string or number, but even small differences to the data boils down to a different string or number. That string or number can then quickly and effortlessly be compared across datasets to test if they are identical or not.<br />
<br />
Most checksums or hashes boils down to a string or a number that has no interpretable meaning to humans, it is just meant to serve the purpose of answering the yes/no question if the two datasets are identical or not. It is usually impossible to tell from a has or a checksum if the files are almost identical or very different. One exception to this is the Stata command <code>datasignature</code> that generates a signature that combines a hash value with some basic data on the dataset such as number of observations. While this provides some useful human readable information, it can only be used on Stata <code>.dta</code> files.<br />
<br />
The drawback of all these methods is that it says nothing about how the dataset was created. So there is no way to recreate a dataset based only on the checksum or hash signature. If you have the checksum of a dataset used in the past but not the corresponding version of the dataset, then all you can do is to say that dataset was identical to the current version of the same dataset or not. You cannot say what differs if they are different. If you used the Stata command <code>datasignature</code> you can get some clues of what differs, such as number of observations or number of variables, but you will still not be able to recreate the old version of the data.<br />
<br />
However, this method is a good fit for when you want to have a quick way to test that the dataset has not changed, and details of what has changed either does not matter or you have another way to find out (perhaps manually) what those differences are. Examples of this can be if you are accessing someone else's dataset and you want to know if that person makes changes to that dataset. Another way this can be used is to check which if any datasets are updated when the code is updated. Sometimes we want to make changes to the code that should not change the dataset it generates, and then this method is a great way to verify that.<br />
<br />
=== Version control in sync software ===<br />
<br />
[[Data_Storage#File_sync_services | File syncing software}} often have version control systems that allows you to both detect changes made to data files and allows you to restore old versions of those files. However, the way they it is done in these systems is so storage in-efficient that you can only restore files version that are less than a few months old. If your project is completed with that time frame, then this is a great solution, however, typically a project runs for much longer than that.<br />
<br />
== Data storage types ==<br />
There are different types of data storage option and which one that is best for each use case typically depends on how big the data is, if there is confidential information in the data and how the data will be used. Below are some different storage types with their pros and cons. The type of data that is considered in this article are data files, and not, for example, data stored in data bases. <br />
<br />
=== File sync services ===<br />
<br />
Common examples of file sync services are DropBox, OneDrive, Box among several others. This storage type is suitable when the data files required for a project is not bigger than that they can be stored on a regular laptop or desktop computer. In file sync services a copy of the file exists locally on all computers that has the folder synced. This means that accessing the file is quick and that there is never a problem for multiple people to access the file at the same time as they have their own version of the file. <br />
<br />
However, because all files are saved locally this also means that if one person works on many projects there is a risk that there is not space for all files for those projects on that persons computer. Many file syncing services have options to mitigate this. For example you can chose to not sync the parts of the project folder that you know you do not need access to. Some sync services also allow you to have non-synced files ready to be downloaded on demand (sometimes referred to as smart sync). However, this is not great for data work as data files are usually to big to be downloaded on demand when your code trying to access them leading to your programming software to crash when trying to read the file.<br />
<br />
Another concern with file syncing services are privacy concerns when sharing confidential data. There are enterprise subscriptions that can be installed on your own servers. This makes sharing data much more secure as the data is never stored or transferred on the servers that belong to the company that offers the syncing service. If you do not have an enterprise subscription or are not sure if you do, you should always [[encryption| encrypt ]] the data before saving it in a synced folder. DIME Analytics have published [https://github.com/worldbank/dime-standards/blob/master/dime-research-standards/pillar-4-data-security/data-security-resources/veracrypt-guidelines.md guidelines to encrypt files with Veracrypt] which is a secure and free software often used for this purpose.<br />
<br />
=== Cloud storage ===<br />
<br />
Cloud storage comes in many different kinds and it is out of the scope of this wiki article to describe them all. This article will only cover general points about cloud storage. This article only covers data stored in files, so cloud storage here typically means [https://aws.amazon.com/s3/ S3 storage on AWS] or [https://docs.microsoft.com/en-us/azure/storage/blobs/ Blob storage on Azure], but are not limited to those.<br />
<br />
The benefit of cloud storage is that you do not need to buy or upgrade any hardware. To get started or to increase your capacity you only need to click a button or it might even be upgraded automatically. However, you will be charged more the more you use. <br />
<br />
Cloud storage might not be the best option when the data will frequently be accessed from laptops and desktops. Even on a fast internet connection will the download time be too long for most use cases if the data is downloaded each time the script is run. Instead, cloud storage is best used when frequently accessed from another cloud resource. However, make sure that the other cloud storage is in the same physical location as the cloud resource that will use it. The same cloud provider have data centers across the world and you might have issues if your storage is on the other side of the world from the resource that will use it. There are ways to address these issues but they are outside of the scope of this article. The main takeaway is that cloud storage is only a great solution if the resources that needs access to that have a very quick connection to them.<br />
<br />
At the same time, this does not mean that cloud storage should never be accessed from a laptop or a desktop. Cloud storage can be a great place to store data that is infrequently accessed by users on regular computers. There will still be a delay, but that can be acceptable as the data is only infrequently accessed.<br />
<br />
=== Network drive storage ===<br />
<br />
Network drive storage is in many way similar to the types of cloud storage that we discussed above. The major difference for the context discussed here is that they can be located in the same location as a user accessing them from a laptop/desktop making it a viable option for such a user to access it frequently. This is a great solution when a regular laptop/desktop frequently needs access to the data and the combined size of all of the data files are too big to all be stored on the laptop/desktop.<br />
<br />
The problem with this solution is that you need to purchase hardware and run your own network drive. This can be complicated if you do not work at a large organization where there is an IT team that can do this for you. Another drawback is that if you work on an organization that is spread out geographically then you will have the same issues as with cloud storage with slow access speed.<br />
<br />
== Back-up protocols ==<br />
<br />
Use<br />
<br />
== Back-up storage types ==<br />
You should never use the same file or storage solution for your back-up copy of your original or raw data, as the files and storage solution you use for your day-to-day work. While some storage types could be used for both day-to-day and backup, this sections covers the former use case. Backing up of data is covered in a later section below. There are multiple types of storage and what type is best for you is how depends on the following factors: size of your data, where it will be used and cost.<br />
<br />
== Data retention ==<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[*topic name, as listed on main page*]]<br />
<br />
<br />
== Additional Resources ==<br />
* list here other articles related to this topic, with a brief description and link<br />
<br />
[[Category: *category name* ]]</div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Data_Storage&diff=8305Data Storage2021-06-03T22:07:36Z<p>Kbjarkefur: /* File sync services */</p>
<hr />
<div>This article discuss different aspects of data storage (such as different types of storage, data back up and data retention). While [[Data_Security]] is a very important topic related to data storage, it is not covered in this article as there is a dedicated link for that.<br />
<br />
<br />
<br />
== Read First ==<br />
* Data should be version controlled. Ideally the version control system used should both be able to detect changes to the data and provide a possibility to revert to old versions<br />
* Data storage depends on how big the data is, if there is confidential information in the data and how the data will be used.<br />
* All original data (the data sets collected or received by the team) should be backed-up. All derivative data (data sets created from original data) should be created by reproducible code and that code should be version controlled in Git. Then derivative data does not need to be backed-up as it can be re-created.<br />
<br />
== Version control for data ==<br />
Version control is used for two purposes. The first is to keep track of changes and modifications to files and the second one is to provide the possibility to revert a file an old version. For code there is an industry standard that efficiently fulfills these two purposes and that system is called Git. Git can be used to both keep track of changes made to code as well as restore older versions of code. Unfortunately there is no system that as elegantly does both of those two things for data over a longer period of time. There is therefore no industry wide one-size-fits-all solutions for version control of data. This article therefore suggests different methods and the project team should pick the method best for their exact use case.<br />
<br />
=== Version control code that generates data ===<br />
<br />
All derivative datasets (datasets that the project team creates from original data they received or collected) should be generated by reproducible code so that it can be re-created by re-running that code. Therefore, if the original data is properly [[Data_Storage#Back-up_protocols | backed-up]] and all code is version controlled using Git, then the derivative datasets are already implicitly version controlled since changes to derivative datasets can only happen through changes to the code tracked in Git. Git can also be used to restore old version of code that in turn can be used to restore old versions of the derivative data.<br />
<br />
While this method often is an excellent option to version control derivative data, it does not work when the original data is updated frequently (ongoing data collection or when data is continuously received) or when the code is not accessible (someone else is generating the data). Derivative data should, in these cases, still be generated using reproducible code tracked in Git, but the effect is no longer that those derivative datasets are implicitly properly version controlled.<br />
<br />
=== Version control using checksums/hashes ===<br />
<br />
One way to keep track of changes done to data is to use ''checksums'', ''hashes'' or ''data signatures''. These three concepts all work slightly differently but all follows the same principle and serves the same purpose in the context of version control of data. The principle they all follow is that a datasets is boiled down to a short string of text or a number. The same data is always boiled down to the same string or number, but even small differences to the data boils down to a different string or number. That string or number can then quickly and effortlessly be compared across datasets to test if they are identical or not.<br />
<br />
Most checksums or hashes boils down to a string or a number that has no interpretable meaning to humans, it is just meant to serve the purpose of answering the yes/no question if the two datasets are identical or not. It is usually impossible to tell from a has or a checksum if the files are almost identical or very different. One exception to this is the Stata command <code>datasignature</code> that generates a signature that combines a hash value with some basic data on the dataset such as number of observations. While this provides some useful human readable information, it can only be used on Stata <code>.dta</code> files.<br />
<br />
The drawback of all these methods is that it says nothing about how the dataset was created. So there is no way to recreate a dataset based only on the checksum or hash signature. If you have the checksum of a dataset used in the past but not the corresponding version of the dataset, then all you can do is to say that dataset was identical to the current version of the same dataset or not. You cannot say what differs if they are different. If you used the Stata command <code>datasignature</code> you can get some clues of what differs, such as number of observations or number of variables, but you will still not be able to recreate the old version of the data.<br />
<br />
However, this method is a good fit for when you want to have a quick way to test that the dataset has not changed, and details of what has changed either does not matter or you have another way to find out (perhaps manually) what those differences are. Examples of this can be if you are accessing someone else's dataset and you want to know if that person makes changes to that dataset. Another way this can be used is to check which if any datasets are updated when the code is updated. Sometimes we want to make changes to the code that should not change the dataset it generates, and then this method is a great way to verify that.<br />
<br />
=== Version control in sync software ===<br />
<br />
File syncing software (read more about them below) often have version control systems that allows you to both detect changes made to data files and allows you to restore old versions of those files. However, the way they it is done in these systems is so storage in-efficient that you can only restore files version that are less than a few months old. If your project is completed with that time frame, then this is a great solution, however, typically a project runs for much longer than that.<br />
<br />
== Data storage types ==<br />
There are different types of data storage option and which one that is best for each use case typically depends on how big the data is, if there is confidential information in the data and how the data will be used. Below are some different storage types with their pros and cons. The type of data that is considered in this article are data files, and not, for example, data stored in data bases. <br />
<br />
=== File sync services ===<br />
<br />
Common examples of file sync services are DropBox, OneDrive, Box among several others. This storage type is suitable when the data files required for a project is not bigger than that they can be stored on a regular laptop or desktop computer. In file sync services a copy of the file exists locally on all computers that has the folder synced. This means that accessing the file is quick and that there is never a problem for multiple people to access the file at the same time as they have their own version of the file. <br />
<br />
However, because all files are saved locally this also means that if one person works on many projects there is a risk that there is not space for all files for those projects on that persons computer. Many file syncing services have options to mitigate this. For example you can chose to not sync the parts of the project folder that you know you do not need access to. Some sync services also allow you to have non-synced files ready to be downloaded on demand (sometimes referred to as smart sync). However, this is not great for data work as data files are usually to big to be downloaded on demand when your code trying to access them leading to your programming software to crash when trying to read the file.<br />
<br />
Another concern with file syncing services are privacy concerns when sharing confidential data. There are enterprise subscriptions that can be installed on your own servers. This makes sharing data much more secure as the data is never stored or transferred on the servers that belong to the company that offers the syncing service. If you do not have an enterprise subscription or are not sure if you do, you should always [[encryption| encrypt ]] the data before saving it in a synced folder. DIME Analytics have published [https://github.com/worldbank/dime-standards/blob/master/dime-research-standards/pillar-4-data-security/data-security-resources/veracrypt-guidelines.md guidelines to encrypt files with Veracrypt] which is a secure and free software often used for this purpose.<br />
<br />
=== Cloud storage ===<br />
<br />
Cloud storage comes in many different kinds and it is out of the scope of this wiki article to describe them all. This article will only cover general points about cloud storage. This article only covers data stored in files, so cloud storage here typically means [https://aws.amazon.com/s3/ S3 storage on AWS] or [https://docs.microsoft.com/en-us/azure/storage/blobs/ Blob storage on Azure], but are not limited to those.<br />
<br />
The benefit of cloud storage is that you do not need to buy or upgrade any hardware. To get started or to increase your capacity you only need to click a button or it might even be upgraded automatically. However, you will be charged more the more you use. <br />
<br />
Cloud storage might not be the best option when the data will frequently be accessed from laptops and desktops. Even on a fast internet connection will the download time be too long for most use cases if the data is downloaded each time the script is run. Instead, cloud storage is best used when frequently accessed from another cloud resource. However, make sure that the other cloud storage is in the same physical location as the cloud resource that will use it. The same cloud provider have data centers across the world and you might have issues if your storage is on the other side of the world from the resource that will use it. There are ways to address these issues but they are outside of the scope of this article. The main takeaway is that cloud storage is only a great solution if the resources that needs access to that have a very quick connection to them.<br />
<br />
At the same time, this does not mean that cloud storage should never be accessed from a laptop or a desktop. Cloud storage can be a great place to store data that is infrequently accessed by users on regular computers. There will still be a delay, but that can be acceptable as the data is only infrequently accessed.<br />
<br />
=== Network drive storage ===<br />
<br />
Network drive storage is in many way similar to the types of cloud storage that we discussed above. The major difference for the context discussed here is that they can be located in the same location as a user accessing them from a laptop/desktop making it a viable option for such a user to access it frequently. This is a great solution when a regular laptop/desktop frequently needs access to the data and the combined size of all of the data files are too big to all be stored on the laptop/desktop.<br />
<br />
The problem with this solution is that you need to purchase hardware and run your own network drive. This can be complicated if you do not work at a large organization where there is an IT team that can do this for you. Another drawback is that if you work on an organization that is spread out geographically then you will have the same issues as with cloud storage with slow access speed.<br />
<br />
== Back-up protocols ==<br />
<br />
Use<br />
<br />
== Back-up storage types ==<br />
You should never use the same file or storage solution for your back-up copy of your original or raw data, as the files and storage solution you use for your day-to-day work. While some storage types could be used for both day-to-day and backup, this sections covers the former use case. Backing up of data is covered in a later section below. There are multiple types of storage and what type is best for you is how depends on the following factors: size of your data, where it will be used and cost.<br />
<br />
== Data retention ==<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[*topic name, as listed on main page*]]<br />
<br />
<br />
== Additional Resources ==<br />
* list here other articles related to this topic, with a brief description and link<br />
<br />
[[Category: *category name* ]]</div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Data_Storage&diff=8304Data Storage2021-06-03T22:07:11Z<p>Kbjarkefur: /* Version control using checksums/hashes */</p>
<hr />
<div>This article discuss different aspects of data storage (such as different types of storage, data back up and data retention). While [[Data_Security]] is a very important topic related to data storage, it is not covered in this article as there is a dedicated link for that.<br />
<br />
<br />
<br />
== Read First ==<br />
* Data should be version controlled. Ideally the version control system used should both be able to detect changes to the data and provide a possibility to revert to old versions<br />
* Data storage depends on how big the data is, if there is confidential information in the data and how the data will be used.<br />
* All original data (the data sets collected or received by the team) should be backed-up. All derivative data (data sets created from original data) should be created by reproducible code and that code should be version controlled in Git. Then derivative data does not need to be backed-up as it can be re-created.<br />
<br />
== Version control for data ==<br />
Version control is used for two purposes. The first is to keep track of changes and modifications to files and the second one is to provide the possibility to revert a file an old version. For code there is an industry standard that efficiently fulfills these two purposes and that system is called Git. Git can be used to both keep track of changes made to code as well as restore older versions of code. Unfortunately there is no system that as elegantly does both of those two things for data over a longer period of time. There is therefore no industry wide one-size-fits-all solutions for version control of data. This article therefore suggests different methods and the project team should pick the method best for their exact use case.<br />
<br />
=== Version control code that generates data ===<br />
<br />
All derivative datasets (datasets that the project team creates from original data they received or collected) should be generated by reproducible code so that it can be re-created by re-running that code. Therefore, if the original data is properly [[Data_Storage#Back-up_protocols | backed-up]] and all code is version controlled using Git, then the derivative datasets are already implicitly version controlled since changes to derivative datasets can only happen through changes to the code tracked in Git. Git can also be used to restore old version of code that in turn can be used to restore old versions of the derivative data.<br />
<br />
While this method often is an excellent option to version control derivative data, it does not work when the original data is updated frequently (ongoing data collection or when data is continuously received) or when the code is not accessible (someone else is generating the data). Derivative data should, in these cases, still be generated using reproducible code tracked in Git, but the effect is no longer that those derivative datasets are implicitly properly version controlled.<br />
<br />
=== Version control using checksums/hashes ===<br />
<br />
One way to keep track of changes done to data is to use ''checksums'', ''hashes'' or ''data signatures''. These three concepts all work slightly differently but all follows the same principle and serves the same purpose in the context of version control of data. The principle they all follow is that a datasets is boiled down to a short string of text or a number. The same data is always boiled down to the same string or number, but even small differences to the data boils down to a different string or number. That string or number can then quickly and effortlessly be compared across datasets to test if they are identical or not.<br />
<br />
Most checksums or hashes boils down to a string or a number that has no interpretable meaning to humans, it is just meant to serve the purpose of answering the yes/no question if the two datasets are identical or not. It is usually impossible to tell from a has or a checksum if the files are almost identical or very different. One exception to this is the Stata command <code>datasignature</code> that generates a signature that combines a hash value with some basic data on the dataset such as number of observations. While this provides some useful human readable information, it can only be used on Stata <code>.dta</code> files.<br />
<br />
The drawback of all these methods is that it says nothing about how the dataset was created. So there is no way to recreate a dataset based only on the checksum or hash signature. If you have the checksum of a dataset used in the past but not the corresponding version of the dataset, then all you can do is to say that dataset was identical to the current version of the same dataset or not. You cannot say what differs if they are different. If you used the Stata command <code>datasignature</code> you can get some clues of what differs, such as number of observations or number of variables, but you will still not be able to recreate the old version of the data.<br />
<br />
However, this method is a good fit for when you want to have a quick way to test that the dataset has not changed, and details of what has changed either does not matter or you have another way to find out (perhaps manually) what those differences are. Examples of this can be if you are accessing someone else's dataset and you want to know if that person makes changes to that dataset. Another way this can be used is to check which if any datasets are updated when the code is updated. Sometimes we want to make changes to the code that should not change the dataset it generates, and then this method is a great way to verify that.<br />
<br />
=== Version control in sync software ===<br />
<br />
File syncing software (read more about them below) often have version control systems that allows you to both detect changes made to data files and allows you to restore old versions of those files. However, the way they it is done in these systems is so storage in-efficient that you can only restore files version that are less than a few months old. If your project is completed with that time frame, then this is a great solution, however, typically a project runs for much longer than that.<br />
<br />
== Data storage types ==<br />
There are different types of data storage option and which one that is best for each use case typically depends on how big the data is, if there is confidential information in the data and how the data will be used. Below are some different storage types with their pros and cons. The type of data that is considered in this article are data files, and not, for example, data stored in data bases. <br />
<br />
=== File sync services ===<br />
<br />
Common examples of file sync services are DropBox, OneDrive, Box among several others. This storage type is suitable when the data files required for a project is not bigger than that they can be stored on a regular laptop or desktop computer. In file sync services a copy of the file exists locally on all computers that has the folder synced. This means that accessing the file is quick and that there is never a problem for multiple people to access the file at the same time as they have their own version of the file. <br />
<br />
However, because all files are saved locally this also means that if one person works on many projects there is a risk that there is not space for all files for those projects on that persons computer. Many file syncing services have options to mitigate this. For example you can chose to not sync the parts of the project folder that you know you do not need access to. Some sync services also allow you to have non-synced files ready to be downloaded on demand (sometimes referred to as smart sync). However, this is not great for data work as data files are usually to big to be downloaded on demand when your code trying to access them leading to your programming software to crash when trying to read the file.<br />
<br />
Another concern with file syncing services are privacy concerns when sharing confidential data. There are enterprise subscriptions that can be installed on your own servers. This makes sharing data much more secure as the data is never stored or transferred on the servers that belong to the company that offers the syncing service. If you do not have an enterprise subscription or are not sure if you do, you should always [[encryption|encrypt]] the data before saving it in a synced folder. DIME Analytics have published [https://github.com/worldbank/dime-standards/blob/master/dime-research-standards/pillar-4-data-security/data-security-resources/veracrypt-guidelines.md guidelines to encrypt files with Veracrypt] which is a secure and free software often used for this purpose.<br />
<br />
=== Cloud storage ===<br />
<br />
Cloud storage comes in many different kinds and it is out of the scope of this wiki article to describe them all. This article will only cover general points about cloud storage. This article only covers data stored in files, so cloud storage here typically means [https://aws.amazon.com/s3/ S3 storage on AWS] or [https://docs.microsoft.com/en-us/azure/storage/blobs/ Blob storage on Azure], but are not limited to those.<br />
<br />
The benefit of cloud storage is that you do not need to buy or upgrade any hardware. To get started or to increase your capacity you only need to click a button or it might even be upgraded automatically. However, you will be charged more the more you use. <br />
<br />
Cloud storage might not be the best option when the data will frequently be accessed from laptops and desktops. Even on a fast internet connection will the download time be too long for most use cases if the data is downloaded each time the script is run. Instead, cloud storage is best used when frequently accessed from another cloud resource. However, make sure that the other cloud storage is in the same physical location as the cloud resource that will use it. The same cloud provider have data centers across the world and you might have issues if your storage is on the other side of the world from the resource that will use it. There are ways to address these issues but they are outside of the scope of this article. The main takeaway is that cloud storage is only a great solution if the resources that needs access to that have a very quick connection to them.<br />
<br />
At the same time, this does not mean that cloud storage should never be accessed from a laptop or a desktop. Cloud storage can be a great place to store data that is infrequently accessed by users on regular computers. There will still be a delay, but that can be acceptable as the data is only infrequently accessed.<br />
<br />
=== Network drive storage ===<br />
<br />
Network drive storage is in many way similar to the types of cloud storage that we discussed above. The major difference for the context discussed here is that they can be located in the same location as a user accessing them from a laptop/desktop making it a viable option for such a user to access it frequently. This is a great solution when a regular laptop/desktop frequently needs access to the data and the combined size of all of the data files are too big to all be stored on the laptop/desktop.<br />
<br />
The problem with this solution is that you need to purchase hardware and run your own network drive. This can be complicated if you do not work at a large organization where there is an IT team that can do this for you. Another drawback is that if you work on an organization that is spread out geographically then you will have the same issues as with cloud storage with slow access speed.<br />
<br />
== Back-up protocols ==<br />
<br />
Use<br />
<br />
== Back-up storage types ==<br />
You should never use the same file or storage solution for your back-up copy of your original or raw data, as the files and storage solution you use for your day-to-day work. While some storage types could be used for both day-to-day and backup, this sections covers the former use case. Backing up of data is covered in a later section below. There are multiple types of storage and what type is best for you is how depends on the following factors: size of your data, where it will be used and cost.<br />
<br />
== Data retention ==<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[*topic name, as listed on main page*]]<br />
<br />
<br />
== Additional Resources ==<br />
* list here other articles related to this topic, with a brief description and link<br />
<br />
[[Category: *category name* ]]</div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Data_Storage&diff=8303Data Storage2021-06-03T21:49:36Z<p>Kbjarkefur: /* Version control code that generates data */</p>
<hr />
<div>This article discuss different aspects of data storage (such as different types of storage, data back up and data retention). While [[Data_Security]] is a very important topic related to data storage, it is not covered in this article as there is a dedicated link for that.<br />
<br />
<br />
<br />
== Read First ==<br />
* Data should be version controlled. Ideally the version control system used should both be able to detect changes to the data and provide a possibility to revert to old versions<br />
* Data storage depends on how big the data is, if there is confidential information in the data and how the data will be used.<br />
* All original data (the data sets collected or received by the team) should be backed-up. All derivative data (data sets created from original data) should be created by reproducible code and that code should be version controlled in Git. Then derivative data does not need to be backed-up as it can be re-created.<br />
<br />
== Version control for data ==<br />
Version control is used for two purposes. The first is to keep track of changes and modifications to files and the second one is to provide the possibility to revert a file an old version. For code there is an industry standard that efficiently fulfills these two purposes and that system is called Git. Git can be used to both keep track of changes made to code as well as restore older versions of code. Unfortunately there is no system that as elegantly does both of those two things for data over a longer period of time. There is therefore no industry wide one-size-fits-all solutions for version control of data. This article therefore suggests different methods and the project team should pick the method best for their exact use case.<br />
<br />
=== Version control code that generates data ===<br />
<br />
All derivative datasets (datasets that the project team creates from original data they received or collected) should be generated by reproducible code so that it can be re-created by re-running that code. Therefore, if the original data is properly [[Data_Storage#Back-up_protocols | backed-up]] and all code is version controlled using Git, then the derivative datasets are already implicitly version controlled since changes to derivative datasets can only happen through changes to the code tracked in Git. Git can also be used to restore old version of code that in turn can be used to restore old versions of the derivative data.<br />
<br />
While this method often is an excellent option to version control derivative data, it does not work when the original data is updated frequently (ongoing data collection or when data is continuously received) or when the code is not accessible (someone else is generating the data). Derivative data should, in these cases, still be generated using reproducible code tracked in Git, but the effect is no longer that those derivative datasets are implicitly properly version controlled.<br />
<br />
=== Version control using checksums/hashes ===<br />
<br />
One way to keep track of changes done to data is to use ''checksums'', ''hashes'' or ''data signatures''. These three concepts all work slightly differently but all follows the same principle and serves the same purpose in the context of version control of data. The principle they all follow is that a datasets is boiled down to a short string of text or a number. Then that string or number can be compared across datasets to test if they are identical or not.<br />
<br />
One very simplified way to explain how this works would be the following example. In this hashing algorithm we start with a word instead of a dataset, but real world hashing algorithms can handle both. Start with a word and then take the corresponding number in the alphabet for each letter, sum those numbers, and then add the digits of each number until you have a single digit. So for "cat" we would get 3+1+20=24 (c=3,a=1,t=20), and in next step 24 would is turned into 6 (2+4=6). So the hash signature for "cat" would be 6. "Horse" would be 8+15+18+19+5=65. 6+5=11 and 1+1=2. So the hash signature would be 2. No matter how big the data is (how long the word is in this simple example) the hash signature will always be the same size and format and we can quickly test if they are the same. The main problem with the very simplified hash algorithm above is that there are only 10 values so many words would share the same signature. However, real world hash algorithms like the <code>checksum</code> command in Stata has signatures on the format "2694850408" which has 10^10 possible signatures.<br />
<br />
With 10^10 possible signatures there is chance that two datasets have the same checksum. This is called "''collisions''". However, checksums are implemented so that two similar dataset are very unlikely to have collisions or even similar checksums, making the risk of two versions of the same dataset having a collision being extremely low. And other algorithms are implemented so that there are many more combinations, but then the hash signature gets longer. Stata also have the command <code>datasignature</code> that has a signature that combines a hash value with some basic data on the dataset such as number of observations. While this provides some useful human readable information, it can only be used on Stata <code>.dta</code> files.<br />
<br />
The drawback of all these methods is that it says nothing about how the dataset was created. So there is no way to recreate a dataset based only on the checksum or hash signature. If you have the checksum of a dataset used in the past but not the corresponding version of the dataset, then all you can do is to say if the current version of the same dataset is identical. You cannot say what differs if they are different. If you used the Stata command <code>datasignature</code> you can get some clues of what differs, such as number of observations or number of variables, but you will still not be able to recreate the old version of the data.<br />
<br />
This method is a good fit for when you want to have a quick way to test that the dataset has not changed, and the details of what has changed does not matter or you have another acceptable and perhaps manual way to find out what those details are. This can be if you are accessing someone else's data set and you want to know if that person makes changes to that dataset. Another way this can be used is to check which if any datasets are updated when the code is updated. Sometimes we want to make changes to the code that should not change the dataset it produces, and then this method is a great way to verify that.<br />
<br />
=== Version control in sync software ===<br />
<br />
File syncing software (read more about them below) often have version control systems that allows you to both detect changes made to data files and allows you to restore old versions of those files. However, the way they it is done in these systems is so storage in-efficient that you can only restore files version that are less than a few months old. If your project is completed with that time frame, then this is a great solution, however, typically a project runs for much longer than that.<br />
<br />
== Data storage types ==<br />
There are different types of data storage option and which one that is best for each use case typically depends on how big the data is, if there is confidential information in the data and how the data will be used. Below are some different storage types with their pros and cons. The type of data that is considered in this article are data files, and not, for example, data stored in data bases. <br />
<br />
=== File sync services ===<br />
<br />
Common examples of file sync services are DropBox, OneDrive, Box among several others. This storage type is suitable when the data files required for a project is not bigger than that they can be stored on a regular laptop or desktop computer. In file sync services a copy of the file exists locally on all computers that has the folder synced. This means that accessing the file is quick and that there is never a problem for multiple people to access the file at the same time as they have their own version of the file. <br />
<br />
However, because all files are saved locally this also means that if one person works on many projects there is a risk that there is not space for all files for those projects on that persons computer. Many file syncing services have options to mitigate this. For example you can chose to not sync the parts of the project folder that you know you do not need access to. Some sync services also allow you to have non-synced files ready to be downloaded on demand (sometimes referred to as smart sync). However, this is not great for data work as data files are usually to big to be downloaded on demand when your code trying to access them leading to your programming software to crash when trying to read the file.<br />
<br />
Another concern with file syncing services are privacy concerns when sharing confidential data. There are enterprise subscriptions that can be installed on your own servers. This makes sharing data much more secure as the data is never stored or transferred on the servers that belong to the company that offers the syncing service. If you do not have an enterprise subscription or are not sure if you do, you should always [[encryption|encrypt]] the data before saving it in a synced folder. DIME Analytics have published [https://github.com/worldbank/dime-standards/blob/master/dime-research-standards/pillar-4-data-security/data-security-resources/veracrypt-guidelines.md guidelines to encrypt files with Veracrypt] which is a secure and free software often used for this purpose.<br />
<br />
=== Cloud storage ===<br />
<br />
Cloud storage comes in many different kinds and it is out of the scope of this wiki article to describe them all. This article will only cover general points about cloud storage. This article only covers data stored in files, so cloud storage here typically means [https://aws.amazon.com/s3/ S3 storage on AWS] or [https://docs.microsoft.com/en-us/azure/storage/blobs/ Blob storage on Azure], but are not limited to those.<br />
<br />
The benefit of cloud storage is that you do not need to buy or upgrade any hardware. To get started or to increase your capacity you only need to click a button or it might even be upgraded automatically. However, you will be charged more the more you use. <br />
<br />
Cloud storage might not be the best option when the data will frequently be accessed from laptops and desktops. Even on a fast internet connection will the download time be too long for most use cases if the data is downloaded each time the script is run. Instead, cloud storage is best used when frequently accessed from another cloud resource. However, make sure that the other cloud storage is in the same physical location as the cloud resource that will use it. The same cloud provider have data centers across the world and you might have issues if your storage is on the other side of the world from the resource that will use it. There are ways to address these issues but they are outside of the scope of this article. The main takeaway is that cloud storage is only a great solution if the resources that needs access to that have a very quick connection to them.<br />
<br />
At the same time, this does not mean that cloud storage should never be accessed from a laptop or a desktop. Cloud storage can be a great place to store data that is infrequently accessed by users on regular computers. There will still be a delay, but that can be acceptable as the data is only infrequently accessed.<br />
<br />
=== Network drive storage ===<br />
<br />
Network drive storage is in many way similar to the types of cloud storage that we discussed above. The major difference for the context discussed here is that they can be located in the same location as a user accessing them from a laptop/desktop making it a viable option for such a user to access it frequently. This is a great solution when a regular laptop/desktop frequently needs access to the data and the combined size of all of the data files are too big to all be stored on the laptop/desktop.<br />
<br />
The problem with this solution is that you need to purchase hardware and run your own network drive. This can be complicated if you do not work at a large organization where there is an IT team that can do this for you. Another drawback is that if you work on an organization that is spread out geographically then you will have the same issues as with cloud storage with slow access speed.<br />
<br />
== Back-up protocols ==<br />
<br />
Use<br />
<br />
== Back-up storage types ==<br />
You should never use the same file or storage solution for your back-up copy of your original or raw data, as the files and storage solution you use for your day-to-day work. While some storage types could be used for both day-to-day and backup, this sections covers the former use case. Backing up of data is covered in a later section below. There are multiple types of storage and what type is best for you is how depends on the following factors: size of your data, where it will be used and cost.<br />
<br />
== Data retention ==<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[*topic name, as listed on main page*]]<br />
<br />
<br />
== Additional Resources ==<br />
* list here other articles related to this topic, with a brief description and link<br />
<br />
[[Category: *category name* ]]</div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Data_Storage&diff=8302Data Storage2021-06-03T18:13:31Z<p>Kbjarkefur: /* Version control code that generates data */</p>
<hr />
<div>This article discuss different aspects of data storage (such as different types of storage, data back up and data retention). While [[Data_Security]] is a very important topic related to data storage, it is not covered in this article as there is a dedicated link for that.<br />
<br />
<br />
<br />
== Read First ==<br />
* Data should be version controlled. Ideally the version control system used should both be able to detect changes to the data and provide a possibility to revert to old versions<br />
* Data storage depends on how big the data is, if there is confidential information in the data and how the data will be used.<br />
* All original data (the data sets collected or received by the team) should be backed-up. All derivative data (data sets created from original data) should be created by reproducible code and that code should be version controlled in Git. Then derivative data does not need to be backed-up as it can be re-created.<br />
<br />
== Version control for data ==<br />
Version control is used for two purposes. The first is to keep track of changes and modifications to files and the second one is to provide the possibility to revert a file an old version. For code there is an industry standard that efficiently fulfills these two purposes and that system is called Git. Git can be used to both keep track of changes made to code as well as restore older versions of code. Unfortunately there is no system that as elegantly does both of those two things for data over a longer period of time. There is therefore no industry wide one-size-fits-all solutions for version control of data. This article therefore suggests different methods and the project team should pick the method best for their exact use case.<br />
<br />
=== Version control code that generates data ===<br />
<br />
All derivative datasets (datasets that the project team creates from original data they received or collected) should be generated by reproducible code so that it can be re-created by re-running that code. Therefore, if the original data is properly [[Data_Storage#Back-up_protocols | backed-up]] and all code is version controlled using Git, then the derivative datasets are implicitly version controlled. This is the case as if the original data is unchanged, changes to derivative datasets can only happen through changes to the code tracked in Git. Git can also be used to restore old version of code that in turn can be used to restore old versions of the derivative data.<br />
<br />
While this method often is an excellent option to version control derivative data, it does not work when the original data is updated frequently (ongoing data collection or when data is continuously received) or when the code is not accessible (someone else is generating the data). Derivative data should, in these cases, still be generated using reproducible code tracked in Git, but the effect is no longer that those derivative datasets are implicitly properly version controlled.<br />
<br />
=== Version control using checksums/hashes ===<br />
<br />
One way to keep track of changes done to data is to use ''checksums'', ''hashes'' or ''data signatures''. These three concepts all work slightly differently but all follows the same principle and serves the same purpose in the context of version control of data. The principle they all follow is that a datasets is boiled down to a short string of text or a number. Then that string or number can be compared across datasets to test if they are identical or not.<br />
<br />
One very simplified way to explain how this works would be the following example. In this hashing algorithm we start with a word instead of a dataset, but real world hashing algorithms can handle both. Start with a word and then take the corresponding number in the alphabet for each letter, sum those numbers, and then add the digits of each number until you have a single digit. So for "cat" we would get 3+1+20=24 (c=3,a=1,t=20), and in next step 24 would is turned into 6 (2+4=6). So the hash signature for "cat" would be 6. "Horse" would be 8+15+18+19+5=65. 6+5=11 and 1+1=2. So the hash signature would be 2. No matter how big the data is (how long the word is in this simple example) the hash signature will always be the same size and format and we can quickly test if they are the same. The main problem with the very simplified hash algorithm above is that there are only 10 values so many words would share the same signature. However, real world hash algorithms like the <code>checksum</code> command in Stata has signatures on the format "2694850408" which has 10^10 possible signatures.<br />
<br />
With 10^10 possible signatures there is chance that two datasets have the same checksum. This is called "''collisions''". However, checksums are implemented so that two similar dataset are very unlikely to have collisions or even similar checksums, making the risk of two versions of the same dataset having a collision being extremely low. And other algorithms are implemented so that there are many more combinations, but then the hash signature gets longer. Stata also have the command <code>datasignature</code> that has a signature that combines a hash value with some basic data on the dataset such as number of observations. While this provides some useful human readable information, it can only be used on Stata <code>.dta</code> files.<br />
<br />
The drawback of all these methods is that it says nothing about how the dataset was created. So there is no way to recreate a dataset based only on the checksum or hash signature. If you have the checksum of a dataset used in the past but not the corresponding version of the dataset, then all you can do is to say if the current version of the same dataset is identical. You cannot say what differs if they are different. If you used the Stata command <code>datasignature</code> you can get some clues of what differs, such as number of observations or number of variables, but you will still not be able to recreate the old version of the data.<br />
<br />
This method is a good fit for when you want to have a quick way to test that the dataset has not changed, and the details of what has changed does not matter or you have another acceptable and perhaps manual way to find out what those details are. This can be if you are accessing someone else's data set and you want to know if that person makes changes to that dataset. Another way this can be used is to check which if any datasets are updated when the code is updated. Sometimes we want to make changes to the code that should not change the dataset it produces, and then this method is a great way to verify that.<br />
<br />
=== Version control in sync software ===<br />
<br />
File syncing software (read more about them below) often have version control systems that allows you to both detect changes made to data files and allows you to restore old versions of those files. However, the way they it is done in these systems is so storage in-efficient that you can only restore files version that are less than a few months old. If your project is completed with that time frame, then this is a great solution, however, typically a project runs for much longer than that.<br />
<br />
== Data storage types ==<br />
There are different types of data storage option and which one that is best for each use case typically depends on how big the data is, if there is confidential information in the data and how the data will be used. Below are some different storage types with their pros and cons. The type of data that is considered in this article are data files, and not, for example, data stored in data bases. <br />
<br />
=== File sync services ===<br />
<br />
Common examples of file sync services are DropBox, OneDrive, Box among several others. This storage type is suitable when the data files required for a project is not bigger than that they can be stored on a regular laptop or desktop computer. In file sync services a copy of the file exists locally on all computers that has the folder synced. This means that accessing the file is quick and that there is never a problem for multiple people to access the file at the same time as they have their own version of the file. <br />
<br />
However, because all files are saved locally this also means that if one person works on many projects there is a risk that there is not space for all files for those projects on that persons computer. Many file syncing services have options to mitigate this. For example you can chose to not sync the parts of the project folder that you know you do not need access to. Some sync services also allow you to have non-synced files ready to be downloaded on demand (sometimes referred to as smart sync). However, this is not great for data work as data files are usually to big to be downloaded on demand when your code trying to access them leading to your programming software to crash when trying to read the file.<br />
<br />
Another concern with file syncing services are privacy concerns when sharing confidential data. There are enterprise subscriptions that can be installed on your own servers. This makes sharing data much more secure as the data is never stored or transferred on the servers that belong to the company that offers the syncing service. If you do not have an enterprise subscription or are not sure if you do, you should always [[encryption|encrypt]] the data before saving it in a synced folder. DIME Analytics have published [https://github.com/worldbank/dime-standards/blob/master/dime-research-standards/pillar-4-data-security/data-security-resources/veracrypt-guidelines.md guidelines to encrypt files with Veracrypt] which is a secure and free software often used for this purpose.<br />
<br />
=== Cloud storage ===<br />
<br />
Cloud storage comes in many different kinds and it is out of the scope of this wiki article to describe them all. This article will only cover general points about cloud storage. This article only covers data stored in files, so cloud storage here typically means [https://aws.amazon.com/s3/ S3 storage on AWS] or [https://docs.microsoft.com/en-us/azure/storage/blobs/ Blob storage on Azure], but are not limited to those.<br />
<br />
The benefit of cloud storage is that you do not need to buy or upgrade any hardware. To get started or to increase your capacity you only need to click a button or it might even be upgraded automatically. However, you will be charged more the more you use. <br />
<br />
Cloud storage might not be the best option when the data will frequently be accessed from laptops and desktops. Even on a fast internet connection will the download time be too long for most use cases if the data is downloaded each time the script is run. Instead, cloud storage is best used when frequently accessed from another cloud resource. However, make sure that the other cloud storage is in the same physical location as the cloud resource that will use it. The same cloud provider have data centers across the world and you might have issues if your storage is on the other side of the world from the resource that will use it. There are ways to address these issues but they are outside of the scope of this article. The main takeaway is that cloud storage is only a great solution if the resources that needs access to that have a very quick connection to them.<br />
<br />
At the same time, this does not mean that cloud storage should never be accessed from a laptop or a desktop. Cloud storage can be a great place to store data that is infrequently accessed by users on regular computers. There will still be a delay, but that can be acceptable as the data is only infrequently accessed.<br />
<br />
=== Network drive storage ===<br />
<br />
Network drive storage is in many way similar to the types of cloud storage that we discussed above. The major difference for the context discussed here is that they can be located in the same location as a user accessing them from a laptop/desktop making it a viable option for such a user to access it frequently. This is a great solution when a regular laptop/desktop frequently needs access to the data and the combined size of all of the data files are too big to all be stored on the laptop/desktop.<br />
<br />
The problem with this solution is that you need to purchase hardware and run your own network drive. This can be complicated if you do not work at a large organization where there is an IT team that can do this for you. Another drawback is that if you work on an organization that is spread out geographically then you will have the same issues as with cloud storage with slow access speed.<br />
<br />
== Back-up protocols ==<br />
<br />
Use<br />
<br />
== Back-up storage types ==<br />
You should never use the same file or storage solution for your back-up copy of your original or raw data, as the files and storage solution you use for your day-to-day work. While some storage types could be used for both day-to-day and backup, this sections covers the former use case. Backing up of data is covered in a later section below. There are multiple types of storage and what type is best for you is how depends on the following factors: size of your data, where it will be used and cost.<br />
<br />
== Data retention ==<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[*topic name, as listed on main page*]]<br />
<br />
<br />
== Additional Resources ==<br />
* list here other articles related to this topic, with a brief description and link<br />
<br />
[[Category: *category name* ]]</div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Data_Storage&diff=8301Data Storage2021-06-03T18:08:02Z<p>Kbjarkefur: /* Version control for data */</p>
<hr />
<div>This article discuss different aspects of data storage (such as different types of storage, data back up and data retention). While [[Data_Security]] is a very important topic related to data storage, it is not covered in this article as there is a dedicated link for that.<br />
<br />
<br />
<br />
== Read First ==<br />
* Data should be version controlled. Ideally the version control system used should both be able to detect changes to the data and provide a possibility to revert to old versions<br />
* Data storage depends on how big the data is, if there is confidential information in the data and how the data will be used.<br />
* All original data (the data sets collected or received by the team) should be backed-up. All derivative data (data sets created from original data) should be created by reproducible code and that code should be version controlled in Git. Then derivative data does not need to be backed-up as it can be re-created.<br />
<br />
== Version control for data ==<br />
Version control is used for two purposes. The first is to keep track of changes and modifications to files and the second one is to provide the possibility to revert a file an old version. For code there is an industry standard that efficiently fulfills these two purposes and that system is called Git. Git can be used to both keep track of changes made to code as well as restore older versions of code. Unfortunately there is no system that as elegantly does both of those two things for data over a longer period of time. There is therefore no industry wide one-size-fits-all solutions for version control of data. This article therefore suggests different methods and the project team should pick the method best for their exact use case.<br />
<br />
=== Version control code that generates data ===<br />
<br />
All derivative datasets (datasets that the project team creates from data they received or collected) should be generated by code and should be reproducible by re-running that code. Therefore, if the original data is properly [[Data_Storage#Back-up_protocols | backed-up]] and all code is version controlled using Git, then the derivative datasets are implicitly version controlled. <br />
<br />
While this method is often an excellent option, it does not work when the original data is updated frequently (ongoing data collection or data streams) or when the code is not accessible (someone else is generating the data). However, when the original data is unchanged, changes to derivative datasets can only happen through changes to the code tracked in Git. Git can also be used to restore old version of code that in turn can be used to restore old versions of the derivative data. <br />
<br />
=== Version control using checksums/hashes ===<br />
<br />
One way to keep track of changes done to data is to use ''checksums'', ''hashes'' or ''data signatures''. These three concepts all work slightly differently but all follows the same principle and serves the same purpose in the context of version control of data. The principle they all follow is that a datasets is boiled down to a short string of text or a number. Then that string or number can be compared across datasets to test if they are identical or not.<br />
<br />
One very simplified way to explain how this works would be the following example. In this hashing algorithm we start with a word instead of a dataset, but real world hashing algorithms can handle both. Start with a word and then take the corresponding number in the alphabet for each letter, sum those numbers, and then add the digits of each number until you have a single digit. So for "cat" we would get 3+1+20=24 (c=3,a=1,t=20), and in next step 24 would is turned into 6 (2+4=6). So the hash signature for "cat" would be 6. "Horse" would be 8+15+18+19+5=65. 6+5=11 and 1+1=2. So the hash signature would be 2. No matter how big the data is (how long the word is in this simple example) the hash signature will always be the same size and format and we can quickly test if they are the same. The main problem with the very simplified hash algorithm above is that there are only 10 values so many words would share the same signature. However, real world hash algorithms like the <code>checksum</code> command in Stata has signatures on the format "2694850408" which has 10^10 possible signatures.<br />
<br />
With 10^10 possible signatures there is chance that two datasets have the same checksum. This is called "''collisions''". However, checksums are implemented so that two similar dataset are very unlikely to have collisions or even similar checksums, making the risk of two versions of the same dataset having a collision being extremely low. And other algorithms are implemented so that there are many more combinations, but then the hash signature gets longer. Stata also have the command <code>datasignature</code> that has a signature that combines a hash value with some basic data on the dataset such as number of observations. While this provides some useful human readable information, it can only be used on Stata <code>.dta</code> files.<br />
<br />
The drawback of all these methods is that it says nothing about how the dataset was created. So there is no way to recreate a dataset based only on the checksum or hash signature. If you have the checksum of a dataset used in the past but not the corresponding version of the dataset, then all you can do is to say if the current version of the same dataset is identical. You cannot say what differs if they are different. If you used the Stata command <code>datasignature</code> you can get some clues of what differs, such as number of observations or number of variables, but you will still not be able to recreate the old version of the data.<br />
<br />
This method is a good fit for when you want to have a quick way to test that the dataset has not changed, and the details of what has changed does not matter or you have another acceptable and perhaps manual way to find out what those details are. This can be if you are accessing someone else's data set and you want to know if that person makes changes to that dataset. Another way this can be used is to check which if any datasets are updated when the code is updated. Sometimes we want to make changes to the code that should not change the dataset it produces, and then this method is a great way to verify that.<br />
<br />
=== Version control in sync software ===<br />
<br />
File syncing software (read more about them below) often have version control systems that allows you to both detect changes made to data files and allows you to restore old versions of those files. However, the way they it is done in these systems is so storage in-efficient that you can only restore files version that are less than a few months old. If your project is completed with that time frame, then this is a great solution, however, typically a project runs for much longer than that.<br />
<br />
== Data storage types ==<br />
There are different types of data storage option and which one that is best for each use case typically depends on how big the data is, if there is confidential information in the data and how the data will be used. Below are some different storage types with their pros and cons. The type of data that is considered in this article are data files, and not, for example, data stored in data bases. <br />
<br />
=== File sync services ===<br />
<br />
Common examples of file sync services are DropBox, OneDrive, Box among several others. This storage type is suitable when the data files required for a project is not bigger than that they can be stored on a regular laptop or desktop computer. In file sync services a copy of the file exists locally on all computers that has the folder synced. This means that accessing the file is quick and that there is never a problem for multiple people to access the file at the same time as they have their own version of the file. <br />
<br />
However, because all files are saved locally this also means that if one person works on many projects there is a risk that there is not space for all files for those projects on that persons computer. Many file syncing services have options to mitigate this. For example you can chose to not sync the parts of the project folder that you know you do not need access to. Some sync services also allow you to have non-synced files ready to be downloaded on demand (sometimes referred to as smart sync). However, this is not great for data work as data files are usually to big to be downloaded on demand when your code trying to access them leading to your programming software to crash when trying to read the file.<br />
<br />
Another concern with file syncing services are privacy concerns when sharing confidential data. There are enterprise subscriptions that can be installed on your own servers. This makes sharing data much more secure as the data is never stored or transferred on the servers that belong to the company that offers the syncing service. If you do not have an enterprise subscription or are not sure if you do, you should always [[encryption|encrypt]] the data before saving it in a synced folder. DIME Analytics have published [https://github.com/worldbank/dime-standards/blob/master/dime-research-standards/pillar-4-data-security/data-security-resources/veracrypt-guidelines.md guidelines to encrypt files with Veracrypt] which is a secure and free software often used for this purpose.<br />
<br />
=== Cloud storage ===<br />
<br />
Cloud storage comes in many different kinds and it is out of the scope of this wiki article to describe them all. This article will only cover general points about cloud storage. This article only covers data stored in files, so cloud storage here typically means [https://aws.amazon.com/s3/ S3 storage on AWS] or [https://docs.microsoft.com/en-us/azure/storage/blobs/ Blob storage on Azure], but are not limited to those.<br />
<br />
The benefit of cloud storage is that you do not need to buy or upgrade any hardware. To get started or to increase your capacity you only need to click a button or it might even be upgraded automatically. However, you will be charged more the more you use. <br />
<br />
Cloud storage might not be the best option when the data will frequently be accessed from laptops and desktops. Even on a fast internet connection will the download time be too long for most use cases if the data is downloaded each time the script is run. Instead, cloud storage is best used when frequently accessed from another cloud resource. However, make sure that the other cloud storage is in the same physical location as the cloud resource that will use it. The same cloud provider have data centers across the world and you might have issues if your storage is on the other side of the world from the resource that will use it. There are ways to address these issues but they are outside of the scope of this article. The main takeaway is that cloud storage is only a great solution if the resources that needs access to that have a very quick connection to them.<br />
<br />
At the same time, this does not mean that cloud storage should never be accessed from a laptop or a desktop. Cloud storage can be a great place to store data that is infrequently accessed by users on regular computers. There will still be a delay, but that can be acceptable as the data is only infrequently accessed.<br />
<br />
=== Network drive storage ===<br />
<br />
Network drive storage is in many way similar to the types of cloud storage that we discussed above. The major difference for the context discussed here is that they can be located in the same location as a user accessing them from a laptop/desktop making it a viable option for such a user to access it frequently. This is a great solution when a regular laptop/desktop frequently needs access to the data and the combined size of all of the data files are too big to all be stored on the laptop/desktop.<br />
<br />
The problem with this solution is that you need to purchase hardware and run your own network drive. This can be complicated if you do not work at a large organization where there is an IT team that can do this for you. Another drawback is that if you work on an organization that is spread out geographically then you will have the same issues as with cloud storage with slow access speed.<br />
<br />
== Back-up protocols ==<br />
<br />
Use<br />
<br />
== Back-up storage types ==<br />
You should never use the same file or storage solution for your back-up copy of your original or raw data, as the files and storage solution you use for your day-to-day work. While some storage types could be used for both day-to-day and backup, this sections covers the former use case. Backing up of data is covered in a later section below. There are multiple types of storage and what type is best for you is how depends on the following factors: size of your data, where it will be used and cost.<br />
<br />
== Data retention ==<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[*topic name, as listed on main page*]]<br />
<br />
<br />
== Additional Resources ==<br />
* list here other articles related to this topic, with a brief description and link<br />
<br />
[[Category: *category name* ]]</div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Data_Storage&diff=8300Data Storage2021-06-03T18:03:27Z<p>Kbjarkefur: /* Back-up protocols */</p>
<hr />
<div>This article discuss different aspects of data storage (such as different types of storage, data back up and data retention). While [[Data_Security]] is a very important topic related to data storage, it is not covered in this article as there is a dedicated link for that.<br />
<br />
<br />
<br />
== Read First ==<br />
* Data should be version controlled. Ideally the version control system used should both be able to detect changes to the data and provide a possibility to revert to old versions<br />
* Data storage depends on how big the data is, if there is confidential information in the data and how the data will be used.<br />
* All original data (the data sets collected or received by the team) should be backed-up. All derivative data (data sets created from original data) should be created by reproducible code and that code should be version controlled in Git. Then derivative data does not need to be backed-up as it can be re-created.<br />
<br />
== Version control for data ==<br />
Version control is used for two purposes. The first is to keep track of changes and modifications and the second one is to provide the possibility to revert a file or a folder to an old version. For code there is an industry standard for code that fulfills these two purposes and that system is called Git. Git can be used to both keep track of changes made to code as well as restore older versions of code. Unfortunately there is no system that as elegantly does both of those two things for data and there is therefore no industry wide one-size-fits-all solutions for data. This article therefore suggests different methods and the project team should pick the method best for their exact use case.<br />
<br />
=== Version control code that generates data ===<br />
<br />
All datasets should be generated by code and should be reproducible by re-running that code. Therefore, if you have a good back-up system for the original data and version control all your code using Git, then you have an implicit version control for your data. As long as the original data is unchanged, changes to datasets you generate can only happen through changes to the code tracked in Git. Git can also be used to restore old version of the data as you simple restore an old version of the code and re-generate the data. <br />
<br />
While this method is often an excellent option, it does not work when the original data is updated frequently (ongoing data collection or data streams) or when the code is not accessible (someone else is generating the data).<br />
<br />
=== Version control using checksums/hashes ===<br />
<br />
One way to keep track of changes done to data is to use ''checksums'', ''hashes'' or ''data signatures''. These three concepts all work slightly differently but all follows the same principle and serves the same purpose in the context of version control of data. The principle they all follow is that a datasets is boiled down to a short string of text or a number. Then that string or number can be compared across datasets to test if they are identical or not.<br />
<br />
One very simplified way to explain how this works would be the following example. In this hashing algorithm we start with a word instead of a dataset, but real world hashing algorithms can handle both. Start with a word and then take the corresponding number in the alphabet for each letter, sum those numbers, and then add the digits of each number until you have a single digit. So for "cat" we would get 3+1+20=24 (c=3,a=1,t=20), and in next step 24 would is turned into 6 (2+4=6). So the hash signature for "cat" would be 6. "Horse" would be 8+15+18+19+5=65. 6+5=11 and 1+1=2. So the hash signature would be 2. No matter how big the data is (how long the word is in this simple example) the hash signature will always be the same size and format and we can quickly test if they are the same. The main problem with the very simplified hash algorithm above is that there are only 10 values so many words would share the same signature. However, real world hash algorithms like the <code>checksum</code> command in Stata has signatures on the format "2694850408" which has 10^10 possible signatures.<br />
<br />
With 10^10 possible signatures there is chance that two datasets have the same checksum. This is called "''collisions''". However, checksums are implemented so that two similar dataset are very unlikely to have collisions or even similar checksums, making the risk of two versions of the same dataset having a collision being extremely low. And other algorithms are implemented so that there are many more combinations, but then the hash signature gets longer. Stata also have the command <code>datasignature</code> that has a signature that combines a hash value with some basic data on the dataset such as number of observations. While this provides some useful human readable information, it can only be used on Stata <code>.dta</code> files.<br />
<br />
The drawback of all these methods is that it says nothing about how the dataset was created. So there is no way to recreate a dataset based only on the checksum or hash signature. If you have the checksum of a dataset used in the past but not the corresponding version of the dataset, then all you can do is to say if the current version of the same dataset is identical. You cannot say what differs if they are different. If you used the Stata command <code>datasignature</code> you can get some clues of what differs, such as number of observations or number of variables, but you will still not be able to recreate the old version of the data.<br />
<br />
This method is a good fit for when you want to have a quick way to test that the dataset has not changed, and the details of what has changed does not matter or you have another acceptable and perhaps manual way to find out what those details are. This can be if you are accessing someone else's data set and you want to know if that person makes changes to that dataset. Another way this can be used is to check which if any datasets are updated when the code is updated. Sometimes we want to make changes to the code that should not change the dataset it produces, and then this method is a great way to verify that.<br />
<br />
=== Version control in sync software ===<br />
<br />
File syncing software (read more about them below) often have version control systems that allows you to both detect changes made to data files and allows you to restore old versions of those files. However, the way they it is done in these systems is so storage in-efficient that you can only restore files version that are less than a few months old. If your project is completed with that time frame, then this is a great solution, however, typically a project runs for much longer than that.<br />
<br />
== Data storage types ==<br />
There are different types of data storage option and which one that is best for each use case typically depends on how big the data is, if there is confidential information in the data and how the data will be used. Below are some different storage types with their pros and cons. The type of data that is considered in this article are data files, and not, for example, data stored in data bases. <br />
<br />
=== File sync services ===<br />
<br />
Common examples of file sync services are DropBox, OneDrive, Box among several others. This storage type is suitable when the data files required for a project is not bigger than that they can be stored on a regular laptop or desktop computer. In file sync services a copy of the file exists locally on all computers that has the folder synced. This means that accessing the file is quick and that there is never a problem for multiple people to access the file at the same time as they have their own version of the file. <br />
<br />
However, because all files are saved locally this also means that if one person works on many projects there is a risk that there is not space for all files for those projects on that persons computer. Many file syncing services have options to mitigate this. For example you can chose to not sync the parts of the project folder that you know you do not need access to. Some sync services also allow you to have non-synced files ready to be downloaded on demand (sometimes referred to as smart sync). However, this is not great for data work as data files are usually to big to be downloaded on demand when your code trying to access them leading to your programming software to crash when trying to read the file.<br />
<br />
Another concern with file syncing services are privacy concerns when sharing confidential data. There are enterprise subscriptions that can be installed on your own servers. This makes sharing data much more secure as the data is never stored or transferred on the servers that belong to the company that offers the syncing service. If you do not have an enterprise subscription or are not sure if you do, you should always [[encryption|encrypt]] the data before saving it in a synced folder. DIME Analytics have published [https://github.com/worldbank/dime-standards/blob/master/dime-research-standards/pillar-4-data-security/data-security-resources/veracrypt-guidelines.md guidelines to encrypt files with Veracrypt] which is a secure and free software often used for this purpose.<br />
<br />
=== Cloud storage ===<br />
<br />
Cloud storage comes in many different kinds and it is out of the scope of this wiki article to describe them all. This article will only cover general points about cloud storage. This article only covers data stored in files, so cloud storage here typically means [https://aws.amazon.com/s3/ S3 storage on AWS] or [https://docs.microsoft.com/en-us/azure/storage/blobs/ Blob storage on Azure], but are not limited to those.<br />
<br />
The benefit of cloud storage is that you do not need to buy or upgrade any hardware. To get started or to increase your capacity you only need to click a button or it might even be upgraded automatically. However, you will be charged more the more you use. <br />
<br />
Cloud storage might not be the best option when the data will frequently be accessed from laptops and desktops. Even on a fast internet connection will the download time be too long for most use cases if the data is downloaded each time the script is run. Instead, cloud storage is best used when frequently accessed from another cloud resource. However, make sure that the other cloud storage is in the same physical location as the cloud resource that will use it. The same cloud provider have data centers across the world and you might have issues if your storage is on the other side of the world from the resource that will use it. There are ways to address these issues but they are outside of the scope of this article. The main takeaway is that cloud storage is only a great solution if the resources that needs access to that have a very quick connection to them.<br />
<br />
At the same time, this does not mean that cloud storage should never be accessed from a laptop or a desktop. Cloud storage can be a great place to store data that is infrequently accessed by users on regular computers. There will still be a delay, but that can be acceptable as the data is only infrequently accessed.<br />
<br />
=== Network drive storage ===<br />
<br />
Network drive storage is in many way similar to the types of cloud storage that we discussed above. The major difference for the context discussed here is that they can be located in the same location as a user accessing them from a laptop/desktop making it a viable option for such a user to access it frequently. This is a great solution when a regular laptop/desktop frequently needs access to the data and the combined size of all of the data files are too big to all be stored on the laptop/desktop.<br />
<br />
The problem with this solution is that you need to purchase hardware and run your own network drive. This can be complicated if you do not work at a large organization where there is an IT team that can do this for you. Another drawback is that if you work on an organization that is spread out geographically then you will have the same issues as with cloud storage with slow access speed.<br />
<br />
== Back-up protocols ==<br />
<br />
Use<br />
<br />
== Back-up storage types ==<br />
You should never use the same file or storage solution for your back-up copy of your original or raw data, as the files and storage solution you use for your day-to-day work. While some storage types could be used for both day-to-day and backup, this sections covers the former use case. Backing up of data is covered in a later section below. There are multiple types of storage and what type is best for you is how depends on the following factors: size of your data, where it will be used and cost.<br />
<br />
== Data retention ==<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[*topic name, as listed on main page*]]<br />
<br />
<br />
== Additional Resources ==<br />
* list here other articles related to this topic, with a brief description and link<br />
<br />
[[Category: *category name* ]]</div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Data_Storage&diff=8299Data Storage2021-06-03T17:53:08Z<p>Kbjarkefur: /* Version control using checksums/hashes */</p>
<hr />
<div>This article discuss different aspects of data storage (such as different types of storage, data back up and data retention). While [[Data_Security]] is a very important topic related to data storage, it is not covered in this article as there is a dedicated link for that.<br />
<br />
<br />
<br />
== Read First ==<br />
* Data should be version controlled. Ideally the version control system used should both be able to detect changes to the data and provide a possibility to revert to old versions<br />
* Data storage depends on how big the data is, if there is confidential information in the data and how the data will be used.<br />
* All original data (the data sets collected or received by the team) should be backed-up. All derivative data (data sets created from original data) should be created by reproducible code and that code should be version controlled in Git. Then derivative data does not need to be backed-up as it can be re-created.<br />
<br />
== Version control for data ==<br />
Version control is used for two purposes. The first is to keep track of changes and modifications and the second one is to provide the possibility to revert a file or a folder to an old version. For code there is an industry standard for code that fulfills these two purposes and that system is called Git. Git can be used to both keep track of changes made to code as well as restore older versions of code. Unfortunately there is no system that as elegantly does both of those two things for data and there is therefore no industry wide one-size-fits-all solutions for data. This article therefore suggests different methods and the project team should pick the method best for their exact use case.<br />
<br />
=== Version control code that generates data ===<br />
<br />
All datasets should be generated by code and should be reproducible by re-running that code. Therefore, if you have a good back-up system for the original data and version control all your code using Git, then you have an implicit version control for your data. As long as the original data is unchanged, changes to datasets you generate can only happen through changes to the code tracked in Git. Git can also be used to restore old version of the data as you simple restore an old version of the code and re-generate the data. <br />
<br />
While this method is often an excellent option, it does not work when the original data is updated frequently (ongoing data collection or data streams) or when the code is not accessible (someone else is generating the data).<br />
<br />
=== Version control using checksums/hashes ===<br />
<br />
One way to keep track of changes done to data is to use ''checksums'', ''hashes'' or ''data signatures''. These three concepts all work slightly differently but all follows the same principle and serves the same purpose in the context of version control of data. The principle they all follow is that a datasets is boiled down to a short string of text or a number. Then that string or number can be compared across datasets to test if they are identical or not.<br />
<br />
One very simplified way to explain how this works would be the following example. In this hashing algorithm we start with a word instead of a dataset, but real world hashing algorithms can handle both. Start with a word and then take the corresponding number in the alphabet for each letter, sum those numbers, and then add the digits of each number until you have a single digit. So for "cat" we would get 3+1+20=24 (c=3,a=1,t=20), and in next step 24 would is turned into 6 (2+4=6). So the hash signature for "cat" would be 6. "Horse" would be 8+15+18+19+5=65. 6+5=11 and 1+1=2. So the hash signature would be 2. No matter how big the data is (how long the word is in this simple example) the hash signature will always be the same size and format and we can quickly test if they are the same. The main problem with the very simplified hash algorithm above is that there are only 10 values so many words would share the same signature. However, real world hash algorithms like the <code>checksum</code> command in Stata has signatures on the format "2694850408" which has 10^10 possible signatures.<br />
<br />
With 10^10 possible signatures there is chance that two datasets have the same checksum. This is called "''collisions''". However, checksums are implemented so that two similar dataset are very unlikely to have collisions or even similar checksums, making the risk of two versions of the same dataset having a collision being extremely low. And other algorithms are implemented so that there are many more combinations, but then the hash signature gets longer. Stata also have the command <code>datasignature</code> that has a signature that combines a hash value with some basic data on the dataset such as number of observations. While this provides some useful human readable information, it can only be used on Stata <code>.dta</code> files.<br />
<br />
The drawback of all these methods is that it says nothing about how the dataset was created. So there is no way to recreate a dataset based only on the checksum or hash signature. If you have the checksum of a dataset used in the past but not the corresponding version of the dataset, then all you can do is to say if the current version of the same dataset is identical. You cannot say what differs if they are different. If you used the Stata command <code>datasignature</code> you can get some clues of what differs, such as number of observations or number of variables, but you will still not be able to recreate the old version of the data.<br />
<br />
This method is a good fit for when you want to have a quick way to test that the dataset has not changed, and the details of what has changed does not matter or you have another acceptable and perhaps manual way to find out what those details are. This can be if you are accessing someone else's data set and you want to know if that person makes changes to that dataset. Another way this can be used is to check which if any datasets are updated when the code is updated. Sometimes we want to make changes to the code that should not change the dataset it produces, and then this method is a great way to verify that.<br />
<br />
=== Version control in sync software ===<br />
<br />
File syncing software (read more about them below) often have version control systems that allows you to both detect changes made to data files and allows you to restore old versions of those files. However, the way they it is done in these systems is so storage in-efficient that you can only restore files version that are less than a few months old. If your project is completed with that time frame, then this is a great solution, however, typically a project runs for much longer than that.<br />
<br />
== Data storage types ==<br />
There are different types of data storage option and which one that is best for each use case typically depends on how big the data is, if there is confidential information in the data and how the data will be used. Below are some different storage types with their pros and cons. The type of data that is considered in this article are data files, and not, for example, data stored in data bases. <br />
<br />
=== File sync services ===<br />
<br />
Common examples of file sync services are DropBox, OneDrive, Box among several others. This storage type is suitable when the data files required for a project is not bigger than that they can be stored on a regular laptop or desktop computer. In file sync services a copy of the file exists locally on all computers that has the folder synced. This means that accessing the file is quick and that there is never a problem for multiple people to access the file at the same time as they have their own version of the file. <br />
<br />
However, because all files are saved locally this also means that if one person works on many projects there is a risk that there is not space for all files for those projects on that persons computer. Many file syncing services have options to mitigate this. For example you can chose to not sync the parts of the project folder that you know you do not need access to. Some sync services also allow you to have non-synced files ready to be downloaded on demand (sometimes referred to as smart sync). However, this is not great for data work as data files are usually to big to be downloaded on demand when your code trying to access them leading to your programming software to crash when trying to read the file.<br />
<br />
Another concern with file syncing services are privacy concerns when sharing confidential data. There are enterprise subscriptions that can be installed on your own servers. This makes sharing data much more secure as the data is never stored or transferred on the servers that belong to the company that offers the syncing service. If you do not have an enterprise subscription or are not sure if you do, you should always [[encryption|encrypt]] the data before saving it in a synced folder. DIME Analytics have published [https://github.com/worldbank/dime-standards/blob/master/dime-research-standards/pillar-4-data-security/data-security-resources/veracrypt-guidelines.md guidelines to encrypt files with Veracrypt] which is a secure and free software often used for this purpose.<br />
<br />
=== Cloud storage ===<br />
<br />
Cloud storage comes in many different kinds and it is out of the scope of this wiki article to describe them all. This article will only cover general points about cloud storage. This article only covers data stored in files, so cloud storage here typically means [https://aws.amazon.com/s3/ S3 storage on AWS] or [https://docs.microsoft.com/en-us/azure/storage/blobs/ Blob storage on Azure], but are not limited to those.<br />
<br />
The benefit of cloud storage is that you do not need to buy or upgrade any hardware. To get started or to increase your capacity you only need to click a button or it might even be upgraded automatically. However, you will be charged more the more you use. <br />
<br />
Cloud storage might not be the best option when the data will frequently be accessed from laptops and desktops. Even on a fast internet connection will the download time be too long for most use cases if the data is downloaded each time the script is run. Instead, cloud storage is best used when frequently accessed from another cloud resource. However, make sure that the other cloud storage is in the same physical location as the cloud resource that will use it. The same cloud provider have data centers across the world and you might have issues if your storage is on the other side of the world from the resource that will use it. There are ways to address these issues but they are outside of the scope of this article. The main takeaway is that cloud storage is only a great solution if the resources that needs access to that have a very quick connection to them.<br />
<br />
At the same time, this does not mean that cloud storage should never be accessed from a laptop or a desktop. Cloud storage can be a great place to store data that is infrequently accessed by users on regular computers. There will still be a delay, but that can be acceptable as the data is only infrequently accessed.<br />
<br />
=== Network drive storage ===<br />
<br />
Network drive storage is in many way similar to the types of cloud storage that we discussed above. The major difference for the context discussed here is that they can be located in the same location as a user accessing them from a laptop/desktop making it a viable option for such a user to access it frequently. This is a great solution when a regular laptop/desktop frequently needs access to the data and the combined size of all of the data files are too big to all be stored on the laptop/desktop.<br />
<br />
The problem with this solution is that you need to purchase hardware and run your own network drive. This can be complicated if you do not work at a large organization where there is an IT team that can do this for you. Another drawback is that if you work on an organization that is spread out geographically then you will have the same issues as with cloud storage with slow access speed.<br />
<br />
== Back-up protocols ==<br />
<br />
== Back-up storage types ==<br />
You should never use the same file or storage solution for your back-up copy of your original or raw data, as the files and storage solution you use for your day-to-day work. While some storage types could be used for both day-to-day and backup, this sections covers the former use case. Backing up of data is covered in a later section below. There are multiple types of storage and what type is best for you is how depends on the following factors: size of your data, where it will be used and cost.<br />
<br />
== Data retention ==<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[*topic name, as listed on main page*]]<br />
<br />
<br />
== Additional Resources ==<br />
* list here other articles related to this topic, with a brief description and link<br />
<br />
[[Category: *category name* ]]</div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Data_Storage&diff=8298Data Storage2021-06-03T17:39:25Z<p>Kbjarkefur: /* Read First */</p>
<hr />
<div>This article discuss different aspects of data storage (such as different types of storage, data back up and data retention). While [[Data_Security]] is a very important topic related to data storage, it is not covered in this article as there is a dedicated link for that.<br />
<br />
<br />
<br />
== Read First ==<br />
* Data should be version controlled. Ideally the version control system used should both be able to detect changes to the data and provide a possibility to revert to old versions<br />
* Data storage depends on how big the data is, if there is confidential information in the data and how the data will be used.<br />
* All original data (the data sets collected or received by the team) should be backed-up. All derivative data (data sets created from original data) should be created by reproducible code and that code should be version controlled in Git. Then derivative data does not need to be backed-up as it can be re-created.<br />
<br />
== Version control for data ==<br />
Version control is used for two purposes. The first is to keep track of changes and modifications and the second one is to provide the possibility to revert a file or a folder to an old version. For code there is an industry standard for code that fulfills these two purposes and that system is called Git. Git can be used to both keep track of changes made to code as well as restore older versions of code. Unfortunately there is no system that as elegantly does both of those two things for data and there is therefore no industry wide one-size-fits-all solutions for data. This article therefore suggests different methods and the project team should pick the method best for their exact use case.<br />
<br />
=== Version control code that generates data ===<br />
<br />
All datasets should be generated by code and should be reproducible by re-running that code. Therefore, if you have a good back-up system for the original data and version control all your code using Git, then you have an implicit version control for your data. As long as the original data is unchanged, changes to datasets you generate can only happen through changes to the code tracked in Git. Git can also be used to restore old version of the data as you simple restore an old version of the code and re-generate the data. <br />
<br />
While this method is often an excellent option, it does not work when the original data is updated frequently (ongoing data collection or data streams) or when the code is not accessible (someone else is generating the data).<br />
<br />
=== Version control using checksums/hashes ===<br />
<br />
One way to keep track of changes done to data is to use ''checksums'', ''hashes'' or ''data signatures''. These three concepts all work slightly differently but all follows the same principle and serves the same purpose in the context of version control of data. The principle they all follow is that a datasets is boiled down to a short string of text or a number. Then that string or number can be compared across datasets to test if they are identical or not.<br />
<br />
One very simplified way to explain how this works would be the following example. In this hashing algorithm we start with a word instead of a dataset, but real world hashing algorithms can handle both. Start with a word and then take the corresponding number in the alphabet for each letter, sum those numbers, and then add the digits of each number until you have a single digit. So for "cat" we would get 3+1+20=24 (c=3,a=1,t=20), and in next step 24 would is turned into 6 (2+4=6). So the hash signature for "cat" would be 6. "Horse" would be 8+15+18+19+5=65. 6+5=11 and 1+1=2. So the hash signature would be 2. No matter how big the data is (how long the word is in this simple example) the hash signature will always be the same size and format and we can quickly test if they are the same. The main problem with the very simplified hash algorithm above is that there are only 10 values so many words would share the same signature. However, real world hash algorithms like the <code>checksum</code> command in Stata has signatures on the format "2694850408" which has 10^10 possible signatures.<br />
<br />
With 10^10 possible signatures there is chance that two datasets have the same checksum. This is called "''collisions''". However, checksums are implemented so that two similar dataset are very unlikely to have collisions or even similar checksums, making the risk of two versions of the same dataset having a collision being extremely low. And other algorithms are implemented so that there are many more combinations, but then the hash signature gets longer. Stata also have the command <code>datasignature</code> that has a signature that combines a hash value with some basic data on the dataset such as number of observations. While this provides some useful human readable information, it can only be used on Stata <code>.dta</code> files.<br />
<br />
The drawback of all these methods is that it says nothing about how the dataset was created. So there is no way to recreate a dataset based only on the checksum or hash signature. If you have the checksum of a dataset used in the past but not the corresponding version of the dataset, then all you can do is to say if the current version of the same dataset is identical. You cannot say what differs if they are different. If you used the Stata command <code>datasignature</code> you can get some clues of what differs, such as number of observations or number of variables, but you will still not be able to recreate the old version of the data.<br />
<br />
This method is a good fit for when you want to have a quick way to test that the dataset has not changed, and the details of what has changed does not matter or you have another acceptable and perhaps manual way to find out what those details are. This can be if you are accessing someone else's data set and you want to know if that person makes changes to that dataset. Another way this can be used is to check which if any datasets are updated when the code is updated. Sometimes we want to make changes to the code that should not change the dataset it produces, and then this method is a great way to verify that.<br />
<br />
== Data storage types ==<br />
There are different types of data storage option and which one that is best for each use case typically depends on how big the data is, if there is confidential information in the data and how the data will be used. Below are some different storage types with their pros and cons. The type of data that is considered in this article are data files, and not, for example, data stored in data bases. <br />
<br />
=== File sync services ===<br />
<br />
Common examples of file sync services are DropBox, OneDrive, Box among several others. This storage type is suitable when the data files required for a project is not bigger than that they can be stored on a regular laptop or desktop computer. In file sync services a copy of the file exists locally on all computers that has the folder synced. This means that accessing the file is quick and that there is never a problem for multiple people to access the file at the same time as they have their own version of the file. <br />
<br />
However, because all files are saved locally this also means that if one person works on many projects there is a risk that there is not space for all files for those projects on that persons computer. Many file syncing services have options to mitigate this. For example you can chose to not sync the parts of the project folder that you know you do not need access to. Some sync services also allow you to have non-synced files ready to be downloaded on demand (sometimes referred to as smart sync). However, this is not great for data work as data files are usually to big to be downloaded on demand when your code trying to access them leading to your programming software to crash when trying to read the file.<br />
<br />
Another concern with file syncing services are privacy concerns when sharing confidential data. There are enterprise subscriptions that can be installed on your own servers. This makes sharing data much more secure as the data is never stored or transferred on the servers that belong to the company that offers the syncing service. If you do not have an enterprise subscription or are not sure if you do, you should always [[encryption|encrypt]] the data before saving it in a synced folder. DIME Analytics have published [https://github.com/worldbank/dime-standards/blob/master/dime-research-standards/pillar-4-data-security/data-security-resources/veracrypt-guidelines.md guidelines to encrypt files with Veracrypt] which is a secure and free software often used for this purpose.<br />
<br />
=== Cloud storage ===<br />
<br />
Cloud storage comes in many different kinds and it is out of the scope of this wiki article to describe them all. This article will only cover general points about cloud storage. This article only covers data stored in files, so cloud storage here typically means [https://aws.amazon.com/s3/ S3 storage on AWS] or [https://docs.microsoft.com/en-us/azure/storage/blobs/ Blob storage on Azure], but are not limited to those.<br />
<br />
The benefit of cloud storage is that you do not need to buy or upgrade any hardware. To get started or to increase your capacity you only need to click a button or it might even be upgraded automatically. However, you will be charged more the more you use. <br />
<br />
Cloud storage might not be the best option when the data will frequently be accessed from laptops and desktops. Even on a fast internet connection will the download time be too long for most use cases if the data is downloaded each time the script is run. Instead, cloud storage is best used when frequently accessed from another cloud resource. However, make sure that the other cloud storage is in the same physical location as the cloud resource that will use it. The same cloud provider have data centers across the world and you might have issues if your storage is on the other side of the world from the resource that will use it. There are ways to address these issues but they are outside of the scope of this article. The main takeaway is that cloud storage is only a great solution if the resources that needs access to that have a very quick connection to them.<br />
<br />
At the same time, this does not mean that cloud storage should never be accessed from a laptop or a desktop. Cloud storage can be a great place to store data that is infrequently accessed by users on regular computers. There will still be a delay, but that can be acceptable as the data is only infrequently accessed.<br />
<br />
=== Network drive storage ===<br />
<br />
Network drive storage is in many way similar to the types of cloud storage that we discussed above. The major difference for the context discussed here is that they can be located in the same location as a user accessing them from a laptop/desktop making it a viable option for such a user to access it frequently. This is a great solution when a regular laptop/desktop frequently needs access to the data and the combined size of all of the data files are too big to all be stored on the laptop/desktop.<br />
<br />
The problem with this solution is that you need to purchase hardware and run your own network drive. This can be complicated if you do not work at a large organization where there is an IT team that can do this for you. Another drawback is that if you work on an organization that is spread out geographically then you will have the same issues as with cloud storage with slow access speed.<br />
<br />
== Back-up protocols ==<br />
<br />
== Back-up storage types ==<br />
You should never use the same file or storage solution for your back-up copy of your original or raw data, as the files and storage solution you use for your day-to-day work. While some storage types could be used for both day-to-day and backup, this sections covers the former use case. Backing up of data is covered in a later section below. There are multiple types of storage and what type is best for you is how depends on the following factors: size of your data, where it will be used and cost.<br />
<br />
== Data retention ==<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[*topic name, as listed on main page*]]<br />
<br />
<br />
== Additional Resources ==<br />
* list here other articles related to this topic, with a brief description and link<br />
<br />
[[Category: *category name* ]]</div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Data_Storage&diff=8297Data Storage2021-06-03T17:36:08Z<p>Kbjarkefur: /* Read First */</p>
<hr />
<div>This article discuss different aspects of data storage (such as different types of storage, data back up and data retention). While [[Data_Security]] is a very important topic related to data storage, it is not covered in this article as there is a dedicated link for that.<br />
<br />
<br />
<br />
== Read First ==<br />
* Data should be version controlled. Ideally the version control system used should both be able to detect changes to the data and provide a possibility to revert to old versions<br />
<br />
== Version control for data ==<br />
Version control is used for two purposes. The first is to keep track of changes and modifications and the second one is to provide the possibility to revert a file or a folder to an old version. For code there is an industry standard for code that fulfills these two purposes and that system is called Git. Git can be used to both keep track of changes made to code as well as restore older versions of code. Unfortunately there is no system that as elegantly does both of those two things for data and there is therefore no industry wide one-size-fits-all solutions for data. This article therefore suggests different methods and the project team should pick the method best for their exact use case.<br />
<br />
=== Version control code that generates data ===<br />
<br />
All datasets should be generated by code and should be reproducible by re-running that code. Therefore, if you have a good back-up system for the original data and version control all your code using Git, then you have an implicit version control for your data. As long as the original data is unchanged, changes to datasets you generate can only happen through changes to the code tracked in Git. Git can also be used to restore old version of the data as you simple restore an old version of the code and re-generate the data. <br />
<br />
While this method is often an excellent option, it does not work when the original data is updated frequently (ongoing data collection or data streams) or when the code is not accessible (someone else is generating the data).<br />
<br />
=== Version control using checksums/hashes ===<br />
<br />
One way to keep track of changes done to data is to use ''checksums'', ''hashes'' or ''data signatures''. These three concepts all work slightly differently but all follows the same principle and serves the same purpose in the context of version control of data. The principle they all follow is that a datasets is boiled down to a short string of text or a number. Then that string or number can be compared across datasets to test if they are identical or not.<br />
<br />
One very simplified way to explain how this works would be the following example. In this hashing algorithm we start with a word instead of a dataset, but real world hashing algorithms can handle both. Start with a word and then take the corresponding number in the alphabet for each letter, sum those numbers, and then add the digits of each number until you have a single digit. So for "cat" we would get 3+1+20=24 (c=3,a=1,t=20), and in next step 24 would is turned into 6 (2+4=6). So the hash signature for "cat" would be 6. "Horse" would be 8+15+18+19+5=65. 6+5=11 and 1+1=2. So the hash signature would be 2. No matter how big the data is (how long the word is in this simple example) the hash signature will always be the same size and format and we can quickly test if they are the same. The main problem with the very simplified hash algorithm above is that there are only 10 values so many words would share the same signature. However, real world hash algorithms like the <code>checksum</code> command in Stata has signatures on the format "2694850408" which has 10^10 possible signatures.<br />
<br />
With 10^10 possible signatures there is chance that two datasets have the same checksum. This is called "''collisions''". However, checksums are implemented so that two similar dataset are very unlikely to have collisions or even similar checksums, making the risk of two versions of the same dataset having a collision being extremely low. And other algorithms are implemented so that there are many more combinations, but then the hash signature gets longer. Stata also have the command <code>datasignature</code> that has a signature that combines a hash value with some basic data on the dataset such as number of observations. While this provides some useful human readable information, it can only be used on Stata <code>.dta</code> files.<br />
<br />
The drawback of all these methods is that it says nothing about how the dataset was created. So there is no way to recreate a dataset based only on the checksum or hash signature. If you have the checksum of a dataset used in the past but not the corresponding version of the dataset, then all you can do is to say if the current version of the same dataset is identical. You cannot say what differs if they are different. If you used the Stata command <code>datasignature</code> you can get some clues of what differs, such as number of observations or number of variables, but you will still not be able to recreate the old version of the data.<br />
<br />
This method is a good fit for when you want to have a quick way to test that the dataset has not changed, and the details of what has changed does not matter or you have another acceptable and perhaps manual way to find out what those details are. This can be if you are accessing someone else's data set and you want to know if that person makes changes to that dataset. Another way this can be used is to check which if any datasets are updated when the code is updated. Sometimes we want to make changes to the code that should not change the dataset it produces, and then this method is a great way to verify that.<br />
<br />
== Data storage types ==<br />
There are different types of data storage option and which one that is best for each use case typically depends on how big the data is, if there is confidential information in the data and how the data will be used. Below are some different storage types with their pros and cons. The type of data that is considered in this article are data files, and not, for example, data stored in data bases. <br />
<br />
=== File sync services ===<br />
<br />
Common examples of file sync services are DropBox, OneDrive, Box among several others. This storage type is suitable when the data files required for a project is not bigger than that they can be stored on a regular laptop or desktop computer. In file sync services a copy of the file exists locally on all computers that has the folder synced. This means that accessing the file is quick and that there is never a problem for multiple people to access the file at the same time as they have their own version of the file. <br />
<br />
However, because all files are saved locally this also means that if one person works on many projects there is a risk that there is not space for all files for those projects on that persons computer. Many file syncing services have options to mitigate this. For example you can chose to not sync the parts of the project folder that you know you do not need access to. Some sync services also allow you to have non-synced files ready to be downloaded on demand (sometimes referred to as smart sync). However, this is not great for data work as data files are usually to big to be downloaded on demand when your code trying to access them leading to your programming software to crash when trying to read the file.<br />
<br />
Another concern with file syncing services are privacy concerns when sharing confidential data. There are enterprise subscriptions that can be installed on your own servers. This makes sharing data much more secure as the data is never stored or transferred on the servers that belong to the company that offers the syncing service. If you do not have an enterprise subscription or are not sure if you do, you should always [[encryption|encrypt]] the data before saving it in a synced folder. DIME Analytics have published [https://github.com/worldbank/dime-standards/blob/master/dime-research-standards/pillar-4-data-security/data-security-resources/veracrypt-guidelines.md guidelines to encrypt files with Veracrypt] which is a secure and free software often used for this purpose.<br />
<br />
=== Cloud storage ===<br />
<br />
Cloud storage comes in many different kinds and it is out of the scope of this wiki article to describe them all. This article will only cover general points about cloud storage. This article only covers data stored in files, so cloud storage here typically means [https://aws.amazon.com/s3/ S3 storage on AWS] or [https://docs.microsoft.com/en-us/azure/storage/blobs/ Blob storage on Azure], but are not limited to those.<br />
<br />
The benefit of cloud storage is that you do not need to buy or upgrade any hardware. To get started or to increase your capacity you only need to click a button or it might even be upgraded automatically. However, you will be charged more the more you use. <br />
<br />
Cloud storage might not be the best option when the data will frequently be accessed from laptops and desktops. Even on a fast internet connection will the download time be too long for most use cases if the data is downloaded each time the script is run. Instead, cloud storage is best used when frequently accessed from another cloud resource. However, make sure that the other cloud storage is in the same physical location as the cloud resource that will use it. The same cloud provider have data centers across the world and you might have issues if your storage is on the other side of the world from the resource that will use it. There are ways to address these issues but they are outside of the scope of this article. The main takeaway is that cloud storage is only a great solution if the resources that needs access to that have a very quick connection to them.<br />
<br />
At the same time, this does not mean that cloud storage should never be accessed from a laptop or a desktop. Cloud storage can be a great place to store data that is infrequently accessed by users on regular computers. There will still be a delay, but that can be acceptable as the data is only infrequently accessed.<br />
<br />
=== Network drive storage ===<br />
<br />
Network drive storage is in many way similar to the types of cloud storage that we discussed above. The major difference for the context discussed here is that they can be located in the same location as a user accessing them from a laptop/desktop making it a viable option for such a user to access it frequently. This is a great solution when a regular laptop/desktop frequently needs access to the data and the combined size of all of the data files are too big to all be stored on the laptop/desktop.<br />
<br />
The problem with this solution is that you need to purchase hardware and run your own network drive. This can be complicated if you do not work at a large organization where there is an IT team that can do this for you. Another drawback is that if you work on an organization that is spread out geographically then you will have the same issues as with cloud storage with slow access speed.<br />
<br />
== Back-up protocols ==<br />
<br />
== Back-up storage types ==<br />
You should never use the same file or storage solution for your back-up copy of your original or raw data, as the files and storage solution you use for your day-to-day work. While some storage types could be used for both day-to-day and backup, this sections covers the former use case. Backing up of data is covered in a later section below. There are multiple types of storage and what type is best for you is how depends on the following factors: size of your data, where it will be used and cost.<br />
<br />
== Data retention ==<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[*topic name, as listed on main page*]]<br />
<br />
<br />
== Additional Resources ==<br />
* list here other articles related to this topic, with a brief description and link<br />
<br />
[[Category: *category name* ]]</div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Data_Storage&diff=8296Data Storage2021-06-03T17:04:25Z<p>Kbjarkefur: /* Version control for data */</p>
<hr />
<div>This article discuss different aspects of data storage (such as different types of storage, data back up and data retention). While [[Data_Security]] is a very important topic related to data storage, it is not covered in this article as there is a dedicated link for that.<br />
<br />
<br />
<br />
== Read First ==<br />
* include here key points you want to make sure all readers understand<br />
<br />
<br />
== Version control for data ==<br />
Version control is used for two purposes. The first is to keep track of changes and modifications and the second one is to provide the possibility to revert a file or a folder to an old version. For code there is an industry standard for code that fulfills these two purposes and that system is called Git. Git can be used to both keep track of changes made to code as well as restore older versions of code. Unfortunately there is no system that as elegantly does both of those two things for data and there is therefore no industry wide one-size-fits-all solutions for data. This article therefore suggests different methods and the project team should pick the method best for their exact use case.<br />
<br />
=== Version control code that generates data ===<br />
<br />
All datasets should be generated by code and should be reproducible by re-running that code. Therefore, if you have a good back-up system for the original data and version control all your code using Git, then you have an implicit version control for your data. As long as the original data is unchanged, changes to datasets you generate can only happen through changes to the code tracked in Git. Git can also be used to restore old version of the data as you simple restore an old version of the code and re-generate the data. <br />
<br />
While this method is often an excellent option, it does not work when the original data is updated frequently (ongoing data collection or data streams) or when the code is not accessible (someone else is generating the data).<br />
<br />
=== Version control using checksums/hashes ===<br />
<br />
One way to keep track of changes done to data is to use ''checksums'', ''hashes'' or ''data signatures''. These three concepts all work slightly differently but all follows the same principle and serves the same purpose in the context of version control of data. The principle they all follow is that a datasets is boiled down to a short string of text or a number. Then that string or number can be compared across datasets to test if they are identical or not.<br />
<br />
One very simplified way to explain how this works would be the following example. In this hashing algorithm we start with a word instead of a dataset, but real world hashing algorithms can handle both. Start with a word and then take the corresponding number in the alphabet for each letter, sum those numbers, and then add the digits of each number until you have a single digit. So for "cat" we would get 3+1+20=24 (c=3,a=1,t=20), and in next step 24 would is turned into 6 (2+4=6). So the hash signature for "cat" would be 6. "Horse" would be 8+15+18+19+5=65. 6+5=11 and 1+1=2. So the hash signature would be 2. No matter how big the data is (how long the word is in this simple example) the hash signature will always be the same size and format and we can quickly test if they are the same. The main problem with the very simplified hash algorithm above is that there are only 10 values so many words would share the same signature. However, real world hash algorithms like the <code>checksum</code> command in Stata has signatures on the format "2694850408" which has 10^10 possible signatures.<br />
<br />
With 10^10 possible signatures there is chance that two datasets have the same checksum. This is called "''collisions''". However, checksums are implemented so that two similar dataset are very unlikely to have collisions or even similar checksums, making the risk of two versions of the same dataset having a collision being extremely low. And other algorithms are implemented so that there are many more combinations, but then the hash signature gets longer. Stata also have the command <code>datasignature</code> that has a signature that combines a hash value with some basic data on the dataset such as number of observations. While this provides some useful human readable information, it can only be used on Stata <code>.dta</code> files.<br />
<br />
The drawback of all these methods is that it says nothing about how the dataset was created. So there is no way to recreate a dataset based only on the checksum or hash signature. If you have the checksum of a dataset used in the past but not the corresponding version of the dataset, then all you can do is to say if the current version of the same dataset is identical. You cannot say what differs if they are different. If you used the Stata command <code>datasignature</code> you can get some clues of what differs, such as number of observations or number of variables, but you will still not be able to recreate the old version of the data.<br />
<br />
This method is a good fit for when you want to have a quick way to test that the dataset has not changed, and the details of what has changed does not matter or you have another acceptable and perhaps manual way to find out what those details are. This can be if you are accessing someone else's data set and you want to know if that person makes changes to that dataset. Another way this can be used is to check which if any datasets are updated when the code is updated. Sometimes we want to make changes to the code that should not change the dataset it produces, and then this method is a great way to verify that.<br />
<br />
== Data storage types ==<br />
There are different types of data storage option and which one that is best for each use case typically depends on how big the data is, if there is confidential information in the data and how the data will be used. Below are some different storage types with their pros and cons. The type of data that is considered in this article are data files, and not, for example, data stored in data bases. <br />
<br />
=== File sync services ===<br />
<br />
Common examples of file sync services are DropBox, OneDrive, Box among several others. This storage type is suitable when the data files required for a project is not bigger than that they can be stored on a regular laptop or desktop computer. In file sync services a copy of the file exists locally on all computers that has the folder synced. This means that accessing the file is quick and that there is never a problem for multiple people to access the file at the same time as they have their own version of the file. <br />
<br />
However, because all files are saved locally this also means that if one person works on many projects there is a risk that there is not space for all files for those projects on that persons computer. Many file syncing services have options to mitigate this. For example you can chose to not sync the parts of the project folder that you know you do not need access to. Some sync services also allow you to have non-synced files ready to be downloaded on demand (sometimes referred to as smart sync). However, this is not great for data work as data files are usually to big to be downloaded on demand when your code trying to access them leading to your programming software to crash when trying to read the file.<br />
<br />
Another concern with file syncing services are privacy concerns when sharing confidential data. There are enterprise subscriptions that can be installed on your own servers. This makes sharing data much more secure as the data is never stored or transferred on the servers that belong to the company that offers the syncing service. If you do not have an enterprise subscription or are not sure if you do, you should always [[encryption|encrypt]] the data before saving it in a synced folder. DIME Analytics have published [https://github.com/worldbank/dime-standards/blob/master/dime-research-standards/pillar-4-data-security/data-security-resources/veracrypt-guidelines.md guidelines to encrypt files with Veracrypt] which is a secure and free software often used for this purpose.<br />
<br />
=== Cloud storage ===<br />
<br />
Cloud storage comes in many different kinds and it is out of the scope of this wiki article to describe them all. This article will only cover general points about cloud storage. This article only covers data stored in files, so cloud storage here typically means [https://aws.amazon.com/s3/ S3 storage on AWS] or [https://docs.microsoft.com/en-us/azure/storage/blobs/ Blob storage on Azure], but are not limited to those.<br />
<br />
The benefit of cloud storage is that you do not need to buy or upgrade any hardware. To get started or to increase your capacity you only need to click a button or it might even be upgraded automatically. However, you will be charged more the more you use. <br />
<br />
Cloud storage might not be the best option when the data will frequently be accessed from laptops and desktops. Even on a fast internet connection will the download time be too long for most use cases if the data is downloaded each time the script is run. Instead, cloud storage is best used when frequently accessed from another cloud resource. However, make sure that the other cloud storage is in the same physical location as the cloud resource that will use it. The same cloud provider have data centers across the world and you might have issues if your storage is on the other side of the world from the resource that will use it. There are ways to address these issues but they are outside of the scope of this article. The main takeaway is that cloud storage is only a great solution if the resources that needs access to that have a very quick connection to them.<br />
<br />
At the same time, this does not mean that cloud storage should never be accessed from a laptop or a desktop. Cloud storage can be a great place to store data that is infrequently accessed by users on regular computers. There will still be a delay, but that can be acceptable as the data is only infrequently accessed.<br />
<br />
=== Network drive storage ===<br />
<br />
Network drive storage is in many way similar to the types of cloud storage that we discussed above. The major difference for the context discussed here is that they can be located in the same location as a user accessing them from a laptop/desktop making it a viable option for such a user to access it frequently. This is a great solution when a regular laptop/desktop frequently needs access to the data and the combined size of all of the data files are too big to all be stored on the laptop/desktop.<br />
<br />
The problem with this solution is that you need to purchase hardware and run your own network drive. This can be complicated if you do not work at a large organization where there is an IT team that can do this for you. Another drawback is that if you work on an organization that is spread out geographically then you will have the same issues as with cloud storage with slow access speed.<br />
<br />
== Back-up protocols ==<br />
<br />
== Back-up storage types ==<br />
You should never use the same file or storage solution for your back-up copy of your original or raw data, as the files and storage solution you use for your day-to-day work. While some storage types could be used for both day-to-day and backup, this sections covers the former use case. Backing up of data is covered in a later section below. There are multiple types of storage and what type is best for you is how depends on the following factors: size of your data, where it will be used and cost.<br />
<br />
== Data retention ==<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[*topic name, as listed on main page*]]<br />
<br />
<br />
== Additional Resources ==<br />
* list here other articles related to this topic, with a brief description and link<br />
<br />
[[Category: *category name* ]]</div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Data_Storage&diff=8295Data Storage2021-06-03T16:57:26Z<p>Kbjarkefur: /* Version control the code the generates the data */</p>
<hr />
<div>This article discuss different aspects of data storage (such as different types of storage, data back up and data retention). While [[Data_Security]] is a very important topic related to data storage, it is not covered in this article as there is a dedicated link for that.<br />
<br />
<br />
<br />
== Read First ==<br />
* include here key points you want to make sure all readers understand<br />
<br />
<br />
== Version control for data ==<br />
There is no dominant standard for version control of data the way Git is the dominant standard version control of code. Git can be used to both keep track of changes made to code as well as restore older versions of code. There is no system that as elegantly does both of those two things for data. This section therefore suggests different methods and the project team should pick the method best for their exact use case.<br />
<br />
=== Version control code that generates data ===<br />
<br />
All datasets should be generated by code and should be reproducible by re-running that code. Therefore, if you have a good back-up system for the original data and version control all your code using Git, then you have an implicit version control for your data. As long as the original data is unchanged, changes to datasets you generate can only happen through changes to the code tracked in Git. Git can also be used to restore old version of the data as you simple restore an old version of the code and re-generate the data. <br />
<br />
While this method is often an excellent option, it does not work when the original data is updated frequently (ongoing data collection or data streams) or when the code is not accessible (someone else is generating the data).<br />
<br />
=== Version control using checksums/hashes ===<br />
<br />
One way to keep track of changes done to data is to use ''checksums'', ''hashes'' or ''data signatures''. These three concepts all work slightly differently but all follows the same principle and serves the same purpose in the context of version control of data. The principle they all follow is that a datasets is boiled down to a short string of text or a number. Then that string or number can be compared across datasets to test if they are identical or not.<br />
<br />
One very simplified way to explain how this works would be the following example. In this hashing algorithm we start with a word instead of a dataset, but real world hashing algorithms can handle both. Start with a word and then take the corresponding number in the alphabet for each letter, sum those numbers, and then add the digits of each number until you have a single digit. So for "cat" we would get 3+1+20=24 (c=3,a=1,t=20), and in next step 24 would is turned into 6 (2+4=6). So the hash signature for "cat" would be 6. "Horse" would be 8+15+18+19+5=65. 6+5=11 and 1+1=2. So the hash signature would be 2. No matter how big the data is (how long the word is in this simple example) the hash signature will always be the same size and format and we can quickly test if they are the same. The main problem with the very simplified hash algorithm above is that there are only 10 values so many words would share the same signature. However, real world hash algorithms like the <code>checksum</code> command in Stata has signatures on the format "2694850408" which has 10^10 possible signatures.<br />
<br />
With 10^10 possible signatures there is chance that two datasets have the same checksum. This is called "''collisions''". However, checksums are implemented so that two similar dataset are very unlikely to have collisions or even similar checksums, making the risk of two versions of the same dataset having a collision being extremely low. And other algorithms are implemented so that there are many more combinations, but then the hash signature gets longer. Stata also have the command <code>datasignature</code> that has a signature that combines a hash value with some basic data on the dataset such as number of observations. While this provides some useful human readable information, it can only be used on Stata <code>.dta</code> files.<br />
<br />
The drawback of all these methods is that it says nothing about how the dataset was created. So there is no way to recreate a dataset based only on the checksum or hash signature. If you have the checksum of a dataset used in the past but not the corresponding version of the dataset, then all you can do is to say if the current version of the same dataset is identical. You cannot say what differs if they are different. If you used the Stata command <code>datasignature</code> you can get some clues of what differs, such as number of observations or number of variables, but you will still not be able to recreate the old version of the data.<br />
<br />
This method is a good fit for when you want to have a quick way to test that the dataset has not changed, and the details of what has changed does not matter or you have another acceptable and perhaps manual way to find out what those details are. This can be if you are accessing someone else's data set and you want to know if that person makes changes to that dataset. Another way this can be used is to check which if any datasets are updated when the code is updated. Sometimes we want to make changes to the code that should not change the dataset it produces, and then this method is a great way to verify that.<br />
<br />
== Data storage types ==<br />
There are different types of data storage option and which one that is best for each use case typically depends on how big the data is, if there is confidential information in the data and how the data will be used. Below are some different storage types with their pros and cons. The type of data that is considered in this article are data files, and not, for example, data stored in data bases. <br />
<br />
=== File sync services ===<br />
<br />
Common examples of file sync services are DropBox, OneDrive, Box among several others. This storage type is suitable when the data files required for a project is not bigger than that they can be stored on a regular laptop or desktop computer. In file sync services a copy of the file exists locally on all computers that has the folder synced. This means that accessing the file is quick and that there is never a problem for multiple people to access the file at the same time as they have their own version of the file. <br />
<br />
However, because all files are saved locally this also means that if one person works on many projects there is a risk that there is not space for all files for those projects on that persons computer. Many file syncing services have options to mitigate this. For example you can chose to not sync the parts of the project folder that you know you do not need access to. Some sync services also allow you to have non-synced files ready to be downloaded on demand (sometimes referred to as smart sync). However, this is not great for data work as data files are usually to big to be downloaded on demand when your code trying to access them leading to your programming software to crash when trying to read the file.<br />
<br />
Another concern with file syncing services are privacy concerns when sharing confidential data. There are enterprise subscriptions that can be installed on your own servers. This makes sharing data much more secure as the data is never stored or transferred on the servers that belong to the company that offers the syncing service. If you do not have an enterprise subscription or are not sure if you do, you should always [[encryption|encrypt]] the data before saving it in a synced folder. DIME Analytics have published [https://github.com/worldbank/dime-standards/blob/master/dime-research-standards/pillar-4-data-security/data-security-resources/veracrypt-guidelines.md guidelines to encrypt files with Veracrypt] which is a secure and free software often used for this purpose.<br />
<br />
=== Cloud storage ===<br />
<br />
Cloud storage comes in many different kinds and it is out of the scope of this wiki article to describe them all. This article will only cover general points about cloud storage. This article only covers data stored in files, so cloud storage here typically means [https://aws.amazon.com/s3/ S3 storage on AWS] or [https://docs.microsoft.com/en-us/azure/storage/blobs/ Blob storage on Azure], but are not limited to those.<br />
<br />
The benefit of cloud storage is that you do not need to buy or upgrade any hardware. To get started or to increase your capacity you only need to click a button or it might even be upgraded automatically. However, you will be charged more the more you use. <br />
<br />
Cloud storage might not be the best option when the data will frequently be accessed from laptops and desktops. Even on a fast internet connection will the download time be too long for most use cases if the data is downloaded each time the script is run. Instead, cloud storage is best used when frequently accessed from another cloud resource. However, make sure that the other cloud storage is in the same physical location as the cloud resource that will use it. The same cloud provider have data centers across the world and you might have issues if your storage is on the other side of the world from the resource that will use it. There are ways to address these issues but they are outside of the scope of this article. The main takeaway is that cloud storage is only a great solution if the resources that needs access to that have a very quick connection to them.<br />
<br />
At the same time, this does not mean that cloud storage should never be accessed from a laptop or a desktop. Cloud storage can be a great place to store data that is infrequently accessed by users on regular computers. There will still be a delay, but that can be acceptable as the data is only infrequently accessed.<br />
<br />
=== Network drive storage ===<br />
<br />
Network drive storage is in many way similar to the types of cloud storage that we discussed above. The major difference for the context discussed here is that they can be located in the same location as a user accessing them from a laptop/desktop making it a viable option for such a user to access it frequently. This is a great solution when a regular laptop/desktop frequently needs access to the data and the combined size of all of the data files are too big to all be stored on the laptop/desktop.<br />
<br />
The problem with this solution is that you need to purchase hardware and run your own network drive. This can be complicated if you do not work at a large organization where there is an IT team that can do this for you. Another drawback is that if you work on an organization that is spread out geographically then you will have the same issues as with cloud storage with slow access speed.<br />
<br />
== Back-up protocols ==<br />
<br />
== Back-up storage types ==<br />
You should never use the same file or storage solution for your back-up copy of your original or raw data, as the files and storage solution you use for your day-to-day work. While some storage types could be used for both day-to-day and backup, this sections covers the former use case. Backing up of data is covered in a later section below. There are multiple types of storage and what type is best for you is how depends on the following factors: size of your data, where it will be used and cost.<br />
<br />
== Data retention ==<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[*topic name, as listed on main page*]]<br />
<br />
<br />
== Additional Resources ==<br />
* list here other articles related to this topic, with a brief description and link<br />
<br />
[[Category: *category name* ]]</div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Data_Storage&diff=8291Data Storage2021-06-01T23:21:57Z<p>Kbjarkefur: /* Data storage types */</p>
<hr />
<div>This article discuss different aspects of data storage (such as different types of storage, data back up and data retention). While [[Data_Security]] is a very important topic related to data storage, it is not covered in this article as there is a dedicated link for that.<br />
<br />
<br />
<br />
== Read First ==<br />
* include here key points you want to make sure all readers understand<br />
<br />
<br />
== Version control for data ==<br />
There is no dominant standard for version control of data the way Git is the dominant standard version control of code. Git can be used to both keep track of changes made to code as well as restore older versions of code. There is no system that as elegantly does both of those two things for data. This section therefore suggests different methods and the project team should pick the method best for their exact use case.<br />
<br />
=== Version control the code the generates the data ===<br />
<br />
All datasets should be generated by code and should be reproducible by re-running that code. Therefore, if you have a good back-up system for the original data and version control all your code using Git, then you have an implicit version control for your data. As long as the original data is unchanged, changes to datasets you generate can only happen through changes to the code tracked in Git. Git can also be used to restore old version of the data as you simple restore an old version of the code and re-generate the data. <br />
<br />
While this method is often an excellent option, it does not work when the original data is updated frequently (ongoing data collection or data streams) or when the code is not accessible (someone else is generating the data).<br />
<br />
=== Version control using checksums/hashes ===<br />
<br />
One way to keep track of changes done to data is to use ''checksums'', ''hashes'' or ''data signatures''. These three concepts all work slightly differently but all follows the same principle and serves the same purpose in the context of version control of data. The principle they all follow is that a datasets is boiled down to a short string of text or a number. Then that string or number can be compared across datasets to test if they are identical or not.<br />
<br />
One very simplified way to explain how this works would be the following example. In this hashing algorithm we start with a word instead of a dataset, but real world hashing algorithms can handle both. Start with a word and then take the corresponding number in the alphabet for each letter, sum those numbers, and then add the digits of each number until you have a single digit. So for "cat" we would get 3+1+20=24 (c=3,a=1,t=20), and in next step 24 would is turned into 6 (2+4=6). So the hash signature for "cat" would be 6. "Horse" would be 8+15+18+19+5=65. 6+5=11 and 1+1=2. So the hash signature would be 2. No matter how big the data is (how long the word is in this simple example) the hash signature will always be the same size and format and we can quickly test if they are the same. The main problem with the very simplified hash algorithm above is that there are only 10 values so many words would share the same signature. However, real world hash algorithms like the <code>checksum</code> command in Stata has signatures on the format "2694850408" which has 10^10 possible signatures.<br />
<br />
With 10^10 possible signatures there is chance that two datasets have the same checksum. This is called "''collisions''". However, checksums are implemented so that two similar dataset are very unlikely to have collisions or even similar checksums, making the risk of two versions of the same dataset having a collision being extremely low. And other algorithms are implemented so that there are many more combinations, but then the hash signature gets longer. Stata also have the command <code>datasignature</code> that has a signature that combines a hash value with some basic data on the dataset such as number of observations. While this provides some useful human readable information, it can only be used on Stata <code>.dta</code> files.<br />
<br />
The drawback of all these methods is that it says nothing about how the dataset was created. So there is no way to recreate a dataset based only on the checksum or hash signature. If you have the checksum of a dataset used in the past but not the corresponding version of the dataset, then all you can do is to say if the current version of the same dataset is identical. You cannot say what differs if they are different. If you used the Stata command <code>datasignature</code> you can get some clues of what differs, such as number of observations or number of variables, but you will still not be able to recreate the old version of the data.<br />
<br />
This method is a good fit for when you want to have a quick way to test that the dataset has not changed, and the details of what has changed does not matter or you have another acceptable and perhaps manual way to find out what those details are. This can be if you are accessing someone else's data set and you want to know if that person makes changes to that dataset. Another way this can be used is to check which if any datasets are updated when the code is updated. Sometimes we want to make changes to the code that should not change the dataset it produces, and then this method is a great way to verify that.<br />
<br />
== Data storage types ==<br />
There are different types of data storage option and which one that is best for each use case typically depends on how big the data is, if there is confidential information in the data and how the data will be used. Below are some different storage types with their pros and cons. The type of data that is considered in this article are data files, and not, for example, data stored in data bases. <br />
<br />
=== File sync services ===<br />
<br />
Common examples of file sync services are DropBox, OneDrive, Box among several others. This storage type is suitable when the data files required for a project is not bigger than that they can be stored on a regular laptop or desktop computer. In file sync services a copy of the file exists locally on all computers that has the folder synced. This means that accessing the file is quick and that there is never a problem for multiple people to access the file at the same time as they have their own version of the file. <br />
<br />
However, because all files are saved locally this also means that if one person works on many projects there is a risk that there is not space for all files for those projects on that persons computer. Many file syncing services have options to mitigate this. For example you can chose to not sync the parts of the project folder that you know you do not need access to. Some sync services also allow you to have non-synced files ready to be downloaded on demand (sometimes referred to as smart sync). However, this is not great for data work as data files are usually to big to be downloaded on demand when your code trying to access them leading to your programming software to crash when trying to read the file.<br />
<br />
Another concern with file syncing services are privacy concerns when sharing confidential data. There are enterprise subscriptions that can be installed on your own servers. This makes sharing data much more secure as the data is never stored or transferred on the servers that belong to the company that offers the syncing service. If you do not have an enterprise subscription or are not sure if you do, you should always [[encryption|encrypt]] the data before saving it in a synced folder. DIME Analytics have published [https://github.com/worldbank/dime-standards/blob/master/dime-research-standards/pillar-4-data-security/data-security-resources/veracrypt-guidelines.md guidelines to encrypt files with Veracrypt] which is a secure and free software often used for this purpose.<br />
<br />
=== Cloud storage ===<br />
<br />
Cloud storage comes in many different kinds and it is out of the scope of this wiki article to describe them all. This article will only cover general points about cloud storage. This article only covers data stored in files, so cloud storage here typically means [https://aws.amazon.com/s3/ S3 storage on AWS] or [https://docs.microsoft.com/en-us/azure/storage/blobs/ Blob storage on Azure], but are not limited to those.<br />
<br />
The benefit of cloud storage is that you do not need to buy or upgrade any hardware. To get started or to increase your capacity you only need to click a button or it might even be upgraded automatically. However, you will be charged more the more you use. <br />
<br />
Cloud storage might not be the best option when the data will frequently be accessed from laptops and desktops. Even on a fast internet connection will the download time be too long for most use cases if the data is downloaded each time the script is run. Instead, cloud storage is best used when frequently accessed from another cloud resource. However, make sure that the other cloud storage is in the same physical location as the cloud resource that will use it. The same cloud provider have data centers across the world and you might have issues if your storage is on the other side of the world from the resource that will use it. There are ways to address these issues but they are outside of the scope of this article. The main takeaway is that cloud storage is only a great solution if the resources that needs access to that have a very quick connection to them.<br />
<br />
At the same time, this does not mean that cloud storage should never be accessed from a laptop or a desktop. Cloud storage can be a great place to store data that is infrequently accessed by users on regular computers. There will still be a delay, but that can be acceptable as the data is only infrequently accessed.<br />
<br />
=== Network drive storage ===<br />
<br />
Network drive storage is in many way similar to the types of cloud storage that we discussed above. The major difference for the context discussed here is that they can be located in the same location as a user accessing them from a laptop/desktop making it a viable option for such a user to access it frequently. This is a great solution when a regular laptop/desktop frequently needs access to the data and the combined size of all of the data files are too big to all be stored on the laptop/desktop.<br />
<br />
The problem with this solution is that you need to purchase hardware and run your own network drive. This can be complicated if you do not work at a large organization where there is an IT team that can do this for you. Another drawback is that if you work on an organization that is spread out geographically then you will have the same issues as with cloud storage with slow access speed.<br />
<br />
== Back-up protocols ==<br />
<br />
== Back-up storage types ==<br />
You should never use the same file or storage solution for your back-up copy of your original or raw data, as the files and storage solution you use for your day-to-day work. While some storage types could be used for both day-to-day and backup, this sections covers the former use case. Backing up of data is covered in a later section below. There are multiple types of storage and what type is best for you is how depends on the following factors: size of your data, where it will be used and cost.<br />
<br />
== Data retention ==<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[*topic name, as listed on main page*]]<br />
<br />
<br />
== Additional Resources ==<br />
* list here other articles related to this topic, with a brief description and link<br />
<br />
[[Category: *category name* ]]</div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Data_Storage&diff=8290Data Storage2021-06-01T23:21:27Z<p>Kbjarkefur: /* Cloud storage */</p>
<hr />
<div>This article discuss different aspects of data storage (such as different types of storage, data back up and data retention). While [[Data_Security]] is a very important topic related to data storage, it is not covered in this article as there is a dedicated link for that.<br />
<br />
<br />
<br />
== Read First ==<br />
* include here key points you want to make sure all readers understand<br />
<br />
<br />
== Version control for data ==<br />
There is no dominant standard for version control of data the way Git is the dominant standard version control of code. Git can be used to both keep track of changes made to code as well as restore older versions of code. There is no system that as elegantly does both of those two things for data. This section therefore suggests different methods and the project team should pick the method best for their exact use case.<br />
<br />
=== Version control the code the generates the data ===<br />
<br />
All datasets should be generated by code and should be reproducible by re-running that code. Therefore, if you have a good back-up system for the original data and version control all your code using Git, then you have an implicit version control for your data. As long as the original data is unchanged, changes to datasets you generate can only happen through changes to the code tracked in Git. Git can also be used to restore old version of the data as you simple restore an old version of the code and re-generate the data. <br />
<br />
While this method is often an excellent option, it does not work when the original data is updated frequently (ongoing data collection or data streams) or when the code is not accessible (someone else is generating the data).<br />
<br />
=== Version control using checksums/hashes ===<br />
<br />
One way to keep track of changes done to data is to use ''checksums'', ''hashes'' or ''data signatures''. These three concepts all work slightly differently but all follows the same principle and serves the same purpose in the context of version control of data. The principle they all follow is that a datasets is boiled down to a short string of text or a number. Then that string or number can be compared across datasets to test if they are identical or not.<br />
<br />
One very simplified way to explain how this works would be the following example. In this hashing algorithm we start with a word instead of a dataset, but real world hashing algorithms can handle both. Start with a word and then take the corresponding number in the alphabet for each letter, sum those numbers, and then add the digits of each number until you have a single digit. So for "cat" we would get 3+1+20=24 (c=3,a=1,t=20), and in next step 24 would is turned into 6 (2+4=6). So the hash signature for "cat" would be 6. "Horse" would be 8+15+18+19+5=65. 6+5=11 and 1+1=2. So the hash signature would be 2. No matter how big the data is (how long the word is in this simple example) the hash signature will always be the same size and format and we can quickly test if they are the same. The main problem with the very simplified hash algorithm above is that there are only 10 values so many words would share the same signature. However, real world hash algorithms like the <code>checksum</code> command in Stata has signatures on the format "2694850408" which has 10^10 possible signatures.<br />
<br />
With 10^10 possible signatures there is chance that two datasets have the same checksum. This is called "''collisions''". However, checksums are implemented so that two similar dataset are very unlikely to have collisions or even similar checksums, making the risk of two versions of the same dataset having a collision being extremely low. And other algorithms are implemented so that there are many more combinations, but then the hash signature gets longer. Stata also have the command <code>datasignature</code> that has a signature that combines a hash value with some basic data on the dataset such as number of observations. While this provides some useful human readable information, it can only be used on Stata <code>.dta</code> files.<br />
<br />
The drawback of all these methods is that it says nothing about how the dataset was created. So there is no way to recreate a dataset based only on the checksum or hash signature. If you have the checksum of a dataset used in the past but not the corresponding version of the dataset, then all you can do is to say if the current version of the same dataset is identical. You cannot say what differs if they are different. If you used the Stata command <code>datasignature</code> you can get some clues of what differs, such as number of observations or number of variables, but you will still not be able to recreate the old version of the data.<br />
<br />
This method is a good fit for when you want to have a quick way to test that the dataset has not changed, and the details of what has changed does not matter or you have another acceptable and perhaps manual way to find out what those details are. This can be if you are accessing someone else's data set and you want to know if that person makes changes to that dataset. Another way this can be used is to check which if any datasets are updated when the code is updated. Sometimes we want to make changes to the code that should not change the dataset it produces, and then this method is a great way to verify that.<br />
<br />
== Data storage types ==<br />
There are different types of data storage option and which one that is best for each use case typically depends on how big the data is, if there is confidential information in the data and how the data will be used. Below are some different storage types with their pros and cons. The type of data that is considered in this article are data files, and not, for example, data stored in data bases. <br />
<br />
=== File sync services ===<br />
<br />
Common examples of file sync services are DropBox, OneDrive, Box among several others. This storage type is suitable when the data files required for a project is not bigger than that they can be stored on a regular laptop or desktop computer. In file sync services a copy of the file exists locally on all computers that has the folder synced. This means that accessing the file is quick and that there is never a problem for multiple people to access the file at the same time as they have their own version of the file. <br />
<br />
However, because all files are saved locally this also means that if one person works on many projects there is a risk that there is not space for all files for those projects on that persons computer. Many file syncing services have options to mitigate this. For example you can chose to not sync the parts of the project folder that you know you do not need access to. Some sync services also allow you to have non-synced files ready to be downloaded on demand (sometimes referred to as smart sync). However, this is not great for data work as data files are usually to big to be downloaded on demand when your code trying to access them leading to your programming software to crash when trying to read the file.<br />
<br />
Another concern with file syncing services are privacy concerns when sharing confidential data. There are enterprise subscriptions that can be installed on your own servers. This makes sharing data much more secure as the data is never stored or transferred on the servers that belong to the company that offers the syncing service. If you do not have an enterprise subscription or are not sure if you do, you should always [[encryption|encrypt]] the data before saving it in a synced folder. DIME Analytics have published [https://github.com/worldbank/dime-standards/blob/master/dime-research-standards/pillar-4-data-security/data-security-resources/veracrypt-guidelines.md guidelines to encrypt files with Veracrypt] which is a secure and free software often used for this purpose.<br />
<br />
=== Cloud storage ===<br />
<br />
Cloud storage comes in many different kinds and it is out of the scope of this wiki article to describe them all. This article will only cover general points about cloud storage. This article only covers data stored in files, so cloud storage here typically means [https://aws.amazon.com/s3/ S3 storage on AWS] or [https://docs.microsoft.com/en-us/azure/storage/blobs/ Blob storage on Azure], but are not limited to those.<br />
<br />
The benefit of cloud storage is that you do not need to buy or upgrade any hardware. To get started or to increase your capacity you only need to click a button or it might even be upgraded automatically. However, you will be charged more the more you use. <br />
<br />
Cloud storage might not be the best option when the data will frequently be accessed from laptops and desktops. Even on a fast internet connection will the download time be too long for most use cases if the data is downloaded each time the script is run. Instead, cloud storage is best used when frequently accessed from another cloud resource. However, make sure that the other cloud storage is in the same physical location as the cloud resource that will use it. The same cloud provider have data centers across the world and you might have issues if your storage is on the other side of the world from the resource that will use it. There are ways to address these issues but they are outside of the scope of this article. The main takeaway is that cloud storage is only a great solution if the resources that needs access to that have a very quick connection to them.<br />
<br />
At the same time, this does not mean that cloud storage should never be accessed from a laptop or a desktop. Cloud storage can be a great place to store data that is infrequently accessed by users on regular computers. There will still be a delay, but that can be acceptable as the data is only infrequently accessed.<br />
<br />
=== Network drive storage ===<br />
<br />
Network drive storage is in many way similar to the types of cloud storage that we discussed above. The major difference for the context discussed here is that they can be located in the same location as a user accessing them from a laptop/desktop making it a viable option for such a user to access it frequently. This is a great solution when a regular laptop/desktop frequently needs access to the data and the combined size of all of the data files are too big to all be stored on the laptop/desktop.<br />
<br />
The problem with this solution is that you need to purchase hardware and run your own network drive. This can be complicated if you do not work at a large organization where there is an IT team that can do this for you. Another drawback is that if you work on an organization that is spread out geographically then you will have the same issues as with cloud storage with slow access speed.<br />
<br />
=== Storage for different size of data ===<br />
<br />
A useful rule of thumb when deciding which type of storage is suitable given the size of your data is if the total size of all data in your project can fit the space usually available on a typical user's laptop. If the data is small enough to fit on a regular laptop then synced storages (such as World Bank OneDrive or DropBox) becomes available. In synced storages each user has their own copy saved on their computer, and the sync software makes sure that all users have identical files. This is different from storing data on network drives of cloud storage where it is the same file each user uses, not just an identical file. When each user has their own version of the file access to that file tend to be faster. <br />
<br />
If the total size of all data in the project folder is too big to fit on a regular but the size of the files relevant to each users is not too big, then synced storages can still be used if the syncing service allows you to sync only specific folders or files in your project folder.<br />
<br />
If the data in the project folder is too big to be synced to a typical laptop, then a the data can be stored in a network drive or a cloud storage. However, in these solution there is no copy of the file stored on each users hard drive, and depending on the exact service used and connectivity speed, this can be slow. And even if network or cloud storage has next to unlimited storage capacity<br />
<br />
== Back-up protocols ==<br />
<br />
== Back-up storage types ==<br />
You should never use the same file or storage solution for your back-up copy of your original or raw data, as the files and storage solution you use for your day-to-day work. While some storage types could be used for both day-to-day and backup, this sections covers the former use case. Backing up of data is covered in a later section below. There are multiple types of storage and what type is best for you is how depends on the following factors: size of your data, where it will be used and cost.<br />
<br />
== Data retention ==<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[*topic name, as listed on main page*]]<br />
<br />
<br />
== Additional Resources ==<br />
* list here other articles related to this topic, with a brief description and link<br />
<br />
[[Category: *category name* ]]</div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Data_Storage&diff=8289Data Storage2021-06-01T19:36:38Z<p>Kbjarkefur: /* Network drives */</p>
<hr />
<div>This article discuss different aspects of data storage (such as different types of storage, data back up and data retention). While [[Data_Security]] is a very important topic related to data storage, it is not covered in this article as there is a dedicated link for that.<br />
<br />
<br />
<br />
== Read First ==<br />
* include here key points you want to make sure all readers understand<br />
<br />
<br />
== Version control for data ==<br />
There is no dominant standard for version control of data the way Git is the dominant standard version control of code. Git can be used to both keep track of changes made to code as well as restore older versions of code. There is no system that as elegantly does both of those two things for data. This section therefore suggests different methods and the project team should pick the method best for their exact use case.<br />
<br />
=== Version control the code the generates the data ===<br />
<br />
All datasets should be generated by code and should be reproducible by re-running that code. Therefore, if you have a good back-up system for the original data and version control all your code using Git, then you have an implicit version control for your data. As long as the original data is unchanged, changes to datasets you generate can only happen through changes to the code tracked in Git. Git can also be used to restore old version of the data as you simple restore an old version of the code and re-generate the data. <br />
<br />
While this method is often an excellent option, it does not work when the original data is updated frequently (ongoing data collection or data streams) or when the code is not accessible (someone else is generating the data).<br />
<br />
=== Version control using checksums/hashes ===<br />
<br />
One way to keep track of changes done to data is to use ''checksums'', ''hashes'' or ''data signatures''. These three concepts all work slightly differently but all follows the same principle and serves the same purpose in the context of version control of data. The principle they all follow is that a datasets is boiled down to a short string of text or a number. Then that string or number can be compared across datasets to test if they are identical or not.<br />
<br />
One very simplified way to explain how this works would be the following example. In this hashing algorithm we start with a word instead of a dataset, but real world hashing algorithms can handle both. Start with a word and then take the corresponding number in the alphabet for each letter, sum those numbers, and then add the digits of each number until you have a single digit. So for "cat" we would get 3+1+20=24 (c=3,a=1,t=20), and in next step 24 would is turned into 6 (2+4=6). So the hash signature for "cat" would be 6. "Horse" would be 8+15+18+19+5=65. 6+5=11 and 1+1=2. So the hash signature would be 2. No matter how big the data is (how long the word is in this simple example) the hash signature will always be the same size and format and we can quickly test if they are the same. The main problem with the very simplified hash algorithm above is that there are only 10 values so many words would share the same signature. However, real world hash algorithms like the <code>checksum</code> command in Stata has signatures on the format "2694850408" which has 10^10 possible signatures.<br />
<br />
With 10^10 possible signatures there is chance that two datasets have the same checksum. This is called "''collisions''". However, checksums are implemented so that two similar dataset are very unlikely to have collisions or even similar checksums, making the risk of two versions of the same dataset having a collision being extremely low. And other algorithms are implemented so that there are many more combinations, but then the hash signature gets longer. Stata also have the command <code>datasignature</code> that has a signature that combines a hash value with some basic data on the dataset such as number of observations. While this provides some useful human readable information, it can only be used on Stata <code>.dta</code> files.<br />
<br />
The drawback of all these methods is that it says nothing about how the dataset was created. So there is no way to recreate a dataset based only on the checksum or hash signature. If you have the checksum of a dataset used in the past but not the corresponding version of the dataset, then all you can do is to say if the current version of the same dataset is identical. You cannot say what differs if they are different. If you used the Stata command <code>datasignature</code> you can get some clues of what differs, such as number of observations or number of variables, but you will still not be able to recreate the old version of the data.<br />
<br />
This method is a good fit for when you want to have a quick way to test that the dataset has not changed, and the details of what has changed does not matter or you have another acceptable and perhaps manual way to find out what those details are. This can be if you are accessing someone else's data set and you want to know if that person makes changes to that dataset. Another way this can be used is to check which if any datasets are updated when the code is updated. Sometimes we want to make changes to the code that should not change the dataset it produces, and then this method is a great way to verify that.<br />
<br />
== Data storage types ==<br />
There are different types of data storage option and which one that is best for each use case typically depends on how big the data is, if there is confidential information in the data and how the data will be used. Below are some different storage types with their pros and cons. The type of data that is considered in this article are data files, and not, for example, data stored in data bases. <br />
<br />
=== File sync services ===<br />
<br />
Common examples of file sync services are DropBox, OneDrive, Box among several others. This storage type is suitable when the data files required for a project is not bigger than that they can be stored on a regular laptop or desktop computer. In file sync services a copy of the file exists locally on all computers that has the folder synced. This means that accessing the file is quick and that there is never a problem for multiple people to access the file at the same time as they have their own version of the file. <br />
<br />
However, because all files are saved locally this also means that if one person works on many projects there is a risk that there is not space for all files for those projects on that persons computer. Many file syncing services have options to mitigate this. For example you can chose to not sync the parts of the project folder that you know you do not need access to. Some sync services also allow you to have non-synced files ready to be downloaded on demand (sometimes referred to as smart sync). However, this is not great for data work as data files are usually to big to be downloaded on demand when your code trying to access them leading to your programming software to crash when trying to read the file.<br />
<br />
Another concern with file syncing services are privacy concerns when sharing confidential data. There are enterprise subscriptions that can be installed on your own servers. This makes sharing data much more secure as the data is never stored or transferred on the servers that belong to the company that offers the syncing service. If you do not have an enterprise subscription or are not sure if you do, you should always [[encryption|encrypt]] the data before saving it in a synced folder. DIME Analytics have published [https://github.com/worldbank/dime-standards/blob/master/dime-research-standards/pillar-4-data-security/data-security-resources/veracrypt-guidelines.md guidelines to encrypt files with Veracrypt] which is a secure and free software often used for this purpose.<br />
<br />
=== Cloud storage ===<br />
<br />
Cloud storage comes in many different kinds and it is out of the scope of this wiki article to describe them all. This article will only cover general points about cloud storage. This article only covers data stored in files, so cloud storage here typically means [https://aws.amazon.com/s3/ S3 storage on AWS] or [https://docs.microsoft.com/en-us/azure/storage/blobs/ Blob storage on Azure], but are not limited to those.<br />
<br />
The benefit of cloud storage is that you do not need to buy or upgrade any hardware. To get started or to increase your capacity you only need to click a button or it might even be upgraded automatically. However, you will be charged more the more you use. <br />
<br />
Cloud storage might not be the best option when the data will frequently be accessed from laptops and desktops. Even on a fast internet connection will the download time be too long for most use cases if the data is downloaded each time the script is run. Instead, cloud storage is best used when frequently accessed from another cloud resource. However, make sure that the other cloud storage is in the same physical location as the cloud resource that will use it. The same cloud provider have data centers across the world and you might have issues if your storage is on the other side of the world from the resource that will use it. There are ways to address these issues but they are outside of the scope of this article. The main takeaway is that cloud storage is only a great solution if the resources that needs access to that have a very quick connection to them.<br />
<br />
At the same time, this does not mean that cloud storage should never be accessed from a laptop or a desktop. Cloud storage can be a great place to store data that is infrequently accessed by users on regular computers. There will still be a delay, but that can be acceptable as the data is only infrequently accessed.<br />
<br />
=== Cloud storage ===<br />
<br />
=== Storage for different size of data ===<br />
<br />
A useful rule of thumb when deciding which type of storage is suitable given the size of your data is if the total size of all data in your project can fit the space usually available on a typical user's laptop. If the data is small enough to fit on a regular laptop then synced storages (such as World Bank OneDrive or DropBox) becomes available. In synced storages each user has their own copy saved on their computer, and the sync software makes sure that all users have identical files. This is different from storing data on network drives of cloud storage where it is the same file each user uses, not just an identical file. When each user has their own version of the file access to that file tend to be faster. <br />
<br />
If the total size of all data in the project folder is too big to fit on a regular but the size of the files relevant to each users is not too big, then synced storages can still be used if the syncing service allows you to sync only specific folders or files in your project folder.<br />
<br />
If the data in the project folder is too big to be synced to a typical laptop, then a the data can be stored in a network drive or a cloud storage. However, in these solution there is no copy of the file stored on each users hard drive, and depending on the exact service used and connectivity speed, this can be slow. And even if network or cloud storage has next to unlimited storage capacity<br />
<br />
== Back-up protocols ==<br />
<br />
== Back-up storage types ==<br />
You should never use the same file or storage solution for your back-up copy of your original or raw data, as the files and storage solution you use for your day-to-day work. While some storage types could be used for both day-to-day and backup, this sections covers the former use case. Backing up of data is covered in a later section below. There are multiple types of storage and what type is best for you is how depends on the following factors: size of your data, where it will be used and cost.<br />
<br />
== Data retention ==<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[*topic name, as listed on main page*]]<br />
<br />
<br />
== Additional Resources ==<br />
* list here other articles related to this topic, with a brief description and link<br />
<br />
[[Category: *category name* ]]</div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Data_Storage&diff=8288Data Storage2021-06-01T18:19:33Z<p>Kbjarkefur: /* File sync services */</p>
<hr />
<div>This article discuss different aspects of data storage (such as different types of storage, data back up and data retention). While [[Data_Security]] is a very important topic related to data storage, it is not covered in this article as there is a dedicated link for that.<br />
<br />
<br />
<br />
== Read First ==<br />
* include here key points you want to make sure all readers understand<br />
<br />
<br />
== Version control for data ==<br />
There is no dominant standard for version control of data the way Git is the dominant standard version control of code. Git can be used to both keep track of changes made to code as well as restore older versions of code. There is no system that as elegantly does both of those two things for data. This section therefore suggests different methods and the project team should pick the method best for their exact use case.<br />
<br />
=== Version control the code the generates the data ===<br />
<br />
All datasets should be generated by code and should be reproducible by re-running that code. Therefore, if you have a good back-up system for the original data and version control all your code using Git, then you have an implicit version control for your data. As long as the original data is unchanged, changes to datasets you generate can only happen through changes to the code tracked in Git. Git can also be used to restore old version of the data as you simple restore an old version of the code and re-generate the data. <br />
<br />
While this method is often an excellent option, it does not work when the original data is updated frequently (ongoing data collection or data streams) or when the code is not accessible (someone else is generating the data).<br />
<br />
=== Version control using checksums/hashes ===<br />
<br />
One way to keep track of changes done to data is to use ''checksums'', ''hashes'' or ''data signatures''. These three concepts all work slightly differently but all follows the same principle and serves the same purpose in the context of version control of data. The principle they all follow is that a datasets is boiled down to a short string of text or a number. Then that string or number can be compared across datasets to test if they are identical or not.<br />
<br />
One very simplified way to explain how this works would be the following example. In this hashing algorithm we start with a word instead of a dataset, but real world hashing algorithms can handle both. Start with a word and then take the corresponding number in the alphabet for each letter, sum those numbers, and then add the digits of each number until you have a single digit. So for "cat" we would get 3+1+20=24 (c=3,a=1,t=20), and in next step 24 would is turned into 6 (2+4=6). So the hash signature for "cat" would be 6. "Horse" would be 8+15+18+19+5=65. 6+5=11 and 1+1=2. So the hash signature would be 2. No matter how big the data is (how long the word is in this simple example) the hash signature will always be the same size and format and we can quickly test if they are the same. The main problem with the very simplified hash algorithm above is that there are only 10 values so many words would share the same signature. However, real world hash algorithms like the <code>checksum</code> command in Stata has signatures on the format "2694850408" which has 10^10 possible signatures.<br />
<br />
With 10^10 possible signatures there is chance that two datasets have the same checksum. This is called "''collisions''". However, checksums are implemented so that two similar dataset are very unlikely to have collisions or even similar checksums, making the risk of two versions of the same dataset having a collision being extremely low. And other algorithms are implemented so that there are many more combinations, but then the hash signature gets longer. Stata also have the command <code>datasignature</code> that has a signature that combines a hash value with some basic data on the dataset such as number of observations. While this provides some useful human readable information, it can only be used on Stata <code>.dta</code> files.<br />
<br />
The drawback of all these methods is that it says nothing about how the dataset was created. So there is no way to recreate a dataset based only on the checksum or hash signature. If you have the checksum of a dataset used in the past but not the corresponding version of the dataset, then all you can do is to say if the current version of the same dataset is identical. You cannot say what differs if they are different. If you used the Stata command <code>datasignature</code> you can get some clues of what differs, such as number of observations or number of variables, but you will still not be able to recreate the old version of the data.<br />
<br />
This method is a good fit for when you want to have a quick way to test that the dataset has not changed, and the details of what has changed does not matter or you have another acceptable and perhaps manual way to find out what those details are. This can be if you are accessing someone else's data set and you want to know if that person makes changes to that dataset. Another way this can be used is to check which if any datasets are updated when the code is updated. Sometimes we want to make changes to the code that should not change the dataset it produces, and then this method is a great way to verify that.<br />
<br />
== Data storage types ==<br />
There are different types of data storage option and which one that is best for each use case typically depends on how big the data is, if there is confidential information in the data and how the data will be used. Below are some different storage types with their pros and cons. The type of data that is considered in this article are data files, and not, for example, data stored in data bases. <br />
<br />
=== File sync services ===<br />
<br />
Common examples of file sync services are DropBox, OneDrive, Box among several others. This storage type is suitable when the data files required for a project is not bigger than that they can be stored on a regular laptop or desktop computer. In file sync services a copy of the file exists locally on all computers that has the folder synced. This means that accessing the file is quick and that there is never a problem for multiple people to access the file at the same time as they have their own version of the file. <br />
<br />
However, because all files are saved locally this also means that if one person works on many projects there is a risk that there is not space for all files for those projects on that persons computer. Many file syncing services have options to mitigate this. For example you can chose to not sync the parts of the project folder that you know you do not need access to. Some sync services also allow you to have non-synced files ready to be downloaded on demand (sometimes referred to as smart sync). However, this is not great for data work as data files are usually to big to be downloaded on demand when your code trying to access them leading to your programming software to crash when trying to read the file.<br />
<br />
Another concern with file syncing services are privacy concerns when sharing confidential data. There are enterprise subscriptions that can be installed on your own servers. This makes sharing data much more secure as the data is never stored or transferred on the servers that belong to the company that offers the syncing service. If you do not have an enterprise subscription or are not sure if you do, you should always [[encryption|encrypt]] the data before saving it in a synced folder. DIME Analytics have published [https://github.com/worldbank/dime-standards/blob/master/dime-research-standards/pillar-4-data-security/data-security-resources/veracrypt-guidelines.md guidelines to encrypt files with Veracrypt] which is a secure and free software often used for this purpose.<br />
<br />
=== Network drives ===<br />
<br />
=== Cloud storage ===<br />
<br />
=== Storage for different size of data ===<br />
<br />
A useful rule of thumb when deciding which type of storage is suitable given the size of your data is if the total size of all data in your project can fit the space usually available on a typical user's laptop. If the data is small enough to fit on a regular laptop then synced storages (such as World Bank OneDrive or DropBox) becomes available. In synced storages each user has their own copy saved on their computer, and the sync software makes sure that all users have identical files. This is different from storing data on network drives of cloud storage where it is the same file each user uses, not just an identical file. When each user has their own version of the file access to that file tend to be faster. <br />
<br />
If the total size of all data in the project folder is too big to fit on a regular but the size of the files relevant to each users is not too big, then synced storages can still be used if the syncing service allows you to sync only specific folders or files in your project folder.<br />
<br />
If the data in the project folder is too big to be synced to a typical laptop, then a the data can be stored in a network drive or a cloud storage. However, in these solution there is no copy of the file stored on each users hard drive, and depending on the exact service used and connectivity speed, this can be slow. And even if network or cloud storage has next to unlimited storage capacity<br />
<br />
== Back-up protocols ==<br />
<br />
== Back-up storage types ==<br />
You should never use the same file or storage solution for your back-up copy of your original or raw data, as the files and storage solution you use for your day-to-day work. While some storage types could be used for both day-to-day and backup, this sections covers the former use case. Backing up of data is covered in a later section below. There are multiple types of storage and what type is best for you is how depends on the following factors: size of your data, where it will be used and cost.<br />
<br />
== Data retention ==<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[*topic name, as listed on main page*]]<br />
<br />
<br />
== Additional Resources ==<br />
* list here other articles related to this topic, with a brief description and link<br />
<br />
[[Category: *category name* ]]</div>Kbjarkefurhttps://dimewiki.worldbank.org/index.php?title=Data_Storage&diff=8287Data Storage2021-06-01T18:17:14Z<p>Kbjarkefur: /* File sync services */</p>
<hr />
<div>This article discuss different aspects of data storage (such as different types of storage, data back up and data retention). While [[Data_Security]] is a very important topic related to data storage, it is not covered in this article as there is a dedicated link for that.<br />
<br />
<br />
<br />
== Read First ==<br />
* include here key points you want to make sure all readers understand<br />
<br />
<br />
== Version control for data ==<br />
There is no dominant standard for version control of data the way Git is the dominant standard version control of code. Git can be used to both keep track of changes made to code as well as restore older versions of code. There is no system that as elegantly does both of those two things for data. This section therefore suggests different methods and the project team should pick the method best for their exact use case.<br />
<br />
=== Version control the code the generates the data ===<br />
<br />
All datasets should be generated by code and should be reproducible by re-running that code. Therefore, if you have a good back-up system for the original data and version control all your code using Git, then you have an implicit version control for your data. As long as the original data is unchanged, changes to datasets you generate can only happen through changes to the code tracked in Git. Git can also be used to restore old version of the data as you simple restore an old version of the code and re-generate the data. <br />
<br />
While this method is often an excellent option, it does not work when the original data is updated frequently (ongoing data collection or data streams) or when the code is not accessible (someone else is generating the data).<br />
<br />
=== Version control using checksums/hashes ===<br />
<br />
One way to keep track of changes done to data is to use ''checksums'', ''hashes'' or ''data signatures''. These three concepts all work slightly differently but all follows the same principle and serves the same purpose in the context of version control of data. The principle they all follow is that a datasets is boiled down to a short string of text or a number. Then that string or number can be compared across datasets to test if they are identical or not.<br />
<br />
One very simplified way to explain how this works would be the following example. In this hashing algorithm we start with a word instead of a dataset, but real world hashing algorithms can handle both. Start with a word and then take the corresponding number in the alphabet for each letter, sum those numbers, and then add the digits of each number until you have a single digit. So for "cat" we would get 3+1+20=24 (c=3,a=1,t=20), and in next step 24 would is turned into 6 (2+4=6). So the hash signature for "cat" would be 6. "Horse" would be 8+15+18+19+5=65. 6+5=11 and 1+1=2. So the hash signature would be 2. No matter how big the data is (how long the word is in this simple example) the hash signature will always be the same size and format and we can quickly test if they are the same. The main problem with the very simplified hash algorithm above is that there are only 10 values so many words would share the same signature. However, real world hash algorithms like the <code>checksum</code> command in Stata has signatures on the format "2694850408" which has 10^10 possible signatures.<br />
<br />
With 10^10 possible signatures there is chance that two datasets have the same checksum. This is called "''collisions''". However, checksums are implemented so that two similar dataset are very unlikely to have collisions or even similar checksums, making the risk of two versions of the same dataset having a collision being extremely low. And other algorithms are implemented so that there are many more combinations, but then the hash signature gets longer. Stata also have the command <code>datasignature</code> that has a signature that combines a hash value with some basic data on the dataset such as number of observations. While this provides some useful human readable information, it can only be used on Stata <code>.dta</code> files.<br />
<br />
The drawback of all these methods is that it says nothing about how the dataset was created. So there is no way to recreate a dataset based only on the checksum or hash signature. If you have the checksum of a dataset used in the past but not the corresponding version of the dataset, then all you can do is to say if the current version of the same dataset is identical. You cannot say what differs if they are different. If you used the Stata command <code>datasignature</code> you can get some clues of what differs, such as number of observations or number of variables, but you will still not be able to recreate the old version of the data.<br />
<br />
This method is a good fit for when you want to have a quick way to test that the dataset has not changed, and the details of what has changed does not matter or you have another acceptable and perhaps manual way to find out what those details are. This can be if you are accessing someone else's data set and you want to know if that person makes changes to that dataset. Another way this can be used is to check which if any datasets are updated when the code is updated. Sometimes we want to make changes to the code that should not change the dataset it produces, and then this method is a great way to verify that.<br />
<br />
== Data storage types ==<br />
There are different types of data storage option and which one that is best for each use case typically depends on how big the data is, if there is confidential information in the data and how the data will be used. Below are some different storage types with their pros and cons. The type of data that is considered in this article are data files, and not, for example, data stored in data bases. <br />
<br />
=== File sync services ===<br />
<br />
Common examples of file sync services are DropBox, OneDrive, Box among several others. This storage type is suitable when the data files required for a project is not bigger than that they can be stored on a regular laptop or desktop computer. In file sync services a copy of the file exists locally on all computers that has the folder synced. This means that accessing the file is quick and that there is never a problem for multiple people to access the file at the same time as they have their own version of the file. <br />
<br />
However, because all files are saved locally this also means that if one person works on many projects there is a risk that there is not space for all files for those projects on that persons computer. Many file syncing services have options to mitigate this. For example you can chose to not sync the parts of the project folder that you know you do not need access to. Some sync services also allow you to have non-synced files ready to be downloaded on demand (sometimes referred to as smart sync). However, this is not great for data work as data files are usually to big to be downloaded on demand when your code trying to access them leading to your programming software to crash when trying to read the file.<br />
<br />
Another concern with file syncing services are privacy concerns when sharing confidential data. There are enterprise subscriptions that can be installed on your own servers. This makes sharing data much more secure as the data is never stored or transferred on the servers that belong to the company that offers the syncing service. If you do not have an enterprise subscription or are not sure if you do, you should always [[encryption|encrypt]] the data before saving it in a synced folder.<br />
<br />
=== Network drives ===<br />
<br />
=== Cloud storage ===<br />
<br />
=== Storage for different size of data ===<br />
<br />
A useful rule of thumb when deciding which type of storage is suitable given the size of your data is if the total size of all data in your project can fit the space usually available on a typical user's laptop. If the data is small enough to fit on a regular laptop then synced storages (such as World Bank OneDrive or DropBox) becomes available. In synced storages each user has their own copy saved on their computer, and the sync software makes sure that all users have identical files. This is different from storing data on network drives of cloud storage where it is the same file each user uses, not just an identical file. When each user has their own version of the file access to that file tend to be faster. <br />
<br />
If the total size of all data in the project folder is too big to fit on a regular but the size of the files relevant to each users is not too big, then synced storages can still be used if the syncing service allows you to sync only specific folders or files in your project folder.<br />
<br />
If the data in the project folder is too big to be synced to a typical laptop, then a the data can be stored in a network drive or a cloud storage. However, in these solution there is no copy of the file stored on each users hard drive, and depending on the exact service used and connectivity speed, this can be slow. And even if network or cloud storage has next to unlimited storage capacity<br />
<br />
== Back-up protocols ==<br />
<br />
== Back-up storage types ==<br />
You should never use the same file or storage solution for your back-up copy of your original or raw data, as the files and storage solution you use for your day-to-day work. While some storage types could be used for both day-to-day and backup, this sections covers the former use case. Backing up of data is covered in a later section below. There are multiple types of storage and what type is best for you is how depends on the following factors: size of your data, where it will be used and cost.<br />
<br />
== Data retention ==<br />
<br />
== Back to Parent ==<br />
This article is part of the topic [[*topic name, as listed on main page*]]<br />
<br />
<br />
== Additional Resources ==<br />
* list here other articles related to this topic, with a brief description and link<br />
<br />
[[Category: *category name* ]]</div>Kbjarkefur