Getting started with GitHub
This page provides resources and links to resources on how to get started with GitHub. There are other Git alternatives to GitHub but most of these resources are applicable to those alternatives as well. See for example GitLab and Bitbucket.
The World Banks GitHub repositories can be found at https://github.com/worldbank.
Read First
- Note that GitHub is meant to be used only on code files and other raw text type files. This and resolutions to that are discussed in more detail below.
What GitHub is good at and what it is less good at
Git was implemented to manage code work and doing so by tracking changes made to code in great detail. This is the reason why Git is an amazing tool to collaborate on code, but the draw back is Git is only efficient in tracking changes to raw text files. All code files in any programming language are always raw text files, and so is .tex, .txt, .csv files, .doc/.docx, .xls.xlsx, .pdf files and images are examples of binary files that are not raw text files. Binary file are stored very efficiently but Git does not have direct access to the text and numbers in those files and can therefore not track changes in detail. Git therefore stores one full version of binary files for each change made to them, which gets very inefficient. See the sections on ignore files and combining GitHub and DropBox below for how to relate to this.
Resources for absolute beginners
Since GitHub is used extensively outside the research community there are a lot of resources online on how to get started on GitHub. Some of those resources expect technical skills, but the list below links to resources that does not:
- https://guides.github.com/ - GitHub's own guide on how to get started
Recommended Github Guide reading
Some topics discusses in the GitHub guide are not relevant in research, but we recommend resaechers to read the topics described in the follow sections and to use those topics frequently.
Best practices for managing a research project using GitHub
gitignore files
gitignore files is a very important tool to control what in your data work folder that you will share in the cloud. This file ignores (digignore) files added to your repository locally and do not sync them with the repository in the cloud. This is a great way to make sure that you do not share data files with private data in the GitHub cloud, and to not share binary files that otherwise makes your GitHub repository big and slow to work with.
See GitHub's own documentation on ignore files here and that page has links to more detailed reading. The World Bank's DIME team has developed a template gitignore file with the needs of a researcher specially in mind. In most cases you can use it as it is, but in some contexts you might have make some edits, but then it is still a great starting point. You find the template here
Combining GitHub and DropBox
In research we often want to use a syncing service like DropBox, OneDrive etc. in combination with GitHub. This requires a specific setup as GitHub is also a syncing serve, although it works very differently compared to DropBox, OneDrive etc.
Combining GitHub and DropBox is a great way to share data and binary files across team members without leaking private data in the GitHub cloud and to get around that GitHub tracks binary files in a way that is very inefficient in terms of disk space. See this guide for how to combine GitHub and DropBox. This guide includes some slightly more technical steps, but it solves a big issue, and is easy to maintain once it is set up.
Back to Parent
This article is part of the topic Data Management