GitHub.com is an online platform that is commonly used in industry and academia for collaboration. You will use GitHub in your statistics / data science class to complete individual and team assignments. This is the first of two tutorials to walk you through the steps you will need to successfully use GitHub in your classes.

By the end of this tutorial, you will be able to

  • Understand commonly used GitHub terms
  • Clone a GitHub repo
  • Start a new project in an RStudio docker container
  • Connect the RStudio project to your GitHub repo

Terminology

Below are terms that are commonly used when talking about GitHub.1

Term Definition
Git An open source version control software system
GitHub A remote commercial hosting service for Git repositories
Git repository (or repo) Similar to a project directory or folder in Google Drive, Dropbox, etc. It tracks changes to files.
commit Save changes to a local repo
pull Update a local repo
push Upload local files to the remote repo
merge conflict Contradictory changes that cannot be integrated until they are reconciled by a user


As you begin to use GitHub more, you may also come across terms describing more advanced GitHub actions.

Term Definition
forking Create a copy of a repository in your local profile
pull request Submit changes to a remote repo
branching Keeping multiple snapshots of a repo
gh-pages Special branch which allows creation of a webpage from within GitHub
GitHub actions Mechanism for continuous integration

Register for a GitHub account

If you do not have a GitHub account, you can register for a new account at http://github.com. Your GitHub repo will contain examples of your work that can be shared as you apply for internships and jobs, so choose a username that you are willing to share with future employers. Also consider a username that will still be relevant after your statistics class (e.g. don’t use Stat101Student2020 as your username 🙃). See Username Advice in Happy Git and GitHub for the R User for more tips on choosing a GitHub username.

Clone a GitHub repo

Most of your assignments will be done using GitHub and RStudio. You’ll begin with a starter repo on GitHub (most likely provided by your instructor) that contains templates and other materials needed to complete the assignment. You’ll conduct your analysis and type your responses in RStudio and “push” the updates back to the repo on GitHub. In this tutorial, we will focus on the steps to get started on an assignment - specifically cloning a repo and starting a new project in RStudio.

We’ll use the example-assignment repo in the DukeStatSci GitHub organization for this tutorial. You can use this repo to practice the steps in this tutorial; however, note that you cannot push to the repo (we’ll talk more about pushing in the next tutorial).

Clone a repo + create new project

  1. Go to your course organization on GitHub. The URL for the course organization is provided by your instructor. In this tutorial, the organization is DukeStatSci at www.github.com/DukeStatSci.

  2. Click on the relevant assignment repo. Ours is the example-assignment repo. This contains the starter documents and other materials required for the assignment.

  1. Click on the green Code button and make sure it says “Clone with HTTPS” (this is the default option). Click on the clipboard icon to copy the repo URL.

  1. Go to RStudio.

    If you are using RStudio through a Docker container

    • Go to the Docker containers at https://vm-manage.oit.duke.edu/containers and log in with your Duke NetId and Password.
    • Click to log into the Docker container for your course. You should now see the RStudio environment.


  1. In RStudio, go to File ➡️ New Project ➡️ Version Control ➡️ Git.

  2. Copy and paste the URL of your assignment repo into the dialog box Repository URL. You can leave Project directory name empty. It will default to the name of the GitHub repo.

  1. Click Create Project, and the files from your GitHub repo will be displayed in the Files pane in RStudio.

Configure git

There is one thing to take care of before you start completing the assignment. You need to configure git so that RStudio can communicate with GitHub. To do so, you will use the use_git_config() function from the usethis package.

Type the following lines of code in the console in RStudio filling in your GitHub username and the email address tied to your GitHub account.

library(usethis)
use_git_config(user.name = "your GitHub username", user.email="your email")

If you get the error message

Error in library(usethis) : there is no package called ‘usethis’

then you need to install the usethis package by running the code below in the console:

install.package("usethis")
library(usethis)

Then, rerun the use_git_config function with your GitHub username and email address.


That’s it! You’ve cloned a repo, started a new project in RStudio, and configured git. You’re now ready to start your analysis in RStudio!

Cache password

If you are asked to reenter your password each time you push to GitHub, you can cache your GitHub password so it is saved over a specified period of time. This most likely occurs if you’re using RStudio through RStudio Cloud or RStudio Pro servers provided by the Statistical Science department.

To cache your password, run the following in the Terminal:

This will cache your password for 604800 seconds, i.e. 7 days (60 * 60 * 24 * 7 = 604800). This is generally enough time to complete a lab or homework assignment. You can increase the number of seconds for longer assignments, such as a project that lasts several weeks.

Resources

Happy Git and GitHub for the R User by Jenny Bryan, the STAT 545 TAs, and Jim Hester

Acknowledgements

The instructions from this tutorial were adapted from labs in Data Science in a Box by Mine Çetinkaya-Rundel.


  1. Definitions from Beckman, M., Çetinkaya-Rundel, M., Horton, N., Rundel, C., Sullivan, A., & Tackett, M. Implementing version control with Git as a learning objective in statistics courses. arxiv.org/abs/2001.01988.↩︎