Reproducible Data Analysis and Publishing

This module is a part of MOLB5700 (Beyond the Bench: Fundamental Skills for Biomedical Researchers)

Table of Contents

1. The importance of reproducible data analysis

2. What is git and why should you care?

3. What is github and how does it work with git

4. SSH Authentication with keys

4. Create Your first git repository

5. Generate some content

6. Configure the repository for github

7. Commit and push your changes to github




































1. The importance of reproducible data analysis

You are likely already familiar with the concept of reproducible research. In the field of scientific research, we put great value on being able to repeat the work of others and confirm their results. This reproducibility is essential to ensure that empirical observations made by scientists are rooted in facts and are not just a fluke or an observation occurring by chance alone. Peer-review also ensures, at least in part that the results are repeatable/reproducible.

This same principle can also be applied to statistical analysis. As you are aware, we are in the midst of a data revolution. Take the example of the field of genomics. The amount of data scientists are generating using genomic sequencing is rising exponentially. More data is leading to development of increasingly complex statistical models and routines. A whole generation of scientists are devoting their careers to data analysis alone.

As graduate students, no matter what your subdiscipline, you will be dealing with a lot of data. You will generate, collect and analyze data. You will write scripts, save them in some obscure location on your computer, you will write a paper based on it, send it off to peer review. A few months later, you will hear from the reviewers who will ask you some tough questions about the analysis and interpretation and more likely than not ask you to repeat your analysis with a different set of parameters (among other things).

Off then you go looking for your raw data, transformed data and scripts. If you are lucky, you find all of those things. But did you annotate your scripts properly? If you did, you are lucky. If not, you are going to have to relearn what exactly the script is doing (or do some time travel to your younger self). Then you try to rerun the script, and of course it doesn’t work. The R functions have changed a bit in their new version and the old routine produces lots of error messages. You are under a deadline to respond to the reviewers and now you are having to retrace every single step of your analysis from before. You get the gist.

Cue reproducible data analysis. How can you achieve the goldi locks zone?


1.1 Steps for making your analysis reproducible

But in order to get there, you will need to invest some time in developing a tool kit.


1.2 Tools for reproducible analysis and publishing



2. What is git and why should you care?

Before we get to git, let’s familiarize ourselves with a BASH terminal. In the spotlight search (top right corner - magnifying glass icon), type terminal and hit enter.

On Macintosh, this window is called a terminal. On different systems, it can have different names (e.g. xterm, shell). But they all run the same program called BASH. At the core of a computer, there is a set of programs (called kernel) which controls the basic functions such as identifying all the hardware and allowing communication amongst them. When you plug in a USB key into a USB port, it’s the kernel that recognizes that a new piece of hardware is plugged in and connects it to the rest of the system.

But in order for us to make the computer do what we want, there are other programs which act as an interface between the computer and us. BASH is one such program. You can also refer to it as shell.

As an example, let’s run a couple of commands in the shell:

ls -l
pwd


2.1 What is git?

- Git is a version control program

- Git runs inside the BASH terminal
git --version

git version 2.29.1
man git



3. What is github and how does it work with git


3.1 Login to your account and make a new repository









4. SSH Authentication with keys


4.1 Generate SSH Key Pair

ssh-keygen -t rsa -b 4096 -C "name@host.edu"
Generating public/private rsa key pair.
Enter file in which to save the key (/Users/wyoibc/.ssh/id_rsa): 
Enter passphrase (empty for no passphrase): 

Your identification has been saved in /Users/wyoibc/.ssh/t_rex.
Your public key has been saved in /Users/wyoibc/.ssh/t_rex.pub.
The key fingerprint is:
SHA256:4ps5bMhN7293Grtr0v+dCsUI4Ji02WQL9v+Wut7uOu4 name@host.edu
The key's randomart image is:
+---[RSA 4096]----+
|     + +         |
|    o % o        |
|     = = .       |
|        . . o    |
|      . S. . o   |
|     ...  . o    |
|   . =..   *.    |
|    o =+..* =o. o|
|     .+o=EBB=Booo|
+----[SHA256]-----+


4.2 Associate Public Key with GitHub

cd /Users/wyoibc/.ssh/

pbcopy < t_rex.pub  







4.3 Configure Your System to Use SSH Keys

ssh-agent -s

ssh-add /Users/wyoibc/.ssh/t_rex
ssh-add -l

4096 SHA256:uMFoUIUC7osmgDQaNowsOnqEt3XmPPZe4RnR2+Q1KB0 name@host.edu (RSA)
cd /Users/wyoibc/.ssh

vim config
Host *
   AddKeysToAgent yes
   UseKeychain yes

Host wyoibc.github.com
        HostName github.com
        User git
        PreferredAuthentications publickey
        IdentityFile ~/.ssh/t_rex
:wq


4.4 Test Your SSH Key

ssh -T git@github.com
> The authenticity of host 'github.com (IP ADDRESS)' can't be established.
> RSA key fingerprint is SHA256:nThbg6kXUpJWGl7E1IGOCspRomTxdCARLviKw6E5SY8.
> Are you sure you want to continue connecting (yes/no)?
> Hi username! You've successfully authenticated, but GitHub does not
> provide shell access.
ssh -vT git@github.com



4. Create Your first git repository

cd ~

mkdir Github && cd Github
mkdir test && cd test
git init

Initialized empty Git repository in /Users/wyoibc/Github/test/.git/
git status

On branch master

No commits yet

nothing to commit (create/copy files and use "git add" to track)



5. Generate some content

vim README.md
## test.git

This is my first git repository.

The date is February 9, 2022.
:wq

git status
On branch master

No commits yet

Untracked files:
  (use "git add <file>..." to include in what will be committed)
    README.md

nothing added to commit but untracked files present (use "git add" to track)
git add README.md

git status

On branch master

No commits yet

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)
    new file:   README.md



6. Configure the repository for github

git config user.name "wyoibc"
git config user.email "wyoinbre@gmail.com"
git remote add origin git@github.com:wyoibc/test
git remote -v

origin  git@github.com:wyoibc/test (fetch)
origin  git@github.com:wyoibc/test (push)
git config --list

user.name=wyoibc
user.email=wyoinbre@gmail.com
remote.origin.url=git@github.com:wyoibc/test
remote.origin.fetch=+refs/heads/*:refs/remotes/origin/*



7. Commit and push your changes to github

git commit -m "my first commit"

[master (root-commit) 79f60ea] my first commit
 1 file changed, 5 insertions(+)
 create mode 100644 README.md
ssh-add -l

4096 SHA256:uMFoNMZK7osmgTMaNowsOnqEt3XmUXAe4RnR2+Q1KB0 wyoinbre@gmail.com (RSA)
git push -u origin master
Enumerating objects: 3, done.
Counting objects: 100% (3/3), done.
Delta compression using up to 8 threads
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 279 bytes | 279.00 KiB/s, done.
Total 3 (delta 0), reused 0 (delta 0), pack-reused 0
To github.com:wyoibc/test
 * [new branch]      master -> master
Branch 'master' set up to track remote branch 'master' from 'origin'.