Creating your first R package

Video recording for this session

1. Anatomy of a R Package
2. Ways to publish your package
3. Create an example package
4. Distributing your package
- 4.1 How others will install your package
5. Exercise

1. Anatomy of a R Package

Every R package has some essential components:
- Documentation, stored as .Rd files inside the man folder
- Functions, stored as .R files inside the R folder
- Package metadata stored inside a flat file named DESCRIPTION
- A short script for exporting all functions that are part of the package, named as NAMESPACE
- Various optional files including: CITATION, help, doc (replacement for man) and data.
Let’s examine contents of some of the packages we have already installed.
In order to find where files for a given package are located, use the following syntax:

find.package("package_name")

For example, let’s find location of the ggplot2 package:

find.package("ggplot2")

[1] "/Library/Frameworks/R.framework/Versions/4.0/Resources/library/ggplot2"

This is the location on my computer, yours might be in a different place. Copy the location including the quotes, and let’s go there.

setwd("/Library/Frameworks/R.framework/Versions/4.0/Resources/library/ggplot2")

list.files()

 [1] "CITATION"    "data"        "DESCRIPTION" "doc"         "help"       
 [6] "html"        "INDEX"       "LICENSE"     "Meta"        "NAMESPACE"  
[11] "NEWS.md"     "R"

Let’s explore the R folder:

setwd("R")

list.files()

[1] "ggplot2"     "ggplot2.rdb" "ggplot2.rdx"

You can check the contents of the ggplot2 file using system("cat FILENAME"), but it does not contain any functions. The other two files are binary meaning that they are only for R’s internal use.
To check what functions are available within ggplot2, you can cat the NAMESPACE file:

setwd("../")

system("cat NAMESPACE")

# Generated by roxygen2: do not edit by hand

S3method("$",ggproto)
S3method("$",ggproto_parent)
S3method("$<-",uneval)
S3method("+",gg)
...
importFrom(stats,setNames)
importFrom(tibble,tibble)
importFrom(utils,.DollarNames)

As you can see, it’s a very long list of functions. We have truncated the view above.
Feel free to explore what is contained within each folder of the package. Some files will be binary and could not be viewed.
Try another package of your choice to learn where and how its files are stored (3 min).

2. Ways to publish your package

There are three main ways you can publish your package i.e. make it available widely for anyone to use:
- CRAN: The Comprehensive R Archive Network is the most widely used repository for R packages. Anyone can submit a package for review and publishing on CRAN. The process may take some time. Currently the repository has over 17,000 packages available.
- Bioconductor: is a specialized repository that contains packages of relevance to biologists. Much of the genomic data analysis packages are now distributed through bioconductor.
- Github: Cutting edge versions of packages can be obtained from Github. Often these packages have stable versions on CRAN, but more recent versions that are not yet archived on CRAN can be accessed on github. We have already visited how to install R packages from Github on several occasions.

3. Let’s create a package

One of the motivations behind creating a package is to be able to use a given function with ease, instead of having to hunt down the relevant code to repeat the analysis. Let’s walk through a few functions to see how they work:

3.1 Function to print numbers

Functions allow you to store code you use frequently to be invoked anytime you need. Here is a very simply and extremely trivial example:

printnum <- function(n){
    n <- readline("How many numbers do you want to print? ")
    print(1:n)
}

printnum()

How many numbers do you want to print? 25

[1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

3.2 Function to convert temperature

f2c <- function(tempF){
    tempC <- (tempF - 32) * 5/9
    return(tempC)
}

f2c(32)

[1] 0


f2c(88)

[1] 31.11111

3.3 Function to generate genetic data

Imagine you want to be able to create population level data coded as genetic variation.

popdata <- function(allele1, allele2, popsize){
    allele1 <- readline("Provide name of the first allele - one alphabet only: ")
    allele2 <- readline("Provide name of the second allele: ")
    popsize <- readline("How many individuals in your population?: ")

    print("Possible diploid genotypes are:")
    print(paste0("First Homozygote: ", allele1, allele1, sep=""))
    print(paste0("Heterozygote: ", allele1, allele2, sep=""))
    print(paste0("Second Homozygote: ", allele2, allele2, sep=""))

    homoz1 <- paste(allele1, allele1, sep="")
    hetz <- paste(allele1, allele2, sep="")
    homoz2 <- paste(allele2, allele2, sep="")

    pop <- sample(c(homoz1, hetz, homoz2), popsize, replace=TRUE)

    print("Your sampled population is stored in object 'pop' and also printed below:")
    print(pop)
}

When you execute this code, the output will look like this:

Provide name of the first allele - one alphabet only: A
Provide name of the second allele: D
How many individuals in your population?: 100
[1] "Possible diploid genotypes are:"
[1] "First Homozygote: AA"
[1] "Heterozygote: AD"
[1] "Second Homozygote: DD"
[1] "Your sampled population is stored in object 'pop' and also printed below:"
  [1] "AD" "DD" "DD" "DD" "DD" "DD" "AD" "AA" "AD" "AA" "DD" "AA" "AD" "DD" "AD"
 [16] "DD" "AA" "AA" "AA" "AA" "AD" "DD" "AD" "AD" "AD" "DD" "AA" "DD" "DD" "AA"
 [31] "AD" "DD" "AD" "DD" "AD" "DD" "AD" "AD" "AD" "AA" "DD" "DD" "AA" "AA" "DD"
 [46] "DD" "DD" "AA" "DD" "DD" "DD" "AA" "DD" "AD" "AA" "AA" "AD" "AA" "DD" "AD"
 [61] "DD" "AD" "DD" "DD" "AA" "DD" "DD" "AD" "AD" "AA" "AA" "DD" "DD" "AA" "AA"
 [76] "AD" "AD" "AA" "AD" "AA" "AA" "AD" "AA" "AD" "AD" "AD" "AA" "AA" "AA" "AA"
 [91] "DD" "AD" "AD" "AD" "DD" "AA" "AA" "DD" "AA" "AD"

A function like this is handy to quickly create population level genetic data. It would be useful to make it into a package.

3.4 Package creation setup

You will need the following libraries to generate a package. Go ahead and get them unless you have them already.
- devtools
- roxygen2

3.5 Framework for your package

We will now use the devtools package to start the process of creating the package.

setwd("~/Github")

devtools::create("popdata")

✔ Creating 'popdata/'
✔ Setting active project to '/Users/vikram/Dropbox/Github/popdata'
✔ Creating 'R/'
✔ Writing 'DESCRIPTION'
Package: popdata
Title: What the Package Does (One Line, Title Case)
Version: 0.0.0.9000
Authors@R (parsed):
    * First Last <first.last@example.com> [aut, cre] (YOUR-ORCID-ID)
Description: What the package does (one paragraph).
License: `use_mit_license()`, `use_gpl3_license()` or friends to
    pick a license
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.1.1
✔ Writing 'NAMESPACE'

This will generate a folder named popdata in your current location. Let’s go inside the folder and check what’s there.

list.files()

[1] "DESCRIPTION" "NAMESPACE"   "R"

These file names should now be familiar to you based on our earlier discussion.

3.6 Add your R functions

In your bash terminal, navigate to the package folder and then to the script folder R. Create a new file there named popdata.R.

cd ~/Github/popdata/R

vim popdata.R

Add contents of the script from above to the file.
Because this function will be exposed to users (some functions are not, they work behind the scenes), we will need to add an export tag to the function content as follows. This line should appear at the very top of the package.

#' @export

Your popdata.R script should look like this now:

#' @export

popdata <- function(allele1, allele2, popsize){
    allele1 <- readline("Provide name of the first allele - one alphabet only: ")
    allele2 <- readline("Provide name of the second allele: ")
    popsize <- readline("How many individuals in your population?: ")

    print("Possible diploid genotypes are:")
    print(paste0("First Homozygote: ", allele1, allele1, sep=""))
    print(paste0("Heterozygote: ", allele1, allele2, sep=""))
    print(paste0("Second Homozygote: ", allele2, allele2, sep=""))

    homoz1 <- paste(allele1, allele1, sep="")
    hetz <- paste(allele1, allele2, sep="")
    homoz2 <- paste(allele2, allele2, sep="")

    pop <- sample(c(homoz1, hetz, homoz2), popsize, replace=TRUE)

    print("Your sampled population is stored in object 'pop' and also printed below:")
    print(pop)
}

3.7 Document your function

We need to add some useful information for the end user so they know exactly what this function does. When you run the help for this function with ?popdata, this information will be printed to the screen.
The following contents should go at the top of the function:


#' popdata() function by FirstName LastName
#' This function takes input from the user on allele names and population size
#' It then prints out the genetic variation data

Once you have written and saved all this information to popdata.R, save and close it.
Now we will ask devtools to generate documentation based on our input.

devtools::document()

Updating popdata documentation
ℹ Loading popdata
Writing NAMESPACE
Writing NAMESPACE

If you look inside ~/Github/popdata/man folder, you will now see a popdata.Rd file which contains the help documentation you just wrote.

3.8 Install your package

The basic minimum steps are now complete and you are able to install the package as a local R library:

devtools::install()

✔  checking for file ‘/Users/vikram/Dropbox/Github/popdata/DESCRIPTION’ ...
─  preparing ‘popdata’:
✔  checking DESCRIPTION meta-information
─  checking for LF line-endings in source and make files and shell scripts
─  checking for empty or unneeded directories
─  building ‘popdata_0.0.0.9000.tar.gz’
   
Running /Library/Frameworks/R.framework/Resources/bin/R CMD INSTALL \
  /var/folders/8z/8vr45rz94t95_gn426z64f5w0000gn/T//Rtmpqd3Prl/popdata_0.0.0.9000.tar.gz \
  --install-tests 
* installing to library ‘/Library/Frameworks/R.framework/Versions/4.0/Resources/library’
* installing *source* package ‘popdata’ ...
** using staged installation
** R
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (popdata)

That’s it. Your package is now ready to be used.
Check its help menu first:

?popdata

popdata                package:popdata                 R Documentation

popdata() function by FirstName LastName This function takes input from
the user on allele names and population size It then prints out the
genetic variation data

Description:

     popdata() function by FirstName LastName This function takes input
     from the user on allele names and population size It then prints
     out the genetic variation data

Usage:

     popdata(allele1, allele2, popsize)
     

(END)

Nice! Now let’s try running the function:

popdata()

Provide name of the first allele - one alphabet only: M
Provide name of the second allele: N
How many individuals in your population?: 100
[1] "Possible diploid genotypes are:"
[1] "First Homozygote: MM"
[1] "Heterozygote: MN"
[1] "Second Homozygote: NN"
[1] "Your sampled population is stored in object 'pop' and also printed below:"
  [1] "NN" "NN" "MN" "MN" "MN" "MM" "NN" "NN" "MM" "NN" "MM" "MM" "MM" "MM" "MM"
 [16] "MM" "MN" "MM" "MN" "MN" "MN" "MM" "NN" "MN" "MN" "NN" "NN" "NN" "MM" "MN"
 [31] "MN" "MM" "NN" "MN" "MN" "MN" "NN" "MM" "MM" "MM" "MM" "NN" "MM" "NN" "NN"
 [46] "MM" "NN" "MN" "NN" "NN" "NN" "MN" "MM" "NN" "MN" "MM" "MN" "MN" "NN" "MN"
 [61] "NN" "NN" "NN" "NN" "MN" "MM" "NN" "MM" "MM" "MN" "MN" "MN" "MN" "MN" "MM"
 [76] "MN" "MM" "MN" "MM" "MN" "MM" "MN" "NN" "MN" "MN" "MN" "NN" "MM" "MM" "NN"
 [91] "MN" "MM" "MM" "MN" "NN" "MN" "MM" "MM" "NN" "MM"

You may have noticed that our function is designed to be interactive. So even though the Usage above indicates that you can pass options to function in parenthesis, it won’t really work.

4. Distributing your package

This part should now be cakewalk for you having gone through git/github multiple times.
Turn ~/Github/popdata into a git repository
Create a new popdata repository on Github.com
Configure your local repo, then add, commit and push files as usual.

4.1 How others will install your package

Share your Github user name on Slack channel so others can access your package.
Another person who wishes to install your package will do the following:

library(devtools)

devtools::install_github("YOUR_USER_NAME/popdata")

5. Exercise

Create a new package called allfreq
Modify the code from our popdata package, but now also calculate frequencies of both alleles in the population.
Follow all steps in #3 above and post your package github url on Slack.
The instructor will run through this exercise after your first attempt.