styler
packagescale()
Biological analysis entails writing code, which changes over time as you develop it, gain insight from your data, and identify new questions and hypotheses. A common pattern when developing scripts is to make copies of older code files to preserve them before making new changes to them. While it is a good idea to maintain a record of all the code you previously ran, over time this practice often leads to disorganized, cluttered, untidy analysis directories.
For example, say you are working on a script named my_R_script.R
and decide
you want to add a new analysis that substantially changes the code. You might be
tempted to make a copy of the current version of the code into a new file
named my_R_script_v2.R
that you then make changes to, leaving your original
script intact and untouched going forward. You make your changes to your new
script, produce some stunning and fascinating plots, present the analysis at a
group meeting, only to discover later there was a critical bug in your code that
made the plots misleading and requires substantial redevelopment.
Bugs happen. There are two types of bugs:
Bugs are normal. The scenario described above, where you present results only to discover your code wasn’t doing what you thought it was doing, is extremely common and it will happen to you. This is normal, and finding a bug in your code does not mean you are a bad programmer.
Rather than edit your version 2 of your script directly, you decide it is
sensible to copy the file to my_R_script_v2_BAD.R
and edit the version 2
script to fix the bug. You are satisfied with your new version 2 script, and so
make a new copy my_R_script_v2_final.R
. Upon review of your analysis, you
are asked to implement new changes to the script based on reviewer feedback. You
make a new copy of your script to my_R_script_v2_final_revision.R
and make
the requested changes. Perhaps now your script is final, but in your directory
you now have five different versions of your analysis:
my_R_script.R
my_R_script_v2.R
my_R_script_v2_BAD.R
my_R_script_v2_final.R
my_R_script_v2_final_revision.R
When you write your code, you may know which scripts are which, and if you
follow good programming practice and carefully commented your
code you or your successors may be able to sleuth what was done. However, as
time passes, the intimate knowledge you thought you had about your code will be
replaced by other more immediately important things; eventually you may not even
understand or even recognize your own code, let alone someone else trying to
understand it. Not an ideal situation in any case. A better solution involves
recording changes to code over time in such a way that you can recover old code
if needed, but don’t clutter your analytical workspace with unneeded files.
git
provides an efficient solution to this problem.
git is a free, open source version control software program. Version control software is used to track and record changes to code over time, potentially by many developers working on the same software project concurrently from different parts of the world. The base git software can be used on the command line, or with graphical user interface applications for popular operating systems.
There are many excellent tutorials online (some linked below) that teach how to
use git
but the basic concepts are described below. The command line
commands are listed, but the same operations apply in the graphical clients.
git init
in a new directory)git add <filename>
from within the git repo directory)git status
git add <filename>
to record changes)git commit -c <commit message>
with a concise
commit message that describes what was done)For the vast majority of use cases, the git init
, git status
, git add
, and git commit
operations are all you will need to use git
effectively. Two more commands, git push
and git pull
are needed when
sharing your code with others as described in the next section.
The basic git software only works on your local computer with local repositories. To share this code with others, and receive others’ contributions, a copy of the repo must be made available in a centralized location that everyone can access. One such place is github.com, which is a free web application that hosts git repos. bitbucket.org is another popular free git repo hosting service. These two services are practically the same, so we will focus on GitHub.
There is no formal relationship between git and GitHub. git is an open source software project maintained by hundreds of developers around the world (and is hosted on GitHub). GitHub is an independently provided web service and application. The only connection between GitHub and git is that GitHub hosts git repos.
As with git, there are many excellent tutorials on how to use GitHub, but the basic concepts are described below.
The next step depends on whether you have an existing local repo or not:
clone
your GitHub repo and create a local copy that is connected
to the one on GitHubNow, your local repo is connected to the same repo on GitHub, and the changes
you make to your local files can be sent, or push
ed to the repo on GitHub:
git add
and git commit
them as
abovegit push
git status
will indicate whether your local repo is up to date
with your remote GitHub repoWhen you are working on a team of contributors to a GitHub repo, your local
files will become out of date as others push their changes. To ensure your local
repo is up to date with the GitHub repo, you must pull
your changes from
GitHub with git pull
.
git was designed to automatically combine changes made to a code base by different developers whenever possible. However, if two people make changes to the same parts of the same file, git may not be able to resolve those changes on its own and the developers must communicate and decide what the code should be. These instances are called merge conflicts and can be challenging to resolve. Dealing with merge conflicts is beyond the scope of this book, but some resources are linked below for further reading.
All the content and code for this book are stored and available on GitHub, as are the assignment code templates.