R version 4.3.2 (2023-10-31)
Platform: aarch64-unknown-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS
Matrix products: default
BLAS: /usr/lib/aarch64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/aarch64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] htmlwidgets_1.6.4 compiler_4.3.2 fastmap_1.1.1 cli_3.6.2
[5] tools_4.3.2 htmltools_0.5.7 rstudioapi_0.15.0 yaml_2.3.8
[9] rmarkdown_2.25 knitr_1.45 jsonlite_1.8.8 xfun_0.41
[13] digest_0.6.33 rlang_1.1.2 evaluate_0.23
1 If it’s not in source control, it doesn’t exist.
2 Collaborative research
Data scientists and statisticians, as opposed to closet mathematicians, rarely do things in vacuum.
We talk to scientists/clients about their data and questions.
We write code (a lot!) together with team members or coauthors.
We run code/program on different platforms.
We write manuscripts/reports with co-authors.
We distribute software so potential users have access to your methods.
In every project you have at least one other collaborator, future-you. You don’t want future-you to curse past-you.
Hadley Wickham
3 Why version control?
A centralized repository helps coordinate multi-person projects.
Time machine. Keep track of all the changes and revert back easily (reproducible).
Storage efficiency.
Synchronize files across multiple computers and platforms.
GitHub is becoming a de facto central repository for open source development. E.g., all packages in Julia are distributed through GitHub; Hadley Wickham also recommends Git/GitHub as the best practices for R package development.
What should an employer look for when they see a certification on a résumé?
For our program, and likely data science in general, they should look at the applicant’s GitHub page. They should see interesting project and code contributions.
4 Available version control tools
Open source: Git, Apache subversion (aka svn), cvs, mercurial.
Proprietary: Visual SourceSafe (VSS), etc.
Dropbox? Mostly for file backup and sharing, limited version control (1 month?).
We use Git in this course.
5 Git
Currently the most popular version control system according to Google Trend.
Initially designed and developed by Linus Torvalds in 2005 for Linux kernel development. Git is the British English slang for unpleasant person.
I’m an egotistical bastard, and I name all my projects after myself. First ‘Linux’, now ‘git’.
Linus Torvalds
6 Centralized vs distributed version control
Svn is a centralized version control system:
Git is a distributed version control system:
7 What do I need to use Git?
A Git server enabling multi-person collaboration through a centralized repository.
github.com: unlimited public repositories, private repositories costs $, but unlimited private repositories for free from Student Developer Pack.
bitbucket.org: unlimited public repositories, unlimited private repositories for academic account (register for free using your edu email).
We use github.com in this course for developing and submitting homework.
A Git client on your own machine.
Linux: Git client program is shipped with many Linux distributions, e.g., Ubuntu and CentOS. If not, install using a package manager, e.g., yum install git on CentOS.
Do not totally rely on GUI or IDE. Learn to use Git on command line, which is needed for cluster and cloud computing. RStudio has basic Git integration, but still cannot do tasks such as tagging, commiting selected files within a folder, and so on.
8 Git workflow
9 Git survival commands
Synchronize local Git directory with remote repository:
```{bash}#| eval: falsegit pull```
same as git fetch plus git merge.
Modify files in local working directory.
Add snapshots to staging area:
```{bash}#| eval: falsegit add FILES```
Commit: store snapshots permanently to (local) Git repository
```{bash}#| eval: falsegit commit -m"MESSAGE"```
Push commits to remote repository:
```{bash}#| eval: falsegit push```
10 Git basic usage
Register for an account on a Git server, e.g., github.com.
Not too less: Make sure collaborators or yourself can reproduce everything on other machines.
Not too much: No need to put all intermediate files in repository. Make good use of the .gitignore file.
Strictly version control system is for source files only. E.g. only xxx.Rmd, xxx.bib, and figure files are necessary to produce a pdf file. Pdf file doesn’t need to be version controlled or, if version controlled, doesn’t need to be frequently committed.
Commit early, commit often and don’t spare the horses.
Adding an informative message when you commit is not optional. Spending one minute on commit message saves hours later for your collaborators and yourself. Read the following sentence to yourself 3 times:
Write every commit message like the next person who reads it is an axe-wielding maniac who knows where you live.