+ - 0:00:00
Notes for current slide
Notes for next slide

Literate programming with R Markdown

Computation Skills Workshop

1 / 32

A judge’s desk labeled

Source: @AllisonHorst

2 / 32

Data science notebooks

3 / 32
4 / 32

A substantial debate

  • R vs. Python
  • Data analysis vs. software engineering
5 / 32

Jupyter Notebooks

A screenshot of the Jupyter Notebook interface, depicting code cells, executed output, and Markdown formatted text.

Source: Dataquest

6 / 32

Critiques about notebooks

  • Hidden state and out-of-order execution
  • Notebooks discourage modularity and testing
  • Hard to copy/paste into Slack/GitHub issues
  • Notebooks hinder reproducible and extensible science
  • Notebooks cannot easily be tracked under version control

Source: The first notebook war

7 / 32

Critiques about notebooks

  • Hidden state and out-of-order execution
  • Notebooks discourage modularity and testing
  • Hard to copy/paste into Slack/GitHub issues
  • Notebooks hinder reproducible and extensible science
  • Notebooks cannot easily be tracked under version control

Source: The first notebook war

Data science is not the same as software engineering

7 / 32

A glam rock band comprised of 3 fuzzy round monsters labeled as

Source: @AllisonHorst

8 / 32

R Markdown basics

---
title: "Gun deaths"
date: "`r lubridate::today()`"
output: html_document
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = FALSE)
```
```{r packages}
library(tidyverse)
library(rcfss)
theme_set(theme_minimal())
```
```{r youths}
youth <- gun_deaths %>%
filter(age <= 65)
```
# Gun deaths by age
We have data about `r nrow(gun_deaths)` individuals killed by guns. Only `r nrow(gun_deaths) - nrow(youth)` are older than 65. The distribution of the remainder is shown below:
```{r youth-dist, echo = FALSE}
youth %>%
ggplot(mapping = aes(age)) +
geom_freqpoly(binwidth = 1)
```
# Gun deaths by race
```{r race-dist}
youth %>%
ggplot(mapping = aes(fct_infreq(race) %>% fct_rev())) +
geom_bar() +
coord_flip() +
labs(x = "Victim race")
```
9 / 32

Major components

  1. A YAML header surrounded by ---s
  2. Chunks of R code surounded by ```
  3. Text mixed with simple text formatting using the Markdown syntax
10 / 32

Knitting process

Source: R for Data Science

11 / 32

Text formatting with Markdown

  • Lightweight set of conventions for formatting plain text files
  • LATEX simplified
12 / 32

Text formatting with Markdown

  • Lightweight set of conventions for formatting plain text files
  • LATEX simplified
  • Demonstration Markdown file
12 / 32

Write a simple Markdown file

  • Edit my-cv.md to create a brief CV
  • Title should be your name
  • Create sections for education and employment (use headers)
  • Include bulleted list of jobs/degrees
  • Highlight the year in bold
08:00
13 / 32

Render and edit an R Markdown document

  • Render gun-deaths.Rmd as an HTML document
  • Add text describing the frequency polygon
03:00
14 / 32

Code chunks

```{r youth-dist, echo = FALSE, message = FALSE, warning = FALSE}
# code goes here
```
15 / 32

Code chunks

```{r youth-dist, echo = FALSE, message = FALSE, warning = FALSE}
# code goes here
```
  • Naming code chunks
  • Code chunk options
  • echo = FALSE
  • message = FALSE or warning = FALSE
  • eval = FALSE
  • include = FALSE
  • error = TRUE
  • cache = TRUE
15 / 32

Global options

knitr::opts_chunk$set(
echo = FALSE
)
16 / 32

Inline code

We have data about `r nrow(gun_deaths)` individuals killed by guns. Only `r
nrow(gun_deaths) - nrow(youth)` are older than 65.
17 / 32

Inline code

We have data about `r nrow(gun_deaths)` individuals killed by guns. Only `r
nrow(gun_deaths) - nrow(youth)` are older than 65.

We have data about 100798 individuals killed by guns. Only 15687 are older than 65.

17 / 32

Customize code chunk options

  • Set echo = FALSE as a global option
  • Enable caching as a global option and render the document. Look at the file structure for the cache. Now render the document again. Does it run faster?
07:00
18 / 32

YAML header

---
author: Benjamin Soltoff
date: '2021-11-12'
title: Gun deaths
output: github_document
---
  • Yet Another Markup Language
  • Standardized format for storing hierarchical data in a human-readable syntax
  • Defines how rmarkdown renders your .Rmd file
19 / 32

HTML document

---
author: Benjamin Soltoff
date: '2021-11-12'
title: Gun deaths
output: html_document
---
20 / 32

Table of contents

---
author: Benjamin Soltoff
date: '2021-11-12'
title: Gun deaths
output:
html_document:
toc: true
toc_depth: 2
---
21 / 32

Appearance and style

---
author: Benjamin Soltoff
date: '2021-11-12'
title: Gun deaths
output:
html_document:
theme: readable
highlight: pygments
---
22 / 32

Customize YAML header

  • Add a table of contents
  • Use the "cerulean" theme
  • Modify the figures so they are 8x6
07:00
23 / 32

PDF document

---
author: Benjamin Soltoff
date: '2021-11-12'
title: Gun deaths
output: pdf_document
---
  • Requires installation of LATEX
  • tinytex::install_tinytex()
24 / 32

Core R Markdown formats

Output format Creates
html_document .html
pdf_document .pdf
word_document Microsoft Word (.docx)
md_document Markdown
github_document Markdown for GitHub
25 / 32

Extensions of R Markdown

26 / 32

Slide presentations

27 / 32

render()

rmarkdown::render("my-document.Rmd", output_format = "html_document")
rmarkdown::render("my-document.Rmd", output_format = "all")
28 / 32

Rendering different document formats

  • Render gun-deaths.Rmd as
    • HTML document
    • PDF document
    • Word document
05:00
29 / 32

R Markdown notebooks

  • Interactive version of an R Markdown document
  • *.nb.html
  • More similar to Jupyter Notebook
  • Still plain-text files for version control
30 / 32

R scripts

# gun-deaths.R
# 2017-02-01
# Examine the distribution of age of victims in gun_deaths
# load packages
library(tidyverse)
library(rcfss)
# filter data for under 65
youth <- gun_deaths %>%
filter(age <= 65)
# number of individuals under 65 killed
nrow(gun_deaths) - nrow(youth)
# graph the distribution of youth
youth %>%
ggplot(aes(age)) +
geom_freqpoly(binwidth = 1)
# graph the distribution of youth, by race
youth %>%
ggplot(aes(fct_infreq(race) %>% fct_rev())) +
geom_bar() +
coord_flip() +
labs(x = "Victim race")
31 / 32

When to use a script

  • Software development
  • For troubleshooting
  • Initial stages of project
  • Building a reproducible pipeline
32 / 32

When to use a script

  • Software development
  • For troubleshooting
  • Initial stages of project
  • Building a reproducible pipeline
  • It depends
32 / 32

When to use a script

  • Software development
  • For troubleshooting
  • Initial stages of project
  • Building a reproducible pipeline
  • It depends

Running scripts

  • Interactively
  • Programmatically using source()
32 / 32

A judge’s desk labeled

Source: @AllisonHorst

2 / 32
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow