AIM RSF R series: Data wrangling with dplyr and tidyr

class: center, middle, inverse, title-slide

.title[
# AIM RSF R series: Data wrangling with dplyr and tidyr
]
.subtitle[
## Based on Data Carpentry: R for Social Scientists
]
.author[
### Eirini Zormpa
]
.date[
### 15 November 2022 (last updated 2022-11-14)
]

---

# Summary of session 2: Starting with data in R

- ✅ Understand what tidy data are and why it's a useful format.
- ✅ Read data into R.
- ✅ Understand and manipulate `data frames`.
- ✅ Understand and manipulate `factors`.

---

# Learning objectives: Data wrangling with `dplyr` and `tidyr`

- ✅ Subset columns or rows with `select` or `filter` and create new columns with `mutate`.
- ✅ Link the output of one function to the input of another function with the ‘pipe’ operator `%>%`.
- ✅ Combine datasets using `join`.
- ✅ Reshape a dataframe from long to wide format with the `pivot_wider` function.
- ✅ Export a dataframe to a .csv and .tsv file.

---
class: center, middle, inverse

# Exercise 1

🕟 **5 mins**

Subset the `covid_data` tibble such that you keep only observations from the TESSy COVID-19 `source` and retain only the variables `country`, `indicator`, `year_week` and `weekly_count`.

---
class: center, middle, inverse

# Exercise 1 solution

```r
covid_data %>% 
  filter(source == "TESSy COVID-19") %>% 
  select(country, indicator, year_week, weekly_count)
```

Note that if you `select` before you `filter`, your code won't run.
That's because you're not retaining the variable that you use in your filtering.
When piping, **order matters**!

---
class: center, middle, inverse

# Exercise 2

🕣 **10 mins**

Create a new tibble `deaths_2021` that contains the total deaths for each country in 2021, arranged such that the country with the most deaths is at the top.

---
class: center, middle, inverse

# Exercise 2 solution

```r
deaths_2021 <- covid_data_dates %>% 
  drop_na(weekly_count) %>% 
  mutate(year = year(from_date)) %>% 
  filter(year == 2021,
         indicator == "deaths") %>% 
  group_by(country) %>% 
  summarise(yearly_deaths = sum(weekly_count)) %>% 
  arrange(desc(yearly_deaths))
```

---

# Summary of packages we used today

- ✅ `readr` to read data into R and export it
- ✅ `dplyr` for a grammar of data manipulation
- ✅ `tidyr` to get data where variables are columns, observations are rows, and cells contain single values
- ✅ `magrittr` to get access to the `tidyverse` pipe `%>%`
- ✅ `stringr` to manipulate strings/characters
- ✅ `lubridate` to manipulate dates

---
class: center, middle

# Thank you for your attention ✨ 🙏

## See you next week for data visualisation with `ggplot2` 🎨