class: center, middle, inverse, title-slide .title[ # AIM RSF R series: Data wrangling with dplyr and tidyr ] .subtitle[ ## Based on Data Carpentry: R for Social Scientists ] .author[ ### Eirini Zormpa ] .date[ ### 15 November 2022 (last updated 2022-11-14) ] --- # Summary of session 2: Starting with data in R - ✅ Understand what tidy data are and why it's a useful format. - ✅ Read data into R. - ✅ Understand and manipulate `data frames`. - ✅ Understand and manipulate `factors`. --- # Learning objectives: Data wrangling with `dplyr` and `tidyr` - ✅ Subset columns or rows with `select` or `filter` and create new columns with `mutate`. - ✅ Link the output of one function to the input of another function with the ‘pipe’ operator `%>%`. - ✅ Combine datasets using `join`. - ✅ Reshape a dataframe from long to wide format with the `pivot_wider` function. - ✅ Export a dataframe to a .csv and .tsv file. --- class: center, middle, inverse # Exercise 1 🕟 **5 mins** Subset the `covid_data` tibble such that you keep only observations from the TESSy COVID-19 `source` and retain only the variables `country`, `indicator`, `year_week` and `weekly_count`.
−
+
05
:
00
--- class: center, middle, inverse # Exercise 1 solution ```r covid_data %>% filter(source == "TESSy COVID-19") %>% select(country, indicator, year_week, weekly_count) ``` -- Note that if you `select` before you `filter`, your code won't run. That's because you're not retaining the variable that you use in your filtering. When piping, **order matters**! --- class: center, middle, inverse # Exercise 2 🕣 **10 mins** Create a new tibble `deaths_2021` that contains the total deaths for each country in 2021, arranged such that the country with the most deaths is at the top.
−
+
10
:
00
--- class: center, middle, inverse # Exercise 2 solution ```r deaths_2021 <- covid_data_dates %>% drop_na(weekly_count) %>% mutate(year = year(from_date)) %>% filter(year == 2021, indicator == "deaths") %>% group_by(country) %>% summarise(yearly_deaths = sum(weekly_count)) %>% arrange(desc(yearly_deaths)) ``` --- # Summary of packages we used today - ✅ `readr` to read data into R and export it - ✅ `dplyr` for a grammar of data manipulation - ✅ `tidyr` to get data where variables are columns, observations are rows, and cells contain single values - ✅ `magrittr` to get access to the `tidyverse` pipe `%>%` - ✅ `stringr` to manipulate strings/characters - ✅ `lubridate` to manipulate dates --- class: center, middle # Thank you for your attention ✨ 🙏 ## See you next week for data visualisation with `ggplot2` 🎨