AIM RSF R series: Starting with data in R

class: center, middle, inverse, title-slide

.title[
# AIM RSF R series: Starting with data in R
]
.subtitle[
## Based on Data Carpentry: R for Social Scientists
]
.author[
### Eirini Zormpa
]
.date[
### 8 November 2022 (last updated 2022-11-08)
]

---

# Summary of session 1: Introduction to R

- ✅ Navigate the RStudio Graphical User Interface (GUI).
- ✅ Install `packages` to access additional functionality.
- ✅ Perform simple arithmetic calculations in R.
- ✅ Understand programming terms, like `objects`, `functions`, `arguments` and `vectors`.
- ✅ Create and manipulate vectors.
- ✅ Learn basic ways to work with missing data.

---

# Learning objectives: Starting with data in R

- ✅ Read data into R.
- ✅ Understand and manipulate `data frames`.
- ✅ Understand and manipulate `factors`.
- ✅ Alternate between date formats.

---

# Data frames

**Data frames** are the standard data structure for tabular data in `R`. 
--
They look very similar to spreadsheets (like in Excel) but each column is, in fact, a vector.
Each vector needs to be of the same length, for a perfectly rectangular shape ◽ ◾ ⬛.

Note that because the columns are all vectors, they must all be of the *same type*.

## A note on terminology

Technically, what we will be working with in these workshops aren't `data.frames`, they are `tibbles`.
`tibbles` are basically data frames for the `tidyverse` - they have some subtle differences but nothing you need to be aware of at this point.

---

# Tabular data: What is tidy data?

.footnote[llustrations from the [Openscapes](https://www.openscapes.org/) blog [Tidy Data for reproducibility, efficiency, and collaboration](https://www.openscapes.org/blog/2020/10/12/tidy-data/) by Julia Lowndes and Allison Horst. ]

---

# Tabular data: Why tidy data?

---

# Tabular data: File formats

.pull-left[

### Comma delimited

comma-separated value files (.csv) are plain text files where the columns are separated by commas

👍🏼 commonly used

👎🏼 annoying when data itself contains commas

]

.pull-right[

### Tab delimited

tab-separated value files (.tsv) are plain text files where the columns are separated by tabs (\t)

👍🏼 no confusion when data contains commas or semicolons

👎🏼 not very commonly used (at least not yet)

]

---

# The data

The data is historic data of worldwide COVID-19 positive cases and deaths.
The data was made available by the European Centre for Disease Control and Prevention.

They cover the period from 1 January 2020 to 20 June 2022.

.footnote[Source: https://www.ecdc.europa.eu/en/publications-data/download-historical-data-20-june-2022-weekly-number-new-reported-covid-19-cases]

---

# The data: variables

| variable | description |
|----------|-------------|
| country  | which country the data come from |
| country_code | a three-letter code for the country the data come from |
| continent | the continent in which the reporting country is located |
| population | the population of the reporting country according to Eurostat for Europe and the World Bank for the rest of the world |
| indicator | whether the observation is a positive case or a death |
| weekly_count | the number of positive cases or deaths in the week of reporting |
| year_week | the year and week when the observations occurred |
| rate_14_day | the rate of positive cases or deaths in the preceding 14 days |
| cumulative_count | the total number of cases or deaths from the start of data collection |
| source | what data source the data come from |

---
class: inverse

# Importing data: Folders

1. Double click on the R Project you created for the workshop to open RStudio.
2. Check that the files you see in your `Files` tab are the right ones (you should only see the `scripts` folder and the `.Rproj` file)
3. Go to the console and type the following commands

```r
# create separate folders for the raw and clean data
dir.create("data_raw")
dir.create("data_clean")

# only if you don't have one already, create a folder for the scripts
dir.create("scripts")
```

---
class: inverse

# Importing data: Download

Then we need to 1) download the data from the ECDC website and 2) save it in the `data_raw` folder we just created it.

We can do both in one go in R by typing the following command in the console:

```r
# download the data
download.file(url = "https://opendata.ecdc.europa.eu/covid19/nationalcasedeath_archive/csv/data.csv",
              destfile = "data_raw/covid-data.csv")
```

After you have run this command, open the `data_raw` folder and check that there is a file called `covid-data.csv`.

---
class: center, middle, inverse

# Exercise 1

🕕 **5 mins**

Create a tibble containing only the last 250 observations of `covid_data` (all the columns but only the last 250 rows).

Hint: you can get the number of rows in a dataset with `nrow`, e.g. `number_rows <- nrow(covid_data)`

---
class: center, middle, inverse

# Exercise 1 solution

There are multiple ways to solve this exercise.

#### Solution 1

```r
e1_s1_data_end <- nrow(covid_data)
e1_s1_data_start <- e1_s1_data_end - 249

e1_s1_data <- covid_data[e1_s1_data_start:e1_s1_data_end, ]
```

#### Solution 2

```r
e1_s2_data <- tail(covid_data, n = 250)
```

---
class: center, middle, inverse

# Exercise 2

⏱ **3 mins**

Calculate the minimum and maximum values for the countries' populations.

Hint: the functions you need here are `min` and `max`.

---
class: center, middle, inverse

# Exercise 2 solution

```r
country_populations <- covid_data$population

min_population_ <- min(country_populations)
max_population <- max(country_populations)
```

---

# Factors

R has a special data class, called **factor**, to deal with *categorical data*.
Factors are very useful and contribute to making R particularly well suited to working with data.

Factors are stored as **integers** associated with labels. 
--
They can be ordered (ordinal) or unordered (nominal). 
--
Factors create a structured relation between the different levels (values) of a categorical variable, such as days of the week or responses to a question in a survey. 
--
Once created, factors can only contain a pre-defined set of values, known as **levels**.

While factors look (and often behave) like character vectors, they are actually treated as integer vectors by R. **So you need to be very careful when treating them as strings.**

---

# Dates

To avoid ambiguity, use the [RFC3339](https://datatracker.ietf.org/doc/html/rfc3339) standard: **YYYYMMDD** (or YYYY-MM-DD).

.footnote[This [image](https://en.m.wikipedia.org/wiki/File:Date_format_by_country_revised.svg) was created by cmglee, Canuckguy and many others for [Wikimedia Commons](https://commons.wikimedia.org/wiki/Main_Page) and is used under a [CC-BY-SA licence](https://creativecommons.org/licenses/by-sa/4.0/)]

---

# Summary

- ✅ Read data into R.
- ✅ Understand and manipulate `data frames`.
- ✅ Understand and manipulate `factors`.
- ✅ Alternate between date formats.

---
class: center, middle

# Thank you for your attention ✨ 🙏

## See you next week for data wrangling with `dplyr` and `tidyr` 👋

---
# References

Lowndes, J. and A. Horst (2020). _Tidy data for efficiency,
reproducibility and collaboration_. URL:
[https://www.openscapes.org/blog/2020/10/12/tidy-data/](https://www.openscapes.org/blog/2020/10/12/tidy-data/).

Wickham, H. (2014). _Tidy Data_. Vol. 59.10 , pp. 1-23. DOI:
[10.18637/jss.v059.i10](https://doi.org/10.18637%2Fjss.v059.i10).