2. Data management and data wrangling

Author

Matthew Ivory

Three important things to remember:

As you complete the handout, please don’t just read the commands, please type every single one of them. This is really important: Learning to program is like practicing a conversation in a new language. You will improve gradually, but only if you practice.
If you’re stuck with something, please write down your questions (to share later in class) and try to solve the problem. Please ask your group members for support and, conversely, if another student is stuck, please try to help them out, too. This way, we develop a supportive learning environment that benefits everybody. In addition, get used to the Help pages in RStudio and start finding solutions online (discussion forums, online textbooks, etc.). This is really important, too. You will only really know how to do quantitative research and statistical analyses when you are doing your own research and dealing with your own data. At that point, you need to be sufficiently autonomous to solve problems, otherwise you will end up making very slow progress in your PhD.
Finally, if you do not complete the handout in class, please complete the handout at home. This is important as we will assume that you know the material covered in this handout. And again, the more you practice the better, so completing these handouts at home is important.

Step 0: Installing tidyverse

I mentioned tidyverse in the lecture, and now we will intall and load it, before using it (mainly for pipes!)

As a one-off (on a per machine basis) the first command only needs to be run when we first want a package. As noted before, there are thousands of functions available to R, having them all pre-packaged would break your computer and you don’t need every single one.

install.packages("tidyverse")

There is little to know about this at this stage, as the function does a lot of the legwork for us. It looks on the CRAN (The Comprehensive R Archive Network) which contains the R approved packages. It downloads it so you can use all the functions contained in a spacific package.

As stated before, tidyverse is a collection of packages, and we will need to understand that later on, but for now: no need.

Now to load the package so we can use the functions:

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.4     ✔ purrr   1.0.2
✔ tibble  3.2.1     ✔ dplyr   1.1.2
✔ tidyr   1.3.0     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

As above, you will get a bunch of messages in the console, we can reasonably ignore these for now.

Note

As a general point, we would run install.packages() in the console, and place library() at the top of a script. The next section should make this a bit clearer as to the difference

Step 1: Scripts

A script is essentially a sequence of commands that we want R to execute. As Winter (2019) points out, we can think of our R script as the recipe and the R console as the kitchen that cooks according to this recipe. Let’s try out the script editor and write our first script. Typing commands in the console is good for one off commands (maybe to check the class() or to install.packages()), but the script is better for keeping the steps in order.

When working in R, try to work as much as possible in the script. This will be a summary of all of your analyses, which can then be shared with other researchers, together with your data. This way, others can reproduce your analyses.

Thus far, you have typed your command lines in the console. This was useful to illustrate the functioning of our R, but in most of your analyses you won’t type much in the console. Instead, we will use the script editor.

The script editor is the pane on the top left of your window. If you don’t see it, you need to open a new script first. For this, press Cmd+Shift+N (Mac) or Ctrl+Shift+N (Windows). Alternatively, in the menu, click File > New File > RScript.)

In the script editor (not the console), type the following command in line 1 press Return (Mac) / Enter (Windows).

2 + 3

As you can see, nothing happened. There is no output in the Console pane; the cursor just moved to the next line in the script editor (line 2). This is because you did not execute the script.

To execute a command in the script editor, you need to place your cursor anywhere on the line you wish to execute and then click the Run icon in the Script editor pane. If you do this, then the following output will appear in your Console.

You can also run the current command line or selection in the script by pressing Cmd+Return (Mac) or Ctrl+Enter (Windows). This will also send your command from the script editor to the console. (I suggest using the shortcut, it’s much more efficient.)

In the script, you can have as many lines of code as you wish. For example, you can add the following three commands to your script.

scores <- c(145, 234, 653, 876, 456) 

mean(scores)

sd(scores)

To execute each one separately, just go to the line in question and click the Run icon or, even better, press the keyboard shortcut.

You can also run multiple commands in one go. For this, you either highlight several lines and then press the Run icon (or keyboard shortcut). Try it with the above three lines.

To execute all commands in the script, you click the Source icon (next to the Run icon) in the Script editor pane. Or just use the shortcut Cmd+Option+R (Mac) or Ctrl+Alt+R (Windows).

Multiline commands

Using the script editor is particularly useful when we write long and complex commands. The example below illustrates this nicely.

This is a fairly long command, written in the console in one line.

df <- data.frame(name = c('jane', 'michaela', 'laurel', 'jaques'), age = c(23, 25, 46, 19), occupation = c('doctor', 'director', 'student', 'spy'))

but in a multiline format:

df <- data.frame(name = c('jane', 'michaela', 'laurel', 'jaques'), 
                 age = c(23, 25, 46, 19), 
                 occupation = c('doctor', 'director', 'student', 'spy'))

Note the indentations, this is done automatically by RStudio as it recognises what is grouped according to parentheses.

Comments

An important feature of R (and other programming languages) is the option to write comments in the code files. Comments are notes, written around the code, that are ignored when the script is executed. In R, anything followed by the # symbol on any line is treated as a comment. This means that a line starting with # is ignored when the code is being run. And if we place a # at any point in a line, anything after the hash tag is also ignored. The following code illustrates this.

Comments are really useful for writing explanatory notes to ourselves or others.

# Here is data frame with three variables.
# The variables refer to the names, ages, and occupations of the participants.
df <- data.frame(name = c('jane', 'michaela', 'laurel', 'jaques'), 
                 age = c(23, 25, 46, 19),
                 occupation = c('doctor', 'director', 'student', 'spy'))

2 + 3 #This is addition in R.

Code sections

To make your script even clearer, you can use code sections. These divide up your script into sections as in the example below. To create a code section, go the line in the script editor where you would like to create the new section, then press Cmd+Shift+R (Mac) or Ctrl+Shift+R (Windows). Alternatively, in the Menu, select Code > Insert Section.

The lines with the many hypens create the sections

# Create vectors ---------------------------------------------------

scores_test1 <- c(1, 5, 6, 8, 10) # These are the scores on the pre-test.
scores_test2 <- c(25, 23, 52, 63) # These are the scores on the post-test.

# A few calculations -----------------------------------------------

mean_test1 <- mean(scores_test1)
mean_test2 <- mean(scores_test2)

round(mean_test1 - mean_test2) # The difference between pre and post-tests.

Once you have created a section, you can ask R to run only the code in a specific region. This is because R recognizes script sections as distinct regions of code.

To run the code in a specific section, first go to the section in question (e.g., the section called # A few calculations ————) and then either press Cmd+Option+T (Mac) or Ctrl+Alt+T (Windows). You can also use the menu, Code > Run Region > Run Section. Have a go to see if this works out well.

Saving scripts

Finally, you can also save your script. To do this, just click the Save icon in the Script editor pane or press Cmd+S (Mac) or Ctrl+S (Windows). The script can be named anything, but it is often recommended to use lowercase letters, numbers and underscores only. (That is, no spaces, hyphens, dots, etc.)

The script is saved in the .R format in your directory. If you later double click it, the file will open in RStudio by default, but you can also view and edit the file in Word and similar programs.

Step 2: A bit more on packages

It’s important to acknowledge the important work done by the developers who make R packages available for free and open source. When you use a package for your analyses (e.g., tidyverse or lme4), you should acknowledge their work by citing them in your output (dissertation, presentation, articles, etc.). You can find the reference for each package via the citation() function, as in the examples below.

citation("tidyverse")

citation("lme4")

You can also install packages by using the Packages tab in the Files, Plots, Packages, etc. pane. As you see in the figure below, the base package is already installed. You can install more packages by scrolling through the list (or using the search option to narrow down the choices) and then selecting the tick box to the left of the package. If you do this, you will see that the click will run the install.packages() command in the console.

As I mentioned above, run install.packages() in the console as a one-off command, you do not need to run this every time you want to use a package. Everytime we want to use a package in a given session, we need to tell R to load it up, which is why we put library() at the top of the script, so we can use the functions.

Step 3: Working directories and clean workspaces

Every R session has a working directory. This is essentially the directory or folder from which files are read and to which files are written.

You can find out your working directory by typing the following command. Your output will obviously look different from the one below, which refers to my machine

getwd()

[1] "/Users/ivorym/Documents/PhD/Teaching/23_24/FASS512/Worksheets"

You can also use a command to list the content in the working directory. (Alternatively, you can see your direct by using the Files tab in the Files, Packages, Plot, etc. pane.)

list.files()

I suggest you create a new working directory on your computer desktop and then use it for the entire course. Important files related to your R tasks (scripts, data, etc.) should later be downloaded to this folder.

The first step is for you to create a folder called FASS512 (or similar) in a sensible place on your computer. You can do this by going to the Files tab (in the Files, Packages, etc. pane) and clicking the “Create a new folder” icon. Place each weekly set of weekly files in their own weekly folders.

Once you have created the “statistics” folder on the desktop, go to the menu to set the default working directory to the new “statistics” folder. The easiest way is to go to the menu, RStudio > Preferences. This should call up the following window.

In the window, click the Browse button and set the default working directory to the “statistics” folder in the desktop.

Step 4: Loading data

When we are dealing with data in our analyses, we usually begin by importing a data file. R allows you to important data files in many different formats, but the most likely ones are .csv and .xlsx.

I have uploaded several data files to our Moodle page. Please go to folder called “Data sets to download for this session” in the section for today’s session, then download the files in the folder and place them in your working directory (the statistics folder you just created). The files are from Winter (2019) and Fogarty (2019).

Let’s try out loading data files. In the examples below, you will import three types of files: .csv, .txt, and .xlsx. Remember: You need to download the data files from our Moodle page and place them in our working directory first. Otherwise, you cannot import the files from our directory into R.

CSV

We can use the read_csv() function from dplyr (part of the tidyverse) to load data that is in .csv format. The command below will load the data set (‘nettle_1999_climate.csv’) and create a new label for this data set (languages). There exists a read.csv() function in base, but it is slower and not as ‘smart’ as read_csv().

languages <- read.csv('nettle_1999_climate.csv')

Alternatively, you can load data files by clicking File > Import Dataset > From Text (readr). In the dialogue window, then click browse and select the file nettle_1999_climate.csv. You can change the name of the data set in the text box at the bottom left, below Import Options, where it says Name.

Note

I am giving you these alternative GUI-based methods for carrying out the same steps as what is written in the script. I offer these to highlight how things can be done in many ways, but preferably you will use the script for pretty much everything. This creates a record of the commands needed to reproduce your analysis, which is better for future researchers (which includes you in a week’s time)

TXT

The data file you just imported is in the .csv format. You can important data from files in other formats, too. If the data is in .txt format, you can simply use the following command.

text_file <- read_table('example_file.txt') #(Note: Ignore warning message in the console.)

The command creates a new data set called text_file.

xlsx

If the data is an Excel spreadsheet (e.g., .xlsx format), you can proceed as follows. Ideally it shouldn’t be, as csv are a universal file format that can be read across many machines. As a general rule, it is important to use these universal filetypes (csv, txt, pdf, html…) for better reproducibility and data management (Towse et al., 2021)¹

library(readxl) #you may need to run install.packages("readxl") first

spreadsheet_exl <- read_excel('simd.xlsx', sheet = 'simd')

First, you need to install the readxl package. Then, you create a new data set called spreadsheet_exl by using the read_excel() function.

Note: Since spreadsheets have multiple sheets, you need to specify the name of the sheet you would like to import by using the sheet argument. In our case, the sheet is called simd, hence sheet = ‘simd’.

RStudio can handle many other file extensions to import datasets. You can find out information on how to import other file types by using the R help function (or by searching on Google).

Step 5: Examining datasets

If you have followed the steps above, you will have imported three data sets, languages, spreadsheet_exl, and text_file. You can now start exploring the data. We will focus on languages as an example.

Every time you import data, it’s good to check the content, just to make sure you imported the correct file.

The easiest way to do this is by using the View() function. This allows you to inspect the data set in the script editor. Note: The function requires a capital V. If you have tidyverse loaded, which we do, then there is a view() function as well. These are functionally equivalent. Use whichever, but View() will always work

If you run the command below, you will see that this shows the data (a table) in a tab of the script editor. It will also be displayed in the console.

Note

Remember what I have said previously about some content being better off in the console rather than the script? This is another example of what to put in the console instead (like class() or install.packages().

Why? Great question, because it’s a one-off command that we don’t need in our script. It’s a sanity check, like class(), and it doesn’t add anything of value to the script. The script should be the minimum series of commands that are required to go from one stage to another. Taking a visual look at a dataframe is superfluous to the actual analysis

View(spreadsheet_exl) 
View(languages)

You can also inspect your data by visiting the Environment tab in the Environment, History, Connections, etc. pane. As you can see in the figure below, this will tell you thatlanguageshas 74 observations (rows) and five variables (columns).

If you would like to examine variables, you can start by using the str() function (str for structure), as in the example below.

str(languages)

As you can see above, the str() function will tell you many useful things about your dataset. For example, it will reveal the number of observations (rows, 74) and variables (columns, 5), and then list the variables (Country, Population, Area, MGS, Langs). For each variable, it will also indicate the variable type (chr = character strings, num = numeric, intd = integer). The str() function will also display the first observations of each variable (Algeria, Angola, Australia, Bangladesh, etc.).

You can also check the names of variables separately by using the names() function, or check the variable type by checking the class() function, but it’s easier to just use the str() function as in the example above.

If you prefer, you can restrict your inspection of to the first or final rows of the data set. You can do this by using the head() and tail() function. This is helpful if your tables has lots of rows. It complements str() as it shows you a sample of the actual data, not just the structure.

head(languages) #default is six rows to display
tail(languages, n = 5) #show last five rows

How could you show the first 10 rows?

There is also a very helpful function called summary(). As you can see in the example below, this function will provide you with summary information for each of your variables.

For numeric/integer variables such as Populations, Area, MGS, and Langs, this command will calculate the minimum and maximum values, quartiles, median and mean. (We will discuss summary statistics in more detail later.)

For character variables, as in Country, the command will simply provide you with the number of observations (length) for this variable.

summary(languages)

In large datasets, you might want to examine only a specific variable. You can do this by using the $ as an index. For example, if you would just like to examine the variable Population in the languages dataset, you could proceed as follows.

str(languages)

str(languages$Population)

class(languages$Population)

head(languages$Population)

tail(languages$Population)

summary(languages$Population)

Which of the above six commands are best placed in the script or console?

Note

Ultimately, there is no right or wrong answer. Personally,

str() belongs in the console because it should just be a quick check that it is the expected shape. It could go in the script if it was part of a more formal test. A sanity check is something that makes you go “oh, I should just make sure”, whereas a test is more in line with thoughts of “if it isn’t have an identical shape to dataframe2, none of this works” - a nuanced difference that we may perhaps explore in later sessions.

class() goes in the console - it is very much a sanity check. If it transpires the class isn’t what you wanted, we can coerce them into different classes, which we would include as a step in the script, but we don’t need to run the check everytime in the script if we are just going to coerce it anyway…

head() and tails() depends. If you’re just having a little look, then console. If it is something you are then using in the analysis, the script. Most likely the console though. If you can exit RStudio and reopen the script and it runs without errors, then it’s fine to leave in the console. If it fails, maybe you need things in the script?

summary() is one I usually keep in the script - particularly if I am reporting the summary of statistics (see a later session) because it is meaningful content that I need.

Step 6: Closing your R session

The last step is to close your R session. When you quit RStudio, a prompt will ask whether you want to save the content of your workspace. It is better to NOT save the workspace. When you start RStudio again, you will have a clean workspace. You then just re-run your scripts.

If you have written your scripts well, upon re-open, you should be able to produce the exact same steps without error and without odd additional windows opening (because we put View() in a script…).

So, I would save R scripts (especially if these are very long and are relevant to your analyses), but I would not the workspace contents.

Take home task

To complete this homework task, you will need to download the language_exams data file from our Moodle page into your working directory.

In the file, you will find the (fictional) scores and ages of 475 students who took an intermediate Portuguese language course at university. Students were tested three times: first in September to check their Portuguese proficiency at the beginning of the course, then again in January as part of their mid-term examination, and finally in June as part of their final examination. On each occasion, students had to complete three subtests to respectively assess their Portuguese vocabulary, grammar and pronunciation. The scores for exams 1, 2 and 3 are composite scores, i.e. each combines the results of the three subtests.

Your task is to run a basic analysis of the exam data using an R script.

In your script, please include all the steps, including the command that loaded the data.

Please also include sections to make your script very clear, as well as comments.

How many observations and columns does the datafile contain?

Run commands to display the first and the last five lines of the table.

What is the average age of participants? Report this as a whole number

What type of variable is student_id?

What is the rounded mean score on exam 3 to 2 decimal places?

Tip

Not sure how? Type ?round() into the console and read the help page. Specifically look under the Arguments section and the examples (the second to last is the best one)

What is the difference between the mean scores on exams1 and 2?

Please save the script to discuss at the next session.

Footnotes

Towse, A. S., Ellis, D. A., & Towse, J. (2021). Making data meaningful: Guidelines for good quality open data. The Journal of Social Psychology, 161(4), 395–402. https://doi.org/10.1080/00224545.2021.1938811↩︎