--- title: "Describing data: R code for Chapter 3 examples" author: "Michael Whitlock and Dolph Schluter" date: "February 1, 2017" output: html_document: toc: yes toc_depth: 3 --- _Note: This document was converted to R-Markdown from [this page](http://whitlockschluter.zoology.ubc.ca/r-code/rcode03), with modifications, by M. Drew LaMar. You can download the R-Markdown [here](https://qubeshub.org/collections/post/1022/download/chap03.Rmd)._ Download the original R code on this page as a single file [here](http://whitlockschluter.zoology.ubc.ca/wp-content/rcode/chap03.r) ## New methods New methods on this page Hover over a function argument for a short description of its meaning. The variable names are plucked from the examples further below. **Sample mean**, assuming that there are no missing (NA) entries: > mean(snakeData$undulationRate) **Standard deviation**, assuming that there are no missing (NA) entries: > sd(snakeData$undulationRate) **Calculate a statistic by group**, assuming that there are no missing (NA) entries: > tapply(sticklebackData$$$plates, sticklebackData$genotype, FUN = median) **Obtain a subset of data from a data frame**: > subset(spiderData, treatment == "before") **Other new methods**: * Variance and coefficient of variation. * Median and interquartile range. * Round to a preferred number of decimals. * Cumulative frequency distribution. * Table of descriptive statistics by group. * Mean and standard deviation from a frequency table. ----- ## Example 3.1. [Gliding snakes](http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter03/chap03e1GlidingSnakes.csv) *__Sample mean__, __standard deviation__, __variance__ and __coefficient of variation__ of undulation rate of 8 gliding paradise tree snakes, in Hz.* Read and inspect the data. ```{r} snakeData <- read.csv("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter03/chap03e1GlidingSnakes.csv") head(snakeData) ``` **Draw a histogram** of the data. ```{r} hist(snakeData$undulationRateHz, right = FALSE) ``` Commands for a fancier histogram are shown here. ```{r} hist(snakeData$undulationRateHz, right = FALSE, las = 1, col = "firebrick", breaks = seq(0.8,2.2,by=0.2), xlab = "Undulation rate (Hz)", ylab = "Frequency", main = "") ``` Calculate the **mean**, **standard deviation** and **variance** of the undulation rates. ```{r} m <- mean(snakeData$undulationRate) s <- sd(snakeData$undulationRate) var(snakeData$undulationRate) ``` Calculate the **coefficient of variation**. ```{r} 100 * s/m ``` ----- ## Table 3.1-2. [Number sof convictions](http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter03/chap03t1_2ConvictionsFreq.csv) *__Mean and standard deviation from a frequency table__. The data are from Chapter 2.* Read and inspect the data. It is a frequency table of the number of convictions. ```{r} convictionsFreq <- read.csv("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter03/chap03t1_2ConvictionsFreq.csv") head(convictionsFreq) ``` Calculate the **mean and standard deviation from the frequency table.** First, use the `rep` command to "repeat" each value in the table, the number of times according to its frequency. Here, we store the result in `convictions`. ```{r} convictions <- rep(convictionsFreq$convictions, convictionsFreq$frequency) ``` Then, calculate mean and standard deviation on the result. ```{r} mean(convictions) sd(convictions) ``` ----- ## Example 3.2. [Spider running speed](http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter03/chap03e2SpiderAmputation.csv) *__Median__, __interquartile range__, and __box plot__ of running speed (cm/s) of male Tidarren spiders. We also include below the __cumulative frequency distribution__ of running speed before amputation.* Read and inspect the data. The data are in "long" format. One variable indicates running speed, and a second variable gives treatment (before vs after amputation). Therefore, every individual spider is on two rows, once for its before-amputation measurement and one for its after-amputation measurement. ```{r} spiderData <- read.csv("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter03/chap03e2SpiderAmputation.csv") head(spiderData) ``` **Box plot** of the data. Begin by ordering the treatment levels so that the "before" amputation measurements come before the "after" measurements in the plot. ```{r} spiderData$treatment <- factor(spiderData$treatment, levels = c("before", "after")) boxplot(speed ~ treatment, data = spiderData) ``` Instructions for a fancier box plot, made with additional options, is shown here. ```{r} par(bty = "l") # Makes the axes L-shaped boxplot(speed ~ treatment, data = spiderData, ylim = c(0,max(spiderData$speed)), col = "goldenrod1", boxwex = 0.5, # Scale width of boxplots by 0.5 whisklty = 1, # Make whiskers linetype 1 (i.e. solid) xlab = "Amputation treatment", ylab = "Running speed (cm/s)") ``` **Extract the before-amputation data** using `subset`. Note the double equal sign "==" needed in the logical statement, which indicates "is equal to" in R. Save the result in a new data frame named `speedBefore`. ```{r} speedBefore <- subset(spiderData, treatment == "before") speedBefore ``` Calculate the **sample median** of before-amputation running speed. ```{r} median(speedBefore$speed) ``` **Calculate the first and third quartiles** of before-amputation running speed (0.25 and 0.75 quantiles). Type 5 reproduces the method we use in the book to calculate quartiles. ```{r} quantile(speedBefore$speed, probs = c(0.25, 0.75), type = 5) ``` Determine the **interquartile range** of before-amputation running speed. Type 5 reproduces the method we use in the book to calculate quartiles. ```{r} IQR(speedBefore$speed, type = 5) ``` **Draw a cumulative frequency distribution** of running speed before amputation (Figure 3.4-1). ```{r} plot(ecdf(speedBefore$speed), verticals = TRUE, ylab = "Cumulative relative frequency", xlab = "Running speed before amputation (cm/s)") ``` Instructions for a slightly more polished cumulative frequency distribution are here. ```{r} par(bty = "l") plot(ecdf(speedBefore$speed), verticals = TRUE, las = 1, main = "", do.points = FALSE, ylab = "Cumulative relative frequency", xlab = "Running speed before amputation (cm/s)" ) ``` ----- ## Example 3.3. [Stickleback lateral plates](http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter03/chap03e3SticklebackPlates.csv) *Draw __multiple histograms__ and __produce a table of descriptive statistics by group__ for plate numbers of three stickleback genotypes. We also include a __table of frequencies and proportions__ of stickleback genotypes. Download and inspect the data. ```{r} sticklebackData <- read.csv("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter03/chap03e3SticklebackPlates.csv") head(sticklebackData) ``` **Draw multiple histograms** of number of plates, one histogram per genotype. Begin by setting the preferred order of the three genotypes in the figure. Here, we use the `lattice` package to draw the histograms, so it must be loaded first. ```{r} sticklebackData$genotype <- factor(sticklebackData$genotype, levels = c("MM","Mm","mm")) library(lattice) histogram(~ plates | genotype, data = sticklebackData, breaks = seq(0,70,by=2), layout = c(1, 3), col = "firebrick") ``` Commands to draw multiple histograms in base R, without using the `lattice` package, are here. This method is more tedious, but allows for easier addition of options. ```{r} oldpar = par(no.readonly = TRUE) # make backup of default graph settings par(mfrow = c(3,1), las = 1, oma = c(4, 6, 2, 6), mar = c(2, 5, 4, 2)) # adjust margins hist(sticklebackData$plates[sticklebackData$genotype == "MM"], right = FALSE, breaks = seq(0,70,by=2), main = "MM", col = "firebrick", las = 1, ylab = "Frequency") hist(sticklebackData$plates[sticklebackData$genotype == "Mm"], right = FALSE, breaks = seq(0,70,by=2), main = "Mm", col = "firebrick", las = 1, ylab = "Frequency") hist(sticklebackData$plates[sticklebackData$genotype == "mm"], right = FALSE, breaks = seq(0,70,by=2), main = "mm", col = "firebrick", las = 1, ylab = "Frequency") mtext("Number of lateral plates", side = 1, outer = TRUE, padj = 1.5) par(oldpar) # revert to default graph settings ``` **Make a table of descriptive statistics by group** using `tapply`. The commands below assume that the variables contain no missing (NA) elements. We need to calculate all the statistics, one at a time. Save them so that we can put them all together in a table afterward. To begin, get the sample sizes by group (genotype): ```{r} n <- tapply(sticklebackData$plates, INDEX = sticklebackData$genotype, FUN = length) n ``` Next, calculate the mean number of plates by group. Let's round the results to 1 decimal place to make it easier to read them when we place into a table. To accomplish this, set `digits = 1` in the round function. ```{r} meanPlates <- tapply(sticklebackData$plates, INDEX = sticklebackData$genotype, FUN = mean) meanPlates <- round(meanPlates, digits = 1) meanPlates ``` Repeat for medians. ```{r} medianPlates <- tapply(sticklebackData$plates, INDEX = sticklebackData$genotype, FUN = median) medianPlates ``` Repeat for standard deviations, including rounding. ```{r} sdPlates <- tapply(sticklebackData$plates, INDEX = sticklebackData$genotype, FUN = sd) sdPlates <- round(sdPlates, 1) sdPlates ``` Finally, the interquartile range by group. ```{r} iqrPlates <- tapply(sticklebackData$plates, INDEX = sticklebackData$genotype, FUN = IQR, type = 5) iqrPlates ``` **Make the table by assembling all the results in a new data frame** (data frames can be used as tables for display purposes). ```{r} sticklebackTable <- data.frame(genotype = names(n), n = n, mean = meanPlates, median = medianPlates, sd = sdPlates, iqrange = iqrPlates) sticklebackTable ``` We can **make a table of frequencies and proportions** of the stickleback genotypes (Table 3.5-1). Below, we first generate a frequency table of genotypes (the dnn argument names the variable in the table). ```{r} sticklebackFreq <- table(sticklebackData$genotype, dnn = "genotype") sticklebackFreq ``` We then convert the table to a data frame so that we can include the proportions in the table too. ```{r} sticklebackFreq <- data.frame(sticklebackFreq) sticklebackFreq ``` Finally, we **calculate the proportions** and put them into the data frame. To convert frequencies to proportions, divide the frequencies by the sum of the frequencies. ```{r} sticklebackFreq$proportion <- sticklebackFreq$Freq / sum(sticklebackFreq$Freq) sticklebackFreq ``` The table would look even nicer if you round the proportions before including them in the table (give this a try).