Resource Image

Statistics with epidemiology of COVID-19

Author(s): Maria Shumskaya1, Shakira Benjamin1, Matthew G Niepielko1, Nicholas Lorusso1

Kean University

402 total view(s), 181 download(s)

0 comment(s) (Post a comment)

Introduction into heat maps, non-parametric t-test and GIF (optional) in R using an original dataset on COVID-19 infections from different counties of New Jersey, USA. Suitable for students who have basic experience in R.

Licensed under CC Attribution-ShareAlike 4.0 International according to these terms

Version 1.0 - published on 20 Jun 2021 doi:10.25334/H1HE-5Z05 - cite this



The coronavirus disease 2019 (COVID-19) pandemic is a global challenge caused by the rapid emergence of the SARS-CoV-2 virus. Over the course of 2020, the number of infections reached 144,220,516 cases and 3,065,612 deaths worldwide. The most current information for cases and deaths can be foundat: According to the United States Centers for Disease Control and Prevention (CDC) as of April 21, 2021 there are at least 31,602,676 reported cases and 565,613 total deaths in the United States, with 983,875 reported cases and 283,000 total deaths in the state of New Jersey alone. You can find the most recent data at Several factors have been highlighted as being particularly important for predicting the likeliness of infection and the severity of disease development including: patient age, sex, and population density.

In this exercise, students will analyze an original dataset prepared from data generated by Genesis Laboratory Management, a COVID-testing laboratory located in Monmouth County, New Jersey, USA. The laboratory collects patient specimens from counties across New Jersey, though more samples are obtained from Monmouth and surrounding counties. The relationship between proximity of Genesis Laboratory Management and collected samples should be noted as it creates a bias for test samples originating closer to Monmouth county. All samples were tested for SARS-CoV-2 RNA sequences using PCR and the results were recorded in the database. The data included in the exercise were collected over between March to December 2020. Students completing this exercise will build heat maps to visualize the number of positive SARS-CoV-2 infections in each county of New Jersey, then perform Wilcoxon-Mann-Whitney test to assess the number of infections for females and males to determine if one sex is statistically more likely to contract the virus.

The included module is suitable for undergraduate or graduate students who have basic knowledge of R. The module takes one lab period (2.5 hours) to complete; with an optional additional section where students can create a GIF of their maps (can be assigned as homework or for extra credit).

Activity software: RStudio

R packages: maps, mapdata, ggplot2, gifski and dplyr.

Dataset: an adapted version from the original data from Genesis Laboratories shared with permission from the testing facility (approximately 380 000 records). All personal identifying information was deleted from the dataset. The dataset is a CSV (comma delimited) file with individual test result entries as rows, and information on patient county, age, sex, and test results as columns. The data set also contains the central geographic location (longitude and latitude) for each county, county populations and ranks according to the density of population.


Learning objectives:

In this activity, students will:

  1. Identify appropriate computational approaches to address questions on epidemiology of COVID-19
  2. Use statistical software R to analyze data on SARS-CoV-2 infections spread
  3. Evaluate data using graphical representation
  4. Apply statistical methods such as non-parametric t-test to analyze original data
  5. Test hypothesis on factors that affect viral spread
  6. Draw conclusions based on data analysis

Questions that students will be able to answer after completing this module: Do you think the number of positive cases for COVID-19 has changed from March to December of 2020 in each county? Do you think it will depend on the number of tests conducted, or the population of the county? Is there a difference in the number of positive tests between males and females? Was the virus spreading from one location within the state or from multiple location at the same time?


What is included in this module:

1. A tutorial for students to guide them through the module in class. The tutorial has an introductory information and questions that should be answered by the students after they work with the R script.

2. R script for students (this is an incomplete script that must be completed by the students in order for it to work).

3. PowerPoint presentation

4. Instructor's resources: a file with instructor's notes and an instructor's R script.


The authors thank Genesis Laboratory Management, New Jersey, USA, for the permission to use the provided data.


Cite this work

Researchers should cite this work as follows:


There are no comments on this resource.