Introductory Data Science Pipeline Activity – Yellow Fever and Global Precipitation
Author(s): Mary Mulcahy
University of Pittsburgh at Bradford
383 total view(s), 363 download(s)
- Mulcahy_Pipeline_Activity_Student_Handout_2024_05_Ver001.docx(DOCX | 182 KB)
- Mulcahy_Pipeline_Activity_Student_Handout_2024_05_Ver001.pdf(PDF | 265 KB)
- Mulcahy_Pipeline_Teaching_Notes_Ver001.docx(DOCX | 30 KB)
- Mulcahy_KEY_Pipeline_Activity_2024_05_Ver001.docx(DOCX | 184 KB)
- Mulcahy Pipeline Activity Merged Data.csv(CSV | 775 B )
- precipitation_fig-2.csv(CSV | 2 KB)
- Yellow Fever YF reported cases and incidence 2024-08-05 00-40 UTC.xlsx(XLSX | 9 KB)
- Yellow Fever (YF) reported cases and incidence
- https://www.epa.gov/climate-indicators/climate-change-indicators-us-and-global-precipitation
- Navigating Codap for the Biomes Module Video Tutorial
- CODAP
- License terms
Description
This activity is an introduction to one version of the data science pipeline with the intent of inspiring students to consider their own research questions that could be answered using this process. In this activity, the data science pipeline is defined as a series of seven steps (start to finish) for using existing data, especially publicly available data, to answer new research questions.
This activity introduces one narrow view of data science. Although the steps are given an order, students are reminded that data scientists may circle back to previous steps or skip a step entirely depending on their research goals. Much like the scientific method, data science is approached in many ways, and the pipeline path described here introduces some of the common terminology and a frequent approach to answering questions.
- Define the terms: clean, pull, verify, wrangle, merge, interoperable, file extension, and long and wide format.
- Recognize the difference between long and wide data format.
- Review metadata associated with a downloaded data file.
- Describe the ordered steps of the data pipeline process, but also recognize that the phrase “data science pipeline” means different things to different scientists and that these steps don’t always occur in order.
- Create a graph with appropriate axes with units under guidance.
- Use an r-squared value to assess the strength of a relationship between two variables.
- Explain why statistical analysis may be needed to interpret a data pattern.
Cite this work
Researchers should cite this work as follows:
- Mulcahy, M. (2024). Introductory Data Science Pipeline Activity – Yellow Fever and Global Precipitation. Biological and Environmental Data Education (BEDE) Network, QUBES Educational Resources. doi:10.25334/SP15-ZK40