Resource Image

Introductory Data Science Pipeline Activity – Yellow Fever and Global Precipitation

Author(s): Mary Mulcahy

University of Pittsburgh at Bradford

181 total view(s), 96 download(s)

0 comment(s) (Post a comment)

Summary:
Students follow the steps of a tiny data science project from start to finish. They are given a research question "Are the number of cases of yellow fever associated with global average precipitation?" The students locate the data from the World…

more

Students follow the steps of a tiny data science project from start to finish. They are given a research question "Are the number of cases of yellow fever associated with global average precipitation?" The students locate the data from the World Health Organization and Environmental Protection Agency, download it, and use the merged and cleaned data to see whether the evidence supports the hypothesis that yellow fever cases are higher in wetter than drier years. The activity is intended to be used early in a course to prepare introductory students to eventually explore their own questions.

Description

This activity is an introduction to one version of the data science pipeline with the intent of inspiring students to consider their own research questions that could be answered using this process. In this activity, the data science pipeline is defined as a series of seven steps (start to finish) for using existing data, especially publicly available data, to answer new research questions. 

This activity introduces one narrow view of data science.  Although the steps are given an order, students are reminded that data scientists may circle back to previous steps or skip a step entirely depending on their research goals.  Much like the scientific method, data science is approached in many ways, and the pipeline path described here introduces some of the common terminology and a frequent approach to answering questions.

  • Define the terms:  clean, pull, verify, wrangle, merge, interoperable, file extension, and long and wide format.
  • Recognize the difference between long and wide data format.
  • Review metadata associated with a downloaded data file.
  • Describe the ordered steps of the data pipeline process, but also recognize that the phrase “data science pipeline” means different things to different scientists and that these steps don’t always occur in order.
  • Create a graph with appropriate axes with units under guidance.
  • Use an r-squared value to assess the strength of a relationship between two variables.
  • Explain why statistical analysis may be needed to interpret a data pattern.

 

Cite this work