• Discoverability Visible
  • Join Policy Restricted
  • Created 20 Nov 2020


“The temptation to form premature theories upon insufficient data is the bane of our profession.” - Sherlock Holmes

We live in an era of “Big Data” - many of the important recent advances in basic science and engineering have come from being able to collect vast amounts of data and find patterns within it.

  • CERN collects over 600 million collision events per second.
  • Storing 1000 human genomes requires 320 terabytes of storage.
  • The Human Connectome Project will contain the fMRI scans of 1200 participants and require close to one petabyte of storage.

Moreover, the internet has become an important source of information:

  • The 3.6 billion words in the English language Wikipedia has been used to train voice assistants.
  • The 500 million Twitter tweets per day have been used to predict voting patterns and the spread of disease.

Although the skills to work with large data have never been more important, most colleges and universities do not have a computer science requirement, and therefore, many graduates lack these key skills. The goal of this course is to provide you with the foundational skills necessary to begin to work within this arena. No course can teach you every important skill in one week, but we hope that the most important thing that we hope that you take away is to learn how to learn more.


Learning Objectives

By the end of Big Data Summer School, you should be able to:

  • Explain the concept of a computer’s path, and use this knowledge to navigate a local or remote computer.
  • Use your computer’s command line to accomplish basic computing operations such as file navigation and script execution.
  • Explain the concept of users and permissions, and be able to determine and change permissions for a given user.
  • Find and understand documentation in order to learn more.
  • Use Vim to open, edit, and save scripts.
  • Set up an environment for scientific computing using Anaconda.
  • Find and install packages and their dependencies.
  • Use data visualization techniques including histograms, scatterplots, and heat maps to understand visual data.
  • Explain the concept of version control, and why it is useful.
  • Be able to set up a local git repository and use git to manage file changes.
  • Create, clone, and manage git repositories. Be able to push and pull from remote repositories on Github.
  • Use git to download, install, and implement other people’s code.



We assume no prior knowledge. Come with a desire to learn!