Working with data in Python

Lesson preamble

This 4-hour workshop takes learners through the basics of programming in Python via the Jupyter Lab interface and culminates with exploration and visualization of real-world bicycle count data from the City of Toronto. This material is based on workshops hosted by UofT Coders, inspired by the Data Carpentry Ecology Python lesson.

Lesson objectives

Part 1: introduction to programming in Python

  • Overview of the capabilities of Python and how to use JupyterLab for exploratory data analyses.
  • Learn about some differences between Python and Excel.
  • Learn basic Python commands.
  • Learn about the Markdown syntax and how to use it within the Jupyter Notebook.

Part 2: working with data in Python

  • Describe what a data frame is
  • Load external data from a .csv file into a data frame with pandas
  • Summarize the contents of a data frame with pandas.
  • Learn to use data frame attributes loc[], head(), info(), describe(), shape, columns, index.
  • Understand the split-apply-combine concept for data analysis.
  • Use groupby(), sum(), agg() and size() to apply this technique.

Part 3: visualizing data

  • Produce scatter plots, line plots, and histograms using seaborn and matplotlib.
  • Understand how to graphically explore relationships between variables.
  • Apply grids for faceting in seaborn.
  • Set universal plot settings.
  • Use seaborn grids with matplotlib functions

Lesson outline

  • Communicating with computers (5 min)
    • Advantages of text-based communication (5 min)
    • Speaking Python (5 min)
    • Natural and formal languages (5 min)
  • The Jupyter Notebook (10 min)
  • Data analysis in Python (5 min)
    • Packages (5 min)
    • How to get help (5 min)
  • Manipulating and analyzing data with pandas
    • Data set background (10 min)
    • What are data frames (15 min)
    • Data wrangling with pandas (40 min)
  • Split-apply-combine techniques in pandas
    • Using sum() and mean() to summarize categorical data (20 min)
    • Using size() to summarize categorical data (10 min)
  • Data visualization with matplotlib and seaborn (10 min)
    • Visualizing one quantitative variable with multiple categorical variables (40 min)
    • Visualizing the relationship of two quantitative variable with multiple categorical variables (40min)
    • Using any plotting function with seaborn grids (10 min)

Setup

  • Install Python 3 via Anaconda. Anaconda is a distribution of Python that also includes the most commonly used packages as well as some nice tools for working with code. The one we will use today is called JupyterLab and comes with newer downloads of Anaconda. (Before February 2018, Anaconda came with Jupyter notebook, an older version.)
  • Important: If you downloaded Anaconda before July 2018, make sure to update the package seaborn before beginning:
    • Open a terminal window or anaconda prompt and type conda update seaborn and hit enter. If that doesn't work, try conda install seaborn.
  • Open JupyterLab:
    • Open Anaconda Navigator and select JupyterLab, or
    • Open a terminal window or anaconda prompt and type jupyter-lab

Introduction

The aim of this workshop is to teach you basic concepts, skills, and tools for working with data so that you can get more done in less time, while having more fun. You will learn how to use the programming language Python to replace many of the tasks you would normally do in spreadsheet software such as Excel, and also do more advanced analysis.

Communicating with computers

  • Computing is about humans communicating with the computer to modulate flows of current in the hardware, in order to get the computer to carry out advanced calculations that we are unable to efficiently compute ourselves.
  • Early examples of human-computer communication included actually disconnecting a wire and connecting it again in a different spot.
  • Today we usually use graphical user interfaces with menus and buttons. These graphical interfaces can be thought of as a layer or shell around the internal components of your operating system making it easier for us to express our thoughts and for computers to interpret them.
  • An example of a graphical user interface is spreadsheet software such as Microsoft Excel or LibreOffice Calc. Here, all the functionality of the program is accessible via hierarchical menus, and clicking buttons sends instructions to the computer, which then responds and sends the results back to your screen.
  • Spreadsheet software is great for viewing and entering small data sets and creating simple visualizations fast. However, it can be tricky to design publication-ready figures, create automatic reproducible analysis workflows, perform advanced calculations, and reliably clean data sets. Even when using a spreadsheet program to record data, it is often beneficial to have some some basic programming skills to facilitate the analyses of those data.

Advantages of text-based communication

Today we will communicate with our computers via text rather than graphical point and click. Typing instructions to the computer might at first seem counterintuitive and unnecessarily difficult. While graphical user interfaces can be nice when you are new to something, text-based interfaces are more powerful, faster and actually also easier to use once you get comfortable with them.

Think about learning a language: in the beginning it's nice to look things up in a dictionary (or a menu in a graphical program), and slowly string together sentences one word at a time. But once we become more proficient in the language and know what we want to say, it is easier to say or type it directly, instead of having to look up every word in the dictionary first.

Text interfaces are also less resource-intensive than their graphical counterparts and easier to develop programs for since you don't have to code the graphical components. It is also much easier to automate and repeat any task once you have all the instructions written down. This facilitates reproducibility of analysis, not only between studies from different labs, but also between researchers in the same lab: compare being shown how to perform a certain analysis in spreadsheet software, where the instruction will essentially be "first you click here, then here, then here...", with being handed the same workflow written down in several lines of codes which you can analyse and understand at your own pace.

Since text is the easiest way for people who are fluent in computer languages to interact with computers, many powerful programs are written without a graphical user interface (which makes it faster to create these programs) and to use these programs you often need to know how to use a text interface. For example, many of the best data analysis and machine learning packages are written in Python or R, and you need to know these languages to use them. Even if the program or package you want to use is not written in Python, much of the knowledge you gain from understanding one programming language can be transferred to others. In addition, most powerful computers that you can log into remotely might only give you a text interface to work with and there is no way to launch a graphical user interface.

Speaking Python

To communicate with the computer via Python, we first need an environment that knows how to interpret Python code. We want commands we type to be interpreted into machine language so that the computer can understand it. For the entire workshop today, we will work in the JupyterLab environment.

Launch JupyterLab

Method 1 On Windows open the Anaconda Prompt, on MacOS open terminal.app, and on Linux open whichever terminal you prefer (e.g. gnome-terminal or konsole). Then type in jupyter-lab and hit Enter.

Method 2 Open Anaconda Navigator, then select JupyterLab from the menu.

You should see something like this:

The Jupyter Notebook

Jupyter originates from a project called IPython, an effort to make Python development more interactive. Since its inception, the scope of the project expanded to include additional programming languages, such as Julia, Python, and R, so the name was changed to "Jupyter" as a reference to these core languages. Today, Jupyter supports many more languages, but we will be using it only for Python code. Specifically, we will be using the notebook from Jupyter, which allows us to easily take notes about our analysis and view plots within the same document where we code. This facilitates sharing and reproducibility of analyses, and the notebook interface is easily accessible through any web browser as well as exportable as a PDF or HTML page.

In the new browser tab, click the plus sign to the left and select to create a new notebook in the Python language (also File --> New --> Notebook).

Initially the notebook has no name other than "Untitled". If you click on "Untitled" you will be given the option of changing the name to whatever you want.

The notebook is divided into cells. Initially there will be a single input cell. You can type Python code directly into the cell. To run the output, press Shift + Enter or click the play button in the toolbar.

In [1]:
4 + 5
Out[1]:
9

The sparseness in the input 4 + 5 is much more efficient than typing "Hello computer, could you please add 4 and 5 for me?". Formal computer languages also avoid the ambiguity present in natural languages such as English. You can think of Python as a combination of math and a formal, succinct version of English. Since it is designed to reduce ambiguity, Python lacks the edge cases and special rules that can make English so difficult to learn, and there is almost always a logical reason for how the Python language is designed, not only a historical one.

The syntax for assigning a value to a variable is also similar to how this is written in math.

In [2]:
a = 4
In [3]:
a * 2
Out[3]:
8

Learning programming really is similar to learning another language - you will often learn the most from just trying to do something and receiving feedback (from the computer or another person)! When there is something you can't wrap your head around, or if you are actively trying to find a new way of expressing a thought, then look it up, just as you would with a natural language.

In [4]:
4 + 5
Out[4]:
9

By default, the code in the current cell is interpreted and the next existing cell is selected or a new empty one is created (you can press Ctrl + Enter to stay on the current cell). You can split the code across several lines as needed.

The little counter on the left of each cell keeps track of in which order the cells were executed, and changing to an * when the computer is processing the computation (only noticeable for computation that takes longer time).

The notebook is saved automatically, but it can also be done manually from the toolbar or by hitting Ctrl + s. Both the input and the output cells are saved so any plots that you make will be present in the notebook next time you open it up without the need to rerun any code. This allows you to create complete documents with both your code and the output of the code in a single place instead of spread across text files for your codes and separate image files for each of your graphs.

You can also change the cell type from Python code to Markdown using the Cell | Cell Type option. Markdown is a simple formatting system which allows you to create documentation for your code, again all within the same notebook structure. You might already be famliar with markdown if you have typed comments in online forums or use use a chat app like slack or whatsapp. A short example of the syntax:

markdown
# Heading level one

- A bullet point
- *Emphasis in italics*
- **Strong emphasis in bold**

This is a [link to learn more about markdown](https://guides.github.com/features/mastering-markdown/)

The Notebook itself is stored as a JSON file with an .ipynb extension. These are specially formatted text files, which can be exported and imported into another Jupyter system. This allows you to share your code, results, and documentation with others. You can also export the notebook to HTML, PDF, and many other formats to make sharing even easier! This is done via File --> Export Notebook As...

The data analysis environment provided by the Jupyter Notebook is very powerful and facilitates reproducible analysis. It is possible to write an entire paper in this environment, and it is very handy for reports such as progress updates since you can share your comments on the analysis together with the analysis itself.

It is also possible to open up other document types in the JupyterLab interface, e.g. text documents and terminals. These can be placed side by side with the notebook through drag and drop, and all running programs can be viewed in the "Running" tab to the left. To search among all available commands for the notebook, the "Commands" tab can be used. Existing documents can be opened from the "Files" tab.

Does JupyterLab require the internet?

Jupyter runs in your browser, but you can use it without internet access: it's all running offline, on your computer. Notice that the address bar will say something like localhost:8888/lab, which means it's being "served" from your computer.

Browsers are programs that display text files that are formatted with HTML. It just so happens that most of the files you want to view are on other computers and so you need internet, but you can also open any old HTML file from your computer in your browser, without the internet.

Data analysis in Python

To access additional functionality in a spreadsheet program, you need to click the menu and select the tool you want to use. All charts are in one menu, text layout tools in another, data analyses tools in a third, and so on. Programming languages such as Python have so many tools and functions so that they would not fit in a menu. Instead of clicking File -> Open and chose the file, you would type something similar to file.open('<filename>') in a programming language. Don't worry if you forget the exact expression, it is often enough to just type the few first letters and then hit Tab, to show the available options. More on that later.

Packages

Since there are so many functions available in Python, it is unnecessary to include all of them with the default installation of the programming language (it would be as if your new phone came with every single app preinstalled). Instead, more advanced functionality is grouped into separate packages, which can be accessed by typing import <package_name> in Python. You can think of this as that you are telling the program which menu items you want to activate (similar to how Excel hides the Developer menu by default since most people rarely use it and you need activate it in the settings if you want to access its functionality). The Anaconda Python distribution essentially bundles the core Python language with many of the most effective Python packages for data analysis, but some packages need to be downloaded before they can be used, just like downloading an addon to a browser or mobile phone.

Just like in spreadsheet software menus, there are lots of different tools within each Python package. For example, if I want to use numerical Python functions, I can import the numerical python module, numpy. I can then access any function by writing numpy.<function_name>.

In [5]:
import numpy

numpy.mean([1, 2, 3, 4, 5])
Out[5]:
3.0

How to get help

Once you start out using Python, you don't know what functions are availble within each package. Luckily, in the Jupyter Notebook, you can type numpy.Tab (that is numpy + period + tab-key) and a small menu will pop up that shows you all the available functions in that module. This is analogous to clicking a 'numpy-menu' and then going through the list of functions. As I mentioned earlier, there are plenty of available functions and it can be helpful to filter the menu by typing the initial letters of the function name.

To get more info on the function you want to use, you can type out the full name and then press Shift + Tab once to bring up a help dialogue and again to expand that dialogue. We can see that to use this function, we need to supply it with the argument a, which should be 'array-like'. An array is essentially just a sequence of numbers. We just saw that one way of doing this was to enclose numbers in brackets [], which in Python means that these numbers are in a list, something you will hear more about later. Instead of manually activating the menu every time, the JupyterLab offers a tool called the "Inspector" which displays help information automatically. I find this very useful and always have it open next to my Notebook. More help is available via the "Help" menu, which links to useful online resources (for example Help --> Numpy Reference).

When you start getting familiar with typing function names, you will notice that this is often faster than looking for functions in menus. However, sometimes you forget and it is useful to get hints via the help system described above.

It is common to give packages nicknames, so that it is faster to type. This is not necessary, but can save some work in long files and make code less verbose so that it is easier to read:

Introduction to programming in Python

Operators

Python can be used as a calculator and mathematical calculations use familiar operators such as +, -, /, and *.

In [6]:
2 + 2 
Out[6]:
4
In [7]:
6 * 7
Out[7]:
42
In [8]:
4 / 3
Out[8]:
1.3333333333333333

Text prefaced with a # is called a "comment". These are notes to people reading the code, so they will be ignored by the Python interpreter.

In [9]:
# `**` means "to the power of"
2 ** 3
Out[9]:
8

Values can be given a nickname, this is called assigning values to variables and is handy when the same value will be used multiple times. The assignment operator in Python is =.

In [10]:
a = 5
a * 2
Out[10]:
10

A variable can be named almost anything. It is recommended to separate multiple words with underscores and start the variable name with a letter, not a number or symbol.

In [11]:
new_variable = 4
a - new_variable
Out[11]:
1

Variables can hold different types of data, not just numbers. For example, a sequence of characters surrounded by single or double quotation marks is called a string. In Python, it is intuitive to append strings by adding them together:

In [12]:
b = 'Hello'
c = 'universe'
b + c
Out[12]:
'Hellouniverse'

A space can be added to separate the words.

In [13]:
b + ' ' + c
Out[13]:
'Hello universe'

To find out what type a variable is, the built-in function type() can be used. In essence, a function can be passed input values, follows a set of instructions with how to operate on the input, and then outputs the result. This is analogous to following a recipe: the ingredients are the input, the recipe specifies the set of instructions, and the output is the finished dish.

In [14]:
type(a)
Out[14]:
int

int stands for "integer", which is the type of any number without a decimal component.

To be reminded of the value of a, the variable name can be typed into an empty code cell.

In [15]:
a
Out[15]:
5

A code cell will only output its last value. To see more than one value per code cell, the built-in function print() can be used. When using Python from an interface that is not interactive like the JupyterLab Notebook, such as when executing a set of Python instructions together as a script, the function print() is often the preferred way of displaying output.

In [16]:
print(a)
type(a)
5
Out[16]:
int

Numbers with a decimal component are referred to as floats

In [17]:
type(3.14)
Out[17]:
float

Text is of the type str, which stands for "string". Strings hold sequences of characters, which can be letters, numbers, punctuation or more exotic forms of text (even emoji!).

In [18]:
print(type(b))
b
<class 'str'>
Out[18]:
'Hello'

The output from type() is formatted slightly differently when it is printed.

Python also allows comparison and logic operators (<, >, ==, !=, <=, >=, and, or, not), which will return either True or False.

In [19]:
3 > 4
Out[19]:
False

not reverses the outcome from a comparison.

In [20]:
not 3 > 4
Out[20]:
True

and checks if both comparisons are True.

In [21]:
3 > 4 and 5 > 1
Out[21]:
False

or checks if at least one of the comparisons are True.

In [22]:
3 > 4 or 5 > 1
Out[22]:
True

The type of the resulting True or False value is called "boolean".

In [23]:
type(True)
Out[23]:
bool

Boolean comparison like these are important when extracting specific values from a larger set of values. This use case will be explored in detail later in this material.

Another common use of boolean comparison is with a conditional statement, where the code after the comparison only is executed if the comparison is True.

In [24]:
if a == 4: # if a is equal to 4
    print('a is 4')
else: # if a is not equal to 4
    print('a is not 4')
a is not 4
In [25]:
a
Out[25]:
5

Note that the second line in the example above is indented. Indentation is very important in Python, and the Python interpreter uses it to understand that the code in the indented block will only be exectuted if the conditional statement above is True.

Challenge 1

  1. Assign a*2 to the variable name two_a.
  2. Change the value of a to 3. What is the value of two_a now, 6 or 10?

Array-like Python types

Lists

Lists are a common data structure to hold an ordered sequence of elements. Each element can be accessed by an index. Note that Python indexes start with 0 instead of 1.

In [26]:
planets = ['Earth', 'Mars', 'Venus']
planets[0]
Out[26]:
'Earth'

You can index from the end of the list by prefixing with a minus sign

In [27]:
planets[-1]
Out[27]:
'Venus'

Multiple elements can be selected via slicing.

In [28]:
planets[0:2]
Out[28]:
['Earth', 'Mars']

Slicing is inclusive of the start of the range and exclusive of the end, so 0:2 returns list elements 0 and 1.

Either the start or the end number of the range can be excluded to include all items to the beginning or end of the list, respectively.

In [29]:
planets[:2]
Out[29]:
['Earth', 'Mars']

To add items to the list, the addition operator can be used together with a list of the items to be added.

In [30]:
planets = planets + ['Neptune']
planets
Out[30]:
['Earth', 'Mars', 'Venus', 'Neptune']

A loop can be used to access the elements in a list or other Python data structure one at a time, and then do something with each element. Loops are an important part of programming - they make automation of repetitive tasks possible.

In [31]:
for planet in planets:
    print(planet)
Earth
Mars
Venus
Neptune

The variable planet is recreated for every iteration in the loop until the list planets has been exhausted.

Operations can be performed on elements inside loops.

In [32]:
for planet in planets:
    print('I live on ' + planet)
I live on Earth
I live on Mars
I live on Venus
I live on Neptune

Tuples

A tuple is similar to a list in that it's an ordered sequence of elements. However, tuples can't be changed once created (they are "immutable"). Tuples are created by separating values with a comma (and for clarity these are commonly surrounded by parentheses).

In [33]:
a_tuple = (1, 2, 3)
another_tuple = ('blue', 'green', 'red')

Challenge - Tuples

  1. Type type(a_tuple) into Python - what is the object type?
  2. What happens when you type a_tuple[2] = 5 vs planets[1] = 5 ?

Dictionaries

A dictionary is a container that holds pairs of objects: keys and values.

In [34]:
fruit_colors = {'banana': 'yellow', 'strawberry': 'red'}
fruit_colors
Out[34]:
{'banana': 'yellow', 'strawberry': 'red'}

Dictionaries work a lot like lists - except that they are indexed with keys. Think about a key as a unique identifier for a set of values in the dictionary. Keys can only have particular types - they have to be "hashable". Strings and numeric types are acceptable, but lists aren't.

In [35]:
fruit_colors['banana']
Out[35]:
'yellow'

To add an item to the dictionary, a value is assigned to a new dictionary key.

In [36]:
fruit_colors['apple'] = 'green'
fruit_colors
Out[36]:
{'banana': 'yellow', 'strawberry': 'red', 'apple': 'green'}

Using loops with dictionaries iterates over the keys by default.

In [37]:
for fruit in fruit_colors:
    print(fruit, fruit_colors[fruit])
banana yellow
strawberry red
apple green

Trying to use a non-existing key, e.g. from typo, throws an error message.

fruit_colors['bannana']

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-37-84b86acf3267> in <module>()
----> 1 fruit_colors['bannana']

KeyError: 'bannana'

An error message is commonly referred to as a "traceback". This message pinpoints what line in the code cell resulted in an error when it was executed, by pointing at it with an arrow (---->). This is helpful in figuring out what went wrong, especially when many lines of code are executed simultaneously.

Challenge - Can you do reassignment in a dictionary?

  1. In the fruit_colors dictionary, change the color of apple to 'red'.
  2. Loop through the fruit_colors dictionary and print the key only if the value of that key points to in the dictionary is 'red'.

Functions

Defining a section of code as a function in Python is done using the def keyword. For example, let's create a function that takes two arguments and returns their sum:

In [38]:
def add_function(a, b):
    """This function adds two values together"""
    result = a + b
    return result

z = add_function(20, 22)
z
Out[38]:
42

There's already a built-in sum() function in Python so add_function isn't all that useful, but functions can be created to do many custom things. In general, if you find yourself doing the same steps over and over, it might be a good idea to make a function.

Just as previously, the ? can be used to get help for the function.

In [39]:
?add_function
Signature: add_function(a, b)
Docstring: This function adds two values together
File:      ~/Documents/GitHub/python-workshop/<ipython-input-38-d410f2449e2e>
Type:      function
Signature: add_function(a, b)
Docstring: This function adds two values together
File:      ~/Documents/GitHub/python-workshop/<ipython-input-38-d410f2449e2e>
Type:      function

The string between the """ is called the docstring and is shown in the help message, so it is important to write a clear description of the function here. It is possible to see the entire source code of the function by using double ? (this can be quite complex for complicated functions).

In [40]:
??add_function
Signature: add_function(a, b)
Source:   
def add_function(a, b):
    """This function adds two values together"""
    result = a + b
    return result
File:      ~/Documents/GitHub/python-workshop/<ipython-input-38-d410f2449e2e>
Type:      function
Signature: add_function(a, b)
Source:   
def add_function(a, b):
    """This function adds two values together"""
    result = a + b
    return result
File:      ~/Documents/GitHub/python-workshop/<ipython-input-38-d410f2449e2e>
Type:      function

Much of the power from languages such as Python and R comes from community contributed functions written by talented people and shared openly so that anyone can use them for their own research instead of reinventing the wheel. These community contributions are often packaged together in so called packages, libraries, or modules, which often consists of a set of related functions that are helpful to carry out a particular task. Packages will be covered more in detail later.

Part 2: working with data

Now that we've seen some Python commands and syntax, let's work with some real data. For the rest of the workshop, we'll be going through an example analysis and visualization pipeline with some data on bicycle counts from the City of Toronto.

We won't talk too much about spreadsheets specifically, but Python isn't necessarily a replacement for spreadsheets - they are still useful, especially for entering data. If you'd like more information and guidelines about best practices for using spreadsheets, there's short Data Carpentry module that covers this: https://datacarpentry.org/spreadsheet-ecology-lesson/.

There are a few data organization principles for spreadsheets that make it much easier to use data programatically with a tool like Python. From the Data Carpentry spreadsheets lesson, here are the cardinal rules of using spreadsheet programs for data:

  1. Put all your variables in columns - the thing you’re measuring, like ‘weight’ or ‘temperature’.
  2. Put each observation in its own row. Columns = variables, rows = observations, cells = data values.
  3. Don’t combine multiple pieces of information in one cell. Sometimes it just seems like one thing, but think if that’s the only way you’ll want to be able to use or sort that data.
  4. Leave the raw data raw - don’t change it!
  5. Export the cleaned data to a text-based format like CSV (comma-separated values) format. This ensures that anyone can use the data, and is required by most data repositories.

Dataset background

We will be using data from the City of Toronto Open Data Catalogue, a great resource with lots of publicly available data. Specifically, we will be analyzing counts of bicycles from the College St. bikelanes in September 2010 and September 2017.

Here's what the data looked like when I downloaded it from the City. Can anyone see any violations of the principles we just discussed? How should the data have been formatted differently when it was entered and saved?

Because the original data looked like this and would have been tricky to analyze in an automated way, I made a new file in which each count observation had its own row.

The data can be downloaded directly at this link: https://bit.ly/2Cs1Mq1 or https://gist.githubusercontent.com/mbonsma/be7482639d7a2d5cfc52505aadb9b53e/raw/1f68fce4a127fdd3b2313728dd84cf21e86e7df3/college_spadina_2010_2017.csv

Data format

We are studying the number of cyclists counted each hour on a series of dates in September 2010 and September 2017. The dataset is stored as a comma separated value (CSV) file. Each row holds information for a single hour of counts, and the columns are described below.

Column Description
date date of the count
day_of_week day of the week, i.e. Saturday
temperature temperature (C)
weather amount of rain
direction direction of travel of bicycle (Eastbound or Westbound)
position recording position, if given
hour start time of hour-long counting block
bikes number of bicycles counted

To read the data into Python, we are going to use a function called read_csv. This function is contained in a Python package called pandas. As mentioned previously, Python packages are a bit like browser extensions: they are not essential, but can provide nifty functionality. To use a package, it first needs to be imported.

In [41]:
# pandas is commonly given the nickname `pd`
import pandas as pd

pandas can read CSV-files saved on the computer or directly from an URL.

In [42]:
bike_counts = pd.read_csv("https://bit.ly/2Cs1Mq1")
# you can also read a file that's already on your computer: 
# bike_counts = pd.read_csv("college_spadina_2010_2017.csv")

To view the result, type bike_counts in a cell and run it, just as when viewing the content of any variable in Python.

In [43]:
bike_counts
Out[43]:
date day_of_week temperature weather direction position hour bikes
0 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 12:00:00 AM 70
1 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 1:00:00 AM 27
2 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 2:00:00 AM 23
3 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 3:00:00 AM 8
4 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 4:00:00 AM 10
5 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 5:00:00 AM 3
6 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 6:00:00 AM 3
7 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 7:00:00 AM 16
8 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 8:00:00 AM 28
9 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 9:00:00 AM 28
10 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 10:00:00 AM 43
11 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 11:00:00 AM 75
12 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 12:00:00 PM 77
13 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 1:00:00 PM 105
14 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 2:00:00 PM 85
15 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 3:00:00 PM 109
16 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 4:00:00 PM 106
17 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 5:00:00 PM 123
18 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 6:00:00 PM 103
19 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 7:00:00 PM 74
20 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 8:00:00 PM 78
21 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 9:00:00 PM 43
22 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 10:00:00 PM 37
23 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 11:00:00 PM 64
24 9/18/10 Saturday 21 No Rain Eastbound Bike lane just west of Spadina Ave. 12:00:00 AM 34
25 9/18/10 Saturday 21 No Rain Eastbound Bike lane just west of Spadina Ave. 1:00:00 AM 38
26 9/18/10 Saturday 21 No Rain Eastbound Bike lane just west of Spadina Ave. 2:00:00 AM 16
27 9/18/10 Saturday 21 No Rain Eastbound Bike lane just west of Spadina Ave. 3:00:00 AM 19
28 9/18/10 Saturday 21 No Rain Eastbound Bike lane just west of Spadina Ave. 4:00:00 AM 7
29 9/18/10 Saturday 21 No Rain Eastbound Bike lane just west of Spadina Ave. 5:00:00 AM 6
... ... ... ... ... ... ... ... ...
594 2017-09-30 Saturday 15 No Rain Eastbound NaN 6:00 PM 90
595 2017-09-30 Saturday 15 No Rain Eastbound NaN 7:00 PM 90
596 2017-09-30 Saturday 15 No Rain Eastbound NaN 8:00 PM 77
597 2017-09-30 Saturday 15 No Rain Eastbound NaN 9:00 PM 79
598 2017-09-30 Saturday 15 No Rain Eastbound NaN 10:00 PM 85
599 2017-09-30 Saturday 15 No Rain Eastbound NaN 11:00 PM 58
600 2017-09-30 Saturday 15 No Rain Westbound NaN 12:00 AM 33
601 2017-09-30 Saturday 15 No Rain Westbound NaN 1:00 AM 13
602 2017-09-30 Saturday 15 No Rain Westbound NaN 2:00 AM 11
603 2017-09-30 Saturday 15 No Rain Westbound NaN 3:00 AM 6
604 2017-09-30 Saturday 15 No Rain Westbound NaN 4:00 AM 4
605 2017-09-30 Saturday 15 No Rain Westbound NaN 5:00 AM 7
606 2017-09-30 Saturday 15 No Rain Westbound NaN 6:00 AM 6
607 2017-09-30 Saturday 15 No Rain Westbound NaN 7:00 AM 14
608 2017-09-30 Saturday 15 No Rain Westbound NaN 8:00 AM 22
609 2017-09-30 Saturday 15 No Rain Westbound NaN 9:00 AM 37
610 2017-09-30 Saturday 15 No Rain Westbound NaN 10:00 AM 48
611 2017-09-30 Saturday 15 No Rain Westbound NaN 11:00 AM 89
612 2017-09-30 Saturday 15 No Rain Westbound NaN 12:00 PM 100
613 2017-09-30 Saturday 15 No Rain Westbound NaN 1:00 PM 96
614 2017-09-30 Saturday 15 No Rain Westbound NaN 2:00 PM 109
615 2017-09-30 Saturday 15 No Rain Westbound NaN 3:00 PM 114
616 2017-09-30 Saturday 15 No Rain Westbound NaN 4:00 PM 132
617 2017-09-30 Saturday 15 No Rain Westbound NaN 5:00 PM 136
618 2017-09-30 Saturday 15 No Rain Westbound NaN 6:00 PM 120
619 2017-09-30 Saturday 15 No Rain Westbound NaN 7:00 PM 98
620 2017-09-30 Saturday 15 No Rain Westbound NaN 8:00 PM 79
621 2017-09-30 Saturday 15 No Rain Westbound NaN 9:00 PM 62
622 2017-09-30 Saturday 15 No Rain Westbound NaN 10:00 PM 80
623 2017-09-30 Saturday 15 No Rain Westbound NaN 11:00 PM 82

624 rows × 8 columns

This is how a data frame is displayed in the JupyterLab Notebook. Although the data frame itself just consists of the values, the Notebook knows that this is a data frame and displays it in a nice tabular format (by adding HTML decorators), and adds some cosmetic conveniences such as the bold font type for the column and row names, the alternating grey and white zebra stripes for the rows and highlights the row the mouse pointer moves over. The increasing numbers on the far left is the data frame's index, which was added by pandas to easily distinguish between the rows.

What are data frames?

A data frame is the representation of data in a tabular format, similar to how data is often arranged in spreadsheets. The data is rectangular, meaning that all rows have the same amount of columns and all columns have the same amount of rows. Data frames are the de facto data structure for most tabular data, and what we use for statistics and plotting. A data frame can be created by hand, but most commonly they are generated by an input function, such as read_csv(). In other words, when importing spreadsheets from your hard drive (or the web).

As can be seen above, the default is to display the first and last 30 rows and truncate everything in between, as indicated by the ellipsis (...). Although it is truncated, this output is still quite space consuming. To glance at how the data frame looks, it is sufficient to display only the top (the first 5 lines) using the head() method.

In [44]:
bike_counts.head()
Out[44]:
date day_of_week temperature weather direction position hour bikes
0 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 12:00:00 AM 70
1 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 1:00:00 AM 27
2 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 2:00:00 AM 23
3 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 3:00:00 AM 8
4 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 4:00:00 AM 10

Methods are very similar to functions; the main difference is that they belong to an object (above, the method head() belongs to the data frame bike_counts). Methods operate on the object they belong to, that's why we can call the method with an empty parenthesis without any arguments. Compare this with the function type() that was introduced previously.

In [45]:
type(bike_counts)
Out[45]:
pandas.core.frame.DataFrame

Here, the bike_counts variable is explicitly passed as an argument to type(). An immediately tangible advantage with methods is that they simplify tab completion. Just type the name of the dataframe, a period, and then hit tab to see all the relevant methods for that data frame instead of fumbling around with all the available functions in Python (there's quite a few!) and figuring out which ones operate on data frames and which do not. Methods also facilitates readability when chaining many operations together, which will be shown in detail later.

The columns in a data frame can contain data of different types, e.g. integers, floats, and objects (which includes strings, lists, dictionaries, and more)). General information about the data frame (including the column data types) can be obtained with the info() method.

In [46]:
bike_counts.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 624 entries, 0 to 623
Data columns (total 8 columns):
date           624 non-null object
day_of_week    624 non-null object
temperature    624 non-null int64
weather        624 non-null object
direction      624 non-null object
position       432 non-null object
hour           624 non-null object
bikes          624 non-null int64
dtypes: int64(2), object(6)
memory usage: 39.1+ KB

The information includes the total number of rows and columns, the number of non-null observations, the column data types, and the memory (RAM) usage. The number of non-null observation is not the same for all columns, which means that some columns contain null (or NA) values representing that there is missing information. The column data type is often indicative of which type of data is stored in that column, and approximately corresponds to the following

  • Qualitative/Categorical
    • Nominal (labels, e.g. 'red', 'green', 'blue')
      • object, category
    • Ordinal (labels with order, e.g. 'Jan', 'Feb', 'Mar')
      • object, category, int
    • Binary (only two outcomes, e.g. True or False)
      • bool
  • Quantitative/Numerical
    • Discrete (whole numbers, often counting, e.g. number of children)
      • int
    • Continuous (measured values with decimals, e.g. weight)
      • float

Note that an object could contain different types, e.g. str or list. Also note that there can be exceptions to the schema above, but it is still a useful rough guide.

After reading in the data into a data frame, head() and info() are two of the most useful methods to get an idea of the structure of this data frame. There are many additional methods that can facilitate the understanding of what a data frame contains:

  • Size:

    • bike_counts.shape - a tuple with the number of rows in the first element and the number of columns as the second element
    • bike_counts.shape[0] - the number of rows
    • bike_counts.shape[1]- the number of columns
  • Content:

    • bike_counts.head() - shows the first 5 rows
    • bike_counts.tail() - shows the last 5 rows
  • Names:

    • bike_counts.columns - returns the names of the columns (also called variable names) objects)
    • bike_counts.index - returns the names of the rows (referred to as the index in pandas)
  • Summary:

    • bike_counts.info() - column names and data types, number of observations, memory consumptions length, and content of each column
    • bike_counts.describe() - summary statistics for each column

These belong to a data frame and are commonly referred to as attributes of the data frame. All attributes are accessed with the dot-syntax (.), which returns the attribute's value. If the attribute is a method, parentheses can be appended to the name to carry out the method's operation on the data frame. Attributes that are not methods often hold a value that has been precomputed because it is commonly accessed and it saves time to store the value in an attribute instead of recomputing it every time it is needed. For example, every time pandas creates a data frame, the number of rows and columns is computed and stored in the shape attribute.

Challenge

Based on the output of bike_counts.info(), can you answer the following questions?

  • What is the class of the object bike_counts?
  • How many rows and how many columns are in this object?
  • Why is there not the same number of rows (observations) for each column?

Saving data frames locally

It is good practice to keep a copy of the data stored locally on your computer in case you want to do offline analyses, the online version of the file changes, or the file is taken down. For this, the data could be downloaded manually or the current bike_counts data frame could be saved to disk as a CSV-file with to_csv().

In [47]:
bike_counts.to_csv('bike_counts.csv', index=False)
# `index=False` because the index (the row names) was generated automatically when pandas opened
# the file and this information is not needed to be saved

Since the data is now saved locally, the next time this Notebook is opened, it could be loaded from the local path instead of downloading it from the URL.

In [48]:
bike_counts = pd.read_csv('bike_counts.csv')
bike_counts.head()
Out[48]:
date day_of_week temperature weather direction position hour bikes
0 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 12:00:00 AM 70
1 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 1:00:00 AM 27
2 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 2:00:00 AM 23
3 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 3:00:00 AM 8
4 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 4:00:00 AM 10

Indexing and subsetting data frames

The bike counts data frame has rows and columns (it has 2 dimensions). To extract specific data from it (also referred to as "subsetting"), columns can be selected by their name.The JupyterLab Notebook (technically, the underlying IPython interpreter) knows about the columns in the data frame, so tab autocompletion can be used to get the correct column name.

In [49]:
bike_counts['weather'].head()
Out[49]:
0     No Rain
1     No Rain
2     No Rain
3     No Rain
4     No Rain
Name: weather, dtype: object

The name of the column is not shown, since there is only one. Remember that the numbers on the left is just the index of the data frame, which was added by pandas upon importing the data.

Another syntax that is often used to specify column names is .<column_name>.

In [50]:
bike_counts.weather.head()
Out[50]:
0     No Rain
1     No Rain
2     No Rain
3     No Rain
4     No Rain
Name: weather, dtype: object

Using brackets is clearer and also alows for passing multiple columns as a list, so this tutorial will stick to that.

In [51]:
bike_counts[['date', 'weather']].head()
Out[51]:
date weather
0 9/18/10 No Rain
1 9/18/10 No Rain
2 9/18/10 No Rain
3 9/18/10 No Rain
4 9/18/10 No Rain

The output is displayed a bit differently this time. The reason is that in the last cell where the returned data frame only had one column ("weather"), pandas technically returned a Series, not a Dataframe. This can be confirmed by using type as previously.

In [52]:
type(bike_counts['weather'].head())
Out[52]:
pandas.core.series.Series
In [53]:
type(bike_counts[['date', 'weather']].head())
Out[53]:
pandas.core.frame.DataFrame

So, every individual column is actually a Series and together they constitue a Dataframe. This introductory tutorial will not make any further distinction between a Series and a Dataframe, and many of the analysis techniques used here will apply to both series and data frames.

Selecting with single brackets ([]) is a shortcut to common operations, such as selecting columns by labels as above. For more flexible and robust row and column selection the more verbose loc[<rows>, <columns>] (location) syntax is used.

In [54]:
bike_counts.loc[[0, 2, 4], ['date', 'weather']]
# Although methods usually have trailing parenthesis, square brackets are used with `loc[]` to stay
# consistent with the indexing with square brackets in general in Python (e.g. lists and Numpy arrays)
Out[54]:
date weather
0 9/18/10 No Rain
2 9/18/10 No Rain
4 9/18/10 No Rain

A single number can be selected, which returns that value (here, an integer) rather than a data frame or series with just one value.

In [55]:
bike_counts.loc[4, 'day_of_week']
Out[55]:
'Saturday'

If the column argument is is left out, all columns are selected.

In [56]:
bike_counts.loc[[3, 4]]
Out[56]:
date day_of_week temperature weather direction position hour bikes
3 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 3:00:00 AM 8
4 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 4:00:00 AM 10

To select all rows, but only a subset of columns, the colon character (:) can be used.

In [57]:
bike_counts.loc[:, ['date', 'day_of_week']].shape # show the size of the data frame
Out[57]:
(624, 2)

It is also possible to select slices of rows and column labels.

In [58]:
bike_counts.loc[2:4, 'day_of_week':'position']
Out[58]:
day_of_week temperature weather direction position
2 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave.
3 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave.
4 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave.

It is important to realize that loc[] selects rows and columns by their labels. To instead select by row or column position, use iloc[] (integer location).

In [59]:
bike_counts.iloc[[2, 3, 4], [0, 1, 2]]
Out[59]:
date day_of_week temperature
2 9/18/10 Saturday 21
3 9/18/10 Saturday 21
4 9/18/10 Saturday 21

The index of surveys consists of consecutive integers so in this case selecting from the index by labels or position will look the same. An index could also consist of text names just like the columns.

While selecting a slice by label is inclusive of the start and end, selecting a slice by position is inclusive of the start by exclusive of the end position, just like when slicing in lists.

In [60]:
bike_counts.iloc[2:5, :3]
Out[60]:
date day_of_week temperature
2 9/18/10 Saturday 21
3 9/18/10 Saturday 21
4 9/18/10 Saturday 21

Selecting slices of row positions is a common operation, and has thus been given a shortcut syntax with single brackets.

In [61]:
bike_counts[2:5]
Out[61]:
date day_of_week temperature weather direction position hour bikes
2 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 2:00:00 AM 23
3 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 3:00:00 AM 8
4 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 4:00:00 AM 10

Challenge

  1. Extract the 200th and 201st row of the bike_counts dataset and assign the resulting data frame to a new variable name (bike_counts_200_201). Remember that Python indexing starts at 0!

  2. How can you get the same result as from bike_counts.head() by using row slices instead of the head() method?

  3. There are at least three distinct ways to extract the last row of the data frame. How many can you come up with?

The describe() method was mentioned above as a way of retrieving summary statistics of a data frame. Together with info() and head() this is often a good place to start exploratory data analysis as it gives a nice overview of the numeric valuables the data set. It's a good way to check things such as if the max and min values make sense for a particular column.

In [62]:
bike_counts.describe()
Out[62]:
temperature bikes
count 624.000000 624.000000
mean 21.076923 81.134615
std 4.517899 78.900314
min 15.000000 0.000000
25% 18.000000 21.000000
50% 19.000000 66.000000
75% 23.000000 113.000000
max 30.000000 511.000000

A common next step would be to plot the data to explore relationships between different variables, but before getting into plotting, it is beneficial to elaborate on the data frame object and several of its common operations.

An often desired outcome is to select a subset of rows matching a criteria, e.g. which observations have a bike count under 5. To do this, the "less than" comparison operator that was introduced previously can be used.

In [63]:
bike_counts['bikes'] < 5
Out[63]:
0      False
1      False
2      False
3      False
4      False
5       True
6       True
7      False
8      False
9      False
10     False
11     False
12     False
13     False
14     False
15     False
16     False
17     False
18     False
19     False
20     False
21     False
22     False
23     False
24     False
25     False
26     False
27     False
28     False
29     False
       ...  
594    False
595    False
596    False
597    False
598    False
599    False
600    False
601    False
602    False
603    False
604     True
605    False
606    False
607    False
608    False
609    False
610    False
611    False
612    False
613    False
614    False
615    False
616    False
617    False
618    False
619    False
620    False
621    False
622    False
623    False
Name: bikes, Length: 624, dtype: bool

The result is a boolean array with one value for every row in the data frame indicating whether it is True or False that this row has a value below 5 in the bikes column. This boolean array can be used to select only those rows from the data frame that meet the specified condition.

In [64]:
bike_counts[bike_counts['bikes'] < 5]
Out[64]:
date day_of_week temperature weather direction position hour bikes
5 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 5:00:00 AM 3
6 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 6:00:00 AM 3
53 9/25/10 Saturday 18 No Rain Westbound Bike lane just east of Spadina Ave. 5:00:00 AM 3
54 9/25/10 Saturday 18 No Rain Westbound Bike lane just east of Spadina Ave. 6:00:00 AM 3
101 9/19/10 Sunday 20 No Rain Westbound Bike lane just east of Spadina Ave. 5:00:00 AM 4
125 9/19/10 Sunday 20 No Rain Eastbound Bike lane just west of Spadina Ave. 5:00:00 AM 3
173 9/26/10 Sunday 17 No Rain Eastbound Bike lane just west of Spadina Ave. 5:00:00 AM 2
191 9/26/10 Sunday 17 No Rain Eastbound Bike lane just west of Spadina Ave. 11:00:00 PM 4
194 9/20/10 Monday 19 No Rain Westbound Bike lane just east of Spadina Ave. 2:00:00 AM 3
195 9/20/10 Monday 19 No Rain Westbound Bike lane just east of Spadina Ave. 3:00:00 AM 2
196 9/20/10 Monday 19 No Rain Westbound Bike lane just east of Spadina Ave. 4:00:00 AM 3
197 9/20/10 Monday 19 No Rain Westbound Bike lane just east of Spadina Ave. 5:00:00 AM 4
218 9/20/10 Monday 19 No Rain Eastbound Bike lane just west of Spadina Ave. 2:00:00 AM 1
219 9/20/10 Monday 19 No Rain Eastbound Bike lane just west of Spadina Ave. 3:00:00 AM 3
220 9/20/10 Monday 19 No Rain Eastbound Bike lane just west of Spadina Ave. 4:00:00 AM 4
242 9/21/10 Tuesday 25 No Rain Westbound Bike lane just east of Spadina Ave. 2:00:00 AM 3
243 9/21/10 Tuesday 25 No Rain Westbound Bike lane just east of Spadina Ave. 3:00:00 AM 2
244 9/21/10 Tuesday 25 No Rain Westbound Bike lane just east of Spadina Ave. 4:00:00 AM 0
266 9/21/10 Tuesday 25 No Rain Eastbound Bike lane just west of Spadina Ave. 2:00:00 AM 3
267 9/21/10 Tuesday 25 No Rain Eastbound Bike lane just west of Spadina Ave. 3:00:00 AM 2
268 9/21/10 Tuesday 25 No Rain Eastbound Bike lane just west of Spadina Ave. 4:00:00 AM 4
290 9/22/10 Wednesday 23 Rain 6mm Westbound Bike lane just east of Spadina Ave. 2:00:00 AM 3
291 9/22/10 Wednesday 23 Rain 6mm Westbound Bike lane just east of Spadina Ave. 3:00:00 AM 3
292 9/22/10 Wednesday 23 Rain 6mm Westbound Bike lane just east of Spadina Ave. 4:00:00 AM 0
293 9/22/10 Wednesday 23 Rain 6mm Westbound Bike lane just east of Spadina Ave. 5:00:00 AM 4
315 9/22/10 Wednesday 23 Rain 6mm Eastbound Bike lane just west of Spadina Ave. 3:00:00 AM 4
316 9/22/10 Wednesday 23 Rain 6mm Eastbound Bike lane just west of Spadina Ave. 4:00:00 AM 4
340 9/23/10 Thursday 19 No Rain Westbound Bike lane just east of Spadina Ave. 4:00:00 AM 0
341 9/23/10 Thursday 19 No Rain Westbound Bike lane just east of Spadina Ave. 5:00:00 AM 3
364 9/23/10 Thursday 19 No Rain Eastbound Bike lane just west of Spadina Ave. 4:00:00 AM 4
387 9/24/10 Friday 30 No Rain Westbound Bike lane just east of Spadina Ave. 3:00:00 AM 2
388 9/24/10 Friday 30 No Rain Westbound Bike lane just east of Spadina Ave. 4:00:00 AM 1
435 2017-09-27 Wednesday 30 No Rain Eastbound NaN 3:00 AM 4
459 2017-09-27 Wednesday 30 No Rain Westbound NaN 3:00 AM 2
483 2017-09-28 Thursday 19 No Rain Eastbound NaN 3:00 AM 3
508 2017-09-28 Thursday 19 No Rain Westbound NaN 4:00 AM 4
509 2017-09-28 Thursday 19 No Rain Westbound NaN 5:00 AM 2
531 2017-09-29 Friday 18 Rain 4mm Eastbound NaN 3:00 AM 3
556 2017-09-29 Friday 18 Rain 4mm Westbound NaN 4:00 AM 4
580 2017-09-30 Saturday 15 No Rain Eastbound NaN 4:00 AM 2
604 2017-09-30 Saturday 15 No Rain Westbound NaN 4:00 AM 4

As before, this can be combined with selection of a particular set of columns.

In [65]:
bike_counts.loc[bike_counts['bikes'] < 5, ['hour', 'bikes']]
Out[65]:
hour bikes
5 5:00:00 AM 3
6 6:00:00 AM 3
53 5:00:00 AM 3
54 6:00:00 AM 3
101 5:00:00 AM 4
125 5:00:00 AM 3
173 5:00:00 AM 2
191 11:00:00 PM 4
194 2:00:00 AM 3
195 3:00:00 AM 2
196 4:00:00 AM 3
197 5:00:00 AM 4
218 2:00:00 AM 1
219 3:00:00 AM 3
220 4:00:00 AM 4
242 2:00:00 AM 3
243 3:00:00 AM 2
244 4:00:00 AM 0
266 2:00:00 AM 3
267 3:00:00 AM 2
268 4:00:00 AM 4
290 2:00:00 AM 3
291 3:00:00 AM 3
292 4:00:00 AM 0
293 5:00:00 AM 4
315 3:00:00 AM 4
316 4:00:00 AM 4
340 4:00:00 AM 0
341 5:00:00 AM 3
364 4:00:00 AM 4
387 3:00:00 AM 2
388 4:00:00 AM 1
435 3:00 AM 4
459 3:00 AM 2
483 3:00 AM 3
508 4:00 AM 4
509 5:00 AM 2
531 3:00 AM 3
556 4:00 AM 4
580 4:00 AM 2
604 4:00 AM 4

All of the bike counts that are less than 5 happened at night, between 11 PM and 6 AM.

A single expression can also be used to filter for several criteria, either matching all criteria (&) or any criteria (|). These special operators are used instead of and and or to make sure that the comparison occurs for each row in the data frame. Parentheses are added to indicate the priority of the comparisons.

In [66]:
# AND = &
bike_counts.loc[(bike_counts['day_of_week'] == 'Saturday') & 
                (bike_counts['direction'] == 'Eastbound'), ['day_of_week', 'direction']].head()
Out[66]:
day_of_week direction
24 Saturday Eastbound
25 Saturday Eastbound
26 Saturday Eastbound
27 Saturday Eastbound
28 Saturday Eastbound

To increase readability, these statements can be put on multiple rows. Anything that is within a parameter or bracket in Python can be continued on the next row. When inside a bracket or parenthesis, the indentation is not significant to the Python interpreter, but it is still recommended to include it in order to make the code more readable.

In [67]:
bike_counts.loc[(bike_counts['day_of_week'] == 'Saturday') & 
                (bike_counts['direction'] == 'Eastbound'), 
                ['day_of_week', 'direction']].head()
Out[67]:
day_of_week direction
24 Saturday Eastbound
25 Saturday Eastbound
26 Saturday Eastbound
27 Saturday Eastbound
28 Saturday Eastbound

With the | operator, rows matching either of the supplied criteria are returned.

In [68]:
# OR = |
bike_counts.loc[(bike_counts['day_of_week'] == 'Saturday') |
            (bike_counts['day_of_week'] == 'Sunday'),
            'day_of_week']
Out[68]:
0      Saturday
1      Saturday
2      Saturday
3      Saturday
4      Saturday
5      Saturday
6      Saturday
7      Saturday
8      Saturday
9      Saturday
10     Saturday
11     Saturday
12     Saturday
13     Saturday
14     Saturday
15     Saturday
16     Saturday
17     Saturday
18     Saturday
19     Saturday
20     Saturday
21     Saturday
22     Saturday
23     Saturday
24     Saturday
25     Saturday
26     Saturday
27     Saturday
28     Saturday
29     Saturday
         ...   
594    Saturday
595    Saturday
596    Saturday
597    Saturday
598    Saturday
599    Saturday
600    Saturday
601    Saturday
602    Saturday
603    Saturday
604    Saturday
605    Saturday
606    Saturday
607    Saturday
608    Saturday
609    Saturday
610    Saturday
611    Saturday
612    Saturday
613    Saturday
614    Saturday
615    Saturday
616    Saturday
617    Saturday
618    Saturday
619    Saturday
620    Saturday
621    Saturday
622    Saturday
623    Saturday
Name: day_of_week, Length: 240, dtype: object

Challenge

Subset the bike_counts data to include counts collected only on weekdays and retain only the columns 'date', 'day_of_week', and 'bikes'. There are multiple ways this could be done.

Creating new columns and the pandas datetime object

A frequent operation when working with data is to create new columns based on the values in existing columns, for example to do unit conversions or find the ratio of values in two columns. You might have noticed that the hour column had two different formats; this happened because the data was entered a bit differently in 2010 and 2017. The same thing happened with the date column. To clean this up, let's make new columns for the date and hour that are formatted better.

In [69]:
bike_counts[['date', 'hour']].head()
Out[69]:
date hour
0 9/18/10 12:00:00 AM
1 9/18/10 1:00:00 AM
2 9/18/10 2:00:00 AM
3 9/18/10 3:00:00 AM
4 9/18/10 4:00:00 AM
In [70]:
bike_counts[['date', 'hour']].tail()
Out[70]:
date hour
619 2017-09-30 7:00 PM
620 2017-09-30 8:00 PM
621 2017-09-30 9:00 PM
622 2017-09-30 10:00 PM
623 2017-09-30 11:00 PM

pandas has a function called to_datetime which takes things that are time-like and creates a very flexible object that stores the date and time. We'll use this on both the 'date' and 'hour' columns to make new columns.

First, let's see what to_datetime does:

In [71]:
pd.to_datetime(bike_counts['date'])
Out[71]:
0     2010-09-18
1     2010-09-18
2     2010-09-18
3     2010-09-18
4     2010-09-18
5     2010-09-18
6     2010-09-18
7     2010-09-18
8     2010-09-18
9     2010-09-18
10    2010-09-18
11    2010-09-18
12    2010-09-18
13    2010-09-18
14    2010-09-18
15    2010-09-18
16    2010-09-18
17    2010-09-18
18    2010-09-18
19    2010-09-18
20    2010-09-18
21    2010-09-18
22    2010-09-18
23    2010-09-18
24    2010-09-18
25    2010-09-18
26    2010-09-18
27    2010-09-18
28    2010-09-18
29    2010-09-18
         ...    
594   2017-09-30
595   2017-09-30
596   2017-09-30
597   2017-09-30
598   2017-09-30
599   2017-09-30
600   2017-09-30
601   2017-09-30
602   2017-09-30
603   2017-09-30
604   2017-09-30
605   2017-09-30
606   2017-09-30
607   2017-09-30
608   2017-09-30
609   2017-09-30
610   2017-09-30
611   2017-09-30
612   2017-09-30
613   2017-09-30
614   2017-09-30
615   2017-09-30
616   2017-09-30
617   2017-09-30
618   2017-09-30
619   2017-09-30
620   2017-09-30
621   2017-09-30
622   2017-09-30
623   2017-09-30
Name: date, Length: 624, dtype: datetime64[ns]

That looks good: now all the dates are in the same YYYY-MM-DD format. Let's save that result in a new column called 'date_dt' for "date as datetime".

In [72]:
bike_counts['date_dt'] = pd.to_datetime(bike_counts['date'])
In [73]:
# check to see that our new column was created
bike_counts.columns
Out[73]:
Index(['date', 'day_of_week', 'temperature', 'weather', 'direction',
       'position', 'hour', 'bikes', 'date_dt'],
      dtype='object')

The really handy thing about datetime is it allows you to access certain parts of the date only. Let's say you only want the year:

In [74]:
bike_counts['date_dt'].dt.year.tail()
Out[74]:
619    2017
620    2017
621    2017
622    2017
623    2017
Name: date_dt, dtype: int64

.year is called a .dt accessor, and there are lots, including .month, .day, .hour, .minute, .second, etc. Try them out yourself!

Next, let's make a column for the hours that just stores the hours as an integer value between 0 and 23.

In [75]:
bike_counts['hour_dt'] = pd.to_datetime(bike_counts['hour']).dt.hour

Here we combined a .dt accessor and the to_datetime function to go straight to the integer value of the hour.

In [76]:
bike_counts.head()
Out[76]:
date day_of_week temperature weather direction position hour bikes date_dt hour_dt
0 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 12:00:00 AM 70 2010-09-18 0
1 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 1:00:00 AM 27 2010-09-18 1
2 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 2:00:00 AM 23 2010-09-18 2
3 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 3:00:00 AM 8 2010-09-18 3
4 9/18/10 Saturday 21 No Rain Westbound Bike lane just east of Spadina Ave. 4:00:00 AM 10 2010-09-18 4

surveys.info() showed that the 'position' column was missing some values. The function dropna() can be used to remove all records with missing data. The missing values occured because the 2017 data didn't have position information.

In [77]:
bike_counts.tail(10)
Out[77]:
date day_of_week temperature weather direction position hour bikes date_dt hour_dt
614 2017-09-30 Saturday 15 No Rain Westbound NaN 2:00 PM 109 2017-09-30 14
615 2017-09-30 Saturday 15 No Rain Westbound NaN 3:00 PM 114 2017-09-30 15
616 2017-09-30 Saturday 15 No Rain Westbound NaN 4:00 PM 132 2017-09-30 16
617 2017-09-30 Saturday 15 No Rain Westbound NaN 5:00 PM 136 2017-09-30 17
618 2017-09-30 Saturday 15 No Rain Westbound NaN 6:00 PM 120 2017-09-30 18
619 2017-09-30 Saturday 15 No Rain Westbound NaN 7:00 PM 98 2017-09-30 19
620 2017-09-30 Saturday 15 No Rain Westbound NaN 8:00 PM 79 2017-09-30 20
621 2017-09-30 Saturday 15 No Rain Westbound NaN 9:00 PM 62 2017-09-30 21
622 2017-09-30 Saturday 15 No Rain Westbound NaN 10:00 PM 80 2017-09-30 22
623 2017-09-30 Saturday 15 No Rain Westbound NaN 11:00 PM 82 2017-09-30 23
In [78]:
bike_counts.dropna().tail(10)
Out[78]:
date day_of_week temperature weather direction position hour bikes date_dt hour_dt
422 9/24/10 Friday 30 No Rain Eastbound Bike lane just west of Spadina Ave. 2:00:00 PM 102 2010-09-24 14
423 9/24/10 Friday 30 No Rain Eastbound Bike lane just west of Spadina Ave. 3:00:00 PM 126 2010-09-24 15
424 9/24/10 Friday 30 No Rain Eastbound Bike lane just west of Spadina Ave. 4:00:00 PM 72 2010-09-24 16
425 9/24/10 Friday 30 No Rain Eastbound Bike lane just west of Spadina Ave. 5:00:00 PM 89 2010-09-24 17
426 9/24/10 Friday 30 No Rain Eastbound Bike lane just west of Spadina Ave. 6:00:00 PM 61 2010-09-24 18
427 9/24/10 Friday 30 No Rain Eastbound Bike lane just west of Spadina Ave. 7:00:00 PM 81 2010-09-24 19
428 9/24/10 Friday 30 No Rain Eastbound Bike lane just west of Spadina Ave. 8:00:00 PM 52 2010-09-24 20
429 9/24/10 Friday 30 No Rain Eastbound Bike lane just west of Spadina Ave. 9:00:00 PM 51 2010-09-24 21
430 9/24/10 Friday 30 No Rain Eastbound Bike lane just west of Spadina Ave. 10:00:00 PM 41 2010-09-24 22
431 9/24/10 Friday 30 No Rain Eastbound Bike lane just west of Spadina Ave. 11:00:00 PM 33 2010-09-24 23

By default, .dropna() removes all rows that have an NA value in any of the columns. There are parameters for dropna() that can specify how the rows are dropped and which columns should be searched for NAs.

A common alternative to removing rows containing NA values is to fill out the values with somethine else, for example the mean of all observations or the previous non-NA value. This can be done with the fillna() method.

In [79]:
bike_counts['position'].tail()
Out[79]:
619    NaN
620    NaN
621    NaN
622    NaN
623    NaN
Name: position, dtype: object
In [80]:
# Fill a particular value
fill_value = "Bike lane"
bike_counts['position'].fillna(fill_value).tail()
Out[80]:
619    Bike lane
620    Bike lane
621    Bike lane
622    Bike lane
623    Bike lane
Name: position, dtype: object
In [81]:
# Fill with previous non-null value - ffill stands for "forward fill"
bike_counts['position'].fillna(method='ffill').tail()
Out[81]:
619    Bike lane just west of Spadina Ave.
620    Bike lane just west of Spadina Ave.
621    Bike lane just west of Spadina Ave.
622    Bike lane just west of Spadina Ave.
623    Bike lane just west of Spadina Ave.
Name: position, dtype: object

Note that in both of these examples we haven't changed the data frame permanently. Unless we reassign the result to that column, the result is returned instead of being stored.

Whether to use dropna() or fillna() depends on the data set and the purpose of the analysis.

Challenge

  1. Create a new data frame from the bike_counts data that contains only the date and bikes columns.
  2. Add a column to this new data frame called year, which contains just the year from the date column.

Split-apply-combine techniques in pandas

Many data analysis tasks can be approached using the split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results.

pandas facilitates this workflow through the use of groupby() to split data, and summary/aggregation functions such as mean(), which collapses each group into a single-row summary of that group. The arguments to groupby() are the column names that contain the categorical variables by which summary statistics should be calculated. To start, compute the total number of bikes counted, grouped by date.

Image credit Jake VanderPlas

Image credit Jake VanderPlas

The .sum() method can be used to calculate the sum of each group. When the sum is computed, the default behavior is to ignore NA values, so they only need to be dropped if they are to be excluded from the visual output.

In [82]:
bike_counts.groupby(['date_dt'])['bikes'].sum()
Out[82]:
date_dt
2010-09-18    2531
2010-09-19    2669
2010-09-20    4734
2010-09-21    4814
2010-09-22    3812
2010-09-23    4746
2010-09-24    4373
2010-09-25    2515
2010-09-26    2575
2017-09-27    5667
2017-09-28    5426
2017-09-29    3778
2017-09-30    2988
Name: bikes, dtype: int64

The output here is a series that is indexed with the grouped variable (the date) and the single column contains the result of the aggregation (the sum of counts).

Individual entries can be selected from the resulting series using loc[], just as previously.

In [83]:
bike_counts_summed = bike_counts.groupby(['date_dt'])['bikes'].sum()
bike_counts_summed.loc['2010-09-18':'2010-09-26']
Out[83]:
date_dt
2010-09-18    2531
2010-09-19    2669
2010-09-20    4734
2010-09-21    4814
2010-09-22    3812
2010-09-23    4746
2010-09-24    4373
2010-09-25    2515
2010-09-26    2575
Name: bikes, dtype: int64

Groups can also be created from multiple columns, e.g. it could be interesting to see the difference for the same day of the week between years.

In [84]:
bike_counts.columns
Out[84]:
Index(['date', 'day_of_week', 'temperature', 'weather', 'direction',
       'position', 'hour', 'bikes', 'date_dt', 'hour_dt'],
      dtype='object')
In [85]:
bike_counts_date_weekday = bike_counts.groupby(['day_of_week', 'date_dt'])['bikes'].sum()
In [86]:
bike_counts_date_weekday
Out[86]:
day_of_week  date_dt   
Friday       2010-09-24    4373
             2017-09-29    3778
Monday       2010-09-20    4734
Saturday     2010-09-18    2531
             2010-09-25    2515
             2017-09-30    2988
Sunday       2010-09-19    2669
             2010-09-26    2575
Thursday     2010-09-23    4746
             2017-09-28    5426
Tuesday      2010-09-21    4814
Wednesday    2010-09-22    3812
             2017-09-27    5667
Name: bikes, dtype: int64

The returned series has an index that is a combination of the columns day_of_week and date_dt, and referred to as a MultiIndex. The same syntax as previously can be used to select rows on the day-of-week-level.

In [87]:
bike_counts_date_weekday.loc[['Sunday', 'Thursday']]
Out[87]:
day_of_week  date_dt   
Sunday       2010-09-19    2669
             2010-09-26    2575
Thursday     2010-09-23    4746
             2017-09-28    5426
Name: bikes, dtype: int64

To select specific values from both levels of the MultiIndex, a tuple or list of tuples can be passed to loc[]. Datetime objects are a bit different, and passing a list might not work.

In [88]:
bike_counts_date_weekday.loc[('Saturday', '2017-09-30')]
Out[88]:
2988

The names and values of the index levels can be seen by inspecting the index object.

In [89]:
bike_counts_date_weekday.index
Out[89]:
MultiIndex(levels=[['Friday', 'Monday', 'Saturday', 'Sunday', 'Thursday', 'Tuesday', 'Wednesday'], [2010-09-18 00:00:00, 2010-09-19 00:00:00, 2010-09-20 00:00:00, 2010-09-21 00:00:00, 2010-09-22 00:00:00, 2010-09-23 00:00:00, 2010-09-24 00:00:00, 2010-09-25 00:00:00, 2010-09-26 00:00:00, 2017-09-27 00:00:00, 2017-09-28 00:00:00, 2017-09-29 00:00:00, 2017-09-30 00:00:00]],
           labels=[[0, 0, 1, 2, 2, 2, 3, 3, 4, 4, 5, 6, 6], [6, 11, 2, 0, 7, 12, 1, 8, 5, 10, 3, 4, 9]],
           names=['day_of_week', 'date_dt'])

Although MultiIndexes offer succinct and fast ways to access data, they also requires memorization of additional syntax and are strictly speaking not essential unless speed is of particular concern. It can therefore be easier to reset the index, so that all values are stored in columns.

In [90]:
bike_counts_date_weekday_res = bike_counts_date_weekday.reset_index()
bike_counts_date_weekday_res
Out[90]:
day_of_week date_dt bikes
0 Friday 2010-09-24 4373
1 Friday 2017-09-29 3778
2 Monday 2010-09-20 4734
3 Saturday 2010-09-18 2531
4 Saturday 2010-09-25 2515
5 Saturday 2017-09-30 2988
6 Sunday 2010-09-19 2669
7 Sunday 2010-09-26 2575
8 Thursday 2010-09-23 4746
9 Thursday 2017-09-28 5426
10 Tuesday 2010-09-21 4814
11 Wednesday 2010-09-22 3812
12 Wednesday 2017-09-27 5667

After resetting the index, the same comparison syntax introduced earlier can be used instead of passing lists of tuples to loc[].

In [91]:
saturday_counts = bike_counts_date_weekday_res.loc[bike_counts_date_weekday_res['day_of_week'] == 'Saturday']
saturday_counts
Out[91]:
day_of_week date_dt bikes
3 Saturday 2010-09-18 2531
4 Saturday 2010-09-25 2515
5 Saturday 2017-09-30 2988

reset_index() grants the freedom of not having to work with indexes, but it is still worth keeping in mind that selecting on an index level can be orders of magnitude faster than using boolean comparisons (on large data frames).

The opposite operation (to create an index) can be performed with set_index() on any column (or combination of columns) that creates an index with unique values.

In [92]:
saturday_counts.set_index(['day_of_week', 'date_dt'])
Out[92]:
bikes
day_of_week date_dt
Saturday 2010-09-18 2531
2010-09-25 2515
2017-09-30 2988

Multiple aggregations on grouped data

Since the bike counts are split into two directions, let's use groupby to create a new dataframe with the total summed across eastbound and westbound directions. Each column we want to keep is included in the list passed to groupby, even though some of them don't actually result in new groups. For example, if hour_dt wasn't included in the list, the sum at the end would add up all the counts across the entire day, when what we want is to keep the counts separate for each hour and only sum over the two directions. The only columns not kept are "position", "date", and "hour".

In [93]:
bike_counts_total =(bike_counts.groupby(['date_dt', 'day_of_week', 'temperature', 'weather', 'hour_dt'])['bikes']
                    .sum()
                    .reset_index())
bike_counts_total
Out[93]:
date_dt day_of_week temperature weather hour_dt bikes
0 2010-09-18 Saturday 21 No Rain 0 104
1 2010-09-18 Saturday 21 No Rain 1 65
2 2010-09-18 Saturday 21 No Rain 2 39
3 2010-09-18 Saturday 21 No Rain 3 27
4 2010-09-18 Saturday 21 No Rain 4 17
5 2010-09-18 Saturday 21 No Rain 5 9
6 2010-09-18 Saturday 21 No Rain 6 14
7 2010-09-18 Saturday 21 No Rain 7 46
8 2010-09-18 Saturday 21 No Rain 8 68
9 2010-09-18 Saturday 21 No Rain 9 85
10 2010-09-18 Saturday 21 No Rain 10 114
11 2010-09-18 Saturday 21 No Rain 11 171
12 2010-09-18 Saturday 21 No Rain 12 163
13 2010-09-18 Saturday 21 No Rain 13 190
14 2010-09-18 Saturday 21 No Rain 14 198
15 2010-09-18 Saturday 21 No Rain 15 149
16 2010-09-18 Saturday 21 No Rain 16 211
17 2010-09-18 Saturday 21 No Rain 17 206
18 2010-09-18 Saturday 21 No Rain 18 170
19 2010-09-18 Saturday 21 No Rain 19 125
20 2010-09-18 Saturday 21 No Rain 20 132
21 2010-09-18 Saturday 21 No Rain 21 68
22 2010-09-18 Saturday 21 No Rain 22 70
23 2010-09-18 Saturday 21 No Rain 23 90
24 2010-09-19 Sunday 20 No Rain 0 73
25 2010-09-19 Sunday 20 No Rain 1 42
26 2010-09-19 Sunday 20 No Rain 2 25
27 2010-09-19 Sunday 20 No Rain 3 24
28 2010-09-19 Sunday 20 No Rain 4 13
29 2010-09-19 Sunday 20 No Rain 5 7
... ... ... ... ... ... ...
282 2017-09-29 Friday 18 Rain 4mm 18 322
283 2017-09-29 Friday 18 Rain 4mm 19 197
284 2017-09-29 Friday 18 Rain 4mm 20 131
285 2017-09-29 Friday 18 Rain 4mm 21 104
286 2017-09-29 Friday 18 Rain 4mm 22 72
287 2017-09-29 Friday 18 Rain 4mm 23 63
288 2017-09-30 Saturday 15 No Rain 0 63
289 2017-09-30 Saturday 15 No Rain 1 31
290 2017-09-30 Saturday 15 No Rain 2 16
291 2017-09-30 Saturday 15 No Rain 3 15
292 2017-09-30 Saturday 15 No Rain 4 6
293 2017-09-30 Saturday 15 No Rain 5 14
294 2017-09-30 Saturday 15 No Rain 6 25
295 2017-09-30 Saturday 15 No Rain 7 35
296 2017-09-30 Saturday 15 No Rain 8 81
297 2017-09-30 Saturday 15 No Rain 9 102
298 2017-09-30 Saturday 15 No Rain 10 108
299 2017-09-30 Saturday 15 No Rain 11 177
300 2017-09-30 Saturday 15 No Rain 12 202
301 2017-09-30 Saturday 15 No Rain 13 200
302 2017-09-30 Saturday 15 No Rain 14 209
303 2017-09-30 Saturday 15 No Rain 15 229
304 2017-09-30 Saturday 15 No Rain 16 233
305 2017-09-30 Saturday 15 No Rain 17 242
306 2017-09-30 Saturday 15 No Rain 18 210
307 2017-09-30 Saturday 15 No Rain 19 188
308 2017-09-30 Saturday 15 No Rain 20 156
309 2017-09-30 Saturday 15 No Rain 21 141
310 2017-09-30 Saturday 15 No Rain 22 165
311 2017-09-30 Saturday 15 No Rain 23 140

312 rows × 6 columns

Since the same grouped data frame will be used in multiple code chunks below, this can be assigned to a new variable instead of typing out the grouping expression each time.

In [94]:
grouped_bike_counts = bike_counts_total.groupby(['date_dt', 'day_of_week', 'weather'])
grouped_bike_counts['bikes'].sum()
Out[94]:
date_dt     day_of_week  weather  
2010-09-18  Saturday      No Rain     2531
2010-09-19  Sunday        No Rain     2669
2010-09-20  Monday        No Rain     4734
2010-09-21  Tuesday       No Rain     4814
2010-09-22  Wednesday     Rain 6mm    3812
2010-09-23  Thursday      No Rain     4746
2010-09-24  Friday        No Rain     4373
2010-09-25  Saturday      No Rain     2515
2010-09-26  Sunday        No Rain     2575
2017-09-27  Wednesday     No Rain     5667
2017-09-28  Thursday      No Rain     5426
2017-09-29  Friday       Rain 4mm     3778
2017-09-30  Saturday      No Rain     2988
Name: bikes, dtype: int64

Other aggregation methods, such as the standard deviation, are called with the same syntax.

In [95]:
grouped_bike_counts['bikes'].std()
Out[95]:
date_dt     day_of_week  weather  
2010-09-18  Saturday      No Rain      65.191343
2010-09-19  Sunday        No Rain      79.652495
2010-09-20  Monday        No Rain     159.863597
2010-09-21  Tuesday       No Rain     149.924885
2010-09-22  Wednesday     Rain 6mm    110.727895
2010-09-23  Thursday      No Rain     143.297274
2010-09-24  Friday        No Rain     122.429387
2010-09-25  Saturday      No Rain      58.832836
2010-09-26  Sunday        No Rain      77.870282
2017-09-27  Wednesday     No Rain     182.103724
2017-09-28  Thursday      No Rain     170.290523
2017-09-29  Friday       Rain 4mm     137.304459
2017-09-30  Saturday      No Rain      82.280010
Name: bikes, dtype: float64

Note that it is important to remember the structure of the data to make sure that the computations you're doing make sense. For example, be careful with an operation like bike_counts_total.groupby('day_of_week')['bikes'].mean(). This result is actually the mean per hour and not the mean for the entire day.

In [96]:
bike_counts_total.groupby('day_of_week')['bikes'].mean()
Out[96]:
day_of_week
Friday       169.812500
Monday       197.250000
Saturday     111.583333
Sunday       109.250000
Thursday     211.916667
Tuesday      200.583333
Wednesday    197.479167
Name: bikes, dtype: float64

Instead of using the sum() method, the more general agg() method could be called to aggregate (or summarize) by any existing aggregation functions. The equivalent to the sum() method would be to call agg() and specify 'sum'.

In [97]:
grouped_bike_counts['bikes'].agg('sum')
Out[97]:
date_dt     day_of_week  weather  
2010-09-18  Saturday      No Rain     2531
2010-09-19  Sunday        No Rain     2669
2010-09-20  Monday        No Rain     4734
2010-09-21  Tuesday       No Rain     4814
2010-09-22  Wednesday     Rain 6mm    3812
2010-09-23  Thursday      No Rain     4746
2010-09-24  Friday        No Rain     4373
2010-09-25  Saturday      No Rain     2515
2010-09-26  Sunday        No Rain     2575
2017-09-27  Wednesday     No Rain     5667
2017-09-28  Thursday      No Rain     5426
2017-09-29  Friday       Rain 4mm     3778
2017-09-30  Saturday      No Rain     2988
Name: bikes, dtype: int64

This general approach is more flexible and powerful since multiple aggregation functions can be applied in the same line of code by passing them as a list to agg(). For instance, the standard deviation and mean could be computed in the same call by passing them in a list.

In [98]:
grouped_bike_counts['bikes'].agg(['max', 'sum'])
Out[98]:
max sum
date_dt day_of_week weather
2010-09-18 Saturday No Rain 211 2531
2010-09-19 Sunday No Rain 235 2669
2010-09-20 Monday No Rain 518 4734
2010-09-21 Tuesday No Rain 525 4814
2010-09-22 Wednesday Rain 6mm 383 3812
2010-09-23 Thursday No Rain 493 4746
2010-09-24 Friday No Rain 379 4373
2010-09-25 Saturday No Rain 186 2515
2010-09-26 Sunday No Rain 236 2575
2017-09-27 Wednesday No Rain 611 5667
2017-09-28 Thursday No Rain 562 5426
2017-09-29 Friday Rain 4mm 466 3778
2017-09-30 Saturday No Rain 242 2988

The returned output is in this case a data frame and the MultiIndex is indicated in bold font.

By passing a dictionary to .agg() it is possible to apply different aggregations to the different columns. Long code statements can be broken down into multiple lines if they are enclosed by parentheses, brackets or braces, something that will be described in detail later.

In [99]:
grouped_bike_counts[['temperature', 'bikes']].agg(
    {'temperature': 'mean',
     'bikes': ['min', 'max', 'sum']
    }
)
Out[99]:
temperature bikes
mean min max sum
date_dt day_of_week weather
2010-09-18 Saturday No Rain 21 9 211 2531
2010-09-19 Sunday No Rain 20 7 235 2669
2010-09-20 Monday No Rain 19 4 518 4734
2010-09-21 Tuesday No Rain 25 4 525 4814
2010-09-22 Wednesday Rain 6mm 23 4 383 3812
2010-09-23 Thursday No Rain 19 4 493 4746
2010-09-24 Friday No Rain 30 6 379 4373
2010-09-25 Saturday No Rain 18 11 186 2515
2010-09-26 Sunday No Rain 17 9 236 2575
2017-09-27 Wednesday No Rain 30 6 611 5667
2017-09-28 Thursday No Rain 19 11 562 5426
2017-09-29 Friday Rain 4mm 18 8 466 3778
2017-09-30 Saturday No Rain 15 6 242 2988

There are plenty of aggregation methods available in pandas (e.g. sem, mad, sum, all of which can be found using tab-complete on the grouped data frame.

In [100]:
# Tab completion might only work like this:
# find_agg_methods = grouped_bike_counts['temperature']
# find_agg_methods.<tab>

Even if a function is not part of the pandas library, it can be passed to agg().

In [101]:
import numpy as np

grouped_bike_counts['temperature'].agg(np.mean)
Out[101]:
date_dt     day_of_week  weather  
2010-09-18  Saturday      No Rain     21
2010-09-19  Sunday        No Rain     20
2010-09-20  Monday        No Rain     19
2010-09-21  Tuesday       No Rain     25
2010-09-22  Wednesday     Rain 6mm    23
2010-09-23  Thursday      No Rain     19
2010-09-24  Friday        No Rain     30
2010-09-25  Saturday      No Rain     18
2010-09-26  Sunday        No Rain     17
2017-09-27  Wednesday     No Rain     30
2017-09-28  Thursday      No Rain     19
2017-09-29  Friday       Rain 4mm     18
2017-09-30  Saturday      No Rain     15
Name: temperature, dtype: int64

Any function can be passed like this, including user-created functions.

Challenge

  1. Use groupby() and agg() with the bike_counts_total data frame to find the mean, min, and max bike counts per hourly time interval across all the measurement days.

  2. What was the largest bike count for each day? Return the columns date_dt, day_of_week, hour_dt, weather, and bikes. Hint Look into the idxmax() method.

Using size() to summarize categorical data

When working with data, it is common to want to know the number of observations present for each categorical variable. For this, pandas provides the size() method. For example, to group by 'weather' and find the number of observations for each weather condition:

In [102]:
bike_counts_total.groupby('weather').size()
Out[102]:
weather
 No Rain     264
 Rain 6mm     24
Rain 4mm      24
dtype: int64

size() can also be used when grouping on multiple variables.

In [103]:
bike_counts.groupby(['temperature', 'weather']).size()
Out[103]:
temperature  weather  
15            No Rain      48
17            No Rain      48
18            No Rain      48
             Rain 4mm      48
19            No Rain     144
20            No Rain      48
21            No Rain      48
23            Rain 6mm     48
25            No Rain      48
30            No Rain      96
dtype: int64

If there are many groups, size() is not that useful on its own. For example, it is difficult to quickly find the most commonly observed temperature among the observations.

In [104]:
bike_counts_total.groupby('temperature').size()
Out[104]:
temperature
15    24
17    24
18    48
19    72
20    24
21    24
23    24
25    24
30    48
dtype: int64

It would be beneficial to sort the table values and display the most commonly observed temperature first. This is easy to do with the sort_values() method.

In [105]:
bike_counts_total.groupby('temperature').size().sort_values()
Out[105]:
temperature
15    24
17    24
20    24
21    24
23    24
25    24
18    48
30    48
19    72
dtype: int64

That's better, but it could be helpful to display the most common temperature at the top of the list. In other words, the output should be arranged in descending order.

In [106]:
bike_counts_total.groupby('temperature').size().sort_values(ascending=False).head(5)
Out[106]:
temperature
19    72
30    48
18    48
25    24
23    24
dtype: int64

Looks good! By now, the code statement has grown quite long because many methods have been chained together. It can be tricky to keep track of what is going on in long method chains. To make the code more readable, it can be broken up multiple lines by adding a surrounding parenthesis.

In [107]:
(bike_counts_total
     .groupby('temperature')
     .size()
     .sort_values(ascending=False)
     .head(5)
)
Out[107]:
temperature
19    72
30    48
18    48
25    24
23    24
dtype: int64

This looks neater and makes long method chains easier to reads. There is no absolute rule for when to break code into multiple line, but always try to write code that is easy for collaborators (your most common collaborator is a future version of yourself!) to understand.

pandas actually has a convenience function for returning the top five results, so the values don't need to be sorted explicitly.

In [108]:
(bike_counts_total
     .groupby('temperature')
     .size()
     .nlargest() # the default is 5
)
Out[108]:
temperature
19    72
18    48
30    48
15    24
17    24
dtype: int64

Using nunique() to count number of unique observations

The command .size() gives the number of observations for each unique value a particular column or set of columns. The command nunique() is similar, but it counts just the number of unique entries. It can be used on an entire dataframe, or a subset of columns.

In [109]:
bike_counts_total.nunique()
Out[109]:
date_dt         13
day_of_week      7
temperature      9
weather          3
hour_dt         24
bikes          212
dtype: int64

Challenge

  1. How many different count days are in the bike_counts_total dataset?

  2. Create a new dataframe called weather_df by grouping by date_dt and weather and calculating the size. Use the new dataframe to calculate how many of the count days had no rain.

Optional challenge: Which day of the week was counted on the most separate occasions? Hint: modify weather_df to include the day_of_week column.

Part 3 - Visualizing Data

Data visualization in matplotlib and seaborn

There are many plotting packages in Python, making it possible to create diverse visualizations such as interactive web graphics, 3D animations, statistical visualizations, and map-based plots. A Google search for "Python graph gallery" or "Seaborn graph gallery" will turn up lots of examples of the diversity of plots that can be made.

Here, we will focus on two of the most useful for researchers: matplotlib, which is a robust, detail-oriented, low level plotting interface, and seaborn, which provides high level functions on top of matplotlib and allows the plotting calls to be expressed more in terms what is being explored in the underlying data rather than what graphical elements to add to the plot.

Instead of instructing the computer to "go through a data frame and plot any observations of eastbound cyclists in blue, any observations of westbound cyclists in red, etc", the seaborn syntax is more similar to saying "color the data by direction". Thanks to this functional way of interfacing with data, only minimal changes are required if the underlying data change or to switch the type of plot used for the visualization. It provides a language that facilitates thinking about data in ways that are conducive for exploratory analysis and allows for the creation of publication quality plots with minimal adjustments and tweaking.

Before the first plot is created, the line %matplotlib inline is used to specify that all plots should show up in the notebook instead of in a separate window.

In [110]:
%matplotlib inline
In [111]:
import seaborn as sns

Let's make a dataframe with just weekday counts to look at weekdays and weekends separately.

In [112]:
bike_counts_weekdays = bike_counts.loc[(bike_counts['day_of_week'] != "Saturday") & (bike_counts['day_of_week'] != "Sunday")]

The relationship between one quantitative and one categorical valuable will be explored while stratifying the data based on its remaining categorical variables. To start, let's visualize summary statistics of bicycle counts across the day with a boxplot.

In [113]:
sns.boxplot(x='hour_dt', y='bikes', data=bike_counts_weekdays)
Out[113]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff35d6089e8>

Plot parameters can be modified to change the appearance.

In [114]:
sns.boxplot(x='hour_dt', y='bikes', data=bike_counts_weekdays, width = 0.4)
Out[114]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff359effcf8>

In a boxplot, the xy-variables are the categorical groups (the hour) and the measurement of interest (the bicycle counts).

The aim of a box plot is to display a few statistics from the underlying distribution of one quantitative variable between the values of one categorical variable. The inclusion of multiple distribution statistics facilitates the comparison of more than just the mean + standard deviation (or another single measure of central tendency and variation). The seaborn box plots are so-called Tukey box plots by default, which means that the graphical elements correspond to the following statistics:

  • The lines of the box represent the 25th, 50th (median), and 75th quantile in the data. These divide the data into four quartiles (0-25, 25-50, 50-75, 75-100).
  • The whiskers represent 1.5 * the interquartile range (the distance between the 25th and 75th quantile).
  • The flyers mark all individual observations that are outside the whiskers, which could be referred to as "outliers" (although there are many definitions of what could constitutes an outlier).

Most of these plot elements are configurable and could be set to represent different distribution statistics.

Why would the distribution be so broad around 7-9 am and 4-6 pm? A boxplot can hide some things about the underlying data, such as if it's bimodal. To delve deeper into the data, we can start adding more categorical variables to look at.

A very effective approach for exploring multiple categorical variables in a data set is to plot so-called "small multiples" of the data, where the same type of plot is used for different subsets of the data. These plots are drawn in rows and columns forming a grid pattern, and can be referred to as a "lattice", "facet", or "trellis" plot.

Visualizing categorical variables in this manner is a key step in exploratory data analysis, and thus seaborn has a dedicated plot function for this, called catplot() (short for "categorical plot"). This plot can be used to plot the same violin plot as before, and easily spread the variables across the rows and columns, e.g. for the variable direction. Note: if you have an older version of seaborn, replace catplot with factorplot.

In [115]:
sns.catplot(x='hour_dt', y='bikes', data = bike_counts_weekdays, col = 'direction', kind = 'box')
Out[115]:
<seaborn.axisgrid.FacetGrid at 0x7ff359eff8d0>

By splitting the data to by the direction of travel, we can see that the broadness of the distribution during those two time periods was actually concealing two peaks that are split based on direction of travel. The eastbound count is much higher in the morning rush hour, and the westbound in the afternoon. People are heading into downtown in the morning and leaving it in the afternoon.

We can achieve a similar effect by colouring the data by direction instead of splitting it into two plots.

In [116]:
sns.boxplot(x='hour_dt', y='bikes', data=bike_counts_weekdays, hue = 'direction')
Out[116]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff35947fb70>

The text is a little small, which can be changed with the set_context() function from seaborn, using a number above 1 for the fontscale parameter. The context parameter changes the size of object in the plots, such as the line widths, and will be left as the default notebook for now.

These option changes will apply to all plots made from now on. Think of it as changing a value in the options menu of a graphical software.

In [117]:
sns.set_context(context='notebook', font_scale=1.2)

seaborn allows you to choose which data you will explore and easily change things like the plot type and other plot characteristics. Above, all we had to change was adding the hue option. We could change the boxplot to a violinplot by simply changing the function call.

In [118]:
sns.violinplot(x='hour_dt', y='bikes', data=bike_counts_weekdays)
Out[118]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff358e64fd0>

This doesn't look very good because the width of each violin is scaled so that they each cover the same area. We can change the way the violin width is plotted using the "scale" parameter. Let's change it to scale the width by the number of counts.

In [119]:
sns.violinplot(x='hour_dt', y='bikes', data=bike_counts_weekdays, scale = "count")
Out[119]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff358c19630>

Or we could plot only a few of the hours by subsetting the data frame:

In [120]:
sns.violinplot(x='hour_dt', y='bikes', 
               data=bike_counts_weekdays[(bike_counts_weekdays['hour_dt'] == 7) | 
                                         (bike_counts_weekdays['hour_dt'] == 8) | 
                                         (bike_counts_weekdays['hour_dt'] == 9)])
Out[120]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff3589e4b70>

If we wanted to include the afternoon rush hour as well, we could keep adding lines to the subsetting code above. But there's an easier way! pandas has a special isin() method for comparing a data frame column to an array-like object of names, so we can use isin() to get all the counts for a list of hours we define.

In [121]:
rush_hours = [7, 8, 9, 10, 16, 17, 18, 19]
bike_counts_weekdays_rush = bike_counts_weekdays.loc[bike_counts_weekdays['hour_dt'].isin(rush_hours)]
In [122]:
sns.violinplot(x='hour_dt', y='bikes', 
               data=bike_counts_weekdays_rush)
Out[122]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff3588fe7b8>

The colours can be changed with the palette keyword.

In [123]:
sns.violinplot(x='hour_dt', y='bikes', 
               data=bike_counts_weekdays_rush, palette = "Blues")
Out[123]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff358805c18>

We can also split the colours by direction on this smaller dataset.

In [124]:
sns.violinplot(x='hour_dt', y='bikes', 
               data=bike_counts_weekdays_rush,
              hue = 'direction',
              palette = "Blues")
Out[124]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff358718fd0>

It would be nice to change the x and y-axis labels to something a little prettier. To do this, assign the plotting command to an object, and then use that axes object to change more plot elements.

In [125]:
g = sns.violinplot(x='hour_dt', y='bikes', 
               data=bike_counts_weekdays_rush,
              hue = 'direction',
              palette = "Blues")
g.set_xlabel("Time (h)")
g.set_ylabel("Bicycle counts")
Out[125]:
<matplotlib.text.Text at 0x7ff358583080>

Challenge

  1. Create a violin plot of bikes vs. day_of_week using the bike_counts dataframe. Colour by direction.
  2. (Optional) Change the plot orientation so that day_of_week is on the y-axis and bikes are on the x-axis.

Le'ts explore a few other categorical variables. Using hue to colour data by different categorical variables is great if there are only a few, but it can get busy if there are many possibilities. For example, let's create a pointplot of counts vs hour, coloured by the day of the week.

In [126]:
sns.pointplot(x='hour_dt', y = 'bikes', data = bike_counts_total, hue = 'day_of_week')
Out[126]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff358491400>

It's not too bad, but it'll be cleaner if we show each line on its own axes with catplot.

In [127]:
sns.catplot(x = 'hour_dt', y = 'bikes', col = 'day_of_week', col_wrap = 3, data = bike_counts_total, 
              kind = 'point')
Out[127]:
<seaborn.axisgrid.FacetGrid at 0x7ff3580c7908>

We can go further by colouring the lines by another categorical variable, "direction"

In [128]:
sns.catplot(x = 'hour_dt', y = 'bikes', col = 'day_of_week', col_wrap = 3, data = bike_counts_total, 
              kind = 'point', hue = 'direction')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-128-f4cc90e90cbb> in <module>()
      1 sns.catplot(x = 'hour_dt', y = 'bikes', col = 'day_of_week', col_wrap = 3, data = bike_counts_total, 
----> 2               kind = 'point', hue = 'direction')

~/anaconda2/envs/py3/lib/python3.6/site-packages/seaborn/categorical.py in catplot(x, y, hue, data, row, col, col_wrap, estimator, ci, n_boot, units, order, hue_order, row_order, col_order, kind, height, aspect, orient, color, palette, legend, legend_out, sharex, sharey, margin_titles, facet_kws, **kwargs)
   3714     # facets to ensure representation of all data in the final plot
   3715     p = _CategoricalPlotter()
-> 3716     p.establish_variables(x_, y_, hue, data, orient, order, hue_order)
   3717     order = p.group_names
   3718     hue_order = p.hue_names

~/anaconda2/envs/py3/lib/python3.6/site-packages/seaborn/categorical.py in establish_variables(self, x, y, hue, data, orient, order, hue_order, units)
    153                 if isinstance(input, string_types):
    154                     err = "Could not interpret input '{}'".format(input)
--> 155                     raise ValueError(err)
    156 
    157             # Figure out the plotting orientation

ValueError: Could not interpret input 'direction'

Oops - there's no 'direction' column in the bike_counts_total dataframe. We need to go back to the original!

In [129]:
sns.catplot(x = 'hour_dt', y = 'bikes', col = 'day_of_week', col_wrap = 3, data = bike_counts, 
              kind = 'point', hue = 'direction')
Out[129]:
<seaborn.axisgrid.FacetGrid at 0x7ff3574790b8>