Women in Data Science - ATX Meetup¶

Ch. 2: A Crash Course in Python¶

Notes on "Data Science from Scratch" by Joel Grus

Download the Jupyter notebook for these slides here

The Basics¶

Getting Python
The Zen of Python
Whitespace Formatting
Modules
Arithmetic
Functions
Strings
Exceptions
Lists
Tuples
Dictionaries
- defaultdict
- Counter
Sets
Control Flow
Truthiness

Getting Python¶

We recommend installing Anaconda
Otherwise, you can install the following to get started:
- Python
- Python package manager: pip
- Nicer Python shell: IPython
- Web app for interactive data sci: Jupyter Notebook
  - pip install jupyter (Python 2)
  - pip3 install jupyter (Python 3)
- etc...

The Zen of Python¶

Python design principles: The Zen of Python (PEP 20)
Python code readability: Style Guide for Python Code (PEP 8)

Whitespace Formatting¶

In Python, whitespace is important. Other languages, like C++, use curly braces {} to delimit blocks of code. for( int i = 0; i < 5; i++ ) { cout << i << endl; }

In [1]:

# Python uses indentation to delimit blocks of code.
for i in range(5):
    print i

Modules¶

import modules (= libraries = packages) that aren't in the Python 2.7 standard library
Python has a huge variety and number of extra packages that can be installed, if not already, with pip or conda, then imported:
- PyPI (Python Package Index)
- Anaconda 2.5.0 Package List; additionally, check out these docs:
  - Also, check out "Packages available in Anaconda" and "Managing Packages in Anaconda" from here

Let's try importing the Matplotlib pyplot module, which "provides a MATLAB-like plotting framework."

Note:

Sometimes you'll want to use IPython magic commands in your Jupyter notebooks.
Try not to do the following since you may inadvertently overwrite variables you've already defined: from matplotlib.pyplot import *

In [2]:

# Tip: use IPython magic command "%matplotlib inline" to display plots in notebook
%matplotlib inline 
import matplotlib.pyplot # import module and reference by its super long name

matplotlib.pyplot.plot([1,2,3], [1,2,3]) # must type the whole name to access methods, etc.

Out[2]:

[<matplotlib.lines.Line2D at 0x10e551e10>]

In [3]:

import matplotlib.pyplot as plt # OR, alias it as a shorter, more fun-to-type name

plt.plot([1,2,3], [1,2,3])

Out[3]:

[<matplotlib.lines.Line2D at 0x10eb03090>]

Arithmetic¶

Python 2.7 (like Fortran) uses integer division by default.

In [4]:

print 5 / 2

To force floating point division, specify at least one value in equation as float:

In [5]:

print type(5), type(2.0)
print 5 / 2.0

<type 'int'> <type 'float'>
2.5

Or, make floating point division default:

In [6]:

from __future__ import division
print 5 / 2   # now floating point division is default
print 5 // 2  # use double slash for integer division

2.5
2

Functions¶

Define functions with def

In [7]:

def double(x):
    """This is where you put an optional docstring that 
    explains what the function does.
    For example, this function multiplies its input by 2."""
    return x * 2

In [8]:

def apply_to_one(f):
    """Calls the function f with 1 as its argument"""
    return f(1)

Python functions are first-class

In [9]:

my_double = double          # refers to the previously defined function
x = apply_to_one(my_double)
print x

`lambda` functions¶

These are short, anonymous functions

In [10]:

y = apply_to_one(lambda x: x + 4)
print y

In [11]:

# but use `def` instead of assigning a lambda function to a variable
another_double = lambda x: 2 * x     # don't do this
def another_double(x): return 2 * x  # do this instead

default function arguments¶

define a default arg; specify arg only if you want different value

In [12]:

def my_print(message="my default message"):
    print message
    
my_print("hello")

hello

In [13]:

my_print()

my default message

Strings¶

In [14]:

# Delimit strings with single OR double quotation marks
single_quoted_string = 'data science'
double_quoted_string = "data science"
print single_quoted_string, "is the same as", double_quoted_string

data science is the same as data science

In [15]:

# Use backslashes to encode special characters
tab_string = "\t"  # represents the tab character
len(tab_string)    # string length is 1 (not 2)

Out[15]:

In [16]:

# Use raw strings to represent backslashes
not_tab_string = r"\t"  # represents the characters '\' and 't'
len(not_tab_string)     # now length is 2

Out[16]:

In [17]:

# Create multiline strings using triple-quotes
multi_line_string = """This is the first line.
and this is the second line
and this is the third line"""
print multi_line_string

This is the first line.
and this is the second line
and this is the third line

Exceptions¶

In [18]:

# This makes your code crash
print 0 / 0

---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-18-ccf175f7fa0e> in <module>()
      1 # This makes your code crash
----> 2 print 0 / 0

ZeroDivisionError: division by zero

Here's a list of built-in exceptions

In [19]:

# This will handle the exception by printing an error message
try:
    print 0 / 0
except ZeroDivisionError:
    print "Cannot divide by zero"

Cannot divide by zero

Lists¶

Ordered collections
Similar to an array in other languages, but holds heterogeneous data (e.g., floats + ints + strings)
Note: Use NumPy arrays (here's a tutorial) for larger amounts of homogeneous data (e.g., just floats)
Specify with brackets []

In [20]:

integer_list = [1, 2, 3]
heterogeneous_list = ["string", 0.1, True]
list_of_lists = [ integer_list, heterogeneous_list, [] ]

print len(integer_list)  # get the length of a list
print sum(integer_list)  # get the sum of the elements in a list (if addition is defined for those elements)

3
6

In [21]:

# Use square brackets to get the n^{th} element of a list
x = range(10)
print x
print x[0]
print x[1]
print x[-1]
print x[-2]

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
0
1
9
8

In [22]:

# Use square brackets to "slice" lists
print x[:3]   # up to but not including 3
print x[3:]   # 3 and up
print x[1:4]  # 1 up to but not including 4
print x[-3:]  # last 3
print x[1:-1] # without 1 and 9
print x[:]    # all elements of the list

[0, 1, 2]
[3, 4, 5, 6, 7, 8, 9]
[1, 2, 3]
[7, 8, 9]
[1, 2, 3, 4, 5, 6, 7, 8]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [23]:

# Use the `in` operator to check for list membership; use only for small lists or if run time is not a concern
print 1 in [1, 2, 3]
print 0 in [1, 2, 3]

True
False

In [24]:

# Concatenate lists like this
x = [1, 2, 3]

y = x + [4, 5, 6]   # creates a new list leaving "x" unchanged
print x
print y, "\n"

x.extend([4, 5, 6]) # changes "x"
print x

[1, 2, 3]
[1, 2, 3, 4, 5, 6] 

[1, 2, 3, 4, 5, 6]

In [25]:

# Append to lists like this
x = [1, 2, 3]
print x, "\n"

x.append(0)
print x

[1, 2, 3] 

[1, 2, 3, 0]

In [26]:

# It's convenient (and common) to unpack lists like this
z = [1, 2]
x, y = z
print type(x), "x =", x 
print type(y), "y =", y 
print type(z), "z =", z

<type 'int'> x = 1
<type 'int'> y = 2
<type 'list'> z = [1, 2]

In [27]:

# It's also common to use an underscore for a value you're going to throw away
z = [3, 4]
_, y = z
print x
print y
print z

1
4
[3, 4]

Tuples¶

Similar to list, but immutable (can't be modified)
Specify with parentheses() or nothing

In [28]:

my_list = [1, 2]
my_tuple = (1, 2)
other_tuple = 3, 4
my_list[1] = 3

print my_list

[1, 3]

In [29]:

try:
    my_tuple[1] = 3
except TypeError:
    print "Cannot modify a tuple"

Cannot modify a tuple

In [30]:

# Use tuples to return multiple values from functions
def sum_and_product(x, y):
    return (x + y), (x * y)

sp = sum_and_product(2, 3)
s, p = sum_and_product(5, 10)
print sp
print s 
print p

(5, 6)
15
50

In [31]:

# Use tuples (and lists) for multiple assignments
x, y = 1, 2
print "x =", x
print "y =", y

x = 1
y = 2

In [32]:

x, y = y, x  # Pythonic way to swap variables
print "x =", x
print "y =", y

x = 2
y = 1

Dictionaries¶

Dictionaries associate values with keys
Allow quick retrieval of value given a key
Specify with curly braces {} or dict()

In [33]:

empty_dict = {}                    # Pythonic
empty_dict2 = dict()               # less Pythonic
grades = { "Joel": 80, "Tim": 95}

# Use square brackets to look up value(s) for a key
print grades["Joel"]

In [34]:

# KeyError exception raised if key not found
try: 
    kates_grade = grades["Kate"]
except KeyError:
    print "No grade for Kate!"

No grade for Kate!

In [35]:

# Use `in` to check for existence of a key
joel_has_grade = "Joel" in grades
kate_has_grade = "Kate" in grades

print joel_has_grade
print kate_has_grade

True
False

In [36]:

# Use `get` method of dictionaries when you want to return a default value (rather than raise exception)
print grades.get("Joel", 0)
print grades.get("Kate", 0)
print  grades.get("No One")

80
0
None

In [37]:

# Use square brackets to assign key-value pairs, e.g., dict_name[key] = value
grades["Tim"] = 99
grades["Kate"] = 100

print "Number of students:", len(grades)

Number of students: 3

In [38]:

# Use dictionaries to represent structured data, such as in a tweet
tweet = {
    "user": "joelgrus",
    "text": "Data Science is Awesome",
    "retweet_count": 100,
    "hashtags": ["#data", "#science", "#datascience", "#awesome"]
}

In [39]:

print tweet.keys()    # list of keys
print tweet.values()  # list of values
print tweet.items()   # list of (key, value) tuples

['text', 'retweet_count', 'hashtags', 'user']
['Data Science is Awesome', 100, ['#data', '#science', '#datascience', '#awesome'], 'joelgrus']
[('text', 'Data Science is Awesome'), ('retweet_count', 100), ('hashtags', ['#data', '#science', '#datascience', '#awesome']), ('user', 'joelgrus')]

In [40]:

print "user" in tweet.keys()       # list `in` is slow
print "user" in tweet              # dict `in` is fast (and more Pythonic)
print "joelgrus" in tweet.values()

True
True
True

You cannot use lists as keys. If that's needed, then:

use a tuple, or
represent the key as a string

Let's grab some data to use in the upcoming examples on `defaultdict`¶

Clone the UW Intro to Data Science course materials repository:

git clone https://github.com/uwescience/datasci_course_materials.git

Look in datasci_course_materials/assignment1/ for a file named three_minutes_tweets.json

In [41]:

# Load the data into a list called `tweets`
import json
# Substitute the local path to your three_minutes_tweets.json file within the quotes
tweet_file = open("/Users/mepa/repos/datasci_course_materials/assignment1/three_minutes_tweets.json")

tweets = []
for line in tweet_file:
    tweets.append(json.loads(line)) # each list element contains all data pertaining to a single tweet
print "Total # of tweets:", len(tweets)
print "type(tweets[0]):", type(tweets[0])

Total # of tweets: 8299
type(tweets[0]): <type 'dict'>

For tweets that contain a "text" key we can print someone's tweet:

In [42]:

print tweets[7]['text']  # text of 8th tweet

@1voodoochild thanks for the follow💯

In [43]:

print tweets[3]['text']  # text of 4th tweet

إنّ العرب إذا تغلبوا على أوطان أسرع إليها الخراب والسبب في ذلك أنها أمة وحشية بإستحكام عوائد التوحش وأسبابه فصار لهم خلقة وجبلة

- ابن خلدون

Aside: The text is attributed to Ibn Khaldun who lived 1332 - 1406. Google Translate gives the following:

"The Arabs overcame their homelands if faster to ruin and why they are a nation and brutal savagery Bastgam returns and causes them became habitus and protoplasm"

In [44]:

print tweets[54]['text']

@morgancollins42 I don't like you

In [45]:

print tweets[7000]['text']

The Newf opened up and ate a whole bar of Ivory soap. Some things about the canine diet I will never understand.

In [47]:

# Print all key-value pairs in the 54th tweet
for key in tweets[53].keys():
    print key, ": \t", tweets[53][key]

contributors : 	None
truncated : 	False
text : 	I'm starving I can't wait to leave work and eat Wing Barn 😋🍗
in_reply_to_status_id : 	None
id : 	633030783796514816
favorite_count : 	0
source : 	<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
retweeted : 	False
coordinates : 	None
timestamp_ms : 	1439761274657
entities : 	{u'user_mentions': [], u'symbols': [], u'trends': [], u'hashtags': [], u'urls': []}
in_reply_to_screen_name : 	None
id_str : 	633030783796514816
retweet_count : 	0
in_reply_to_user_id : 	None
favorited : 	False
user : 	{u'follow_request_sent': None, u'profile_use_background_image': True, u'default_profile_image': False, u'id': 509749471, u'verified': False, u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/603406649433268224/wfMu2EpQ_normal.jpg', u'profile_sidebar_fill_color': u'DDEEF6', u'profile_text_color': u'333333', u'followers_count': 149, u'profile_sidebar_border_color': u'C0DEED', u'id_str': u'509749471', u'profile_background_color': u'C0DEED', u'listed_count': 2, u'profile_background_image_url_https': u'https://abs.twimg.com/images/themes/theme1/bg.png', u'utc_offset': None, u'statuses_count': 6211, u'description': u'21 years old, Aquarius \u2652\ufe0f, Instagram _BENNYAGUIRRE, taken', u'friends_count': 206, u'location': u'harlingen, tx', u'profile_link_color': u'0084B4', u'profile_image_url': u'http://pbs.twimg.com/profile_images/603406649433268224/wfMu2EpQ_normal.jpg', u'following': None, u'geo_enabled': False, u'profile_banner_url': u'https://pbs.twimg.com/profile_banners/509749471/1423998507', u'profile_background_image_url': u'http://abs.twimg.com/images/themes/theme1/bg.png', u'name': u'Ben\u270c', u'lang': u'en', u'profile_background_tile': False, u'favourites_count': 3447, u'screen_name': u'BenitoAguirre1', u'notifications': None, u'url': None, u'created_at': u'Thu Mar 01 05:16:54 +0000 2012', u'contributors_enabled': False, u'time_zone': None, u'protected': False, u'default_profile': True, u'is_translator': False}
geo : 	None
in_reply_to_user_id_str : 	None
possibly_sensitive : 	False
lang : 	en
created_at : 	Sun Aug 16 21:41:14 +0000 2015
filter_level : 	low
in_reply_to_status_id_str : 	None
place : 	None

`defaultdict`¶

Like a regular dictionary, except that when you try to look up a key that is not present in the dictionary, it first adds a value for it using a no-argument function that you provide when creating it.
Imagine creating a dictionary to count the words in a document. Here are 3 approaches for doing that with a regular dictionary:

In [ ]:

# Approach 1 - if/else statement
word_counts = {}
for word in document:
    if word in word_counts:
        word_counts[word] += 1
    else:
        word_counts[word] = 1
        
# Approach 2 - handle the exception
word_counts = {}
for word in document:
    try:
        word_counts[word] += 1
    except KeyError:
        word_counts[word] = 1
        
# Approach 3 - use `get`
word_counts = {}
for word in document:
    previous_count = word_counts.get(word, 0)
    word_counts[word] = previous_count + 1

Instead we can use defaultdict (needs to be imported from collections):

In [ ]:

from collections import defaultdict

word_counts = defaultdict(int)  # int(0) produces 0
for word in document:
    word_counts[word] += 1

In [73]:

# Let's implement the example above, but substitute a tweet for "document"
from collections import defaultdict

tweet_text = tweets[2736]['text'].encode('utf-8') # returns one long string & encodes the text as unicode
tweet_words = tweet_text.split()                  # split() returns a list of words

word_counts = defaultdict(int)
for word in tweet_words:
    word_counts[word] += 1
        
print tweet_text, "\n"
print word_counts

RT @HCeretto: I don't love you I'm just passing the time
You could love me if I knew how to lie
But who could love me? I am out of my mind 

defaultdict(<type 'int'>, {'love': 3, 'just': 1, "don't": 1, 'am': 1, 'me?': 1, 'You': 1, 'if': 1, 'RT': 1, 'lie': 1, 'how': 1, 'But': 1, 'to': 1, 'you': 1, 'out': 1, 'knew': 1, 'I': 3, 'mind': 1, 'who': 1, "I'm": 1, 'passing': 1, 'me': 1, '@HCeretto:': 1, 'of': 1, 'could': 2, 'time': 1, 'the': 1, 'my': 1})

Optional Exercises¶

Read over the UW Intro to Data Science - Assignment 1 description.

Skip

Problem 1: Get the Twitter Data

and just use "three_minutes_tweets.json" for the following problems:

Problem 2: Derive the sentiment of each tweet
Problem 3: Derive the sentiment of new terms
Problem 4: Compute term frequency

Can skip the last two problems:

Problem 5: Which state is happiest?
Problem 6: Top ten hash tags

defaultdict can also be useful with list or dict or your own functions:

In [49]:

dd_list = defaultdict(list)          # list() produces an empty list
dd_list[2].append(1) 
print dd_list, "\n"

dd_dict = defaultdict(dict)          # dict() produces an empty dict
dd_dict["Joel"]["City"] = "Seattle"
print dd_dict, "\n"

dd_pair = defaultdict(lambda: [0,0]) # use a lambda function
dd_pair[2][1] = 1
print dd_pair

defaultdict(<type 'list'>, {2: [1]}) 

defaultdict(<type 'dict'>, {'Joel': {'City': 'Seattle'}}) 

defaultdict(<function <lambda> at 0x11c4e5410>, {2: [0, 1]})

`Counter`¶

Instead of using any of the approaches mentioned above to compute word-counts, we could have rather used a simpler built-in counter offered by Python. Counter turns a sequence of values into defaultdict(int) like objects mapping keys to its corresponding counts. This gives a very simple way to solve our word-count problem.
A Counter instance has a most_common method to find most common keys and their counts.

In [53]:

from collections import Counter
c = Counter([0, 1, 2, 0])
print c

Counter({0: 2, 1: 1, 2: 1})

In [ ]:

word_counts = Counter(document)

# Print the 10 most common words and their counts
for word, count in word_counts.most_common(10):
    print word, count

In [54]:

print tweet_text, "\n"

word_counts = Counter(tweet_words)

# Print the 5 most common words and their counts
for word, count in word_counts.most_common(5):
    print word, count

RT @HCeretto: I don't love you I'm just passing the time
You could love me if I knew how to lie
But who could love me? I am out of my mind 

love 3
I 3
could 2
just 1
don't 1

Sets¶

Represents a collection of distinct elements
Used for two main reasons:
1. the in operation is much faster on sets than on lists, for example
2. sometimes we want to find distinct items in a list

In [55]:

s = set()
s.add(1)
print "s is", s
s.add(2)
print "s is now", s
s.add(2)
print "s is still", s
print "There are", len(s), "elements in s."
print 2 in s
print 3 in s

s is set([1])
s is now set([1, 2])
s is still set([1, 2])
There are 2 elements in s.
True
False

In [62]:

# Let's use the tweets we read in earlier from "three_minutes_tweets.json"
all_tweets_words = []
for tweet in tweets:
    if 'text' in tweet: # since not all "tweets" contain text
        tweet_text = tweet['text'].encode('utf-8')
        tweet_words_list = tweet_text.split()
        all_tweets_words.extend(tweet_words_list)
print len(all_tweets_words)

In [63]:

%time "love" in all_tweets_words # another IPython magic command!

all_tweets_words_set = set(all_tweets_words)
%time "love" in all_tweets_words_set

CPU times: user 49 µs, sys: 11 µs, total: 60 µs
Wall time: 62 µs
CPU times: user 5 µs, sys: 2 µs, total: 7 µs
Wall time: 7.87 µs

Out[63]:

True

In [64]:

item_list = [1, 2, 3, 1, 2, 3]
print len(item_list)
print set(item_list)
print len(set(item_list))
print len(list(set(item_list)))

6
set([1, 2, 3])
3
3

Control Flow¶

Conditional looping: if-else statements¶

In [65]:

x = - 4

if x < 0:
    x = 0
    print 'Negative changed to zero'
elif x == 0:
     print 'Zero'
elif x == 1:
     print 'Single'
else:
    print 'More'

Negative changed to zero

In [66]:

# Can also write if-then-else statements on one line
x = 5
parity = "even" if x % 2 == 0 else "odd" 
print parity

odd

While loops¶

Repeatedly executes a target statement as long as a given condition is true.

In [67]:

count = 0
while (count < 9):
    print 'The count is:', count
    count = count + 1

print "Good bye!"

The count is: 0
The count is: 1
The count is: 2
The count is: 3
The count is: 4
The count is: 5
The count is: 6
The count is: 7
The count is: 8
Good bye!

For loops¶

Iterates over the items of any sequence, such as a list or a string.

In [68]:

animal_kingdom = ['dogs', 'cats', 'elephant', 'tiger', 'lion']

for animal in animal_kingdom:
    print animal, len(animal)

dogs 4
cats 4
elephant 8
tiger 5
lion 4

`continue` and `break`¶

In [69]:

for x in range(10):
    if x == 3:
        continue   # go immediately to the next iteration
    if x == 5:
        break      # quit the loop entirely
    print x

Truthiness¶

Booleans work in Python as in most other languages, except that they are capitalized
Python uses value 'None' to indicate a non-existent value. It is similar to 'null' in other languages.

In [70]:

one_is_less_than_two = 1 < 2
print 'one_is_less_than_two:', one_is_less_than_two

true_equals_false = True == False
print 'true_equals_false:', true_equals_false

x = None
print 'x_equals_none:', x == None # non-Pythonic way
print 'x is none?', x is None     # Pythonic way

one_is_less_than_two: True
true_equals_false: False
x_equals_none: True
x is none? True

Python lets you use any value where it expects a boolean. These all count as False:
- False
- None
- []
- {}
- ""
- set()
- 0
- 0.0
Almost everything else gets treated as True.

In [71]:

# Use `if` statements to check for empty lists, strings, dicts, etc.
s = "foo"
if s:
    first_char = s[0]
else:
    first_char = ""
print first_char

Python also has 'all' and 'any' functions that take a list and return True precisly when 'all' or 'any' elements of the list are truthy respectively.

In [72]:

print all([True, 1, {3}])
print all([True, "", {3}])
print any([False, 1, [2]])
print all([])
print any([])

True
False
True
True
False

The Not-So-Basics¶

Sorting
List Comprehensions
Generators and Iterators
Randomness
Regular Expressions
Object-Oriented Programming
Functional Tools
enumerate
zip and Argument Unpacking
args and kwargs

Sorting¶

You can sort a list using two functions:

sort(): this sorts list in place
sorted(): this creates a new sorted list and so, the original list remains unchanged

Usually, the elements are sorted by values in ascending order. However, you can change this default behaviour by setting argument 'reverse' = True in the sorting function.

Also, you can specify key by which you would like your collection to be sorted.

In [74]:

x = [4,1,2,3]
print sorted(x)
print x         # x list is still the same
x.sort()        
print x         # x list is now changed

[1, 2, 3, 4]
[4, 1, 2, 3]
[1, 2, 3, 4]

In [75]:

print sorted(x, reverse=True)
x.sort(reverse=True)
print x

[4, 3, 2, 1]
[4, 3, 2, 1]

In [76]:

#sort a list by abs value in descending order
x = sorted([-4, 1, -2, 3], key = abs, reverse = True)
print x

[-4, 3, -2, 1]

List Comprehensions¶

List Comprehensions is a very powerful tool, which creates a new list based on another list, in a single, readable line.
You can also tyrn lists into dictionaries or sets using list comprehensions.
We can also use multiple for loops in list comprehension.

In [77]:

even_numbers = [x for x in range(5) if x%2 == 0]
print 'list of even numbers below 5:', even_numbers

squares = [x**2 for x in range(5)]
print 'list of squares of numbers less than 5:', squares

even_squares = [x**2 for x in even_numbers]
print 'list of squares of even numbers less than 5:', even_squares

print
#turning list into dict or sets
square_dict = {x : x**2 for x in range(5)}
print 'dictionary of squares of numbers less than 5:', square_dict

square_set =  set(x**2 for x in range(5))
print 'set of squares of numbers less than 5:', square_set

print
#list comprehension with multiple for statements
pairs = [(x, y) 
         for x in range(5)
         for y in range(5)
        ]
print pairs

list of even numbers below 5: [0, 2, 4]
list of squares of numbers less than 5: [0, 1, 4, 9, 16]
list of squares of even numbers less than 5: [0, 4, 16]

dictionary of squares of numbers less than 5: {0: 0, 1: 1, 2: 4, 3: 9, 4: 16}
set of squares of numbers less than 5: set([0, 1, 4, 16, 9])

[(0, 0), (0, 1), (0, 2), (0, 3), (0, 4), (1, 0), (1, 1), (1, 2), (1, 3), (1, 4), (2, 0), (2, 1), (2, 2), (2, 3), (2, 4), (3, 0), (3, 1), (3, 2), (3, 3), (3, 4), (4, 0), (4, 1), (4, 2), (4, 3), (4, 4)]

Generators and Iterators¶

In [78]:

def squares(n):
    return n**2

def lazy_range(n):  # this is`xrange` in Python2 and `range` in Python3
    """a lazy version of range"""
    i = 0
    while i < n:
        print i
        yield i
        i += 1
        
squares_list = []

for i in lazy_range(10):
    squares_list.append(squares(i))

print 'squares_list:', squares_list

0
1
2
3
4
5
6
7
8
9
squares_list: [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

In [ ]:

# Create an infinite sequence (but don't use without some kind of `break` statement)
def natural_numbers():
    n = 1
    while True:
        yield n
        n += 1

Recall that every dict has items() that returns a list of its key-value pairs. Dictionaries also have iteritems() method that lazily yields the key value pairs one at a time as we iterate over it.

Randomness¶

We can generate random numbers using random module from Python. Some of the important methods from random package that we will often use are as follows:

random.seed - Set random seed in case you want to have reproducible results with random numbers
randrange() - allows you to produce random number from within given range
shuffle() - randomly shuffles elements of a collection and gives an output
choice() - in case you want to randomly select an element from a collection
sample() - if you want to randomly choose a sample of elements without replacement (sampling without any duplicates)

In [79]:

import random 

four_uniform_randoms = [random.random() for _ in range(4)]
print 'four_uniform_random numbers:', four_uniform_randoms
print

#set seed for reproducible results
random.seed(10)
print 'reproducible random number:', random.random()
random.seed(10) #reset the seed to 10
print 'reproducible random number again:',random.random()
print

#create range of random numbers 
print 'random number between 0 and 10:', random.randrange(10)
print 'random number between 3 and 6:',random.randrange(3, 6)
print

#shuffle a list in order to get random order of its elements
up_to_10 = range(10)
print 'ordered list:', up_to_10
random.shuffle(up_to_10)
print 'shuffled list:', up_to_10
print

#randomly pick one element from list
my_best_friend = random.choice(['Heisenberg', 'Saul', 'Jesse', 'Skinny Pete'])
print 'my_best_friend:', my_best_friend
print 

#sample numbers without replacement(without dups)
lottery_numbers = range(100)
winning_numbers = random.sample(lottery_numbers, 6)
print 'winning lottery numbers:', winning_numbers
print

#sample with replacement(with dups)
four_with_replacement = [random.choice(range(10)) for _ in range(4)]
print 'four_with_replacement', four_with_replacement

four_uniform_random numbers: [0.4506916641123372, 0.5200461255446185, 0.39018952392156003, 0.5616870973196527]

reproducible random number: 0.57140259469
reproducible random number again: 0.57140259469

random number between 0 and 10: 4
random number between 3 and 6: 4

ordered list: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
shuffled list: [8, 3, 5, 1, 9, 0, 4, 6, 7, 2]

my_best_friend: Skinny Pete

winning lottery numbers: [4, 86, 60, 38, 28, 67]

four_with_replacement [4, 6, 6, 1]

Regular Expressions¶

Regular expressions provide a way of searching text and are used extensively in NLP. They range from easy to extremely complicated. Some of the important functions from regular expressions (re) package in python are as follows:

re.match() - tries to match pattern with the string and returns True or False
re.search() - looks for pattern in the string and returns the matching pattern in string
re.split() - splits the string based on pattern
re.sub() - replaces / substitutes pattern with replacement value and changes the string value

In [80]:

import re

print 'does a match cat?', re.match('a', 'cat')
print 'does cat have an a?', re.search('a', 'cat')
print 'does dog have an a?', re.search('a', 'dog')
print 'split carbs on a and b', re.split('[ab]', 'carbs')
print 'replace any numbers with to', re.sub('[0-9]+', 'to', 'from here 2 there')

does a match cat? None
does cat have an a? <_sre.SRE_Match object at 0x11c6b2370>
does dog have an a? None
split carbs on a and b ['c', 'r', 's']
replace any numbers with to from here to there

Object-Oriented Programming¶

Python allows you to create classes that encapsulate data and functions that operate on them.

For example: Let us say we did not have an in-built implementation of sets in python and we would like to build one. So, we can start by constructing elements of Set class.

In our set class, we would like to have following functions:

add : to add items to set
remove: to remove items from set
contains: to check if a given element is present in the set

In [81]:

class Set:
    
    def __init__(self, values = None):
        self.dict = {}
        if values is not None:
            for value in values:
                self.add(value)
    
    def __repr__(self):
        return 'Set:', str(self.dict.keys())
    
    def add(self, value):
        self.dict[value] = True
    
    def remove(self, value):
        del self.dict[value]
    
    def contains(self, value):
        return value in self.dict

In [82]:

s = Set([1, 2, 3])
s.add(4)
print 'set contains 3?', s.contains(3)
s.remove(3)
print 'set contains 3?', s.contains(3)

set contains 3? True
set contains 3? False

Functional Tools¶

When passing functions around, sometimes we want to apply a function only partially to create new functions. For this purpose we can use various functional tools offered by Python. Some of the functions that we would be using are as follows:

partial(): allows you to partially fill function with default values and create new functions
map(): allows you to apply (or map) a function to every element of a collection
filter(): returns elements of a list that satisfy a pre-defined condition or filter
reduce(): combines all elements of a collection from left to right

In [83]:

#use of partial() function:
def exp(base, power):
    return base ** power

#compute two to the power without using partial
def two_to_the_power(power):
    return exp(2, power)

#use partial() function to compute results of 2 raised to a power
from functools import partial
two_to_the_power = partial(exp, 2) #two_to_the_power is now a function of just one variable
print 'two to the power 3:', two_to_the_power(3)

#use partial() function to compute any base number raised to a power
square_of = partial(exp, power = 2)
print 'square of 3:', square_of(3)

two to the power 3: 8
square of 3: 9

In [84]:

#use map() function
def double(x):
    return 2 * x

xs = [1, 2, 3, 4]
twice_xs = [double(x) for x in xs] #double every element of list using list comprehension
print 'twice_xs created using list comprehension method:', twice_xs
twice_xs = map(double, xs) #double every element of list by using map() function
print 'twice_xs created using map method:', twice_xs

twice_xs created using list comprehension method: [2, 4, 6, 8]
twice_xs created using map method: [2, 4, 6, 8]

In [85]:

#use filter() function
def is_even(n):
    return n%2 == 0

x_evens = [x for x in xs if is_even(x)] #find even numbers in the list using list-comprehension method
print 'x_evens created using list comprehension method:', x_evens
x_evens = filter(is_even, xs) #double every element of list by using map() function
print 'x_evens created using filter method:', x_evens

x_evens created using list comprehension method: [2, 4]
x_evens created using filter method: [2, 4]

In [86]:

#use reduce() function
def multiply(x, y): return x*y

x_product = reduce(multiply, xs) #computes 1 * 2 * 3 * 4
print 'product of all elements of list:', x_product

product of all elements of list: 24

`enumerate`¶

The enumerate() function can be use to iterate over indices and items of a list.

In [87]:

a = ["a", "b", "c"]

#non-pythonic way
for i in range(len(a)):
    print i, a[i]

#pythonic way
for index, value in enumerate(a):
    print index, value

0 a
1 b
2 c
0 a
1 b
2 c

`zip` and Argument Unpacking¶

The zip() function can be used to iterate over two or more lists in parallel. zip() transforms multiple lists into a single list of tuples of corresponding elements.

In [88]:

#example1:
list1 = ['a', 'b', 'c']
list2 = [1, 2, 3]
print 'list1 and list2 zipped together:', zip(list1, list2)

#example2:
a = [1, 2, 3]
b = [3, 4, 5]
c = [6, 7, 8]

for i, j, k in zip(a, b, c):
    print i, j, k


### Use zip() and enumerate() together
alist = ['a1', 'a2', 'a3']
blist = ['b1', 'b2', 'b3']

for i, (a,b) in enumerate(zip(alist, blist)):
    print i, a, b

list1 and list2 zipped together: [('a', 1), ('b', 2), ('c', 3)]
1 3 6
2 4 7
3 5 8
0 a1 b1
1 a2 b2
2 a3 b3

You can also unzip a list using *.

In [89]:

pairs = [(1, 'one'), (2, 'two'), (3, 'three')]
numbers, letters = zip(*pairs)
print 'numbers in list:', numbers
print 'letters in list:', letters

#using argument unpacking with any function. 
def add(a, b): return a + b

print 'add 1 and 2 by using simple function call:', add(1,2)
# print 'add 1 and 2 by passing list [1,2] to add() method:' add([1, 2]) #produces error
print 'add 1 and 2 by passing unpacked list [1,2] to add() method:', add(*[1, 2])

numbers in list: (1, 2, 3)
letters in list: ('one', 'two', 'three')
add 1 and 2 by using simple function call: 3
add 1 and 2 by passing unpacked list [1,2] to add() method: 3

args and kwargs¶

*args allows us to pass variable number of arguments to the function. Let’s take an example to make this clear.

In our following example, we first build a function to add two numbers. As you can see the first add() only accepts two numbers, what if you want to pass more than two arguments, this is where *args comes into play.

In [90]:

#function to add numbers:
def add_fixed(a, b):
    return a + b

def add_variable(*args):
    sum = 0
    for num in args:
        sum += num
    return sum

# print 'add 3 numbers using add_fixed():', add_fixed(3, 2, 1) #gives an error
print 'add 3 numbers using add_variable()', add_variable(3, 2, 1)
print 'add 6 numbers using add_variable()', add_variable(3, 2, 1, 5, 6, 7)

add 3 numbers using add_variable() 6
add 6 numbers using add_variable() 24

Note: name of *args is just a convention you can use anything that is a valid identifier. For example, *myargs is perfectly valid.

**kwargs allows us to pass variable number of keyword argument like this func_name(name='tim', team='school')

In [91]:

def my_func(**kwargs):
    for i, j in kwargs.items():
        print(i, j)
        
        
my_func(name='tim', sport='football', roll=19)

('sport', 'football')
('name', 'tim')
('roll', 19)

For Further Exploration¶

Python
- Official Python.org tutorial (good)
IPython
- Official IPython.org tutorial (not quite as good)
- IPython.org videos (better)
- Python for Data Analysis by Wes McKinney (original author of pandas)

Note: Get the code and examples from the book for Chapters 3 - 24 here

Finally, thanks for coming and reach out to Meghann or Chai if you have any questions on these notes!

Women in Data Science - ATX Meetup¶

Ch. 2: A Crash Course in Python¶

The Basics¶

Getting Python¶

The Zen of Python¶

Whitespace Formatting¶

Modules¶

Arithmetic¶

Functions¶

lambda functions¶

default function arguments¶

Strings¶

Exceptions¶

Lists¶

Tuples¶

Dictionaries¶

Let's grab some data to use in the upcoming examples on defaultdict¶

defaultdict¶

Optional Exercises¶

Counter¶

Sets¶

Control Flow¶

Conditional looping: if-else statements¶

While loops¶

For loops¶

continue and break¶

Truthiness¶

The Not-So-Basics¶

Sorting¶

List Comprehensions¶

Generators and Iterators¶

Randomness¶

Regular Expressions¶

Object-Oriented Programming¶

Functional Tools¶

enumerate¶

zip and Argument Unpacking¶

args and kwargs¶

For Further Exploration¶

`lambda` functions¶

Let's grab some data to use in the upcoming examples on `defaultdict`¶

`defaultdict`¶

`Counter`¶

`continue` and `break`¶

`enumerate`¶

`zip` and Argument Unpacking¶