Women in Data Science - ATX Meetup

Ch. 2: A Crash Course in Python

Notes on "Data Science from Scratch" by Joel Grus

  • Download the Jupyter notebook for these slides here

The Basics

  • Getting Python
  • The Zen of Python
  • Whitespace Formatting
  • Modules
  • Arithmetic
  • Functions
  • Strings
  • Exceptions
  • Lists
  • Tuples
  • Dictionaries
    • defaultdict
    • Counter
  • Sets
  • Control Flow
  • Truthiness

Getting Python

  • We recommend installing Anaconda
  • Otherwise, you can install the following to get started:
    • Python
    • Python package manager: pip
    • Nicer Python shell: IPython
    • Web app for interactive data sci: Jupyter Notebook
      • pip install jupyter (Python 2)
      • pip3 install jupyter (Python 3)
    • etc...

The Zen of Python

Whitespace Formatting

In Python, whitespace is important. Other languages, like C++, use curly braces {} to delimit blocks of code. for( int i = 0; i < 5; i++ ) { cout << i << endl; }

In [1]:
# Python uses indentation to delimit blocks of code.
for i in range(5):
    print i
0
1
2
3
4

Modules

Let's try importing the Matplotlib pyplot module, which "provides a MATLAB-like plotting framework."

Note:

  • Sometimes you'll want to use IPython magic commands in your Jupyter notebooks.
  • Try not to do the following since you may inadvertently overwrite variables you've already defined: from matplotlib.pyplot import *
In [2]:
# Tip: use IPython magic command "%matplotlib inline" to display plots in notebook
%matplotlib inline 
import matplotlib.pyplot # import module and reference by its super long name

matplotlib.pyplot.plot([1,2,3], [1,2,3]) # must type the whole name to access methods, etc.
Out[2]:
[<matplotlib.lines.Line2D at 0x10e551e10>]
In [3]:
import matplotlib.pyplot as plt # OR, alias it as a shorter, more fun-to-type name

plt.plot([1,2,3], [1,2,3])
Out[3]:
[<matplotlib.lines.Line2D at 0x10eb03090>]

Arithmetic

  • Python 2.7 (like Fortran) uses integer division by default.
In [4]:
print 5 / 2
2
  • To force floating point division, specify at least one value in equation as float:
In [5]:
print type(5), type(2.0)
print 5 / 2.0
<type 'int'> <type 'float'>
2.5
  • Or, make floating point division default:
In [6]:
from __future__ import division
print 5 / 2   # now floating point division is default
print 5 // 2  # use double slash for integer division
2.5
2

Functions

  • Define functions with def
In [7]:
def double(x):
    """This is where you put an optional docstring that 
    explains what the function does.
    For example, this function multiplies its input by 2."""
    return x * 2
In [8]:
def apply_to_one(f):
    """Calls the function f with 1 as its argument"""
    return f(1)
In [9]:
my_double = double          # refers to the previously defined function
x = apply_to_one(my_double)
print x
2

lambda functions

In [10]:
y = apply_to_one(lambda x: x + 4)
print y
5
In [11]:
# but use `def` instead of assigning a lambda function to a variable
another_double = lambda x: 2 * x     # don't do this
def another_double(x): return 2 * x  # do this instead

default function arguments

  • define a default arg; specify arg only if you want different value
In [12]:
def my_print(message="my default message"):
    print message
    
my_print("hello")
hello
In [13]:
my_print()
my default message

Strings

In [14]:
# Delimit strings with single OR double quotation marks
single_quoted_string = 'data science'
double_quoted_string = "data science"
print single_quoted_string, "is the same as", double_quoted_string
data science is the same as data science
In [15]:
# Use backslashes to encode special characters
tab_string = "\t"  # represents the tab character
len(tab_string)    # string length is 1 (not 2)
Out[15]:
1
In [16]:
# Use raw strings to represent backslashes
not_tab_string = r"\t"  # represents the characters '\' and 't'
len(not_tab_string)     # now length is 2
Out[16]:
2
In [17]:
# Create multiline strings using triple-quotes
multi_line_string = """This is the first line.
and this is the second line
and this is the third line"""
print multi_line_string
This is the first line.
and this is the second line
and this is the third line

Exceptions

In [18]:
# This makes your code crash
print 0 / 0
---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-18-ccf175f7fa0e> in <module>()
      1 # This makes your code crash
----> 2 print 0 / 0

ZeroDivisionError: division by zero

Here's a list of built-in exceptions

In [19]:
# This will handle the exception by printing an error message
try:
    print 0 / 0
except ZeroDivisionError:
    print "Cannot divide by zero"
Cannot divide by zero

Lists

  • Ordered collections
  • Similar to an array in other languages, but holds heterogeneous data (e.g., floats + ints + strings)
  • Note: Use NumPy arrays (here's a tutorial) for larger amounts of homogeneous data (e.g., just floats)
  • Specify with brackets []
In [20]:
integer_list = [1, 2, 3]
heterogeneous_list = ["string", 0.1, True]
list_of_lists = [ integer_list, heterogeneous_list, [] ]

print len(integer_list)  # get the length of a list
print sum(integer_list)  # get the sum of the elements in a list (if addition is defined for those elements)
3
6
In [21]:
# Use square brackets to get the n^{th} element of a list
x = range(10)
print x
print x[0]
print x[1]
print x[-1]
print x[-2]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
0
1
9
8
In [22]:
# Use square brackets to "slice" lists
print x[:3]   # up to but not including 3
print x[3:]   # 3 and up
print x[1:4]  # 1 up to but not including 4
print x[-3:]  # last 3
print x[1:-1] # without 1 and 9
print x[:]    # all elements of the list
[0, 1, 2]
[3, 4, 5, 6, 7, 8, 9]
[1, 2, 3]
[7, 8, 9]
[1, 2, 3, 4, 5, 6, 7, 8]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
In [23]:
# Use the `in` operator to check for list membership; use only for small lists or if run time is not a concern
print 1 in [1, 2, 3]
print 0 in [1, 2, 3]
True
False
In [24]:
# Concatenate lists like this
x = [1, 2, 3]

y = x + [4, 5, 6]   # creates a new list leaving "x" unchanged
print x
print y, "\n"

x.extend([4, 5, 6]) # changes "x"
print x
[1, 2, 3]
[1, 2, 3, 4, 5, 6] 

[1, 2, 3, 4, 5, 6]
In [25]:
# Append to lists like this
x = [1, 2, 3]
print x, "\n"

x.append(0)
print x
[1, 2, 3] 

[1, 2, 3, 0]
In [26]:
# It's convenient (and common) to unpack lists like this
z = [1, 2]
x, y = z
print type(x), "x =", x 
print type(y), "y =", y 
print type(z), "z =", z 
<type 'int'> x = 1
<type 'int'> y = 2
<type 'list'> z = [1, 2]
In [27]:
# It's also common to use an underscore for a value you're going to throw away
z = [3, 4]
_, y = z
print x
print y
print z
1
4
[3, 4]

Tuples

  • Similar to list, but immutable (can't be modified)
  • Specify with parentheses() or nothing
In [28]:
my_list = [1, 2]
my_tuple = (1, 2)
other_tuple = 3, 4
my_list[1] = 3

print my_list
[1, 3]
In [29]:
try:
    my_tuple[1] = 3
except TypeError:
    print "Cannot modify a tuple"
Cannot modify a tuple
In [30]:
# Use tuples to return multiple values from functions
def sum_and_product(x, y):
    return (x + y), (x * y)

sp = sum_and_product(2, 3)
s, p = sum_and_product(5, 10)
print sp
print s 
print p
(5, 6)
15
50
In [31]:
# Use tuples (and lists) for multiple assignments
x, y = 1, 2
print "x =", x
print "y =", y
x = 1
y = 2
In [32]:
x, y = y, x  # Pythonic way to swap variables
print "x =", x
print "y =", y
x = 2
y = 1

Dictionaries

  • Dictionaries associate values with keys
  • Allow quick retrieval of value given a key
  • Specify with curly braces {} or dict()
In [33]:
empty_dict = {}                    # Pythonic
empty_dict2 = dict()               # less Pythonic
grades = { "Joel": 80, "Tim": 95}

# Use square brackets to look up value(s) for a key
print grades["Joel"]
80
In [34]:
# KeyError exception raised if key not found
try: 
    kates_grade = grades["Kate"]
except KeyError:
    print "No grade for Kate!"
No grade for Kate!
In [35]:
# Use `in` to check for existence of a key
joel_has_grade = "Joel" in grades
kate_has_grade = "Kate" in grades

print joel_has_grade
print kate_has_grade
True
False
In [36]:
# Use `get` method of dictionaries when you want to return a default value (rather than raise exception)
print grades.get("Joel", 0)
print grades.get("Kate", 0)
print  grades.get("No One")
80
0
None
In [37]:
# Use square brackets to assign key-value pairs, e.g., dict_name[key] = value
grades["Tim"] = 99
grades["Kate"] = 100

print "Number of students:", len(grades)
Number of students: 3
In [38]:
# Use dictionaries to represent structured data, such as in a tweet
tweet = {
    "user": "joelgrus",
    "text": "Data Science is Awesome",
    "retweet_count": 100,
    "hashtags": ["#data", "#science", "#datascience", "#awesome"]
}
In [39]:
print tweet.keys()    # list of keys
print tweet.values()  # list of values
print tweet.items()   # list of (key, value) tuples
['text', 'retweet_count', 'hashtags', 'user']
['Data Science is Awesome', 100, ['#data', '#science', '#datascience', '#awesome'], 'joelgrus']
[('text', 'Data Science is Awesome'), ('retweet_count', 100), ('hashtags', ['#data', '#science', '#datascience', '#awesome']), ('user', 'joelgrus')]
In [40]:
print "user" in tweet.keys()       # list `in` is slow
print "user" in tweet              # dict `in` is fast (and more Pythonic)
print "joelgrus" in tweet.values() 
True
True
True

You cannot use lists as keys. If that's needed, then:

  • use a tuple, or
  • represent the key as a string

Let's grab some data to use in the upcoming examples on defaultdict

  • Clone the UW Intro to Data Science course materials repository:

git clone https://github.com/uwescience/datasci_course_materials.git

  • Look in datasci_course_materials/assignment1/ for a file named three_minutes_tweets.json
In [41]:
# Load the data into a list called `tweets`
import json
# Substitute the local path to your three_minutes_tweets.json file within the quotes
tweet_file = open("/Users/mepa/repos/datasci_course_materials/assignment1/three_minutes_tweets.json")

tweets = []
for line in tweet_file:
    tweets.append(json.loads(line)) # each list element contains all data pertaining to a single tweet
print "Total # of tweets:", len(tweets)
print "type(tweets[0]):", type(tweets[0])
Total # of tweets: 8299
type(tweets[0]): <type 'dict'>
  • For tweets that contain a "text" key we can print someone's tweet:
In [42]:
print tweets[7]['text']  # text of 8th tweet
@1voodoochild thanks for the follow💯
In [43]:
print tweets[3]['text']  # text of 4th tweet
إنّ العرب إذا تغلبوا على أوطان أسرع إليها الخراب والسبب في ذلك أنها أمة وحشية بإستحكام عوائد التوحش وأسبابه فصار لهم خلقة وجبلة

- ابن خلدون

Aside: The text is attributed to Ibn Khaldun who lived 1332 - 1406. Google Translate gives the following:

"The Arabs overcame their homelands if faster to ruin and why they are a nation and brutal savagery Bastgam returns and causes them became habitus and protoplasm"

In [44]:
print tweets[54]['text']
@morgancollins42 I don't like you
In [45]:
print tweets[7000]['text']
The Newf opened up and ate a whole bar of Ivory soap. Some things about the canine diet I will never understand.
In [47]:
# Print all key-value pairs in the 54th tweet
for key in tweets[53].keys():
    print key, ": \t", tweets[53][key]
contributors : 	None
truncated : 	False
text : 	I'm starving I can't wait to leave work and eat Wing Barn 😋🍗
in_reply_to_status_id : 	None
id : 	633030783796514816
favorite_count : 	0
source : 	<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
retweeted : 	False
coordinates : 	None
timestamp_ms : 	1439761274657
entities : 	{u'user_mentions': [], u'symbols': [], u'trends': [], u'hashtags': [], u'urls': []}
in_reply_to_screen_name : 	None
id_str : 	633030783796514816
retweet_count : 	0
in_reply_to_user_id : 	None
favorited : 	False
user : 	{u'follow_request_sent': None, u'profile_use_background_image': True, u'default_profile_image': False, u'id': 509749471, u'verified': False, u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/603406649433268224/wfMu2EpQ_normal.jpg', u'profile_sidebar_fill_color': u'DDEEF6', u'profile_text_color': u'333333', u'followers_count': 149, u'profile_sidebar_border_color': u'C0DEED', u'id_str': u'509749471', u'profile_background_color': u'C0DEED', u'listed_count': 2, u'profile_background_image_url_https': u'https://abs.twimg.com/images/themes/theme1/bg.png', u'utc_offset': None, u'statuses_count': 6211, u'description': u'21 years old, Aquarius \u2652\ufe0f, Instagram _BENNYAGUIRRE, taken', u'friends_count': 206, u'location': u'harlingen, tx', u'profile_link_color': u'0084B4', u'profile_image_url': u'http://pbs.twimg.com/profile_images/603406649433268224/wfMu2EpQ_normal.jpg', u'following': None, u'geo_enabled': False, u'profile_banner_url': u'https://pbs.twimg.com/profile_banners/509749471/1423998507', u'profile_background_image_url': u'http://abs.twimg.com/images/themes/theme1/bg.png', u'name': u'Ben\u270c', u'lang': u'en', u'profile_background_tile': False, u'favourites_count': 3447, u'screen_name': u'BenitoAguirre1', u'notifications': None, u'url': None, u'created_at': u'Thu Mar 01 05:16:54 +0000 2012', u'contributors_enabled': False, u'time_zone': None, u'protected': False, u'default_profile': True, u'is_translator': False}
geo : 	None
in_reply_to_user_id_str : 	None
possibly_sensitive : 	False
lang : 	en
created_at : 	Sun Aug 16 21:41:14 +0000 2015
filter_level : 	low
in_reply_to_status_id_str : 	None
place : 	None

defaultdict

  • Like a regular dictionary, except that when you try to look up a key that is not present in the dictionary, it first adds a value for it using a no-argument function that you provide when creating it.
  • Imagine creating a dictionary to count the words in a document. Here are 3 approaches for doing that with a regular dictionary:
In [ ]:
# Approach 1 - if/else statement
word_counts = {}
for word in document:
    if word in word_counts:
        word_counts[word] += 1
    else:
        word_counts[word] = 1
        
# Approach 2 - handle the exception
word_counts = {}
for word in document:
    try:
        word_counts[word] += 1
    except KeyError:
        word_counts[word] = 1
        
# Approach 3 - use `get`
word_counts = {}
for word in document:
    previous_count = word_counts.get(word, 0)
    word_counts[word] = previous_count + 1

Instead we can use defaultdict (needs to be imported from collections):

In [ ]:
from collections import defaultdict

word_counts = defaultdict(int)  # int(0) produces 0
for word in document:
    word_counts[word] += 1
In [73]:
# Let's implement the example above, but substitute a tweet for "document"
from collections import defaultdict

tweet_text = tweets[2736]['text'].encode('utf-8') # returns one long string & encodes the text as unicode
tweet_words = tweet_text.split()                  # split() returns a list of words

word_counts = defaultdict(int)
for word in tweet_words:
    word_counts[word] += 1
        
print tweet_text, "\n"
print word_counts
RT @HCeretto: I don't love you I'm just passing the time
You could love me if I knew how to lie
But who could love me? I am out of my mind 

defaultdict(<type 'int'>, {'love': 3, 'just': 1, "don't": 1, 'am': 1, 'me?': 1, 'You': 1, 'if': 1, 'RT': 1, 'lie': 1, 'how': 1, 'But': 1, 'to': 1, 'you': 1, 'out': 1, 'knew': 1, 'I': 3, 'mind': 1, 'who': 1, "I'm": 1, 'passing': 1, 'me': 1, '@HCeretto:': 1, 'of': 1, 'could': 2, 'time': 1, 'the': 1, 'my': 1})

Optional Exercises

Read over the UW Intro to Data Science - Assignment 1 description.

Skip

  • Problem 1: Get the Twitter Data

and just use "three_minutes_tweets.json" for the following problems:

  • Problem 2: Derive the sentiment of each tweet
  • Problem 3: Derive the sentiment of new terms
  • Problem 4: Compute term frequency

Can skip the last two problems:

  • Problem 5: Which state is happiest?
  • Problem 6: Top ten hash tags

defaultdict can also be useful with list or dict or your own functions:

In [49]:
dd_list = defaultdict(list)          # list() produces an empty list
dd_list[2].append(1) 
print dd_list, "\n"

dd_dict = defaultdict(dict)          # dict() produces an empty dict
dd_dict["Joel"]["City"] = "Seattle"
print dd_dict, "\n"

dd_pair = defaultdict(lambda: [0,0]) # use a lambda function
dd_pair[2][1] = 1
print dd_pair
defaultdict(<type 'list'>, {2: [1]}) 

defaultdict(<type 'dict'>, {'Joel': {'City': 'Seattle'}}) 

defaultdict(<function <lambda> at 0x11c4e5410>, {2: [0, 1]})

Counter

  • Instead of using any of the approaches mentioned above to compute word-counts, we could have rather used a simpler built-in counter offered by Python. Counter turns a sequence of values into defaultdict(int) like objects mapping keys to its corresponding counts. This gives a very simple way to solve our word-count problem.
  • A Counter instance has a most_common method to find most common keys and their counts.
In [53]:
from collections import Counter
c = Counter([0, 1, 2, 0])
print c
Counter({0: 2, 1: 1, 2: 1})
In [ ]:
word_counts = Counter(document)

# Print the 10 most common words and their counts
for word, count in word_counts.most_common(10):
    print word, count
In [54]:
print tweet_text, "\n"

word_counts = Counter(tweet_words)

# Print the 5 most common words and their counts
for word, count in word_counts.most_common(5):
    print word, count
RT @HCeretto: I don't love you I'm just passing the time
You could love me if I knew how to lie
But who could love me? I am out of my mind 

love 3
I 3
could 2
just 1
don't 1

Sets

  • Represents a collection of distinct elements
  • Used for two main reasons:
    1. the in operation is much faster on sets than on lists, for example
    2. sometimes we want to find distinct items in a list
In [55]:
s = set()
s.add(1)
print "s is", s
s.add(2)
print "s is now", s
s.add(2)
print "s is still", s
print "There are", len(s), "elements in s."
print 2 in s
print 3 in s
s is set([1])
s is now set([1, 2])
s is still set([1, 2])
There are 2 elements in s.
True
False
In [62]:
# Let's use the tweets we read in earlier from "three_minutes_tweets.json"
all_tweets_words = []
for tweet in tweets:
    if 'text' in tweet: # since not all "tweets" contain text
        tweet_text = tweet['text'].encode('utf-8')
        tweet_words_list = tweet_text.split()
        all_tweets_words.extend(tweet_words_list)
print len(all_tweets_words)
78668
In [63]:
%time "love" in all_tweets_words # another IPython magic command!

all_tweets_words_set = set(all_tweets_words)
%time "love" in all_tweets_words_set
CPU times: user 49 µs, sys: 11 µs, total: 60 µs
Wall time: 62 µs
CPU times: user 5 µs, sys: 2 µs, total: 7 µs
Wall time: 7.87 µs
Out[63]:
True
In [64]:
item_list = [1, 2, 3, 1, 2, 3]
print len(item_list)
print set(item_list)
print len(set(item_list))
print len(list(set(item_list)))
6
set([1, 2, 3])
3
3

Control Flow

Conditional looping: if-else statements

In [65]:
x = - 4

if x < 0:
    x = 0
    print 'Negative changed to zero'
elif x == 0:
     print 'Zero'
elif x == 1:
     print 'Single'
else:
    print 'More'
Negative changed to zero
In [66]:
# Can also write if-then-else statements on one line
x = 5
parity = "even" if x % 2 == 0 else "odd" 
print parity
odd

While loops

  • Repeatedly executes a target statement as long as a given condition is true.
In [67]:
count = 0
while (count < 9):
    print 'The count is:', count
    count = count + 1

print "Good bye!"
The count is: 0
The count is: 1
The count is: 2
The count is: 3
The count is: 4
The count is: 5
The count is: 6
The count is: 7
The count is: 8
Good bye!

For loops

  • Iterates over the items of any sequence, such as a list or a string.
In [68]:
animal_kingdom = ['dogs', 'cats', 'elephant', 'tiger', 'lion']

for animal in animal_kingdom:
    print animal, len(animal)
dogs 4
cats 4
elephant 8
tiger 5
lion 4

continue and break

In [69]:
for x in range(10):
    if x == 3:
        continue   # go immediately to the next iteration
    if x == 5:
        break      # quit the loop entirely
    print x
0
1
2
4

Truthiness

  • Booleans work in Python as in most other languages, except that they are capitalized
  • Python uses value 'None' to indicate a non-existent value. It is similar to 'null' in other languages.
In [70]:
one_is_less_than_two = 1 < 2
print 'one_is_less_than_two:', one_is_less_than_two

true_equals_false = True == False
print 'true_equals_false:', true_equals_false

x = None
print 'x_equals_none:', x == None # non-Pythonic way
print 'x is none?', x is None     # Pythonic way
one_is_less_than_two: True
true_equals_false: False
x_equals_none: True
x is none? True
  • Python lets you use any value where it expects a boolean. These all count as False:
    • False
    • None
    • []
    • {}
    • ""
    • set()
    • 0
    • 0.0
  • Almost everything else gets treated as True.
In [71]:
# Use `if` statements to check for empty lists, strings, dicts, etc.
s = "foo"
if s:
    first_char = s[0]
else:
    first_char = ""
print first_char
f

Python also has 'all' and 'any' functions that take a list and return True precisly when 'all' or 'any' elements of the list are truthy respectively.

In [72]:
print all([True, 1, {3}])
print all([True, "", {3}])
print any([False, 1, [2]])
print all([])
print any([])
True
False
True
True
False

The Not-So-Basics

  • Sorting
  • List Comprehensions
  • Generators and Iterators
  • Randomness
  • Regular Expressions
  • Object-Oriented Programming
  • Functional Tools
  • enumerate
  • zip and Argument Unpacking
  • args and kwargs

Sorting

You can sort a list using two functions:

  • sort(): this sorts list in place
  • sorted(): this creates a new sorted list and so, the original list remains unchanged

Usually, the elements are sorted by values in ascending order. However, you can change this default behaviour by setting argument 'reverse' = True in the sorting function.

Also, you can specify key by which you would like your collection to be sorted.

In [74]:
x = [4,1,2,3]
print sorted(x)
print x         # x list is still the same
x.sort()        
print x         # x list is now changed
[1, 2, 3, 4]
[4, 1, 2, 3]
[1, 2, 3, 4]
In [75]:
print sorted(x, reverse=True)
x.sort(reverse=True)
print x
[4, 3, 2, 1]
[4, 3, 2, 1]
In [76]:
#sort a list by abs value in descending order
x = sorted([-4, 1, -2, 3], key = abs, reverse = True)
print x
[-4, 3, -2, 1]

List Comprehensions

  • List Comprehensions is a very powerful tool, which creates a new list based on another list, in a single, readable line.
  • You can also tyrn lists into dictionaries or sets using list comprehensions.
  • We can also use multiple for loops in list comprehension.
In [77]:
even_numbers = [x for x in range(5) if x%2 == 0]
print 'list of even numbers below 5:', even_numbers

squares = [x**2 for x in range(5)]
print 'list of squares of numbers less than 5:', squares

even_squares = [x**2 for x in even_numbers]
print 'list of squares of even numbers less than 5:', even_squares

print
#turning list into dict or sets
square_dict = {x : x**2 for x in range(5)}
print 'dictionary of squares of numbers less than 5:', square_dict

square_set =  set(x**2 for x in range(5))
print 'set of squares of numbers less than 5:', square_set

print
#list comprehension with multiple for statements
pairs = [(x, y) 
         for x in range(5)
         for y in range(5)
        ]
print pairs
list of even numbers below 5: [0, 2, 4]
list of squares of numbers less than 5: [0, 1, 4, 9, 16]
list of squares of even numbers less than 5: [0, 4, 16]

dictionary of squares of numbers less than 5: {0: 0, 1: 1, 2: 4, 3: 9, 4: 16}
set of squares of numbers less than 5: set([0, 1, 4, 16, 9])

[(0, 0), (0, 1), (0, 2), (0, 3), (0, 4), (1, 0), (1, 1), (1, 2), (1, 3), (1, 4), (2, 0), (2, 1), (2, 2), (2, 3), (2, 4), (3, 0), (3, 1), (3, 2), (3, 3), (3, 4), (4, 0), (4, 1), (4, 2), (4, 3), (4, 4)]

Generators and Iterators

In [78]:
def squares(n):
    return n**2

def lazy_range(n):  # this is`xrange` in Python2 and `range` in Python3
    """a lazy version of range"""
    i = 0
    while i < n:
        print i
        yield i
        i += 1
        
squares_list = []

for i in lazy_range(10):
    squares_list.append(squares(i))

print 'squares_list:', squares_list
0
1
2
3
4
5
6
7
8
9
squares_list: [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
In [ ]:
# Create an infinite sequence (but don't use without some kind of `break` statement)
def natural_numbers():
    n = 1
    while True:
        yield n
        n += 1

Recall that every dict has items() that returns a list of its key-value pairs. Dictionaries also have iteritems() method that lazily yields the key value pairs one at a time as we iterate over it.

Randomness

We can generate random numbers using random module from Python. Some of the important methods from random package that we will often use are as follows:

  • random.seed - Set random seed in case you want to have reproducible results with random numbers
  • randrange() - allows you to produce random number from within given range
  • shuffle() - randomly shuffles elements of a collection and gives an output
  • choice() - in case you want to randomly select an element from a collection
  • sample() - if you want to randomly choose a sample of elements without replacement (sampling without any duplicates)
In [79]:
import random 

four_uniform_randoms = [random.random() for _ in range(4)]
print 'four_uniform_random numbers:', four_uniform_randoms
print

#set seed for reproducible results
random.seed(10)
print 'reproducible random number:', random.random()
random.seed(10) #reset the seed to 10
print 'reproducible random number again:',random.random()
print

#create range of random numbers 
print 'random number between 0 and 10:', random.randrange(10)
print 'random number between 3 and 6:',random.randrange(3, 6)
print

#shuffle a list in order to get random order of its elements
up_to_10 = range(10)
print 'ordered list:', up_to_10
random.shuffle(up_to_10)
print 'shuffled list:', up_to_10
print

#randomly pick one element from list
my_best_friend = random.choice(['Heisenberg', 'Saul', 'Jesse', 'Skinny Pete'])
print 'my_best_friend:', my_best_friend
print 

#sample numbers without replacement(without dups)
lottery_numbers = range(100)
winning_numbers = random.sample(lottery_numbers, 6)
print 'winning lottery numbers:', winning_numbers
print

#sample with replacement(with dups)
four_with_replacement = [random.choice(range(10)) for _ in range(4)]
print 'four_with_replacement', four_with_replacement
four_uniform_random numbers: [0.4506916641123372, 0.5200461255446185, 0.39018952392156003, 0.5616870973196527]

reproducible random number: 0.57140259469
reproducible random number again: 0.57140259469

random number between 0 and 10: 4
random number between 3 and 6: 4

ordered list: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
shuffled list: [8, 3, 5, 1, 9, 0, 4, 6, 7, 2]

my_best_friend: Skinny Pete

winning lottery numbers: [4, 86, 60, 38, 28, 67]

four_with_replacement [4, 6, 6, 1]

Regular Expressions

Regular expressions provide a way of searching text and are used extensively in NLP. They range from easy to extremely complicated. Some of the important functions from regular expressions (re) package in python are as follows:

  • re.match() - tries to match pattern with the string and returns True or False
  • re.search() - looks for pattern in the string and returns the matching pattern in string
  • re.split() - splits the string based on pattern
  • re.sub() - replaces / substitutes pattern with replacement value and changes the string value
In [80]:
import re

print 'does a match cat?', re.match('a', 'cat')
print 'does cat have an a?', re.search('a', 'cat')
print 'does dog have an a?', re.search('a', 'dog')
print 'split carbs on a and b', re.split('[ab]', 'carbs')
print 'replace any numbers with to', re.sub('[0-9]+', 'to', 'from here 2 there')
does a match cat? None
does cat have an a? <_sre.SRE_Match object at 0x11c6b2370>
does dog have an a? None
split carbs on a and b ['c', 'r', 's']
replace any numbers with to from here to there

Object-Oriented Programming

Python allows you to create classes that encapsulate data and functions that operate on them.

For example: Let us say we did not have an in-built implementation of sets in python and we would like to build one. So, we can start by constructing elements of Set class.

In our set class, we would like to have following functions:

  • add : to add items to set
  • remove: to remove items from set
  • contains: to check if a given element is present in the set
In [81]:
class Set:
    
    def __init__(self, values = None):
        self.dict = {}
        if values is not None:
            for value in values:
                self.add(value)
    
    def __repr__(self):
        return 'Set:', str(self.dict.keys())
    
    def add(self, value):
        self.dict[value] = True
    
    def remove(self, value):
        del self.dict[value]
    
    def contains(self, value):
        return value in self.dict
In [82]:
s = Set([1, 2, 3])
s.add(4)
print 'set contains 3?', s.contains(3)
s.remove(3)
print 'set contains 3?', s.contains(3)
set contains 3? True
set contains 3? False

Functional Tools

When passing functions around, sometimes we want to apply a function only partially to create new functions. For this purpose we can use various functional tools offered by Python. Some of the functions that we would be using are as follows:

  • partial(): allows you to partially fill function with default values and create new functions
  • map(): allows you to apply (or map) a function to every element of a collection
  • filter(): returns elements of a list that satisfy a pre-defined condition or filter
  • reduce(): combines all elements of a collection from left to right
In [83]:
#use of partial() function:
def exp(base, power):
    return base ** power

#compute two to the power without using partial
def two_to_the_power(power):
    return exp(2, power)

#use partial() function to compute results of 2 raised to a power
from functools import partial
two_to_the_power = partial(exp, 2) #two_to_the_power is now a function of just one variable
print 'two to the power 3:', two_to_the_power(3)

#use partial() function to compute any base number raised to a power
square_of = partial(exp, power = 2)
print 'square of 3:', square_of(3)
two to the power 3: 8
square of 3: 9
In [84]:
#use map() function
def double(x):
    return 2 * x

xs = [1, 2, 3, 4]
twice_xs = [double(x) for x in xs] #double every element of list using list comprehension
print 'twice_xs created using list comprehension method:', twice_xs
twice_xs = map(double, xs) #double every element of list by using map() function
print 'twice_xs created using map method:', twice_xs
twice_xs created using list comprehension method: [2, 4, 6, 8]
twice_xs created using map method: [2, 4, 6, 8]
In [85]:
#use filter() function
def is_even(n):
    return n%2 == 0

x_evens = [x for x in xs if is_even(x)] #find even numbers in the list using list-comprehension method
print 'x_evens created using list comprehension method:', x_evens
x_evens = filter(is_even, xs) #double every element of list by using map() function
print 'x_evens created using filter method:', x_evens
x_evens created using list comprehension method: [2, 4]
x_evens created using filter method: [2, 4]
In [86]:
#use reduce() function
def multiply(x, y): return x*y

x_product = reduce(multiply, xs) #computes 1 * 2 * 3 * 4
print 'product of all elements of list:', x_product
product of all elements of list: 24

enumerate

The enumerate() function can be use to iterate over indices and items of a list.

In [87]:
a = ["a", "b", "c"]

#non-pythonic way
for i in range(len(a)):
    print i, a[i]

#pythonic way
for index, value in enumerate(a):
    print index, value
0 a
1 b
2 c
0 a
1 b
2 c

zip and Argument Unpacking

The zip() function can be used to iterate over two or more lists in parallel. zip() transforms multiple lists into a single list of tuples of corresponding elements.

In [88]:
#example1:
list1 = ['a', 'b', 'c']
list2 = [1, 2, 3]
print 'list1 and list2 zipped together:', zip(list1, list2)

#example2:
a = [1, 2, 3]
b = [3, 4, 5]
c = [6, 7, 8]

for i, j, k in zip(a, b, c):
    print i, j, k


### Use zip() and enumerate() together
alist = ['a1', 'a2', 'a3']
blist = ['b1', 'b2', 'b3']

for i, (a,b) in enumerate(zip(alist, blist)):
    print i, a, b
list1 and list2 zipped together: [('a', 1), ('b', 2), ('c', 3)]
1 3 6
2 4 7
3 5 8
0 a1 b1
1 a2 b2
2 a3 b3

You can also unzip a list using *.

In [89]:
pairs = [(1, 'one'), (2, 'two'), (3, 'three')]
numbers, letters = zip(*pairs)
print 'numbers in list:', numbers
print 'letters in list:', letters

#using argument unpacking with any function. 
def add(a, b): return a + b

print 'add 1 and 2 by using simple function call:', add(1,2)
# print 'add 1 and 2 by passing list [1,2] to add() method:' add([1, 2]) #produces error
print 'add 1 and 2 by passing unpacked list [1,2] to add() method:', add(*[1, 2])
numbers in list: (1, 2, 3)
letters in list: ('one', 'two', 'three')
add 1 and 2 by using simple function call: 3
add 1 and 2 by passing unpacked list [1,2] to add() method: 3

args and kwargs

*args allows us to pass variable number of arguments to the function. Let’s take an example to make this clear.

In our following example, we first build a function to add two numbers. As you can see the first add() only accepts two numbers, what if you want to pass more than two arguments, this is where *args comes into play.

In [90]:
#function to add numbers:
def add_fixed(a, b):
    return a + b

def add_variable(*args):
    sum = 0
    for num in args:
        sum += num
    return sum

# print 'add 3 numbers using add_fixed():', add_fixed(3, 2, 1) #gives an error
print 'add 3 numbers using add_variable()', add_variable(3, 2, 1)
print 'add 6 numbers using add_variable()', add_variable(3, 2, 1, 5, 6, 7)
add 3 numbers using add_variable() 6
add 6 numbers using add_variable() 24

Note: name of *args is just a convention you can use anything that is a valid identifier. For example, *myargs is perfectly valid.

**kwargs allows us to pass variable number of keyword argument like this func_name(name='tim', team='school')

In [91]:
def my_func(**kwargs):
    for i, j in kwargs.items():
        print(i, j)
        
        
my_func(name='tim', sport='football', roll=19)
('sport', 'football')
('name', 'tim')
('roll', 19)

For Further Exploration

Note: Get the code and examples from the book for Chapters 3 - 24 here

Finally, thanks for coming and reach out to Meghann or Chai if you have any questions on these notes!