Chapter 6

Learning Python

Python is one of the most popular programming languages. It’s broadly used in programming web applications, writing scripts for automation, accessing data, processing text, data analysis, etc. Many software packages that are useful for data analysis (like NumPy, SciPy, and Pandas) and machine learning (scikit-learn, TensorFlow, Keras, and PyTorch) can be integrated within a Python application in a few lines of code. In this chapter, we explore the programming language in a similar approach to the one we took for C++ and Java. In addition, we explore tools and packages that help accelerate the development of data-driven application using Python.

Note: The examples below are abridged; the book contains more details.

Getting Started
Objects
Importing Modules
Executing Shell Commands
Scalar Data Types
Strings
Duck Typing
Tuples
Lists
Ranges
Slicing
Sets
Dictionaries
Counters
Dictionaries with Default Values
Hashable Objects
List Comprehensions
Set Comprehensions
Dictionary Comprehensions
Nested Comprehensions
Control Flow
The Empty Statement
Functions - Part I
Functions - Part II
Functions - Part III
Classes
Inheritance
The Empty Class
The NumPy Package
Linear Algebra
Random Number Generation
The SciPy Package
The pandas Package
The scikit-learn Package
Clustering
Classification
Regression
Reading and Writing Data in Text Format
Reading and Writing Ndarrays
Reading and Writing Dataframes
Material Differences between Python 3 and 2

Getting Started

The Python programs in this book target Python 3.5.1 (the latest version available at the time of writing). Python 2.x is legacy and Python 3.x offers several new features but it’s not entirely backward-compatible. The code examples in this chapter should run as expected using either version (unless otherwise stated). The code segment below demonstrates a short Python program:

# This is a comment
print("hello world")

Objects

Variable names in Python are untyped and refer to typed objects in memory:

# variables are dynamically typed
x = 3
x = "three" # ok in Python

x = [1, 2, 3]  # x is a list
y = x  # both x and y refer to the same list
y.append(4)  # appending to y impacts x as well
print(x)

s = "three"
# call upper, a member method of the string object
t = s.upper()
print(t)

Importing Modules

A Python module is a code object loaded into the Python program by the process of importing. Suppose we have a module file my_vars.py, the definitions in this file can be accessed in a different file using the appropriate import command:

importing-modules.py

import my_vars
# accessing variables in my_vars module
area = my_vars.LENGTH * my_vars.WIDTH

# the following imports the module using an alias
import my_vars as mv
area = mv.LENGTH * mv.WIDTH 

# the following makes the definition x in the current module
# accessible without the need for its prefix
from my_vars import LENGTH, WIDTH
area = LENGTH * WIDTH  

from my_vars import *
area = LENGTH * WIDTH

my_vars.py

LENGTH = 24
WIDTH = 48

Executing Shell Commands

Shell commands can be executed from within Python using the function system("shell command"), which is defined in the module os. A newer alternative to os.system that provides much more flexible functionality is the call function, which is defined in the subprocess module:

import os
os.system("ls -al")  # call shell command ls -al

import subprocess as sp
sp.call(["ls", "-al"])

Scalar Data Types

Python’s scalar data types and operators are similar to those in C++ and Java. That said, Python’s boolean operators are expressed using keywords: and, or, and not:

a = 42
b = float(a)
c = str(a)
d = bool(a)
# print the type and value for each object above
for x in a, b, c, d:
  print("type: {}, value: {}".format(type(x), x))

x = True
y = False
print(x and y)
print(x or y)
print(not x)

int("a")
## Traceback (most recent call last):
##   File "<stdin>", line 1, in <module>
## ValueError: invalid literal for int() with base 10: 'a'

Strings

Python’s scalar data types and operators are similar to those in C++ and Java. That said, Python’s boolean operators are expressed using keywords: and, or, and not:

s = ''' this is a string
that spans
multiple lines'''
print('\n line')  # newline followed by "line"
print(r'\n line') # avoid escape character substitution

s = "Python"
print(s[2])  # print third character of string s
print(s + str(123))  # concatenate s with "123"
print(s.replace('P', 'p'))  # replace 'P' in s with 'p'

# Example: formatting using the format method
import math
print('Formatting in {} is simple and powerful'.format('Python'))
print('Refer to {1} by {0}'.format('index', 'fields'))
print('Use {name} too'.format(name='keyword arguments'))
print('Rounding: pi = {:.3}'.format(math.pi))
print('Attributes: pi = {0.pi:.3}'.format(math))
print('And {0[1][0]} more!'.format(['so', ['much', ['more']]]))

# Example: formatting using the % operator
value = '%'
print('Formatting using the %s operator is error-prone.' % value)
print('Values must be specified %s %s.' % ('in', 'order'))
value = ('a', 'tuple')
# Wrap value as (value,); otherwise we get a TypeError
print('If a value is %s, things get complicated!' % (value,))
value = {'data': 'dictionary'}
print('Using a %(data)s key works though.' % value)
print('Rounding: pi = %.3f and e = %.3f.' % (math.pi, math.e))
print('%% has to be escaped when formatting.' % ())

# Example: formatting using the Template class
from math import pi
from string import Template as T # aliased for brevity

print(T('$id work').substitute(id='Keyword arguments'))
print(T('$id works too').substitute({'id': 'A dictionary'}))
print(T('Note the ${id}').substitute(id='differences'))
print(T('pi = ${id}').substitute(id=pi))
print(T('$$ has to be escaped').substitute())

# Example: formatting using literal string interpolation
value = 'string interpolation'
print(f'The use of {value} is awesome!') # more readable
print(f"Another PI example: pi = {pi:.3f}")

Duck Typing

The interpreter tries to execute the code and if it cannot, a runtime error will occur:

class Duck:
  def quack(self):
    print("Quack")
  def walk(self):
    print("Shuffle")

class Fox:
  def quack(self):
    print("Quackkkkk")
  def walk(self):
    print("Shuffffle")
  def kill(self):
    print("Yum!")

def foo(x):
  x.quack()
  x.walk()

donald = Duck()
swiper = Fox()
foo(donald)
foo(swiper)

Tuples

A tuple is an immutable sequence of objects of potentially different types. Each object can be referred to using the square bracket index notation. Tuples are defined using parenthesis (or the built-in function tuple):

a = (1, 2) # tuple of two integers
b = 1, 2 # alternative way to define a tuple
a[0] = 3 # error: modifying an immutable tuple after its creation

a = (1, 2)  # tuple of two integers
b = (1, "one")  # tuple of an integer and a string
c = (a, b)  # tuple containing two tuples
print(b[1])  # prints the second element of b
print(c[0])  # prints the first element of c

# Example: unpacking
t = (1, 2, 3)  # tuple of three integers
(a, b, c) = t  # copy first element of t to a, second to b, etc.
a, b, c = t  # alternative syntax to accomplish the same thing
print(a)
print(b)
print(c)

# Example: nested tuples
t = (1, (2, 3))  # tuple containing an integer and a tuple
# copy the first element of t to a, second to (b, c)
(a, (b, c)) = t
a, (b, c) = t
print(a)
print(b)
print(c)

# Example: erroneous unpacking
a, b, c = t  # error: not enough values to unpack

Lists

Lists are similar to tuples, but they’re mutable, and they’re defined using square brackets (or the built-in function list):

a = [1, 2, 3]  # list containing three integers
b = [a, "hello"]  # list containing a list and a string
b.append(4)  # modify existing list by appending an element
print(a)
print(b)

# Example: inserting, removing, and sorting elements
a = [1, 2, 3] 
a.insert(1, 100)  # insert 100 before index 1
print(a)
a.remove(2)  # remove first occurrence of 2
print(a)
a.sort()  # sort list
print(a)
a.extend([20, 21, 23])  # extend list with an additional list
print(a)
b = sorted(a)
print(b)

# Example: deleting an element
a = ['hello', ', ', 'world']
print(len(a))
del a[1]
print(len(a))
print(a)

# Example: the + operator
print([1, 2, 3] + [4, 5] + [6])

Ranges

The built-in function range is often used to access a list of evenly-spaced integers:

print(list(range(0, 10, 1)))
print(tuple(range(0, 10))) # same as tuple(range(0, 10, 1))
print(tuple(range(10))) # same as tuple(range(0, 10))
print(tuple(range(0, 10, 2))) # step size of 2

Slicing

Slicing refers to accessing a slice of a list, tuple, or string using a range of integers representing the corresponding indices:

a = list(range(10))
print(a[:])  # starting index (0) to end index (9)
print(a[3:])  # index 3 to end (9)
print(a[:3])  # index 0 to 2
print(a[0:10:2])  # index 0 to 9, skipping every other element

# Example: negative values
a = list(range(10))
print(a[::-1])  # start index (0) to end (9) - backwards
print(a[-3:])  # index 3 from the end (7) to the end (9)

# Example: zipping
a = [1, 2, 3]
b = ["one", "two", "three"]
print(list(zip(a, b)))

a = [1, 2, 3]
b = ["one", "two", "three"]
c = ["uno", "dos", "tres"]
print(list(zip()))
print(list(zip(a)))
print(list(zip(a, b, c)))

Sets

A set is an unordered collection of unique immutable objects (yet the set itself is mutable). A set can be defined using curly braces, with objects separated by commas:

s = {1, 2, 3, 2}  # duplicity in sets is ignored
print(s)

# Example: set operations
a = set([1, 2, 3]) # make a set from a list
b = set((2, 3, 4)) # make a set from a tuple
print(a | b)  # union
print(a & b)  # intersection
print(a - b)  # set-difference
print(a.isdisjoint(b), a.issubset(b), a.issuperset(b))

Dictionaries

A dictionary is an unordered, mutable compound data type representing a set of (key, value) pairs, where each key may appear at most one time:

# Example: create 3 dicts with three key-value pairs:
# ('Na', 11), ('Mg', 12), and ('Al', 13)
d = dict((('Na', 11), ('Mg', 12), ('Al', 13)))
e = {'Na': 11, 'Mg': 12, 'Al': 13}
f = dict(Na=11, Mg=12, Al=13)
print(d == e == f)  # check equivalence
print(d.keys())  # print all keys (unordered)
print(d.values())  # print all respective values

# Example: create a dict with two key-value pairs
d = {'Na': 11, 'Mg': 12, 'Al': 13}
print(d['Na'])  # retrieve value corresponding to key 'Na'
print('Li' in d)  # check if 'Li' is a key in dict d

# Example: create a dictionary with two key-value pairs
d = {'Na': 11, 'Mg': 12, 'Al': 13}
d['Li'] = 3 # add the key value pair ('Li', 3)
del d['Na']  # remove key-value pair corresponding to key 'Na'
print(d)  # print dictionary

# Example: create a dictionary from two lists: keys and values
keys = ['Na', 'Mg', 'Al']
values = [11, 12, 13]
d = dict(zip(keys, values))
print(d)

# Example: iterate over all items in a dictionary
d = dict(Na=11, Mg=12, Al=13)
# iterate over all items in d
for key, value in d.items():
  print(key, value)

# Example: a double lookup (an anti-pattern)
d = dict(Na=11, Mg=12, Al=13)
for key in d: # similar to for key in iter(d):
  print(key, d[key])

# Example: a double lookup (again)
counts = dict(apples=3, oranges=5, bananas=1)
key = 'apples'
if key in counts: # first lookup
  print(counts[key]) # second lookup
else:
  print(0)

# Example: avoiding double lookups
counts = dict(apples=3, oranges=5, bananas=1)
key = 'apples'
print(counts.get(key, 0)) # single lookup

Counters

Python provides a dict subclass called Counter, which is mainly used for frequency-counting of its keys:

from collections import Counter
counts = Counter(apples=3, oranges=5, bananas=1)
print(counts)

# Example: split the text into a list of words then count them
counts = Counter('to be or not to be'.split())
print(counts)
print(counts['question'])  # counters can be used as sparse vectors

# Example: repeat the keys as many times as their respective counts
print(tuple(counts.elements())) # unordered

# Example: get the n most common elements
tuple(counts.most_common(2)) # unordered

# Example: addition and subtraction
other = Counter('love or war'.split())
# accumulate two counters together
print(dict(counts + other))
# subtract counts (and only keep positive counts)
print(dict(counts - other))

Dictionaries with Default Values

An alternative for the \texttt{dict} class to use as a sparse vector is defaultdict, which provides default values for nonexistent keys using a factory function that creates said default value:

from collections import defaultdict

# the default value is what int() returns: 0
d = defaultdict(int)
print(d[13])

# example: grouping elements by their respective keys
books = [  # documents to index: a list of (title, text) tuples
  ('hamlet', 'to be or not to be'),
  ('twelfth night', 'if music be the food of love'),
]
index = defaultdict(set)  # the default value is what set() returns: {}
for title, text in books:
  for word in text.split():
    # get or create the respective set
    # then add the title to it
    index[word].add(title)

from pprint import pprint  # pretty-printer (recommended for nested data)
pprint(index)
print(index['music'])  # query for documents that have the word 'music'

# example: building a trie (aka prefix tree)
words = { 'apple', 'ban', 'banana', 'bar', 'buzz' }
# a factory function/lambda expression that returns
# a defaultdict which uses it as a factory function
factory = lambda: defaultdict(factory)
trie = defaultdict(factory)
DELIMITER = '--END--'
LEAF = dict()
for word in words:
  level = trie
  for letter in word:
    level = level[letter]
  level[DELIMITER] = LEAF

def autocomplete(root, prefix, suffix=''):
  if DELIMITER in root:
    print(prefix + suffix)
  for key, value in root.items():
    autocomplete(value, prefix, suffix + key)

autocomplete(trie['b']['a'], 'ba')  # query for words that start with 'ba'

Hashable Objects

In addition to the requirement that each key may appear at most once, keys are required to be hashable:

print(hash("abc"))  
print(hash((1, 2, 3)))

List Comprehensions

For many constructs in Python, there’s a dull way of doing things, and then there’s the Pythonic way:

from collections import Counter

# Example: a simple, mundane transformation to a list
words = ['apple', 'banana', 'carrot']
modes = []
for word in words:
  counter = Counter(word)
  # most_common(n) returns a list and the *
  # operator expands it before calling append(x)
  modes.append(*counter.most_common(1))
print(modes)

# Example: the Pythonic way (using a list comprehension)
print([Counter(w).most_common(1)[0] for w in words])

# Example: select and transform
print([Counter(w).most_common(1)[0] \
       for w in words if len(w) > 5])

# Example: mimicing nested for loops and multiple conditionals
x = (0, 1, 2, 7)
y = (9, 0, 5)
print([(a, b) for a in x for b in y if a != 0 and b != 0])
# alternatively
print([(a, b) for a in x if a != 0 for b in y if b != 0])

Set Comprehensions

Set comprehensions work almost exactly the same way as list comprehensions, except that the result is a set (where repeated elements are not allowed) and it’s expressed using curly braces:

champions = [
  (2014, 'San Antonio Spurs'),
  (2015, 'Golden State Warriors'),
  (2016, 'The Cleveland Cavaliers'),
  (2017, 'Golden State Warriors'),
  (2018, 'Golden State Warriors'),
]
for team in {c[1] for c in champions}: # unordered
    print(team)

Dictionary Comprehensions

Similarly, dictionary comprehensions work almost exactly the same way as set comprehensions, except that the result is a dictionary (where a unique key is mapped to a value):

from pprint import pprint

champions = [
  (2014, 'San Antonio Spurs'),
  (2015, 'Golden State Warriors'),
  (2016, 'The Cleveland Cavaliers'),
  (2017, 'Golden State Warriors'),
  (2018, 'Golden State Warriors'),
]
pprint({c[0]: c[1] for c in champions})

# Example: nonunique keys
pprint({c[1]: c[0] for c in champions})

Nested Comprehensions

Comprehensions can be nested; the expression that evaluates to an element in the result can be a comprehension:

# create a 3x5 matrix of 1's
matrix = [[1 for j in range(5)] for i in range(3)]
print(matrix)

Control Flow

Conditionals, loops, etc.

# Example: if-else
x = 3
if x > 0:
  # if block
  print("x is positive")
  sign = 1
elif x == 0:
  # elif (else-if) block
  print("x equals zero")
  sign = 0
else:
  # else block
  print("x is negative")
  sign = -1
print(sign)  # new block

# Example: for-loops
a = [1, 2, 3, -4]
for x in a:
  # start of the for block
  print(x)

print(x)  # new block, but x is still in scope!

# Example: the enumerate function
a = [1, 2, 3, -4]
for i, x in enumerate(a):
  print('a at', i, '=', x)

# Example: a loop combined with a conditional
a = [1, 2, 3, -4]
abs_sum = 0
for x in a:
  if x > 0:
    abs_sum += x
  else:
    abs_sum += -x
print(abs_sum)

# Example: alternatively, the Pythonic way to replace the above
print(sum(abs(x) for x in a))

# Example: else as a completion clause
s = 'Jane is 42 years old'
for c in s:
  if c.isdigit():
    print('Found:', c)
    break
else:
  print('No digit was found!')

# Example: when the else completion clause is executed
budget = 13
costs = [5, 3, 2]
for i, cost in enumerate(costs):
  print('Item', i, 'for', cost, 'unit(s) of cost: ', end='')
  budget -= cost
  if budget > 0:
    print('acquired')
  else:
    print('insufficient funds')
    break
else:
  print('Remaining budget:', budget)

The Empty Statement

The empty statement, pass, is another example of Python syntax that one might find confusing at first glance:

budget = 13
costs = [5, 3, 2]
for i, cost in enumerate(costs):
  pass # TODO: put off until tomorrow
else:
  print('Remaining budget:', budget)

Functions - Part I

Here are a few examples of user-defined functions in Python:

def scale(vector, factor):
  return [x * factor for x in vector]

print(scale([1, 2, 3], 2))
print(scale({1, 2, 3}, 2)) # unordered
print(scale(factor=2, vector=(1, 2, 3)))
# the * operator works with lists and tuples as well,
# but it has different semantics:
print(scale([(1, 2, 3)], 2))

# Example: recursion
from collections import Sequence

def scale(vector, factor, recursively=False):
  if recursively:
      # call scale recursively for sequence elements
      return [scale(x, factor, True) if isinstance(x, Sequence)
              else (x * factor) for x in vector]
  else:
    return [x * factor for x in vector]

print(scale([(1, 2, 3)], 2))  # recursively is False by default
print(scale([(1, 2, 3)], 2, recursively=True))

# Example: recursion (two functions)
def scale_recursively(vector, factor):
  return [scale_element(x, factor) for x in vector]

def scale_element(element, factor):
  if isinstance(element, Sequence):
    return scale_recursively(element, factor)
  else:
    return element * factor

print(scale_recursively([1, [2, 3]], 2))

# Example: nested functions
def scale_recursively(vector, factor):
  def scale_element(element):
    if isinstance(element, Sequence):
      return scale_recursively(element, factor)
    else:
      return element * factor
  return [scale_element(x) for x in vector]

print(scale_recursively([1, [2, 3]], 2))

# Example: currying
def create_line_printer(prefix):
  def print_suffix(suffix):
    print(prefix, suffix)
  return print_suffix

write = create_line_printer('[currying example]')
write('this is useful for behavior reuse')
write('and for testing multiple behaviors')
write = create_line_printer('[a prefix to test]')
write('the call of write(x) did not change')
write('even though the prefix had changed')

# Example: passing a function as a parameter
def sort_by_second(pairs):
  def get_second(pair):
    return pair[1]
  return sorted(pairs, key=get_second)

pairs = ((1, 3), (4, 2))
print(sort_by_second(pairs))

# Example: passing a callable object as a parameter
pairs = ((1, 3), (4, 2))
from operator import itemgetter
print(sorted(pairs, key=itemgetter(1)))

Functions - Part II

Here are a few more examples of functions in Python:

# Example: nested functions - variable scope (erroneous)
# calling create_accumulator fails with the following error message:
# "local variable 'tally' referenced before assignment"
def create_accumulator(seed=0):
  tally = seed
  def accumulate(x):
    tally += x # local to accumulate(x)
    return tally
  return accumulate

# Example: nested functions - variable scope (the nonlocal statement)
def create_accumulator(seed=0):
  tally = seed
  def accumulate(x):
    nonlocal tally # the fix
    tally += x
    return tally
  return accumulate

accumulate = create_accumulator()
print(accumulate(2))
print(accumulate(3))
print(accumulate(5))

# Example: the global statement
tally = 0
def accumulate(x):
  global tally
  tally += x
  return tally

print(accumulate(2))
print(accumulate(3))
print(accumulate(5))

# Example: returning multiple objects
def foo(x, y):
  return x + y, x - y

a, b = foo(1, 2)  # unpack the returned tuple into a and b
print(a)
print(b)

# Example: a variadic function
# the syntax *lines makes this function variadic
def print_lines(prefix, *lines):
  for line in lines:
    print(prefix, line)

# call the function with varargs
print_lines('[varargs example]', 'hello', 'world')

# Example: an equivalent to the example above
# no * operator, just an ordinary tuple
def print_lines(prefix, lines):
  for line in lines:
    print(prefix, line)

# call the function with a tuple parameter
print_lines('[tuple example]', ('hello', 'world'))

# Example: unpacking the argument list
def power(base, exponent):
  return base ** exponent

data = [2, 3] # or (2, 3)
print(power(*data))

# Example: unpacking the argument list from a dictionary
def power(base, exponent):
  return base ** exponent

data = {'base': 2, 'exponent': 3}
print(power(**data))

# Example: named parameters packed into a dictionary
def print_latest_championship(**data):
  for team, year in data.items():
    print(team + ':', year)  # dictionaries are unordered

print_latest_championship(cavaliers=2016, warriors=2018)

# Example: combining regular parameters, *args, and **kwargs
def print_latest_championship(prefix, *args, **kwargs):
  for line in args:
    print(prefix, line)
  for team, year in kwargs.items():
    print(prefix, team + ':', year)  # dictionaries are unordered

print_latest_championship(
  '[NBA]',
  'Conference Finals',
  'Team: Year',
  cavaliers=2016,
  warriors=2018)

Functions - Part III

In this part, we explore anonymous functions:

# Example: a lambda expression is a very convenient way to create
# a single-expression anonymous function
pairs = ((1, 3), (4, 2))
print(sorted(pairs, key=lambda p: p[1]))

# Example: a lambda expression can be named and reused
pairs = ((1, 3), (4, 2))
get_second = lambda p: p[1]
print(sorted(pairs, key=get_second))
print('max:', max(pairs, key=get_second))

# Example: a recursive lambda
factorial = lambda x: factorial(x - 1) * x if x else 1
print(factorial(5))

# Example: a lambda with multiple arguments
power = lambda x, y: x ** y
print(power(5, 2))

# Example: a variadic lambda
from math import sqrt
norm = lambda *args: sqrt(sum(x * x for x in args))
print('norm:', norm(1, -2, 2))

# Example: a trick to print a lambda's arguments
norm = lambda *a: 0 if print(a) else sqrt(sum(x * x for x in a))
print('norm:', norm(1, -2, 2))

Classes

As in Java and C++, Python classes contain fields (variables) and methods (functions) and allow inheritance:

# Example: a point in a two-dimensional coordinate space
class Point:
  def __init__(self, x, y):
    self.x = x
    self.y = y
  def __del__(self):
    print ("destructing a Point object")

p1 = Point(3, 4)
p2 = Point(1, 2)
print("p1.x = {0.x}, p1.y = {0.y}".format(p1))
print("p2.x = {0.x}, p2.y = {0.y}".format(p2))

# Example: a class field (num_points) and instance fields (x and y)
class Point:
  '''Represents a point in a two-dimensional coordinate space.'''
  num_points = 0

  def __init__(self, x, y):
    Point.num_points += 1 
    self.x = x
    self.y = y

  def __del__(self):
    Point.num_points -= 1
    print("destructing a Point object")
    print("{} Point objects left".format(Point.num_points))


print(Point.__doc__)
p1 = Point(3, 4)
p2 = Point(1, 2)
print("p1.x = {0.x}, p1.y = {0.y}".format(p1))
print("p2.x = {0.x}, p2.y = {0.y}".format(p2))
print("number of objects:", Point.num_points)

Inheritance

Class inheritance in Python is indicated by including the base class in parenthesis after the class name in the class definition:

class Point:
  '''Represents a point in a 2-D coordinate space.'''
  num_points = 0

  def __init__(self, x, y):
    Point.num_points += 1
    self.x = x
    self.y = y

  def __del__(self):
    Point.num_points -= 1
    print("destructing a Point object")
    print("{} points left".format(Point.num_points))


class NamedPoint(Point):
  '''Represents a named point in a 2-D coordinate space.'''
  num_points = 0

  def __init__(self, x, y, name):
    # call superclass constructor
    super().__init__(x, y)
    NamedPoint.num_points += 1
    self.name = name

  def __del__(self):
    super().__del__()
    NamedPoint.num_points -= 1
    print("destructing a NamedPoint object")
    print("{} named points left".format(NamedPoint.num_points))


np = NamedPoint(0, 0, "origin point")
print("number of named points:", NamedPoint.num_points)
print("x = {0.x}, y = {0.y}, name = {0.name}".format(np))

The Empty Class

When working with records or data transfer objects (DTO), all that you need is a bunch of data members grouped together into a class. In Python, that can be easily achieved using an empty class definition:

class Element:
  '''Represents an element in the periodic table.'''
  pass


na = Element()
na.atomic_number = 11
na.name = 'Sodium'

print(f'Atomic Number: {na.atomic_number}; Name: {na.name}')

The NumPy Package

NumPy is a Python package that facilitate working with multidimensional arrays, vectors, matrices, linear algebra, etc.

import numpy as np

# Example: ndarry objects
# create 2x3 ndarray of floats from a list of lists
m = np.array([[1, 2, 3], [4, 5, 6]], dtype=np.float32)
print("m.shape =", m.shape)
print("m.dtype =", m.dtype)
print("m =\n" + str(m))
m = m.astype(np.int32)  # cast the ndarray from float to int
print("m =\n" + str(m))

# Example: other ways to create ndarry objects
m1 = np.zeros((2, 3)) # create a 2x3 ndarray of zeros
m2 = np.identity(3) # the 3x3 identity matrix
m3 = np.ones((2, 3, 2)) # create a 2x3x4 ndarray of ones
print("m1 =\n" + str(m1))
print("m2 =\n" + str(m2))
print("m3 =\n" + str(m3))

# Example: ndarry operations
m1 = np.identity(3)
m2 = np.ones((3, 3))
print("2 * m1 =\n" + str(2 * m1))
print("m1 + m2 =\n" + str(m1 + m2))

# Example: the sublist notation in the one-dimensional case
a = np.array(range(9))
print("a[:] =", a[:])
print("a[3] =", a[3])
print("a[3:6] =", a[3:6])
a[3:6] = -1
print("a =", a)

# Example: the logical condition in the one-dimensional case
a = np.array(range(7))
# create an bool (mask) ndarray of True/False
print("a < 3 =", a < 3)
# refer to all elements that are lower than 3
print("a[a < 3] =", a[a < 3])
a[a < 3] = 0  # replace elements lower than 3 by 0
print("a =", a)

# Example: the sublist notation in the two-dimensional case
m = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
print("m =\n" + str(m))
print("m[:, 1] =\n", m[:, 1]) # second column
# first two rows of the 2nd column
print("m[0:2, 1] =\n", m[0:2, 1])
# first 2 rows of the first two columns
print("m[0:2, 0:2] =\n" + str(m[0:2, 0:2]))

# Example: the logical condition in the two-dimensional case
m = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
print("m =\n" + str(m))
print("m < 3 =\n" + str(m < 3))
print("m[m < 3] =\n", m[m < 3])
m[m < 3] = 0
print("m =\n" + str(m))

# Example: another way to refer a subset of a two-dimensional ndarray
m = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
print("m =\n" + str(m))
# first row
print("m[0] =\n", m[0])
# first and third rows
print("m[[0, 2]] =\n" + str(m[[0, 2]]))
# third and first rows
print("m[[2, 0]] =\n" + str(m[[2, 0]]))
# refer to the (0, 2) and (1, 0) elements
print("m[[0, 1], [2, 0]] =\n", m[[0, 1], [2, 0]])

# Example: assignment
a = np.array(range(9))
b = a[0:4] # b refers to elements 0-3
b[1] = 42 # modify second element of b
print(a) # a is modified as well

Linear Algebra

NumPy implements many arithmetic operators and linear algebraic concepts for ndarrays:

import numpy as np

# Example: the mean function for ndarrays, columns, and rows
m = np.array([[0, 1], [2, 3]])
# global average
print("np.mean(m) =", np.mean(m))
# average of columns
print("np.mean(m, axis=0) =", np.mean(m, axis=0))
# average of rows
print("np.mean(m, axis=1) =", np.mean(m, axis=1))

# Example: matrix inverse and multiplication
m = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]]) + np.identity(3)
print("m =\n" + str(m))
m_inv = np.linalg.inv(m)
print("inverse of m =\n" + str(m_inv))
print("m times its inverse =\n" + str(np.dot(m, m_inv)))

# Example: the matrix object
m = np.mat([[0, 1], [2, 3]])
print("m =\n" +  str(m))
print("m[:, 0] =\n" + str(m[:, 0])) # column vector
print("m[0, :] =\n" + str(m[0, :])) # row vector
# compute bilinear form v^T * A * v (matrix multiplication)
print("m[0, :] * m * m[:, 0] =")
print(m[0, :] * m * m[:, 0])

# Example: set operations
a = np.array(['a', 'b', 'a', 'b'])
b = np.array(['c', 'd'])

print("a =", a)
print("np.unique(a) =", np.unique(a))
print("np.union1d(a, b) =", np.union1d(a, b))
print("np.intersect1d(a, b) =", np.intersect1d(a, b))

# Example: generating ndarrays containing pseudo-random numbers
 # N(0, 1) Gaussian
print("np.random.normal(size=(2, 2)) =")
print(np.random.normal(size=(2, 2)))
# uniform over [0,1]
print("np.random.uniform(size=(2, 2)) =")
print(np.random.uniform(size=(2, 2)))
# uniform over {1..10}
print("np.random.randint(10, size=(2, 2)) =")
print(np.random.randint(10, size=(2, 2)))
# random permutation over 1..6
print("np.random.permutation(6) =")
print(np.random.permutation(6))

Random Number Generation

Here’s an example of generating ndarrays containing pseudo-random numbers:

import numpy as np

# N(0, 1) Gaussian
print("np.random.normal(size=(2, 2)) =")
print(np.random.normal(size=(2, 2)))
# uniform over [0,1]
print("np.random.uniform(size=(2, 2)) =")
print(np.random.uniform(size=(2, 2)))
# uniform over {1..10}
print("np.random.randint(10, size=(2, 2)) =")
print(np.random.randint(10, size=(2, 2)))
# random permutation over 1..6
print("np.random.permutation(6) =")
print(np.random.permutation(6))

The SciPy Package

Here’s an example of a sparse matrix using SciPy:

from scipy import sparse
from numpy import *

# create a sparse matrix using LIL format
m = sparse.lil_matrix((5,5))
m[0, 0] = 1
m[0, 1] = 2
m[1, 1] = 3
m[2, 2] = 4
print("m =\n" + str(m))
# convert a to CSR format
b = sparse.csr_matrix(m)
print("b + b =\n" + str(b + b)) # matrix addition

The pandas Package

Pandas is a package that provides an implementation of a dataframe object that can be used to store datasets (like the ones in R):

import pandas as pd
import numpy as np

# Example: how to create a dataframe
data = {
  "names": ["John", "Jane", "George"], 
  "age": [25, 35, 52],
  "height": [68.1, 62.5, 60.5],
}
df = pd.DataFrame(data)
print("dataframe content =\n" + str(df))
print("dataframe types =\n" + str(df.dtypes))

# Example: accessing data in a dataframe
df["age"] = 35  # assign 35 to all age values
print("age column =\n" + str(df["age"]))
print("height column =\n" + str(df.height))
print("second row =\n" + str(df.ix[1]))

# Example: adding a new column
df = pd.DataFrame(data)
df["weight"] = [170.2, 160.7, 185.5]
print(df)

# Example: the median function
df = pd.DataFrame(data)
print("medians of columns =\n" + str(df.median()))
print("medians of rows =\n" + str(df.median(axis=1)))

# Example: apply f(x) = x + 1 to all columns
data = {
  "age": [25.2, 35.4, 52.1],
  "height": [68.1, 62.5, 60.5],
  "weight": [170.2, 160.7, 185.5],
}
df = pd.DataFrame(data)
print(df.apply(lambda z: z + 1))

# Example: working with missing data
data = {
  "age" : [25.2, np.nan, np.nan],
  "height" : [68.1, 62.5, 60.5],
  "weight" : [170.2, np.nan, 185.5],
}
df = pd.DataFrame(data)
# NA stands for Not Available
print("column means (NA skipped):")
print(str(df.mean()))
print("column means: (NA not skipped)")
print(str(df.mean(skipna=False)))

The scikit-learn Package

scikit-learn is a package that provides machine learning, data mining, and data analysis tools for Python:

from sklearn import datasets

iris = datasets.load_iris()

print(iris.feature_names) # columns
print(iris.data[:5]) # first 5 rows of the ndarray
print('...')
print(iris.DESCR) # description

Clustering

Using the iris dataset, we try to fit the 150 samples into 3 clusters using the k-means algorithm:

import numpy as np
from sklearn import datasets
from sklearn.cluster import KMeans
from sklearn.metrics import *

iris = datasets.load_iris()
# instead of running the algorithm many times with random
# initial values for centroids, we picked one that works well;
# the values below were obtained from one of the random runs:
centroids = np.array([
  [5.006, 3.418, 1.464, 0.244],
  [5.9016129, 2.7483871, 4.39354839, 1.43387097],
  [6.85, 3.07368421, 5.74210526, 2.07105263],
])
predictor = KMeans(n_clusters=3, init=centroids, n_init=1)
predictor.fit(iris.data)
# range of scores is [0.0, 1.0]; the higher, the better
completeness = completeness_score(iris.target, predictor.labels_)
homogeneity = homogeneity_score(iris.target, predictor.labels_)
accuracy = accuracy_score(iris.target, predictor.labels_)
print("Completeness:", completeness)
print("Homogeneity:", homogeneity)
print("Accuracy:", accuracy)

Classification

Using the iris dataset, we implement a binary classifier that predicts whether a sample is an Iris-Versicolor (denoted by the label 1) or not:

import numpy as np
from sklearn import datasets
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score

# constant seed to reproduce the same results every time
np.random.seed(28) 

iris = datasets.load_iris()
# prepare labels for binary classification task
# 1 iff original target is Iris-Versicolor
labels = iris.target == 1 
labels = labels.reshape((len(labels), 1))
data = np.append(iris.data, labels, axis=1) 
# randomly shuffle data and split to train and test sets
data = np.random.permutation(data)
split = 4 * len(data) // 5
train_data, test_data = data[:split], data[split:]
train_features = train_data[:, :-1]
train_labels = train_data[:, -1]
predictor = SGDClassifier(n_iter=500)
predictor.fit(train_features, train_labels)
test_features = test_data[:, :-1]
test_labels = test_data[:, -1]
test_error = 1 - accuracy_score(test_labels, 
	predictor.predict(test_features))
print("Test Error: {:.3%}".format(test_error))

Regression

Using the house prices dataset, we implement a linear regression model that predicts the price y of a house using a set of features X:

import numpy as np
from sklearn import datasets
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

print(datasets.load_boston().DESCR)

np.random.seed(42)  # constant seed for reproducibility
houses = datasets.load_boston()
split = 4 * len(houses.data) // 5
X_train, X_test = houses.data[:split], houses.data[split:]
y_train, y_test = houses.target[:split], houses.target[split:]
# linear regression works better with normalized features
scaler = StandardScaler() 
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
predictor = SGDRegressor(loss="squared_loss")
predictor.fit(X_train, y_train)
mse = mean_squared_error(y_test, predictor.predict(X_test))
print("Test Mean Squared Error: ${:,.2f}".format(mse * 1000))

Reading and Writing Data in Text Format

Below are a few examples of how to read from a text file or write to a text file in Python:

reading-and-writing-data-in-text-format.py

# Example: read a text file line by line using a for-loop
f = open("mobydick.txt", "r")  # open file for reading
words = []
for line in f: # iterate over all lines in file
  words += line.split()  # append the list of words in line
f.close()

# Example: the with statement
with open("mobydick.txt") as f:  # "rt" is the default mode
  words = [word for line in f for word in line.split()]

# Example: writing to a file
with open("output.txt", "w") as f:
  f.write("first line\n")
  f.writelines(["second line\n", "third line\n"])

with open("output.txt") as f:
  print(f.read())

mobydick.txt

Call me Ishmael. Some years ago—never mind how long precisely—having
little or no money in my purse, and nothing particular to interest me
on shore, I thought I would sail about a little and see the watery part
of the world. It is a way I have of driving off the spleen and
regulating the circulation. Whenever I find myself growing grim about
the mouth; whenever it is a damp, drizzly November in my soul; whenever
I find myself involuntarily pausing before coffin warehouses, and
bringing up the rear of every funeral I meet; and especially whenever
my hypos get such an upper hand of me, that it requires a strong moral
principle to prevent me from deliberately stepping into the street, and
methodically knocking people’s hats off—then, I account it high time to
get to sea as soon as I can. This is my substitute for pistol and ball.
With a philosophical flourish Cato throws himself upon his sword; I
quietly take to the ship. There is nothing surprising in this. If they
but knew it, almost all men in their degree, some time or other,
cherish very nearly the same feelings towards the ocean with me.

Reading and Writing Ndarrays

The NumPy package offers several functions for reading and writing ndarrays:

import numpy as np
import os

# Example: the save and load functions
m = np.array([[1, 2, 3], [4, 5, 6]])
file_name = "matrix.npy"
np.save(file_name, m)
print("File Size in Bytes:", os.stat(file_name).st_size)
loaded = np.load(file_name)
print(loaded)

# Example: the savez and savez_compressed functions
file_name = "output.npz"
for save in np.savez, np.savez_compressed:
  save(file_name, foo=np.array([1, 2]), bar=np.array([3, 4]))
  arrays = np.load(file_name)
  print("Using {}, {} bytes:".format(
    save.__name__,
    os.stat(file_name).st_size))
  for key, value in arrays.items(): # unordered
    print("\t{}: {}".format(key, value))
  print() # empty line

# Example: the savetxt and loadtxt functions
def print_file(file_name):
  with open(file_name) as f:
    for line in f:
      print(line, end='')

file_name = "array.txt" 
np.savetxt(file_name, np.array([[1, 3], [2, 4]]))
loaded = np.loadtxt(file_name)
print("Using numpy.loadtxt:\n{}".format(loaded))
print("Text File Content:")
print_file(file_name)

Reading and Writing Dataframes

The pandas package offers several functions for reading and writing dataframes:

import pandas as pd

def print_file(file_name):
  with open(file_name) as f:
    for line in f:
      print(line, end='')

data_frame = pd.DataFrame({
  "age": [25.2, 35.4, 52.1],
  "height": [68.1, 62.5, 60.5],
  "weight": [170.2, 160.7, 185.5],
})
file_name = "dataframe.csv"
data_frame.to_csv(file_name, index=False)
loaded = pd.read_csv(file_name)
print("Data Frame:\n{}".format(loaded))
print("Text File Content:")
print_file(file_name)

Material Differences between Python 3 and 2

In this section, we describe material differences between the two versions; see [this article(https://wiki.python.org/moin/Python2orPython3) for more details about the version change. Lines prefixed with ## are output lines.

# Example: unicode support

# Python 2.x
print(type(''))  # str
print(type(u''))  # unicode
## <type 'str'>
## <type 'unicode'>

# Python 3.x
print(type(''))  # str
print(type(u''))  # also str
## <class 'str'>
## <class 'str'>


# Example: the print statement

# Python 2.x
print(13, 42)
## (13, 42)

# Python 3.x
# print 13, 42
##  File "<stdin>", line 1
##    print 13, 24
##           ^
## SyntaxError: Missing parentheses in call to 'print'


# Example: the print function

# Python 3.x (or 2.x with the import from 3.x)
# imports from the future must come first in the file
from __future__ import print_function
print('hello from the future')
print(print)  # the print function is an object
print(13, 42)
## hello from the future
## <built-in function print>
## 13 42


# Example: division

# Python 2.x
print(1 / 3)  # floor (integer) division
print(1.0 / 3)  # true division
## 0
## 0.333333333333

# Python 3.x (or 2.x with the import from 3.x)
from __future__ import division
print(1 // 3)  # floor (integer) division
print(1 / 3)  # true division
## 0
## 0.3333333333333333