4. Representations, Complexity and Algorithms

09 Oct 2015

How does a Computer represent things

You might have heard that a computer works with something called binary. That means that everything you do with a computer boils down to a string of ones and zeros. As an example $00000100$ represent the number 4 in 8 bits. That also means that all operations (addition, subtraction, multiplication ...) is the manipulation of these binary strings. 8 bits are called one byte and there are $2^8 = 256$ different bit patterns for a byte. So with a byte we can represent 256 different things.

Sometimes instead of using binary string you will see people using hexadecimal numbers. Hexadecimal numbers from 00 to FF are exactly the binary patterns from 00000000 to 11111111. (Decimal, Hexadecimal, Octal, Binary).

Binary logic

One of the most fundamental things you can do with two bits, is binary logic or as it is also called boolean algebra.

`and` and `or`

x	y	`x and y`	`x or y`
0	0	0	0
0	1	0	1
1	0	0	1
1	1	1	1

`not`

x	`not x`
0	1
1	0

`xor`

There is also the exclusive or (xor):

x	y	`x xor y`
0	0	0
0	1	1
1	0	1
1	1	0

Text

Text in a computer is encoded either in ASCII or nowadays in one of the Unicode standards if you are wondering why your phone is able to send emoji and why we ASCII is not enough see these videos: Internationalis(z)ing Code and Emoji and the Levitating Businessman.

Python 3 uses UTF8 as a default encoding. ASCII is a strict subset of the UTF-8 standard so as long as we only talk about the English 26 characters we will ignore unicode.

Words and sentences are then stored as arrays of characters. So if you work with strings you have to be as careful as when you work with arrays.

Integers

We are back at the beginning of building computers and somebody asks you to create a computer that is able to do calculation with positive (or unsigned) numbers. Since no body told us how many numbers they need we decide to use 8 bit to store a numbers. We decide on the notion of storing the least significant number on the right (as in decimal) and we define $00000000$ to be zero. To convert a binary number into decimal we calculate $b_7 * 2^7 + b_6 * 2^6 + b_5 * 2^5 + b_4 * 2^4 + b_3 * 2^3 + b_2 * 2^2 + b_1 * 2^1 + b_0 * 2^0$, where $b_n$ corresponds to the $n$th bit from the right.

Addition

0001001
0000101
-------
0001110

Overflow

1101001
0110101
-------
0011110

How could we prevent this? Is this an intrinsic problem to how a computer handles math?

Binary Addition & Overflow - Computerphile

Negative numbers

For negative numbers the first bit is reserved as the sign so with 8 bits we can represent numbers from -128 to 127. The particular representation that is used is called 2-complement.

Python and Integers

Python takes most of the worries away from you and when you add to very big numbers that could overflow it changes the representation from 32 to 64 bit. This allows you not to worry to much about stuff like this but it carries a performance penalty.

But be careful if you use tools like numpy. They are implemented in C and you might get overflow behavior.

Real numbers

For representing real numbers computer use floating point (defined in [IEEE754]). Floating point resembles scientific notation, but to the base of 2 instead of 10. One part of stores the significant and the other the exponent. For 64 bits [IEEE754] defines that 52 bits + 1 sign bit are reserved for the significant and 11 bits are reserved for the exponent.

Floating point numbers try to solve the problem of how to represent a set of infinite numbers with only a finite set of representations (bit patterns).

Noteworthy features

Most numbers can't be exactly represented and errors accumulate.

 print(100*0.1)
sum = 0.0
for i in range(0,100):
    sum += 0.1

print(sum)

Gaps between numbers grow the bigger the numbers get.
Adding a small number to a big number might have no effect. This is called annihilation.

$(a+b)+c != a+(b+c)$ the order of addition matters!

a = 1e17
b = -1e17
c = 1

(a + b) + c = ???
a + (b + c) = ???

$a+b>a$ this is no longer true if the distance between $a$ and the next largest floating-point number is bigger than b $⇒ eps(a) > b$.
Instead of testing for equality you should test for approximaly equal.

Further links

Complexity

The complexity of an algorithm is defined as the number of elementary steps an algorithm takes depending on the number of inputs.

The big O notation gives us the asymptotic runtime behavior of a function. As an example the search algorithm for a sorted list most of you implemented in the last homework runs in $O(n)$ steps. Constant cost factors are not of interest. The most optimal algorithm would run in $O(log(n))$. The sorting algorithm most of you implemented takes $O(n^2)$ and the most efficient algorithm takes $O(n*log(n))$.

If you notice that your program runs slowly for large datasets, there are several potential reasons. Either your implementation is inefficient or the algorithm you chose has a bad asymptotic runtime.

MIT 16.070

Algorithms

Sorting algorithms

Bubble sort and merge sort Getting Sorted - Computerphile
Quick Sort - Computerphile

Searching in a sorted list

Search for 13 and 9
In: [0, 3, 5, 7, 8, 10, 13, 17]
                  8
                /    \
              5       13
            /   \    /   \
          3      7  10    17
          |
          0

The obvious solution is to go through each element in the list and check if it is the element we are looking for it. This is going to take $n = length(Inpu)$ number of operations in the average case. Another solution is to realize that a sorted list is a tree (with the root being length(Input)/2) and we can search a tree with log(n) operations.

Linked-List

A linked list is a list that is build from elements that contain a pointer to the next element and the value they are storing.

class LinkedList:
    next = None
    def __init__(self, value):
        self.value = value
    def append(self, value):
      if self.next == None:
          self.next = LinkedList(value)
      else:
          self.next.append(value)

Operation costs in Python

The way we choose to represent data-structure leads to different cost of operations. As an example append takes $O(n)$ steps with your LinkedList. A more efficient implementation would take $O(1)$.

Take a look at operation costs for common python containers TimeComplexity and a detailed analysis of the python list.

Profiling in Python

Analyzing the performance of a program is called profiling. Profiling will determine hot spots in your code that are costly and those are the places that rewriting them might be beneficial.

Notes from last weeks homework

return and function arguments
markdown in Jupyter

Homework

The homework is due on October 15 2015 at 13:00pm (eg. noon). Hand in your homework either as a IPython notebook or a pdf + the source code you wrote.

Implement a function that computes xor, while only using and, or and not.
What happens when you sort characters. What decides their order?
Write a program that takes 16 bit positive numbers (as list of boolean values) and converts them to decimal.
Implement bubble sort.
Implement merge sort.
Why is merge sort more efficient for large n?
Extend the defintion of LinkedList given in the lecture to support the operations:
- get(i) get the value of the list element at position i.
- set(i, value) set the value of the list element at position i.
- insert(i, value) insert a new value at position i.
- delete(i) delete the value at position i

Describe the following code using big-O notation, and justify your answer. Hint: if you are lost, profile this code.

# http://code.activestate.com/recipes/579093-converting-numeric-strings-to-integers/?in=lang-python
def str_to_int(s):
    ctr = i = 0
    for c in reversed(s):
        i += (ord(c) - 48) * (10 ** ctr)
        ctr += 1
    return i

print
for s in ('0', '1', '2', '3', '12', '123', '234', '456', '567'):
    i = str_to_int(s)
    print "s = {}, i = {} |".format(s, i),

print
print

for i in range(50):
    s = str(i)
    j = str_to_int(s)
    print "s = {}, j = {} |".format(s, j),

To run this example you may need to install pip install bipartite_recipe

# http://code.activestate.com/recipes/579088-edge-coloring-of-a-bipartite-graph/?in=lang-python
from bipartite_recipe import bipartite_match
def max_degree(graph):
    max_degree_u=max(len(v) for v in graph.values() )
    count={}
    for vlist in graph.values():
        for v in vlist:
            try:
                count[v]+=1
            except KeyError:
                count[v]=1
    max_degree_v=max(count.values())
    return max(max_degree_v,max_degree_u)
def best_match(graph):
    b_m=bipartiteMatch(graph)[0]
    return dict((value,key) for (key,value) in b_m.items())
def edge_coloring(graph):
    g=graph.copy()
    number_colors=max_degree(g)
    for color in range(number_colors):
        print 'NEXT COLOR'
        bm=best_match(g)
        print bm
        for k in g.copy():
            try:
                g[k].remove(bm[k])
            except KeyError:
                pass

graph={ 'p1':['c1','c3'], 'p2':['c1','c2'], 'p3':['c2','c3'], 'p4':['c2'] }
edge_coloring(graph)

# http://code.activestate.com/recipes/578959-draw-a-diamond-with-asterisks-using-recursion/?in=lang-python
def triangles(n):
    if not n & 1:
        raise ValueError('n must be odd')
    print_diamond(0, n, n >> 1)

def print_diamond(start, stop, midpoint):
    if start < stop:
        if start <= midpoint:
            print('  ' * (midpoint - start) + '* ' * ((start << 1) + 1))
        else:
            print('  ' * (start - midpoint) + '* ' * ((stop - start << 1) - 1))
        print_diamond(start + 1, stop, midpoint)

Consider the following code in Python:
```
a = {'foo': 'bar'}
b = a

b['new'] = 'key'
print(a)
```
Is this behavior what you expected? Why does Python do this? What would you do if you wanted to change just b and not a?
What cost is associated with calling append repeatedly on a python list. Use the resources given in the lecture to study the implementation and its consequences on performance. How could you prevent this behavior in your own code, without reimplementing lists?