Arrow of time
Arrow of time

On constants in algorithmic complexity, and comparing apples to oranges

Share Tweet Share

Can an O(n!) algorithm be faster than an O(1) algorithm? Yes, of course it can. So I came …

Can an O(n!) algorithm be faster than an O(1) algorithm? Yes, of course it can.

So I came across an interestring and almost trivial programming puzzle: given an arbitrary word and a dictionary of words, find all anagrams of the word present in the dictionary. At first I didn't care to think about the best solution for the problem and I just wrote a trivial, brute force solution as a proof of concept:

from itertools import permutations
from collections import Counter
from math import factorial

def anagrams(w):
    words = set()
    for line in open('words.txt').readlines():
        words.add(line.strip())
    result = set()
    for perm in permutations(w):
        perm_str = ''.join(perm)
        if perm_str in words:
            result.add(perm_str)
    return list(result)

Let's call this "solution 1."

Neat and straightforward... and with the obvious complexity of O(n!) where n is the length of the given word. This complexity is a natural consequence of finding all permutations of the word and checking each one for its presence in the dictionary. It also has a O(m) memory space complexity where m is the size of the dictionary. I'm using the set type to hold the result here since permutations may be non-unique (e.g. transpose the 1st and the second letter 'a' in the word 'anagram').

To illustrate what this means, a 5-letter word has 5! = 120 permutations to check. An 8 letter word has 40,320 and a 10 letter word has 3,628,800 permutations to check.

Ok... so this is basically bad. What can we do about it? How about turning the problem upside-down and for each word in the dictionary, check if it's an anagram of the target word. This is surely an O(m) operation where m is the number of words in the dictionary:

def ana2(w):
    c = Counter(w)
    result = []
    for line in open('words.txt').readlines():
        c2 = Counter(line.strip())
        if c2 == c:
            result.append(line.strip())
    return result

Call this "solution 2."

Again, a clear and obvious solution. It makes a single pass through the dictionary and if the size of the dictionary is constant, we may even stretch our estimation of the complexity and call this an O(1) algorithm, as the number of operations in the loop doesn't depend significantly on the target word length. By the way, this is an obvious candidate for list comprehension, so I can write this more Pythonically as:

def ana3(w):
    c = Counter(w)
    return [ line.strip() for line in open('words.txt').readlines() if Counter(line.strip()) == c ]

Call this "solution 3."

Both of these solutions are also essentially O(1) in memory space complexity. But... does any of this actually help? Maybe.

The thing is: if our target word is short, the number of total operations we execute in the O(n!) case is actually pretty low for a modern CPU's capabilities. Since in most cases the words we use actually are short, this should be taken into account when solving this particular problem. To illustrate the point, here's a comparison in the number of operations depending on the target word word length and the assumed dictionary length of 22,000 words:

Word length     O(1) ops        O(n!) ops
4               22000                  24
5               22000                 120
6               22000                 720
7               22000                5040
8               22000               40320
9               22000              362880
10              22000             3628800

The O(1) solution actually has a huge constant attached to it, since in every single case we will pass through all of the entries in the dictionary, no matter what. This fixed cost of every iteration is the "constant" in the complexity estimation. Writing just "O(1)" doesn't tell the whole story. In practice, we're interested in the fact that this algorithm is simply slow, even if it behaves nicely and executes in constant-bound time. It's useful to write C + O(1), where C (indicating some constant cost per iteration, and this algorithm has only a "single" iteration) is very large in this case. Here is the table of timings for a 5 letter target word and 100 iterations:

solution 1               1.7 s
solution 2              31.0 s
solution 3              30.7 s

Ouch. Our clever O(1) algorithm seems to be rather sluggish. Also note that in this case, using list comprehensions doesn't help us at all. On the other hand, our O(n!) algorithm is actually K+O(n!), but its constant cost per iteration K is so low that we don't care: basically it stands for whatever operations are involved in comparing the two strings.

Ok, so let's be a little less dumb and realize that for two words to be anagrams, they need to have the same length, and adjust the algorithm such:

def ana4(w):
    c = Counter(w)
    result = []
    for line in open('words.txt').readlines():
        word = line.strip()
        if len(word) != len(w):
            continue
        c2 = Counter(word)
        if c2 == c:
            result.append(w)
    return result

Call this "solution 4."

Actually, the performance of this solution is surprisingly good:

solution 1               1.7 s
solution 2              31.0 s
solution 3              30.7 s
solution 4               4.8 s

In this case, we have significantly reduced the constant cost of the average iteration, but the nature of the algorithm has remained the same.

Clearly there's a cutoff point at which the number of operations performed by the O(n!) solution is significantly lower than for the other ones, so we can pick and mix our algorithms for the respective cases. To keep the solution correct, we need to adapt to both the word length and the dictionary size:

def ana5(target):

    result = set()
    words = set([ line.strip() for line in open('words.txt').readlines() ])

    if factorial(len(target)) > len(words):
        # use the O(m) algorithm WRT the size of the dictionary
        c = Counter(target)
        for word in words:
            if len(word) != len(target):
                continue
            if Counter(word) == c:
                result.add(word)
    else:
        # use the O(n!) algorithm WRT the word length
        for perm in permutations(target):
            perm_str = ''.join(perm)
            if perm_str in words:
                result.add(perm_str)

    return list(result)

Call this "solution 5."

This gives us a nicely bounded worst-case performance, which is best illustrated by running the algorithm through a list of words of varying sizes, each of which has exactly two permutations in the dictionary (the words in my example were chosen randomly):

Length  Words                   Time (100 iterations)
 5      ['abode', 'adobe']                      1.4
 6      ['aboard', 'abroad']                    1.5
 7      ['abridge', 'brigade']                  1.8
 8      ['alarming', 'marginal']                6.4
 9      ['lamenting', 'alignment']              5.6
10      ['domination', 'admonition']            4.8

So, yeah, programming puzzles are even more interesting to solve if performance is in consideration.

Update: By popular request, here's one way to do it "properly". The algorithms above are toys for illustrating a point about the importance of constants in algorithm complexity, not actually something that should be used in production.

Here's how to do it better: index all the words in the dictionary (put them in a hash map) by their "signature" which contains information about the letters they are made of and how many times each of the letters occurs. Thus, all searches for anagrams of the target words boil down to a single lookup in the hash table:

def ana6(target):
    cmap = {}
    for line in open('words.txt').readlines():
        word = line.strip()
        c = str(Counter(word))
        if c in cmap:
            cmap[c].append(word)
        else:
            cmap[c] = [ word ]
    return cmap[str(Counter(target))]

Call this "solution 6."

In the above code, I'm using str(Counter(...)) because Counter objects are not hashable (cannot be keys in a dict) but their string representations, of course, are. HOWEVER: this particular code has a monstruously complex constant part while the map (cmap in the above code) is being built. It takes around 76 seconds to do 100 iterations of the above solution! In real, production code, if at all possible, this word index would be created only once:

# index all the words in the dictionary only once
cmap = {}
for line in open('words.txt').readlines():
    word = line.strip()
    c = str(Counter(word))
    try:
        cmap[c].append(word)
    except KeyError:
        cmap[c] = [ word ]

# use the index
def ana7(target):
    global cmap
    try:
        return cmap[str(Counter(target))]
    except KeyError:
        return []

We'll call this "solution 7" and, fittingly, it has a perfect O(1) performance with a trivial cost: a 100 calls of ana7() with absolutely any arguments takes around 0.003s on this machine. It has an O(m) memory space complexity where m is the size of the dictionary, and can be considered as an example of the typical tradeoff between the performance and memory used. If you actually need to find anagrams, use this algorithm, not the toy ones.


comments powered by Disqus