Excessively Adequate

How to Write a Spelling Corrector (in Haskell)

Spelling correction might seem like a huge thing to implement, but a very basic variant can be implemented in a couple dozen lines of code in most languages – and that’s without using any fancy tricks to make things as terse1 as possible.

This article is a walk-through of my Haskell implementation2 of Peter Norvig’s toy spelling corrector, which I recommend taking a look at first – especially if you’re interested in the underlying theory. After detailing my implementation and how it differs from the original Python code, I’ll briefly analyze its performance.

Implementation

The ugly part first: To emulate the first part of the Python code…

def words(text): return re.findall(r'\w+', text.lower())

WORDS = Counter(words(open('big.txt').read()))

…we’ll swiftly abandon3 any notion of functional purity by using unsafePerformIO to read the file big.txt:

-- | Note: To simulate the global variable @WORDS@ from the original Python
-- implementation, we'll perform some unsafe IO here. Please put on your safety
-- goggles and take a few steps back.
{-# NOINLINE words #-}
words :: Map.Map String Int
words = Map.fromList [ (head l, length l) | l <- group (sort words') ]
  where
    text   = unsafePerformIO $ getDataFileName "big.txt" >>= readFile
    words' = filter (not . null) . splitWhen (not . isAlpha) $ map toLower text

The helper function words' first converts all characters to lower-case and splits the resulting String on non-alphabetic characters, returning a [String] which contains only the words we want (because they are composed of alphabetic characters) and empty lists (this is what splitWhen from the split package leaves behind when applied to two subsequent non-alphabetic characters), the latter of which are subsequently filtered out. This emulates the \w+ regular expression4 closely and reasonably idiomatically. Finally, we sort the resulting list of words and group duplicate occurrences, which facilitates creating a mapping from words to number of occurrences.

Moving on: Given a word and a text, we need a way of computing the probability of the word. In Python, the function

def P(word, N=sum(WORDS.values())):
    "Probability of `word`."
    return WORDS[word] / N

solves this quite neatly. Here, WORDS is a Counter instance providing a mapping from words to the number of occurences in the text – which I’ve implemented using Data.Map above. As a result, we can implement this function as follows:

-- | Probability of @word@.
p :: String -> Double
p word = (/ n) $ fromIntegral $ fromMaybe 0 (Map.lookup word words :: Maybe Int)
  where
    n = fromIntegral $ Map.foldl' (+) 0 words

What’s happening here? In the where block, we compute the total number n of words in the text by summing up the number of occurences for each word. For the given word, we then perform a lookup, which yields its number of occurrences or – thanks to fromMaybe – 0 if it’s not in the map. Finally, we divide the result of this operation by the total number of words we’ve just computed.

Next up is the user-facing correction function…

def correction(word):
    "Most probable spelling correction for word."
    return max(candidates(word), key=P)

…the Haskell translation of which is fairly5 straightforward – it just computes the argmax of p applied to the candidate corrections for the given word, which means that it returns the most probable correction for word:

-- | Most probable spelling correction for @word@.
correction :: String -> String
correction word = argmax p $ candidates word

Well, how can we come up with possible corrections for a word? The candidates function takes care of that:

def candidates(word):
    "Generate possible spelling corrections for word."
    return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])

The above Python implementation is based on empty containers evaluating to False in Python: The left-associative or operator first evaluates its left operand. If this is not empty (which means the associated boolean is True), it simply returns it. Otherwise (if the left operant is empty, i.e. False), the right operand is returned. As a result, the candidates function returns the first non-empty one of the or-linked sets.

In Haskell, we can achieve the same behavior by first computing the different lists of candidate corrections (more on that below) and then picking the first non-empty one – except, thanks to laziness, we only6 compute up to the one we pick.

-- | Generate possible spelling corrections for @word@.
candidates :: String -> [String]
candidates word = head $ filter (not . null) s
  where
    s = [ known [word]
        , known $ edits1 word
        , known $ edits2 word
        , [word]
        ]

The known function takes a set or list of words and filters out everything that’s not in the previously described WORDS constant:

def known(words):
    "The subset of `words` that appear in the dictionary of WORDS."
    return set(w for w in words if w in WORDS)

This can be solved neatly using a list comprehension and a Map.member check:

-- | The subset of @words'@ that appear in the dictionary of @words@.
known :: [String] -> [String]
known words' = [ w | w <- words', Map.member w words]

The central edits1 function, which computes all possible corrections with an edit distance of 1, looks like this in the original Python implementation:

def edits1(word):
    "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

We can translate this to Haskell without any significant modifications, merely replacing Python constructs with their Haskell equivalents:

-- | All edits that are one edit away from @word@.
edits1 :: String -> [String]
edits1 word = deletes ++ transposes ++ replaces ++ inserts
  where
    letters    = "abcdefghijklmnopqrstuvwxyz"
    splits     = [ splitAt i word                  | i <- [1 .. length word] ]
    deletes    = [ l ++ tail r                     | (l,r) <- splits, (not . null) r ]
    transposes = [ l ++ r !! 1 : head r : drop 2 r | (l,r) <- splits, length r > 1 ]
    replaces   = [ l ++ c : tail r                 | (l,r) <- splits, (not . null) r, c <- letters ]
    inserts    = [ l ++ c : r                      | (l,r) <- splits, c <- letters]

Applying this function twice – first on the original word, then on each of its possible edits – is exactly what the edits2 function is for:

def edits2(word):
    "All edits that are two edits away from `word`."
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))

As before, since this is a no-frills set comprehension, it can be written as a list comprehension in Haskell:

-- | All edits that are two edits away from @word@.
edits2 :: String -> [String]
edits2 word = [ e2 | e1 <- edits1 word, e2 <- edits1 e1 ]

That’s it. Let’s take it for a spin!

Demo

After cloning the repository and following the enclosed setup instructions, run cabal repl or stack ghci to start correcting your spelling:

$ cabal repl
Preprocessing library spell-0.1.0.0...
GHCi, version 8.0.2: http://www.haskell.org/ghc/  :? for help
Loaded GHCi configuration from /Users/noah/.ghci
[1 of 2] Compiling Paths_spell      ( dist/build/autogen/Paths_spell.hs, interpreted )
[2 of 2] Compiling Spell            ( src/Spell.hs, interpreted )
Ok, modules loaded: Paths_spell, Spell.
*Spell> correction "speling"
"spelling"
it :: String
*Spell> correction "cdoe"
"code"
it :: String
*Spell> correction "haskell"
"haskell"
it :: String
*Spell>
Leaving GHCi.

Analysis

As you will notice when running it, compared to the original Python implementation, the Haskell version performs significantly worse in one respect: parsing the file big.txt and populating the mapping from each word to its number of occurrences. On my machine, Python manages this in just under a second while my Haskell implementation takes about 10 seconds.

Luckily, this is only required during the first call to the correction function – any subsequent calls can skip this step and will run much faster, at which point the largely unoptimized Haskell translation “only” takes about 100 ms per word, about twice as long as Peter Norvig’s Python original (although this varies depending on the input).

Another thing worth noting is that I have decided against using a set-like data type like Data.Set or equivalent third-party libraries here, opting for lists instead (even though the Python version employs sets). This is mainly due to performance and code clarity7 considerations. The resulting duplicate entries don’t seem to actually impact the result’s correctness and passing them up the call stack instead of performing a filtering pass at each stage8 turned out to be faster. An old version using Data.Set was significantly slower that the implementation described in this article.

  1. Like terseness? Take a look at K. If you come back with your sanity intact, you can make good money

  2. Of course, I’m not the first to take this on. Other reimplementations of Peter Norvig’s algorithm can be found here, here, here and here

  3. This is less than ideal, but it allows us to use the words function as a global constant, similar to the WORDS variable in Python. Any alternatives, while being more idiomatic, would make the program harder to read due to the need of passing an extra parameter to almost every function. This would impact usage too: instead of simply calling the correction function, the user would first have to read the text file and then pass its contents (or, depending on the implementation, the result of applying a function to its contents) to correction. Of course, you could use the Reader monad to circumvent all this, but that would add complexity and it’s a bit more advanced than I’d like this article to be. 

  4. Using a regex library here was significantly slower in my tests. 

  5. Note that the argmax function is the Haskell equivalent to Python’s max(..., key=...). It comes as part of the list-extras package

  6. Suppose the word is already spelled correctly. Then known [word] is going to return a singleton list containing word, which is non-empty, so it’s directly returned to the caller without prior computation of known $ edits1 word or known $ edits2 word

  7. In Python, sets are first-class citizens and can interoperate with lists easily thanks to dynamic typing. In Haskell, conversion between sets and lists is a bit more verbose. 

  8. Which would run in O(n2)O(n^2) if using nub, or O(nlogn+n)O(n \log n + n) if doing a sort prior to applying the nub function from the data-ordlist package