How to Write a Spelling Corrector (in Haskell)
Spelling correction might seem like a huge thing to implement, but a very basic variant can be implemented in a couple dozen lines of code in most languages – and that’s without using any fancy tricks to make things as terse1 as possible.
This article is a walk-through of my Haskell implementation2 of Peter Norvig’s toy spelling corrector, which I recommend taking a look at first – especially if you’re interested in the underlying theory. After detailing my implementation and how it differs from the original Python code, I’ll briefly analyze its performance.
Implementation
The ugly part first: To emulate the first part of the Python code…
def words(text): return re.findall(r'\w+', text.lower())
WORDS = Counter(words(open('big.txt').read()))
…we’ll swiftly abandon3 any notion of functional purity by using unsafePerformIO
to read the file big.txt
:
-- | Note: To simulate the global variable @WORDS@ from the original Python
-- implementation, we'll perform some unsafe IO here. Please put on your safety
-- goggles and take a few steps back.
{-# NOINLINE words #-}
words :: Map.Map String Int
words = Map.fromList [ (head l, length l) | l <- group (sort words') ]
where
text = unsafePerformIO $ getDataFileName "big.txt" >>= readFile
words' = filter (not . null) . splitWhen (not . isAlpha) $ map toLower text
The helper function words'
first converts all characters to lower-case and splits the resulting String
on non-alphabetic characters, returning a [String]
which contains only the words we want (because they are composed of alphabetic characters) and empty lists (this is what splitWhen
from the split package leaves behind when applied to two subsequent non-alphabetic characters), the latter of which are subsequently filter
ed out. This emulates the \w+
regular expression4 closely and reasonably idiomatically. Finally, we sort the resulting list of words and group duplicate occurrences, which facilitates creating a mapping from words to number of occurrences.
Moving on: Given a word and a text, we need a way of computing the probability of the word. In Python, the function
def P(word, N=sum(WORDS.values())):
"Probability of `word`."
return WORDS[word] / N
solves this quite neatly. Here, WORDS
is a Counter
instance providing a mapping from words to the number of occurences in the text – which I’ve implemented using Data.Map
above. As a result, we can implement this function as follows:
-- | Probability of @word@.
p :: String -> Double
p word = (/ n) $ fromIntegral $ fromMaybe 0 (Map.lookup word words :: Maybe Int)
where
n = fromIntegral $ Map.foldl' (+) 0 words
What’s happening here? In the where
block, we compute the total number n
of words in the text by summing up the number of occurences for each word. For the given word
, we then perform a lookup
, which yields its number of occurrences or – thanks to fromMaybe
– 0 if it’s not in the map. Finally, we divide the result of this operation by the total number of words we’ve just computed.
Next up is the user-facing correction
function…
def correction(word):
"Most probable spelling correction for word."
return max(candidates(word), key=P)
…the Haskell translation of which is fairly5 straightforward – it just computes the argmax
of p
applied to the candidate corrections for the given word
, which means that it returns the most probable correction for word
:
-- | Most probable spelling correction for @word@.
correction :: String -> String
correction word = argmax p $ candidates word
Well, how can we come up with possible corrections for a word? The candidates
function takes care of that:
def candidates(word):
"Generate possible spelling corrections for word."
return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])
The above Python implementation is based on empty containers evaluating to False
in Python: The left-associative or
operator first evaluates its left operand. If this is not empty (which means the associated boolean is True
), it simply returns it. Otherwise (if the left operant is empty, i.e. False
), the right operand is returned. As a result, the candidates
function returns the first non-empty one of the or
-linked sets.
In Haskell, we can achieve the same behavior by first computing the different lists of candidate corrections (more on that below) and then picking the first non-empty one – except, thanks to laziness, we only6 compute up to the one we pick.
-- | Generate possible spelling corrections for @word@.
candidates :: String -> [String]
candidates word = head $ filter (not . null) s
where
s = [ known [word]
, known $ edits1 word
, known $ edits2 word
, [word]
]
The known
function takes a set or list of words and filters out everything that’s not in the previously described WORDS
constant:
def known(words):
"The subset of `words` that appear in the dictionary of WORDS."
return set(w for w in words if w in WORDS)
This can be solved neatly using a list comprehension and a Map.member
check:
-- | The subset of @words'@ that appear in the dictionary of @words@.
known :: [String] -> [String]
known words' = [ w | w <- words', Map.member w words]
The central edits1
function, which computes all possible corrections with an edit distance of 1, looks like this in the original Python implementation:
def edits1(word):
"All edits that are one edit away from `word`."
letters = 'abcdefghijklmnopqrstuvwxyz'
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [L + R[1:] for L, R in splits if R]
transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
replaces = [L + c + R[1:] for L, R in splits if R for c in letters]
inserts = [L + c + R for L, R in splits for c in letters]
return set(deletes + transposes + replaces + inserts)
We can translate this to Haskell without any significant modifications, merely replacing Python constructs with their Haskell equivalents:
-- | All edits that are one edit away from @word@.
edits1 :: String -> [String]
edits1 word = deletes ++ transposes ++ replaces ++ inserts
where
letters = "abcdefghijklmnopqrstuvwxyz"
splits = [ splitAt i word | i <- [1 .. length word] ]
deletes = [ l ++ tail r | (l,r) <- splits, (not . null) r ]
transposes = [ l ++ r !! 1 : head r : drop 2 r | (l,r) <- splits, length r > 1 ]
replaces = [ l ++ c : tail r | (l,r) <- splits, (not . null) r, c <- letters ]
inserts = [ l ++ c : r | (l,r) <- splits, c <- letters]
Applying this function twice – first on the original word, then on each of its possible edits – is exactly what the edits2
function is for:
def edits2(word):
"All edits that are two edits away from `word`."
return (e2 for e1 in edits1(word) for e2 in edits1(e1))
As before, since this is a no-frills set comprehension, it can be written as a list comprehension in Haskell:
-- | All edits that are two edits away from @word@.
edits2 :: String -> [String]
edits2 word = [ e2 | e1 <- edits1 word, e2 <- edits1 e1 ]
That’s it. Let’s take it for a spin!
Demo
After cloning the repository and following the enclosed setup instructions, run cabal repl
or stack ghci
to start correcting your spelling:
$ cabal repl
Preprocessing library spell-0.1.0.0...
GHCi, version 8.0.2: http://www.haskell.org/ghc/ :? for help
Loaded GHCi configuration from /Users/noah/.ghci
[1 of 2] Compiling Paths_spell ( dist/build/autogen/Paths_spell.hs, interpreted )
[2 of 2] Compiling Spell ( src/Spell.hs, interpreted )
Ok, modules loaded: Paths_spell, Spell.
*Spell> correction "speling"
"spelling"
it :: String
*Spell> correction "cdoe"
"code"
it :: String
*Spell> correction "haskell"
"haskell"
it :: String
*Spell>
Leaving GHCi.
Analysis
As you will notice when running it, compared to the original Python implementation, the Haskell version performs significantly worse in one respect: parsing the file big.txt
and populating the mapping from each word to its number of occurrences. On my machine, Python manages this in just under a second while my Haskell implementation takes about 10 seconds.
Luckily, this is only required during the first call to the correction
function – any subsequent calls can skip this step and will run much faster, at which point the largely unoptimized Haskell translation “only” takes about 100 ms per word, about twice as long as Peter Norvig’s Python original (although this varies depending on the input).
Another thing worth noting is that I have decided against using a set-like data type like Data.Set
or equivalent third-party libraries here, opting for lists instead (even though the Python version employs sets). This is mainly due to performance and code clarity7 considerations. The resulting duplicate entries don’t seem to actually impact the result’s correctness and passing them up the call stack instead of performing a filtering pass at each stage8 turned out to be faster. An old version using Data.Set
was significantly slower that the implementation described in this article.
-
Like terseness? Take a look at K. If you come back with your sanity intact, you can make good money. ↩
-
Of course, I’m not the first to take this on. Other reimplementations of Peter Norvig’s algorithm can be found here, here, here and here. ↩
-
This is less than ideal, but it allows us to use the
words
function as a global constant, similar to theWORDS
variable in Python. Any alternatives, while being more idiomatic, would make the program harder to read due to the need of passing an extra parameter to almost every function. This would impact usage too: instead of simply calling thecorrection
function, the user would first have to read the text file and then pass its contents (or, depending on the implementation, the result of applying a function to its contents) tocorrection
. Of course, you could use the Reader monad to circumvent all this, but that would add complexity and it’s a bit more advanced than I’d like this article to be. ↩ -
Using a regex library here was significantly slower in my tests. ↩
-
Note that the
argmax
function is the Haskell equivalent to Python’smax(..., key=...)
. It comes as part of the list-extras package. ↩ -
Suppose the
word
is already spelled correctly. Thenknown [word]
is going to return a singleton list containingword
, which is non-empty, so it’s directly returned to the caller without prior computation ofknown $ edits1 word
orknown $ edits2 word
. ↩ -
In Python, sets are first-class citizens and can interoperate with lists easily thanks to dynamic typing. In Haskell, conversion between sets and lists is a bit more verbose. ↩
-
Which would run in if using
nub
, or if doing asort
prior to applying thenub
function from the data-ordlist package. ↩