Inside F#

Brian's thoughts on F# and .NET

How to _really_ be an evil genius (with F#)

Posted by Brian on February 14, 2010

(Starting aside: There’s a new F# release out, check it out.)

My coworker Chris claims to be an evil genius.  He recently posted a blog about how to automatically solve the Bookworm word search.  I was looking at his code, and thought, “There’s no way he’s ever going to take over the world with that implementation…” 

So I wrote my own version.  Which is three times shorter.  And uses less memory.  And is much simpler to read.  Oh, and it runs way faster too.  Which begs the question…

Who’s the evil genius now?

All trash talk and no code makes for a bad blog entry, so let’s get to it.

The Problem

There’s a game called Bookworm you can play on the web (or also on some crazy pocket-gadget you might have heard of).  You have a grid of letter tiles and you’re trying to find words like “BREAK”:

bookworm

There’s a dictionary of all the legal words you can use (details below).  The goal is, given as input one of these letter grids, find all the words.  You can only move to hexagonally adjacent tiles, and you can’t use the same letter tile more than once in a word. 

Simple, right?  Classic search problem.  I immediately started thinking of the optimal data structures to use to make this simple and efficient to code. 

The Strategy – the dictionary

There is a data structure that’s especially well-suited to representing a word dictionary for this type of problem.  It’s called a “prefix tree” or “trie”, and it is a very useful data structure to have in your back pocket.

(Chris also chose a trie for his implementation – awwww, snap! – Chris actually did something right!  Of course, his implementation is memory-heavy, creating a lot of garbage along the way.)

So let’s explain this trie thing.  It’s simple.  We represent a dictionary of words as a bunch of TrieNodes.  Each TrieNode contains two things.  First, an array of 26 references to other TrieNodes (26 letters in the alphabet – get it?).  Second, a boolean, which repesents if a word just ended here.  A picture is worth a thousand words.  Rather than just copy a diagram off wikipedia like some people (*cough*), I’ll draw my own to show off my mad skillz with MS Paint:

TrieDiagram

In the diagram, each big rectangle is a TrieNode, and each node has got a 26-wide array of “pointers” to other nodes, and a boolean (represented by a red or green box – I’ll let you guess which color means ‘true’).  If you don’t see a squiggle-line coming out of the bottom of one of the little boxes in the array, it means that array cell is null, dig?  So to find out of “BET” is a word in the trie, we start at the root node and look at cell.[1] in the array (A=0, B=1, C=2, … and we want the “B”, the first letter of “BET”).  It’s got a pointer to another TrieNode, so we follow that pointer and then look up the next letter in cell.[4] (because? … E=4, right), and that’s a pointer to another TrieNode, so we go there and look up cell.[19] (T=19), and that’s a pointer to another node, so we go that that node, but now we have no more letters, and thus we have traversed “B-E-T” and check the boolean in the final TrieNode we landed at to see if a word ends here.  Yes, the boolean is true (green), so we determine that “BET” was in this trie.  You can follow similar logic for “BE” or “A”.

What if we try to look up “BAT”?  We follow cell.[1] down to the ‘B’ node, but then when we look there in cell.[0] for the ‘A’, we see it’s null (no squiggly line coming out of there).  That means we stop searching, “BAT” is not stored in this trie; “BA” is not the prefix of any word in the trie.

Is just “B” a word in this trie?  Again we start by following cell.[1] down to the ‘B’ node.  But now there are no more letters in the word we are trying to look up, so we check the boolean here.  It’s false (red), which means “B” is not in the trie.

If you want to add a word to an existing trie, you just have to walk down each letter of the word you’re adding (creating fresh TrieNodes along the way if you encounter nulls), and then once you’ve finished all the letters of the word you’re adding, flip the boolean in the final TrieNode to true, to mark it as a word-ender.

Great, you now know everything there is to know about how tries work.  Let’s code the sucker up.

Trie implementation

Simple is good.  Let’s keep things simple.  All the Bookworm words contain only UPPERCASE LETTERS.  Pipe down, I’m not shouting at you, I’m illustrating.  So here’s a useful function we’ll be calling from a few places:

let IsLetter(c)   =    c>='A' && c<='Z'

Get it? Good, because if you don’t, you may want to quit reading right now.  You might prefer IsUppercaseLetter or IsLegalBookwormLetter or some such name for this function.  But I like this name.  The whole program is about Bookworm, and as far as Bookworm is concerned, there are only 26 “letters”, and these are they.  End of story.  Moving on!

(In point of fact, Bookworm treats “Qu” as one letter, but Chris ignores this, and so will I.  (But at least I’m explicitly pointing it out.))

What’s the ideal public API for a TrieNode?  Well, think back to our diagram.  At any given node, we want to know if it’s an “end of word”, so we need to surface that boolean as a property, like “node.IsEndOfWord”.  At a given node, we want to be able to “move forward” to some letter, e.g. go to ‘E’ from here.  So we would like an indexer that lets us do node.[‘E’] to get a reference to the next node down.  And finally we need a way to populate the tree, so let’s have a method that lets us say rootNode.Add(“BET”) to add that word to the trie.  (Yes, this means the trie is mutable.  It is much simpler to build the trie using mutation.  Just because you’re an F# programmer and all into “functional” doesn’t mean you can’t use arrays and mutation when they’re the simplest tool for the job!  Be pragmatic.) 

Hmm, is that everything?  Let’s try to write the client code we’ll need to build a trie and do some ad-hoc testing, in order to test drive the API we’re picturing.  Assume we have a list of words already and we want to make a trie of all those words:

let trie = new TrieNode()
for w in words do
    trie.Add(w)

Yeah, that looks good.  And now let’s hack a quick function to check if words are in the trie:

let rec IsWord(s:string, t:TrieNode) =
    if t=null then false
    elif s.Length=0 then t.IsEndOfWord
    else IsWord(s.Substring(1), t.[ s.[0] ])

We’ll be walking down the trie, and if we “fall off the end” (reach null), then there’s no word with that prefix, else if we walked all the way to the end of the word we’re looking up, when we just check the boolean in the current TrieNode to see if it’s a word, else if we still have more letters, we recurse.  The Substring() call at each step is inefficient (allocating a new string each time), but that’s ok, because I’m only using this function for this quickie test suite:

assert(IsWord("ABSTRACTIONISMS", trie))
assert(IsWord("MONSOON", trie))
assert(not(IsWord("QZXWWK", trie)))

to verify that things appear to maybe be ok.  Great.  We have a public API design that seems ok, and a tiny set of tests, let’s try to write our TrieNode class.  In four tiny steps:

[<AllowNullLiteral>]
type TrieNode() =
    let nextChar = Array.create 26 null
    let mutable isEndOfWord = false

We define a TrieNode class.  By default, instances of classes in F# can’t be null, but we want to use nulls to signify non-entries in the trie, so we add an attribute to allow it.  The class has two fields – the array of size 26, which we initialize to nulls, and the boolean, which we start as false.

We want the boolean exposed as a public property, so, shazam:

    member this.IsEndOfWord = isEndOfWord

We want to be able to use a character to index the next node down, so we write that indexer property.  Only the getter needs to be public in the API, but as we’ll see in a moment, we’ll want a private setter too (to use in the Add() implementation).  We need to convert ‘A’ to 0, ‘B’ to 1, etc., so we write

    member this.Item with get(c:char) = nextChar.[int c - int 'A']
                     and private set(c:char) v = nextChar.[int c - int 'A'] <- v

(Note that in F#, if you write an indexer property named “Item”, then you can use the object with array syntax, e.g. node.[‘E’].) 

Finally we write our Add() method, the only bit of code with any meat:

    member this.Add(word) = 
        assert(word |> Seq.forall IsLetter)
        this.Add(word,0)
    member private this.Add(word:string,i) =
        if i = word.Length then
            isEndOfWord <- true
        else 
            match this.[word.[i]] with
            | null -> let newNode = new TrieNode()
                      newNode.Add(word,i+1)
                      this.[word.[i]] <- newNode
            | node -> node.Add(word,i+1)

Add asserts that the word contains nothing but LETTERS and then calls a little recursive helper method, which takes as arguments the word we’re adding, and which character index we’re at (equivalently, how deep into the trie we’ve gone).  If we have traversed all the way down to the end of the word, we’re at the right node in the trie to just flip the ‘end of word’ bit to true to mark that there’s a word that ends here.  Otherwise, we still need to go deeper, so we look in this node’s array at the current character.  If it’s null, we need to create a fresh new node at this letter, and add the rest of the word to that.  Otherwise if there’s already a node there, then just recurse on down.

Feel free to trace through this on a sheet of paper with a fresh trie and add “A”, “BE”, and “BET” and see if you get a pretty picture like my earlier diagram.

Ok, that’s it.  The whole TrieNode class is less than 20 lines of code.

Now we need our dictionary of English words.  It turns out there’s some huge dictionary file full of

WORDS
LIKE
THIS

one per line, as well as a small file of words that aren’t allowed.  I dunno, I just did basically what Chris did:

let words = seq {
    let toIgnore = System.IO.File.ReadAllLines("IgnoredWords.txt") |> Set.ofArray
    for word in System.IO.File.ReadAllLines("TWL06.txt") do
        assert(word |> Seq.forall IsLetter)
        if not (toIgnore.Contains word) && word.Length >= 3 then
            yield word }

(I’ve linked the text files containing the words at the end of this blog entry, right above the full source code.)  Oh yeah, Bookworm only allows words that are at least three letters long, hence that “Length>=3” check.

Ok, I hit F5 to run my code so far in VS, and before you can say “blueberry pie” the thing is done, no asserts.  We have a big English dictionary stored in a trie, now we are go for search!

The Strategy – word search

One of the thing that makes Bookworm interesting is the hexagonal grid.  (This ain’t your run-of-the-mill square-grid game like Boggle!)  So we need a good board representation that enables easy hexagonal search.

Chris, predictably, goes nuts here, with some crazy “ITile” interface with a Neighbors property and Locations and all kinds of garbage like that.  Classic overengineering.  Hey Chris, it’s no wonder your volcano lair got foreclosed on!  Sure, blame the economy, I’m sure it wasn’t because of all the wasteful–

<deep breath>

Ok, I’m digressing.  Back on task.

If you want hexagonal search, there’s a good data structure for this.  It’s not rocket science.  You’ve heard of this data structure before.  They even have it in C.  It’s called a 2-D array.  Check it out:

        O U B N O V S 
         J A P A L O M
        G T E J T S H 
         U A E O F S A
        N O A E L E P 
         O S O U G N P
        A T F G A I I 

It’s a 2-D array of characters, where every other character is a space (kinda like a checkerboard, with letters as black squares and spaces as red ones, or whatnot).  We can index every cell with a pair of (x,y) coordinates.  And if, say, the green ‘A’ above happens to be at coordinate (x,y), then its six neighbors are like (x-1,y-1), (x+1,y-1), (x-2,y), (x+2,y), (x+1,y-1), (x+1,y+1).  See?  Simple.  It even looks like the board, once you tilt your head to the left (each ‘row’ is actually a ‘column’).

Of course, say we were at the ‘O’ in the upper left corner at coordinate (0,0).  If we try to get the neighbor at (x-1,y-1), we’ll look into cell [-1,-1] of the array and get an IndexOutOfRangeException.  Now we could be like Chris, and have an elaborate ‘validIndex’ checking function at every array lookup, to ensure we don’t go off the side of the board.  Or we can play it smart and simple, and just surround the array with some extra padding of spaces, so it won’t matter if we look outside the word grid.  Yeah, let’s do that.

Let’s also be flexible and allow any number of columns and rows. 

Board&search implementation

Consider the screenshot again:

bookworm

We can type up this input easily as

let COLS = 7
let ROWS = 8
let data = [|  // columns
        "  K M E M T O O "
        " E F K I M C A D"
        "  A A E R E A F "
        " T D A I B N D J"
        "  V F F A T E E "
        " O J T R F I R R"
        "  I R D R E E O "
    |]    

where we say there are 7 columns (COLS), each with at most 8 letters (ROWS), and an array of strings makes it easy to lay out the data like the screenshot (don’t forget to tilt your head to the left when looking at the code – this is your exercise for the day).  Ok, let’s convert this to a 2-D array with extra space-padding around the outside.  I’ll call this structure a “board”:

// surround entire grid with border of spaces (one extra column, two extra rows)
let BOARDCOLS = COLS + 2
let BOARDROWS = 2*ROWS + 4
let MakeBoard(columns:string[]) =
    assert(columns<>null)
    assert(columns.Length = COLS)
    assert(columns |> Seq.forall (fun s -> s.Length = ROWS*2))
    for i in 0..COLS-1 do
        for j in 0..ROWS*2-1 do
            let c = columns.[i].[j]
            if (i+j)%2 = 0 then assert(c=' ' || IsLetter c)
            else                assert(c=' ')
    Array2D.init BOARDCOLS BOARDROWS (fun c r ->
        if c=0 || c=BOARDCOLS-1 || 
           r=0 || r=1 || r=BOARDROWS-2 || r=BOARDROWS-1 then ' '
        else columns.[c-1].[r-2])

The first 8 lines of the MakeBoard function are just assertions to help ensure you didn’t screw up when you typed in the data.  All the real work is in the last 4 lines, which convert the array of strings into a 2-D array of characters with extra space padding around the edge.  The first and last columns are all space characters, and the first two and last two rows are all space characters, and the interior is populated with data from our string array.  Now we have a nice 2-D array where we can easily do hexagonal-neighbor-indexing from any letter without falling off the edge of the data structure.  We’re ready to rock.  No, wait:

let board = MakeBoard(data)    

Ok, now we’re ready to rock.

It’s time to search.  I want to write a function along the lines of “ok, I have a dictionary of English words encoded in a trie, and I have this 2-D array, and I want to find, say, all the words starting with the letter ‘M’ located at [5,2]” or something.  Think recursively.  At any given point, we’re at some TrieNode inside the trie, and at some [col,row] in the array.  Maybe we have gone off the edge of the letters into the ‘space’ border – no more words here!  But if we are on a letter, we can look it up in the trie.  If we get to a TrieNode where node.IsEndOfWord is true, then great, we’re at the last letter of a word, score one for the good guys.  Regardless, we need to recursively keep searching all the nearby neighbors with our new position in the tree.  And to ensure we don’t use the same letter tile twice in one word, we can “blot it out” by overwriting it with a space when we use it (and then “write it back” as we backtrack out of the recursive search). 

If you’ve written recursive searches before, that may all seem old hat; if not, welcome to the party.  It’s simpler just to show the code before talking about it any more:

let rec Find(col, row, t:TrieNode) = seq {
    assert((col+row)%2 = 1) // we should only be searching LETTER positions
    let c = board.[col,row]
    if c <> ' ' then
        board.[col,row] <- ' '
        match t.[c] with
        | null -> ()
        | node -> 
            if node.IsEndOfWord then
                yield [c]
            for suf in Find(col-1, row-1, node) do yield c::suf
            for suf in Find(col-1, row+1, node) do yield c::suf        
            for suf in Find(col  , row-2, node) do yield c::suf        
            for suf in Find(col  , row+2, node) do yield c::suf        
            for suf in Find(col+1, row-1, node) do yield c::suf
            for suf in Find(col+1, row+1, node) do yield c::suf        
        board.[col,row] <- c }

Just 17 lines of code.  I assert to ensure I’m correctly only looking at the “letter” portion of the checkerboard.  Grab the current character.  If it’s not a space, then it’s a letter we’ll try to use.  Temporarily blot it out with a space so we don’t use it again within “this word”.  (Yeah, that’s right, it’s our pragmatic pal, “mutation”.)  Now look up that letter in the trie.  If null, we’re out of luck, no words with this prefix.  Otherwise we got a trie node.  If this is the end of a word, yield a list containing this character (the last letter of the word).  Recursively search in all 6 neighbor directions for all word suffixes; for each one we find, cons this character onto the front to yield the longer suffix including the character from this recursive step.  That’s all, we’re done with the search on this character, so undo the “blotting out” before we exit, stage left.

That’s it.  With this search function in hand, it’s easy to generate a sorted list of all the words in the whole grid:

let allWordsSorted = 
    seq {    
        for col in 0..BOARDCOLS-1 do
            for row in 0..BOARDROWS-1 do
                if board.[col,row] <> ' ' then
                    yield! Find(col, row, trie) 
    } 
    |> Seq.map (fun w -> new string(w |> Seq.toArray))
    |> Set.ofSeq    // remove duplicates
    |> Seq.toArray 
    |> Array.sortWith (fun a b -> match compare a.Length b.Length with 
                                  | 0 -> compare a b
                                  | x -> x)

We loop over the whole grid and call Find() on every starting letter, resulting in one big sequence of all the words starting from every position.  The “words” yielded from our Find function have type “list<char>”, so we convert these to strings.  Then we remove any duplicate words by making a Set of the whole sequence.  We then convert this to an array, and sort it, first by word length, then by the string itself (alphabetically). 

We can print out the results

for w in allWordsSorted do
    printfn "%s" w
printfn "%d words" allWordsSorted.Length 

and verify we have the same output as Chris.  Except of course that our program is way faster.  (By about a factor of 6 on my box, to create the trie and find all the words.  But who’s counting?  Oh, right, the guys at the bank with the deed to the volcano lair.)

Here’s a data set that’s more fun:

let COLS = 21
let ROWS = 7
let data = [|  // columns
        "O U B N O V S "
        " J A P A L O M"
        "G T E J T S H "
        " U A E O F S A"
        "N O A E L E P "
        " O S O U G N P"
        "A T F G A I I "
        " W U B N O V S"
        "J E P A L O M "
        " G S E J T Y C"
        "U A O O F Y R "
        " N O M E L U T"
        "O S O E G E I "
        " H T F N A I Y"
        "O U B N E V S "
        " J A P A S O M"
        "G T E J T S C "
        " U A E O F Y R"
        "N O A E L U T "
        " O S O U G E I"
        "H T F G A I Y "
    |]

The output for that guy ends with


HAPPINESS
LANGUAGES
AWESOMENESS
732 words

Awesomeness indeed.

The Smackdown

Let’s face it: Chris may have wrote the book on F#, but as far as the F# blogosphere goes… this is my house.  Some may think Chris deserves to be on the receiving end of trash talk like “Yo’ momma’s so fat, she sat on a binary tree and turned it into a linked list in constant time”.  But not me; I’ll stay classy, and let my code do the trash-talking for me.

Chris Smith – Evil genius?  Riiiiight… half right, to be exact.  It’s left as an exercise to the reader to figure out which half.  (BOOM!  “Oh no he didn’t!”)

(Yeah, Chris, I know just how you’re feeling right now.)

The Challenge

Think you can do better?  I’d like to see you try!  Word up, bookworm!

The Source Code

Unlike Chris, who had to link to his code because it was too long to all fit on his blog, my code fits nicely right below.  And you can find links to the TWL06 and IgnoredWords text files here.

let IsLetter(c)   =    c>='A' && c<='Z'

[<AllowNullLiteral>]
type TrieNode() =
    let nextChar = Array.create 26 null
    let mutable isEndOfWord = false
    member this.IsEndOfWord = isEndOfWord
    member this.Item with get(c:char) = nextChar.[int c - int 'A']
                     and private set(c:char) v = nextChar.[int c - int 'A'] <- v
    member this.Add(word) = 
        assert(word |> Seq.forall IsLetter)
        this.Add(word,0)
    member private this.Add(word:string,i) =
        if i = word.Length then
            isEndOfWord <- true
        else 
            match this.[word.[i]] with
            | null -> let newNode = new TrieNode()
                      newNode.Add(word,i+1)
                      this.[word.[i]] <- newNode
            | node -> node.Add(word,i+1)

let words = seq {
    let toIgnore = System.IO.File.ReadAllLines("IgnoredWords.txt") |> Set.ofArray
    for word in System.IO.File.ReadAllLines("TWL06.txt") do
        assert(word |> Seq.forall IsLetter)
        if not (toIgnore.Contains word) && word.Length >= 3 then
            yield word }

let trie = new TrieNode()
for w in words do
    trie.Add(w)

let rec IsWord(s:string, t:TrieNode) =
    if t=null then false
    elif s.Length=0 then t.IsEndOfWord
    else IsWord(s.Substring(1), t.[ s.[0] ])
    
assert(IsWord("ABSTRACTIONISMS", trie))
assert(IsWord("MONSOON", trie))
assert(not(IsWord("QZXWWK", trie)))

/////////////////////////////////////////////////////////

let COLS = 7
let ROWS = 8
let data = [|  // columns
        "  K M E M T O O "
        " E F K I M C A D"
        "  A A E R E A F "
        " T D A I B N D J"
        "  V F F A T E E "
        " O J T R F I R R"
        "  I R D R E E O "
    |]    
    
// surround entire grid with border of spaces (one extra column, two extra rows)
let BOARDCOLS = COLS + 2
let BOARDROWS = 2*ROWS + 4
let MakeBoard(columns:string[]) =
    assert(columns<>null)
    assert(columns.Length = COLS)
    assert(columns |> Seq.forall (fun s -> s.Length = ROWS*2))
    for i in 0..COLS-1 do
        for j in 0..ROWS*2-1 do
            let c = columns.[i].[j]
            if (i+j)%2 = 0 then assert(c=' ' || IsLetter c)
            else                assert(c=' ')
    Array2D.init BOARDCOLS BOARDROWS (fun c r ->
        if c=0 || c=BOARDCOLS-1 || 
           r=0 || r=1 || r=BOARDROWS-2 || r=BOARDROWS-1 then ' '
        else columns.[c-1].[r-2])

let board = MakeBoard(data)    

let rec Find(col, row, t:TrieNode) = seq {
    assert((col+row)%2 = 1) // we should only be searching LETTER positions
    let c = board.[col,row]
    if c <> ' ' then
        board.[col,row] <- ' '
        match t.[c] with
        | null -> ()
        | node -> 
            if node.IsEndOfWord then
                yield [c]
            for suf in Find(col-1, row-1, node) do yield c::suf
            for suf in Find(col-1, row+1, node) do yield c::suf        
            for suf in Find(col  , row-2, node) do yield c::suf        
            for suf in Find(col  , row+2, node) do yield c::suf        
            for suf in Find(col+1, row-1, node) do yield c::suf
            for suf in Find(col+1, row+1, node) do yield c::suf        
        board.[col,row] <- c }

let allWordsSorted = 
    seq {    
        for col in 0..BOARDCOLS-1 do
            for row in 0..BOARDROWS-1 do
                if board.[col,row] <> ' ' then
                    yield! Find(col, row, trie) 
    } 
    |> Seq.map (fun w -> new string(w |> Seq.toArray))
    |> Set.ofSeq    // remove duplicates
    |> Seq.toArray 
    |> Array.sortWith (fun a b -> match compare a.Length b.Length with 
                                  | 0 -> compare a b
                                  | x -> x)

printfn ""
for w in allWordsSorted do
    printfn "%s" w
printfn "%d words" allWordsSorted.Length 

P.S. You may have noted that the tenor of this blog post is markedly different from others I’ve posted.  No, I haven’t lost my mind (at least, not recently).  Rather, Chris and I are friends, and we had discussed different implementation strategies before any of the blogging started, and decided that, rather than a dry, boring comparison of different techniques, it would be fun to do it this way with… “Drama!” (Or as Chris calls it, “drama-lama-ding-dong”.  Seriously.  I’m not making this up.)  Hence this post; hope you enjoyed it.  By the way, regarding “The Challenge”, if you have a simpler/faster/better algorithm for this problem, we’ll be interested to see it!  Feel free to post it on your own blog and call us out!  After all, that’s how we do it in the FP.  Word.

About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Follow

Get every new post delivered to your Inbox.

Join 30 other followers

%d bloggers like this: