Inside F#

Brian's thoughts on F# and .NET

Dear Proggit: graphs are cool, but I prefer F#, so I graphed the subreddit interconnections with F# and DGML

Posted by Brian on May 26, 2010

Today I saw this and I thought, hey, that’s pretty cool!

So of course I had to code my own version with F#.  First off, the results, when starting from the “music” subreddit:

SubRedditConnections

(To see the full size version, click here.)

Each node contains the name of the subreddit and the number of readers.  The font size is proportional to log(numReaders).  The color is determined by community age, with young ones being red and rainbowing across to old ones being blue.  And of course the graph arrows are the links to other related subreddits.

All the info is scraped from the reddit sidebars using some hackish Regexes and XML parsing.

Of course I used F# async for the non-blocking I/O to grab multiple web pages concurrently so that the program runs fast.  I used an agent and .NET 4.0 concurrent collections to manage mutable updates to state without creating data races.

All told, just about 70 lines of code to scrape reddit for the info and 40 more to generate the DGML that VS2010 then renders into the pretty graph.  I just hacked it together tonight, so the code is maybe not awesome, but it is good enough to share.  Code is below, have fun with it.  F# and DGML are cool!

// reddit uses XHTML, hurrah
open System.Xml.Linq
let XN name = XName.Get(name, "")

// state used by the workers (multi-threaded)
let results = new System.Collections.Concurrent.ConcurrentBag<_>()

// state managed by the agent (which is logically-single-threaded)
let mutable visited = new System.Collections.Generic.HashSet<string>()
let mutable started = new System.Collections.Generic.HashSet<string>()
let allDone = new System.Threading.ManualResetEvent(false)

type Message =
    | EnsureVisited of string
    | FinishedVisiting of string

printfn "Periodically showing number of known remaining links to follow (as progress indicator)..."
let rec agent = MailboxProcessor.Start(fun mbox ->
    let numIters = ref 0
    let rec Loop() =
        async {
            let! msg = mbox.Receive()
            match msg with
            | EnsureVisited url -> 
                if not(visited.Contains(url)) && not(started.Contains(url)) then
                    started.Add(url) |> ignore
                    visit url |> Async.Start 
            |  FinishedVisiting url ->
                started.Remove(url) |> ignore
                visited.Add(url) |> ignore
            incr numIters
            if !numIters % 10 = 0 then
                printf "%d " started.Count 
            if started.Count <> 0 then
                return! Loop()
            else
                allDone.Set() |> ignore
        }
    Loop())

and visit reddit = async {
    use wc = new System.Net.WebClient()
    let! xhtml = wc.AsyncDownloadString(new System.Uri("http://www.reddit.com/r/"+reddit))
    let y = System.Text.RegularExpressions.Regex.Match(xhtml, "<span class=\"age\">a community for (\d+) years?</span>")
    let m = System.Text.RegularExpressions.Regex.Match(xhtml, "<span class=\"age\">a community for (\d+) months?</span>")
    let months = if y.Success then 12*int(y.Groups.[1].Value) elif m.Success then int(m.Groups.[1].Value) else 0
    let getTitleBox (xhtml:string) =
        // extremely quick and dirty way to parse out just the bit I want
        let start = xhtml.IndexOf("<div class=\"titlebox\">")
        let mutable finish = xhtml.IndexOf("</div>", start)
        while(finish <> -1 && try XElement.Parse(xhtml.Substring(start,finish-start+6)); false with e-> true) do
            finish <- xhtml.IndexOf("</div>", finish+1)
        if finish = -1 then failwith "could not parse"
        xhtml.Substring(start,finish-start+6)
    let getAttrVal attrName (e:XElement) =
        let s = e.Attributes(XN attrName)
        if Seq.length s = 1 then Some((Seq.head s).Value) else None
    let xe = XElement.Parse(getTitleBox xhtml)
    let numReaders = xe.Descendants(XN "span") 
                  |> Seq.filter (fun e -> match getAttrVal "class" e with None -> false | Some s -> s="number") |> Seq.head 
    let urls = 
        xe.Descendants(XN "a") |> Seq.choose (getAttrVal "href") |> Seq.choose (fun u -> 
            let m = System.Text.RegularExpressions.Regex.Match(u, @"^(?:http://(?:www.)?reddit.com)?/r/(\w+)/?$")
            if m.Success then Some(m.Groups.[1].Value.ToLowerInvariant()) else None) |> set
    results.Add(reddit, numReaders.Value, urls, months)
    for u in urls do
        agent.Post(EnsureVisited u)
    agent.Post(FinishedVisiting reddit)
}

// kick it off with a starting reddit
agent.Post(EnsureVisited "music")
allDone.WaitOne() |> ignore

// make DGML of the results
let sb = new System.Text.StringBuilder()
let add (s:string) = sb.AppendLine(s) |> ignore
add @"<?xml version=""1.0"" encoding=""utf-8""?>"
add @"<DirectedGraph GraphDirection=""BottomToTop"" Layout=""Sugiyama"" xmlns=""http://schemas.microsoft.com/vs/2009/dgml"">"
add @"  <Nodes>"
for (reddit,numReaders,links,months) in results do
    add <| sprintf "    <Node Id=\"%s\" Label=\"%s
%s\" NumReaders=\"%d\" Months=\"%d\" />" 
            reddit reddit numReaders (System.Int32.Parse(numReaders, System.Globalization.NumberStyles.AllowThousands)) months
add @"  </Nodes>"
add @"  <Links>"
for (reddit,numReaders,links, months) in results do
    for l in links do
        add <| sprintf "    <Link Source=\"%s\" Target=\"%s\" />" reddit l
add @"  </Links>"
add @"  <Styles>"
add @"    
    <Style TargetType=""Node"" GroupLabel=""NumReaders"" ValueLabel=""Function"">
      <Condition Expression=""NumReaders &gt; 0"" />
      <Setter Property=""FontSize"" Expression=""Math.Max(12,3*Math.Log(NumReaders))"" />
    </Style>
    <Style TargetType=""Node"" GroupLabel=""Months"" ValueLabel=""Function"">
      <Condition Expression=""Months &gt; 12"" />
      <Setter Property=""Background"" Value=""#FF6666FF"" />
    </Style>
    <Style TargetType=""Node"" GroupLabel=""Months"" ValueLabel=""Function"">
      <Condition Expression=""Months = 12"" />
      <Setter Property=""Background"" Value=""#FF66FF66"" />
    </Style>
    <Style TargetType=""Node"" GroupLabel=""Months"" ValueLabel=""Function"">
      <Condition Expression=""Months &gt;= 9"" />
      <Condition Expression=""Months &lt; 12"" />
      <Setter Property=""Background"" Value=""#FFFFFF44"" />
    </Style>
    <Style TargetType=""Node"" GroupLabel=""Months"" ValueLabel=""Function"">
      <Condition Expression=""Months &gt;= 5"" />
      <Condition Expression=""Months &lt; 9"" />
      <Setter Property=""Background"" Value=""#FFDDBB66"" />
    </Style>
    <Style TargetType=""Node"" GroupLabel=""Months"" ValueLabel=""Function"">
      <Condition Expression=""Months &lt; 5"" />
      <Setter Property=""Background"" Value=""#FFFF6666"" />
    </Style>"
add @"  </Styles>"
add @"</DirectedGraph>"
System.IO.File.WriteAllText(@"graph.dgml", sb.ToString())

Advertisements

One Response to “Dear Proggit: graphs are cool, but I prefer F#, so I graphed the subreddit interconnections with F# and DGML”

  1. Brian said

    After I posted it, I realized one small error. For example, http://www.reddit.com/r/electronicmusic has a link to http://idm.reddit.com/ , but I didn\’t realize that was a legal subreddit URL, and I fail to take it into account. As a result, my graph misses e.g. the link from electronicmusic back to idm. Would probably need to add another URL Regex to account for this.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: