Wednesday, February 20, 2008

Deferred Execution, the Elegance of LINQ

One of the things I love about LINQ is its deferred execution model. It's the type of thing that makes sense academically when you first read about it (e.g. in Part Three of Scott Gunthrie's LINQ to SQL series), but for me anyway, it took some time to understand enough to use effectively.

For instance the Daily RSS Download open source application that I wrote about last week needs to download entries (posts) that are newly published since the last download. While it isn't a complicated problem, my first attempt at a solution didn't use the power of LINQ correctly. I'll explain my naïve solution in this post, describe how LINQ's deferred execution works (i.e. Lambda expressions), explain the problems with my solution, then give an the elegant solution that is only possible because of LINQ's deferred execution model. See if you can spot my error along the way.

Downloading the Latest Entries

Downloading the latest entries would be a ridiculously simple problem if there weren't multiple formats for RSS. But since the solution needs to support Atom and RSS 2.0 and 1.0 and potentially other future formats, the class structure should be set up appropriately:

The newspaper class primarily exists to enumerate feeds:

public class Newspaper {
    public void DownloadNow() {
      foreach
(Feed objFeed in Settings.Feeds) {
        objFeed.DownloadRecentEntries(...);
      }
    }
}

The Feed class is abstract and during runtime is either an RssFeed or an AtomFeed. The relevant function Feed.DownloadRecentEntries() calls the abstract Feed.GetEntries() method, which returns a group of Entry objects.

public abstract class Feed {
    public
abstract IEnumerable<Entry> GetEntries(XDocument rssfeed);

    public
void DownloadRecentEntries(...) {
        XDocument xdocFeed = XDocument.Load(Url);
        IEnumerable<Entry> lstRecentPosts = GetEntries(xdocFeed);

        foreach (Entry objEntry in lstRecentPosts) {
            objEntry.Download(...)
        }
    }
}

The Feed classes, RssFeed and AtomFeed then implement GetEntries as follows:

publpublic class RssFeed : Feed {
    public override IEnumerable<Entry> GetEntries(XDocument rssfeed) {
        return from item in rssfeed.Descendants("item")
               where (DateParser.ParseDateTime(item.Element("pubDate").Value)
                   >= this.LastDownloaded)
                   || this.LastDownloaded == null
               select
(Entry)new RssEntry(item, this);
    }
}

public class AtomFeed : Feed {
    public override IEnumerable<Entry> GetEntries(XDocument rssfeed) {
        return from item in rssfeed.Descendants(_atomNamespace + "entry")
               where (DateParser.ParseDateTime(
                   item.Element(_atomNamespace + "published").Value)
                    >= this.LastDownloaded)
                    || this.LastDownloaded == null
               select (Entry)new AtomEntry(item, this);
    }
}

Yes, that's all LINQ to XML in there. It looks a lot like SQL, but as you'll see in a second it's really just glorified syntactic sugar. Expressive though, isn't it? While the astute reader may have already spotted the inelegance of my solution, for those unfamiliar with LINQ, let's first describe what AtomFeed.GetEntries() does.

What is this Deferred Execution Stuff?

If you already understand LINQ and how delayed execution works feel free to skip this section. For everyone else it's important to understand that the following line:

from item in rssfeed.Descendants("item")
               where (DateParser.ParseDateTime(item.Element("pubDate").Value)
                   >= this.LastDownloaded)
                   || this.LastDownloaded == null
               select
(Entry)new RssEntry(item, this);

Is actually just syntactic sugar for the following set of statements:

rssfeed
    .Descendants(_atomNamespace + "entry")
    .Where( item => (DateParser.ParseDateTime(
        item.Element(_atomNamespace + "published").Value)
        >= this.LastDownloaded)
        || this.LastDownloaded == null)
    .Select( item => (Entry)new AtomEntry(item, this));

Now XDocument.Descendants() returns IEnumerable which most definitely does not have a Where() function on it. And if you look at the return type of Where() in this context, it returns an IEnumerable which definetly does not have a Select() method on it. That's because Where() and Select() are extension methods, meaning you can attach them on to just about anything. They're new to C# 3.0 and are beyond the scope of this article.

But more important for the topic of deferred execution is the => operator, which is a Lambda expression and is also new to C# 3.0. The best way to understand them is that they are essentially syntactic sugar for an anonymous method (e.g. a type safe function pointer to code). So we could again rewrite our code as follows:

rssfeed
    .Descendants(_atomNamespace + "entry")
    .Where(delegate(XElement item) {
        return (DateParser.ParseDateTime(
            item.Element(_atomNamespace + "published").Value)
            >= this.LastDownloaded) || this.LastDownloaded == null; }
)
    .Select(delegate(XElement item) {
        return (Entry)new AtomEntry(item, this); }
);

Back in familiar territory yet? If not you probably aren't familiar with C# 2.0. In the background the compiler takes the anonymous methods above and turns them into methods on the current class and instantiates new delegates of the correct type that points to them and passes them to the Select() and Where() methods.

The The key thing to note is that the arguments for select and where are delegates, and so when those delegates are executed is beyond our control. In fact if you put a Console.WriteLine or a breakpoint inside of the AtomEntry constructor, it won't get called until the resulting IEnumerable is enumerated, specifically the following line in the first code sample:

foreach (Entry objEntry in lstRecentPosts) {

So that's delayed execution. But understanding how it works and how to use it are completely different things.

The Inelegant Solution

Getting back to my code sample you may have picked up that my where clause is the mistake. I implemented it like this because RSS and Atom have different field names for the published date. But the way I wrote it I'd have to make two changes if I wanted to change which entries to download. Ok, big deal, I'm extremely unlikely to make changes to that where clause right? Or I wasn't until I wanted functionality to set some defaults based on the average length of posts prior to downloading posts. Basically:

public static Feed CreateFeed(string strUrl, int intDisplayOrder) {
    IEnumerable<Entry> lstRecentEntries = feed.GetEntries(rssfeed);
    double intAveragePostSize = lstRecentEntries.Average(
        i => i.Description.Length);
    // if the feeds posts are typically small then include the
    // description field in the summary and download the content
    // for the main article from the link
   
if (intAveragePostSize < 1000) {
        ...
    } else {
        ...
    }
}

Except this now ties me to the were clause, when what I'd really like to do is just get the average post size for the last couple of posts. The problem is that GetEntries() isn't generic enough.

The Elegant Solution

The The solution is then to normalize out (excuse the database terminology) the where clause into the two methods that use GetEntries(). So GetEntries() becomes simple:

public override IEnumerable<Entry> GetEntries(XDocument rssfeed) {
      return from item in rssfeed.Descendants(_atomNamespace + "entry")
               select (Entry)new AtomEntry(item, this);
}

And then Feed.CreateFeed() and Feed.DownloadRecentEntries() become more complicated

public abstract class Feed {
    public
abstract IEnumerable<Entry> GetEntries(XDocument rssfeed);

    public
static Feed CreateFeed(string strUrl, int intDisplayOrder) {
        IEnumerable<Entry> lstEntries = feed.GetEntries(rssfeed);
        // get the five most recent posts
        IEnumerable<Entry> lstRecentEntries =
            from entry in lstEntries.Take(5)
            select entry;

        double
intAveragePostSize = lstRecentEntries.Average(
            i => i.Description.Length);
        if (intAveragePostSize < 1000) {
            ...
        } else {
            ...
        }
    }

    public
void DownloadRecentEntries(...) {
        XDocument xdocFeed = XDocument.Load(Url);
        IEnumerable<Entry> lstEntries = GetEntries(xdocFeed);
        // get newly published posts
        IEnumerable<Entry> lstRecentPosts = from entry in lstEntries
            where (entry.Published >= this.LastDownloaded)
                || this.LastDownloaded == null
            select entry;


        foreach (Entry objEntry in lstRecentPosts) {
            objEntry.Download(...)
        }
    }
}

Note that we now have a second LINQ statement that runs against the results of the LINQ statement in GetEntries(). But since nothing's been executed yet we're just building out the statement that we will eventually run when the resulting IEnumerable if enumerated. So we've now spread our LINQ statements across an inheriting and a base class, and in process we've made GetEntries() extremely generic.

Conclusion

So what's the big deal? The big deal is that we can spread our data access statements across multple classes and because of deferred execution we don't need to worry about the performance of generic methods that are closer to the data that don't contain a "where" clause. This may not be a huge deal in this example, but it becomes extremely powerful when the user interface tier can tack on "order by" statements or "filters" BEFORE anything is executed against your data store. And that, for me, is at the heart of the beauty of LINQ.

No comments: