« Putting a Microscope to the Segoe Typefaces | Main | Video pick of the week - Strange Arrangements by Radium Audio Ltd »
Friday
Oct082010

Scraping the NHL 2010-2011 Schedule with C#, LINQ, and the HTML Agility Pack

 

Back in 2007 I first explained how to do this a blog post: Scraping The NHL 2007-2008 Schedule Using C# 3.0, LINQ, extension methods, and the Html Agility Pack

Things have changed since then. There’s a new .NET Runtime, a new Visual Studio, a new HTML Agility Pack, and the format of the NHL Schedule. In this post I’ll explain how to use the latest versions of these components to get the NHL calendar. It’s a great and simple example of LINQ and HTML Agility Pack.

GETTING THE NHL SCHEDULE

First, let’s find the schedule. We start here: http://www.nhl.com/ice/schedulebyseason.htm

image

Make sure you have selected the season you want and press the GO button.

image

What you’ll get is a rather large web page. Which you should Save…

image

If you get this warning message , just press Yes.

image

image

As you can see the file is large – almost 3MBs

image

 

THE STRUCTURE OF THE NHL SCHEDULE

When I first did this in 2007 the NHL schedule was simple a big HTML <table>. Things are very different in the new schedule. Curiously a table is not used, but rather everything is in a massive collection of <div> elements.

Each row in the schedule, corresponds to a <div> element with a class attribute set to “skedDataRow”.

image

 

Inside a row structure is a set of <div> elements that correspond to the columns. The class attributes make it clear how these are mapped. The first team is the visiting team, and the second team is the home team.

image

Some fields are more complex. The time field has two components: one for the Eastern Standard timezone and one for the local timezone.

image

At this point we know enough to begin scraping.

USING THE HTML AGILITY PACK

First thing you should be aware of is the the HTML Agility Pack (HAP) has been updated since I first used it. It is now very LINQ-friendly. This means, I do not have to rely on extension methods to simplify the API. In fact, with the new HAP, the code is very simple and obvious.

The entire code is here

        public void get_schedule()
        {
            string local_fname = "nhl-2010-2011.htm";

            var schedule_doc = new HtmlAgilityPack.HtmlDocument();
            schedule_doc.Load(local_fname);

            // identify all the td nodes that directly contain the text "Date"
            var row_nodes = schedule_doc.DocumentNode.DescendantNodes()
                .Where(n => n.Name == "div")
                .Where(n => n.GetAttributeValue("class", null) == "skedDataRow");


            foreach (var row_node in row_nodes)
            {
                var div_nodes = row_node.Elements("div").ToList();

                var date_node = div_nodes.Where(n => n.GetAttributeValue("class", null) == "skedDataRow date").FirstOrDefault();
                var team_nodes = div_nodes.Where(n => n.GetAttributeValue("class", null) == "skedDataRow team").ToList();
                var starttime_node = div_nodes.Where(n => n.GetAttributeValue("class", null) == "skedDataRow time").FirstOrDefault();
                var starttimeest_node = starttime_node.Elements("div").Where( n => n.GetAttributeValue("id", null) == "skedStartTimeEST").FirstOrDefault();
 
                string date = date_node != null ? date_node.InnerText : "NO DATE";
                date = clean_text(date);

                var teams = team_nodes.Select(n => clean_text(n.InnerText)).ToList();

                var startimeest = starttimeest_node != null ? starttimeest_node.InnerText : "NO START TIME";
                startimeest = clean_text(startimeest);

                System.Console.WriteLine(" {0} | {1} @ {2} | {3}", date, teams[0], teams[1], startimeest);
            }

        }

Running this code will produce this output on the command line:

image

 

GET THE SOURCE CODE

This zip file includes the full VS2010 project and raw HTM file with the NHL Schedule for the 2010-2011 Season.

http://cid-1ff099edb1c7ebfa.office.live.com/self.aspx/Public/Blog%20Posts/MSDN%20Saveenr%20Blog/2010/2010-10-08%20NHL%20Schedule%20Scraping

PrintView Printer Friendly Version

EmailEmail Article to Friend

Reader Comments

There are no comments for this journal entry. To create a new comment, use the form below.

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>