Scraping the NHL 2010-2011 Schedule with C#, LINQ, and the HTML Agility Pack
Friday, October 8, 2010 at 5:08PM
Back in 2007 I first explained how to do this a blog post: Scraping The NHL 2007-2008 Schedule Using C# 3.0, LINQ, extension methods, and the Html Agility Pack
Things have changed since then. There’s a new .NET Runtime, a new Visual Studio, a new HTML Agility Pack, and the format of the NHL Schedule. In this post I’ll explain how to use the latest versions of these components to get the NHL calendar. It’s a great and simple example of LINQ and HTML Agility Pack.
GETTING THE NHL SCHEDULE
First, let’s find the schedule. We start here: http://www.nhl.com/ice/schedulebyseason.htm
Make sure you have selected the season you want and press the GO button.
What you’ll get is a rather large web page. Which you should Save…
If you get this warning message , just press Yes.
As you can see the file is large – almost 3MBs
THE STRUCTURE OF THE NHL SCHEDULE
When I first did this in 2007 the NHL schedule was simple a big HTML <table>. Things are very different in the new schedule. Curiously a table is not used, but rather everything is in a massive collection of <div> elements.
Each row in the schedule, corresponds to a <div> element with a class attribute set to “skedDataRow”.
Inside a row structure is a set of <div> elements that correspond to the columns. The class attributes make it clear how these are mapped. The first team is the visiting team, and the second team is the home team.
Some fields are more complex. The time field has two components: one for the Eastern Standard timezone and one for the local timezone.
At this point we know enough to begin scraping.
USING THE HTML AGILITY PACK
First thing you should be aware of is the the HTML Agility Pack (HAP) has been updated since I first used it. It is now very LINQ-friendly. This means, I do not have to rely on extension methods to simplify the API. In fact, with the new HAP, the code is very simple and obvious.
The entire code is here
public void get_schedule()
{
string local_fname = "nhl-2010-2011.htm";
var schedule_doc = new HtmlAgilityPack.HtmlDocument();
schedule_doc.Load(local_fname);
// identify all the td nodes that directly contain the text "Date"
var row_nodes = schedule_doc.DocumentNode.DescendantNodes()
.Where(n => n.Name == "div")
.Where(n => n.GetAttributeValue("class", null) == "skedDataRow");
foreach (var row_node in row_nodes)
{
var div_nodes = row_node.Elements("div").ToList();
var date_node = div_nodes.Where(n => n.GetAttributeValue("class", null) == "skedDataRow date").FirstOrDefault();
var team_nodes = div_nodes.Where(n => n.GetAttributeValue("class", null) == "skedDataRow team").ToList();
var starttime_node = div_nodes.Where(n => n.GetAttributeValue("class", null) == "skedDataRow time").FirstOrDefault();
var starttimeest_node = starttime_node.Elements("div").Where( n => n.GetAttributeValue("id", null) == "skedStartTimeEST").FirstOrDefault();
string date = date_node != null ? date_node.InnerText : "NO DATE";
date = clean_text(date);
var teams = team_nodes.Select(n => clean_text(n.InnerText)).ToList();
var startimeest = starttimeest_node != null ? starttimeest_node.InnerText : "NO START TIME";
startimeest = clean_text(startimeest);
System.Console.WriteLine(" {0} | {1} @ {2} | {3}", date, teams[0], teams[1], startimeest);
}
}
Running this code will produce this output on the command line:
GET THE SOURCE CODE
This zip file includes the full VS2010 project and raw HTM file with the NHL Schedule for the 2010-2011 Season.
Reader Comments