A Simple Example of Web Scraping With the Html Agility Pack  
Monday, March 18, 2013 at 7:55PM
saveenr

A coworkers wanted to learn how to do basic web scraping –  for example finding all the <A HREF> links on a webpage. Naturally I directed him to the Html Agility Pack. He wanted a .NET Framework solution, otherwise I would have recommended Python's Beautiful Soup library.

To get him started and to show how simple it was, I provided the following demo code. It doesn't do much, but will give anyone a quick start.

using System;
using System.Linq;
using HAP=HtmlAgilityPack;
namespace DemoHtmlAgilityPack
{
    class Program
    {
        private static void Main(string[] args)
        {
            using (var client = new System.Net.WebClient())
            {
                var filename = System.IO.Path.GetTempFileName();
                client.DownloadFile("http://python.org", filename);
                var doc = new HAP.HtmlDocument();
                doc.Load(filename);
                var root = doc.DocumentNode;
                var a_nodes = root.Descendants("a").ToList();
                foreach (var a_node in a_nodes)
                {
                    Console.WriteLine();
                    Console.WriteLine("LINK: {0}", a_node.GetAttributeValue("href",""));
                    Console.WriteLine("TEXT: {0}", a_node.InnerText.Trim());
                }
            }
            Console.ReadKey();
        }
    }
}

Not hard at all!

Article originally appeared on viziblr (http://viziblr.com/).
See website for complete article licensing information.