« Getting a Depth of Field effect with Visio 2013 and the Soft Edges Feature | Main | My New Favorite Typeface: Quan »
Monday
Mar182013

A Simple Example of Web Scraping With the Html Agility Pack  

A coworkers wanted to learn how to do basic web scraping –  for example finding all the <A HREF> links on a webpage. Naturally I directed him to the Html Agility Pack. He wanted a .NET Framework solution, otherwise I would have recommended Python's Beautiful Soup library.

To get him started and to show how simple it was, I provided the following demo code. It doesn't do much, but will give anyone a quick start.

using System;
using System.Linq;
using HAP=HtmlAgilityPack;

namespace DemoHtmlAgilityPack
{
    class Program
    {
        private static void Main(string[] args)
        {
            using (var client = new System.Net.WebClient())
            {
                var filename = System.IO.Path.GetTempFileName();

                client.DownloadFile("http://python.org", filename);

                var doc = new HAP.HtmlDocument();
                doc.Load(filename);

                var root = doc.DocumentNode;

                var a_nodes = root.Descendants("a").ToList();

                foreach (var a_node in a_nodes)
                {
                    Console.WriteLine();
                    Console.WriteLine("LINK: {0}", a_node.GetAttributeValue("href",""));
                    Console.WriteLine("TEXT: {0}", a_node.InnerText.Trim());
                }
            }

            Console.ReadKey();
        }
    }
}

Not hard at all!

PrintView Printer Friendly Version

EmailEmail Article to Friend

Reader Comments (3)

Thank you for getting me started with this simple example code. It was a great help. Programmers these days often neglect adequate documentation for end-user/peers. The HTML Agility Pack is a perfect example. Thanks again for your help.

October 27, 2013 | Unregistered CommenterMartin

This post is awesome. Thank you. God bless you for simplify things up.

February 6, 2014 | Unregistered CommenterJeson

Great post. Your example is exactly what I need. Would you know why "http://python.org" url will work when downloading to a temp file but a url such as "http://192.168.1.9/logs/TestLog/" would not? I get a "WebException was unhandled" exception when using "http://192.168.1.9/logs/TestLog/". I know that "http://192.168.1.9/logs/TestLog/" is valid as I can get to it with my browser.

Thanks.

May 18, 2014 | Unregistered CommenterRyan

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>