« Getting a Depth of Field effect with Visio 2013 and the Soft Edges Feature | Main | My New Favorite Typeface: Quan »

A Simple Example of Web Scraping With the Html Agility Pack  

A coworkers wanted to learn how to do basic web scraping –  for example finding all the <A HREF> links on a webpage. Naturally I directed him to the Html Agility Pack. He wanted a .NET Framework solution, otherwise I would have recommended Python's Beautiful Soup library.

To get him started and to show how simple it was, I provided the following demo code. It doesn't do much, but will give anyone a quick start.

using System;
using System.Linq;
using HAP=HtmlAgilityPack;

namespace DemoHtmlAgilityPack
    class Program
        private static void Main(string[] args)
            using (var client = new System.Net.WebClient())
                var filename = System.IO.Path.GetTempFileName();

                client.DownloadFile("http://python.org", filename);

                var doc = new HAP.HtmlDocument();

                var root = doc.DocumentNode;

                var a_nodes = root.Descendants("a").ToList();

                foreach (var a_node in a_nodes)
                    Console.WriteLine("LINK: {0}", a_node.GetAttributeValue("href",""));
                    Console.WriteLine("TEXT: {0}", a_node.InnerText.Trim());


Not hard at all!

PrintView Printer Friendly Version

EmailEmail Article to Friend

Reader Comments (3)

Thank you for getting me started with this simple example code. It was a great help. Programmers these days often neglect adequate documentation for end-user/peers. The HTML Agility Pack is a perfect example. Thanks again for your help.

October 27, 2013 | Unregistered CommenterMartin

This post is awesome. Thank you. God bless you for simplify things up.

February 6, 2014 | Unregistered CommenterJeson

Great post. Your example is exactly what I need. Would you know why "http://python.org" url will work when downloading to a temp file but a url such as "" would not? I get a "WebException was unhandled" exception when using "". I know that "" is valid as I can get to it with my browser.


May 18, 2014 | Unregistered CommenterRyan

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>