viziblr

Monday

Mar182013

A Simple Example of Web Scraping With the Html Agility Pack

Monday, March 18, 2013 at 7:55PM

A coworkers wanted to learn how to do basic web scraping – for example finding all the <A HREF> links on a webpage. Naturally I directed him to the Html Agility Pack. He wanted a .NET Framework solution, otherwise I would have recommended Python's Beautiful Soup library.

To get him started and to show how simple it was, I provided the following demo code. It doesn't do much, but will give anyone a quick start.

using System;
using System.Linq;
using HAP=HtmlAgilityPack;
namespace DemoHtmlAgilityPack
{
    class Program
    {
        private static void Main(string[] args)
        {
            using (var client = new System.Net.WebClient())
            {
                var filename = System.IO.Path.GetTempFileName();
                client.DownloadFile("http://python.org", filename);
                var doc = new HAP.HtmlDocument();
                doc.Load(filename);
                var root = doc.DocumentNode;
                var a_nodes = root.Descendants("a").ToList();
                foreach (var a_node in a_nodes)
                {
                    Console.WriteLine();
                    Console.WriteLine("LINK: {0}", a_node.GetAttributeValue("href",""));
                    Console.WriteLine("TEXT: {0}", a_node.InnerText.Trim());
                }
            }
            Console.ReadKey();
        }
    }
}

Not hard at all!

saveenr |

3 Comments |

View Printer Friendly Version

Email Article to Friend

Reader Comments (3)

Thank you for getting me started with this simple example code. It was a great help. Programmers these days often neglect adequate documentation for end-user/peers. The HTML Agility Pack is a perfect example. Thanks again for your help.

October 27, 2013 |

Martin

This post is awesome. Thank you. God bless you for simplify things up.

February 6, 2014 |

Jeson

Great post. Your example is exactly what I need. Would you know why "http://python.org" url will work when downloading to a temp file but a url such as "http://192.168.1.9/logs/TestLog/" would not? I get a "WebException was unhandled" exception when using "http://192.168.1.9/logs/TestLog/". I know that "http://192.168.1.9/logs/TestLog/" is valid as I can get to it with my browser.

Thanks.

May 18, 2014 |

Ryan

Post a New Comment

Enter your information below to add a new comment.

Author:

Author Email (optional):

Author URL (optional):

Post:

↓ | ↑

Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>

A Simple Example of Web Scraping With the Html Agility Pack

Reader Comments (3)

Post a New Comment

Other interesting websites