Monday
Mar182013
A Simple Example of Web Scraping With the Html Agility Pack
Monday, March 18, 2013 at 7:55PM A coworkers wanted to learn how to do basic web scraping – for example finding all the <A HREF> links on a webpage. Naturally I directed him to the Html Agility Pack. He wanted a .NET Framework solution, otherwise I would have recommended Python's Beautiful Soup library.
To get him started and to show how simple it was, I provided the following demo code. It doesn't do much, but will give anyone a quick start.
using System;
using System.Linq;
using HAP=HtmlAgilityPack;
namespace DemoHtmlAgilityPack
{
class Program
{
private static void Main(string[] args)
{
using (var client = new System.Net.WebClient())
{
var filename = System.IO.Path.GetTempFileName();
client.DownloadFile("http://python.org", filename);
var doc = new HAP.HtmlDocument();
doc.Load(filename);
var root = doc.DocumentNode;
var a_nodes = root.Descendants("a").ToList();
foreach (var a_node in a_nodes)
{
Console.WriteLine();
Console.WriteLine("LINK: {0}", a_node.GetAttributeValue("href",""));
Console.WriteLine("TEXT: {0}", a_node.InnerText.Trim());
}
}
Console.ReadKey();
}
}
}
Not hard at all!
saveenr |
3 Comments |
Reader Comments (3)
Thank you for getting me started with this simple example code. It was a great help. Programmers these days often neglect adequate documentation for end-user/peers. The HTML Agility Pack is a perfect example. Thanks again for your help.
This post is awesome. Thank you. God bless you for simplify things up.
Great post. Your example is exactly what I need. Would you know why "http://python.org" url will work when downloading to a temp file but a url such as "http://192.168.1.9/logs/TestLog/" would not? I get a "WebException was unhandled" exception when using "http://192.168.1.9/logs/TestLog/". I know that "http://192.168.1.9/logs/TestLog/" is valid as I can get to it with my browser.
Thanks.