Finding all the A HREF Urls in an HTML document (even in malformed HTML)
Monday, October 16, 2006 at 11:04PM The Need
Given an HTML file you want to exract all the HREF urls.
You could use a Regex
I've done this before, but haven't found it entirely reliable. I use regex's so infrequently it is painful to relearn the syntax every time.
What I recommend
Use the HTML Agility Pack. If you are familiar with the XML DOM, using the HTML Agility Pack will come naturally.
HTML Agility Pack URL
http://www.codeplex.com/Wiki/View.aspx?ProjectName=htmlagilitypack
Two things that make HTML Agility Pack interesting
- It doesn't depend on Internet Explorer
- It works on malformed HTML. See this post for a little for context: NET Html Agility Pack: How to use malformed HTML just like it was well-formed XML
Sample code
// this isn't a full sample, but enough to see the value of using the HTML Agility Pack
HtmlDocument input_doc = HtmlDocument();
input_doc.Load(“foo.htm”);
foreach ( HtmlNode node in input_doc.DocumentNode.SelectNodes("//a") )
{
string href_url = node.GetAttributeValue("href", "");
}