« PowerPoint 2007 graphics have dramatically improved | Main | It's Helvetica's world, we just live in it. »

Finding all the A HREF Urls in an HTML document (even in malformed HTML)

The Need

Given an HTML file you want to exract all the HREF urls.


You could use a Regex

I've done this before, but haven't found it entirely reliable. I use regex's so infrequently it is painful to relearn the syntax every time.


What I recommend

Use the HTML Agility Pack. If you are familiar with the XML DOM, using the HTML Agility Pack will come naturally.


HTML Agility Pack URL



Two things that make HTML Agility Pack interesting

- It doesn't depend on Internet Explorer

- It works on malformed HTML. See this post for a little for context:  NET Html Agility Pack: How to use malformed HTML just like it was well-formed XML


Sample code

// this isn't a full sample, but enough to see the value of using the HTML Agility Pack

HtmlDocument input_doc = HtmlDocument();


foreach ( HtmlNode node in input_doc.DocumentNode.SelectNodes("//a") )


string href_url = node.GetAttributeValue("href", "");


PrintView Printer Friendly Version

EmailEmail Article to Friend

Reader Comments

There are no comments for this journal entry. To create a new comment, use the form below.

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>