Finding all the A HREF Urls in an HTML document (even in malformed HTML)

The Need

Given an HTML file you want to exract all the HREF urls.


You could use a Regex

I've done this before, but haven't found it entirely reliable. I use regex's so infrequently it is painful to relearn the syntax every time.


What I recommend

Use the HTML Agility Pack. If you are familiar with the XML DOM, using the HTML Agility Pack will come naturally.


HTML Agility Pack URL



Two things that make HTML Agility Pack interesting

- It doesn't depend on Internet Explorer

- It works on malformed HTML. See this post for a little for context:  NET Html Agility Pack: How to use malformed HTML just like it was well-formed XML


Sample code

// this isn't a full sample, but enough to see the value of using the HTML Agility Pack

HtmlDocument input_doc = HtmlDocument();


foreach ( HtmlNode node in input_doc.DocumentNode.SelectNodes("//a") )


string href_url = node.GetAttributeValue("href", "");


