Finding all the A HREF Urls in an HTML document (even in malformed HTML)

The Need

Given an HTML file you want to exract all the HREF urls.


You could use a Regex

I've done this before, but haven't found it entirely reliable. I use regex's so infrequently it is painful to relearn the syntax every time.


What I recommend

Use the HTML Agility Pack. If you are familiar with the XML DOM, using the HTML Agility Pack will come naturally.


HTML Agility Pack URL


Two things that make HTML Agility Pack interesting

- It doesn't depend on Internet Explorer

- It works on malformed HTML. See this post for a little for context:  NET Html Agility Pack: How to use malformed HTML just like it was well-formed XML


Sample code

// this isn't a full sample, but enough to see the value of using the HTML Agility Pack

HtmlDocument input_doc = HtmlDocument();


foreach ( HtmlNode node in input_doc.DocumentNode.SelectNodes("//a") )


string href_url = node.GetAttributeValue("href", "");



It's Helvetica's world, we just live in it.

A film about a typeface to be released in 2007:

Microsoft uses a typeface called "Arial" that I discovered only recently is NOT the same as the "Helvetica" font found on Macs. The differences are subtle to my untrained eye, but those with a background in typography don't take the discrepencies lightly.


On the battle between Helvetica and Arial

Helvetica vs. Arial:

How to spot Arial:

Arial or Helvetica?:


Wikipedia's article on Helvetica


The real evil: Comic Sans

The Arial's existence is tolerated by those in-the-know. Comic sans is hated. That is is a shame, because its designer, Vincent Connare, also designed one of my favorite fonts: Trebuchet. Read what Vincent says about Comic Sans.

Trivia: I first knew Vincent as a teammate on an ice hockey team. I remember him as a skilled player and an excellent sportsman. Only later did I discover his contributions to Microsoft's typography.

Wikipedia's article on Comic Sans:


Check out NodeBox - a playground for graphics

NodeBox ( is an fun tool to experiment with 2D graphics. NodeBox version 1.0 only runs on Mac OS X, but a beta version of NodeBox V2 found on the NodeBox research Wiki ( runs on both Windows and Mac OS X (via Jython).

Before I knew NodeBox version 2 existed, I used Visual Studio 2005 to create a demo WinForms app that used GDI+ (System.Drawing) and IronPython. I plan on using this tool to play with some ideas on visualizing data.

A screenshot of NodeBox V1



I always wondered what Anisotropic Filtering did

And thanks to the following page (pretty pictures!) I know I should turn it on when I play games:

From the article:

"In my opinion, anisotropic filtering is the most important single image quality setting available on today's 3D hardware"

And a link to wikipedia for some more information



Use the built-in Batch Rename in Windows Explorer to give files consistent and readable names

The Scenario

One has a lot of oddly named files such as thsoe produced by a digital camera.

Example of what a digital camera might produce:

  • pic383718.jpg
  • pic120453.jpg
  • pic938889.jpg
  • pic109012.jpg
  • pic433590.jpg
  • pic219093.jpg

The Pain Experienced

  • The names are meaningless
  • The ordering is unclear
  • One has to open them up to determine their content and order

What is Desired

  • Give the files nice names that make sense without having to open them up.
  • A way to do this without having to download or purchase a tool

Windows XP comes with a built-in solution to batch rename files

  • Browse Windows Exploer to the files you want to fix
  • Switch to Details view (recommended, but not required)
  • Order the view by time or name or size as desired
  • multi-select all the files to with the names you don't like
  • While the files are selected, right click on the file you want to consider as the *first* one. The context menu will appear.
  • Select Rename
  • The file upon which you right-clicked will now show that you can edit its name
  • Change the name as desired and add a space and a starting number in parenthesis and leave the extension alone
    • Example:
      • Rename this file: "pic383718.jpg"
      • To this: "My Vacation in Italy (1).jpg"
  • The rest of the files will be renamed to match that pattern.
    • Final Output
      • My Vacation in Italy (1).jpg
      • My Vacation in Italy (2).jpg
      • My Vacation in Italy (3).jpg
      • My Vacation in Italy (4).jpg
      • My Vacation in Italy (5).jpg
      • My Vacation in Italy (6).jpg

What happens if the number in parenthesis is left out? For example, if one named the first file "foo.jpg"?

The results will look like this:

  • foo.jpg
  • foo (1).jpg
  • foo (2).jpg
  • foo (3).jpg
  • foo (4).jpg
  • foo (5).jpg