Monday
Oct162006

Finding all the A HREF Urls in an HTML document (even in malformed HTML)

The Need

Given an HTML file you want to exract all the HREF urls.

 

You could use a Regex

I've done this before, but haven't found it entirely reliable. I use regex's so infrequently it is painful to relearn the syntax every time.

 

What I recommend

Use the HTML Agility Pack. If you are familiar with the XML DOM, using the HTML Agility Pack will come naturally.

 

HTML Agility Pack URL

http://www.codeplex.com/Wiki/View.aspx?ProjectName=htmlagilitypack

 

Two things that make HTML Agility Pack interesting

- It doesn't depend on Internet Explorer

- It works on malformed HTML. See this post for a little for context:  NET Html Agility Pack: How to use malformed HTML just like it was well-formed XML

 

Sample code

// this isn't a full sample, but enough to see the value of using the HTML Agility Pack

HtmlDocument input_doc = HtmlDocument();

input_doc.Load(“foo.htm”);

foreach ( HtmlNode node in input_doc.DocumentNode.SelectNodes("//a") )

{

string href_url = node.GetAttributeValue("href", "");

}

Friday
Oct132006

It's Helvetica's world, we just live in it.

A film about a typeface to be released in 2007: http://helveticafilm.com/

Microsoft uses a typeface called "Arial" that I discovered only recently is NOT the same as the "Helvetica" font found on Macs. The differences are subtle to my untrained eye, but those with a background in typography don't take the discrepencies lightly.

 

On the battle between Helvetica and Arial

Helvetica vs. Arial: http://www.engagestudio.com/helvetica/

How to spot Arial: http://www.ms-studio.com/articlesarialsid.html

Arial or Helvetica?: http://www.iliveonyourvisits.com/helvetica/

 

Wikipedia's article on Helvetica

http://en.wikipedia.org/wiki/Helvetica

 

The real evil: Comic Sans

The Arial's existence is tolerated by those in-the-know. Comic sans is hated. That is is a shame, because its designer, Vincent Connare, also designed one of my favorite fonts: Trebuchet. Read what Vincent says about Comic Sans.

Trivia: I first knew Vincent as a teammate on an ice hockey team. I remember him as a skilled player and an excellent sportsman. Only later did I discover his contributions to Microsoft's typography.

Wikipedia's article on Comic Sans: http://en.wikipedia.org/wiki/Comic_Sans

Tuesday
Jun202006

Check out NodeBox - a playground for graphics

NodeBox (http://nodebox.net/) is an fun tool to experiment with 2D graphics. NodeBox version 1.0 only runs on Mac OS X, but a beta version of NodeBox V2 found on the NodeBox research Wiki ( http://research.nodebox.net/Home) runs on both Windows and Mac OS X (via Jython).

Before I knew NodeBox version 2 existed, I used Visual Studio 2005 to create a demo WinForms app that used GDI+ (System.Drawing) and IronPython. I plan on using this tool to play with some ideas on visualizing data.

A screenshot of NodeBox V1

 

Monday
Jan092006

I always wondered what Anisotropic Filtering did

And thanks to the following page (pretty pictures!) I know I should turn it on when I play games:

http://www.codinghorror.com/blog/archives/000484.html

From the article:

"In my opinion, anisotropic filtering is the most important single image quality setting available on today's 3D hardware"

And a link to wikipedia for some more information

http://en.wikipedia.org/wiki/Anisotropic_filtering

 

Thursday
Nov242005

Use the built-in Batch Rename in Windows Explorer to give files consistent and readable names

The Scenario

One has a lot of oddly named files such as thsoe produced by a digital camera.

Example of what a digital camera might produce:

  • pic383718.jpg
  • pic120453.jpg
  • pic938889.jpg
  • pic109012.jpg
  • pic433590.jpg
  • pic219093.jpg

The Pain Experienced

  • The names are meaningless
  • The ordering is unclear
  • One has to open them up to determine their content and order

What is Desired

  • Give the files nice names that make sense without having to open them up.
  • A way to do this without having to download or purchase a tool

Windows XP comes with a built-in solution to batch rename files

  • Browse Windows Exploer to the files you want to fix
  • Switch to Details view (recommended, but not required)
  • Order the view by time or name or size as desired
  • multi-select all the files to with the names you don't like
  • While the files are selected, right click on the file you want to consider as the *first* one. The context menu will appear.
  • Select Rename
  • The file upon which you right-clicked will now show that you can edit its name
  • Change the name as desired and add a space and a starting number in parenthesis and leave the extension alone
    • Example:
      • Rename this file: "pic383718.jpg"
      • To this: "My Vacation in Italy (1).jpg"
  • The rest of the files will be renamed to match that pattern.
    • Final Output
      • My Vacation in Italy (1).jpg
      • My Vacation in Italy (2).jpg
      • My Vacation in Italy (3).jpg
      • My Vacation in Italy (4).jpg
      • My Vacation in Italy (5).jpg
      • My Vacation in Italy (6).jpg

What happens if the number in parenthesis is left out? For example, if one named the first file "foo.jpg"?

The results will look like this:

  • foo.jpg
  • foo (1).jpg
  • foo (2).jpg
  • foo (3).jpg
  • foo (4).jpg
  • foo (5).jpg