Regex Cowboys vs Doing It Right
John Graham‐Cumming recently set off a firestorm by observing that Eric Raymond writes bad code. What Raymond was doing in this particular instance was trying to extract data from HTML by using regular expression matching.
The proper solution, as Mr. Graham‐Cumming explains, is to use a SGML parser to create a document tree, and then extract the necessary elements from the tree. This takes advantage of the fact that HTML (which is a subset of SGML) is a structured data format.
In the ensuring discussion, as demonstrated by the Coding Horror post linked above, many argued that the pattern Raymond used was perfectly acceptable for “quick and dirty” projects. They further argued that HTML is often poorly‐formed, which can trick the strict XML parsers many try to use. The drama has consumed the programmer blog world, as well as Hacker News and Reddit.
To sum up this whole drama, we have one group saying “you should use a parser” and the other group saying “but sometimes regexes aren’t totally horrible, and this is just a quick job. I don’t want to take the effort to use the proper tool.”
The latter group sees themselves as being pretty reasonable. And perhaps they are. But I find myself in the former group because it is generally not quicker or easier to use regexps—even for one‐off jobs—than a nice SGML parser. And while strict XML parsers will fail on loose SGML‐based formats like HTML, SGML parsers exist that are made flexible enough to avoid the problem.
Those tools give you a object‐view of the document which, as Mr. Graham‐Cumming demonstrated, allows you to express what you want to do concisely and easily, even when doing a “quick” job plucking a few tags out of the document.
The reason why regex hacking on HTML documents remains popular isn’t because it’s ‘easy’ but because it doesn’t require people knowing how to use the proper tools, nor does it require people to make the conceptual leap from the document as a long string of unstructured text to the document being a tree‐structure of objects.
It’s a classic “The Wrong Way” choice made by people who don’t have a mechanical or conceptual understanding of the right way.
It’s like this. Imagine you come upon someone banging in screws with a hammer. And laying next to them is a nice power drill with a Phillips bit. You say to them “why aren’t you using the drill for that. It’d do a better job and be a hell of a lot easier too.”
And the guy responds “This is a quick job. These screws don’t need to hold very much weight, or for very long. If this was a serious piece of construction I would totally use the drill, but for my needs this is just easier.”
Now, you might look at him with some incredulity. No way is his way “easier.” It’s much harder, on top of doing a worse job. The real reason why he’s not using the drill is, quite obviously, he doesn’t know how!
So you say to him “seriously, try the drill” and he keeps on insisting that his approach is “totally sufficient” (which it may well be, that’s not the point) and that using the drill would be “too much work.”