I gave a talk at the last ChiPy on the relevance of Unicode to the typical programmer. The point I was attempting to make was that Unicode is a tool for writing culture‐independent software the same way high‐level languages were tools for writing machine‐independent software.
There’s a lot of interest in the industry, and particularly among socially‐minded nerds in the role computers have in improving the lot of humanity. In the talk, I start out with a little historical review of electronic communication. I talk a lot about the evolution of the telegraph network into a system that allowed global instantaneous communication, but one that was very expensive and low fidelity.
Computers changed the equation—at least in the West—making communication cheap and universally available. With the computerization of the network in the 1960s, they developed a new character set that was freed from the electromechanical limitations of the old network: ASCII. Unlike the old telegraph character sets, ASCII allows us to write high‐fidelity English. This is what allowed the computerized network to shift from being just a cheaper form of the telegraph network into an online always‐available compendium of knowledge.
After ASCII, there was an explosion of character sets that all tried to provide the ability to write high‐fidelity text in every language on Earth. The problem is that the sets were incompatible and many were very poor from a technical perspective. What’s more, programmers dealing in text tended to use idioms that made sense in their own language, but perhaps not for others. The result is that there is now a lot of software that worked great in English, adequately for other Latin‐based languages, and very poorly for everyone else.
Unicode tries to solve the problem by
- superseding all of the (often horrible) legacy character sets, and
- providing programmers with a set of tools to refer to language structures (such as words) and common tasks (such as sorting) in a language-independent way.
After explaining some of the features Unicode has for writing culture‐independent code, I explore the Unicode features of Python and demonstrate that many of the most important features aren’t yet available.
At any rate, here's the talk:
I really enjoyed giving the talk, and I would like to thank ChiPy—and in particular Brian Ray and Carl Karsten—for allowing me to give it.
In the future, I need to remember to repeat the questions from the audience into the microphone, as it’s not impossible to hear anything the audience is doing. The last third is probably not worth watching as it’s me answering questions you can’t hear.
I often find clients with horrendous legacy technology stacks—systems that are bug‐ridden, poorly‐written, undocumented, and that don’t even work for what they need to do.
It makes sense that people will end up with some technical debt. But in many cases, people in the organization who are trying to improve the situation are stymied by political blocks from other people who would rather protect a fief than try to improve the business process.
I’ve begun to wonder if it’s possible for such companies to actually have nice software, or if the organizational dysfunction is so fundamental that it will poison any development or acquisition process.
I was speaking to a friend of mine a bit ago, who is the IT manager of a hospital. We got on the subject of Electronic Medical Record systems, and he said that there’s little chance for anything good to come of them.
He said that problem is that American private hospitals already are highly computerized, it’s just that the computer systems are focused entirely billing and not at all on medical records management. The hospital considers it critical that you get charged for every tongue depressor used, but they’re significantly less enthusiastic about making sure your x‐ray gets to the right person.
He said that his hospital will end up turning to their billing software vendor to purchase their EMR system. This is a decision that has, pretty much already been made, and one over which he has no control.
While EMR systems like the Veterans Administration’s VistA program exist and are well‐liked by nurses and doctors, they provide no functionality to ensure that you get billed for that tongue depressor. Software that doesn’t integrate with the hospitals’ existing billing system has no chance of being chosen. This gives hospital billing system vendors the contract by default. And they have no incentive to actually make EMR software that works.
John Graham‐Cumming recently set off a firestorm by observing that Eric Raymond writes bad code. What Raymond was doing in this particular instance was trying to extract data from HTML by using regular expression matching.
The proper solution, as Mr. Graham‐Cumming explains, is to use a SGML parser to create a document tree, and then extract the necessary elements from the tree. This takes advantage of the fact that HTML (which is a subset of SGML) is a structured data format.
In the ensuring discussion, as demonstrated by the Coding Horror post linked above, many argued that the pattern Raymond used was perfectly acceptable for “quick and dirty” projects. They further argued that HTML is often poorly‐formed, which can trick the strict XML parsers many try to use. The drama has consumed the programmer blog world, as well as Hacker News and Reddit.
To sum up this whole drama, we have one group saying “you should use a parser” and the other group saying “but sometimes regexes aren’t totally horrible, and this is just a quick job. I don’t want to take the effort to use the proper tool.”
The latter group sees themselves as being pretty reasonable. And perhaps they are. But I find myself in the former group because it is generally not quicker or easier to use regexps—even for one‐off jobs—than a nice SGML parser. And while strict XML parsers will fail on loose SGML‐based formats like HTML, SGML parsers exist that are made flexible enough to avoid the problem.
Those tools give you a object‐view of the document which, as Mr. Graham‐Cumming demonstrated, allows you to express what you want to do concisely and easily, even when doing a “quick” job plucking a few tags out of the document.
The reason why regex hacking on HTML documents remains popular isn’t because it’s ‘easy’ but because it doesn’t require people knowing how to use the proper tools, nor does it require people to make the conceptual leap from the document as a long string of unstructured text to the document being a tree‐structure of objects.
It’s a classic “The Wrong Way” choice made by people who don’t have a mechanical or conceptual understanding of the right way.
It’s like this. Imagine you come upon someone banging in screws with a hammer. And laying next to them is a nice power drill with a Phillips bit. You say to them “why aren’t you using the drill for that. It’d do a better job and be a hell of a lot easier too.”
And the guy responds “This is a quick job. These screws don’t need to hold very much weight, or for very long. If this was a serious piece of construction I would totally use the drill, but for my needs this is just easier.”
Now, you might look at him with some incredulity. No way is his way “easier.” It’s much harder, on top of doing a worse job. The real reason why he’s not using the drill is, quite obviously, he doesn’t know how!
So you say to him “seriously, try the drill” and he keeps on insisting that his approach is “totally sufficient” (which it may well be, that’s not the point) and that using the drill would be “too much work.”
Are Programmers Professionals or Tradesmen?
Joel Spolsky wrote an article complaining that people who study computers at school aren’t taught “practical” skills like using source control or bug tracking tools. His solution is that universities should create a class where they teach students to use his FogBugz product. If that sounds self‐serving to you, you’re not alone. Mark Dennehy thinks that Spolsky has become a snake-oil salesman.
Setting aside Spoksly’s product marketing, I think his attitude about skills and tooling is very common among managers in our industry. Several times in my career I’ve seen communications from management calling me a “Python Resource” or a “PHP Resource.” Ignoring the completely dehumanizing concept of people as “resources” (which is par for the course in Corporate America) the fact that managers seem so absorbed with skills with particular tools belies a refusal to treat us like professionals and a preference towards viewing us as skilled labor.
A graduating structural engineer has no idea how to marshal a design through the building inspector’s approval process. A graduating lawyer has no idea how to actually engage in litigation. Yet those fields aren’t dominated by managers whining that students should be learning "real world skills" instead of "theoretical stuff" like high physics or constitutional law.
They understand that a professional isn’t an automation that gets precision machined by a training program to slide frictionlessly into their workflow. They know that when they hire a graduate, he isn’t trained in the mechanics of his job (nor is he even licensed yet), and that the hire represents a commitment on the firm’s part to take the theoretical knowledge he got in school show him how to leverage it for the real‐world practice of his profession. They are fine with that because they have a culture – as professions and as firms – of respecting and investing in their practitioners.
The computing industry doesn’t operate that way. All the lip service it gives about ’professionalism’ is, as far as I can tell, entirely driven by a desire to ensure that programmers remain exempt. Why is it that the manager class at development firms is dominated by non‐technical MBAs? Why are development firms not set up so that programmers are partner‐tracked associates? I can’t think of any real profession where that’s not the default configuration for a firm. And most apropos here: why do they expect their supposedly professional workforce to receive trade skills from their university education programs?
I think people like Spolsky need to quit attacking universities until they can get some consistency in their own views of their employees. Either programmers are the skilled tradesmen we’re currently treated as. In which case, the exempt status should be removed and the industry should come up with a tradesman’s curriculum. Or they should accept us as professionals, and start treating us that way.