[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Parsing XML with anything but
Hey, Liam! On Mon, 09 Dec 2013 22:07:09 -0500, Liam R E Quin wrote: > The "desperate perl hacker" was a significant and much-discussed use > case during XML development, and was part of why we chose a self-evident > empty element syntax. Mmmmm. I suggest that you didn't succeed. XML, in the general case, cannot be reliably handled with regular expressions. This is unsurprising; the problem of parity is literally a textbook case for the limitations of regular expressions (regular languages, regular grammars, finite state automata) in parsing. XML's reliance on parity both for tag delimiters (<>) and for start/end semantics (<></>) is fairly unquestionable. Developing a library of regular expressions that handles a series of special cases in XML is a good way of falling prey to the classic Perl programmer's virtue of hubris. That code may be safe in your own (desperate hackish) hands; it isn't safe in someone else's. One of my earliest experiences of this, around 2001, had to do with a processor for handling SOAP (probably 1.1). The designer, a developer who is *significantly* smarter, better-trained (in computer sciences in general, though not in XML or markup in particular), and more experienced than I am (or was), decided that a namespace declaration binding the default prefix *necessarily* changed the prefix of attributes-without-prefix. Gentle (and less-gentle) remonstration based on specifications failed to change his mind. Since SAX wasn't doing the right thing, he implemented code that caught the events, changed the prefixes appropriately, and passed it on. And on output, it did-the-right-thing for generating attributes. Since this blew up in ways that those reading this list can probably easily imagine, the XML geeks were required to make it work for all those situations. Even deprecating this enormous pile of pigs' lips as our first activity did not save us from the succeeding *two infinite years* of writing increasingly baroque and fragile code to catch the output from this ... desperate hack ... and turn it into something that was both well-formed and valid. It had shipped as production code. Our later ships of the production code could *not* say "we [expletive deleted] up; we can't handle this horse pucky," whatever our competitors did with it. We were finally able to drop support for the versions of shipping products that used this nightmare, and instead rely on well-vetted parsing code (like, the original SAX before it got filtered) that Did the Right Thing, and to throw out something over 20K lines of specialized "fix the problem that we generated by failing to actually train up on the real problem rather than our desperate-hackish conception of what it ought to be" code. I haven't any patience for it. XML 1.0 namespace are a disaster, XML schema a living nightmare. Trying to cope with incoming XML that *could* contain these things *without understanding those specifications*, even if the plan is that the incoming stuff *won't* contain them, is asking for problems. Because then you find you have to cope with them. And you can't throw out all that beautiful work you've done! And when you've moved on, and someone else is trying to deal with the new inputs for the code that you wrote that worked so well ... perhaps that's brilliant, rather than stupid, but it's not something that's going to make your successor bless your name. Or the name of XML. And that's a problem of training. Like the developer/designer/architect who simply *could not believe* that the specification required that elements and attributes respond differently to the declaration of a binding to the default prefix: insufficient willingness to believe that the specification writers could specify something boneheaded. Like the DPH-s who wrote piles of regexes because the spec writers said "we're making it work for you!" without looking at the specification and discovering it's type-1 in the Chomsky hierarchy, not type-0. > Use of regular expressions does not need to be evidence of stupidity, > nor of poor training. In general? Absolutely not. In dealing with a grammar that is context-free, but not regular? It's a sign of poor training at least. If the expressions operate over something that's known to safely conform to a regular grammar (necessarily a special case in XML processing), then it's fine. Alas, anyone who succeeds with this is going to keep going with it until the [^>] bites. That's an absolute certainty if the code is used by more than one person, especially if it's hand-me-down. > I admit to using regular expressions to process > XML at times myself, although I also suppose that since I haven't > received a whole lot of introductory XML training I'm poorly trained in > XML... I'm probably supposed to be intimidated, considering history and authorship and such. Sorry. I think that if you turn over your aggregation of regexes to someone else, then Bad Things Will Happen. I think that if you don't expect that, then perhaps it's an indication of poor training or experience. Naivete? Something. Perhaps you'd be one of the ones offering strong, understandable, and written (so that they can be passed on) warnings on the limitations of the bits that you're turning over to others, and none of this applies. > Absence of carefulness is a problem, but that can be a problem with any > tool. Hammers and screws are an inappropriate combination, as a general rule. It has nothing to do with how careful one is, pounding the damned things in. However, let me provide another anecdote, on why this particular analogy occurred to me. When I was young (and ... still not pretty, alas), I was heavily involved in theater. Community theater, college theater. Since I was notably *terrible* on stage, I ended up as part of the supporting staff. We did things like building the sets. Our director (who, in this environment, is probably better described as BossAndGod), handed out lumber, fabric, screws, and ... yes, hammers. To build the scrims for the backdrops. On purpose. Because they could be hammered in, quickly, and later, when we tore it all down, a screwdriver generally got the things back out. We weren't *allowed* to use screwdrivers (no power tools in that era, mind; circumstances have certainly changed since then) because it *took too long*. We were always short on time when a show was coming up. In other words, this was sensible behavior, for the circumstances. Not that we could convince anyone involved with carpentry of it, mind. We generally ended up with at least one person each year who had been a carpenter's assistant, or who did carpentry of some sort for fun, who *insisted* that we could be just as fast doing it the right way. They may have even been right. Our way worked, though, and we knew how to use our regexen^Whammers. Mind you, when I tried to build my loft in my first dorm room at college, I decided that perhaps I'd been misled. YMMV. Amy! -- Amelia A. Lewis amyzing {at} talsever.com About the use of language: it is impossible to sharpen a pencil with a blunt axe. It is equally vain to try to do it with ten blunt axes instead. -- Edsger Dijkstra
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|