Re: The triples datamodel -- was Re: Semantic Web p

Cart

XML Editor - Download a Free Trial >

See What's New >

Buy Now >

[Home] [By Thread] [By Date] [Recent Entries]

To: XML Developer List <xml-dev@l...>
Subject: Re: The triples datamodel -- was Re: Semantic Web permathread, iteration n+1
From: "Thomas B. Passin" <tpassin@c...>
Date: Sun, 06 Jun 2004 12:01:23 -0400
In-reply-to: <p06010202bce8dc0073d2@[192.168.254.88]>
References: <IKEOLCDFPBBPPAHGNKKOKEAAEMAA.howardk@f...> <p06010208bce75195c8e6@[192.168.254.88]> <40c9aa34.248723415@s...> <p0601020dbce76af8bfc8@[192.168.254.88]> <40C243E6.9060602@a...> <1086511960.3929.23.camel@l...> <p06010206bce8a5118be2@[192.168.254.88]> <40C31EFC.2090203@k...> <p06010202bce8dc0073d2@[192.168.254.88]>
User-agent: Mozilla Thunderbird 0.6 (Windows/20040502)

Elliotte Rusty Harold wrote:

> But rigid fixed schemas fail when we're talking about thousands or tens 
> of thousands or even millions of disconnected developers who do not have 
> prior agreements, who do not know each other, and who are doing very 
> different things with the same data. This is the world of the Internet. 
> This is the world I work in. This is the world more and more developers 
> are working in more and more of the time, and the old practices that 
> worked in small, closed systems behind the firewall are failing. It's 
> time to learn how to design systems that are flexible and loosely 
> coupled enough to work in this new environment. XML is a critical 
> component in making this work. Maybe RDF is too, though I'm still not 
> convinced (to bring this thread back on topic.) Schemas really aren't. 
> At best schemas are a useful diagnostic tool for deciding what kind of 
> document you've got so you can dispatch it to the appropriate local 
> process. At worst, however, schemas encourage a mindset and assumptions 
> that are actively harmful when trying to produce scalable, robust, 
> interoperable systems.

What Rusty said.

Here are two vingettes from my own experience to underline his point.

- We will be getting xml messages (via JMS) from a state agency - the 
state of California, in fact.  Their contractor tells us the messages 
conform to such-and-such a schema.  The schema happens to be one that we 
ourselves wrote; it is a draft version of a to-be standard.

But the first documents we get do not validate against the schema, and 
unfortunately they are not just simple extensions.  In  a few places new 
structures have made their way into the document.  It seems pretty clear 
what has happened.  Probably the messages originally validated, but then 
the contractor found they wanted to make some changes and forgot that 
the changes might not be schema-valid.  Or maybe they never tried 
validating in the first place.  Anyway, no problem - xslt to the rescue!

- I need to screen-scrape certain data from a web page updated from time 
to time.  The page is put up by a US government agency.  The data is 
critical medically-related information.  The results of the data 
extraction go into the front end of a long and complex automated 
workflow.  I write the front-end parser (this was before John Cowan's 
tag soup parser came out).

It turns out that the page is hand-authored by someone who is not very 
expert about html.  Every update the internal structure changes.  It 
always looks the same in the browser, but certain key internal parts are 
actually invalid html, and the nature of the invalidity changes each 
time.  Unfortunately we have to use those parts to extract indexes that 
point to the actual data we want to collect from other parts of the page.

We cannot outguess all the changes, and so from time to time we get 
parse failures.  We cannot influence the page design.  Finally, we give 
up and use the text-only version that the agency also hosts.  This has 
no markup, but the visual structure blocks out the information we need 
in a consistent way, and the visual structure matches the actual text 
format.  I write a parser that emits sax-like events to feed into the 
downstream process.  Everything works nicely and robustly after this change.

As Rusty says, that is the world of the internet.

Cheers,

Tom P

-- 
Thomas B. Passin
Explorer's Guide the the Semantic Web (Manning Books)
http://www.manning.com/catalog/view.php?book=passin

Follow-Ups:
- Re: The triples datamodel -- was Re: Semantic Web permathread, iteration n+1
  - From: Robert Koberg <rob@k...>

References:
- RE: The triples datamodel -- was Re: Semantic Web permathread, iteration n+1
  - From: "Howard Katz" <howardk@f...>
- RE: The triples datamodel -- was Re: SemanticWeb permathread, iteration n+1
  - From: Elliotte Rusty Harold <elharo@m...>
- Re: The triples datamodel -- was Re: Semantic Web permathread, iteration n+1
  - From: Bjoern Hoehrmann <derhoermi@g...>
- Re: The triples datamodel -- was Re: SemanticWeb permathread, iteration n+1
  - From: Elliotte Rusty Harold <elharo@m...>
- Re: The triples datamodel -- was Re: SemanticWeb permathread, iteration n+1
  - From: Alaric B Snell <alaric@a...>
- Re: The triples datamodel -- was Re: SemanticWeb permathread, iteration n+1
  - From: Henrik Martensson <henrik.martensson@b...>
- Re: The triples datamodel -- was Re: Semantic Web permathread, iteration n+1
  - From: Elliotte Rusty Harold <elharo@m...>
- Re: The triples datamodel -- was Re: Semantic Web permathread, iteration n+1
  - From: Robert Koberg <rob@k...>
- Re: The triples datamodel -- was Re: Semantic Web permathread, iteration n+1
  - From: Elliotte Rusty Harold <elharo@m...>

Prev by Date: Re: The triples datamodel -- was Re: Semantic Web permathread, iteration n+1
Next by Date: RE: The triples datamodel -- was Re: Semantic Web permathread, iteration n+1
Previous by thread: Re: The triples datamodel -- was Re: Semantic Web permathread, iteration n+1
Next by thread: Re: The triples datamodel -- was Re: Semantic Web permathread, iteration n+1
Index(es):
- Date
- Thread

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >