|
[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Re: English sentences, was: Re: Announce: XML Sc
Jonathan Borden scripsit: > Actually this _is_ the original point, isn't it? You are saying that using a > specific character set isn't a reliable way to detect a human language > (because other characters might be correctly present) This is not my point at all, though I do agree with it, as do you. My point is that determining THE alphabet of English is a wild goose chase, because different definitions exist for different uses. For children's books, the alphabet is unquestionably a-zA-Z and nothing else. For more complex prose, some rarer letters are required. Foreign words may retain their accents or not, and quotations can be in any script at all. This has absolutely nothing to do with detection as such. It has to do with *validation* that the text can be handled by some kind of mechanization or other. FWIW, Harald Alvestrand has done some work on the subject which can be found at http://www.alvestrand.no/ietf/lang-chars.txt . This work is explicitly incomplete, most likely contains errors, and is to be used at your own risk. -- John Cowan <jcowan@r...> http://www.reutershealth.com I amar prestar aen, han mathon ne nen, http://www.ccil.org/~cowan han mathon ne chae, a han noston ne 'wilith. --Galadriel, _LOTR:FOTR_
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|
|||||||||

Cart








