[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Regular expression /s whitespace : Which whitespac

Subject: Re: Regular expression /s whitespace : Which whitespace?
From: Wendell Piez <wapiez@xxxxxxxxxxxxxxxx>
Date: Thu, 20 Apr 2006 18:40:22 -0400
regex whitespace
Karen,

The regular expression syntax used in XPath 2.0 is defined by the XML Schema Recommendation, as amended by XQuery/XPath2.0 Functions and Operators. See

http://www.w3.org/TR/xmlschema-2/
especially Appendix F, Regular Expressions

and

http://www.w3.org/TR/xpath-functions/
7.6.1 Regular Expression Syntax

If you dig into these (especially the first) you'll find that \s is equivalent to [#x20\t\n\r], which is to say the space character, the tab character, the newline, or the return. This is consistent with XML's general notion of what constitutes whitespace, for example as used inside tags or declarations (see the XML Rec). Note that the non-breaking space character is not in this set.

It's tricky what a "word" should be defined to be ... whether a word count is properly derivable from an analysis of whitespace (or whitespace plus punctuation) is arguable, but for most purposes it's usually considered good enough, at any rate for English, especially considering the alternatives.

(For example, out on the edge, if you ever have em-dashes or even "--" hyphen pairs, without extra whitespace--like this--as is sometimes seen--you'll count "words" like "whitespace--like".)

I hope that helps,
Wendell

At 05:57 PM 4/20/2006, you wrote:
I am using count(tokenize(lower-case(.),'(\s|[,.!:;])+')[string(.)]) -a technique I retrieved from the list for counting words. I have been questioned about the regular expression that is being used to find white spaces. The content can contain many kinds of whitespaces and i am being asked to defend using this expression to find words. Does the saxon 8b interpretation of this regular expression covers as whitespaces

--------------------Karen McAdams


======================================================================
Wendell Piez                            mailto:wapiez@xxxxxxxxxxxxxxxx
Mulberry Technologies, Inc.                http://www.mulberrytech.com
17 West Jefferson Street                    Direct Phone: 301/315-9635
Suite 207                                          Phone: 301/315-9631
Rockville, MD  20850                                 Fax: 301/315-8285
----------------------------------------------------------------------
  Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.