[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

RE: regex in csv2xml

Subject: RE: regex in csv2xml
From: "Michael Kay" <mike@xxxxxxxxxxxx>
Date: Mon, 27 Mar 2006 09:57:08 +0100
csv2xml
I would do something like this:

<xsl:variable name="regex1">".*?"</xsl:variable>
<xsl:variable name="s1" as="xs:string*">
  <xsl:analyze-string select="$in" regex="{$regex1}">
    <xsl:matching-substring>
      <xsl:sequence select="replace(., '\n', &pua1;"/>
    </xsl:matching-substring>
    <xsl:non-matching-substring>
      <xsl:sequence select="."/>
    </xsl:non-matching-substring>
  </xsl:analyze-string>
</xsl:variable>

xsl:variable s2 select="string-join($s1, '')"

for each select tokenize($s2, '\n')
  for each select tokenize(., ',')
    replace(&pua1, '\n')

That is: first take the total string and identify substrings in quotes. The
fact that this treats "He said ""don't""" as three strings ("He said",
"don't", "") doesn't matter. Replace a newline appearing between quotes by a
private-use-area character (or any other 'spare' character). Then put the
strings back together again.

Now take the reassembled string and split it first at newlines, then at
commas, and within each identified token, convert the private character back
to a newline.

Michael Kay
http://www.saxonica.com/ 


> -----Original Message-----
> From: Jesper Tverskov [mailto:jesper@xxxxxxxxxxx] 
> Sent: 27 March 2006 08:51
> To: Xsl-List@Lists. Mulberrytech. Com
> Subject:  regex in csv2xml
> 
> Hi list,
> 
> I am trying to make a csv2xml XSLT 2.0 stylesheet using the 
> Excel csv format
> as example:
> If delimiter, newline or quotes are part of data the data is 
> quoted, quotes
> are doubled.
> 
> My last problem is that the newline character can be part of 
> data. I would
> like to detect thise newline characters and replace them 
> temporarily with
> some unique code.
> But have can I detect them in the first place?
> 
> Look at the sample below, we have 3 records and 3 fields:
> 
> 34,"""yes"", I said",46
> 25,"I said:
> ""Hello"", and I added: ""nice day, stranger""
> and, ""look at the sun"" , and: 
> ""bye for now.""",33
> 47,,35
> 
> Line 1 and 6 are records. We have an empty field in line 6.
> But line 2, 3, 4, 5 are one record with three linefeeds and 
> several commas
> as part of data.
> 
> How can I detect with a regex, that the linefeeds at the end 
> of line 2, 3
> and 4 are part of data?
> As I see it line 2 and 5 are the easy part, they will always 
> have an uneven
> number of quotes.
> But the linefeeds in line 3 and 4 can only be detected as 
> part of data if we
> compare all the lines being part of a record?
> 
> Best regards,
> Jesper Tverskov

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Cast Your Vote

We need your help – Vote for DataDirect XML Products!

  • Best SOA or XML site

Winners and finalists announced at SOA World Conference in November.

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2007 All Rights Reserved.