[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

RE: Converting CSV to XML without hardcoding schema d

Subject: RE: Converting CSV to XML without hardcoding schema details in xsl
From: "Pantvaidya, Vishwajit" <vpantvai@xxxxxxxxxxxxx>
Date: Tue, 27 Jun 2006 15:11:10 -0700
csv regular expression
Hi Michael,

This is what I got with the following regex:

([^&quot;]*,|&quot;[^&quot;]*&quot;,)+(.*)

The ending (.*) was needed to match the last field which ends with neither a
comma nor quote.

Input CSV(3 lines):

ID,ParentID,Group,User,Title,Description,GroupBelong,EffectiveDate,Effective
Month,EffectiveDay,EffectiveYear,Months,EndDate,Name,AssumedName,Address,Typ
e,Status,Amount,AmountAggregate
1,,,,A BC - A B.
Cloud,Individual,VP,2/13/2006,February,13th,2006,36,2/12/2009,"A B C,
Inc.",D E,"38th Street, MyCity, MyState 12345",TypeA,Active,"$442,000.00
",$1.62 
2,,,,ABC- Judge
ABC,Internal,VP,3/1/2006,March,1st,2006,36,2/28/2009,"Charity Services
(""CS"")",MyCity,"ABC Blvd., MyCity, MyState
12345",TypeB,Active,"$1,442,000.00 ",$1.35


Output XML:

<doc xmlns:xs="http://www.w3.org/2001/XMLSchema">
   <row>
      <ID>"$442,000.00 ",</ID>
      <ParentID>$1.62 </ParentID>
      <Group/>
      <User/>
      <Title/>
      <Description/>
      <GroupBelong/>
      <EffectiveDate/>
      <EffectiveMonth/>
      <EffectiveDay/>
      <EffectiveYear/>
      <Months/>
      <EndDate/>
      <Name/>
      <AssumedName/>
      <Address/>
      <Type/>
      <Status/>
      <Amount/>
      <AmountAggregate/>
   </row>
   <row>
      <ID>2,,,,ABC- Judge
ABC,Internal,VP,3/1/2006,March,1st,2006,36,2/28/2009,</ID>
      <ParentID>"Charity Services (""CS"")",MyCity,"ABC Blvd., MyCity,
MyState 12345",TypeB,Active,"$1,442,000.00 ",$1.35 </ParentID>
      <Group/>
      <User/>
      <Title/>
      <Description/>
      <GroupBelong/>
      <EffectiveDate/>
      <EffectiveMonth/>
      <EffectiveDay/>
      <EffectiveYear/>
      <Months/>
      <EndDate/>
      <Name/>
      <AssumedName/>
      <Address/>
      <Type/>
      <Status/>
      <Amount/>
      <AmountAggregate/>
   </row>
   <row/>
</doc>



Thanks,

Vish.


>-----Original Message-----
>From: Pantvaidya, Vishwajit
>Sent: Tuesday, June 27, 2006 2:48 PM
>To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
>Subject: RE:  Converting CSV to XML without hardcoding schema details
>in xsl
>
>>From: Michael Kay [mailto:mike@xxxxxxxxxxxx]
>>Sent: Saturday, June 24, 2006 12:41 AM
>>> >
>>> >There's a lot of potential backtracking here: it might be better to
>>> >replace each "(.*)," with "[^,]*" or with "(.*?),".
>>>
>>> [Pantvaidya, Vishwajit] Does "[^,]*" work the same as "(.*),"
>>> - I understand that ^ is start of line metachar. How does the
>>> former match the alphabet chars?
>>
>>No, within square brackets, ^ means "not". So [^,]* matches a sequence of
>>any characters except comma.
>>
>>The problem with your expression is that (.*) matches as many characters
>as
>>it can. Then it sees ",", so it backtracks to find the last comma. Then it
>>sees the next (.*), and has to backtrack again; and so on.
>>>
>>> >
>>> >My own instinct would be to use something like:
>>> >
>>> >([^"]*,|"[^"]*",)*
>>> >
>>>
>>> [Pantvaidya, Vishwajit] Oxygen would not accept this regex as
>>> "it matches a zero-length string".
>>
>>Perhaps then you want to change the final "*" to a "+".
>>
>[Pantvaidya, Vishwajit] That's is the first thing I tried when the * did
>not
>work - but even then it does not seem to be working.
>
>>> Anyway, how does this regex work - it does not seem to have
>>> anything that matches the alphabet chars.
>>
>>See above: [^"] matches everything except quotes.
>>
>>> And does the ,|" match comma or double quotes - because
>>> actually some field will have both.
>>
>>The first alternative, [^"]*, matches any field that ends with a comma,
>and
>>doesn't contain a quotation mark. The second alternative, "[^"]*,",
>matches
>>any field that begins and ends with quotes (followed by a comma), and
>might
>>contain a comma between the quotes.
>>
>>It's very hard to find out what the exact rules for CSV files used by a
>>particular product are: for example, how it represents a field that
>>contains
>>quotation marks as well as commas. (That's one of the great advantages of
>>XML< you can find a specification!) If you know the exact rules for your
>>particular flavour of CSV, you can adapt the regex to match (well, you can
>>if you study a bit more about regular expressions).
>>>
>>>
>>> Maybe this conversion is easier done with some Java code.
>>>
>>I'm sure it can be done using regular expressions but it looks as if you
>>need to do some learning in this area.
>>
>[Pantvaidya, Vishwajit] Thanks a lot for all the clarifications and help.
>Actually I did look at the regex documentation in the XSLT2 spec, but not
>very exhaustively - the info on back-references I found there made me feel
>that could be potentially useful here e.g. to tell the regex that if a
>starting quote is found, look for an ending one. But the more I look into
>it, the more it seems like I maynot be able to use it.
>
>Thanks and regards,
>
>Vish.

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Cast Your Vote

We need your help – Vote for DataDirect XML Products!

  • Best SOA or XML site

Winners and finalists announced at SOA World Conference in November.

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2007 All Rights Reserved.