[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

RE: Converting CSV to XML without hardcoding schema d

Subject: RE: Converting CSV to XML without hardcoding schema details in xsl
From: "Pantvaidya, Vishwajit" <vpantvai@xxxxxxxxxxxxx>
Date: Fri, 23 Jun 2006 15:25:08 -0700
regex csv
>-----Original Message-----
>From: Michael Kay [mailto:mike@xxxxxxxxxxxx]
>
>> My CSV has some commas in some cells - in those cases the
>> entire cell value is itself enclosed in quotes. So a simple
>> tokenize that splits at comma boundaries would not work - so
>> I replaced the tokenize for the cells with a regex that took
>> care of the quotes (is there any alternative here other than
>> using regex?). I had to specify the quotes in the regex as
>> &quot; After this, it started taking 45 minutes to transform
>> a 20 columns-35 rows CSV.
>
>Are you using Saxon? Performance information is only interesting if we know
>what processor you are using.
[Pantvaidya, Vishwajit] Yes, I am using oxygen as editor which is using
Saxon8B.

>>
>> Next problem I found was that for columns that contain commas
>> in the value, all cells in that column are not enclosed in
>> quotes - only those cells that actually have commas are
>> enclosed in quotes. So I changed the regex to account for
>> 0/more quotes. Now it transformed in 45 secs - surprise?
>> But even now, I see that the 0/more quotes regex throws it
>> off and the csv gets incorrectly parsed resulting in the
>> wrong xml content.
>>
>> So I made some changes and the current xsl has the regex as:
>> <xsl:analyze-string select="."
>> regex="(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),
>> (.*),(.*),&quo
>> t;*(.*)&quot;*,(.*),&quot;*(.*)&quot;*,(.*),(.*),&quot;*($.*)&
>> quot;*,(.*)">
>
>There's a lot of potential backtracking here: it might be better to replace
>each "(.*)," with "[^,]*" or with "(.*?),".

[Pantvaidya, Vishwajit] Does "[^,]*" work the same as "(.*)," - I understand
that ^ is start of line metachar. How does the former match the alphabet
chars?

>
>My own instinct would be to use something like:
>
>([^"]*,|"[^"]*",)*
>

[Pantvaidya, Vishwajit] Oxygen would not accept this regex as "it matches a
zero-length string".
Anyway, how does this regex work - it does not seem to have anything that
matches the alphabet chars.
And does the ,|" match comma or double quotes - because actually some field
will have both.

Generally, it seems that the problems with transforming such CSVs where the
field names may themselves have commas, maybe due to there being no way to
- remember current state (e.g. opening double quotes) and match the
remaining string based on knowledge of that state i.e. something like "if
opening double quotes encountered, then continue matching chars till closing
double quote, else match till next comma" or
- assign priority to specific matches over others e.g. give preference to
matching quotes if found over commas.

Maybe this conversion is easier done with some Java code.


Thanks a lot Michael for all your help...


Vish.

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2011 All Rights Reserved.