Subject: RE: Converting CSV to XML without hardcoding schema details in xsl
From: "Michael Kay" <mike@xxxxxxxxxxxx>
Date: Fri, 23 Jun 2006 09:39:32 +0100
|
> My CSV has some commas in some cells - in those cases the
> entire cell value is itself enclosed in quotes. So a simple
> tokenize that splits at comma boundaries would not work - so
> I replaced the tokenize for the cells with a regex that took
> care of the quotes (is there any alternative here other than
> using regex?). I had to specify the quotes in the regex as
> " After this, it started taking 45 minutes to transform
> a 20 columns-35 rows CSV.
Are you using Saxon? Performance information is only interesting if we know
what processor you are using.
>
> Next problem I found was that for columns that contain commas
> in the value, all cells in that column are not enclosed in
> quotes - only those cells that actually have commas are
> enclosed in quotes. So I changed the regex to account for
> 0/more quotes. Now it transformed in 45 secs - surprise?
> But even now, I see that the 0/more quotes regex throws it
> off and the csv gets incorrectly parsed resulting in the
> wrong xml content.
>
> So I made some changes and the current xsl has the regex as:
> <xsl:analyze-string select="."
> regex="(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),
> (.*),(.*),&quo
> t;*(.*)"*,(.*),"*(.*)"*,(.*),(.*),"*($.*)&
> quot;*,(.*)">
There's a lot of potential backtracking here: it might be better to replace
each "(.*)," with "[^,]*" or with "(.*?),".
My own instinct would be to use something like:
([^"]*,|"[^"]*",)*
Michael Kay
Saxonica
|