I will also look into what you have suggested. Thanks, Vish. >-----Original Message----- >From: Nathan Young -X (natyoung - Artizen at Cisco) >[mailto:natyoung@xxxxxxxxx] >Sent: Monday, June 26, 2006 11:02 AM >To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx >Subject: RE: Converting CSV to XML without hardcoding schema details >in xsl > >Hi. > >I don't know how you need to treat performance but regular expressions >are going to be a lot slower than the low level css parsing routines you >can get by using a perl, java or c library someone wrote to parse csv. >These are cleverly written and perform very well, a quick web search for >your language will turn up useful links if you go this route. > >If "good enough" is good enough for you performance-wise, regular >expressions probably can work for you. If you do pursue this I strongly >recommend an application called "regex coach" for troubleshooting and >learning regular expressions. It really makes the effects of your >expression visible to you and lets you quickly adjust and try >variations. > >----->Nathan > >> >> > > >> > >There's a lot of potential backtracking here: it might be >> better to >> > >replace each "(.*)," with "[^,]*" or with "(.*?),". >> > >> > [Pantvaidya, Vishwajit] Does "[^,]*" work the same as "(.*)," >> > - I understand that ^ is start of line metachar. How does the >> > former match the alphabet chars? >> >> No, within square brackets, ^ means "not". So [^,]* matches a >> sequence of >> any characters except comma. >> >> The problem with your expression is that (.*) matches as many >> characters as >> it can. Then it sees ",", so it backtracks to find the last >> comma. Then it >> sees the next (.*), and has to backtrack again; and so on. >> > >> > > >> > >My own instinct would be to use something like: >> > > >> > >([^"]*,|"[^"]*",)* >> > > >> > >> > [Pantvaidya, Vishwajit] Oxygen would not accept this regex as >> > "it matches a zero-length string". >> >> Perhaps then you want to change the final "*" to a "+". >> >> > Anyway, how does this regex work - it does not seem to have >> > anything that matches the alphabet chars. >> >> See above: [^"] matches everything except quotes. >> >> > And does the ,|" match comma or double quotes - because >> > actually some field will have both. >> >> The first alternative, [^"]*, matches any field that ends >> with a comma, and >> doesn't contain a quotation mark. The second alternative, >> "[^"]*,", matches >> any field that begins and ends with quotes (followed by a >> comma), and might >> contain a comma between the quotes. >> >> It's very hard to find out what the exact rules for CSV files >> used by a >> particular product are: for example, how it represents a >> field that contains >> quotation marks as well as commas. (That's one of the great >> advantages of >> XML< you can find a specification!) If you know the exact >> rules for your >> particular flavour of CSV, you can adapt the regex to match >> (well, you can >> if you study a bit more about regular expressions). >> > >> > >> > Maybe this conversion is easier done with some Java code. >> > >> I'm sure it can be done using regular expressions but it >> looks as if you >> need to do some learning in this area. >> >> Michael Kay >> http://www.saxonica.com/
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format