Subject:Strange 'invisible' characters Author:Richard Potts Date:13 Jun 2008 06:18 AM
Hi guys, I'm receiving an xml extract from a database. I use the extract to do decoding and importing into MS-Excel for downstream users.
In Excel the formatting is getting screwed up i.e. unexpected newlines are appearing. I've traced this to some entries in the XML. See attached example.
I notice that the SS text editor viewer puts the CDATA closing brackets on a new line for my 'strange' entries.
I believe there are invisible chars being exported by the database in the xml and I want to automatically identify (As the xml is very large) all 'strange' entries so I can inform the database team to correct them. Can I do this in SS?
If not possible, perhaps it could be a new feature as I'm sure other xml guys are fed 'rubbish' from their upstream suppliers and need to identify/eliminate such issues.
Subject:Strange 'invisible' characters Author:Tony Lavinio Date:13 Jun 2008 08:20 AM
There are no strange or invisible characters in the file you sent.
There are 22 tabs, 18 linefeeds, 18 carriage returns, 29 spaces, and
everything else is a printable character. There are no ampersands,
and therefore no other characters expressed as &#nnn; or &#xnnn; or
as character entities.
So what does this mean? It's possible the receiving side is just
expecting linefeeds and doesn't like the carriage returns.
But it's more likely that since the CR+LF pairs are part of the
content of the DESCRIPTION element in the CDATA wrapper, they are
getting imported, and they are the source of your extra lines.
Subject:Strange 'invisible' characters Author:(Deleted User) Date:13 Jun 2008 08:23 AM
Hi Richard,
the XML you posted doesn't have invalid chars (you can look for them by pressing Ctrl-F, checking the 'use regular expression' check box and entering the search pattern "[^\x09-\x7E]" - without quotes); the fact that the end of the collapsible region is on the line following the end of the CDATA expression is because the region is for the DESCRIPTION element (the CDATA doesn't have a region for itself because doesn't span at least 3 lines).
Given this, it could be that the extra new line you see in Excel is an artifact of the transformation you perform, maybe caused by that extra new line located between the end of the CDATA and the end of the DESCRIPTION element.
Subject:Strange 'invisible' characters Author:Richard Potts Date:16 Jun 2008 04:56 AM
Thanks guys, Yes I'm not expecting 'New lines' in the CDATA sections and it was this causing the issue.
So is there a regular 'expression' or other mechanism I can use to look for the CR LF that are part of the CDATA section? (e.g. find the 2nd and 3rd entries in the example file)
I clicked on the link in the SS help http://www.boost.org/libs/regex/doc/syntax.html (in the section "Moving Around in XML Documents") to learn more about regular expressions - and its a 'broken'link.
the regular expression = "\n]" (without the quotes)
Using this expression I found that there are 100's of such entries in my source data and it will probably take a long time for this to get fixed. So I'll have to get 'defensive' in my XSL - so my next task is to figure out if there is a 'newline' in the resulting string from my <xsl:select...> and if so strip it off.
- looks like "normalize-space()" is the way to go.