[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Internal entities removed from XML?
Tim Bray wrote, > Rich Salz wrote: > >> Only benchmarking/profiling can tell > > > > It's hard to beat this: > > Not that hard I'd say given that an XML processor has to recognize > 'quot' and 'apos' and 'gt'. [snip: linear time algorithm] Ooh ... I think we can do a little better than that ;-) How about a hand-coded multi-pattern backward DAWG automaton, http://citeseer.nj.nec.com/246061.html The trick is similar to Boyer-Moore (search backwards from the end of the pattern, skip forward as far as possible on a mismatch) but modified to handle multiple patterns. It should use roughly 1/4 of the memory accesses and comparisons of a linear search. Nb. as is traditional, this might not even compile, never mind execute correctly. char *p = src+3; char *limit = p+size; char c; while(p < limit) { switch(*p) { case ';': if(*(p-1) == 't') { char c = *(p-2); if(c == 'g') { if(*(p-3) == '&') // found ">" } else if(c == 'l') { if(*(p-3) == '&') // found "<" } } p += 4; break; case 'o': if(p+2 >= limit) goto done; if(*(p+2) == ';') { char c = *(p+1); if(c == 's') { if(*(p-1) == 'p' && *(p-2) == 'a' && *(p-3) == '&') // found "'" } else if(c == 't') { if(*(p-1) == 'u' && *(p-2) == 'q' && *(p-3) == '&') // found """ } } p += 6; break; case 'p': if(p+1 >= limit) goto done; if(*(p+1) != ';') { // watch out for collision with "'" base += 2; break; } if(*(p-1) == 'm' && *(p-2) == 'a' && *(p-3) == '&') // found "&" base += 5; break; default: base += 4; } } done: The state transition diagram is a little easier to follow, but there's no way I'm going to try and do the ascii art here. Now, what was it Knuth said? ;-) Cheers, Miles
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|