[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: parsing XML using regular expressions

  • From: Steve Muench <smuench@u...>
  • To: "Simon St.Laurent" <simonstl@s...>
  • Date: Wed, 09 Aug 2000 09:58:01 -0700

parsing html using regex
Simon,

| Has anyone written a generic XML parser, even a somewhat broken one, that's
| built on regular expressions?  I remember hearing of something a long while
| ago, but I can't find it.
|
| I'm not concerned with the efficiency/viability/profitability/wisdom of
| such a solution, just whether or not it's been done - especially if it's
| available open source.

Regarding the technique, this is the note I bookmarked from a long time ago:

http://www.cs.sfu.ca/~cameron/REX.html

There's an interactive demo at the end of the page.

At the time I was playing with it I wrote the Java class
below (no guarantee how well it works) to see if I could
apply the technique using the OROMatcher RegExp library.
(http://www.savarese.org/oro/software/OROMatcher.html)

If for nothing else than to save you the time of
typing in the RegExp's, I include it below.

Have fun.

______________________________________________________________
Steve Muench, Lead XML Evangelist & Consulting Product Manager
BC4J & XSQL Servlet Development Teams, Oracle Rep to XSL WG
Author "Building Oracle XML Applications", O'Reilly
http://www.oreilly.com/catalog/orxmlapp/


import java.io.*;
import com.oroinc.text.regex.*;

public final class XMLParser {

  public static final void main(String args[]) {

    // XML_SPE Regular Expressions from http://www.cs.sfu.ca/~cameron/REX.html

     String TextSE = "[^<]+";
     String UntilHyphen = "[^-]*-";
     String Until2Hyphens = UntilHyphen + "([^-]" + UntilHyphen + ")*-";
     String CommentCE = Until2Hyphens + ">?";
     String UntilRSBs = "[^]]*]([^]]+])*]+";
     String CDATA_CE = UntilRSBs + "([^]>]" + UntilRSBs + ")*>";
     String S = "[ \\n\\t\\r]+";
     String NameStrt = "[A-Za-z_:]|[^\\x00-\\x7F]";
     String NameChar = "[A-Za-z0-9_:.-]|[^\\x00-\\x7F]";
     String Name = "(" + NameStrt + ")(" + NameChar + ")*";
     String QuoteSE = "\"[^\"]" + "*" + "\"" + "|'[^']*'";
     String DT_IdentSE = S + Name + "(" + S + "(" + Name + "|" + QuoteSE + "))*";
     String MarkupDeclCE = "([^]\"'><]+|" + QuoteSE + ")*>";
     String S1 = "[\\n\\r\\t ]";
     String UntilQMs = "[^?]*\\?+";
     String PI_Tail = "\\?>|" + S1 + UntilQMs + "([^>?]" + UntilQMs + ")*>";
     String DT_ItemSE = "<(!(--" + Until2Hyphens + ">|[^-]" + MarkupDeclCE + ")|\\?" + Name + "(" +
PI_Tail + "))|%" + Name + ";|" + S;
     String DocTypeCE = DT_IdentSE + "(" + S + ")?(\\[(" + DT_ItemSE + ")*](" + S + ")?)?>?";
     String DeclCE = "--(" + CommentCE + ")?|\\[CDATA\\[(" + CDATA_CE + ")?|DOCTYPE(" + DocTypeCE +
")?";
     String PI_CE = Name + "(" + PI_Tail + ")?";
     String EndTagCE = Name + "(" + S + ")?>?";
     String AttValSE = "\"[^<\"]" + "*" + "\"" + "|'[^<']*'";
     String ElemTagCE = Name + "(" + S + Name + "(" + S + ")?=(" + S + ")?(" + AttValSE + "))*(" + S
+ ")?/?>?";
     String MarkupSPE = "<(!(" + DeclCE + ")?|\\?(" + PI_CE + ")?|/(" + EndTagCE + ")?|(" +
ElemTagCE + ")?)";
     String XML_SPE = TextSE + "|" + MarkupSPE;


    Perl5Matcher matcher;
    Perl5Compiler compiler;
    Perl5Pattern pattern = null;
    Perl5StreamInput input;
    MatchResult result;
    InputStream file = null;

    // Create Perl5Compiler and Perl5Matcher instances.
    compiler = new Perl5Compiler();
    matcher  = new Perl5Matcher();

    // Attempt to compile the pattern.  If the pattern is not valid,
    // report the error and exit.
    try {
      pattern
     = (Perl5Pattern)compiler.compile(XML_SPE);

    } catch(MalformedPatternException e) {
      System.err.println("Bad pattern.");
      System.err.println(e.getMessage());
      System.exit(1);
    }


    // Open input file.
    try {
      file = new FileInputStream("C:\\javadev\\OROMatcher-1.0.7\\examples\\oracle.xml");
    } catch(IOException e) {
      System.err.println("Error opening streamInputExample.txt.");
      System.err.println(e.getMessage());
      System.exit(1);
    }

    // Create a Perl5StreamInput instance to search the input stream.
    input   = new Perl5StreamInput(file);

    // We need to put the search loop in a try block because when searching
    // a Perl5StreamInput instance, an IOException may occur, and it must be
    // caught.
    long time = System.currentTimeMillis();

    try {
      // Loop until there are no more matches left.
      while(matcher.contains(input, pattern)) {
     // Since we're still in the loop, fetch match that was found.

      }
    } catch(IOException e) {
      System.err.println("Error occurred while reading file.");
      System.err.println(e.getMessage());
      System.exit(1);
    }
     time = System.currentTimeMillis() - time;
     System.out.println("Parsed the file in " + time + " milliseconds.");
  }
}

______________________________________________________________
Steve Muench, Lead XML Evangelist & Consulting Product Manager
BC4J & XSQL Servlet Development Teams, Oracle Rep to XSL WG
Author "Building Oracle XML Applications", O'Reilly
http://www.oreilly.com/catalog/orxmlapp/



PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.