[Home] [By Thread] [By Date] [Recent Entries]
Perhaps of interest to you perl programmers out there.
I asked one of our programmers (Gabe Schaeffer) to write a function to
parse a malformed HTML file, prior to converting it to XHTML. Here is
what he produced!
I've never seen an HTML file parsed with a single line of Perl RegEx
before!
sub ParseHTML
{
# pass in an HTML string to be parsed and a boolean indicating if
whitespace between elements should be trimmed;
# returns a dictionary with the elements in the string
my ($html, $trim) = @_;
my $i, $element, $dict;
$dict = $Server->CreateObject("Scripting.Dictionary");
foreach $element ($html =~
/(.*?)(<(?:(?:!--.*?--)|(?:\/?[a-z0-9_:.-]+(?:\s+[a-z0-9_:.-]+(?:=(?:[
^> '"\t\n]+|(?:'.*?')|(?:".*?")))?)*))\s*\/?\s*>)/isg)
{
$element = TrimWS($element) if $trim;
$dict->Add($i++, ParseTag($element, $trim)) if length $element;
}
return $dict;
}
|

Cart



