[Home] [By Thread] [By Date] [Recent Entries]

  • To: xml-dev@l...
  • Subject: Re: Copying text from a source, then converting to XML
  • From: Daniel Gresh <dgresh@l...>
  • Date: Fri, 14 Jul 2006 08:39:27 -0400
  • In-reply-to: <73B9C8D87DA7654D804B36A7EDF4C5A601C9B331@x...>
  • References: <73B9C8D87DA7654D804B36A7EDF4C5A601C9B331@x...>
  • User-agent: Mozilla Thunderbird 1.0.7 (Windows/20050923)

Mark Novembrino (novembri) wrote:

>Hi, Daniel.
>
>There are probably many ways to do this.
>
>One way, perhaps a little crude, would be to use a text/macro editor to
>process the files in batch mode first. I've often used Vedit for these
>sorts of things. (http://www.vedit.com) The program doesn't do these
>kinds of batch things out of the box. You'd have to write a macro, but
>the macro language is easy to work with. Of course, you could also do
>the same thing in Perl or another scripting language.
>
>Once you've extracted the text you want to each file, the conversion to
>XML is another matter. That would depend on *which* XML you mean, i.e.,
>what DTD, what sort of text, what are the mapping rules you want to use
>and how do you want to tag the resulting XML output. You could continue
>to use the text editor for this sort of thing, or if you want a more
>"official" method, use XSLT to do the transform.
>
>Hope this helps.
>
>Not sure your level of programming expertise. If you need any more info
>(and nobody else on the list comes up with any better answers), I'd be
>glad to help with any small scripts/macros. I don't know Perl very well,
>but I probably have some Vedit and/or VBScripts floating around
>somewhere that could do the job.
>
>- Mark Novembrino
>
>
>  
>
>>-----Original Message-----
>>From: Daniel Gresh [mailto:dgresh@l...] 
>>Sent: Thursday, July 13, 2006 1:12 PM
>>To: xml-dev@l...
>>Subject:  Copying text from a source, then converting to XML
>>
>>I have a question about this. Some of the question may not 
>>pertain to XML, but if anyone knows a method, that'd be great.
>>
>>So, I basically want to automatically search a large number 
>>of documents for certain keywords. When I find that keyword, 
>>I want the paragraph the keyword is in, not the page, to be 
>>copied and pasted somewhere. After that, I want to convert 
>>the pasted text to XML.
>>
>>Does anyone know a method for doing either of these tasks? 
>>Copying certain paragraphs or substrings of text that have 
>>certain phrases in them, then converting to XML? Perhaps 
>>there is a script of some sort? Or a free program?
>>
>>Any help would be appreciated.
>>
>>-----------------------------------------------------------------
>>The xml-dev list is sponsored by XML.org 
>><http://www.xml.org>, an initiative of OASIS 
>><http://www.oasis-open.org>
>>
>>The list archives are at http://lists.xml.org/archives/xml-dev/
>>
>>To subscribe or unsubscribe from this list use the subscription
>>manager: <http://www.oasis-open.org/mlmanage/index.php>
>>
>>    
>>
>
>  
>
You're going to have to forgive my lack of knowledge regarding the 
subject, but I am not all that familiar with XSLT. As for extracting the 
text, I've looked around a bit, and it does look like a script of some 
sort would be useful; I'll look around for an example before I try to 
make one from scratch.

As for what type of XML I'm converting to, I guess I should have been a 
little more specific. I'm not even sure if this is possible, but I'm 
really crossing my fingers and hoping it is, because it will make this 
task a whole lot easier: I want to somehow extract the text and use it 
with an ontology built in RDF/OWL. Is that ... possible? Even if it's 
not possible to convert it directly to RDF/OWL format, which I would 
guess is impossible, because in OWL and RDF one needs to predefine the 
classes and such, I figured converting to a XML format would be the 
first step in the right direction.

I'm sort of digressing here, and I apologize, but I simply don't know 
where else to ask this: is there some way to extract large amounts of 
text from a large number of documents, then access it in some way by 
applying metadata to it and using RDF/OWL? Extracting the text can be 
accomplished with scripts, as mentioned earlier, or by using XSLT, 
although I am not familiar with that method, but putting it into an 
ontology is a different matter. I was thinking of organizing the text 
according to the keywords and areas I extract, and then using something 
to search through it, but that's not really what I need, and I could 
just use XQuery for that, or something similar. Does anyone have any 
thoughts? Again, I apologize for the off-topic subject, I just haven't 
found any other places to ask this.

Thanks for all the help,
Dan

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member