[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] RE: Beyond Ontologies
Hi Len, Didier said: >For instance, why some sites have more traffic than others? >Simply, because they are better positioned in search engines like >Google. Len replied: Usually because they have information of interest to some community. I rank that by my experiment with allowing my song "Sam (for Liz)" to be offered from free at www.bewitched.net. This has dimensions in that the name of the site is obvious, but also that the people who own it are extremely well connected to the producer and cast of the show. So the content they offer is timely and is rare. In other words, the dimensions that determine the quality of the site are both dimensions of the web as a system and what it wants, and of the users and what they want. In combination, one gets a compelling site. What metric do I get? A continuous stream of mail about the song from that site despite the fact that it is also posted at mp3.com. One might think that mp3.com being a music site would engender more mail, but it doesn't. The majority of mail I get from there is from other members asking me to cross link or to sell me services. I know the site gets a lot of traffic because it feeds back to me at a higher rate despite the fact that a song on that site is simply just another piece of content. Didier replies: So in that case we can say that: a) People organized a personal web in their "favorite" section. This is a personal ontology where keyphrases are associated to URLs. Topic map people would recognize here a topic map without topic associations or topic facets. So, in the previous example it seems that a lot of people are using their personal ontology to access the mentioned web site. b) New people are accessing the mentioned web site either by word of mouth or by search engine access. In either case, they are using keyphrases to qualify this site or using keyphrases to look for this site. In that case, people will find the site if their personnel ontology matches the search engine classification or if they trust the "maven" or the "connector" in relationship web. Again, a keyphrase positioned in a personal ontology will trigger a certain behavior: go to that site and find your song. In all these cases, the ontology is either tacit or controlled by the search engine algorithms. Yes there is a semantic web but not based on RDF or any formal ontology, most of it is not yet explicit, it is still tacit. A lot of commercial interest is involved in keeping it tacit. Didier said: >Why are they better positioned because their site is structured >in some ways that search engines like and moreover a lot of other well >ranked sites cite them (Google is a citation based classification >system). However, this brings some interesting perspective on the >semantic web. Len replied: On the other hand, I wrote a paper on Information Ecosystems that was posted two years ago by a company in New York. When I google that term, I get back approximately a half million hits and my paper is at the top. I have trouble believing that the paper in pdf format gets that many citations. So other metrics besides citation are in play. Didier replies: Maybe not if you say so. However, the site in question is using a lot the keyphrase "ecosystem" and "information ecosystem". Therefore the site's theme is better correlated to this keyphrase. Your document is also parsed and classified by google since this latter can parse and classify PDF documents. Since this site is associated with a vector in the theme space related to the keyphrase "information ecosystem", then your page's vector position is more closely positioned to the "information ecosystem" locus. Remember that I said "structured in some ways that search engines like". This means that the page content makes it more related to the "information ecosystem" vector. However, if you would have another page on the web and having about the same weight, but would have more links pointing to it. Then, this page would be classified as closer to the "information ecosystem" locus on the basis or the votes/citations from other pages located in other domains. Just take some more popular keyphrases and you will notice that the pagerank can make a difference when two pages are equally weighted in terms of keyphrase relevance strictly from their content/structure. I mention here the structure because if a keyphrase is included in a header it doesn't weight as much as if it is contained in a paragraph. Yes other metrics are in play, citations are votes used to discriminate equal weight. However, in certain cases, the votes weights more and the entire classification is broken. This is what is happening with blogs. This said, other classifications schemas like toema uses the notion of community cluster around a certain keyphrase to reduce the influence of free form votes. Google is slowly adapting its own algorithm to this kind of scheme. This implies that the tacit web ontology is translated into community clusters. Said differently, community cluster set and keyphrases sets are related with a relation "is_part_of" and a certain weight. We now speak of physical incarnation of tacit ontologies with fuzzy set ownership. If a page associated to a particular theme is referred by a community cluster then its vote as "is_part-of" this keyphrase is having more weight. Simply said: a) Actually the web is based on an implicit or tacit ontology b) This ontology finds its physical incarnation with community clusters and their link structure. There is a relationship between words and sites. c) Social networks and their related economics are also based on implicit or tacit ontologies. Call that, brands, constructs or whatever but, nonetheless, it exist in the mind of people as tacit ontologies and we refer to them either by constructs, brand or URLs. A real semantic web revolution may happen if: a) Search engines publish their result in RDF, OWL or any other format that knowledge engine can process. It could be done today simply by using a PERL script to translate from HTML into one of these formats and then process that. The absence of such script (or maybe there is one but I am not aware of it - please, if someone knows one, let us know, this may be useful). b) there exist a corpus of relations between keyphrases/topics/themes/concepts. Then in that case we can make some inferences. This is precisely what Adsense is doing (with some glitches sometimes or in other times with brio). Just look at my site where I did unconsciously an experiment : http://dsssl.netfolder.com you'll notice that the ads are about XML. Google related the main theme DSSSL and OpenJade to XML. It can achieve this kind of relationship through DMOZ and its explicit ontology, or through some database coming from the newly acquired companies. Didier said: >a) Attractors (site having a lot of traffic) are associated to some >keyphrases (a main theme and some related concept - see how adsense is >working). Thus we can model the attractor in relation to ontologies by >associating to a topic/class/object/keyphrase a set of sites. Len replied: Yes. Emergent topic maps. Didier replies: Precisely. A keyphrase can be considered an attractor. A community cluster can also be considered as a network representing the proximity to this attractor locus. Some keyphrase like "green tea weight loss" can be decomposed into two themes "green tea" and "weight loss" therefore can potentially be owned by two sets or two attractors a) "green tea" and b) "weight loss". Internal structure and content of pages will determine their proximity to an attractor. Votes/links will amplify or reduce the fuzzy set function value. For instance, taking the previous example, if a "weight loss" community links heavily to a page (from the internal structure or content), then even if its internal structure and content position its vector equally as close to "weight loss" and to "green tea", then the community's votes will make it closer to "weight loss". The community cluster will simply push the page toward the "weight loss" attractor. Didier said: >b) some people connected on the web propagate >keyphrases/brands/concepts. These connectors act as gate keepers or as >amplifiers. Len replied: They are opinion leaders in some cases and that is one of the dangers of the system. It propagates opinion which can take on a life of its own. In the Enterprise Engineering papers, I warned about 'superstitious acquisition', the danger of using citation because it can be only rumor backed up by a cult of personality. Still, let's take a simpler example. Because we know that XML-Dev is reasonably well read, it would be interesting to see stats on how many hits on the search engines the terminology of chaos and complexity theory recorded this week. We see the bottom up driving of ontological creation, and if automated, these are what Costello should be looking at. Didier replies: Interesting experiment to do. Didier said: >d) Search engine are the real semantic web and they connect URI with >words. More and more as demonstrated with the "~" operator (in Google) >or with Adsense, they possess the concept of association or related >concept to a theme. Search engines own the semantic web and ontologies. Len replied: To some extent, yes, but the search engine is just an engine. It is the feedback loop that creates the ontologies bottom up and then the direction those ontologies give to the direction of a search that is the nonlinear dynamic power. Look for the intelligent selector. One can do this with agents, yes, but so far, we are doing it with our own gray matter. The web is indeed an amplifier, and its signal processing clearly demonstrates the effects of controlled feedback, and that is directed evolution if not a top down hierarchy as such. In fact, a top down directed evolution is precisely what I fear about a so-called, semantic web. Didier replies: I took my time to think seriously and hard about your statement that it is the feedback loop that creates the ontologies buttom up. I disagree. The ontology is there either implicit or tacit or explicit as in yahoo or DMOZ. The aggregation of URL around an attractor/concept/theme/keyphrase is simply based on "what this page is saying about itself" and "what the other are saying about this page". The former classification is based on the page's content and structure and the algorithms we used to classify it. Actually, the algorithms are mostly statistical but more and more the search engines are going beyond stemming and start digging into named phrases for relevancy (we are not there yet since it requires some advancement in computational linguistics - but we tremendously improved the state of the art in the last 10 years). The latter is simply to confirm that we got the right classification. Actually, the latter is having a lot of weight because statistical methods are not as good as they should be to classify documents. As the linguistics methods and explicit knowledge used to classify, the less important the votes will be. Until then, votes or opinions are what is used to know "what this page is all about". If a lot of sites related to "green tea" link to a page with anchors containing "green tea", then a certain inference could be made that the target page is about "green tea". Have a lot of these and you re-enforced your opinion. Thus, actual classification is based on a certain "social concensus". That game can be corrupted as we know with blogs. Objectively I cannot say if it is corrupted or that the social web represented with document posted and associated to keyphrases are what is more or less important. Same problem with democacry :-) we don't necessarily have the best or the more relevant, we have what the majority voted for :-) Ouff, enough for today, let's go back to work. I am working on an interesting project: the 4 generation web. No more fat servers and thin client (It makes me sick to see how we returned to the mainframe paradigm with a different hardware). The project I am working on uses a language (xml based) that we call PDML used to transfer from the server to the client a set of objects defines with an ontology (class hierarchy). Have them encoded in XML and re-constructed in the client. Have them live for a while on the client and respond to users interactions, then come back to the server to modify the database. Its REST based with GET (object set) and PUT (object set) works very well with javascript (a prototype/instance based language) and python (An mixed object oriented prototype/instance based language). It no more fat server, thin client it is now object storage - object instantiation/interaction environment. When you go through Alice's mirror its fun to see again that the dark ages of the last years could be overcome and progress start again where we left it in the 80s :-) That was long, but hey, I was silent for a while on this list :-). I was thinking... Cheers Didier PH Martin
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|