[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: XML to graph

  • From: Ihe Onwuka <ihe.onwuka@gmail.com>
  • To: daniela florescu <dflorescu@me.com>, "xml-dev@l..." <xml-dev@l...>
  • Date: Thu, 2 Jul 2015 12:11:06 -0400

Re:  XML to graph
The Zorba cleaning library is the most immediately interesting. Since I am not starting from a blank slate (see below) and am applying some domain knowledge I  think the problem I am solving fits into too small of a subset of the scope of the paper and Lisa Getoor's tutorial. There is definitely commonality in some of the heuristics but all I am doing is matching.  The Zorba  library  may have changed my approach had I known about it before but it would not have helped to clean movie ratings where I was dealing with how to turn a completely free format movie rating where anything could be entered like a number on an unknown scale, a letter grade or things like

2 out of -4..+4

and 

**1/2

to a -2 to +2 scale.

What I have is a bunch of heuristics based on common metadata from different  movie silos . The heuristics are prioritised based on my own domain knowledge (not very satisfactory is it but easy to change). I do a very superficial form of stemming (lower case, get rid of non-alphanumerics) after which the highest ranked heuristic is if the titles match and they have a director in common they are the same movie. The lowest priority heuristic is if the movie release dates are the same and they have an actor in common they are the same (many movies don't have release date information). Some of these heuristics entail lookups of information culled from other repositories.

The use of the freebase search API's gives me a list of candidate solutions  so I am not starting from a blank slate and it allows me to circumvent concepts like edit distance. Below is a good example of the scope of the problem. We are trying to find the correct movie match for a 1998 release of Treasure Island. These are the ranked matches for that search term from the Freebase Search API -  no the right match isn't the top ranked one and yes it is there even though 1998 is not the year of any of the candidate matches.

 <movie term="Treasure Island" year="1998" rtLink="/m/1116410-treasure_island/">
    <match mid="/m/0fw837" score="375.183075" year="1950" imdb_id="tt0043067">Treasure Island</match>
    <match mid="/m/0gyk56x" score="362.390839" year="2012" imdb_id="tt1820723">Treasure Island</match>
    <match mid="/m/05351g" score="357.812256" year="1996" imdb_id="tt0117110">Muppet Treasure Island</match>
    <match mid="/m/027hq_7" score="312.396545" year="1990" imdb_id="tt0100813">Treasure Island</match>
    <match mid="/m/0d6_3x" score="303.274017" year="1972" imdb_id="tt0069229">Treasure Island</match>
    <match mid="/m/0dnv98" score="298.398956" year="1934" imdb_id="tt0025907">Treasure Island</match>
    <match mid="/m/02vr1mt" score="291.193634" year="1988" imdb_id="tt0465041">Treasure Island</match>
    <match mid="/m/0glqxkk" score="256.178223" year="1982" imdb_id="tt0084452">Treasure Island</match>
    <match mid="/m/02_fm2" score="242.696091" year="2002" imdb_id="tt0133240">Treasure Planet</match>
    <match mid="/m/076wc3r" score="234.503937" year="1985" imdb_id="tt0090199">Treasure Island</match>
    <match mid="/m/04gsb_p" score="232.238983" year="1999" imdb_id="tt0248568">Treasure Island</match>
    <match mid="/m/03d8xy4" score="232.115433" year="1972" imdb_id="tt0280371">Treasure Island</match>
    <match mid="/m/02vr8jc" score="222.359039" year="1971" imdb_id="tt0067002">Animal Treasure Island</match>
    <match mid="/m/05c2_7k" score="213.867325" year="2006" imdb_id="tt0811011">Pirates of Treasure Island</match>
    <match mid="/m/04csrh1" score="210.738998" year="1920" imdb_id="tt0011785">Treasure Island</match>
    <match mid="/m/0crsd_v" score="191.344147" year="1999" imdb_id="tt0181868">Treasure Island</match>
    <match mid="/m/0ztj51t" score="175.443268" year="1987" imdb_id="tt0787225">Treasure Island</match>
    <match mid="/m/0dlmcwr" score="166.556198" year="1954" imdb_id="tt0047406">Return to Treasure Island</match>
    <match mid="/m/04j09gc" score="164.241943" year="1939" imdb_id="tt0031147">Charlie Chan at Treasure Island</match>
    <match mid="/m/0crrmyz" score="161.611359" year="" imdb_id="">Treasure Island</match>
  </movie>

The code itself is very compact - just over 100 lines of XSLT and exploits what I hope is the lazy evaluation of a sequence expression 

MatchedMovie=(xpath expression for top heuristic, xpath expresion  for next heuristic, .... xpath expression for last heurstic)[1]

This ability to plug and unplug heuristic rules makes me believe this can be the basis of a framework. I could certainly see it being applied to music data. 

As you can see it relies more on harvesting semantic metadata rather than algorithms and yes it does solve the Treasure Island problem correctly.

I think it's tidier than the GroupLens project approach.


On Wed, Jul 1, 2015 at 7:04 PM, daniela florescu <dflorescu@me.com> wrote:
XQuery needs some serious extensions if you want to do what Helena did in her PhD….
(BTW, I was working with her when I wrote Quilt with Don Chamberlin… so can see some similarities ..)

Two major extensions would be:
1. FLWOR doesn’t stop when there is an exception, but just logs the exception and moves on
2. Grouby has to be extended from a simple hash to a more general clustering algorithm 


Dana


On Jul 1, 2015, at 3:35 PM, Ihe Onwuka <ihe.onwuka@g...> wrote:



On Wed, Jul 1, 2015 at 2:59 PM, daniela florescu <dflorescu@me.com> wrote:
Ihe,

transforming XQuery to be able to do data cleaning has been a LONG desire of mine.


The problem articulated in the paper with Citeseer publications is similar to the issues I face, for movies there are additional weapons that can be brought to bear because actors, directors and movie titles all have several aliases documented on various sites. That said the problem with movies may be harder because the incidence of two different papers sharing the same title is probably relatively low.

Reading on.....






[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.