[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: XML to graph
The Zorba cleaning library is the most immediately interesting. Since I am not starting from a blank slate (see below) and am applying some domain knowledge I think the problem I am solving fits into too small of a subset of the scope of the paper and Lisa Getoor's tutorial. There is definitely commonality in some of the heuristics but all I am doing is matching. The Zorba library may have changed my approach had I known about it before but it would not have helped to clean movie ratings where I was dealing with how to turn a completely free format movie rating where anything could be entered like a number on an unknown scale, a letter grade or things like 2 out of -4..+4 and **1/2 to a -2 to +2 scale. What I have is a bunch of heuristics based on common metadata from different movie silos . The heuristics are prioritised based on my own domain knowledge (not very satisfactory is it but easy to change). I do a very superficial form of stemming (lower case, get rid of non-alphanumerics) after which the highest ranked heuristic is if the titles match and they have a director in common they are the same movie. The lowest priority heuristic is if the movie release dates are the same and they have an actor in common they are the same (many movies don't have release date information). Some of these heuristics entail lookups of information culled from other repositories. The use of the freebase search API's gives me a list of candidate solutions so I am not starting from a blank slate and it allows me to circumvent concepts like edit distance. Below is a good example of the scope of the problem. We are trying to find the correct movie match for a 1998 release of Treasure Island. These are the ranked matches for that search term from the Freebase Search API - no the right match isn't the top ranked one and yes it is there even though 1998 is not the year of any of the candidate matches. <movie term="Treasure Island" year="1998" rtLink="/m/1116410-treasure_island/"> <match mid="/m/0fw837" score="375.183075" year="1950" imdb_id="tt0043067">Treasure Island</match> <match mid="/m/0gyk56x" score="362.390839" year="2012" imdb_id="tt1820723">Treasure Island</match> <match mid="/m/05351g" score="357.812256" year="1996" imdb_id="tt0117110">Muppet Treasure Island</match> <match mid="/m/027hq_7" score="312.396545" year="1990" imdb_id="tt0100813">Treasure Island</match> <match mid="/m/0d6_3x" score="303.274017" year="1972" imdb_id="tt0069229">Treasure Island</match> <match mid="/m/0dnv98" score="298.398956" year="1934" imdb_id="tt0025907">Treasure Island</match> <match mid="/m/02vr1mt" score="291.193634" year="1988" imdb_id="tt0465041">Treasure Island</match> <match mid="/m/0glqxkk" score="256.178223" year="1982" imdb_id="tt0084452">Treasure Island</match> <match mid="/m/02_fm2" score="242.696091" year="2002" imdb_id="tt0133240">Treasure Planet</match> <match mid="/m/076wc3r" score="234.503937" year="1985" imdb_id="tt0090199">Treasure Island</match> <match mid="/m/04gsb_p" score="232.238983" year="1999" imdb_id="tt0248568">Treasure Island</match> <match mid="/m/03d8xy4" score="232.115433" year="1972" imdb_id="tt0280371">Treasure Island</match> <match mid="/m/02vr8jc" score="222.359039" year="1971" imdb_id="tt0067002">Animal Treasure Island</match> <match mid="/m/05c2_7k" score="213.867325" year="2006" imdb_id="tt0811011">Pirates of Treasure Island</match> <match mid="/m/04csrh1" score="210.738998" year="1920" imdb_id="tt0011785">Treasure Island</match> <match mid="/m/0crsd_v" score="191.344147" year="1999" imdb_id="tt0181868">Treasure Island</match> <match mid="/m/0ztj51t" score="175.443268" year="1987" imdb_id="tt0787225">Treasure Island</match> <match mid="/m/0dlmcwr" score="166.556198" year="1954" imdb_id="tt0047406">Return to Treasure Island</match> <match mid="/m/04j09gc" score="164.241943" year="1939" imdb_id="tt0031147">Charlie Chan at Treasure Island</match> <match mid="/m/0crrmyz" score="161.611359" year="" imdb_id="">Treasure Island</match> </movie> The code itself is very compact - just over 100 lines of XSLT and exploits what I hope is the lazy evaluation of a sequence expression MatchedMovie=(xpath expression for top heuristic, xpath expresion for next heuristic, .... xpath expression for last heurstic)[1] This ability to plug and unplug heuristic rules makes me believe this can be the basis of a framework. I could certainly see it being applied to music data. As you can see it relies more on harvesting semantic metadata rather than algorithms and yes it does solve the Treasure Island problem correctly. I think it's tidier than the GroupLens project approach. On Wed, Jul 1, 2015 at 7:04 PM, daniela florescu <dflorescu@me.com> wrote:
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|