DSNo%fy:
 Handling
 Broken
 Links
 
 
in
 the
 Web
 of
 Data
 
Niko
 Popitsch,
 Univ...
Outline
 
  IntroducIon
 and
 problem
 definiIon
 
  Related
 work
 and
 soluIon
 strategies
 
  ...
image by TBL / Hans Rosling
Linked
 Data
 Principles
 (short
 version):
 
 
(1) 
 use
 HTTP
 URIs
 ...
A
 Linked
 Data
 Example
 
4
$ curl -H "Accept: application/rdf+xml" http://www.bbc.co.uk/music/artists/
084308bd-1654-436f-ba03-df6697104e19
Links wit...
Problem:
 links
 can
 break
 
6
Ignore
 broken
 links
 ?
 Not
 a
 good
 idea
 !
 
Broken
 links
 on
 the
 Web
 are
 annoy...
Avoid
 broken
 links
 ?
 Great!
 
 
 
But
 hard
 to
 achieve
 in
 the
 Web
 environment…
 
...
Solve
 the
 problem
 (1/2)
 :
 No%fica%on
 
No%fica%on
 strategy:
 
  Data
 source
 “knows“
 about
...
Solve
 the
 problem
 (2/2):
 Detect
 and
 correct
 
Detect
 and
 correct
 
  If
 noIficaIon
 is
...
What
 events
 cause
 the
 problem
 ?
 
?
...
Events
 that
 poten%ally
 lead
 to
 broken
 links
 
Broken
 links
 due
 to
 dele%on
 events
 
...
Events
 that
 poten%ally
 lead
 to
 broken
 links
 
Broken
 links
 due
 to
 update
 events
 
...
Events
 that
 poten%ally
 lead
 to
 broken
 links
 
What
 about
 move
 events
 ?
 
...
Events
 that
 poten%ally
 lead
 to
 broken
 links
 
What
 about
 move
 events
 ?
 
...
Events
 that
 poten%ally
 lead
 to
 broken
 links
 
The
 core
 algorithm
 of
 DSNoIfy
 detects
...
Changes
 in
 DBpedia
 
Class Snapshot 3.2 Snapshot 3.3 Moved Removed Created
Person
...
PART
 3
 :
 DSNo%fy
 
18
Usage
 Scenario
 
19
Usage
 Scenario
 
  ApplicaIon
 that
 consumes
 various
 LD
 sources
 and
 may
 
 
update
...
Usage
 Scenario
 
  DSnoIfy
 is
 an
 add-­‐on
 for
 applicaIons
 that
 want
 to
 preserve
 hi...
Usage
 Scenario
 
  Other
 actors
 (applicaIons)
 might
 also
 be
 interested
 in
 these
 event...
General
 Approach
 
  Periodically
 access
 linked
 data
 sources
 
  Extract
 features
 from
 reso...
From
 Resource
 to
 Feature
 Vector
 
  Both,
 data
 type
 and
 object
 
 
proper%es
 support...
Move
 Event
 Detec%on
 
  Pair
 wise
 comparison
 using
 a
 vector
 space
 model
 
 
  Featu...
Core
 Housekeeping
 Algorithm
 
Ci,Ri and Mi,j denote create, remove and move events of items i and j. mx 26
...
Resul%ng
 Data
 Structures
 
  DSNoIfy
 constructs
 three
 data
 structures:
 
  An
 event
 lo...
Evalua%on
 
  Core
 quesIons:
 
  Does
 DSNoIfy
 work
 with
 real
 data
 ?
 
  How
 does...
Evalua%on
 -­‐
 Results
 
  Influence
 of
 data
 source
 agility
 and
 housekeeping
 frequency
 ...
Discussion
 
  Broken
 links
 are
 a
 considerable
 problem
 in
 a
 Web
 of
 Data
 
  The
 b...
Current
 and
 Future
 Work
 
  Scalability
 issues,
 evaluaIon
 
  AutomaIc
 feature
 selecIon
 (pa...
References
 and
 Related
 Work
 
  H.
 Ashman.
 Electronic
 document
 addressing:
 dealing
 with
...
of 32

dsnotify presentation at www2010

Presentation of DSNotify (http://dsnotify.org) at the WWW 2010 conference in Raleigh/NC/USA
Published on: Mar 4, 2016
Source: www.slideshare.net


Transcripts - dsnotify presentation at www2010

  • 1. DSNo%fy:  Handling  Broken  Links     in  the  Web  of  Data   Niko  Popitsch,  University  of  Vienna  /  Austria   niko.popitsch@univie.ac.at   Joint  work  with  Bernhard  Haslhofer   bernhard.haslhofer@univie.ac.at   April  30,  2010     WWW  2010  Conference     Raleigh, North Carolina, USA
  • 2. Outline     IntroducIon  and  problem  definiIon     Related  work  and  soluIon  strategies     DSNoIfy     Usage  scenarios  and  design     Core  algorithm     EvaluaIon     Summary  &  Discussion     References   image: www.freeimages.co.uk 2
  • 3. image by TBL / Hans Rosling Linked  Data  Principles  (short  version):     (1)   use  HTTP  URIs  to  idenIfy  resources,     (2)   deliver  meaningful  representa%ons  (e.g.,  RDF,  XHTML)  when  these  are   dereferenced   (3)   link  to  other  resources   3
  • 4. A  Linked  Data  Example   4
  • 5. $ curl -H "Accept: application/rdf+xml" http://www.bbc.co.uk/music/artists/ 084308bd-1654-436f-ba03-df6697104e19 Links within the data source [...] <mo:member rdf:resource="/music/artists/5d06fe54-485a-4a07-b506-5f6f719448cb#artist" /> <mo:member rdf:resource="/music/artists/f332a312-e95b-4413-b6cc-1762a5a6a083#artist" /> <mo:member rdf:resource="/music/artists/0dcee02c-5d2c-4f5c-9d60-d58a4df32d9e#artist" /> [...] RDF links between data sources [...] <owl:sameAs rdf:resource="http://dbpedia.org/resource/Green_Day" /> <mo:musicbrainz rdf:resource="http://musicbrainz.org/artist/084308bd-1654-436f-ba03- df6697104e19.html" /> [...] [...] <mo:MusicArtist rdf:about="/music/artists/084308bd-1654-436f-ba03-df6697104e19#artist"> <rdf:type rdf:resource="http://purl.org/ontology/mo/MusicGroup" /> <foaf:name>Green Day</foaf:name> [...] 5
  • 6. Problem:  links  can  break   6
  • 7. Ignore  broken  links  ?  Not  a  good  idea  !   Broken  links  on  the  Web  are  annoying  for  humans     but  alternaIve  paths  may  be  used:       search  engines,  URL  manipulaIon,  alternaIve     informaIon  providers,  etc.   Much  harder  for  machines  in  a  Web  of  Data  !     reduced  data  accessibility     data  inconsistencies   7
  • 8. Avoid  broken  links  ?  Great!       But  hard  to  achieve  in  the  Web  environment…     SoluIon  strategies  that  solve  problem  only  parIally:     RelaIve  references     embedded  links     redundancy     SoluIon  strategies  that  are  not  commonly  applicable:     Versioned/staIc  collecIons     regular  (predictable)  updates     dynamic  links     indirecIon  services  (PURLs,  DOIs)   8
  • 9. Solve  the  problem  (1/2)  :  No%fica%on   No%fica%on  strategy:     Data  source  “knows“  about  the  events  that  are  taking  place     NoIfies  clients     Client  may  then  check  their  links  and  fix  the  broken  ones   Current  AcIviIes:     WOD-­‐LMP  [Volz  et  al.  2009]     Triplify  Linked  Data  Update  Log  [Auer  et  al.  2009]     PubSubHubbub  /  sparqlPuSH   1   h^p://groups.google.com/group/dataset-­‐dynamics     …   9
  • 10. Solve  the  problem  (2/2):  Detect  and  correct   Detect  and  correct     If  noIficaIon  is  not  applicable     Clients  detect  broken  links  and  try  to  fix  them   2 Current  acIviIes:     Robust  hyperlinks  [Phelps  &  Wilensky  2000]  –  Web  documents     PageChaser  [Morishima  et  al.  2009]  –  Web  documents     DSNo%fy  –  aims  at  becoming  a  general  framework  for  fixing  broken  links     …   10
  • 11. What  events  cause  the  problem  ?   ? 11
  • 12. Events  that  poten%ally  lead  to  broken  links   Broken  links  due  to  dele%on  events     A  dele%on  event  takes  place  at  Ime  t  when  a  resource  had  (dereferencable)   representaIons  at  t-­‐Δ  but  has  none  at  Ime  t     Vice  versa:  create  event     Easy  to  detect   12
  • 13. Events  that  poten%ally  lead  to  broken  links   Broken  links  due  to  update  events     An  update  events  takes  place  at  Ime  t  when  a  resource  had  different   representaIons  at  t-­‐Δ  compared  to  the  ones  at  Ime  t     Resource  updates  resulIng  in  representaIons  with  different  meaning   (seman%c  dri_)  may  lead  to  seman%cally  broken  links     Hard  to  detect,  open  problem     13
  • 14. Events  that  poten%ally  lead  to  broken  links   What  about  move  events  ?   14
  • 15. Events  that  poten%ally  lead  to  broken  links   What  about  move  events  ?   a b   A  move  event  from  a  to  b  takes  place  at  Ime  t  when     There  were  no  representaIons  of  b  at  Ime  t-­‐Δ       There  are  no  representaIons  of  a  at  Ime  t     The  representaIons  of  at-­‐Δ  are  more  similar  to  the  ones  of  bt  than  to  the   ones  of  any  other  considered  resource  at  Ime  t     The  calculated  similarity  between  them  is  >  than  a  threshold     Instance  matching  problem!   15
  • 16. Events  that  poten%ally  lead  to  broken  links   The  core  algorithm  of  DSNoIfy  detects  move  events  based  on   resource  similari%es   16
  • 17. Changes  in  DBpedia   Class Snapshot 3.2 Snapshot 3.3 Moved Removed Created Person 213,016 244,621 2,841 20,561 49,325 Place 247,508 318,017 2,209 2,430 70,730 Organisation 76,343 105,827 2,020 1,242 28,706 Work 189,725 213,231 4,097 6,558 25,967 Resources that were moved/removed/created between the DBpedia snapshots 3.2 (October 2008) and 3.3 (May 2009)
  • 18. PART  3  :  DSNo%fy   18
  • 19. Usage  Scenario   19
  • 20. Usage  Scenario     ApplicaIon  that  consumes  various  LD  sources  and  may     update  a  “source  dataset”   20
  • 21. Usage  Scenario     DSnoIfy  is  an  add-­‐on  for  applicaIons  that  want  to  preserve  high  link   integrity  in  their  data   21
  • 22. Usage  Scenario     Other  actors  (applicaIons)  might  also  be  interested  in  these  events     22
  • 23. General  Approach     Periodically  access  linked  data  sources     Extract  features  from  resource  representa%ons     Combine  them  to  comparable  feature  vectors  (FV)     Store  them  in  3  indices     1st  index  represents  the  current  state  of  the  monitored  data     2nd  index  stores  items  that  became  recently  unavailable     3rd  index  stores  archived  feature  vectors     Periodically  access  index  1+2  and  log  detect  events     Periodically  update  indices  1-­‐3   23
  • 24. From  Resource  to  Feature  Vector     Both,  data  type  and  object     proper%es  supported     Feature  influence  is     weighted     Some  are  used  in   plausibility  checks     RDFHash  over  all   features   24
  • 25. Move  Event  Detec%on     Pair  wise  comparison  using  a  vector  space  model       Feature  comparison  e.g.,  using  Levenshtein  similarity.     It  is  sufficient  to  compare  recently  added  and  recently  removed  feature   vectors  !     Two  thresholds  for  comparing  the  similarity  between  FVs  represenIng   created  and  removed  items:     lower  threshold:  select  predecessor  candidates     consider  URI  of  added  FV  as  possible  new  URI  of  resource  represented  by   removed  FV     upper  threshold:  decidable  by  DSNoIfy?     decide  whether  such  a  candidate  can  be  automaIcally  selected  or  whether   human  user  has  to  be  asked  for  assistance.   25
  • 26. Core  Housekeeping  Algorithm   Ci,Ri and Mi,j denote create, remove and move events of items i and j. mx 26 and hx denote monitoring and housekeeping operations respectively.
  • 27. Resul%ng  Data  Structures     DSNoIfy  constructs  three  data  structures:     An  event  log  containing  all  events  detected  by  the  system     A  log  containing  all  “event  choices”  DSNoIfy  cannot  decide  on  and       a  linked  structure  of  feature  vectors  consItuIng  a  history  of  the   respecIve  items.     Accessible  via   image: www.freeimages.co.uk   Linked  data  interface     Java  interface     XML-­‐RPC   27
  • 28. Evalua%on     Core  quesIons:     Does  DSNoIfy  work  with  real  data  ?     How  does  housekeeping  frequency  affect  its  effec%veness  ?     Used  data:     Data  from  DBpedia  (8380  events)  and  IIMB  (10  x  222  events)  were  used     Hand-­‐picked  features  based  on  coverage  and  entropy  in  the  data  sets       Results:     Housekeeping  frequency  and  data  source  dynamics  determine  the   number  of  FV-­‐pairs  that  have  to  be  compared  (scalability)     Number  of  FV  comparisons  as  well  as  coverage  and  entropy  of  indexed   features  influence  accuracy  of  method   28
  • 29. Evalua%on  -­‐  Results     Influence  of  data  source  agility  and  housekeeping  frequency  on  the  accuracy   of  the  DSNoIfy  algorithm   29
  • 30. Discussion     Broken  links  are  a  considerable  problem  in  a  Web  of  Data     The  broken  link  problem  is  partly  a  special  case  of  the  instance  matching   problem     DSNo%fy  is  an  event-­‐based  approach  to  this  problem:     DSNoIfy  can  be  used  as  an  add-­‐on  for  data  sources  that  want  to  preserve   link  integrity  in  their  data     We  cannot  “cure”  the  Web  of  Data  from  broken  links  (but  at  least  alleviate   the  pain  a  bit  :)   30
  • 31. Current  and  Future  Work     Scalability  issues,  evaluaIon     AutomaIc  feature  selecIon  (parameter  esImaIon)     Event  algebra:  high-­‐level  composite  events     EvaluaIon  with  other  data  sources  (e.g.,  file  system)     Dataset  dynamics:     vocabularies,  protocols,  formats     …   Thank  You  !   niko.popitsch@univie.ac.at   images: NASA / NSSDC h^p://www.dsnoIfy.org   31
  • 32. References  and  Related  Work     H.  Ashman.  Electronic  document  addressing:  dealing  with  change.  ACM  Comput.  Surv.,  32(3),  2000.     F.  Kappe.  A  scalable  architecture  for  maintaining  referenIal  integrity  in  distributed  informaIon   systems.  Journal  of  Universal  Computer  Science,  1(2):84–104,  1995.     A.  Morishima,  A.  Nakamizo,  T.  Iida,  S.  Sugimoto,  and  H.  Kitagawa.  Bringing  your  dead  links  back  to  life:  a   comprehensive  approach  and  lessons  learned.  In  HT  ’09:  Proceedings  of  the  20th  ACM  conference  on   Hypertext  and  hypermedia,  pages  15–24,  2009.       T.  A.  Phelps  and  R.  Wilensky.  Robust  hyperlinks  cost  just  five  words  each.  Technical  Report  UCB/ CSD-­‐00-­‐1091,  EECS  Department,  University  of  California,  Berkeley,  2000     J.  Volz,  C.  Bizer,  M.  Gaedke,  and  G.  Kobilarov.  Discovering  and  maintaining  links  on  the  web  of  data.  In   8th  InternaGonal  SemanGc  Web  Conference,  2009.     A.  Hogan,  A.  Harth,  and  S.  Decker.  Performing  object  consolidaIon  on  the  semanIc  web  data  graph.  In   Proceedings  of  the  1st  I3:  IdenGty,  IdenGfiers,  IdenGficaGon  Workshop,  2007     A.  Ferrara,  D.  Lorusso,  S.  Montanelli,  and  G.  Varese.  Towards  a  benchmark  for  instance  matching.  In   Ontology  Matching  (OM  2008),  volume  431  of  CEUR  Workshop  Proceedings.  CEUR-­‐WS.org,  2008     C.  Bizer,  T.  Heath,  and  T.  Berners-­‐Lee.  Linked  data  -­‐  the  story  so  far.  InternaGonal  Journal  on  SemanGc   Web  and  InformaGon  Systems  (IJSWIS),  5(3),  2009     S.  Auer,  S.  Dietzold,  J.  Lehmann,  S.  Hellmann,  and  D.  Aumüller.  Triplify:  light-­‐weight  linked  data   publicaIon  from  relaIonal  databases.  In  WWW  ’09,  New  York,  NY,  USA,  2009.  ACM     W.  Y.  Arms.  Uniform  resource  names:  handles,  purls,  and  digital  object  idenIfiers.  Commun.  ACM,  44 (5):68,  2001.   32

Related Documents