Saturday, March 23, 2013

SPARQL queries for iRODS Data

This is cool:

PREFIX irods:       <http://www.irods.org/ontologies/2013/2/iRODS.owl#> 
PREFIX skos:    <http://www.w3.org/2004/02/skos/core#>
SELECT ?x ?y
WHERE { 

?x  irods:correspondingConcept ?y .
?y skos:related <http://www.fao.org/aos/agrovoc#c_28638>
}



That's a SPARQL query running on Jena Fuseki...and this is related to the work we're doing with HIVE integration, as discussed in this previous blog entry...SPARQL is a query langage that can be used to search semantic metadata, in our case, metadata that describes the iRODS catalog, SKOS controlled vocabularies, and 'serialized' RDF statements saved as iRODS AVUs that apply controlled vocabulary terms to iRODS files and collections.  This improves the normal iRODS AVUs by giving them structure and meaning, via SKOS.

In the case above, we have a term defined in the Agrovoc vocabulary which looks something like this snippet, as 'turtle'.

<http://www.fao.org/aos/agrovoc#c_1669>
      a       skos:Concept ;
      skos:narrower <http://www.fao.org/aos/agrovoc#c_7979> , <http://www.fao.org/aos/agrovoc#c_1745> , <http://www.fao.org/aos/agrovoc#c_7656> , <http://www.fao.org/aos/agrovoc#c_3688> , <http://www.fao.org/aos/agrovoc#c_6963> , <http://www.fao.org/aos/agrovoc#c_14658> , <http://www.fao.org/aos/agrovoc#c_16099> , <http://www.fao.org/aos/agrovoc#c_29563> , <http://www.fao.org/aos/agrovoc#c_613> ;
      skos:prefLabel "Climatic zones"@en ;
      skos:related <http://www.fao.org/aos/agrovoc#c_7213> , <http://www.fao.org/aos/agrovoc#c_28638> , <http://www.fao.org/aos/agrovoc#c_29554> ;
      skos:scopeNote "Use for areas having identical climates; for the physical phenomenon use Climate (1665)"@en .


Note that SKOS will define broader, narrower, and related terms, along with other data.  This means that a user may tag an iRODS file or collection with a term like c_1669, and search for it on the related term c:6963.  

That's what the SPARQL query above shows, you are looking for any iRODS files or collections that have an AVU with a SKOS vocabulary term from Agrovoc that is related to a given concept.  The result of this query, in JSON, looks like so:

{
        "x": { "type": "uri" , "value": "irods://localhost:1247/test1/trash/home/test1/jargon-scratch.1256888938/JenaHiveIndexerServiceImplWithDerbyTest/testExecuteOnt/subdirectory2/hivefile7" } ,
        "y": { "type": "uri" , "value": "http://www.fao.org/aos/agrovoc#c_1669" }
      } ,
      {
        "x": { "type": "uri" , "value": "irods://localhost:1247/test1/trash/home/test1/jargon-scratch.1256888938/JenaHiveIndexerServiceImplWithDerbyTest/testExecuteOnt/subdirectory1/hivefile7" } ,
        "y": { "type": "uri" , "value": "http://www.fao.org/aos/agrovoc#c_1669" }
      } ,
      {
        "x": { "type": "uri" , "value": "irods://localhost:1247/test1/trash/home/test1/jargon-scratch.705362199/JenaHiveIndexerServiceImplWithOntTest/testExecuteOnt/subdirectory1/hivefile6" } ,
        "y": { "type": "uri" , "value": "http://www.fao.org/aos/agrovoc#c_1669" }
      } ,


As you can see (or at least trust me on this), you are finding iRODS data based on a related concept.  With Fuseki, we could add such SPARQL queries in short order to the iDrop apps, or even to iCommands.  Note that we've done this by marking up iRODS data with SKOS terms, storing these as special AVUs, indexing them with a spider, and then putting them into a Jena triple store for SPARQL queries.  The same sorts of things can also be pretty easily done using Lucene for text search, and adding these new methods of finding data is going to be an interesting area for Jargon and iRODS development.  You can see some of the HIVE work in the GForge project at DICE and RENCI here!

Tuesday, March 12, 2013

Some work in progress integrating HIVE with iRODS

iRODS has a powerful facility, through the iCAT master catalog, to manage user-supplied metadata on different parts of the catalog domain, such as files and collections.  These are 'AVU' triples, which are just attribute-value-unit slots that can hold free-format data.

We're using AVUs by adding conventions and metaphors on top of them, such as free tags, starred folders, and shares, such as in this previous video demo.  One weakness of AVUs is that they are totally unstructured.  This does not mean that we cannot apply structure at a higher level, and that's exactly what the interest in HIVE integration is about.

HIVE is an acronym for Helping Interdisciplinary Vocabulary Engineering, and HIVE is a project from the Metadata Research Center at the School of Information and Library Science at UNC Chapel Hill,.  (Did I mention we were just voted the #2 best program in the country by US News and World Report?)

HIVE is a tool that allows browsing and searching across controlled vocabularies defined in SKOS, a simple RDF schema for defining dictionaries, thesauri, and other structured metadata.  A key aspect is the integration of RDF with Lucene to allow searching across selected vocabularies, a helpful approach since much of the focus of iRODS and DICE is in multi-disciplinary research collaboration, as in the Datanet Federation Consortium.  HIVE solves a lot of problems we were facing, so it is a happy circumstance that the MRC is just around the corner from us, and we're busy looking at integration.

In a nutshell, HIVE allows us to:


  • Keep multiple controlled vocabularies
  • Allow users to easily search and navigate across vocabularies to find appropriate terms
  • Make AVU metadata meaningful by providing structure and consistency
  • Power rich metadata queries using tools such as SPARQL to find iRODS files and collections

A short video demo follows that shows the first level of integration between iDrop (the iRODS cloud browser) and HIVE.  We've added a HIVE tab to contain a concept browser, allowing markup of iRODS files and collections with controlled vocabulary terms.




Note that we've yet to add search across vocabularies and automatic keyword extraction with MAUI and KEA.  These are available in HIVE, and we intend on adding them in this project.  

The next step is to build the capability to extract iRODS data and vocabulary terms and populate a triple store (Sesame or Jena), allowing queries on the triple-store, and allowing processing of results such that users can access the referenced data in iRODS.  We're seeking a generalized approach so that we can have a standard practice to store RDF statements about iRODS data, and we can index and manage real-time updates.  This aspect is next for the project, and should have a wide application for iRODS users!