data.lincoln.ac.uk

Recently, I posted on the LNCD blog about our work on data.lincoln.ac.uk. You might find it interesting.

One of the by-products outcomes of our recent ‘proper’ projects is data.lincoln.ac.uk. This is simply a site that documents the data we are warehousing in our MongoDB datastore (called ‘Nucleus’), and the programatic methods by which we (and the public) can access that data. Most of the data is licensed for public use, but where appropriate (e.g. personal data), a secure access token must be requested. Currently, outside of our own projects, the only people needing/wanting secure access tokens are some third year computer science students who are using data.lincoln.ac.uk as the basis for their dissertation projects and require access to their own personal event data.

Our approach to publishing open data at the University of Lincoln has been to do so in a way that was immediately useful to the work we were doing…

Read more about ‘an open platform for development‘.

Open Data at Lincoln: What have we got?

Tony Hirst recently blogged about the Open Data scene in UK HE, mentioning Lincoln as one of the few universities that are currently contributing HEI-related #opendata to the web. Sooner or later, I’ll write a more reflective post, but here I just wanted to document the current situation (that I’m aware of) at Lincoln. There are two groups that take an interest in furthering open data at Lincoln: LiSC, led by Prof. Shaun Lawson, and LNCD, the new cross-university group I co-ordinate which consolidates a lot of the previous and current work listed below. (For a broader overview of recent work, see this post).

Derek Foster in LiSC recently released energy data from our main campus buildings, updated every 2hrs on Pachube. I was just speaking to Nick and Alex and I think they plan to pull this data into our nucleus datastore, combine it with the campus location-based work we’ve done and generate dynamic heat maps (assuming Derek isn’t already working on something similar??)

LiSC are also mashing open data from the UK Police Crime Statistics database to create a social application called FearSquare and last week put together MashMyGov, a site that randomly suggests mashups using data sourced from Data.Gov.UK.

In the past couple of years, LNCD have worked on:

JISCPress, a 2009/10 project we worked on that didn’t release any data but developed a prototype WordPress platform that atomises documents for publication and comment on the web and spits out lots of data in open formats. It also uses OpenCalais, Triplify and can push RDF Linked Data to the Talis Platform. JISC now use it to publish documents for comment.

Total Recal, a JISC-funded project we completed recently and will roll out across the university this September. As well as providing a fairly comprehensive and flexible calendaring service at the university, it allowed us to work on our space-time data and develop a number of APIs on top of…

Nucleus, the epicentre of our open data efforts. This is a data store, using MongoDB, which aggregates data from a number of disparate university databases and makes that data available over secure APIs. Through a lot of hard work over the last year, Alex and Nick have compiled the single largest data store that we have at the university. Currently, it offers APIs to university events, calendars, locations and people. We’ll also be adding APIs to over 250,000 CC0 licensed bibliographic records held in Nucleus, too (see Jerome below). It also uses the OAuth-based authentication that Alex has developed.

Linking You, is a JISC-funded project we delivered last week to JISC, which looked at our use of URIs, undertook a comparative study of 40 HEI websites (more to come), proposed a high-level data model for use by the HEI sector and made some recommendations for further work. What we’ve learned on this project will have a lasting effect on the way we present our data and on our wider advocacy of open data to the university sector. I really hope that our recommendations will lead us to more discussion and collaboration with people interested in opening university data.

lncn.eu, a URL shortener that Alex and Nick developed in their spare time for a while and has since been formally adopted by the university. Naturally, lncn.eu has an API and can be used (e.g. Jerome) as a proxy for other services, collecting real-time analytics.

Jerome, is a current JISC-funded project that will release over 250,000 bibliographic records under a CC0 license. The data is stored in Nucleus and documented APIs will be available by the end of July. This is a very cool project managed by Paul Stainthorp in the Library (who’s also a member of LNCD).

We’re currently using data.online.lincoln.ac.uk to document the data that is accessible over our APIs. At some point, I can see us moving to data.lincoln.ac.uk – we just need to find time to discuss this with the right people. So far, we haven’t really gone down the RDF/Linked Data route, preferring to offer data that is linked (e.g. locations and events data are linked) and publicly accessible over APIs that are authenticated where necessary and open whenever possible. We are keen to engage in the RDF/Linked Data discussion – it’s just a matter of finding time. Please invite us to your discussions, if you think we might have something to contribute!

Triplify: Make your blog mashable

Last week, I wrote about how it is relatively simple to ‘pimp your ride on the semantic web‘. Over the weekend, I stumbled upon Triplify, a small ‘plugin’ for pretty much any web publishing platform, that “reveals the semantic structures encoded in relational databases by making database content available as RDF, JSON or Linked Data.” What is so appealing about Triplify is how easy it is to implement, especially alongside a WordPress site.

I can confirm that the three-step installation process is all it takes, although I wouldn’t undertake implementing this blindly as you are, literally, exposing a semantic representation of your database content. In other words, you should look at the configuration file you’re using and check that it’s going to expose the right data and not clear text passwords and unpublished posts and comments. Before I  implemented it, I realised that it would expose comments on a bunch of posts that I have since made private (they were imported from an old, private blog), so I had to ‘unapprove’ those comments so the script didn’t expose them to the public. A five minute job. Alternatively, the script could probably be modified to work around my problem, by only exposing comments after a certain date, for example.

The end result is that, with a WordPress site, you expose a semantic representation of your users, posts, pages, tags, categories, comments and attachments in RDF (N-Triples) and JSON formatted data (for JSON, just add ‘?t-output=json’ to the end of the URI). Like I said though, it could be used on any database driven web application. Here’s what you get when you expose the high level links to your content:


<http://blog.josswinn.org/triplify/> <http://www.w3.org/2000/01/rdf-schema#comment> "Generated by Triplify V0.5 (http://Triplify.org)" .
<http://blog.josswinn.org/triplify/> <http://creativecommons.org/ns#license> <http://creativecommons.org/licenses/by/2.0/uk/> .
<http://blog.josswinn.org/triplify/post> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Class> .
<http://blog.josswinn.org/triplify/attachment> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Class> .
<http://blog.josswinn.org/triplify/tag> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Class> .
<http://blog.josswinn.org/triplify/category> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Class> .
<http://blog.josswinn.org/triplify/user> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Class> .
<http://blog.josswinn.org/triplify/comment> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Class> .

Here’s an example of what you get when you expose the full content:


<http://blog.josswinn.org/triplify/post/154> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://rdfs.org/sioc/ns#Post> .
<http://blog.josswinn.org/triplify/post/154> <http://rdfs.org/sioc/ns#has_creator> <http://blog.josswinn.org/triplify/user/1> .
<http://blog.josswinn.org/triplify/post/154> <http://purl.org/dc/terms/created> "2008-10-06T05:55:25"^^<http://www.w3.org/2001/XMLSchema#dateTime> .
<http://blog.josswinn.org/triplify/post/154> <http://rdfs.org/sioc/ns#content> "Up early to go to Sheffield for LPI exams. The last week has left me underprepared. Never mind." .
<http://blog.josswinn.org/triplify/post/154> <http://purl.org/dc/terms/modified> "2008-10-06T20:12:15"^^<http://www.w3.org/2001/XMLSchema#dateTime> .

...

<http://blog.josswinn.org/triplify/post/154> <http://www.holygoat.co.uk/owl/redwood/0.1/tags/taggedWithTag> <http://blog.josswinn.org/triplify/tag/27> .

...

<http://blog.josswinn.org/triplify/post/154> <http://www.holygoat.co.uk/owl/redwood/0.1/tags/taggedWithTag> <http://blog.josswinn.org/triplify/tag/41> .
<http://blog.josswinn.org/triplify/post/154> <http://www.holygoat.co.uk/owl/redwood/0.1/tags/taggedWithTag> <http://blog.josswinn.org/triplify/tag/42> .

...

<http://blog.josswinn.org/triplify/post/154> <http://sdp.iasi.rdsnet.ro/semantic-wordpress/vocabulary/belongsToCategory> <http://blog.josswinn.org/triplify/category/22> .

...

<http://blog.josswinn.org/triplify/tag/154> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.holygoat.co.uk/owl/redwood/0.1/tags/Tag> .
<http://blog.josswinn.org/triplify/tag/154> <http://www.holygoat.co.uk/owl/redwood/0.1/tags/tagName> "valentine" .

You can choose to expose different levels of information in your HTML source. If you have more than a moderate amount of content, you’ll probably want to just expose the top level links as in the first example and let the users of your data dig deeper. You’ll also note that you can (and should) attach a license to your data.

A number of namespaces are recognised as well as a WordPress vocabulary.


$triplify['namespaces']=array(
'vocabulary'=>'http://sdp.iasi.rdsnet.ro/semantic-wordpress/vocabulary/',
'rdf'=>'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
'rdfs'=>'http://www.w3.org/2000/01/rdf-schema#',
'owl'=>'http://www.w3.org/2002/07/owl#',
'foaf'=>'http://xmlns.com/foaf/0.1/',
'sioc'=>'http://rdfs.org/sioc/ns#',
'sioctypes'=>'http://rdfs.org/sioc/types#',
'dc'=>'http://purl.org/dc/elements/1.1/',
'dcterms'=>'http://purl.org/dc/terms/',
'skos'=>'http://www.w3.org/2004/02/skos/core#',
'tag'=>'http://www.holygoat.co.uk/owl/redwood/0.1/tags/',
'xsd'=>'http://www.w3.org/2001/XMLSchema#',
'update'=>'http://triplify.org/vocabulary/update#',
);

So, what’s the point in doing this? Well, it’s fairly trivial and if you think that structured, linked, machine-readable licensed data is a Good Thing, why not?  The Triplify website lists an number of advantages:

Such a triplification of your Web application has tremendous advantages:

  • The installations of the Web application are better found and search engines can better evaluate the content.
  • Different installations of the Web application can easily syndicate arbitrary content without the need to adopt interfaces, content representations or protocols, even when the content structures change.
  • It is possible to create custom tailored search engines targeted at a certain niche. Imagine a search engine for products, which can be queried for digital cameras with high resolution and large zoom.

Ultimately, a triplification will counteract the centralization we faced through Google, YouTube and Facebook and lead to an increased democratization of the Web

The vision of the semantic web and semantic publishing is one of meaningfully identifying objects (and people) on the Internet and showing their relationships. This should improve searches for things on the web, but also improve how we exchange knowledge, re-use information and help clarify our identity on the web, too. It’s an ambitious task, but made easier with tools like Triplify.  The semantic web also raises questions over individual privacy and, if data is well formed and accessible, it may be easier to control and therefore censor. The creator of Triplify recently gave a technical presentation on Triplify and how it is being used to publish data collected by the OpenStreetMap project. It shows how geodata exposed in this way can result in mashup applications that directly benefit you and me.