Metadata …arrghhh!

In my previous job as Audiovisual Archivist, I spent a lot of time examining various metadata standards in detail; hours spent pouring over PBCore, METS, MODS, MIX, EXIF and IPTC/XMP, because we were designing a content model for an in-house Digital Asset Management system. I thought I had put it all behind me yet here I am staring at Phil Barker’s informative post about ‘metadata and resource description’ and it’s all coming back to me… Arrghhh 🙂

Workpackage six of the Chemistry.fm project aims to:

Plan the storage, delivery and marketing of the course.
Choose a metadata standard
Evaluate third-party hosting such as Flickr, Slideshare and YouTube as well as JORUM and the IR.

Ah, if only life were as simple as a series of bullet points!

As I was creating the project poster yesterday, I was reminded about the various ways that our project OERs could be ‘broadcast’. Although collaboration with our community radio station SirenFM, is core to the approach of our project, we all know that there are many ways for anyone to be a broadcaster on the web and part of the fun of this project for me, is being able to explore the different ways that educational content can be pulled and pushed between subscribing students and members of the public.

My plan at the moment is to use our Institutional Repository as the ‘canonical reference’ for the OERs. During our JISC-funded LIROLEM project, we developed EPrints to better accommodate multimedia resources and it makes sense to use a versioned digital archive that supports embedded media enriched by copious amounts of metadata. (I know it’s a requirement to use JORUM, too, but at the first Programme Meeting, it became clear that JORUM can be used simply as a directory where we can register URIs of existing OERs, so that’s what I’ll be doing).

Anyway, Archivists, have you ever feasted your eyes on the source code of an EPrint? Of course you have. Here’s a reminder.

Looking at the (draft) Metadata Guidelines for the OER Programme, you can see that the following are covered by EPrints:

programme tag [there is no “DC.keyword” term, so EPrints uses name=”eprints.keywords”]
title [name=”DC.title”]
author [name=”DC.creator”]
date [name=”DC.date”]
url [name=”DC.identifier]
technical information [name=”DC.format”]
language [hmmm, nowhere to be seen. Can we add that?]
subject classification [name=”DC.subject”]
keywords/tags [there is no “DC.keyword” term, so EPrints uses name=”eprints.keywords”]
comments [We use the SNEEP plugins but the comments are not showing in the source code – do we need to make sure they are crawlable? Some people aren’t keen…]
description [name=”DC.description”]

I’ve highlighted the Dublin Core terms above, but happily, the data is available in several other alternate formats:

<link rel="alternate" href="http://eprints.lincoln.ac.uk/cgi/export/1543/HTML/lirolem-eprint-1543.html" title="HTML Citation" type="text/html; charset=utf-8" />
<link rel="alternate" href="http://eprints.lincoln.ac.uk/cgi/export/1543/Text/lirolem-eprint-1543.txt" title="ASCII Citation" type="text/plain; charset=utf-8" />
<link rel="alternate" href="http://eprints.lincoln.ac.uk/cgi/export/1543/ContextObject/lirolem-eprint-1543.xml" title="OpenURL ContextObject" type="text/xml" />

<link rel="alternate" href="http://eprints.lincoln.ac.uk/cgi/export/1543/EndNote/lirolem-eprint-1543.enw" title="EndNote" type="text/plain" />
<link rel="alternate" href="http://eprints.lincoln.ac.uk/cgi/export/1543/BibTeX/lirolem-eprint-1543.bib" title="BibTeX" type="text/plain" />
<link rel="alternate" href="http://eprints.lincoln.ac.uk/cgi/export/1543/MODS/lirolem-eprint-1543.xml" title="MODS" type="text/xml" />
<link rel="alternate" href="http://eprints.lincoln.ac.uk/cgi/export/1543/COinS/lirolem-eprint-1543.txt" title="OpenURL ContextObject in Span" type="text/plain" />
<link rel="alternate" href="http://eprints.lincoln.ac.uk/cgi/export/1543/DIDL/lirolem-eprint-1543.xml" title="DIDL" type="text/xml" />
<link rel="alternate" href="http://eprints.lincoln.ac.uk/cgi/export/1543/XML/lirolem-eprint-1543.xml" title="EP3 XML" type="text/xml" />
<link rel="alternate" href="http://eprints.lincoln.ac.uk/cgi/export/1543/JSON/lirolem-eprint-1543.js" title="JSON" type="text/javascript; charset=utf-8" />
<link rel="alternate" href="http://eprints.lincoln.ac.uk/cgi/export/1543/DC/lirolem-eprint-1543.txt" title="Dublin Core" type="text/plain" />
<link rel="alternate" href="http://eprints.lincoln.ac.uk/cgi/export/1543/RIS/lirolem-eprint-1543.ris" title="Reference Manager" type="text/plain" />
<link rel="alternate" href="http://eprints.lincoln.ac.uk/cgi/export/1543/EAP/lirolem-eprint-1543.xml" title="Eprints Application Profile" type="text/xml" />
<link rel="alternate" href="http://eprints.lincoln.ac.uk/cgi/export/1543/Simple/lirolem-eprint-1543.txt" title="Simple Metadata" type="text/plain" />
<link rel="alternate" href="http://eprints.lincoln.ac.uk/cgi/export/1543/Refer/lirolem-eprint-1543.refer" title="Refer" type="text/plain" />
<link rel="alternate" href="http://eprints.lincoln.ac.uk/cgi/export/1543/METS/lirolem-eprint-1543.xml" title="METS" type="text/xml" />

Now, we could choose to lump all the OERs that we create into one single EPrint, but that doesn’t give us much flexibility and remember that EPrints is serving as the canonical reference for the OERs, not necessarily the final presentation layer that people will actually be using to browse, download and use the resources from. So if we were to group the OERs into sets of items that constituted an EPrint and then relate those EPrints to each other, using the “DC.isPartOf” property, from the point of view of metadata, we’ll be creating a consistent whole, but giving ourselves some flexibility in how we ‘broadcast’ the content of the course.

EPrints DC.relation — Dublin Core relationships

If we consider the course MindMap that we knocked up a while back, we might decide to create a single EPrint for each of the five major ‘nodes’ of the course. Doing this, would then give us an RSS 1.0 (RDF), RSS 2.0 and Atom feed for the course where each node was an item.

Introductory Chemistry Mindmap — Course MindMap

Before I move on with this, look at the export formats that EPrints offers for a query. Imagine that the course could be exported in each of these ways:

EPrints export formats — Exporting from EPrints

The zip export allows you to download the entire query and all it’s resources at once. The HTML citation format allows you to produce some HTML you could copy and paste into any web page. It could just as easily be dropped into Blackboard as it could on any other (and anybody’s) web page. BibTex would allow you to browse the course via your preferred reference management software and JSON… I still don’t completely get it, but it’s pretty fancy, I know that much.

Anyway, If each of the mindmap nodes is an ‘item’ in the RSS feed, then perhaps we can use that to feed a WordPress site, using the FeedWordPress plugin? Nope. It doesn’t seem to work. FeedWordPress recognises the feed but doesn’t import anything. Testing it with another feed based on keywords does work, but the information included in the feed is sparse, so that’s no good. By the way, the EPrints RSS 2.0 feed does include the xmlns:media=”http://search.yahoo.com/mrss” namespace and marks up the preview thumbnails accordingly:


<media:thumbnail url="http://eprints.lincoln.ac.uk/1543/thumbnails/15/small.png" type="image/png"></media:thumbnail><media:content url="http://eprints.lincoln.ac.uk/1543/thumbnails/15/preview.png" type="image/png"></media:content>

(Another way to tackle this might be using our newly developed ‘EPrints2Blog’ plugin, which allows a depositor to post information about their new EPrint to a blog of their choice (using XML-RPC). As we deposit the course EPrints, each could be posted to a WordPress site. The resulting feed from the WordPress site does include some embedded media, but it’s still a bit of a hack. No, scrap this idea).

Post2Blog: An XML-RPC plugin for EPrints

Podcasting from Eprints in WordPress

Right, how about this…?

Using EPrints as the canonical source for each of the files for the course, we could create a WordPress site with the addition of the Dublin Core and OAI-ORE plugins for WordPress.

For each WordPress post, this gives us the following metadata:


<meta name="DC.publisher" content="../learninglab/joss" />

<meta name="DC.publisher.url" content="https://joss.blogs.lincoln.ac.uk/" />

<meta name="DC.title" content="Thinking the unthinkable" />

<meta name="DC.identifier" content="https://joss.blogs.lincoln.ac.uk/2009/10/08/thinking-the-unthinkable/" />

<meta name="DC.date.created" scheme="WTN8601" content="2009-10-08T16:14:54" />

<meta name="DC.creator" content="Joss" />

<meta name="DC.rights.rightsHolder" content="Joss" />

<meta name="DC.subject" content="Funding" />

<meta name="DC.rights.license" content="http://creativecommons.org/licenses/by-nc-sa/2.0/uk/" />

<link rel="alternate" type="application/rss+xml" title="Comments: Thinking the unthinkable" href="https://joss.blogs.lincoln.ac.uk/2009/10/08/thinking-the-unthinkable/feed/" />

<!-- OAI-ORE -->

<link rel="resourcemap" type="application/atom+xml" href="https://joss.blogs.lincoln.ac.uk/wp-content/plugins/oai-ore/rem.php"/>

This is more like it. Click on the oai-ore link and look at the source code. It’s too big to display here, but it does what you’d expect and produces a OAI-ORE 1.0 compliant Atom/XML file. Contained within the file is a ‘resource map’ of all the WordPress posts and pages marked up with Dublin Core and FOAF terms. Thinking about how the course site might be represented in this way, it makes sense to atomise the course even further so that each of the sub-nodes of the Mind Map is a WordPress post. Using the current course structure, that would result in about 20 separate posts to represent the course. Each post would contain one or more resources such as a PDF, video, audio, slides, etc. Is it worth atomising it even further and creating a post for each of these resources, too, I wonder? Quite possibly.

Unfortunately, the resource map does not include media that are included in each post or page – apparently it’s on the developer’s list of things to do. Maybe we could use some of the project budget to ask Alex, who’s working on the JISCPress project with me, to extend the plugin in this way…

Finally, there’s also a MediaRSS plugin for WordPress, which could enhance the RSS feeds to include all the media used in the course. Here’s an example that’s including images by default. I’ve already written about the various feeds that are available for WordPress, with some careful categorisation and tagging, media rich feeds would be available for different points (‘nodes’) of entry into the course.

Once we are at this point, I guess we’re ready to think about broadcasting the course via Boxee and DeliTV (no time to dig into that now. Sorry!)

Metadata… arrghhh!

p.s. you’ve probably noticed that I’m a bit weak on the EPrints and OAI-ORE stuff, to say the least. Please do pick me up on where I’m going wrong with this. Thanks 🙂