Open Calais + site-wide tags = semantic site architecture

Preamble about people

Over the last month, we’ve I’ve started to grow an embryonic social web publishing platform that can be many things but fundamentally offers a personalised and collaborative environment for research, teaching and learning. (Where? You’re looking at it!). There are a few active blogs (currently fewer than on the pilot Learning Lab blogs), nearly 70 users and the word is starting to get out at a pace that I can manage. So, now it’s time to look to the future…

By running BuddyPress, the connections between people are pretty much taken care of. Sign in to http://blogs.lincoln.ac.uk with a Lincoln username and password and you’ve joined a community that, as it grows, will increasingly and effortlessly connect people through the information they choose to add to their profile. Staff and students can click on a link and find other people who have similarly tagged their profile.

Notice the comma seprated hyper-linked data
Notice the comma-separated hyper-linked data

What is of equal interest to me, and potentially very useful to the university community, is how we link the content that is being generated by staff and students and make those links accessible. It is not difficult to appreciate what the potential is when you have a revolving community of 10,000 people who, over time, document their work, their research, teaching and learning using cutting edge web publishing tools, but I’m writing this post to try and understand and sketch out how I might evolve what I have begun.

Put simply, WordPress Multi-User (WPMU) allows one person (me) to provide and manage multiple web sites which other people (staff and students) take ownership of. Typically, every action, every new user and every new page and post on every site, is recorded and held in a shared database(s). Although at this low level, the data is relational, on the surface, when you look at one of the sites, they pretty much stand alone and so they should. We’re not talking about a single website with lots of users, we’re talking about lots of websites with lots of users. They might be working collaboratively with others, but they’re working as individuals or in distinct groups that benefit from a distinct online identity. BuddyPress helps bring things together by aggregating people’s actions (i.e. posting blog updates, making friends, joining groups, posting messages) but the visibility of those connections is transient. Social networks display our actions along a timeline and the connections between people are, for the most part, buried until the next time person A interacts with person Y.

Enough about connecting people.

Site-wide content aggregation

Site content is a mixture of text, multimedia and metadata. The last thing I’ll do when completing this blog post is to categorise and tag it. Each time I write, I publish text, (sometimes images) and metadata which summarises and categorises the full text. Why am I telling you this? You know it already. What you may not know is that each post created on our university WPMU installation, by any person, providing their blog is public, is aggregated into a single site and re-published a second time. So this post exists here on this site and there, on the Community Posts site. Notice how the Community Posts version links back to the original post. We’re not creating a whole new resource, we’re creating a powerful linked resource that allows others to search, filter, browse and discover content held across multiple sites. With only a few sites up and running here at the moment, the opportunity to discover varied content is limited, but over time that will change. Look at wordpress.com, where there are 5 million sites:

Browse by user-generated metadata

Search over 5 million sites
Search over 5 million sites

On the university blogs, this is made possible through the use of the site-wide-tags plugin, which was developed by @donncha, the same person that develops WPMU and the wordpress.com site. By using this plugin, a WPMU installation can share similar functionality to what you see on wordpress.com. I say ‘similar’ because, as I’ll mention later, designing how people discover content is key to all of this and something I, or we as a community, would benefit from thinking about and acting on collectively.

Community Posts
Community Posts

On the Community Posts site, you can search the full-text of every post, filter resources by category and tag, and subscribe to feeds from any combination of tag or category. Any search can be turned into a feed by appending ‘&feed=rss’ to the end of the resulting URL.

i.e. http://tags.blogs.lincoln.ac.uk/?s=gaming&feed=rss

To create a feed from a tag or category, just click on a tag or category and append ‘/feed’ to the end of the URL.

i.e. http://tags.blogs.lincoln.ac.uk/tag/games/

You can combine tags with ‘+’, too:

http://tags.blogs.lincoln.ac.uk/tag/games+development/

You can also specify the type of feed you want by appending:

/feed/rss/
/feed/rss2/
/feed/rdf/
/feed/atom/

Mixing categories and tags is currently broken by a bug but is due to be fixed in the next version of WordPress.

So it’s not difficult to imagine, over time, an active community of thousands of university web publishers, having their content aggregated into a site-wide resource that allows full text searching, browsing and filtering with a choice of feeds to syndicate that content elsewhere. See how it’s happening at the University of Mary Washington, where over 2400 sites have been created in under three years.

Semantic technology

Yesterday, I discovered OpenCalais. It’s a semantic technology that’s been around since January 2008, so you might be tired of hearing about it, but if not, ‘Welcome to Web 3.0!’

The Calais Web Service automatically creates rich semantic metadata for the content you submit – in well under a second. Using natural language processing, machine learning and other methods, Calais analyzes your document and finds the entities within it. But, Calais goes well beyond classic entity identification and returns the facts and events hidden within your text as well.

Nice. And it’s installed on this site. There are three Calais plugins available for WordPress. This one, allows writers to submit their blog posts to the OpenCalais web service API and fetch back a number of auto-generated tags based on the content of their post. The longer the post, the more tags are returned. Tags are returned in just seconds. Those tags can be added to the post in their entirety or used selectively (actually, you have to add them all and then remove those you don’t want to include – a minor irritation). This next plugin, allows you to automatically go through every post you’ve written and tags them using the Calais web service. It’s all or nothing, but following the auto-tagging of archive content, you can then go to the ‘tags’ menu and delete any tags you don’t want to use. I’ve done that to this site and to the Community Posts site. Calais looks for names, facts and events and the API allows for up to 40,000 transactions a day and up to four per second. It returns some predictable tags and a few odd ones, but on the whole is fast and works like magic.

The third plugin also allows blog authors to fetch tags for the post they are writing and, in addition, it also suggests Creative Commons licensed images based on a dynamic evaluation of the chosen or suggested tags.

The tagaroo interface
The tagaroo interface

Image suggestion is a nice idea, but tends to return some fairly generic images.

Having used OpenCalais to auto-tag the Community Posts site, a whole new and richer set of semantic metadata has been added with barely any effort. The challenge now is to figure out how to 1) automate this as a scheduled process, so that the Calais plugin looks for new content every hour, say, and tags whatever has been recently introduced (a cron job that calls the plugin and a modification to the plugin to look at the timestamp of the post and ignore anything older than when it was last run?); 2) present the semantic data in an accessible way and this mostly, I think, comes down to appropriate site design.  The wordpress.com screenshots above show one way of doing it. A del.icio.us style approach is a more powerful and versatile model of tag filtering. Until then, it’s a matter of constructing filters, searches and feeds in the way I’ve outlined above.

So how might all of this semantically structured data be used? It seems to me that most of the advantages are proportional to the quantity of information available. For teaching and learning, it could be used by students and staff who want to find and re-use material that has been posted in the past for a specific course or subject area. Great for new students who want to measure the type and quality of work produced by students in previous years. In a similar way, it could be used by staff looking for posts by colleagues on subjects they might be teaching, and because searches and tags can be turned into feeds, past content could be aggregated into a new course site. A widely adopted, semantically tagged WPMU installation could also reveal trends in the type of work occurring at the university and, by tagging names of people, queries against references to Prof. X’s work could be made (I also wonder whether through the use of feeds, content from the institutional repository could be joined up with all of this, too – but it’s late in the day and I can’t think straight).

You’ll see from the image below that using Calais on the Community Posts site, resulted in a much richer variety of tags than would have appeared if we relied on user-generated tagging alone (136 posts now have 558 tags). Some people don’t even bother to tag their work… Shame on them! Notice too, that with the Firefox Operator plugin, you can take a tag on the site and use it to find related resources elsewhere. So if you’re looking at work tagged ‘client-applications’ on WPMU, you can conveniently hop over to delicious and find further web resources or, on a whim, look at what books on this subject are available on Amazon.

Operator provides a way to use tags on one site to discover related resources on another site
Use tags on one site to discover related resources on another site

Anyway, if you’re still reading, you might remember from the title of this post that my overriding interest in all of this is how it can be understood as and developed into a site-wide ‘architecture’. Again, I’m thinking how user-generated tags have determined the way delicious is designed for navigation and searching of resources. I need to learn more about how WordPress themes are constructed and consider how available functions can be best exploited and usefully presented on this type of site. If you have any ideas or want to work on a specific theme to get the most out of the site-wide-tags plugin, please do leave a comment or get in touch on Twitter @josswinn

OpenSim, OpenID and Open Microblogging

My second day in Leeds for the ALT Conference 2008 and I’m really excited about three open source applications that I’d like us to evaluate when I return to Lincoln.

Yesterday’s OpenSim – A pre Second Life taster workshop demonstrated the potential of having our own OpenSim virtual environment, either as a way to orientate new users to Second Life or actually develop a Virtual World, confined to the university network.

Today’s Hood 2.0, it’s a Web 2 world out there, introduced Laconica to me. This is an open source microblogging service, that would allow us to effectively reproduce a Twitter-like service, but within the confines of the university. I think for us, this has an advantage over Twitter because of the privacy issues surrounding the use of public microblogging services. It does also have the ability to hook into Twitter (and soon, Facebook), if desired.

During the F-ALT08 Edublogger session this evening, I met David, who works on identity systems as Eduserv. We talked about OpenID and how it can easily be set up to serve as an identity provider for one person or an entire organisation.  I’ve taken his advice and now run an OpenID server on my personal website (it took less than 30 minutes to install and test). I’ve also been looking at OpenID plugins for WordPress and indeed the Learning Lab blogs could act as OpenID providers for anyone with a blog. I need to speak to ICT Services about the issues surrounding this on an institutional scale, but for me personally, I’m really impressed with how simple the process was to regain more control over my own identity online.

Finally, on a different note, the Keynote this morning was by Hans Rosling of Gapminder. He gave a very similar presentation to the one on TED, which I encourage you to watch for beautiful visualisations of statistical data relating to social, economic and environmental development.

The Student as Producer

We were recently unsuccessful in an application to JISC for a Learning and Teaching Innovation Grant. Nevertheless, the project is one that we’re keen on pursuing in some shape or form, so I thought I post the details here and invite comment.

Continue reading “The Student as Producer”

EPrints Session and OR08 Reflections

Back in the office, following a week away at the Open Repositories conference.

The last couple of days were spent in EPrints sessions, as that is the repository software we use here at Lincoln. I found the first session most interesting as the new features in EPrints 3.1 were discussed. The linked page explains in detail the changes in v3.1, but in summary they provide much more control for repository managers through a web interface, rather than editing config files directly. Les’ slides give a nice overview.

The following session on EPrints and the RAE generally reflected the experience we’ve had using EPrints 2 for the RAE last year.

A session on repository analytics was a very useful overview of using Google Analytics, AWStats and IRStats to measure the various uses of an EPrints repository. Very useful, in particular IRStats which has been developed at Southampton for EPrints. I look forward to installing it.

The final sessions were mainly aimed at developers with a knowledge of Perl. I found the session on how to write plugins for EPrints 3 clear and interesting, but not especially useful as I don’t understand Perl. Still, it was obvious, even to me, that with a basic knowledge of programming, plugins could be written quite easily. I think it’s important for repository managers to immerse themselves in the technicalities of repository development even if they don’t understand much of the detail. Just by sharing ideas and questions with developers, you get a better understanding of what is involved in rolling out new features and a sense of what can be achieved within given resources.

On the whole, the conference leaned towards the technical rather than the strategic and managerial aspects of institutional repositories. There were a lot of developers present and the number of technical projects discussed seemed high. Personally, I appreciated this and came away with a good sense of where the development of repositories is going. It would have been good to have had an event which explicitly aimed at bringing both developers and repository staff together.

Finally, I do wonder whether the open access repository community would benefit from engaging with developments in Enterprise Content Management, as there is a great deal of overlap, having to face similar issues around workflow, IPR and technical standards. Perhaps there are universities evaluating the open source Alfresco ECMS as a repository platform. If so, I’d like to hear about them.

Next year, the conference is in Atlanta, USA.

Session 4: National & International Perspectives

Arjan Hogenaar & Wilko Steinhoff, from KNAW, gave a presentation on AID, a Dutch Academic Information Domain. I’ll be honest and admit I didn’t pay much attention to this as I was writing up my blog notes for Session 3. Follow the hyperlinks for more information.

I was able to concentrate on the next two presentations which were both interesting and relevant to our work at Lincoln. The first was by Chris Awre, from the University of Hull, who is working on the EThOS project, a joint project between several HE institutions and the BL. It’s a project to provide a central repository service for e-theses produced in the UK. The idea is that the BL will harvest e-thesis specific UK ETD metadata provided by University repositories to create a single point of access to this type of academic output. Interestingly, the business model for this is a subscription service, whereby universities are expected to pay for the harvesting of metadata and digitisation of hard copy theses when they are requested. The content is Open Access (search, download), financially supported by a paid-for harvesting and digitisation service. It’s always interesting to see how people are creating new business models based on freely giving a product away. I hope it’s a success.

The third presentation was by Vanessa Proudman, from Tilburg University and the DRIVER Project. This was excellent, not least because of the rare clarity of presentation but also because the research findings are directly relevant and useful to us at Lincoln as we embark on establishing a repository service in the University. Vanessa looked at the challenges we face in populating our repositories and suggested key methods of increasing the number of deposits, noting that even with a Mandate, the deposit rate is only 40-60%. This work is published as part of a new book (chapter 3), which, naturally, can be downloaded here. Upon return to work, I intend to look at this in detail and begin drafting a plan for the next phase of our repository project, which is to establish an Open Access Mandate at the University and begin the important advocacy work within the Faculties.

Session 1: Web 2.0

Ian Mulvaney, from Nature Publishing, gave a presentation on Connotea. He discussed how their earlier ‘Tagging Tool’, EPrints plugin required repository users to register and sign-in to Connotea, in order to use the service from participating repositories. This, they found, created a barrier to entry which he thinks the use of OpenID and OAuth may overcome.

Richard Davis, from the University of London Computer Centre, gave a presentation on SNEEP, the JISC project to develop Web 2.0 plugins for EPrints. They are developing Comments, Bookmarks and Tags (CBT) plugins, which we’re actually going to be using in one form or another in our own repository at Lincoln. He raised the question of whether we really need this functionality of not in our repositories, and I’d argue that the functionality should be there, or else they remain read-only alternatives to publishing. With a ‘user space’ for commenting, bookmarking and tagging, an informal method of peer-review is introduced that could mature into something very valuable.

Daniel Smith, from The University of Southampton, presented Rich Tags, a web application for cross-browsing repositories. It uses the mSpace faceted browser for exploration of multiple repositories in an interface similar to iTunes. It’s a nice interface, a bit heavy on resources when I loaded it in my browser, but provides a more enjoyable interface than the default EPrints UI, with the addition of searching more than one repository.