Open Calais + site-wide tags = semantic site architecture

Preamble about people

Over the last month, we’ve I’ve started to grow an embryonic social web publishing platform that can be many things but fundamentally offers a personalised and collaborative environment for research, teaching and learning. (Where? You’re looking at it!). There are a few active blogs (currently fewer than on the pilot Learning Lab blogs), nearly 70 users and the word is starting to get out at a pace that I can manage. So, now it’s time to look to the future…

By running BuddyPress, the connections between people are pretty much taken care of. Sign in to http://blogs.lincoln.ac.uk with a Lincoln username and password and you’ve joined a community that, as it grows, will increasingly and effortlessly connect people through the information they choose to add to their profile. Staff and students can click on a link and find other people who have similarly tagged their profile.

Notice the comma seprated hyper-linked data — Notice the comma-separated hyper-linked data

What is of equal interest to me, and potentially very useful to the university community, is how we link the content that is being generated by staff and students and make those links accessible. It is not difficult to appreciate what the potential is when you have a revolving community of 10,000 people who, over time, document their work, their research, teaching and learning using cutting edge web publishing tools, but I’m writing this post to try and understand and sketch out how I might evolve what I have begun.

Put simply, WordPress Multi-User (WPMU) allows one person (me) to provide and manage multiple web sites which other people (staff and students) take ownership of. Typically, every action, every new user and every new page and post on every site, is recorded and held in a shared database(s). Although at this low level, the data is relational, on the surface, when you look at one of the sites, they pretty much stand alone and so they should. We’re not talking about a single website with lots of users, we’re talking about lots of websites with lots of users. They might be working collaboratively with others, but they’re working as individuals or in distinct groups that benefit from a distinct online identity. BuddyPress helps bring things together by aggregating people’s actions (i.e. posting blog updates, making friends, joining groups, posting messages) but the visibility of those connections is transient. Social networks display our actions along a timeline and the connections between people are, for the most part, buried until the next time person A interacts with person Y.

Enough about connecting people.

Site-wide content aggregation

Site content is a mixture of text, multimedia and metadata. The last thing I’ll do when completing this blog post is to categorise and tag it. Each time I write, I publish text, (sometimes images) and metadata which summarises and categorises the full text. Why am I telling you this? You know it already. What you may not know is that each post created on our university WPMU installation, by any person, providing their blog is public, is aggregated into a single site and re-published a second time. So this post exists here on this site and there, on the Community Posts site. Notice how the Community Posts version links back to the original post. We’re not creating a whole new resource, we’re creating a powerful linked resource that allows others to search, filter, browse and discover content held across multiple sites. With only a few sites up and running here at the moment, the opportunity to discover varied content is limited, but over time that will change. Look at wordpress.com, where there are 5 million sites:

wordpress.com site wide tags — Browse by user-generated metadata

On the university blogs, this is made possible through the use of the site-wide-tags plugin, which was developed by @donncha, the same person that develops WPMU and the wordpress.com site. By using this plugin, a WPMU installation can share similar functionality to what you see on wordpress.com. I say ‘similar’ because, as I’ll mention later, designing how people discover content is key to all of this and something I, or we as a community, would benefit from thinking about and acting on collectively.

On the Community Posts site, you can search the full-text of every post, filter resources by category and tag, and subscribe to feeds from any combination of tag or category. Any search can be turned into a feed by appending ‘&feed=rss’ to the end of the resulting URL.

i.e. http://tags.blogs.lincoln.ac.uk/?s=gaming&feed=rss

To create a feed from a tag or category, just click on a tag or category and append ‘/feed’ to the end of the URL.

i.e. http://tags.blogs.lincoln.ac.uk/tag/games/

You can combine tags with ‘+’, too:

http://tags.blogs.lincoln.ac.uk/tag/games+development/

You can also specify the type of feed you want by appending:

/feed/rss/
/feed/rss2/
/feed/rdf/
/feed/atom/

Mixing categories and tags is currently broken by a bug but is due to be fixed in the next version of WordPress.

So it’s not difficult to imagine, over time, an active community of thousands of university web publishers, having their content aggregated into a site-wide resource that allows full text searching, browsing and filtering with a choice of feeds to syndicate that content elsewhere. See how it’s happening at the University of Mary Washington, where over 2400 sites have been created in under three years.

Semantic technology

Yesterday, I discovered OpenCalais. It’s a semantic technology that’s been around since January 2008, so you might be tired of hearing about it, but if not, ‘Welcome to Web 3.0!’

The Calais Web Service automatically creates rich semantic metadata for the content you submit – in well under a second. Using natural language processing, machine learning and other methods, Calais analyzes your document and finds the entities within it. But, Calais goes well beyond classic entity identification and returns the facts and events hidden within your text as well.

Nice. And it’s installed on this site. There are three Calais plugins available for WordPress. This one, allows writers to submit their blog posts to the OpenCalais web service API and fetch back a number of auto-generated tags based on the content of their post. The longer the post, the more tags are returned. Tags are returned in just seconds. Those tags can be added to the post in their entirety or used selectively (actually, you have to add them all and then remove those you don’t want to include – a minor irritation). This next plugin, allows you to automatically go through every post you’ve written and tags them using the Calais web service. It’s all or nothing, but following the auto-tagging of archive content, you can then go to the ‘tags’ menu and delete any tags you don’t want to use. I’ve done that to this site and to the Community Posts site. Calais looks for names, facts and events and the API allows for up to 40,000 transactions a day and up to four per second. It returns some predictable tags and a few odd ones, but on the whole is fast and works like magic.

The third plugin also allows blog authors to fetch tags for the post they are writing and, in addition, it also suggests Creative Commons licensed images based on a dynamic evaluation of the chosen or suggested tags.

Image suggestion is a nice idea, but tends to return some fairly generic images.

Having used OpenCalais to auto-tag the Community Posts site, a whole new and richer set of semantic metadata has been added with barely any effort. The challenge now is to figure out how to 1) automate this as a scheduled process, so that the Calais plugin looks for new content every hour, say, and tags whatever has been recently introduced (a cron job that calls the plugin and a modification to the plugin to look at the timestamp of the post and ignore anything older than when it was last run?); 2) present the semantic data in an accessible way and this mostly, I think, comes down to appropriate site design. The wordpress.com screenshots above show one way of doing it. A del.icio.us style approach is a more powerful and versatile model of tag filtering. Until then, it’s a matter of constructing filters, searches and feeds in the way I’ve outlined above.

So how might all of this semantically structured data be used? It seems to me that most of the advantages are proportional to the quantity of information available. For teaching and learning, it could be used by students and staff who want to find and re-use material that has been posted in the past for a specific course or subject area. Great for new students who want to measure the type and quality of work produced by students in previous years. In a similar way, it could be used by staff looking for posts by colleagues on subjects they might be teaching, and because searches and tags can be turned into feeds, past content could be aggregated into a new course site. A widely adopted, semantically tagged WPMU installation could also reveal trends in the type of work occurring at the university and, by tagging names of people, queries against references to Prof. X’s work could be made (I also wonder whether through the use of feeds, content from the institutional repository could be joined up with all of this, too – but it’s late in the day and I can’t think straight).

You’ll see from the image below that using Calais on the Community Posts site, resulted in a much richer variety of tags than would have appeared if we relied on user-generated tagging alone (136 posts now have 558 tags). Some people don’t even bother to tag their work… Shame on them! Notice too, that with the Firefox Operator plugin, you can take a tag on the site and use it to find related resources elsewhere. So if you’re looking at work tagged ‘client-applications’ on WPMU, you can conveniently hop over to delicious and find further web resources or, on a whim, look at what books on this subject are available on Amazon.

Operator provides a way to use tags on one site to discover related resources on another site — Use tags on one site to discover related resources on another site

Anyway, if you’re still reading, you might remember from the title of this post that my overriding interest in all of this is how it can be understood as and developed into a site-wide ‘architecture’. Again, I’m thinking how user-generated tags have determined the way delicious is designed for navigation and searching of resources. I need to learn more about how WordPress themes are constructed and consider how available functions can be best exploited and usefully presented on this type of site. If you have any ideas or want to work on a specific theme to get the most out of the site-wide-tags plugin, please do leave a comment or get in touch on Twitter @josswinn

Preamble about people

Site-wide content aggregation

Semantic technology

4 Replies to “Open Calais + site-wide tags = semantic site architecture”