Categories
Developer tools

Powerhouse releases a Python HTML Sanitiser for developers to use (BSD license)

As you’ve heard, we’ve been working on a whole lot of new projects. And with new projects comes new code. I can’t say a lot more about these projects right now, but we’ve been using Python and the Django framework to develop them. So here’s the first of the spinoff products that we’re putting out under a BSD license for everyone to benefit from.

Over to Dan MacKinlay, one of our Python gurus, to tell you all about the HTML Sanitiser and why it matters.

“So the idea with the Python HTML Sanitizer is that we are consuming data from a wide variety of client websites, and we need to get their HTML data in a form that’s useful to us. This means –

1) standards compliant XHTML
2) … bereft of formatting quirks which break our site …
3) … and free from exploits for cross-site scripting and other browser-bugs that can compromise user security.

Normally, you can sidestep the HTML sanitization process by writing your own content, or using a special markup language (say, Markdown) – but when you are consuming HTML from clients’ websites this is not an option. They simply aren’t written in Markdown.

Stripping ALL HTML tags out would be another common option. That’s not reasonable for us, however, since we are supposed to be extracting rich information from our clients sites, and some of it is really useful and semantic – links, citations and definitions. things we don’t want to filter out, or punish them for using.

Rather, we’d probably like to reward them by keeping that markup and indexing on it.

By the same token, many clients use old markup (think HTML3), invalid or badly-formed markup or merely use types of markup which are inconvenient for us to display. (br – or even td tags – instead of p) Moreover, when a site is old enough to have such ancient markup in, it’s reasonable to think that maybe other types of maintenance has lapsed too — such as security maintenance.

We can’t blithely assume that every client site is free from malicious Javascript or whatever – that’s a one way ticket to weakest-link security hell. Already we’ve noticed that two partner sites have been hacked in the course of the project so far (these days we’d assume that a fair proportion of traffic to most dynamic websites is malicious).

Solution – the HTML Sanitiser.

This a flexible, adaptable HTML sanitising module (both parsing and cleaning) that can be tweaked to let through rich markup from good client sites, and salvage what it can from bad client sites. This is the approach chosen by things like PHP5’s HTML Purifier and Ruby’s HTML:Sanitizer, but since our scraping code is in Python, we’ve had to build our own, leveraging the power of the awesome BeautifulSoup HTML parser.

Since a lot of people need to solve similar problems to this, and many eyes make for more secure code, we’ve open-sourced it.

Go and download it, make changes and update the codebase.”

Categories
Developer tools Tools User experience

Usability and IA testing tools – OptimalSort, ClickDensity, Silverback

As the team has been working on a large array of new projects and sites of late we’ve been exploring some of the newer tools that have emerged for usability testing and ensuring good information architectures. Here’s some of what we’ve been exploring and using –

We’ve started using Optimalsort for site architecture – especially the naming and content of menus. Optimalsort is a lovely Australian-made web product that offers an online ‘card sorting’ exercise. In our case we’ve been using it as a way of ensuring we get a good diversity of opinions on how different types of content (‘cards’) should be stacked together (in groups) under titles (menus). Optimalsort lets you invite people to come and order your content in ways that make sense to them and then presents you with an overall table of results, form which you can deduce the best possible solution.

We’re also back using Clickdensity which is great for tracking down user interface problems on live sites. We used this when it first was released by Box UK and it revealed some holes we quickly fixed on a number of our sites. Whilst it still has issues working properly in Safari and, surprisingly, sometimes on Firefox, Clickdensity lets you generate heatmaps of your visitors’ clicks and mouse hovers. Armed with this you can quickly discover whether your site visitors are trying to click on images thinking that they are buttons or links; or choosing certain navigation items over others.

Sliverback is another UK product, this time from Clearleft. We’re gearing up to use this with some focus groups to record their interactions (and facial expressions!) as they use some of our new projects and products. Silverback is Mac only (which suits us fine) and records a users’ interactions with your application whilst using the Mac’s built in camera and microphone to record the participant (hopefully not swearing, cursing and looking frustrated). This should be perfectly geared for small focus groups with targetted testing.

Categories
Collection databases Developer tools Metadata

OPAC2.0 – OpenCalais meets our museum collection / auto-tagging and semantic parsing of collection data

Today we went live with another one of the new experimental features of our collection database – auto-generation of tags based on semantic parsing.

Throughout the Museum’s collection database you will now find, in the right hand column of the more recently acquired objects (see a quick sample list), a new cluster of content titled “Auto-generated tags”.

We have been experimenting with Reuters’ OpenCalais web service since it launched in January. Now we have made a basic implementation of it applied to records in our collection database, initially as a way of generating extra structured metadata for our objects. We can extract proper names, places (by continent, country, region, state and city), company names, technologies and specialist terms, from object records all without requiring cataloguers to catalogue in this way. Having this data extracted makes it much easier for us to connect objects by manufacturers, people, and places within our own collection as well as to external resources.

Here’s a brief description of what OpenCalais is in a nutshell from their FAQ

From a user perspective it’s pretty simple: You hand the web service unstructured text (like news articles, blog postings, your term paper, etc) and it returns semantic metadata in RDF format. What’s happening in the background is a little more complicated.

Using natural language processing and machine learning techniques, the Calais web service looks inside your text and locates the entities (people, places, products, etc), facts (John Doe works for Acme Corp) and events (Jane Doe was appointed as a Board member of Acme Corp) in the text. Calais then processes the entities, facts and events extracted from the text and returns them to the caller in RDF format.

Whilst we store the RDF triples and unique hash, we are not making use of these beyond display right now. There is a fair bit of ‘cleaning up’ we have to do first, and we’d like to enlist your help so read on.

Obviously the type of content that we are asking OpenCalais to parse is complex. Whilst it is ideally suited to the more technical objects in our collection as well as our many examples of product design, it struggles with differentiating between content on some object records.

Here is a good example from a recent acquisition of amateur radio equipment used in the 1970s and 1980s.

The OpenCalais tags generated are as follows –

The bad:

The obvious errors which need deleting are the classification of “Ray Oscilloscope” as a person (although that might be a good name for my next avatar!); “Amateur Microprocessor Teleprinter Over Radio” as a company; the rather sinister “Terminal Unit” as an organisation; and the meaningless “metal” as an industry term.

We have included a simple ‘X’ to allow users to delete the ones that are obviously incorrect and will be tracking its use.

These errors and other like them reveal OpenCalais’ history as Clearforest in the business world. The rules it applies when parsing text as well as the entities that it is ‘aware’ of are rooted in the language of enterprise, finance and commerce.

The good:

On the otherhand, by making all this new ‘auto-generated’ tag data available, users can now traverse our collection in new ways, discovering connections between objects that previous remained hidden deep in blocks of text.

Currently clicking any tag will return a search result for that term in the rest of our collection. In a few hours of demonstrations to registrars and cataloguers today many new connections between objects were discovered, and people, who we didn’t expect to be mentioned in our collection documentation, revealed.

Help us:

Have a play with the auto-tags and see what you can find. Feel free to delete incorrect auto-tags.

We will be improving their operation over the coming weeks, but hope that this is a useful demonstration of some of the potential lying dormant in rich collection records and a real world demonstration of what the ‘semantic web’ might begin to mean for museums. It is important to remember that there is no way that this structured data could be generated manually – the volume of legacy data is too great and the burden on curatorial and cataloguing staff would be too great.

Categories
Developer tools Web metrics

Google Teleportation / Google’s ‘search within search’

Google’s ‘search within search’ or as they call it ‘teleporting‘ has hit the Powerhouse Museum.

I’m not sure whether this is a compliment or not, but as the New York Times reports, this is a very interesting development which raises many issues for content-rich sites with vested interests in their own internal search.

As you can see in the screenshot below, a search for ‘powerhouse museum‘ now not only shows the main home page link, and the ‘selected’ 8 results (automatically picked by Google – probably a mix of popular pages and ‘relevant’ pages by title), it also shows a secondary search box.

Searching in this second box returns a site-specific search result, but still on Google, and depending upon the search term, filled with term-sensitive search advertising. Here’s an example of the effect of entering a term like ‘travel‘ into the secondary search box.

Worse still, try this one – ‘venue hire’.

It is going to be interesting to watch the effect of this on user behaviour. For Google it allows them to keep users on their search site for a longer period of time (and tempts them with advertising), and, if I look at this with a positive spin, it also hopefully delivers users to exactly what they want on our site by the time they get to it,

Either way though, this is another nail in the coffin of traditional web metrics and measurement. Where previously visitors wanting to find your organisation by a brand name search would start their visit to your site at the home page (after being delivered to there by Google), now they are more likely to exhibit similar behaviour to content-seekers, and start their visit deep in your site. This has significant implications for site design and navigation if users do actually start using the ‘search within search’.

Have any other museums found their site is now affected this way? (I notice that the Australian Broadcasting Corporation – ABC is another Australian site that is)

Categories
Developer tools Web 2.0

OpenSocial, social networking and museums

Google’s OpenSocial has finally gone live.

What it provides for the museum sector is a much easier way to seed content to social networks, where apparently our younger online audiences, like to spend a lot of their time. OpenSocial, as opposed to a Facebook application, promises to work across multiple social networking services – meaning the development effort expended results in an application that can theoretically be deployed on MySpace, Ning, Bebo, Linked In and all the other OpenSocial partners. It remains to be seen just how portable these applications are.

The benefit for developers, especially those in the museum world, is that the risk of ‘choosing the right’ social networking service is greatly reduced. Museums have been experimenting a lot with seeding content to social networks – mainly as a marketing and promotional tool; and to a lesser extent building professional communities either within existing social networks (the multitude of Facebook groups especially) or discrete services like Exhibit Files.

Where applications from museums sit in the mix is more complicated, especially when development needs to be outsourced or requires significant investment. Our sector is often slow to respond and by the time we do, the audiences we were targetting have sometimes moved on.

Looking at the professional networks for museum staff on Facebook, they currently are thriving because currently many museum staff have private accounts – and usage from work is generally not blocked. However once non-museum personal friends move on to the next site, I do wonder how long those professional networks on Facebook will be sustainable.

As Fred Stutzman points out,

Ego-centric social network sites all suffer from the “what’s next” problem. You log in, you find your friends, you connect, and then…what? Social networks solve this problem by being situationally relevant. On a college campus, where student real-world social networks are in unprecedented flux, Facebook is a social utility; the sheer amount of social information a student needs to manage as they mature their social networks makes Facebook invaluable. For the consultant or job seeker, LinkedIn maintains situational relevance by allowing one to activate weak ties in periods of need.

What happens when a social network is no longer situationally relevant? Use drops off. Social networks can combat this problem on a number of levels. Myspace dumped tons of exclusive media content into the site, so users would keep coming back once they negotiated their social networks. For non-SR users, Facebook developed the application platfom, betting that third party developers could make tools that would answer the varied needs of their userbase. Unfortunately, the gimmicky nature of the platform tools has undercut this approach somewhat, but this could very well change over time.

Try as they might, once ego-centric social networks lose situational relevance, its pretty much impossible for them to retain their status. Myspace users have exhausted the Myspace experience; they’ve done all they can do, they’ve found all the people they can find, so now its time to find a new context. We naturally migrate – we don’t hang out in the same bar or restaurant forever, so why would we assume behavior would be any different online?

Categories
Developer tools Imaging Web 2.0

Comparing a site across browsers

One of the biggest problems when designing and developing a new website or rolling out a new look and feel is cross-browser compatibility. Usually the solution has been to have a series of machines, real or virtual, with different versions of the various different browsers out there installed, and then go through each one laboriously.

Fortunately now there is Browsershots which is a web-based browser farm which you can utilise for browser checking. You simply submit an URL to Browsershots and tick the various flavours and versions of browser you want to check against and then wait . . .

Your request is queued and once processed you can view and download screenshots of your site as it look in each of the browsers selected. Because it runs on actual machines these screenshots aren’t kludges, they are the real thing.

Categories
Collection databases Developer tools Folksonomies Web 2.0

OPAC2.0 – Go bulk taggers!

Thank you to everyone who has been tagging the collection with our bulk tagging mini-application.

Since announcing it 2 weeks ago we’ve had 515 new tags added to previously untagged objects. That’s a lot.

If you are one of the many who have added some tags – thank you. If you haven’t tried it yet, then what are you waiting for?

Thank you also to everyone who emailed in or left suggestions in the comments.

Categories
Developer tools Imaging Web 2.0

Visualising a metasearch with SearchCrystal

SearchCrystal is a very nifty search visualisation tool. Above is the results of an image search for ‘Sydney’ across multiple engines – you can see clearly in the visualisation where results crossover and there is similarity. I really like the different types of search that can be done in this way – web searches, image seraches, video, news, blogs, tags . . . . below is a web search for ‘Powerhouse Museum’.

Categories
Collection databases Copyright/OCL Developer tools Interactive Media Metadata Social networking UKMW07 Web 2.0

UK Museums on the Web 2007 full report (Leicester)

Museums on the Web UK 2007 was held at the slightly rainy and chilly summer venue of the University of Leciester. Organised by the 24 Hour Museum and Dr Ross Parry with the Museums Computer Group the event was attended by about 100 museum web techies, content creators and policy makers.

As a one day conference (preceded by a day long ‘museum mashup’ workshop) it was very affordable, fun and entertaining (yes, in the lobby they had a demo of one of those new Phillips 3D televisions . . . disconcerting and very strange).

Here’s an overview of the day’s proceedings (warning: long . . . you may wish to print this or save to your new iPhone)

The conference opened with Michael Twidale and myself presenting the two conference keynote addresses. I presented a rather ‘sugar-rush, no-holds barred view from the colonies’ of why museums should be thinking about their social tagging strategies. (I’ll probably post my slides a little later). I had been quite stressed about the presentation coming off very little sleep and a long flight from Ottawa to London the night before. But I’ve been talking about these and related topics almost non-stop for the past two weeks so it was actually a good feeling to get it done right at the beginning.

After my presentation Michael Twidale from the University of Illinois reprised the joint presentation about museums making tentative steps into SecondLife that his colleague and co-author Richard Urban had presented at MW07 in San Francisco. Michael (like Richard before) certainly peaked the interest of some in the room who I had the feeling had barely thought about Second Life before – although I notice that the extremely minimally staffed Design Museum in London has just been doing an architecture event and competition in Second Life (see Stephen Doesinger’s ‘Bastard Spaces’).

Mike Ellis from the Science Museum followed the tea break with a presentation that looked at the outcomes of letting a small group of museum web nerds loose for a day without the pressures of a corporate inbox. Using a variety of public feeds the outcomes of such a short period of open-ended collaborative R&D were quite amazing. In many ways Mike’s presentation ended up challenging the audience to think about new ways of injecting innovation and R&D into their museum’s web practices. Amongst the mashups were a quick implementation of the MIT Simile Timeline for an existing project at the Cambridge University Museum tracking dates; a GoogleMaps mashup of all known museum locations and websites in the UK (something that revealed that current RSS feeds of this data are missing the crucial UK postcode information); a date cleaning API to allow cross-organisational date comparison built by Dan Z from Box UK; and an exciting mashup using Spinvox‘s voice to text service to allow museum visitors to call a phone number and be SMSed back information about locations, services or objects.

These were all really exciting prototypes that had come out of a very small amount of collaborative R&D time – something every museum web team should have. Apart from this a couple of problems facing museum mashups were revealed – stability issues and reliance on other people’s data – but as Mike pointed out how does this really compare to the actual stability of your existing services?

Nick Poole from MDA presented Naomi Korn’s slides on rights issues (moral, ethical and Copyright) involving museums implementing Web 2.0 applications. Nick presentation was excellent and had two main points to make. The first being that the museum sector is already going the way of increased audience focus and interaction in real world policy and has been for at least the past decade so why should the web be any different? Further that the recent political climate in which museums in teh UK exist has focussed on the cultural sector being a lead in enhancing social cohesion and the sharing of cultural capital. Secondly, Nick emphasised that as museums “we have a social responsibility to the population to exploit any and all methodologies which makes it easier for them to engage with and learn from their (cultural) property”, concluding that despite the potential legal issues, Web 2.0 offers a “set of mechanisms by which we can enhance accountability and effectiveness in a public service industry”. Excellent stuff.

Alex Whitfield from the British Library then presented an interesting look at an albeit extreme example of the tensions with implementing Web 2.0 technologies with certain exhibition content. Alex demonstrated some of the website for the Sacred exhibiton which shows some the key religious manuscripts from the faiths – Christianity, Islam, and Judaism. The online exhibition shows 66 of 152 texts and includes a GoogleMaps interface, expert blogs, podcasts and some nice Flash interactives (yes, I did ask why Flash? apparently because it was a technology choice encouraged by the IT team). Alex then proceeded to look at a few examples of where tagging and digital reproduction can cause community offence or at the very least controversy, before closing referencing from Susan Sontag’s ‘On Photography’ where Sontag claims that there is a reduction of ‘the subject’. (see an interview with Sontag where she explains this concept). Alex’s example was certainly provocative and reminded me, again, that the static web and the participatory web both carry their own particular set of implicit politics (individualistic, pro-globalisation, and pro-democracy although to differing depths of democracy).

After a light lunch Frances Lloyd-Baynes from the V&A gave an overview of some of the work they have been doing and some of the challenges ahead. She reported that the V&A has 28% of their collection online but that the figure reduces to 3% once bibliographic content is excluded. Of course they have been working on other ‘collections’ – those held by the community – for quite a while as evidenced by their Every Object Tells A Story and the new Families Online project.

She also mentioned the influence of the MDA’s ‘Revisiting Collections‘ methodology which focuses on making a concerted effort to engage audiences and bring user/public experiences to museum collections content. This and other concepts have become a key part of the V&A’s strategic policy.

In terms of user-generated content she highlighted problems that manyof us are starting to face. What UGC gets ‘kept’? How long, how much? What should be brought into the collection record? Should it be acknowledged? How?How should museums respond, mediate and transform content? Or should they remain unmediated? And how do we ensure that there is a clarity and distinction between voice of the museum and voice of the user.

Fellow Australian, now ex-pat who works as a database developer at the Museum of London, Mia Ridge, gave a practical overview of how Web2.0 can be implemented in museums. She covered topics like participation inequality, RSS and mashups, and the need to be transparent with acceptable use and moderation policies. it was a very practical set of recommendations.

Paul Shabajee from HP Labs then gave a very cerebral presentation on the design of the “digital content exchange protoype” for the Singapore education sector. The DCX allows for the combination of multiple data and metadata spread across multiple locations and sources, as well as faceted browsing and searches for teachers and students allowing for dynamic filtering by type, curriculum subject area, format, education level, availability, text search, etc. It was a great example of the potential of the Semantic Web. He then went on to explain the CEMS thesaurus model of curriculum and the taxonomies of collection, and how actual users wanted to do things in a more complex way such as finding topic for a class then find real world events and map them against topics. And because everything had been semantically connected, building new views in line with user needs did not mean massive re-coding. More information ont eh project can be gleaned from Shabajee’s publications.

Then after some very tasty micro-tarts (chocolate and raspberry, of which I must have partaken in five or six . . ), we moved on to the closing session from Brian Kelly of UKOLN. Brian is a great presenter although his slides always seem so lo-fi because of his typographic choices. Brian managed to make web accessibility for Web 2.0 are compelling topic and his passion for reforming the way we generally approach is ‘accessibility’ is infectious.

Brian is a firm believer that ‘accessibility is not about control. rules, universal solutions, and an IT problem’. Instead he asks what does accessibility really mean for your users? And rather cheekily ‘how can you make surrealist art accessible’? Accessibility, for Brian, is about empowering people, contextual solutions, wideing participation, blended solutions, all the things that Nick Poole and Frances Lloyd-Baynes (and the rest of us) were pushing for earlier in the day.

Brian has come up with a model of approaching accessibility that uses as a metaphor the tangram puzzle (for which there is no single ‘correct’ solution) rather than a jigsaw. He advised that we should focus on content accessibility because a mechanistic approach doesn’t work. How do you make an e-learning resource 3d model? It is just not possible and instead we should be focussing on making the learning objectives/outcomes accessible instead. If we see things in this way then there is no technical barrier for doing museum in projects in say, Second Life, citing the reasons that it isn’t ‘accessible’ by some disabled users, but that we should focus on providing alternatives as well that achieve or demonstrate similar outcomes for other users. Michael Twidale also provided the example of the paralysed Second Life user who can, in his virtual world, fly when in the real world he cannot walk.

Brian closed by advising that at a policy level we should be saying things like “museum services will seek to engage its auidences, attract new and diverse audiences. The museum will take reasonable steps to maximise access to its services”. By applying principles of accessible access across the whole portfolio of what the museum offers (real and virtual) we can still implement experimental services rather than using accessibility as a preventative tool. After all, as he points out the BBC has a portfolio of services for impaired users rather than ensuring access on every service.

Categories
Developer tools Interactive Media Social networking Web 2.0

Museums on the Web UK 2007 – Friday June 22 – register now

If you happen to be one of our UK or European readers then you may be interested in Museums on the Web UK 2007 which happens on Friday June 22. It is organised by the Museum Computer Group, 24hr Museum and the University of Leicester.

The Web is changing – faster, smarter, more personal, more social. The software that drives it and the usage that shapes it are evolving at a rapid pace. Is the museum sector responding to this evolution? And as visible and trusted providers of rich and unique content might museums have, in fact, an opportunity to influence the future Web?

Is it time to become more ‘Web adept’?

From Web ethics, to user-generated content, and from the implications and possibilities of mashed-up content, to the need for new values and holistic approaches to accessible design…this year’s conference will explore the many ways the Web is being transformed around us, and how museums can respond to – and perhaps lead – this change.

UKMW will, as in previous years, be an accessible and affordable event welcoming around 100 delegates. It will aim to bring to together a programme of high quality speakers with a national and international perspective, from inside and outside the sector, offering creative, leading edge thinking relevant to anyone working with museums and the Web today.

I am giving one of the keynotes on social tagging and the future of collections online. The other keynote is Michael Twidale speaking about Second Life. Other speakers include Mike Ellis, Naomi Korn, Jon Pratty, Jeremy Keith (Clearleft), Paul Shabajee (HP Labs) and Brian Kelly. It is a low cost single day event and should be excellent.

Register online over at the UK Museums Computer Group.

I hope to see you there.