Moving out in to the cloud – reverse proxying our website

For a fair while we’ve been thinking about how we can improve our web hosting. At the Powerhouse we host everything in-house and our IT team does a great job of keeping things up and running. However as traffic to our websites has grown exponentially along with an explosion in the volume of data we make available, scalability has become a huge issue.

So when I came back from Museums and the Web in April I dropped Rob Stein, Charles Moad and Edward Bachta’s paper on how the Indianapolis Museum of Art was using Amazon Web Services (AWS) to run Art Babble on Dan, our IT manager’s desk.

A few months ago a new staff member started in IT – Chris Bell. Chris had a background in running commercial web hosting services and his knowledge and skills in the area have been invaluable. In a few short months our hosting set up has been overhauled. With a move to virtualisation inside the Museum as a whole, Chris started working with one of our developers, Luke, thinking about how we might try AWS ourselves.

Today we started our trial of AWS beginning with the content in the Hedda Morrison microsite. Now when you visit that site all the image content, including the zoomable images, are served from AWS.

We’re keeping an eye on how that goes and then will switch over the entirety of our OPAC.

I asked Chris to explain how it works and what is going on – the solution he has implemented is elegantly simple.

Q: How have you changed our web-hosting arrangements so that we make use of Amazon Web Services?

We haven’t changed anything actually. The priorities in this case were to reduce load on our existing infrastructure and improve performance without re-inventing our current model. That’s why we decided on a system that would achieve our goals of outsourcing the hosting of a massive number of files (several million) without ever actually having to upload them to a third-party service. We went with Amazon Web Services (AWS) because it offers an exciting opportunity to deliver content from a growing number of geographical points that will suit our users. [Our web traffic over the last three months has been split 47% Oceania, 24% North America, 21% Europe]

Our current web servers deliver a massive volume and diversity of content. By identifying areas where we could out-source this content delivery to external servers we both reduce demand on our equipment – increasing performance – and reduce load on our connection.

The Museum does not currently have a connection intended for high-end hosting applications (despite the demand we receive), so moving content out of the network promises to not only deliver better performance for our website users but also for other applications within our corporate network.

Q: Reverse-proxy? Can you explain that for the non-technical? What problem does it solve?

We went with Squid, which is a cache server. Squid is basically a proxy server, usually used to cache inbound Internet traffic and spy on your employees or customers – but also optimise traffic-flow. For instance, if one user within your network accesses a web page from a popular web site, it’s retained for the next user so that it needn’t be downloaded again. That’s called caching – it saves traffic and improves performance.

Squid is a proven, open-source and robust platform, which in this case allows us to do this in reverse – a reverse-proxy. When users access specified content on our web site, if a copy already exists in the cache it is downloaded from Amazon Web Services instead of from our own network, which has limited bandwith that is more appropriately allocated to internal applications such as security monitoring, WAN applications and – naturally – in-house YouTube users (you know who you are!).

Q: What parts of AWS are you using?

At this stage we’re using a combination. S3 (Simple Storage) is where we store our virtual machine images – that’s the back-end stuff, where we build virtual machines and create AMIs (Amazon Machine Images) to fire up the virtual server that does the hard work. We’re using EC2 (Elastic Cloud Compute) to load these virtual machines into running processes that implement the solution.

Within EC2 we also use Elastic IPs to forward services to our virtual machines, which in the first instance are web servers and our proxy server, but also allows us to enforce security protocols and implement management tools for assessing the performance of our cache server, such as SNMP monitoring. We also use EBS (Elastic Block Store) to create virtual hard drives which maintain the cache, can be backed up to S3 and can be re-attached to a running instance should we ever need to re-configure the virtual machine. All critical data, including logs, are maintained on EBS.

We’re also about to implement a solution for another project called About NSW where we will be outsourcing high bandwidth static content (roughly 17GB of digitised archives in PDFs) to Amazon CloudFront.

Q: If an image is updated on the Powerhouse site how does AWS know to also update?

It happens transparently, and that’s the beauty of the design of the solution.

We have several million files that we’re trying to distribute and are virtually unmanageable in a normal Windows environment. trying to push this content to the cloud would be a nightmare. By using the reverse proxy method we effectively pick and choose – and thereby pull the most popular content and it automatically gets copied to the cloud for re-use.

Amazon have recently announced an import/export service, which would effectively allow us to send them a physical hard-drive of content to upload to a storage unit that they call a “bucket”. However, this is still not a viable solution for us because it’s not available in Australia and our content keeps getting added to – every day. By using a reverse proxy we effectively ensure that the first time that content is accessed it becomes rapidly available to any future users. And we can still do our work locally.

Q: How scalable is this solution? Can we apply it to the whole site?

I think it would be undesirable to apply it to dynamic content in particular, so no – things such as blogs which get changed frequently or search results which are always going to be slightly different depending on the changes that are effected to the underlying databases at the back end. In any case, once the entire site is fed via a virtual machine in another country you’ll actually experience a reduction in performance.

The solution we’ve implemented is aimed at re-distributing traffic in order to improve performance. It is an experiment, and the measurement techniques that we’ve implemented will gauge its effectiveness over the next few months. We’re trying to improve performance and save money, and we can only measure that through statistics, lies and invoices.

We’ll report back shortly once we know how it goes, but go on – take a look at the site we’ve got running in the cloud. Can you notice the difference?