Elasticness and Clouds

Amazon pretty much claimed the word "elastic" in computing when they delivered their Elastic Compute Cloud (EC2) years ago. One of the key features of this is, unsurprisingly, that it is elastic: you use an API (ideally) or a web interface, to provision resources on demand. 

This works nicely - except when it doesn't. After years of experience with this - I have noticed that (in general) at the first sign of trouble the API will start failing, requests dropping, timing out. (sometimes that can even be an early warning sign of impending doom). 

It is worth noting that most clouds seem to have similar issues regarding APIs  - they often don't have the same quality of service that your servers get. This wouldn't normally be a problem, but a common strategy with infrastructure clouds is to make use of this elasticness (duh !) day to day for your operations, as well as recovery. Frustratingly, due to this behaviour you have to either accept this QoS limitation, or plan around it by consuming extra resources ahead of time. The latter approach somewhat undoes the benefit of having a highly elastic API - but here we are anyway. 

Somewhere, there is a balance, but at the moment, the big users of the public clouds are treating them increasingly as a less than elastic resource pool (look up Netflix and their use of Amazon for an example of this). I can't help but wonder if this means APIs will fall out of favour for highly elastic workloads, or if the QoS of these APIs will improve over time...

How to waste a friday

A unusually verbose writeup on network issues and what they will do at status.aws.amazon.com:

Repeated here for posterity (E-BS is a story for another day, but for now): 

From 7:28pm PDT to 9:56pm PDT, a networking issue affected connectivity to a significant number of instances in the US-EAST-1 region. Affected instances experienced degraded network connectivity to the Internet and to instances in other availability zones.

The root cause of last night's issue was when a core network routing device experienced a partial failure. While the router was causing packet loss, the failure was not detected by surrounding network devices and therefore they did not automatically fail traffic over to redundant network paths as intended.

Additionally, our network monitoring tools failed to help our network operators locate the specific source of the connectivity issues. Once our networking team determined the location of the impact, they were able to identify the failing router and manually failed traffic routes away from it. At this point, all affected instances regained full connectivity.

We will be completely replacing the failed network device and our team will work on the failed device to understand the source of the failure. More importantly, we will be working to understanding why our network monitoring did not allow our team to quickly isolate the problem and force the manual failover to redundant network routes. We rely on this monitoring to help us deal with partial failures which defeat the normal redundancy built into high availability network architectures. We understand the impact this event had on some of our users, this just took us too long to figure out, and will be intensely focused on improving our monitoring and addressing the root cause of this failure.


CloudBees runtime

So one of the things I have been working on a bit lately is now finally live: 

The short version: very much like GAE but without the google. On top of this, this works with the so called "dev@cloud" (which is hudson - no renamed to Jenkins - still with me?) so you can have a code push -> test -> run cycle all in one place (if you like). 

Zero proprietary apis, the data remains yours of course. There will always be a free runtime in the cloud, I think it is necessary to "keep the dream alive" so to speak. Signup is free of course (you may need to put in your phone number so it can check you are human). 

It has been nice building something that I want to use. It is a nice place to be. 

Enjoy !

Getting running with Postgres (on OS-X)

So I spent the best part of a friday morning "yak shaving" - getting postgres to run on OS-X in a development friendly way. I consulted The Google - and found thousands of ways one could install postgresql for development. I think the core of my problem is that a lot of the guides want you to have a hardened production installation. 

So what I recommend (thanks to Simon Harris): 

Install homebrew ! 
brew install postgresql

(have a coffee)

After it is installed, it spits out instructions on how to start, you may miss that, so here it is repeated: 


If this is your first install, create a database with:

    initdb /usr/local/var/postgres


If this is your first install, automatically load on login with:

    cp /usr/local/Cellar/postgresql/9.0.1/org.postgresql.postgres.plist ~/Library/LaunchAgents

    launchctl load -w ~/Library/LaunchAgents/org.postgresql.postgres.plist


If this is an upgrade and you already have the org.postgresql.postgres.plist loaded:

    launchctl unload -w ~/Library/LaunchAgents/org.postgresql.postgres.plist

    cp /usr/local/Cellar/postgresql/9.0.1/org.postgresql.postgres.plist ~/Library/LaunchAgents

    launchctl load -w ~/Library/LaunchAgents/org.postgresql.postgres.plist


Or start manually with:

    pg_ctl -D /usr/local/var/postgres -l /usr/local/var/postgres/server.log start


And stop with:

    pg_ctl -D /usr/local/var/postgres stop -s -m fast


If you want to install the postgres gem, including ARCHFLAGS is recommended:

    env ARCHFLAGS="-arch x86_64" gem install pg


Finally - this is all running as your current user - no super user, no postgres user. You can simply run: 


createdb yourdbname-here

dropdb etc... 

It will always use your current user - which is perfect for development. Much easier than trying to shoehorn in server grade configurations. 

Ephemeral - the word of the week (cloud)

So part of what I have been working on has been touching on 2 fairly popular "infrastructure clouds" Amazon EC2 and Rackspace CloudServers. Infrastructure as a service == IaaS == servers on demand (via some api, with illusion of infinite numbers of these resources available - lots more written elsewhere on the web so I won't repeat). I pronounce IaaS "yass" but a good friend of mine says "I Ass" which I may adapt, except in polite company.

In any case, there is a bit of a misunderstanding as to what "ephemeral" means for these virtual cloud machines: with amazons "instance storage" - people still seem to think that if you restart, you lose your storage, heck you even loose your ip. But this is NOT the case ! The only time you lose that storage is if the machine dies in some catastrophic way - or you terminate it. Restarting is just fine. In this sense Rackspace and EC2 are very similar (Rackspace has a higher QoS on their instance stores which is at least equaled by using an EBS store on ec2). 

I am surprised that this confusion lives on, but reading the above paragraph for jargon its probably not surprising why. As always experience cures all this.