November 19, 2009

QCon 2009: Software Architecture for Cloud Applications

Michael Nygard

Cloud: any rapidly-provisioned, on-demand computing or application services.

Grid: large-scale computing services, typically hiterogeneous, distributed, and parallel.

Utility: "pay for what you use"

IaaS/PaaS/SaaS -- even though names are similar, they're not

Most of this talk will be discussing Cloud + IaaS

Recent on-demand models:

LHC, due to the amount of data generated by every collision
Utility computing: just a billing method; variable demand; ceiling usually capped
- Good for holiday volumes: to handle 2 months of the year they're overspending the other 10

Trends leading to clouds:

Virtualization
Commoditization
Horizontally scalable architecture
Rapid provisioning

Virtualization:

Be able to factor out devices and drivers
Applications can be as messy as they like (ISOLATION)
Take advantage of idle processing

Commoditization:

43 Dell M600 for cost of 1 HP 7410
Nothing in the former is hot-swappable, and it WILL break
...but you'll need to be able to fire up a new server really quickly

How is virtualization + cloud different? Related to who does what (admin or user), how quick + automatic provisioning is (if it's under programmatic control).

Cloud: specialized hardware is abstracted away -- load balancers, firewalls, SANs, etc. These get turned into declarative aspects rather than actual boxes. ("Environment: I need to allow a network connection from X to Y, and I need X GB of storage...")

Choosing a provider:

like corn farming: small margin, large fixed costs, scale essential
small providers won't be around long

Cloud near you ("private cloud" or a "fog"); some people don't call this a cloud, but since it fits a number of the characteristics above MN does.

Easiest way to get ready for cloud: do nothing! It'll work, but you'll miss out on some items.

Advantages.
- Scalability
- Bundling: be able to paper over operational failures
- Ephemerality
Risks:
- Availability: there are more moving pieces, which means risk to availability
- Geography
- Ephemerality

EC2 quirks:

"clean boot" is REALLY clean
local storaage not persistent
IP address not predictable

Controlling the flock of EC2s: talking via an API to your instances. You can talk via SSH or rdesktop to a single one also.

Bundling: start with an amz image (?); do stuff (install, configure), then you store it as S3 resource.

Can automate this and have your build server churning out fully bootable images.
~half of all failures come within 24 hours of change; this is because of bad change control, lack of automation
Plus side of ephemerality -- can fire up instance for user acceptance testing, then once it's good it's just a matter of assigning IPs. Once that's working for a while, you can just blow away the old machines.
- This is like trucking in a bunch of machines with every upgrade.

Handling concurrent versions:

web assets: old version in URLs
integration points: version all protocols and encodings; client has to specify which one it's using
DB: migration automated; split into phases; expand first, contract later ("refactoring databases" has info on this)

Scalability: master/worker pattern

Workers pulling work off queue; master watching the workers and time to complete jobs; will fire up new EC2 instances if the work is taking too long(github recently talked about this)
NYTimes: converting 11 million articles, firing up several hundred servers for a weekend
animoto.com: were able to go from 50 to 5000 processing nodes in 1 week when they got popular (facebook, etc)

Map/reduce, hadoop; many workers in parallel; resilience thru re-execution; move computation to data. Amazon has 'Elastic Map/Reduce' which hides you from network topology requirements.

Scalability: horizontal replication -- make read-only snapshots of database, let them serve up data that changes very rarely.

Now have the ability to use declarative load balancing, automatic scaling. Every load balancer out there requires you to configure the workers in the pool, and when configuring must restart. This is a problem when new nodes can pop up automatically. Specify some tolerance of use, and if you go outside it we add another node; this gets put into the auto load-balanced group. (Takes about 5 minutes to react, "So if Oprah mentions you on twitter you're still dead.")
Natural affinity between cloud computing and open source.
Cache servers and in-memory data grids
- essential components; translate things you like to do least in the cloud and turn them into the things you like to do most.

Mitigating the downside of ephemerality: anything can disappear at any moment.

Failure of block storage: snapshot data to S3, so a parallel system can be setup and read from the snapshot.

Another pattern:

Decentralized adoption of cloud computing (he's paying for EC2 instances with secretary's credit card)

Security:

Perception of risk is just that, stem from idea that security is either perfect or worthless. Security profs don't think this way, more about risk data. WIBHI reaction ("Wouldn't it be horrible if...")
Cloud security alliance working on this
Control plane threats (set of APIs who determine who can access what) any managed hosting provider has this issue
Patent shutdowns: be sure you have strategies for extracting yourself (VM + data) from your provider
Lack of risk management info

Facilities to apply

Data in motion: transport level encryption
Data at rest: storage level encryption
Access control: security groups (~VLANs and firewalls)
Control plane: not under your control
Multitenancy concerns: not under your control

Next: QCon 2009: Better Architecture Management Made Easy

Previous: QCon 2009: Architecting for the cloud: hoizontal scalability