Flapjack: rethinking monitoring for the cloud
|Time:||10:30 - 11:15|
|Day:||Thursday 21 January 2010|
|Location:||Ilott Theatre (Town Hall)|
Whether you're deploying a single website or thousands of networked servers, monitoring is an essential part of any production setup. Nagios is by far the most popular and well established open source monitoring system available, used by tens of thousands of companies world wide.
But as applications start moving towards the cloud, being able to scale your monitoring system to cope with 100 times the number of services and systems becomes extremely important. There are numerous projects that provide options for clustering Nagios instances together to squeeze out a bit of extra performance, but they all work within the limits of Nagios's architecture.
What's really required is a new architecture that's inherently scalable, that breaks the problem into discrete chunks, and focuses on doing one thing and doing it well. Enter Flapjack.
Flapjack is a scalable and distributed monitoring system. It natively talks the Nagios plugin format (so you can use all your existing Nagios checks), and can easily be scaled from 1 server to 1000.
Flapjack tries to adhere to the following tenets:
- it should be simple to set up, configure, and maintain
- it should easily scale from a single host to multiple
- writing and sharing checks should be quick and obvious
Flapjack breaks the monitoring process up into several components that communicate over the beanstalkd queueing system:
- flapjack-worker executes the checks and reports back results.
- flapjack-notifier notifies people if the check results are bad.
- flapjack-admin configures checks and views reports.
Flapjack's architecture provides several distinct advantages when scaling the number of checks you're performing.
Firstly, it focuses on executing checks and notifying as required rather than collecting data. Most monitoring systems attempt to retrieve data and metrics directly from the machines they're monitoring, as well as running tests against that data when deciding to notify. This causes all sorts of problems when the data retrieval phase blocks and holds up the rest of the monitoring system.
Flapjack doesn't care about how the data is collected, only whether a check returns a result. This makes it easy to pair Flapjack with something like collectd, which solves all the hard data collection problems for you, and has networking baked in.
Secondly, Flapjack is asynchronous and distributed, allowing you to execute the checks on as many machines as you want. If you spin up 100 new EC2 instances that all need monitoring, and your existing flapjack-workers aren't keeping up, just bring up a new cluster of flapjack-workers and point them at the beanstalkd. Need to take your monitoring machines down for maintenance? Bring up a new cluster on another machine, then take down your old cluster.
Lindsay Holmwood, Flapjack's creator, will be talking about why you should be using it, getting up and running with it, strategies for scaling to thousands of checks, and why separating data collection and notification is important.
Lindsay Holmwood is a freelance sysadmin/developer from Sydney, Australia. He was the sole architect of Australia's largest desktop Linux deployment, which he completed in 2007. More recently he has released gotgastro.com, a Google Maps mashup of the NSW Food Authority name-and-shame lists, cucumber-nagios, which allows you to write checks for your monitoring system in plain english, and Visage, a web interface for viewing collectd statistics.
Lindsay was on the organising committee of linux.conf.au 2007, and served as president of the Sydney Linux Users Group from 2006-2008. He regularly speaks at open source conference in Australia and abroad, presenting on system administration, the Ruby programming language, and deployment best practices.