Asset 1
Life in the Cloud: Making it Easier to Install, Configure and Leverage Analytics

Overview

Here at Digital Reasoning we spend a lot of time living in the cloud thinking about how we can leverage cloud infrastructures and how our clients can experience our products in new and innovative ways.  Cloud computing offers a variety of features and benefits, but for us one of the biggest advantages is the flexibility it offers.

Delivering our products within a cloud environment meant that our products and processes had to mature as the complexity of what we’ve needed to develop and test  for the cloud also increased.  Over the past 6 months we’ve been working on something that will not only make it a lot easier for us to manage that complexity, but will be something that the rest of the community can leverage as well.

First: How Do We Use the Cloud?

Our software, Synthesys, uses Hadoop to take advantage of large clusters.  In addition to Hadoop/MapReduce itself, Synthesys uses a variety of tools from the Hadoop and big data ecosystem:

  • Oozie
  • HBase
  • Hive
  • Pig
  • Impala
  • Cassandra
  • And more!

Several years ago we released PyStratus, an open source command line tool for managing EC2 instances running Hadoop and Cassandra.  When it’s properly installed and configured, to launch a 10 node cluster using a cluster definition I’ve defined called centos-hbase-cdh4, I just type:

> stratus exec centos-hbase-cdh4 launch-cluster 10

Several minutes later I see something like:

Using hbase as the backend datastore
Launching cluster with 10 instance(s) - starting master...please wait.
Master now running at EC2_HOSTNAME - starting slaves
Launching 10 slave instance(s)...please wait.
Finished - browse the cluster at http://EC2_HOSTNAME

At that point I have a fully functioning cluster with Hadoop, HBase, and all the other pieces needed to run Synthesys.  PyStratus has been a mainstay of our automated testing framework the past several years, but that aforementioned complexity in the product and our testing needs have pushed it to its limit.

How Can We Improve This?

One of our developers, Steve Brownlee, recently blogged about something called stackd.io, which is what we’ve been developing to better manage all this complexity.  Take a minute to check out his post for a great in-depth dive on the various aspects of the problem space we’ve been trying to address with stackd.io.  While we got some invaluable feedback about their infrastructure needs from some of our developers, I’ve had the exciting experience of being the first “heavy lifting” user of stackd.io the past few weeks.  All I can say about the hype and promise that Steve wrote about is that it’s all that and more!

As I’ve been digging in and revamping our automated QA platform to use stackd.io instead of PyStratus, there are 3 main things about stackd.io that have risen to the top of my “Reasons Why stackd.io Is Awesome” list:

Repeatable Provisioning

PyStratus configured a cluster with a combination of 2 things: Python code in PyStratus itself, and user-data scripts.  Modifying core software to apply a cluster configuration change is far from ideal, and the user-data scripts quickly become unwieldy.  Sharing my configuration with other developers can be a complicated process, they might have a different Python setup, or be missing some environment variables that the user data scripts are depending on, etc.

With stackd.io, all that is a thing of the past.  I’ve been able to configure a wide variety of cluster configurations and haven’t had to touch a line of stackd.io code.  It’s been purposely designed to avoid lock-in to any particular technology stack, so I’ve been creating “blueprints” specific to our testing needs.  Multiple versions of Hadoop, different databases, varying cluster sizes, taking advantage of spot instances for significant cost savings, are all just a different API call away.  Sharing my blueprints with another developer for them to use is as easy as giving them an account, and telling them the name of the blueprint.  They’ll have the exact same configuration that I did, or be able to tweak a few key variables specific to their own use.

Declarative Configuration

As fun as it is to hack together a complicated solution in a bash script (I’ve done some pretty gnarly stuff over the years), there are a lot of weaknesses to using them to manage your cloud infrastructure.  I’ve started many of them with the best of intentions, but they have quickly devolved into too-long overly-complicated nests of logic, all just to set some hostname variable in a handful of configuration files.

Rescuing us from this predicament is the core of stackd.io:  SaltStack.  SaltStack provides a declarative approach to defining how your infrastructure should be configured.  For example, to install the popular Apache webserver, and have it automatically be started up, I’d add this to my configuration:

apache:
 pkg:
   - installed
 service:
   - running
   - require:
     - pkg: apache

In the user-data solution, I’d end up making clever use of some bash syntax to craft a compact one-liner to do that, on a RHEL/CentOS system it’d be something like:

yum install apache && service apache start

It’s nice and compact at this point, but what if the service dies for some reason?  Or what if I wanted to define a set of configuration files to be included with my Apache installation?  Or have a Python WSGI server depend on Apache? My simple bash script would quickly become complicated to handle all of those situations, and I’ll have to document my solution extensively if I want anyone else to be able to understand or modify what I did.  With a declarative approach, I simply update the configuration file (which is using YAML syntax), and re-provision my cluster.

It’s All About the API

As I mentioned earlier, stackd.io has specifically avoided lock-in to our exact technology stack.  Similarly, the API has been designed in a flexible way to accommodate various use cases.  As the first heavy user of it, I’ve been bumping into a number of use cases for our automated QA environment that haven’t been directly supported in the API.  For those use cases that we think are of general use, we’re updating the API, but a lot of them are specific to what we’re doing with our automated QA.  For those cases it’s been very powerful to be able to leverage data available from the API directly, and at no point have I been limited in what I’ve been able to do.

I’ve been consuming the API directly through REST calls and avoiding the user interface, but it’s great to know that any of our developers who end up using this will end up invoking the same APIs that I am.  This ties into the sharing aspect, individual developers and integration with our QA environment will both be able to leverage each others work – the API is the common link that makes that possible.

So…  When Will We Share This?

I know Abe and Steve (primary developers on stackd.io) have been excited to get this out into your hands.  Now that I’ve been using it the past few weeks, I share that excitement.  We have a few key things to tie up in the next month or two before it’ll be ready for public consumption, but hopefully it’ll be worth the wait.  Stay tuned for some additional posts about some more specifics on the ways we’re using stackd.io.  Hit us up at info@digitalreasoning.com if you have any questions about what we’re doing with this.