Sunday, July 07, 2013

Best Practices: It's Freezing in the Cloud

Production.  

It's a term many people have heard of, but what does it mean?   A lot of people have been asking me about this lately, so I'm happy to give an overview of some best practices for solution management.


Production Rule #1: A production environment is something you don't touch.  

Its configuration is frozen - its not up for experimentation.  It does exactly what it's been configured to do, nothing more, nothing less.  A production environment is comprised of a set of production servers handling various functions.  Irrespective if they're web servers, app servers, compute servers, db servers, or comms/queueing servers, each production server is a simple combination of three things:

  • a vetted version of your application/website/service/database/whatever
  • a vetted copy of each and every non-stock (tuned) configuration file
  • a base OS - hopefully one that's stock, but definitely one that's been proven stable

The sum total of those three things makes a production server.   

Example Production Environment "Cook Book":
db application + db_master_config + stock RHEL6 server = master-database-server
db application + db_tapebackup_slave_config + stock RHEL6 server = backup-slave-database-server
db application + db_failover_slave_config + stock RHEL6 server = failover-slave-database-server
payment_gateway_if + comm_server_tuning + stock RHEL6 server = payment_gateway
payment_gateway_if + failover_comm_server_tuning + stock RHEL6 server = payment_gateway
website content + webhead_config + stock RHEL6 server = web1
website content + webhead_config + stock RHEL6 server = web2
website content + webhead_config + stock RHEL6 server = web3
load_balancer_config + stock cloud loadbalancer (LBaaS) = production_loadbalancer


Production Rule #2:  Never login on your production servers.

Notice there's no mention of "custom tweaking" or "post-install tuning" or the like in the recipes I listed in the example above.  That's because there really must be none. Human beings NEVER LOG IN on production servers.  If they are doing that, they're almost certainly breaking rule #1 and they're definitely breaking rule #2.  Early on, as you're putting the solution in place, it may be handy to use ssh to remotely run a command or two - but you must ensure you're only running non-intrusive monitoring - "read only" operations - if you wish to ensure the correctness of the production environment.

The moment you break rule #2, you'll be setting yourself up for a conundrum if/when there is a problem with the production environment.  You'll then need to answer the question:  "was it what I did in there, or was it something in the production push that's the root cause of the problem?"

If you've never logged in on the production servers, you then KNOW it was something in the production push that cause the problem.

How then do you arrive at a reasonable solution, not over-spending on servers, memory, storage, licenses, etc. if you don't tune your production environment?   You tune your staging environment instead.  

Staging.

There's a term fewer people have heard, but it's equally as important as "production".  Every good production environment has at least one staging environment.

Ideally, a staging environment duplicates the production environment.  If you're hesitant to jump straight to that, you can introduce less redundancy than the production environment has - but you're opening up the possibility of mis-deployment if you do.

Example "barely shorted" staging environment:
db application + db_master_config + stock RHEL6 server = master-database-server
db application + db_tapebackup_slave_config + stock RHEL6 server = backup-slave-database-server
payment_gateway_if + comm_server_tuning + stock RHEL6 server = payment_gateway
website content + webhead_config + stock RHEL6 server = web1
website content + webhead_config + stock RHEL6 server = web2
load_balancer_config + stock cloud loadbalancer (LBaaS) = production_loadbalancer

The idea with a staging environment is that it's a destination for changes to applications, website content, configuration changes prior to their going into production.


Staging Rule #1: A staging environment is something you don't touch.  

(well... after it's been setup and debugged and is working properly, anyway)

It's a "devops" world now - sysadmin config changes need to be versioned and managed just as carefully as code changes.  Ideally all of the changes are committed to a source code repository - ideally something like git.  

Once a week, or more often if needed, the entire list of changes being made for all components and configurations is reviewed and vetted/approved.  Then, all of those changes are applied to the staging environment, backing things up first if needed.

With that, you've achieved a "staging push" - combining all of the changes to all of the functionality and configuration for all of the various solution components and applying them to the staging environment.  At that point automated testing begins against the solution that you've just put in place in the staging environment.

Real-world traffic to the solution is either simulated or exactly reproduced, and the performance and resource utilization of all servers implementing staging is logged.  After a period of some day of testing (yes, multiple days - ideally simulating a full week of operations) then summarization and statistics can be generated from the resource utilization logs.

If there are any ill side-effects of the most recent push, they'll be evident because the resource utilization statistics will show that things got worse.  For example, if there's a badly coded webpage introduced which is causing apache processes to balloon up in size, the memory statistics on the webheads will be notably worse than they were for the previous staging push.

Staging Rule #2:  Never login on your staging servers.

If it's done right by suitably lazy programmers, your staging environment will be running all of this testing automatically, monitoring resources automatically, comparing the previous and current statistics resulting from testing at the end of the test run, and emailing you with the results.

You can only be 100% sure of the results of the staging test if it was entirely "hands off".  Otherwise if/when something goes wrong (either in production or in staging) you'll be left wondering if it was due to the push, or due to whatever bespoke steps you took in staging.  That's not a good feeling, and it's not a fun discussion with your board of directors either.

More Twenty-first century devops best practices

If you'd like to learn more, I can recommend Allspaw and Robbins: Web Operations - keeping the data on time Now it's your turn - what's your favorite "devops" runbook/rule-book? 


No comments:

Post a Comment