Friday, July 26, 2013

Exploring Variants of Linux

Rackspace Cloud Linux

One of the very cool aspects of the Rackspace Cloud is the number of Linux/GNU variants. As I'm already very intimately familiar with Red Hat / CentOS, I decided to take a look at the other seven.

Dealing with an array of cloud servers can be a little tedious, so I wrote a little program to run commands on all of them for me: github_gist:/PaulReiber/run

This article documents my exploration of Ubuntu, Arch, FreeBSD, Debian, Gentoo, Fedora, and openSUSE.

Ubuntu

CentOS

Red Hat

Arch

Debian

Fedora

FreeBSD

Gentoo

openSUSE

$ run 'hostname'|egrep ^pbr\|stdout
pbr_ubuntu12.10_512
  @stdout="pbr-ubuntu12.10-512\r\n">]
pbr_freebsd9_512
  @stdout="pbr-freebsd9-512\r\n">]
pbr_opensuse12.1_512
  @stdout="pbr-opensuse12.1-512\r\n">]
pbr_fedora17_512
  @stdout="pbr-fedora17-512\r\n">]
pbr_gentoo12.3_512
  @stdout="pbr-gentoo12.3-512\r\n">]
pbr_debian6_512
  @stdout="pbr-debian6-512\r\n">]
pbr_arch2012.08_512
  @stdout="bash: hostname: command not found\r\n">]

Grepping JSON output isn't pretty, but you can see from the above how my program "run" works - it prints out the name of each cloud server then the JSON output of running a command via the ruby fog "ssh" API.

The output above also highlights something I ran into a LOT - various "standard" commands are simply not present on some distros! Let's take a look - is "vim" available on all of the distros?

$ run 'which vim'|egrep ^pbr\|stdout
pbr_ubuntu12.10_512
  @stdout="/usr/bin/vim\r\n">]
pbr_freebsd9_512
  @stdout="vim: Command not found.\r\n">]
pbr_opensuse12.1_512
  @stdout="/usr/bin/vim\r\n">]
pbr_fedora17_512
  @stdout="/bin/vim\r\n">]
pbr_gentoo12.3_512
  @stdout="which: no vim in (/usr/bin:/bin:/usr/sbin:/sbin)\r\n">]
pbr_debian6_512
  @stdout="">]
pbr_arch2012.08_512
  @stdout="/usr/bin/vim\r\n">]

From this it seems that FreeBSD, Gentoo, and Debian don't come with vim pre-installed. How about 'make'?

$ run 'which make'|egrep ^pbr\|stdout
pbr_ubuntu12.10_512
  @stdout="/usr/bin/make\r\n">]
pbr_freebsd9_512
  @stdout="/usr/bin/make\r\n">]
pbr_opensuse12.1_512
  @stdout="/usr/bin/make\r\n">]
pbr_fedora17_512
  @stdout="/bin/make\r\n">]
pbr_gentoo12.3_512
  @stdout="/usr/bin/make\r\n">]
pbr_debian6_512
  @stdout="/usr/bin/make\r\n">]
pbr_arch2012.08_512
  @stdout="which: no make in (/usr/bin:/bin:/usr/sbin:/sbin)\r\n">]

No make command is available on arch. Arch is not for the faint of heart, I guess.

I'll be updating this blog post, and adding additional posts as I continue exploration of these various Linux distros.

Wednesday, July 24, 2013

Nameless Temporary Files

Linux 3.11 rc2

Here's an interesting snippet from Linus's announcement post regarding Linux 3.11 rc2:

 (a) the O_TMPFILE flag that is new to 3.11 has been going through a
few ABI/API cleanups (and a few fixes to the implementation too), but
I think we're done now. So if you're interested in the concept of
unnamed temporary files, go ahead and test it out. The lack of name
not only gets rid of races/complications with filename generation, it
can make the whole thing more efficient since you don't have the
directory operations that can cause serializing IO etc.

Interesting idea!  Temporary files that aren't burdened with having to have filenames.

It will be some time before sysadmins see this feature in production, but in this case I think it's best to get the word out early, especially since this new feature could cause "mystery drive space exhaustion".

Right now, the only discrepancy between "df" and "du" numbers is due to deleted files with still-opened file descriptors.  With this new feature, it appears that nameless temporary files will join the ranks of hard-to-spot possible root causes of space exhaustion.

It's not clear how these new files will be identifiable / distinquished, for example, in the output of "lsof".

As I learn more about this new feature, I'll be sure to write about it.

Sunday, July 07, 2013

Best Practices: It's Freezing in the Cloud

Production.

It's a term many people have heard of, but what does it mean? A lot of people have been asking me about this lately, so I'm happy to give an overview of some best practices for solution management.

Production Rule #1: A production environment is something you don't touch.

Its configuration is frozen - its not up for experimentation. It does exactly what it's been configured to do, nothing more, nothing less. A production environment is comprised of a set of production servers handling various functions. Irrespective if they're web servers, app servers, compute servers, db servers, or comms/queueing servers, each production server is a simple combination of three things:

a vetted version of your application/website/service/database/whatever
a vetted copy of each and every non-stock (tuned) configuration file
a base OS - hopefully one that's stock, but definitely one that's been proven stable

The sum total of those three things makes a production server.

Example Production Environment "Cook Book":
db application + db_master_config + stock RHEL6 server = master-database-server
db application + db_tapebackup_slave_config + stock RHEL6 server = backup-slave-database-server
db application + db_failover_slave_config + stock RHEL6 server = failover-slave-database-server
payment_gateway_if + comm_server_tuning + stock RHEL6 server = payment_gateway
payment_gateway_if + failover_comm_server_tuning + stock RHEL6 server = payment_gateway
website content + webhead_config + stock RHEL6 server = web1
website content + webhead_config + stock RHEL6 server = web2

website content + webhead_config + stock RHEL6 server = web3

load_balancer_config + stock cloud loadbalancer (LBaaS) = production_loadbalancer

Production Rule #2: Never login on your production servers.

Notice there's no mention of "custom tweaking" or "post-install tuning" or the like in the recipes I listed in the example above. That's because there really must be none. Human beings NEVER LOG IN on production servers. If they are doing that, they're almost certainly breaking rule #1 and they're definitely breaking rule #2. Early on, as you're putting the solution in place, it may be handy to use ssh to remotely run a command or two - but you must ensure you're only running non-intrusive monitoring - "read only" operations - if you wish to ensure the correctness of the production environment.

The moment you break rule #2, you'll be setting yourself up for a conundrum if/when there is a problem with the production environment. You'll then need to answer the question: "was it what I did in there, or was it something in the production push that's the root cause of the problem?"

If you've never logged in on the production servers, you then KNOW it was something in the production push that cause the problem.

How then do you arrive at a reasonable solution, not over-spending on servers, memory, storage, licenses, etc. if you don't tune your production environment? You tune your staging environment instead.

Staging.

There's a term fewer people have heard, but it's equally as important as "production". Every good production environment has at least one staging environment.

Ideally, a staging environment duplicates the production environment. If you're hesitant to jump straight to that, you can introduce less redundancy than the production environment has - but you're opening up the possibility of mis-deployment if you do.

Example "barely shorted" staging environment:
db application + db_master_config + stock RHEL6 server = master-database-server
db application + db_tapebackup_slave_config + stock RHEL6 server = backup-slave-database-server
payment_gateway_if + comm_server_tuning + stock RHEL6 server = payment_gateway
website content + webhead_config + stock RHEL6 server = web1
website content + webhead_config + stock RHEL6 server = web2

load_balancer_config + stock cloud loadbalancer (LBaaS) = production_loadbalancer

The idea with a staging environment is that it's a destination for changes to applications, website content, configuration changes prior to their going into production.

Staging Rule #1: A staging environment is something you don't touch.

(well... after it's been setup and debugged and is working properly, anyway)

It's a "devops" world now - sysadmin config changes need to be versioned and managed just as carefully as code changes. Ideally all of the changes are committed to a source code repository - ideally something like git.

Once a week, or more often if needed, the entire list of changes being made for all components and configurations is reviewed and vetted/approved. Then, all of those changes are applied to the staging environment, backing things up first if needed.

With that, you've achieved a "staging push" - combining all of the changes to all of the functionality and configuration for all of the various solution components and applying them to the staging environment. At that point automated testing begins against the solution that you've just put in place in the staging environment.

Real-world traffic to the solution is either simulated or exactly reproduced, and the performance and resource utilization of all servers implementing staging is logged. After a period of some day of testing (yes, multiple days - ideally simulating a full week of operations) then summarization and statistics can be generated from the resource utilization logs.

If there are any ill side-effects of the most recent push, they'll be evident because the resource utilization statistics will show that things got worse. For example, if there's a badly coded webpage introduced which is causing apache processes to balloon up in size, the memory statistics on the webheads will be notably worse than they were for the previous staging push.

Staging Rule #2: Never login on your staging servers.

If it's done right by suitably lazy programmers, your staging environment will be running all of this testing automatically, monitoring resources automatically, comparing the previous and current statistics resulting from testing at the end of the test run, and emailing you with the results.

You can only be 100% sure of the results of the staging test if it was entirely "hands off". Otherwise if/when something goes wrong (either in production or in staging) you'll be left wondering if it was due to the push, or due to whatever bespoke steps you took in staging. That's not a good feeling, and it's not a fun discussion with your board of directors either.

More Twenty-first century devops best practices

If you'd like to learn more, I can recommend Allspaw and Robbins: Web Operations - keeping the data on time Now it's your turn - what's your favorite "devops" runbook/rule-book?

Tuesday, July 02, 2013

Don't Fear the Mongo

NOSQL is a term that strikes fear in the heart of many with traditional relational database skills.

How can a database not use SQL? How could that possibly perform well? It does! And it's not hard to learn, either. Don't worry about performance - just dive in. http://education.10gen.com is offering free classes in Mongo - and the'yre totally worth your time.

I'm partially through "M101P MongoDB for Developers" and I now feel relatively comfortable addressing NOSQL related concerns. I'm also enrolled in an upcoming "MongoDB for DBAs" class.

Similar to MySQL, MongoDB is a service process. You connect using a client program, "mongo", or by using a MongoDB library and making calls from your favorite programming language. The class I'm in right now uses python, which is pretty straightforward to learn - but they give you most of the python code for the various homework exercises already, and you only really need to write a few line of calls that use the MongoDB API for the various assignments.

If you know javascript and JSON notation, you're 80% of the way to knowing MongoDB already. Here's a quick demo of using mongo:

bash-3.2$ mongo

MongoDB shell version: 2.4.4

connecting to: test

> show dbs

blog 0.203125GB

local 0.078125GB

m101 0.203125GB

students 0.203125GB

test 0.203125GB

> use students

switched to db students

> db.grades.find().forEach( function(one){db.gradesCopy.insert(one)});

> db.grades.count()

600

> db.gradesCopy.count()

600

> quit()

bash-3.2$

Pretty straightforward, huh? Don't fear the mongo!

PBRs blogspot: a lifetime of learning about GNU/linux.