Friday, August 09, 2013

Simple Counting Sort, in Python

For fun, I coded this up a while ago - both to review the details of implementing a counting sort (which is super-fast, since it does no comparisons!) and to review the details of implementing a Linux "pipeline" program in python.

Enjoy!
-pbr


https://gist.github.com/PaulReiber/6193485

https://gist.githubusercontent.com/PaulReiber/6193485/raw/7678f484648ad89608166c158ab8b00b98735394/countsort.py

Topping 100 posts

It's an arbitrary number, 100.  

Ten tens.  Only pertinent to us humans due to the count of digits we have on our hands and on our toes.  Mathematically, a base is arbitrary, and its square is just as arbitrary.  However, it seems reasonable to highlight that there are indeed over 100 relevant, hopefully helpful Linux-related posts on this blog at this point!

My posts of my notes/impressions of various Linux distros has pushed the count of posts I've made on this blog up over 100, in an otherwise unceremonious fashion, but it feels rather good to have accomplished that, and I look forward to continuing posting helpful stuff about Linux.
(The above confirms... people who abuse punctuation deserve a long sentence.)

If you have questions - anything you've "always wondered about" regarding Linux, anything perplexing, incomprehensible, or impenetrable... please don't hesitate to reach out and ask me.  Most of the posts I've written have resulted from simple questions about how to best use Linux.

Thanks for reading!
-Paul

Exploring ArchLinux 2012

Sensibly, Arch includes vim.  At least it's not forcing us to use 'nano'.

However, package installs were problematic.

# pacman -S mlocate
:: The following packages should be upgraded first :
    pacman
:: Do you want to cancel the current operation
:: and upgrade these packages now? [Y/n] Y

resolving dependencies...
looking for inter-conflicts...

Targets (6): bash-4.2.045-1  filesystem-2013.03-2  glibc-2.17-5  libarchive-3.1.2-1  linux-api-headers-3.8.4-1  pacman-4.1.0-2

Total Installed Size:   51.14 MiB
Net Upgrade Size:       -0.99 MiB

Proceed with installation? [Y/n] Y
(6/6) checking package integrity                                                                                           [##########################################################################] 100%
(6/6) loading package files                                                                                                [##########################################################################] 100%
(6/6) checking for file conflicts                                                                                          [##########################################################################] 100%
error: failed to commit transaction (conflicting files)
filesystem: /etc/profile.d/locale.sh exists in filesystem
filesystem: /usr/share/man/man7/archlinux.7.gz exists in filesystem
Errors occurred, no packages were upgraded.
#

These are not the sorts of problems I enjoy - it's a relatively pointless challenge to figure out how to use something that doesn't seem to want to be used.

Searching about on the internet for answers was no more enjoyable, and just as fruitless.  Evidently, people who are expert with arch wish to remain an exclusive club, and have little interest in communicating HOW to use the package manager or otherwise become proficient with the distro.

I can't say that I would recommend Arch, based on my experiences with it to date.

Exploring Debian 6 - Squeeze

Debian's a reasonable distro.  apt-get installed python, gcc, make, and vim quite handily.

I was a little disappointed to find it doesn't have pstree - and further to that:

# aptitude search pstree
#

I'm perplexed.  No match on a search for pstree?  Doesn't seem reasonable.  Am I missing something? Or are they?  It's a bit frustrating.

ps x --forest

...it's just not the same.

Otherwise, a very reasonable distro.

Exploring Gentoo 12


What do they have against Vim?  
Vim is the default Linux/UNIX editor.  Excluding it on a distro is bordering on criminal.

emerge sys-apps/mlocate

Nothing can be standard, Gentoo must differentiate.  One cannot simply install, or update, or "get" a package, one must emerge it.

However, emerge worked, at least for "locate"... almost as straightforwardly as with RedHat/CentOS/Fedora/Ubuntu.

I found myself in "nano" having issued "visudo".  That's just wrong.   Let's see - can I install vim?

# emerge vim
 * Last emerge --sync was 348d 11h 31m 40s ago.
Calculating dependencies... done!

>>> Verifying ebuild manifests

>>> Starting parallel fetch

>>> Emerging (1 of 6) app-admin/eselect-vi-1.1.7-r1
 * Fetching files in the background. To view fetch progress, run
 * `tail -f /var/log/emerge-fetch.log` in another terminal.
 * vi.eselect-1.1.7.bz2 SHA256 SHA512 WHIRLPOOL size ;-) ...                                                                                                                                         [ ok ]
>>> Unpacking source...
>>> Unpacking vi.eselect-1.1.7.bz2 to /var/tmp/portage/app-admin/eselect-vi-1.1.7-r1/work
>>> Source unpacked in /var/tmp/portage/app-admin/eselect-vi-1.1.7-r1/work
>>> Preparing source in /var/tmp/portage/app-admin/eselect-vi-1.1.7-r1/work ...
 * Applying eselect-vi-1.1.7-prefix.patch ...                                                                                                                                                        [ ok ]
>>> Source prepared.
>>> Configuring source in /var/tmp/portage/app-admin/eselect-vi-1.1.7-r1/work ...
>>> Source configured.
>>> Compiling source in /var/tmp/portage/app-admin/eselect-vi-1.1.7-r1/work ...
>>> Source compiled.
>>> Test phase [not enabled]: app-admin/eselect-vi-1.1.7-r1

>>> Install eselect-vi-1.1.7-r1 into /var/tmp/portage/app-admin/eselect-vi-1.1.7-r1/image/ category app-admin
>>> Completed installing eselect-vi-1.1.7-r1 into /var/tmp/portage/app-admin/eselect-vi-1.1.7-r1/image/


>>> Installing (1 of 6) app-admin/eselect-vi-1.1.7-r1

>>> Emerging (2 of 6) app-admin/eselect-ctags-1.13
>>> Downloading 'http://mirror.usu.edu/mirrors/gentoo/distfiles/eselect-emacs-1.13.tar.bz2'
--2013-07-28 03:06:55--  http://mirror.usu.edu/mirrors/gentoo/distfiles/eselect-emacs-1.13.tar.bz2
Resolving mirror.usu.edu... 129.123.104.64
Connecting to mirror.usu.edu|129.123.104.64|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2013-07-28 03:06:55 ERROR 404: Not Found.

>>> Downloading 'http://mirror.mcs.anl.gov/pub/gentoo/distfiles/eselect-emacs-1.13.tar.bz2'
--2013-07-28 03:06:55--  http://mirror.mcs.anl.gov/pub/gentoo/distfiles/eselect-emacs-1.13.tar.bz2
Resolving mirror.mcs.anl.gov... 2620:0:dc0:1800:214:4fff:fe7d:1b9, 146.137.96.7
Connecting to mirror.mcs.anl.gov|2620:0:dc0:1800:214:4fff:fe7d:1b9|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2013-07-28 03:06:55 ERROR 404: Not Found.

>>> Downloading 'http://gentoo.cities.uiuc.edu/pub/gentoo/distfiles/eselect-emacs-1.13.tar.bz2'
--2013-07-28 03:06:55--  http://gentoo.cities.uiuc.edu/pub/gentoo/distfiles/eselect-emacs-1.13.tar.bz2
Resolving gentoo.cities.uiuc.edu... failed: Name or service not known.
wget: unable to resolve host address `gentoo.cities.uiuc.edu'
>>> Downloading 'http://gentoo.osuosl.org/distfiles/eselect-emacs-1.13.tar.bz2'
--2013-07-28 03:06:56--  http://gentoo.osuosl.org/distfiles/eselect-emacs-1.13.tar.bz2
Resolving gentoo.osuosl.org... 140.211.166.134
Connecting to gentoo.osuosl.org|140.211.166.134|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2013-07-28 03:06:56 ERROR 404: Not Found.

>>> Downloading 'http://ftp.halifax.rwth-aachen.de/gentoo/distfiles/eselect-emacs-1.13.tar.bz2'
--2013-07-28 03:06:56--  http://ftp.halifax.rwth-aachen.de/gentoo/distfiles/eselect-emacs-1.13.tar.bz2
Resolving ftp.halifax.rwth-aachen.de... 137.226.34.42
Connecting to ftp.halifax.rwth-aachen.de|137.226.34.42|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2013-07-28 03:06:56 ERROR 404: Not Found.

>>> Downloading 'http://gentoo.ussg.indiana.edu/distfiles/eselect-emacs-1.13.tar.bz2'
--2013-07-28 03:06:56--  http://gentoo.ussg.indiana.edu/distfiles/eselect-emacs-1.13.tar.bz2
Resolving gentoo.ussg.indiana.edu... 156.56.247.195
Connecting to gentoo.ussg.indiana.edu|156.56.247.195|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2013-07-28 03:06:57 ERROR 404: Not Found.

>>> Downloading 'http://gentoo-distfiles.mirrors.tds.net/distfiles/eselect-emacs-1.13.tar.bz2'
--2013-07-28 03:06:57--  http://gentoo-distfiles.mirrors.tds.net/distfiles/eselect-emacs-1.13.tar.bz2
Resolving gentoo-distfiles.mirrors.tds.net... 216.165.129.135
Connecting to gentoo-distfiles.mirrors.tds.net|216.165.129.135|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2013-07-28 03:06:57 ERROR 404: Not Found.

!!! Couldn't download 'eselect-emacs-1.13.tar.bz2'. Aborting.
 * Fetch failed for 'app-admin/eselect-ctags-1.13', Log file:
 *  '/var/tmp/portage/app-admin/eselect-ctags-1.13/temp/build.log'

>>> Failed to emerge app-admin/eselect-ctags-1.13, Log file:

>>>  '/var/tmp/portage/app-admin/eselect-ctags-1.13/temp/build.log'

 * Messages for package app-admin/eselect-ctags-1.13:

 * Fetch failed for 'app-admin/eselect-ctags-1.13', Log file:
 *  '/var/tmp/portage/app-admin/eselect-ctags-1.13/temp/build.log'

 * GNU info directory index is up-to-date.

 * IMPORTANT: 2 config files in '/etc' need updating.
 * See the CONFIGURATION FILES section of the emerge
 * man page to learn how to update config files.

The above experience gave me ZERO faith in gentoo.  I had asked to install Vim, yet the errors are about it being unable to install emacs.  Poignant, yet so totally inappropriate!

I can't say I've had a reasonable experience with Gentoo 12 so far.  It hasn't been totally unwieldy, but it's been far from malleable.

Exploring Fedora 17

For anyone who is already familiar with RedHat/Centos, this is a painless distro to adopt - you'll find few if any surprises.

As with the other Rackspace Cloud distros, this one's lean - but not TOO lean.  I found myself needing to install "locate" and "gcc":

yum install mlocate
updatedb
yum install gcc

It makes for a relatively boring blog post, but... I had so few problems with the Fedora 17 that I really have nothing more to report.  It just works.

Exploring OpenSUSE 12

Yet more documentation of my Exploring Variants of Linux in the Rackspace Cloud

OpenSUSE package management is via Yast.  Yast wants to be interactive.  It might be possible to do things from the commandline but it seems to work best interactively.

This makes it harder to review what you've done since there's no record of exact command line options.  My root history simply says "yast", with no indication of what I installed.

As such, I don't have a good audit trail for what I've done on the OpenSUSE server.

However, it gave me very few problems and had few if any issues.

OpenSUSE will surprise you if/when you run "pstree -paul"

You'll find init has been replaced by something called "systemd".  As with Ubuntu's "upstart", OpenSUSE's "systemd" replaces the tried-and-true system init scripts with something new and wonderful.

man systemd.special systemd.unit systemd.service systemd.socket systemd.target

...there's quite a bit of learning and reading to be done.  Another day.   As with Ubuntu Server, were I forced to work with and manage a Linux other than RedHat/CentOS, I'd be quite happy with OpenSUSE.

Exploring Ubuntu 12 Server

As I promised in Exploring Variants of Linux I'm following up with my impressions of, and notes on, various Linux distros.  This is the first of those "general impressions".

If you're a RedHat/CentOS centric Linux user, hopefully these posts will help you over the first few hurdles you might find on the various other distros.

As with all of the Rackspace Cloud distros, the Ubuntu 12 Server distro is very lean.  I noticed a few relatively important tools were not installed.  They're not "required" but they're really handy, so my first step was to install them:

apt-get install mlocate
updatedb
apt-get install make
apt-get install gcc

I noticed only one service running which is not needed - "whoopsie" - so I turned it off.

root@pbr-ubuntu12:~# cat /etc/default/whoopsie 
[General]
report_crashes=true
root@pbr-ubuntu12:~# sed -i 's/report_crashes=true/report_crashes=false/' /etc/default/whoopsie 
root@pbr-ubuntu12:~# cat /etc/default/whoopsie 
[General]
report_crashes=false
root@pbr-ubuntu12:~# sudo service whoopsie stop
root@pbr-ubuntu12:~# 

Upgrading to the latest ubuntu was very lengthy and verbose, including a full-screen interaction with a pink background.. but it was functional:

apt-get update
apt-get upgrade
do-release-upgrade

Upstart's quite a bit different from the standards sys5 init script approach, but easy enough to get accustomed to.   

man upstart-events

...neat.  Upstart's pretty darn powerful, in fact.

 General impression?  If I was forced to use something other than a RedHat/CentOS distro, I'd be quite happy with Ubuntu Server.

Friday, July 26, 2013

Exploring Variants of Linux

Rackspace Cloud Linux

One of the very cool aspects of the Rackspace Cloud is the number of Linux/GNU variants.  As I'm already very intimately familiar with Red Hat / CentOS, I decided to take a look at the other seven.

Dealing with an array of cloud servers can be a little tedious, so I wrote a little program to run commands on all of them for me:  github_gist:/PaulReiber/run

This article documents my exploration of Ubuntu, Arch, FreeBSD, Debian, Gentoo, Fedora, and openSUSE.
Ubuntu
Ubuntu
CentOS
CentOS
Red Hat
Red Hat
Arch
Arch
Debian
Debian
Fedora
Fedora
FreeBSD
FreeBSD
Gentoo
Gentoo
openSUSE
openSUSE

$ run 'hostname'|egrep ^pbr\|stdout
pbr_ubuntu12.10_512
  @stdout="pbr-ubuntu12.10-512\r\n">]
pbr_freebsd9_512
  @stdout="pbr-freebsd9-512\r\n">]
pbr_opensuse12.1_512
  @stdout="pbr-opensuse12.1-512\r\n">]
pbr_fedora17_512
  @stdout="pbr-fedora17-512\r\n">]
pbr_gentoo12.3_512
  @stdout="pbr-gentoo12.3-512\r\n">]
pbr_debian6_512
  @stdout="pbr-debian6-512\r\n">]
pbr_arch2012.08_512
  @stdout="bash: hostname: command not found\r\n">]


Grepping JSON output isn't pretty, but you can see from the above how my program "run" works - it prints out the name of each cloud server then the JSON output of running a command via the ruby fog "ssh" API.

The output above also highlights something I ran into a LOT - various "standard" commands are simply not present on some distros!  Let's take a look - is "vim" available on all of the distros?

$ run 'which vim'|egrep ^pbr\|stdout
pbr_ubuntu12.10_512
  @stdout="/usr/bin/vim\r\n">]
pbr_freebsd9_512
  @stdout="vim: Command not found.\r\n">]
pbr_opensuse12.1_512
  @stdout="/usr/bin/vim\r\n">]
pbr_fedora17_512
  @stdout="/bin/vim\r\n">]
pbr_gentoo12.3_512
  @stdout="which: no vim in (/usr/bin:/bin:/usr/sbin:/sbin)\r\n">]
pbr_debian6_512
  @stdout="">]
pbr_arch2012.08_512
  @stdout="/usr/bin/vim\r\n">]


From this it seems that FreeBSD, Gentoo, and Debian don't come with vim pre-installed.   How about 'make'?

$ run 'which make'|egrep ^pbr\|stdout
pbr_ubuntu12.10_512
  @stdout="/usr/bin/make\r\n">]
pbr_freebsd9_512
  @stdout="/usr/bin/make\r\n">]
pbr_opensuse12.1_512
  @stdout="/usr/bin/make\r\n">]
pbr_fedora17_512
  @stdout="/bin/make\r\n">]
pbr_gentoo12.3_512
  @stdout="/usr/bin/make\r\n">]
pbr_debian6_512
  @stdout="/usr/bin/make\r\n">]
pbr_arch2012.08_512
  @stdout="which: no make in (/usr/bin:/bin:/usr/sbin:/sbin)\r\n">]

No make command is available on arch.  Arch is not for the faint of heart, I guess.

I'll be updating this blog post, and adding additional posts as I continue exploration of these various Linux distros.

Wednesday, July 24, 2013

Nameless Temporary Files

Linux 3.11 rc2 


Here's an interesting snippet from Linus's announcement post regarding Linux 3.11 rc2:

 (a) the O_TMPFILE flag that is new to 3.11 has been going through a
few ABI/API cleanups (and a few fixes to the implementation too), but
I think we're done now. So if you're interested in the concept of
unnamed temporary files, go ahead and test it out. The lack of name
not only gets rid of races/complications with filename generation, it
can make the whole thing more efficient since you don't have the
directory operations that can cause serializing IO etc.

Interesting idea!  Temporary files that aren't burdened with having to have filenames.

It will be some time before sysadmins see this feature in production, but in this case I think it's best to get the word out early, especially since this new feature could cause "mystery drive space exhaustion".

Right now, the only discrepancy between "df" and "du" numbers is due to deleted files with still-opened file descriptors.  With this new feature, it appears that nameless temporary files will join the ranks of hard-to-spot possible root causes of space exhaustion.

It's not clear how these new files will be identifiable / distinquished, for example, in the output of "lsof".  
As I learn more about this new feature, I'll be sure to write about it.

Sunday, July 07, 2013

Best Practices: It's Freezing in the Cloud

Production.  

It's a term many people have heard of, but what does it mean?   A lot of people have been asking me about this lately, so I'm happy to give an overview of some best practices for solution management.


Production Rule #1: A production environment is something you don't touch.  

Its configuration is frozen - its not up for experimentation.  It does exactly what it's been configured to do, nothing more, nothing less.  A production environment is comprised of a set of production servers handling various functions.  Irrespective if they're web servers, app servers, compute servers, db servers, or comms/queueing servers, each production server is a simple combination of three things:

  • a vetted version of your application/website/service/database/whatever
  • a vetted copy of each and every non-stock (tuned) configuration file
  • a base OS - hopefully one that's stock, but definitely one that's been proven stable

The sum total of those three things makes a production server.   

Example Production Environment "Cook Book":
db application + db_master_config + stock RHEL6 server = master-database-server
db application + db_tapebackup_slave_config + stock RHEL6 server = backup-slave-database-server
db application + db_failover_slave_config + stock RHEL6 server = failover-slave-database-server
payment_gateway_if + comm_server_tuning + stock RHEL6 server = payment_gateway
payment_gateway_if + failover_comm_server_tuning + stock RHEL6 server = payment_gateway
website content + webhead_config + stock RHEL6 server = web1
website content + webhead_config + stock RHEL6 server = web2
website content + webhead_config + stock RHEL6 server = web3
load_balancer_config + stock cloud loadbalancer (LBaaS) = production_loadbalancer


Production Rule #2:  Never login on your production servers.

Notice there's no mention of "custom tweaking" or "post-install tuning" or the like in the recipes I listed in the example above.  That's because there really must be none. Human beings NEVER LOG IN on production servers.  If they are doing that, they're almost certainly breaking rule #1 and they're definitely breaking rule #2.  Early on, as you're putting the solution in place, it may be handy to use ssh to remotely run a command or two - but you must ensure you're only running non-intrusive monitoring - "read only" operations - if you wish to ensure the correctness of the production environment.

The moment you break rule #2, you'll be setting yourself up for a conundrum if/when there is a problem with the production environment.  You'll then need to answer the question:  "was it what I did in there, or was it something in the production push that's the root cause of the problem?"

If you've never logged in on the production servers, you then KNOW it was something in the production push that cause the problem.

How then do you arrive at a reasonable solution, not over-spending on servers, memory, storage, licenses, etc. if you don't tune your production environment?   You tune your staging environment instead.  

Staging.

There's a term fewer people have heard, but it's equally as important as "production".  Every good production environment has at least one staging environment.

Ideally, a staging environment duplicates the production environment.  If you're hesitant to jump straight to that, you can introduce less redundancy than the production environment has - but you're opening up the possibility of mis-deployment if you do.

Example "barely shorted" staging environment:
db application + db_master_config + stock RHEL6 server = master-database-server
db application + db_tapebackup_slave_config + stock RHEL6 server = backup-slave-database-server
payment_gateway_if + comm_server_tuning + stock RHEL6 server = payment_gateway
website content + webhead_config + stock RHEL6 server = web1
website content + webhead_config + stock RHEL6 server = web2
load_balancer_config + stock cloud loadbalancer (LBaaS) = production_loadbalancer

The idea with a staging environment is that it's a destination for changes to applications, website content, configuration changes prior to their going into production.


Staging Rule #1: A staging environment is something you don't touch.  

(well... after it's been setup and debugged and is working properly, anyway)

It's a "devops" world now - sysadmin config changes need to be versioned and managed just as carefully as code changes.  Ideally all of the changes are committed to a source code repository - ideally something like git.  

Once a week, or more often if needed, the entire list of changes being made for all components and configurations is reviewed and vetted/approved.  Then, all of those changes are applied to the staging environment, backing things up first if needed.

With that, you've achieved a "staging push" - combining all of the changes to all of the functionality and configuration for all of the various solution components and applying them to the staging environment.  At that point automated testing begins against the solution that you've just put in place in the staging environment.

Real-world traffic to the solution is either simulated or exactly reproduced, and the performance and resource utilization of all servers implementing staging is logged.  After a period of some day of testing (yes, multiple days - ideally simulating a full week of operations) then summarization and statistics can be generated from the resource utilization logs.

If there are any ill side-effects of the most recent push, they'll be evident because the resource utilization statistics will show that things got worse.  For example, if there's a badly coded webpage introduced which is causing apache processes to balloon up in size, the memory statistics on the webheads will be notably worse than they were for the previous staging push.

Staging Rule #2:  Never login on your staging servers.

If it's done right by suitably lazy programmers, your staging environment will be running all of this testing automatically, monitoring resources automatically, comparing the previous and current statistics resulting from testing at the end of the test run, and emailing you with the results.

You can only be 100% sure of the results of the staging test if it was entirely "hands off".  Otherwise if/when something goes wrong (either in production or in staging) you'll be left wondering if it was due to the push, or due to whatever bespoke steps you took in staging.  That's not a good feeling, and it's not a fun discussion with your board of directors either.

More Twenty-first century devops best practices

If you'd like to learn more, I can recommend Allspaw and Robbins: Web Operations - keeping the data on time Now it's your turn - what's your favorite "devops" runbook/rule-book? 


Tuesday, July 02, 2013

Don't Fear the Mongo

NOSQL is a term that strikes fear in the heart of many with traditional relational database skills.

How can a database not use SQL?  How could that possibly perform well?  It does!  And it's not hard to learn, either. Don't worry about performance - just dive in.  http://education.10gen.com is offering free classes in Mongo - and the'yre totally worth your time.

I'm partially through "M101P MongoDB for Developers" and I now feel relatively comfortable addressing NOSQL related concerns.  I'm also enrolled in an upcoming "MongoDB for DBAs" class.

Similar to MySQL, MongoDB is a service process.  You connect using a client program, "mongo", or by using a MongoDB library and making calls from your favorite programming language.  The class I'm in right now uses python, which is pretty straightforward to learn - but they give you most of the python code for the various homework exercises already, and you only really need to write a few line of calls that use the MongoDB API for the various assignments.

If you know javascript and JSON notation, you're 80% of the way to knowing MongoDB already.  Here's a quick demo of using mongo:

bash-3.2$ mongo
MongoDB shell version: 2.4.4
connecting to: test
> show dbs
blog 0.203125GB
local 0.078125GB
m101 0.203125GB
students 0.203125GB
test 0.203125GB
> use students
switched to db students
> db.grades.find().forEach(  function(one){db.gradesCopy.insert(one)});
> db.grades.count()
600
> db.gradesCopy.count()
600
> quit()
bash-3.2$ 
Pretty straightforward, huh?   Don't fear the mongo!  

Tuesday, June 25, 2013

Running commands on all of your cloud servers

I consider my cloud servers to be one big array of servers.

I decided to use "fog" - the Ruby API for the Rackspace Cloud - to build something to let me, in one step, run commands on all of the servers.  It turned out to be pretty straightforward.


You might have a different model - an array of web servers, another array of db servers, and another array of compute servers, for example.  If so, you can easily extend the code to work with your different groups of servers by querying for whatever differentiates them.

Monday, June 24, 2013

Migrating a website - sporadic performance

I had the opportunity today to help someone who had done an outstanding job of migrating a website into the cloud - but their new sites performance was sporadic and unpredictable.

I'll share both the technique I used to debug that, and the lessons learned.

The site was implemented over two webservers, with lsync handling content synchronization.  The web servers were behind a cloud load balancer.  They were sized right.  They weren't pegging the CPU, swapping, or io intensive.  But... they weren't working right.  The site would load sometimes, and timeout other times.

To see the website, I had to put the domain name and its new IP address in my local /etc/hosts file.  When I put the IP for the load balancer in my /etc/hosts file, I couldn't tell which apache child process was handling my request, because all of the connections were coming from the load balancer.  So, I picked one of the web servers - web1 - and changed my /etc/hosts details for the domain to point my browser straight to that web server - bypassing the load balancer.  That way, I could see which apache child process my browser had been hooked up to.

Lesson #1: bypass the load balancer for testing.

I used 'strace' to see what apache was doing.  It took a few tries, but I soon had a good idea of what was going on.  By the time you've got output from netstat, with the process ID, the work's already done - so how to strace that process super-fast?
alias s="strace -s 999 -p"
This way when netstat shows apache process 11245 is serving your IP, you can bust out:
"s 11245" and hit enter.  Viola!   (it's possible to go even further with this, but let's keep it simple)

Lesson #2: don't give up - figure out how to not have to type a lot.

I saw apache contacting some IPs I was familiar with - the caching nameservers for the datacenter where the server lives.   Then I saw apache reach out and connect to an unfamiliar IP address.

What that means is that apache was looking up a domainname in DNS, then using the resulting IP.

I asked about that IP... and it turned out, it was the IP of where that website is CURRENTLY hosted.

I was helping to debug the NEW version of this site - but for whatever reason, the new code was reaching out to the OLD implementation of the website.

So, the "root cause" had been found.  However, what to do next?  I could have simply advised that the best solution would be to revise the code to use relative references.  Or, mentioned that it could use IP addresses instead of domain names.

Instead, I fixed the problem, right then and there.  

On both web servers, I added the domain name in /etc/hosts:
127.0.0.1 localhost localhost.localdomain thewebsite.com
That way, each machine considered that 127.0.0.1 was the proper IP address for the domain.  This had the added benefit that references to the domain from either web server wouldn't cause traffic through the load balancer.  I think it's an all-around good idea.

Lesson #3: servers that implement domains should consider themselves that domain.

By the way... the moment I edited /etc/hosts and fixed this, the site started to render super-fast, and the sporadic performance problem was gone.

My customer was so happy, he told me to tell my boss he said I could take the rest of the day off. (I didn't... but I loved the sentiment!)

Thursday, June 06, 2013

Nightly Maintenance and "Sorry Sites"

Servers need backups.  And, sometimes, there are nightly maintenance scripts that need to be run, for example dumping out all transactions, or importing orders or products.  Usually these maintenance tasks will be run from a cron job.

Often, these tasks impact the "production" website, or conversely, the "production" website often impacts these tasks.  Either way, sometimes it's best to get the site offline for a minute or two, to let the maintenance task run quickly and to completion, without competition.

I thought the approach below was totally obvious, but I've learned that a lot of people are really happy to learn how to do this sort of thing.

It's really straightforward for a cron job to also put a "Sorry Site" in place - a website that states "We're down for maintenance - please reload in a few minutes" or similar.   Here's a strategy for doing this.

Say your website document root is:
/var/www/website
And say your "sorry" website is:
/var/www/sorry
We'll make a script called /root/switch.
#!/bin/sh
site = /var/www/website
sorry = /var/www/sorry
hold = /var/www/hold
if [ -d /var/www/sorry ]; then
    mv $site $hold
    mv $sorry $site
else
    mv $site $sorry
    mb $hold $site
fi
Say your existing cron job is:
0 0 * * * /do/my/maintenance >/dev/null 2<&1
To put the sorry site in place while the maintenance is running, just change that to:
0 0 * * * /root/switch; /do/my/maintenance; /root/switch >/dev/null 2<&1
The above simply calls the "switch" script twice - once before, and once after, the maintenance script.  It keeps all of the details of what "switch" actually does hidden away from the cron job, as a good programming practice.

The above approach lets you customize your "sorry site" - some of the pages can say "We're down for maintenance" (say, the main page) and other pages can still work (say... for example... the pages that let people check out :-)

If you just want to take ALL pages offline, there's a simpler way - setup a variant .htaccess file and swap that in place, instead of moving the directories around.

Tuesday, June 04, 2013

Fighting SPAM: Identifying compromised email accounts

A compromised email account is one where spammers have determined someone's email password, and they're using the email account to send out spam email.

Various email servers have better and worse logging.  Depending on the server (qmail, postfix, sendmail) the logs may or may not let you directly correlate an outgoing spam email with the actual account that sent the email.

So, the following can be pretty useful.  It collects up all the IP addresses ($13 - the thirteenth field in the logfile, in this particular case) that each user has connected from, and prints out the accounts that are connecting from more than one IP.

awk '/LOGIN,/ {if (index(i[$12], $13) == 0) i[$12]=i[$12] " " $13} END {for(p in i) {print split(i[p], a, " ") " " p " " i[p]}}' maillog|sort -n|grep -v '^1 '

If you see an account for an individual, which is getting connections from dozens or hundreds of IP addresses, that's very possibly a compromised email account.

Note that an end-user with a smartphone will end up with a big bank of IPs connecting to check email.  They'll all have similar IP addresses in most cases.

Friday, May 31, 2013

Track Apache's calls to PHP

Customers often ask how to find out what PHP code is being called.  Sometimes, they're looking to find abusers of PHP email forms - and other times, they're interested in learning which routines are being called the most often.

The following monitoring command will run until you interrupt it with a control-C.

lsof +r 1 -p `ps axww | grep [h]ttpd | awk '{ str=str","$1} END {print str}'`|grep vhosts|grep php

It takes the process IDs of all of the Apache processes and strings them together with commas inbetween.  Then it calls "lsof", asking it to repeat every second.

"lsof" lists all of the open file descriptors for the processes listed after the "-p" argument.

At the end of the command, we select only those lines that have "vhosts" and "php".  If your website document roots aren't under /var/www/vhosts you will want to look for some other string indicating "a file within a website"

Wednesday, May 29, 2013

As a software developer, how can I ensure I remain employable after age 50?

I used to think the same way.  I've been programming UNIX/Linux for around 30 years.  I liked writing code.  I wanted my job to be writing code, and I wanted some company to pay me to do that.

I absolutely LOVE writing code now - because I only write WHAT I want to write, WHEN I want to, and HOW I want to.  (I.e. it's no longer part of my job.  I write code as a hobby now.) 

I absolutely LOVE my job now - it's WAY better than any job I've ever had before - including when I was a consultant, and including when I worked for myself (I was CTO of my own startup some years ago).

My day job is: HELP PEOPLE.  I found a very good fit in customer service.  

I'm now a top shelf systems administrator, and I leverage my coding skills to solve problems that would make many sysadmins heads spin. For example, I was asked to action a db import the other day.  Mid-import, the load on the server went almost to zero, and memory usage started to climb. 

The import had dead-locked with the customer's runtime application logic.  

Because of how apache works, and because most customers over-commit apache in terms of how they set MaxClients (they allow Apache's worst-case memory footprint to be larger than their total available memory)... in this sort of a case, it's imperative to act QUICKLY to correct the situation, or the server will very probably crash.

Most sysadmins in that case would immediately stop apache, which I did.  They would then abort the import, probably restart mysql to clear the deadlock, and restart the import.  That, I did not do - it's overkill.

Instead, I stopped apache, ran "mysqladmin processlist > queries", edited the file "queries" in vim and... 
-> deleted the header, the footer and the specific db import query I did NOT want to kill, 
-> issued :1,$s/^|/kill /
-> issued :1,$s/|.*/;/ 
-> wrote the file and exited.  

That gave me a file full of lines like this: 

kill 12345 ; 
kill 67890 ;  

...then I ran "mysql

It was a 4.5G import, so that was a good thing; restarting it would have added hours to the downtime.  

This isn't something your typical dev knows how to do correctly.  It's not something your typical admin knows how to do correctly.  And it's not even something your typical DBA knows how to do correctly.  It's something I knew how to do correctly, leveraging my years of experience.  

I'm sharing this because it shows there's still a need for people who can solve difficult computing problems, accurately and quickly, but outside of the programming domain.  Your experience level may well make you IDEAL for this sort of position, so if you find it at all compelling, I recommend that you:
  • review all of your past positions to see how each and every one of them had "customer service" as some aspect of what they were about
  • rework your resume to exude that aspect of what you did
  • apply for an entry-level position in customer service at a hosting company

Learning Vim From the Inside

Vim improves on vi in countless ways.  As a curious vi expert, I wanted to know exactly what those were, so I dove into the source code.  In doing so, I was compelled to create this online class a few years back: http://curiousreef.com/class/learning-vim-from-the-inside/

It's still going strong.  New students join every month.  For the most part, it runs itself now... but if you hit any hurdles while working your way through the content, please reach out to me and let me know.

The ethos of slicing and dicing logfiles

When a logfile is of reasonable size, you can review it using "view" - a read-only version of "vim".  This gives you flexible searching, and all of the power of vim as you review the logfile.  However, for viewing huge files, instead of editing them in vim directly, try this:

tail -100000 logfile | vim -

That way you're only looking at the last 100,000 lines not the whole file.  On a server with 4GB of RAM, looking at a 6GB logfile in vim without something like the above can be, well... a semi-fatal mistake.

For logfile analysis, I use awk a lot, along with the other tools you mentioned - grep, etc.  Awk's over the top - totally worth learning. You can do WAY cool things with it.   For example, I once used grep on an apache access log to find all the SQL injections an attacker had attempted, and wrote that to a tempfile.

Then I used awk to figure out (a) which .php files had been called and how many times each, and (b) what parameters had been used to do the injections.

awk -F\" tells awk to use " as the field separator, so anything to the left of the first " is '$1' and whatever's between the first and second quote is $2, etc.

So awk -F\" '{print $2}' shows me what was inside the first set of quotes on each line.

Using other characters for the field separator let me slice out just the filename from the GET request, then another pass over the file with slightly different code let me slice out just the parameter names.  

Here again, as you might feel is a resounding theme in my blog, the Linux commandline tools have proven to be immensely useful.

Log Dissector

If you want to see some of awk's more awesome features being leveraged for logfile analysis, take a look at this little program I threw together:
http://paulreiber.github.com/Log-Dissector/

How to learn UNIX/Linux

There are a lot of books... however, I recommend reading the manuals.
I know... it's a lot... but, I did it (3 decades ago) and it works really well.

Here are a few sites with unix manpages:

http://www.ma.utexas.edu/cgi-bin/man-cgi 
http://unixhelp.ed.ac.uk
http://bama.ua.edu/cgi-bin/man-cgi

I'm sure there are other good sites with manpages as well.

I recommend you read the entire manual, end-to-end - but unless you want to do that twice or thrice you'll want to skip around, and learn these commands first.

-> learn bash really well.  Read every single line of the manual, and figure out what the heck they're talking about.  Learn every variation of how to tweak a variable, how to do filename expansion (globbing, as opposed to regular expressions), what all the builtin functions are and how they work.

-> learn vi (or vim, if on Linux) - exceedingly awesome editor.  Sure, use a tutorial, and don't bother to dive down the ratholes of macros or settings tweaking unless you're really into those things, 'cause you'll wake up months later wondering what happened.  If you're in a hurry, learn nano instead.  You won't be as happy, but you'll be able to edit files.

-> learn awk.  Awks associative arrays can help you solve some serious problems.  Once you know how to use awk, you'll never look at a flat file or a "unix pipeline" quite the same.  You can't learn too much awk.

-> learn sed.  Also awesome, both for modifying files on disk and for applying mods to every line of text fed through a "pipeline" to it.

With those four as "cornerstones", you'll be a UNIX power-user in no time.  They don't all work the same, or have the same conventions... but you'll get over the hurdles and be all the better UNIX expert in the end.

ALSO - once logged in - the 'man' and 'apropos' commands are your friends.  Read, learn, experiment, and build up your commandline/pipelines piece by piece.  Start small, and build.

Example of "building up" a command line:

find . -type f -print
...show me all the files in the current directory and down

find . -type f -print0 | xargs file
...tell me what kind of file each of them is (and, handle filenames with spaces in them properly)

find . -type f -print0 | xargs file | egrep -v GIF\|JPEG\|PNG\|PDF
...show me just the files that are NOT gifs, jpegs, pngs, or pdfs.

UNIX commands are kind of like LEGO elements - you can plug them together in cool ways.  So, get used to building up pipelines that do what you want.

I guess, most of all... don't expect to be given a guided tour.  


Instead, treat UNIX like a huge machine, where each part of the machine does have a manual and can be understood, and make it a habit, every day, to spend time wandering around in that machine, exploring it end-to-end, corner-to-corner, until you know it well enough to call yourself a UNIX power-user.

analyzing memory exhaustion

It's really kind of random which processes will be pushed into swap space by the kernel, when it realizes it's low on memory.  Just because a process is swapped out or using a lot of swap, doesn't mean that it is necessarily a problem - in fact, quite often, a process using a lot of swap space is an "innocent bystander".

To find the root cause of memory exhaustion issues, it's helpful to look at how processes are using virtual memory - both physical memory and swap. That way you can see which processes have what "footprint" across the board - not just which are dipping into swap.  

Here are a couple of useful commands for that, and their output on a local machine.

Show me the users using >= 1% of physical memory:

$ ps aux|awk '$1 != "USER" {t[$1]+=$4} END {for (i in t) {print t[i]" "i}}'|sort -n|grep -v ^0
2.4 root
35.6 pbr

Show me what programs are using >= 1% of physical memory:

$ ps aux|awk '$1 != "USER" {t[$11]+=$4} END {for (i in t) {print t[i]" "i}}'|sort -n|grep -v ^0
1 /usr/bin/knotify4
1.1 nautilus
1.3 mono
1.6 /usr/bin/yakuake
2.3 /usr/bin/cli
2.5 kdeinit4:
5.8 /usr/lib/firefox/firefox
7.6 gnome-power-manager

(Really?!? gnome-power-manager? ...that's gotta go away!  Glad I ran this!)

Note the only difference between the two commands above is the column used as the "index" into the t array - i.e. t[$1]+=$4 vs t[$11]+=$4

Of course you can aggregate other columns similarly.  For example:

Show me how many megabytes of virtual memory each non-trivial user is using:

$ ps aux|awk '$1 != "USER" {t[$1]+=$5} END {for (i in t) {print t[i]/1024" "i}}'|sort -n|grep -v ^0
2.19141 daemon
3.26172 103
3.30078 gdm
5.82031 avahi
11.4258 postfix
19.9141 105
33.6055 syslog
365.281 root
2653.86 pbr

Note the differences there are (a) we aggregate column 5 instead of 4, (b) we divide the result by 1024 so we're working with MB instead of KB.

Show me all programs cumulatively using >= 100MB of virtual memory

$ ps aux|awk '$1 != "USER" {t[$11]+=$5} END {for (i in t) {print t[i]/1024" "i}}'|sort -n|egrep ^[0-9]{3}
108.477 /usr/bin/cli
123.414 /usr/lib/indicator-applet/indicator-applet-session
149.359 udevd
154.168 /usr/bin/yakuake
162.402 nautilus
185.516 gnome-power-manager
226.586 kdeinit4:
499.598 /usr/lib/firefox/firefox

If your head is spinning trying to understand the command lines, I'll try to help.

Awk has an awesome feature called "associative arrays".  You can use a string as an index into an array.  No need to initialize it - awk does that for you automagically.  

Let's disect the awk program I provide on the final commandline above - the one for "Show me all programs cumulatively using >= 100MB of virtual memory"

for each line of input (which happens to be the output of "ps aux")
  if field-1 isn't the string "USER" then
      add the value in field-5 (process-size) to 
      whatever is in the array t at index field-11 (program-name)

at the END of the file
  for each item i in array t
    print the value (t[i] divided by 1024), then a space (" "), then the item itself (i)

All of that output is fed to "sort" with the -n (for numeric) option, then that sorted output is fed to "egrep" which has been told to only print lines that start with at least three numerals. (remember, the goal is to only list programs cumulatively using ">=100MB" ... and 99MB has only two numerals.)

With the basic Linux tools, you can do some pretty amazing things with the output of various commands.  This is an example of what is meant when people speak about the "power of the UNIX shell".

Back to swap space.  Once you have an idea of which processes on your system are using how much virtual memory, and how much physical memory, you'll be in a much better position to assess the actual root cause for any disconcerting swap usage.  As I mentioned, quite often, the processes that get swapped out are NOT the ones that are the real problem.

Very often, Apache or some other process which increases its footprint in response to increased demand will be the root cause of your memory problems.