Tuesday, June 25, 2013

Running commands on all of your cloud servers

I consider my cloud servers to be one big array of servers.

I decided to use "fog" - the Ruby API for the Rackspace Cloud - to build something to let me, in one step, run commands on all of the servers.  It turned out to be pretty straightforward.


You might have a different model - an array of web servers, another array of db servers, and another array of compute servers, for example.  If so, you can easily extend the code to work with your different groups of servers by querying for whatever differentiates them.

Monday, June 24, 2013

Migrating a website - sporadic performance

I had the opportunity today to help someone who had done an outstanding job of migrating a website into the cloud - but their new sites performance was sporadic and unpredictable.

I'll share both the technique I used to debug that, and the lessons learned.

The site was implemented over two webservers, with lsync handling content synchronization.  The web servers were behind a cloud load balancer.  They were sized right.  They weren't pegging the CPU, swapping, or io intensive.  But... they weren't working right.  The site would load sometimes, and timeout other times.

To see the website, I had to put the domain name and its new IP address in my local /etc/hosts file.  When I put the IP for the load balancer in my /etc/hosts file, I couldn't tell which apache child process was handling my request, because all of the connections were coming from the load balancer.  So, I picked one of the web servers - web1 - and changed my /etc/hosts details for the domain to point my browser straight to that web server - bypassing the load balancer.  That way, I could see which apache child process my browser had been hooked up to.

Lesson #1: bypass the load balancer for testing.

I used 'strace' to see what apache was doing.  It took a few tries, but I soon had a good idea of what was going on.  By the time you've got output from netstat, with the process ID, the work's already done - so how to strace that process super-fast?
alias s="strace -s 999 -p"
This way when netstat shows apache process 11245 is serving your IP, you can bust out:
"s 11245" and hit enter.  Viola!   (it's possible to go even further with this, but let's keep it simple)

Lesson #2: don't give up - figure out how to not have to type a lot.

I saw apache contacting some IPs I was familiar with - the caching nameservers for the datacenter where the server lives.   Then I saw apache reach out and connect to an unfamiliar IP address.

What that means is that apache was looking up a domainname in DNS, then using the resulting IP.

I asked about that IP... and it turned out, it was the IP of where that website is CURRENTLY hosted.

I was helping to debug the NEW version of this site - but for whatever reason, the new code was reaching out to the OLD implementation of the website.

So, the "root cause" had been found.  However, what to do next?  I could have simply advised that the best solution would be to revise the code to use relative references.  Or, mentioned that it could use IP addresses instead of domain names.

Instead, I fixed the problem, right then and there.  

On both web servers, I added the domain name in /etc/hosts:
127.0.0.1 localhost localhost.localdomain thewebsite.com
That way, each machine considered that 127.0.0.1 was the proper IP address for the domain.  This had the added benefit that references to the domain from either web server wouldn't cause traffic through the load balancer.  I think it's an all-around good idea.

Lesson #3: servers that implement domains should consider themselves that domain.

By the way... the moment I edited /etc/hosts and fixed this, the site started to render super-fast, and the sporadic performance problem was gone.

My customer was so happy, he told me to tell my boss he said I could take the rest of the day off. (I didn't... but I loved the sentiment!)

Thursday, June 06, 2013

Nightly Maintenance and "Sorry Sites"

Servers need backups.  And, sometimes, there are nightly maintenance scripts that need to be run, for example dumping out all transactions, or importing orders or products.  Usually these maintenance tasks will be run from a cron job.

Often, these tasks impact the "production" website, or conversely, the "production" website often impacts these tasks.  Either way, sometimes it's best to get the site offline for a minute or two, to let the maintenance task run quickly and to completion, without competition.

I thought the approach below was totally obvious, but I've learned that a lot of people are really happy to learn how to do this sort of thing.

It's really straightforward for a cron job to also put a "Sorry Site" in place - a website that states "We're down for maintenance - please reload in a few minutes" or similar.   Here's a strategy for doing this.

Say your website document root is:
/var/www/website
And say your "sorry" website is:
/var/www/sorry
We'll make a script called /root/switch.
#!/bin/sh
site = /var/www/website
sorry = /var/www/sorry
hold = /var/www/hold
if [ -d /var/www/sorry ]; then
    mv $site $hold
    mv $sorry $site
else
    mv $site $sorry
    mb $hold $site
fi
Say your existing cron job is:
0 0 * * * /do/my/maintenance >/dev/null 2<&1
To put the sorry site in place while the maintenance is running, just change that to:
0 0 * * * /root/switch; /do/my/maintenance; /root/switch >/dev/null 2<&1
The above simply calls the "switch" script twice - once before, and once after, the maintenance script.  It keeps all of the details of what "switch" actually does hidden away from the cron job, as a good programming practice.

The above approach lets you customize your "sorry site" - some of the pages can say "We're down for maintenance" (say, the main page) and other pages can still work (say... for example... the pages that let people check out :-)

If you just want to take ALL pages offline, there's a simpler way - setup a variant .htaccess file and swap that in place, instead of moving the directories around.

Tuesday, June 04, 2013

Fighting SPAM: Identifying compromised email accounts

A compromised email account is one where spammers have determined someone's email password, and they're using the email account to send out spam email.

Various email servers have better and worse logging.  Depending on the server (qmail, postfix, sendmail) the logs may or may not let you directly correlate an outgoing spam email with the actual account that sent the email.

So, the following can be pretty useful.  It collects up all the IP addresses ($13 - the thirteenth field in the logfile, in this particular case) that each user has connected from, and prints out the accounts that are connecting from more than one IP.

awk '/LOGIN,/ {if (index(i[$12], $13) == 0) i[$12]=i[$12] " " $13} END {for(p in i) {print split(i[p], a, " ") " " p " " i[p]}}' maillog|sort -n|grep -v '^1 '

If you see an account for an individual, which is getting connections from dozens or hundreds of IP addresses, that's very possibly a compromised email account.

Note that an end-user with a smartphone will end up with a big bank of IPs connecting to check email.  They'll all have similar IP addresses in most cases.