Monday, June 24, 2013

Migrating a website - sporadic performance

I had the opportunity today to help someone who had done an outstanding job of migrating a website into the cloud - but their new sites performance was sporadic and unpredictable.

I'll share both the technique I used to debug that, and the lessons learned.

The site was implemented over two webservers, with lsync handling content synchronization.  The web servers were behind a cloud load balancer.  They were sized right.  They weren't pegging the CPU, swapping, or io intensive.  But... they weren't working right.  The site would load sometimes, and timeout other times.

To see the website, I had to put the domain name and its new IP address in my local /etc/hosts file.  When I put the IP for the load balancer in my /etc/hosts file, I couldn't tell which apache child process was handling my request, because all of the connections were coming from the load balancer.  So, I picked one of the web servers - web1 - and changed my /etc/hosts details for the domain to point my browser straight to that web server - bypassing the load balancer.  That way, I could see which apache child process my browser had been hooked up to.

Lesson #1: bypass the load balancer for testing.

I used 'strace' to see what apache was doing.  It took a few tries, but I soon had a good idea of what was going on.  By the time you've got output from netstat, with the process ID, the work's already done - so how to strace that process super-fast?
alias s="strace -s 999 -p"
This way when netstat shows apache process 11245 is serving your IP, you can bust out:
"s 11245" and hit enter.  Viola!   (it's possible to go even further with this, but let's keep it simple)

Lesson #2: don't give up - figure out how to not have to type a lot.

I saw apache contacting some IPs I was familiar with - the caching nameservers for the datacenter where the server lives.   Then I saw apache reach out and connect to an unfamiliar IP address.

What that means is that apache was looking up a domainname in DNS, then using the resulting IP.

I asked about that IP... and it turned out, it was the IP of where that website is CURRENTLY hosted.

I was helping to debug the NEW version of this site - but for whatever reason, the new code was reaching out to the OLD implementation of the website.

So, the "root cause" had been found.  However, what to do next?  I could have simply advised that the best solution would be to revise the code to use relative references.  Or, mentioned that it could use IP addresses instead of domain names.

Instead, I fixed the problem, right then and there.  

On both web servers, I added the domain name in /etc/hosts:
127.0.0.1 localhost localhost.localdomain thewebsite.com
That way, each machine considered that 127.0.0.1 was the proper IP address for the domain.  This had the added benefit that references to the domain from either web server wouldn't cause traffic through the load balancer.  I think it's an all-around good idea.

Lesson #3: servers that implement domains should consider themselves that domain.

By the way... the moment I edited /etc/hosts and fixed this, the site started to render super-fast, and the sporadic performance problem was gone.

My customer was so happy, he told me to tell my boss he said I could take the rest of the day off. (I didn't... but I loved the sentiment!)