Showing posts with label Administration. Show all posts
Showing posts with label Administration. Show all posts

Friday, August 09, 2013

Exploring ArchLinux 2012

Sensibly, Arch includes vim.  At least it's not forcing us to use 'nano'.

However, package installs were problematic.

# pacman -S mlocate
:: The following packages should be upgraded first :
    pacman
:: Do you want to cancel the current operation
:: and upgrade these packages now? [Y/n] Y

resolving dependencies...
looking for inter-conflicts...

Targets (6): bash-4.2.045-1  filesystem-2013.03-2  glibc-2.17-5  libarchive-3.1.2-1  linux-api-headers-3.8.4-1  pacman-4.1.0-2

Total Installed Size:   51.14 MiB
Net Upgrade Size:       -0.99 MiB

Proceed with installation? [Y/n] Y
(6/6) checking package integrity                                                                                           [##########################################################################] 100%
(6/6) loading package files                                                                                                [##########################################################################] 100%
(6/6) checking for file conflicts                                                                                          [##########################################################################] 100%
error: failed to commit transaction (conflicting files)
filesystem: /etc/profile.d/locale.sh exists in filesystem
filesystem: /usr/share/man/man7/archlinux.7.gz exists in filesystem
Errors occurred, no packages were upgraded.
#

These are not the sorts of problems I enjoy - it's a relatively pointless challenge to figure out how to use something that doesn't seem to want to be used.

Searching about on the internet for answers was no more enjoyable, and just as fruitless.  Evidently, people who are expert with arch wish to remain an exclusive club, and have little interest in communicating HOW to use the package manager or otherwise become proficient with the distro.

I can't say that I would recommend Arch, based on my experiences with it to date.

Exploring Debian 6 - Squeeze

Debian's a reasonable distro.  apt-get installed python, gcc, make, and vim quite handily.

I was a little disappointed to find it doesn't have pstree - and further to that:

# aptitude search pstree
#

I'm perplexed.  No match on a search for pstree?  Doesn't seem reasonable.  Am I missing something? Or are they?  It's a bit frustrating.

ps x --forest

...it's just not the same.

Otherwise, a very reasonable distro.

Exploring Gentoo 12


What do they have against Vim?  
Vim is the default Linux/UNIX editor.  Excluding it on a distro is bordering on criminal.

emerge sys-apps/mlocate

Nothing can be standard, Gentoo must differentiate.  One cannot simply install, or update, or "get" a package, one must emerge it.

However, emerge worked, at least for "locate"... almost as straightforwardly as with RedHat/CentOS/Fedora/Ubuntu.

I found myself in "nano" having issued "visudo".  That's just wrong.   Let's see - can I install vim?

# emerge vim
 * Last emerge --sync was 348d 11h 31m 40s ago.
Calculating dependencies... done!

>>> Verifying ebuild manifests

>>> Starting parallel fetch

>>> Emerging (1 of 6) app-admin/eselect-vi-1.1.7-r1
 * Fetching files in the background. To view fetch progress, run
 * `tail -f /var/log/emerge-fetch.log` in another terminal.
 * vi.eselect-1.1.7.bz2 SHA256 SHA512 WHIRLPOOL size ;-) ...                                                                                                                                         [ ok ]
>>> Unpacking source...
>>> Unpacking vi.eselect-1.1.7.bz2 to /var/tmp/portage/app-admin/eselect-vi-1.1.7-r1/work
>>> Source unpacked in /var/tmp/portage/app-admin/eselect-vi-1.1.7-r1/work
>>> Preparing source in /var/tmp/portage/app-admin/eselect-vi-1.1.7-r1/work ...
 * Applying eselect-vi-1.1.7-prefix.patch ...                                                                                                                                                        [ ok ]
>>> Source prepared.
>>> Configuring source in /var/tmp/portage/app-admin/eselect-vi-1.1.7-r1/work ...
>>> Source configured.
>>> Compiling source in /var/tmp/portage/app-admin/eselect-vi-1.1.7-r1/work ...
>>> Source compiled.
>>> Test phase [not enabled]: app-admin/eselect-vi-1.1.7-r1

>>> Install eselect-vi-1.1.7-r1 into /var/tmp/portage/app-admin/eselect-vi-1.1.7-r1/image/ category app-admin
>>> Completed installing eselect-vi-1.1.7-r1 into /var/tmp/portage/app-admin/eselect-vi-1.1.7-r1/image/


>>> Installing (1 of 6) app-admin/eselect-vi-1.1.7-r1

>>> Emerging (2 of 6) app-admin/eselect-ctags-1.13
>>> Downloading 'http://mirror.usu.edu/mirrors/gentoo/distfiles/eselect-emacs-1.13.tar.bz2'
--2013-07-28 03:06:55--  http://mirror.usu.edu/mirrors/gentoo/distfiles/eselect-emacs-1.13.tar.bz2
Resolving mirror.usu.edu... 129.123.104.64
Connecting to mirror.usu.edu|129.123.104.64|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2013-07-28 03:06:55 ERROR 404: Not Found.

>>> Downloading 'http://mirror.mcs.anl.gov/pub/gentoo/distfiles/eselect-emacs-1.13.tar.bz2'
--2013-07-28 03:06:55--  http://mirror.mcs.anl.gov/pub/gentoo/distfiles/eselect-emacs-1.13.tar.bz2
Resolving mirror.mcs.anl.gov... 2620:0:dc0:1800:214:4fff:fe7d:1b9, 146.137.96.7
Connecting to mirror.mcs.anl.gov|2620:0:dc0:1800:214:4fff:fe7d:1b9|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2013-07-28 03:06:55 ERROR 404: Not Found.

>>> Downloading 'http://gentoo.cities.uiuc.edu/pub/gentoo/distfiles/eselect-emacs-1.13.tar.bz2'
--2013-07-28 03:06:55--  http://gentoo.cities.uiuc.edu/pub/gentoo/distfiles/eselect-emacs-1.13.tar.bz2
Resolving gentoo.cities.uiuc.edu... failed: Name or service not known.
wget: unable to resolve host address `gentoo.cities.uiuc.edu'
>>> Downloading 'http://gentoo.osuosl.org/distfiles/eselect-emacs-1.13.tar.bz2'
--2013-07-28 03:06:56--  http://gentoo.osuosl.org/distfiles/eselect-emacs-1.13.tar.bz2
Resolving gentoo.osuosl.org... 140.211.166.134
Connecting to gentoo.osuosl.org|140.211.166.134|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2013-07-28 03:06:56 ERROR 404: Not Found.

>>> Downloading 'http://ftp.halifax.rwth-aachen.de/gentoo/distfiles/eselect-emacs-1.13.tar.bz2'
--2013-07-28 03:06:56--  http://ftp.halifax.rwth-aachen.de/gentoo/distfiles/eselect-emacs-1.13.tar.bz2
Resolving ftp.halifax.rwth-aachen.de... 137.226.34.42
Connecting to ftp.halifax.rwth-aachen.de|137.226.34.42|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2013-07-28 03:06:56 ERROR 404: Not Found.

>>> Downloading 'http://gentoo.ussg.indiana.edu/distfiles/eselect-emacs-1.13.tar.bz2'
--2013-07-28 03:06:56--  http://gentoo.ussg.indiana.edu/distfiles/eselect-emacs-1.13.tar.bz2
Resolving gentoo.ussg.indiana.edu... 156.56.247.195
Connecting to gentoo.ussg.indiana.edu|156.56.247.195|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2013-07-28 03:06:57 ERROR 404: Not Found.

>>> Downloading 'http://gentoo-distfiles.mirrors.tds.net/distfiles/eselect-emacs-1.13.tar.bz2'
--2013-07-28 03:06:57--  http://gentoo-distfiles.mirrors.tds.net/distfiles/eselect-emacs-1.13.tar.bz2
Resolving gentoo-distfiles.mirrors.tds.net... 216.165.129.135
Connecting to gentoo-distfiles.mirrors.tds.net|216.165.129.135|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2013-07-28 03:06:57 ERROR 404: Not Found.

!!! Couldn't download 'eselect-emacs-1.13.tar.bz2'. Aborting.
 * Fetch failed for 'app-admin/eselect-ctags-1.13', Log file:
 *  '/var/tmp/portage/app-admin/eselect-ctags-1.13/temp/build.log'

>>> Failed to emerge app-admin/eselect-ctags-1.13, Log file:

>>>  '/var/tmp/portage/app-admin/eselect-ctags-1.13/temp/build.log'

 * Messages for package app-admin/eselect-ctags-1.13:

 * Fetch failed for 'app-admin/eselect-ctags-1.13', Log file:
 *  '/var/tmp/portage/app-admin/eselect-ctags-1.13/temp/build.log'

 * GNU info directory index is up-to-date.

 * IMPORTANT: 2 config files in '/etc' need updating.
 * See the CONFIGURATION FILES section of the emerge
 * man page to learn how to update config files.

The above experience gave me ZERO faith in gentoo.  I had asked to install Vim, yet the errors are about it being unable to install emacs.  Poignant, yet so totally inappropriate!

I can't say I've had a reasonable experience with Gentoo 12 so far.  It hasn't been totally unwieldy, but it's been far from malleable.

Exploring Fedora 17

For anyone who is already familiar with RedHat/Centos, this is a painless distro to adopt - you'll find few if any surprises.

As with the other Rackspace Cloud distros, this one's lean - but not TOO lean.  I found myself needing to install "locate" and "gcc":

yum install mlocate
updatedb
yum install gcc

It makes for a relatively boring blog post, but... I had so few problems with the Fedora 17 that I really have nothing more to report.  It just works.

Exploring OpenSUSE 12

Yet more documentation of my Exploring Variants of Linux in the Rackspace Cloud

OpenSUSE package management is via Yast.  Yast wants to be interactive.  It might be possible to do things from the commandline but it seems to work best interactively.

This makes it harder to review what you've done since there's no record of exact command line options.  My root history simply says "yast", with no indication of what I installed.

As such, I don't have a good audit trail for what I've done on the OpenSUSE server.

However, it gave me very few problems and had few if any issues.

OpenSUSE will surprise you if/when you run "pstree -paul"

You'll find init has been replaced by something called "systemd".  As with Ubuntu's "upstart", OpenSUSE's "systemd" replaces the tried-and-true system init scripts with something new and wonderful.

man systemd.special systemd.unit systemd.service systemd.socket systemd.target

...there's quite a bit of learning and reading to be done.  Another day.   As with Ubuntu Server, were I forced to work with and manage a Linux other than RedHat/CentOS, I'd be quite happy with OpenSUSE.

Exploring Ubuntu 12 Server

As I promised in Exploring Variants of Linux I'm following up with my impressions of, and notes on, various Linux distros.  This is the first of those "general impressions".

If you're a RedHat/CentOS centric Linux user, hopefully these posts will help you over the first few hurdles you might find on the various other distros.

As with all of the Rackspace Cloud distros, the Ubuntu 12 Server distro is very lean.  I noticed a few relatively important tools were not installed.  They're not "required" but they're really handy, so my first step was to install them:

apt-get install mlocate
updatedb
apt-get install make
apt-get install gcc

I noticed only one service running which is not needed - "whoopsie" - so I turned it off.

root@pbr-ubuntu12:~# cat /etc/default/whoopsie 
[General]
report_crashes=true
root@pbr-ubuntu12:~# sed -i 's/report_crashes=true/report_crashes=false/' /etc/default/whoopsie 
root@pbr-ubuntu12:~# cat /etc/default/whoopsie 
[General]
report_crashes=false
root@pbr-ubuntu12:~# sudo service whoopsie stop
root@pbr-ubuntu12:~# 

Upgrading to the latest ubuntu was very lengthy and verbose, including a full-screen interaction with a pink background.. but it was functional:

apt-get update
apt-get upgrade
do-release-upgrade

Upstart's quite a bit different from the standards sys5 init script approach, but easy enough to get accustomed to.   

man upstart-events

...neat.  Upstart's pretty darn powerful, in fact.

 General impression?  If I was forced to use something other than a RedHat/CentOS distro, I'd be quite happy with Ubuntu Server.

Wednesday, July 24, 2013

Nameless Temporary Files

Linux 3.11 rc2 


Here's an interesting snippet from Linus's announcement post regarding Linux 3.11 rc2:

 (a) the O_TMPFILE flag that is new to 3.11 has been going through a
few ABI/API cleanups (and a few fixes to the implementation too), but
I think we're done now. So if you're interested in the concept of
unnamed temporary files, go ahead and test it out. The lack of name
not only gets rid of races/complications with filename generation, it
can make the whole thing more efficient since you don't have the
directory operations that can cause serializing IO etc.

Interesting idea!  Temporary files that aren't burdened with having to have filenames.

It will be some time before sysadmins see this feature in production, but in this case I think it's best to get the word out early, especially since this new feature could cause "mystery drive space exhaustion".

Right now, the only discrepancy between "df" and "du" numbers is due to deleted files with still-opened file descriptors.  With this new feature, it appears that nameless temporary files will join the ranks of hard-to-spot possible root causes of space exhaustion.

It's not clear how these new files will be identifiable / distinquished, for example, in the output of "lsof".  
As I learn more about this new feature, I'll be sure to write about it.

Sunday, July 07, 2013

Best Practices: It's Freezing in the Cloud

Production.  

It's a term many people have heard of, but what does it mean?   A lot of people have been asking me about this lately, so I'm happy to give an overview of some best practices for solution management.


Production Rule #1: A production environment is something you don't touch.  

Its configuration is frozen - its not up for experimentation.  It does exactly what it's been configured to do, nothing more, nothing less.  A production environment is comprised of a set of production servers handling various functions.  Irrespective if they're web servers, app servers, compute servers, db servers, or comms/queueing servers, each production server is a simple combination of three things:

  • a vetted version of your application/website/service/database/whatever
  • a vetted copy of each and every non-stock (tuned) configuration file
  • a base OS - hopefully one that's stock, but definitely one that's been proven stable

The sum total of those three things makes a production server.   

Example Production Environment "Cook Book":
db application + db_master_config + stock RHEL6 server = master-database-server
db application + db_tapebackup_slave_config + stock RHEL6 server = backup-slave-database-server
db application + db_failover_slave_config + stock RHEL6 server = failover-slave-database-server
payment_gateway_if + comm_server_tuning + stock RHEL6 server = payment_gateway
payment_gateway_if + failover_comm_server_tuning + stock RHEL6 server = payment_gateway
website content + webhead_config + stock RHEL6 server = web1
website content + webhead_config + stock RHEL6 server = web2
website content + webhead_config + stock RHEL6 server = web3
load_balancer_config + stock cloud loadbalancer (LBaaS) = production_loadbalancer


Production Rule #2:  Never login on your production servers.

Notice there's no mention of "custom tweaking" or "post-install tuning" or the like in the recipes I listed in the example above.  That's because there really must be none. Human beings NEVER LOG IN on production servers.  If they are doing that, they're almost certainly breaking rule #1 and they're definitely breaking rule #2.  Early on, as you're putting the solution in place, it may be handy to use ssh to remotely run a command or two - but you must ensure you're only running non-intrusive monitoring - "read only" operations - if you wish to ensure the correctness of the production environment.

The moment you break rule #2, you'll be setting yourself up for a conundrum if/when there is a problem with the production environment.  You'll then need to answer the question:  "was it what I did in there, or was it something in the production push that's the root cause of the problem?"

If you've never logged in on the production servers, you then KNOW it was something in the production push that cause the problem.

How then do you arrive at a reasonable solution, not over-spending on servers, memory, storage, licenses, etc. if you don't tune your production environment?   You tune your staging environment instead.  

Staging.

There's a term fewer people have heard, but it's equally as important as "production".  Every good production environment has at least one staging environment.

Ideally, a staging environment duplicates the production environment.  If you're hesitant to jump straight to that, you can introduce less redundancy than the production environment has - but you're opening up the possibility of mis-deployment if you do.

Example "barely shorted" staging environment:
db application + db_master_config + stock RHEL6 server = master-database-server
db application + db_tapebackup_slave_config + stock RHEL6 server = backup-slave-database-server
payment_gateway_if + comm_server_tuning + stock RHEL6 server = payment_gateway
website content + webhead_config + stock RHEL6 server = web1
website content + webhead_config + stock RHEL6 server = web2
load_balancer_config + stock cloud loadbalancer (LBaaS) = production_loadbalancer

The idea with a staging environment is that it's a destination for changes to applications, website content, configuration changes prior to their going into production.


Staging Rule #1: A staging environment is something you don't touch.  

(well... after it's been setup and debugged and is working properly, anyway)

It's a "devops" world now - sysadmin config changes need to be versioned and managed just as carefully as code changes.  Ideally all of the changes are committed to a source code repository - ideally something like git.  

Once a week, or more often if needed, the entire list of changes being made for all components and configurations is reviewed and vetted/approved.  Then, all of those changes are applied to the staging environment, backing things up first if needed.

With that, you've achieved a "staging push" - combining all of the changes to all of the functionality and configuration for all of the various solution components and applying them to the staging environment.  At that point automated testing begins against the solution that you've just put in place in the staging environment.

Real-world traffic to the solution is either simulated or exactly reproduced, and the performance and resource utilization of all servers implementing staging is logged.  After a period of some day of testing (yes, multiple days - ideally simulating a full week of operations) then summarization and statistics can be generated from the resource utilization logs.

If there are any ill side-effects of the most recent push, they'll be evident because the resource utilization statistics will show that things got worse.  For example, if there's a badly coded webpage introduced which is causing apache processes to balloon up in size, the memory statistics on the webheads will be notably worse than they were for the previous staging push.

Staging Rule #2:  Never login on your staging servers.

If it's done right by suitably lazy programmers, your staging environment will be running all of this testing automatically, monitoring resources automatically, comparing the previous and current statistics resulting from testing at the end of the test run, and emailing you with the results.

You can only be 100% sure of the results of the staging test if it was entirely "hands off".  Otherwise if/when something goes wrong (either in production or in staging) you'll be left wondering if it was due to the push, or due to whatever bespoke steps you took in staging.  That's not a good feeling, and it's not a fun discussion with your board of directors either.

More Twenty-first century devops best practices

If you'd like to learn more, I can recommend Allspaw and Robbins: Web Operations - keeping the data on time Now it's your turn - what's your favorite "devops" runbook/rule-book? 


Tuesday, June 25, 2013

Running commands on all of your cloud servers

I consider my cloud servers to be one big array of servers.

I decided to use "fog" - the Ruby API for the Rackspace Cloud - to build something to let me, in one step, run commands on all of the servers.  It turned out to be pretty straightforward.


You might have a different model - an array of web servers, another array of db servers, and another array of compute servers, for example.  If so, you can easily extend the code to work with your different groups of servers by querying for whatever differentiates them.

Tuesday, June 04, 2013

Fighting SPAM: Identifying compromised email accounts

A compromised email account is one where spammers have determined someone's email password, and they're using the email account to send out spam email.

Various email servers have better and worse logging.  Depending on the server (qmail, postfix, sendmail) the logs may or may not let you directly correlate an outgoing spam email with the actual account that sent the email.

So, the following can be pretty useful.  It collects up all the IP addresses ($13 - the thirteenth field in the logfile, in this particular case) that each user has connected from, and prints out the accounts that are connecting from more than one IP.

awk '/LOGIN,/ {if (index(i[$12], $13) == 0) i[$12]=i[$12] " " $13} END {for(p in i) {print split(i[p], a, " ") " " p " " i[p]}}' maillog|sort -n|grep -v '^1 '

If you see an account for an individual, which is getting connections from dozens or hundreds of IP addresses, that's very possibly a compromised email account.

Note that an end-user with a smartphone will end up with a big bank of IPs connecting to check email.  They'll all have similar IP addresses in most cases.

Friday, May 31, 2013

Track Apache's calls to PHP

Customers often ask how to find out what PHP code is being called.  Sometimes, they're looking to find abusers of PHP email forms - and other times, they're interested in learning which routines are being called the most often.

The following monitoring command will run until you interrupt it with a control-C.

lsof +r 1 -p `ps axww | grep [h]ttpd | awk '{ str=str","$1} END {print str}'`|grep vhosts|grep php

It takes the process IDs of all of the Apache processes and strings them together with commas inbetween.  Then it calls "lsof", asking it to repeat every second.

"lsof" lists all of the open file descriptors for the processes listed after the "-p" argument.

At the end of the command, we select only those lines that have "vhosts" and "php".  If your website document roots aren't under /var/www/vhosts you will want to look for some other string indicating "a file within a website"

Wednesday, May 29, 2013

As a software developer, how can I ensure I remain employable after age 50?

I used to think the same way.  I've been programming UNIX/Linux for around 30 years.  I liked writing code.  I wanted my job to be writing code, and I wanted some company to pay me to do that.

I absolutely LOVE writing code now - because I only write WHAT I want to write, WHEN I want to, and HOW I want to.  (I.e. it's no longer part of my job.  I write code as a hobby now.) 

I absolutely LOVE my job now - it's WAY better than any job I've ever had before - including when I was a consultant, and including when I worked for myself (I was CTO of my own startup some years ago).

My day job is: HELP PEOPLE.  I found a very good fit in customer service.  

I'm now a top shelf systems administrator, and I leverage my coding skills to solve problems that would make many sysadmins heads spin. For example, I was asked to action a db import the other day.  Mid-import, the load on the server went almost to zero, and memory usage started to climb. 

The import had dead-locked with the customer's runtime application logic.  

Because of how apache works, and because most customers over-commit apache in terms of how they set MaxClients (they allow Apache's worst-case memory footprint to be larger than their total available memory)... in this sort of a case, it's imperative to act QUICKLY to correct the situation, or the server will very probably crash.

Most sysadmins in that case would immediately stop apache, which I did.  They would then abort the import, probably restart mysql to clear the deadlock, and restart the import.  That, I did not do - it's overkill.

Instead, I stopped apache, ran "mysqladmin processlist > queries", edited the file "queries" in vim and... 
-> deleted the header, the footer and the specific db import query I did NOT want to kill, 
-> issued :1,$s/^|/kill /
-> issued :1,$s/|.*/;/ 
-> wrote the file and exited.  

That gave me a file full of lines like this: 

kill 12345 ; 
kill 67890 ;  

...then I ran "mysql

It was a 4.5G import, so that was a good thing; restarting it would have added hours to the downtime.  

This isn't something your typical dev knows how to do correctly.  It's not something your typical admin knows how to do correctly.  And it's not even something your typical DBA knows how to do correctly.  It's something I knew how to do correctly, leveraging my years of experience.  

I'm sharing this because it shows there's still a need for people who can solve difficult computing problems, accurately and quickly, but outside of the programming domain.  Your experience level may well make you IDEAL for this sort of position, so if you find it at all compelling, I recommend that you:
  • review all of your past positions to see how each and every one of them had "customer service" as some aspect of what they were about
  • rework your resume to exude that aspect of what you did
  • apply for an entry-level position in customer service at a hosting company

The ethos of slicing and dicing logfiles

When a logfile is of reasonable size, you can review it using "view" - a read-only version of "vim".  This gives you flexible searching, and all of the power of vim as you review the logfile.  However, for viewing huge files, instead of editing them in vim directly, try this:

tail -100000 logfile | vim -

That way you're only looking at the last 100,000 lines not the whole file.  On a server with 4GB of RAM, looking at a 6GB logfile in vim without something like the above can be, well... a semi-fatal mistake.

For logfile analysis, I use awk a lot, along with the other tools you mentioned - grep, etc.  Awk's over the top - totally worth learning. You can do WAY cool things with it.   For example, I once used grep on an apache access log to find all the SQL injections an attacker had attempted, and wrote that to a tempfile.

Then I used awk to figure out (a) which .php files had been called and how many times each, and (b) what parameters had been used to do the injections.

awk -F\" tells awk to use " as the field separator, so anything to the left of the first " is '$1' and whatever's between the first and second quote is $2, etc.

So awk -F\" '{print $2}' shows me what was inside the first set of quotes on each line.

Using other characters for the field separator let me slice out just the filename from the GET request, then another pass over the file with slightly different code let me slice out just the parameter names.  

Here again, as you might feel is a resounding theme in my blog, the Linux commandline tools have proven to be immensely useful.

Log Dissector

If you want to see some of awk's more awesome features being leveraged for logfile analysis, take a look at this little program I threw together:
http://paulreiber.github.com/Log-Dissector/

analyzing memory exhaustion

It's really kind of random which processes will be pushed into swap space by the kernel, when it realizes it's low on memory.  Just because a process is swapped out or using a lot of swap, doesn't mean that it is necessarily a problem - in fact, quite often, a process using a lot of swap space is an "innocent bystander".

To find the root cause of memory exhaustion issues, it's helpful to look at how processes are using virtual memory - both physical memory and swap. That way you can see which processes have what "footprint" across the board - not just which are dipping into swap.  

Here are a couple of useful commands for that, and their output on a local machine.

Show me the users using >= 1% of physical memory:

$ ps aux|awk '$1 != "USER" {t[$1]+=$4} END {for (i in t) {print t[i]" "i}}'|sort -n|grep -v ^0
2.4 root
35.6 pbr

Show me what programs are using >= 1% of physical memory:

$ ps aux|awk '$1 != "USER" {t[$11]+=$4} END {for (i in t) {print t[i]" "i}}'|sort -n|grep -v ^0
1 /usr/bin/knotify4
1.1 nautilus
1.3 mono
1.6 /usr/bin/yakuake
2.3 /usr/bin/cli
2.5 kdeinit4:
5.8 /usr/lib/firefox/firefox
7.6 gnome-power-manager

(Really?!? gnome-power-manager? ...that's gotta go away!  Glad I ran this!)

Note the only difference between the two commands above is the column used as the "index" into the t array - i.e. t[$1]+=$4 vs t[$11]+=$4

Of course you can aggregate other columns similarly.  For example:

Show me how many megabytes of virtual memory each non-trivial user is using:

$ ps aux|awk '$1 != "USER" {t[$1]+=$5} END {for (i in t) {print t[i]/1024" "i}}'|sort -n|grep -v ^0
2.19141 daemon
3.26172 103
3.30078 gdm
5.82031 avahi
11.4258 postfix
19.9141 105
33.6055 syslog
365.281 root
2653.86 pbr

Note the differences there are (a) we aggregate column 5 instead of 4, (b) we divide the result by 1024 so we're working with MB instead of KB.

Show me all programs cumulatively using >= 100MB of virtual memory

$ ps aux|awk '$1 != "USER" {t[$11]+=$5} END {for (i in t) {print t[i]/1024" "i}}'|sort -n|egrep ^[0-9]{3}
108.477 /usr/bin/cli
123.414 /usr/lib/indicator-applet/indicator-applet-session
149.359 udevd
154.168 /usr/bin/yakuake
162.402 nautilus
185.516 gnome-power-manager
226.586 kdeinit4:
499.598 /usr/lib/firefox/firefox

If your head is spinning trying to understand the command lines, I'll try to help.

Awk has an awesome feature called "associative arrays".  You can use a string as an index into an array.  No need to initialize it - awk does that for you automagically.  

Let's disect the awk program I provide on the final commandline above - the one for "Show me all programs cumulatively using >= 100MB of virtual memory"

for each line of input (which happens to be the output of "ps aux")
  if field-1 isn't the string "USER" then
      add the value in field-5 (process-size) to 
      whatever is in the array t at index field-11 (program-name)

at the END of the file
  for each item i in array t
    print the value (t[i] divided by 1024), then a space (" "), then the item itself (i)

All of that output is fed to "sort" with the -n (for numeric) option, then that sorted output is fed to "egrep" which has been told to only print lines that start with at least three numerals. (remember, the goal is to only list programs cumulatively using ">=100MB" ... and 99MB has only two numerals.)

With the basic Linux tools, you can do some pretty amazing things with the output of various commands.  This is an example of what is meant when people speak about the "power of the UNIX shell".

Back to swap space.  Once you have an idea of which processes on your system are using how much virtual memory, and how much physical memory, you'll be in a much better position to assess the actual root cause for any disconcerting swap usage.  As I mentioned, quite often, the processes that get swapped out are NOT the ones that are the real problem.

Very often, Apache or some other process which increases its footprint in response to increased demand will be the root cause of your memory problems.

Which user sends and receives the largest volume of email?

Although awks associative arrays are nowhere near as intricate or graphically stunning as some other data models, they're over-the-top-cool, because of how immensely useful they are for basic text transformation.

You can code whatever sort of transformation you want to do to "stdout" of any unix/linux command using awks associative arrays.

For example... here's a command that'll work with ALL of the maillog files - rotated or not, compressed or not, and tell you which users send/receive the largest volumes of email:

1
2
3
4
5
zgrep -h "sent=" maillog*| \
sed 's/^.*user=//'| \
sed -e 's/rcvd=//' -e  's/sent=//'| \
awk -F, '{t[$1]=t[$1]+$5+$6; r[$1]=r[$1]+$5; s[$1]=s[$1]+$6}  END {for (i in t) { print t[i]" "s[i]" "r[i]" "i}}' \
|sort -n

Output format is:  

combined-total sent-total received-total email-address.  

Sample output:

11635906 11530222 105684 boss@somecompany.com
33077188 32995397 81791 biggerboss@somecompany.com
41524794 41225163 299631 ceo@somecompany.com
82771501 81433867 1337634 guywhodoesrealwork@somecompany.com

You could have it give you the totals in K or M by simply appending  /1024  or /1048576 to the arguments to the "print" function.

How to be almost-root

Today, most of us own our own Linux computers, or at least, our employers do, but they're signed out to us and dedicated for our use.  

If you have 'root' access on the computer, and consider it basically 'yours' - quite possibly you'll want to be able to look around at ALL of the files on the system, without first having to escalate to 'root'.

It's safer this way, by the way.  You should be able to look at ALL the files without having to escalate to root privilege.  How to do that?

If your filesystem(s) support ACLs, any regular user can be given this level of access.   For linux, the command 'setfacl' can be used to do this:

setfacl -R -m u:whoever:r /

The above recursively modifies access for the user whoever, to include "r".  It applies to ALL files and ALL directories.

setfacl -d -R -m u:whoever:r /

The above recursively modifies the DEFAULT acls for all directories such that they'll give the user whoever read access on any NEW files created in the future. (that's a REALLY REALLY REALLY cool feature!)

Now... the issue gets more complex.  "execute access" means different things for directories than it does for files.  Execute permission on a directory allows the user to list what files are in the directory.  Most people would lump that in with "reading" it.

find / -type d -exec setfacl -m u:whoever:rx {} \;

The above gives both read and execute permission for the user whoever to all directories.

Note, together these aren't perfect regarding NEW content.  The DEFAULT acls concept doesn't differentiate between new files in a directory and new subdirectories in that directory.  So, with the above, any NEW directories created after the "find" is run will have "r" permissions, not "rx" permissions, for the user whoever.


You might setup a nightly cron job to repeat the "find" command above - that'll take care of new directories and ensure you have "x" on them the next day.

If you have questions or concerns about ACLs just let me know and I'll be happy to help as best I can.



What Every Beginning Linux Sysadmin Needs to Know

Many new sysadmins focus on process and technology issues, devouring books and guides.  That's awesome, and important, but it's only the tip of the iceberg.

Let's talk about attitude.  The difference between a good systems administrator and an outstanding systems administrator is attitude.

A sysadmin has to be extremely intelligent.  That goes without saying, pretty much.  If you confuse "affect" and "effect", "server" and "service", and "is it possible to" vs "please do this for me", you'll want to find another profession.  But intelligence alone is FAR from sufficient.  The five H's cover other attributes you'll need to have to be truly outstanding as a sysadmin.

humble

Humble is good.  It means you understand yourself.  You neither over-estimate your abilities nor do you have poor self-esteem.  You know your limits, your strengths and weaknesses, and you're sober-minded.

A humble person assumes the position of a learner during conversation.  You would never ever consider yourself "the smartest person in the room".  You're interested in understanding the other person's perspective, and you make the others in the conversation feel smart and competent.

No job that needs to be done is beneath you.  You're not focused on your title, position, or status relative to others in the organization - instead you're focused on the success of the organization.

You value "the little people" in the organization, and treat them as peers.

honest

An honest person delivers bad news first.  Exaggerating, bragging, or misrepresenting facts are TOTALLY foreign concepts to an honest person.

My employer values honesty extremely highly.  So highly, in fact, that if a co-worker told me, with a straight face, that they had gone over Niagara Falls in a wooden barrel, I would believe them.

Wrapped up with honesty is integrity.  If you make a promise, keep it - no matter how inconvenient, difficult, or personally expensive it is to do so.  

holistic

Don't think "inside the box" - think holistically.  Let's look at an example.  A server with 6 drive bays, currently using only 2 of them, is out of free drive space.  How much time to do you spend finding files/directories that can be deleted?  If the filesystem is using LVM, think instead about adding drives and expanding the filesystem.  Even without LVM, the cost of NOT adding additional drives could well exceed the cost of adding them.

Ticketing systems, used religiously, are awesome for documenting trouble as it happens.  One often-overlooked feature is the timestamping.  

Expensive alternatives to fixing recurring problems are hard to justify, but the timestamping makes it possible.  You can assess how much sysadmin time was spent on a problem by subtracting the start-time from the end-time.  Get an "average cost per hour" for sysadmin time in your organization, and multiply.  You can extrapolate out an average cost per month of not solving the problem in a more permanent way.

Hopefully you can identify recurring problems to management somewhere below the "50%" mark.   You'll be able to say:  "This is an ongoing problem.  So far it has cost us X.  By it will have cost us Y, which is the cost of solving the problem.    If we spend Y to fix the problem now, then by we'll have broken even, and we'll be SAVING money every month after that."  This is exactly what your boss needs to be able to justify the expense up the ladder.  

The reason this H is so important is that quite often, even though there may be people in the organization charged with implementing these sorts of cost-saving measures, they don't have the necessary information or holistic perspective into what YOU are doing, to be able to identify what needs to be done for particular situations.

hungry

Don't dwell on your past success.  Set your standards higher and higher.  Have an insatiable appetite for new information.  Be an active listener in meetings - ask questions, take notes.  Follow up on open action items.  Put shoulder to everything you do.  You're being paid for 8 hours per day... deliver at LEAST 8 hours worth of value.  And if that took you 6 hours... deliver just as hard for the next two, because you've set your bar too low.  

helpful

Having a helpful attitude is crucial.  ALWAYS be helpful.  ALWAYS be part of "the solution", not part of "the problem".  ALWAYS ensure that everyone in the loop is helped sufficiently.

It's useful here to have the ability to say no using the letters "y", "e", and "s".

For example, say a customer, or manager, or end-user, wants to solve a problem in a way which simply won't work.  YOU know it won't work, all of your co-workers know it won't work, and the whole world except that one person knows... it won't work.

How to say "no"?  Find three alternatives that WILL work, and present those.

Depending on the dynamic, you might take the following stance.  "The approach you're recommending isn't viable.  We can go into why, if necessary, but it might be more effective to look at alternatives.  Here are three good ones: ..."

So, instead of showing them why what they're asking for won't work (and making them feel less competent in the process) you're putting them in a position where they can CHOOSE a workable solution from the options.

They'll feel better about that, and the problem will get solved in a workable manner.

Another way to be helpful is to share not only data, but a quick statement of how you generated the data.  Consider the tremendous difference:

Dear Customer,

You'll need to clear some drive space - your drives are full.  Here is a list of files >100MB on your server for your review, and possible deletion, compression, or relocation to another device.

9999MB /really/big/file
1234MB /other/big/file
123MB /somewhat/big/file

Regards,
-your helpful sysadmin

...compared to...

Dear Customer,

You'll need to clear some drive space - your drives are full.  Here is a list of files >100MB on your server for your review, and possible deletion, compression, or relocation to another device.

[root ~]% find / -xdev -type f -size +102400k -exec stat -c'%s|%n' {} \; | awk -F\| '{ print $1/1024/1024 "MB " ": " $2 }' | sort -nr

9999MB /really/big/file
1234MB /other/big/file
123MB /somewhat/big/file

Regards,
-your helpful sysadmin

In the first version, you've given the customer some information to work with.  However, they have no idea how you got that information.  You've left them powerless to search, for example, for files >50MB as well.

In the second version, you've provided just enough information that if they're at all capable on the commandline, they'll be able run another scan themselves.

THAT is helpful.  Forcing them to come back to you for another listing of files, this time >50MB, might on the surface seem like you're helping them more - but what you're really doing is forcing them to be dependent upon you.

     ~~*~~

So... to recap:  humble, honest, holistic, hungry, and helpful.  Integrate these 5 H's into your very being... and you're well on your way to becoming an outstanding systems administrator.