Friday, May 31, 2013

Track Apache's calls to PHP

Customers often ask how to find out what PHP code is being called. Sometimes, they're looking to find abusers of PHP email forms - and other times, they're interested in learning which routines are being called the most often.

The following monitoring command will run until you interrupt it with a control-C.

lsof +r 1 -p `ps axww | grep [h]ttpd | awk '{ str=str","$1} END {print str}'`|grep vhosts|grep php

It takes the process IDs of all of the Apache processes and strings them together with commas inbetween. Then it calls "lsof", asking it to repeat every second.

"lsof" lists all of the open file descriptors for the processes listed after the "-p" argument.

At the end of the command, we select only those lines that have "vhosts" and "php". If your website document roots aren't under /var/www/vhosts you will want to look for some other string indicating "a file within a website"

Wednesday, May 29, 2013

As a software developer, how can I ensure I remain employable after age 50?

I used to think the same way. I've been programming UNIX/Linux for around 30 years. I liked writing code. I wanted my job to be writing code, and I wanted some company to pay me to do that.

I absolutely LOVE writing code now - because I only write WHAT I want to write, WHEN I want to, and HOW I want to. (I.e. it's no longer part of my job. I write code as a hobby now.)

I absolutely LOVE my job now - it's WAY better than any job I've ever had before - including when I was a consultant, and including when I worked for myself (I was CTO of my own startup some years ago).

My day job is: HELP PEOPLE. I found a very good fit in customer service.

I'm now a top shelf systems administrator, and I leverage my coding skills to solve problems that would make many sysadmins heads spin. For example, I was asked to action a db import the other day. Mid-import, the load on the server went almost to zero, and memory usage started to climb.

The import had dead-locked with the customer's runtime application logic.

Because of how apache works, and because most customers over-commit apache in terms of how they set MaxClients (they allow Apache's worst-case memory footprint to be larger than their total available memory)... in this sort of a case, it's imperative to act QUICKLY to correct the situation, or the server will very probably crash.

Most sysadmins in that case would immediately stop apache, which I did. They would then abort the import, probably restart mysql to clear the deadlock, and restart the import. That, I did not do - it's overkill.

Instead, I stopped apache, ran "mysqladmin processlist > queries", edited the file "queries" in vim and...

-> deleted the header, the footer and the specific db import query I did NOT want to kill,

-> issued :1,$s/^|/kill /

-> issued :1,$s/|.*/;/

-> wrote the file and exited.

That gave me a file full of lines like this:

kill 12345 ;

kill 67890 ;

...then I ran "mysql

It was a 4.5G import, so that was a good thing; restarting it would have added hours to the downtime.

This isn't something your typical dev knows how to do correctly. It's not something your typical admin knows how to do correctly. And it's not even something your typical DBA knows how to do correctly. It's something I knew how to do correctly, leveraging my years of experience.

I'm sharing this because it shows there's still a need for people who can solve difficult computing problems, accurately and quickly, but outside of the programming domain. Your experience level may well make you IDEAL for this sort of position, so if you find it at all compelling, I recommend that you:

review all of your past positions to see how each and every one of them had "customer service" as some aspect of what they were about
rework your resume to exude that aspect of what you did
apply for an entry-level position in customer service at a hosting company

Learning Vim From the Inside

Vim improves on vi in countless ways. As a curious vi expert, I wanted to know exactly what those were, so I dove into the source code. In doing so, I was compelled to create this online class a few years back: http://curiousreef.com/class/learning-vim-from-the-inside/

It's still going strong. New students join every month. For the most part, it runs itself now... but if you hit any hurdles while working your way through the content, please reach out to me and let me know.

The ethos of slicing and dicing logfiles

When a logfile is of reasonable size, you can review it using "view" - a read-only version of "vim". This gives you flexible searching, and all of the power of vim as you review the logfile. However, for viewing huge files, instead of editing them in vim directly, try this:

tail -100000 logfile | vim -

That way you're only looking at the last 100,000 lines not the whole file. On a server with 4GB of RAM, looking at a 6GB logfile in vim without something like the above can be, well... a semi-fatal mistake.

For logfile analysis, I use awk a lot, along with the other tools you mentioned - grep, etc. Awk's over the top - totally worth learning. You can do WAY cool things with it. For example, I once used grep on an apache access log to find all the SQL injections an attacker had attempted, and wrote that to a tempfile.

Then I used awk to figure out (a) which .php files had been called and how many times each, and (b) what parameters had been used to do the injections.

awk -F\" tells awk to use " as the field separator, so anything to the left of the first " is '$1' and whatever's between the first and second quote is $2, etc.

So awk -F\" '{print $2}' shows me what was inside the first set of quotes on each line.

Using other characters for the field separator let me slice out just the filename from the GET request, then another pass over the file with slightly different code let me slice out just the parameter names.

Here again, as you might feel is a resounding theme in my blog, the Linux commandline tools have proven to be immensely useful.

Log Dissector

If you want to see some of awk's more awesome features being leveraged for logfile analysis, take a look at this little program I threw together:
http://paulreiber.github.com/Log-Dissector/

How to learn UNIX/Linux

There are a lot of books... however, I recommend reading the manuals.
I know... it's a lot... but, I did it (3 decades ago) and it works really well.

Here are a few sites with unix manpages:

http://www.ma.utexas.edu/cgi-bin/man-cgi
http://unixhelp.ed.ac.uk
http://bama.ua.edu/cgi-bin/man-cgi

I'm sure there are other good sites with manpages as well.

I recommend you read the entire manual, end-to-end - but unless you want to do that twice or thrice you'll want to skip around, and learn these commands first.

-> learn bash really well. Read every single line of the manual, and figure out what the heck they're talking about. Learn every variation of how to tweak a variable, how to do filename expansion (globbing, as opposed to regular expressions), what all the builtin functions are and how they work.

-> learn vi (or vim, if on Linux) - exceedingly awesome editor. Sure, use a tutorial, and don't bother to dive down the ratholes of macros or settings tweaking unless you're really into those things, 'cause you'll wake up months later wondering what happened. If you're in a hurry, learn nano instead. You won't be as happy, but you'll be able to edit files.

-> learn awk. Awks associative arrays can help you solve some serious problems. Once you know how to use awk, you'll never look at a flat file or a "unix pipeline" quite the same. You can't learn too much awk.

-> learn sed. Also awesome, both for modifying files on disk and for applying mods to every line of text fed through a "pipeline" to it.

With those four as "cornerstones", you'll be a UNIX power-user in no time. They don't all work the same, or have the same conventions... but you'll get over the hurdles and be all the better UNIX expert in the end.

ALSO - once logged in - the 'man' and 'apropos' commands are your friends. Read, learn, experiment, and build up your commandline/pipelines piece by piece. Start small, and build.

Example of "building up" a command line:

find . -type f -print
...show me all the files in the current directory and down

find . -type f -print0 | xargs file
...tell me what kind of file each of them is (and, handle filenames with spaces in them properly)

find . -type f -print0 | xargs file | egrep -v GIF\|JPEG\|PNG\|PDF
...show me just the files that are NOT gifs, jpegs, pngs, or pdfs.

UNIX commands are kind of like LEGO elements - you can plug them together in cool ways. So, get used to building up pipelines that do what you want.

I guess, most of all... don't expect to be given a guided tour.

Instead, treat UNIX like a huge machine, where each part of the machine does have a manual and can be understood, and make it a habit, every day, to spend time wandering around in that machine, exploring it end-to-end, corner-to-corner, until you know it well enough to call yourself a UNIX power-user.

analyzing memory exhaustion

It's really kind of random which processes will be pushed into swap space by the kernel, when it realizes it's low on memory. Just because a process is swapped out or using a lot of swap, doesn't mean that it is necessarily a problem - in fact, quite often, a process using a lot of swap space is an "innocent bystander".

To find the root cause of memory exhaustion issues, it's helpful to look at how processes are using virtual memory - both physical memory and swap. That way you can see which processes have what "footprint" across the board - not just which are dipping into swap.

Here are a couple of useful commands for that, and their output on a local machine.

Show me the users using >= 1% of physical memory:

$ ps aux|awk '$1 != "USER" {t[$1]+=$4} END {for (i in t) {print t[i]" "i}}'|sort -n|grep -v ^0

2.4 root
35.6 pbr

Show me what programs are using >= 1% of physical memory:

$ ps aux|awk '$1 != "USER" {t[$11]+=$4} END {for (i in t) {print t[i]" "i}}'|sort -n|grep -v ^0

1 /usr/bin/knotify4
1.1 nautilus
1.3 mono
1.6 /usr/bin/yakuake
2.3 /usr/bin/cli
2.5 kdeinit4:
5.8 /usr/lib/firefox/firefox
7.6 gnome-power-manager

(Really?!? gnome-power-manager? ...that's gotta go away! Glad I ran this!)

Note the only difference between the two commands above is the column used as the "index" into the t array - i.e. t[$1]+=$4 vs t[$11]+=$4

Of course you can aggregate other columns similarly. For example:

Show me how many megabytes of virtual memory each non-trivial user is using:

$ ps aux|awk '$1 != "USER" {t[$1]+=$5} END {for (i in t) {print t[i]/1024" "i}}'|sort -n|grep -v ^0

2.19141 daemon
3.26172 103
3.30078 gdm
5.82031 avahi
11.4258 postfix
19.9141 105
33.6055 syslog
365.281 root
2653.86 pbr

Note the differences there are (a) we aggregate column 5 instead of 4, (b) we divide the result by 1024 so we're working with MB instead of KB.

Show me all programs cumulatively using >= 100MB of virtual memory

$ ps aux|awk '$1 != "USER" {t[$11]+=$5} END {for (i in t) {print t[i]/1024" "i}}'|sort -n|egrep ^[0-9]{3}

108.477 /usr/bin/cli
123.414 /usr/lib/indicator-applet/indicator-applet-session
149.359 udevd
154.168 /usr/bin/yakuake
162.402 nautilus
185.516 gnome-power-manager
226.586 kdeinit4:
499.598 /usr/lib/firefox/firefox

If your head is spinning trying to understand the command lines, I'll try to help.

Awk has an awesome feature called "associative arrays". You can use a string as an index into an array. No need to initialize it - awk does that for you automagically.

Let's disect the awk program I provide on the final commandline above - the one for "Show me all programs cumulatively using >= 100MB of virtual memory"

for each line of input (which happens to be the output of "ps aux")
if field-1 isn't the string "USER" then
add the value in field-5 (process-size) to

whatever is in the array t at index field-11 (program-name)

at the END of the file
for each item i in array t
print the value (t[i] divided by 1024), then a space (" "), then the item itself (i)

All of that output is fed to "sort" with the -n (for numeric) option, then that sorted output is fed to "egrep" which has been told to only print lines that start with at least three numerals. (remember, the goal is to only list programs cumulatively using ">=100MB" ... and 99MB has only two numerals.)

With the basic Linux tools, you can do some pretty amazing things with the output of various commands. This is an example of what is meant when people speak about the "power of the UNIX shell".

Back to swap space. Once you have an idea of which processes on your system are using how much virtual memory, and how much physical memory, you'll be in a much better position to assess the actual root cause for any disconcerting swap usage. As I mentioned, quite often, the processes that get swapped out are NOT the ones that are the real problem.

Very often, Apache or some other process which increases its footprint in response to increased demand will be the root cause of your memory problems.

Which user sends and receives the largest volume of email?

Although awks associative arrays are nowhere near as intricate or graphically stunning as some other data models, they're over-the-top-cool, because of how immensely useful they are for basic text transformation.

You can code whatever sort of transformation you want to do to "stdout" of any unix/linux command using awks associative arrays.

For example... here's a command that'll work with ALL of the maillog files - rotated or not, compressed or not, and tell you which users send/receive the largest volumes of email:

zgrep -h "sent=" maillog*| \

sed 's/^.*user=//'| \

sed -e 's/rcvd=//' -e  's/sent=//'| \

awk -F, '{t[$1]=t[$1]+$5+$6; r[$1]=r[$1]+$5; s[$1]=s[$1]+$6}  END {for (i in t) { print t[i]" "s[i]" "r[i]" "i}}' \

|sort -n

Output format is:

combined-total sent-total received-total email-address.

Sample output:

11635906 11530222 105684 boss@somecompany.com
33077188 32995397 81791 biggerboss@somecompany.com
41524794 41225163 299631 ceo@somecompany.com
82771501 81433867 1337634 guywhodoesrealwork@somecompany.com

You could have it give you the totals in K or M by simply appending /1024 or /1048576 to the arguments to the "print" function.

How to make a web API from a random open source project

People ask the craziest questions sometimes. Here were the steps I came up with in answering this one.

1) analyze the open source package
Your goal is to lay it out as a small set of arrays of functions. In typical API fashion, you'll have a small set of initialization functions (think "open"), a large set of functions that can be called once the service is lashed up (think "seek", "read", "write", etc), and a small set of finalization functions (think "close").

There might be more states than just initialize-use-finalize - but keep that state transition diagram as simplistic as possible. Complexity is your enemy.

You'll want to figure out how you can do as minor of a reorg to the current codebase as possible, to keep any changes you ask to have pushed upstream to a minimum.

Also, you'll want to work initially with a COPY of all the functions that has each one "stubbed out" - implementations, with all the arguments and types and such, but functions that simply print out the fact that they've been called. For integration simplicity, you may want to have your array of copies of the functions eventually just call the actual functions.

2) add your interface(s) to the mix
What standards/protocols do you want your API to be layered over? Build new code that initializes and interfaces with those protocols, and calls the array of functions. For now, just have the interfaces call "stub" implementations of the functions that print out the fact that they've been called.

3) build a full array of automated tests for the API
You might want to do this BEFORE #2. But you need to do it. Let the computer do the work - running your API through a rigorous set of tests, automatically, with every build/release.

4) hook your interfaces to the actual open source package
Once you have the API working right over the protocol(s) in question, and the tests are all working flawlessly and printing out that they're calling your stubbed functions, then you'll want to either link to the REAL functions instead of the stubs, or as I mentioned above, have the stubs start calling the real functions (one more level of indirection). Whether to eliminate the stubs or use them will depending on how much "glue" is needed in-between your API and the original functions. If a bunch of simplification is needed, where you have one function calling 3 or 4 from the open source system to get some job done, convert the stubs into "callers" rather than eliminating them.

4) debug things
Inevitably, code that wasn't designed for a particular usage model will hiccup. Debugging someone else's code is never fun... but, you didn't have to write it in the first place, so consider the time savings.

5) publish
The team of people who developed the original code, and their power users, are an awesome initial audience for your new API. Write a concise but informative introduction to what the API does, and share it on mailing lists related to the various technologies of the domain.

6) seek feedback
Don't think you're done - look for ways to improve and extend the API.   Let it grow and flourish.

7) let me know how it turns out
I'm sure I've missed something or other in the above, so let me know where I made a molehill out of a mountain.

Whats the purpose of a Linux Daemon?

First, let's look at what a daemon is.

Many people confuse services and daemons. Services listen on ports. Daemons are a kind of process. Services can be daemons. Daemons don't need to be services.

http://www.steve.org.uk/Referenc...

The above URL details the steps a program should take to become a daemon.

Reasons you're doing those things:

disassociate from the parent process
disassociate from the controlling terminal
chdir to / to disassociate from the directory the process was started in
umask 0 to ignore whatever umask you may have inherited
close your filedescriptors and reopen specific ones to your liking

How to Daemonize in Linux provides code examples in C.

So... why do this?

There are a couple of good reasons. A daemon can be a service as I mentioned above. Daemonizing a service is a great idea, so it can stay running as long as is desired.

Another good reason for making a program a daemon is that it'll keep running even when you logout. You can disassociate functionality from whether you're logged in or not. Once you run it, it'll stay running until it's explicitly killed, or a bug causes it to crash.

Monitoring a system is a good reason to use a daemon. Cron can run processes every minute - but if you need a tighter granularity than that, cron can't help. A daemon can. With a daemon, you can setup whatever timing you want in your "main loop".

You might watch for files to exist, or not exist, or drives to be mounted or unmounted, or any number of other things, using inotify or other means of checking what's going on.

Daemons can be pretty darn useful!

Why don't we have a way in Linux to know when a particular file was created?

Linux DOES have a way!

The various filesystems have what they have, no more no less - pondering why they are as they are isn't productive. They don't track creation-time metadata properly, and that's that. The GREAT news is that the design of a workable solution isn't complex at all. It's pretty straightforward.

The inotify kernel subsystem has been part of Linux since 2005 but it's still relatively unknown. You can use it to learn when new files are added to directories, among other things.

So, if you choose to solve this problem, you'll build a daemon to monitor new files as they're created, and put their create times into a dataset you can later query.

Start with logic which recurses over whatever directory trees you wish to track, creating "inotify watches" on each directory.

Use a loop which calls "select" across that large array of file descriptors, one-per-directory, and reads the inotify events from the individual fds as they happen. IN_CREATE events are the ones you'll be looking for - those indicate new files were created.

Capture the ctime of the file as soon as you have received the IN_CREATE event indicating it was created, and, viola, you have it's "cr_time".

Next. Implement in whatever way you prefer a persistent associative array of filenames -> creation timestamps.

You might also implement the inverse, mapping creation timestamps to the file or files which were created at that time, to whatever granularity you prefer.

You can then query the creation time for a given file quite straightforwardly, and if you've implemented the inverse as I mentioned, you can query which files were created between two timestamps as well.

If you named it "pfcmd", short for "Paul's File Creation Monitor Daemon", I wouldn't mind one bit. :-)

How to be almost-root

Today, most of us own our own Linux computers, or at least, our employers do, but they're signed out to us and dedicated for our use.

If you have 'root' access on the computer, and consider it basically 'yours' - quite possibly you'll want to be able to look around at ALL of the files on the system, without first having to escalate to 'root'.

It's safer this way, by the way. You should be able to look at ALL the files without having to escalate to root privilege. How to do that?

If your filesystem(s) support ACLs, any regular user can be given this level of access. For linux, the command 'setfacl' can be used to do this:

setfacl -R -m u:whoever:r /

The above recursively modifies access for the user whoever, to include "r". It applies to ALL files and ALL directories.

setfacl -d -R -m u:whoever:r /

The above recursively modifies the DEFAULT acls for all directories such that they'll give the user whoever read access on any NEW files created in the future. (that's a REALLY REALLY REALLY cool feature!)

Now... the issue gets more complex. "execute access" means different things for directories than it does for files. Execute permission on a directory allows the user to list what files are in the directory. Most people would lump that in with "reading" it.

find / -type d -exec setfacl -m u:whoever:rx {} \;

The above gives both read and execute permission for the user whoever to all directories.

Note, together these aren't perfect regarding NEW content. The DEFAULT acls concept doesn't differentiate between new files in a directory and new subdirectories in that directory. So, with the above, any NEW directories created after the "find" is run will have "r" permissions, not "rx" permissions, for the user whoever.

You might setup a nightly cron job to repeat the "find" command above - that'll take care of new directories and ensure you have "x" on them the next day.

If you have questions or concerns about ACLs just let me know and I'll be happy to help as best I can.

What Every Beginning Linux Sysadmin Needs to Know

Many new sysadmins focus on process and technology issues, devouring books and guides. That's awesome, and important, but it's only the tip of the iceberg.

Let's talk about attitude.  The difference between a good systems administrator and an outstanding systems administrator is attitude.

A sysadmin has to be extremely intelligent. That goes without saying, pretty much. If you confuse "affect" and "effect", "server" and "service", and "is it possible to" vs "please do this for me", you'll want to find another profession. But intelligence alone is FAR from sufficient. The five H's cover other attributes you'll need to have to be truly outstanding as a sysadmin.

humble

Humble is good. It means you understand yourself. You neither over-estimate your abilities nor do you have poor self-esteem. You know your limits, your strengths and weaknesses, and you're sober-minded.

A humble person assumes the position of a learner during conversation. You would never ever consider yourself "the smartest person in the room". You're interested in understanding the other person's perspective, and you make the others in the conversation feel smart and competent.

No job that needs to be done is beneath you. You're not focused on your title, position, or status relative to others in the organization - instead you're focused on the success of the organization.

You value "the little people" in the organization, and treat them as peers.

honest

An honest person delivers bad news first. Exaggerating, bragging, or misrepresenting facts are TOTALLY foreign concepts to an honest person.

My employer values honesty extremely highly. So highly, in fact, that if a co-worker told me, with a straight face, that they had gone over Niagara Falls in a wooden barrel, I would believe them.

Wrapped up with honesty is integrity. If you make a promise, keep it - no matter how inconvenient, difficult, or personally expensive it is to do so.

holistic

Don't think "inside the box" - think holistically. Let's look at an example. A server with 6 drive bays, currently using only 2 of them, is out of free drive space. How much time to do you spend finding files/directories that can be deleted? If the filesystem is using LVM, think instead about adding drives and expanding the filesystem. Even without LVM, the cost of NOT adding additional drives could well exceed the cost of adding them.

Ticketing systems, used religiously, are awesome for documenting trouble as it happens. One often-overlooked feature is the timestamping.

Expensive alternatives to fixing recurring problems are hard to justify, but the timestamping makes it possible. You can assess how much sysadmin time was spent on a problem by subtracting the start-time from the end-time. Get an "average cost per hour" for sysadmin time in your organization, and multiply. You can extrapolate out an average cost per month of not solving the problem in a more permanent way.

Hopefully you can identify recurring problems to management somewhere below the "50%" mark.   You'll be able to say: "This is an ongoing problem. So far it has cost us X. By it will have cost us Y, which is the cost of solving the problem.    If we spend Y to fix the problem now, then by we'll have broken even, and we'll be SAVING money every month after that." This is exactly what your boss needs to be able to justify the expense up the ladder.

The reason this H is so important is that quite often, even though there may be people in the organization charged with implementing these sorts of cost-saving measures, they don't have the necessary information or holistic perspective into what YOU are doing, to be able to identify what needs to be done for particular situations.

hungry

Don't dwell on your past success. Set your standards higher and higher. Have an insatiable appetite for new information. Be an active listener in meetings - ask questions, take notes. Follow up on open action items. Put shoulder to everything you do. You're being paid for 8 hours per day... deliver at LEAST 8 hours worth of value. And if that took you 6 hours... deliver just as hard for the next two, because you've set your bar too low.

helpful

Having a helpful attitude is crucial. ALWAYS be helpful. ALWAYS be part of "the solution", not part of "the problem". ALWAYS ensure that everyone in the loop is helped sufficiently.

It's useful here to have the ability to say no using the letters "y", "e", and "s".

For example, say a customer, or manager, or end-user, wants to solve a problem in a way which simply won't work. YOU know it won't work, all of your co-workers know it won't work, and the whole world except that one person knows... it won't work.

How to say "no"? Find three alternatives that WILL work, and present those.

Depending on the dynamic, you might take the following stance.  "The approach you're recommending isn't viable. We can go into why, if necessary, but it might be more effective to look at alternatives. Here are three good ones: ..."

So, instead of showing them why what they're asking for won't work (and making them feel less competent in the process) you're putting them in a position where they can CHOOSE a workable solution from the options.

They'll feel better about that, and the problem will get solved in a workable manner.

Another way to be helpful is to share not only data, but a quick statement of how you generated the data. Consider the tremendous difference:

Dear Customer,

You'll need to clear some drive space - your drives are full. Here is a list of files >100MB on your server for your review, and possible deletion, compression, or relocation to another device.

9999MB /really/big/file
1234MB /other/big/file
123MB /somewhat/big/file

Regards,
-your helpful sysadmin

...compared to...

Dear Customer,

You'll need to clear some drive space - your drives are full. Here is a list of files >100MB on your server for your review, and possible deletion, compression, or relocation to another device.

[root ~]% find / -xdev -type f -size +102400k -exec stat -c'%s|%n' {} \; | awk -F\| '{ print $1/1024/1024 "MB " ": " $2 }' | sort -nr

9999MB /really/big/file
1234MB /other/big/file
123MB /somewhat/big/file

Regards,
-your helpful sysadmin

In the first version, you've given the customer some information to work with. However, they have no idea how you got that information. You've left them powerless to search, for example, for files >50MB as well.

In the second version, you've provided just enough information that if they're at all capable on the commandline, they'll be able run another scan themselves.

THAT is helpful. Forcing them to come back to you for another listing of files, this time >50MB, might on the surface seem like you're helping them more - but what you're really doing is forcing them to be dependent upon you.

~~*~~

So... to recap: humble, honest, holistic, hungry, and helpful. Integrate these 5 H's into your very being... and you're well on your way to becoming an outstanding systems administrator.

PBRs blogspot: a lifetime of learning about GNU/linux.