Sunday, May 26, 2013

The UNIX Swiss Army Knife

I'm a huge fan of awk's associative array handling.  Here's an example that leverages this feature - summarizing one output field against another.

ps aux | awk \
'NR>1 {s[$11]=s[$11]+$4} END {for (i in s) {print s[i] " " i}}' \
|sort -n|grep -v '^0 '

This takes the output of 'ps aux' and sums the percentage memory used ($4) for each processname ($11).  The processname is used as the index into an associative array named 's'.  The array iteration within the END clause (and thus, the output) is in no particular order, so sorting the output is helpful.  There are other approaches - see the link at the end of this answer for one alternative.  The grep at the very end of the command pipeline omits processes that have used almost no memory.

The end result will look something like this:

2.8 bash
18.3 mysqld
70.5 httpd

To sum the CPU being used instead of memory, just use $3 instead of $4.

To summarize by userID instead of by what program... just use $1 instead of $11 (in both places it's mentioned, of course).

The same technique can be used on logfiles - for example, for most common apache access_log formats, you can quickly sum how many bytes have been transferred to specific IP addresses, or figure out which IPs have been transferring the same page over and over.

(The trick for figuring out which IPs are getting the same pages over and over is to catenate the IP and the pagename into a single string, use THAT as the index into the array, and simply increment a counter at that index.)

The following is FAR from a one-liner - but it does show some of the cool stuff that can be done with awk's associative arrays: https://github.com/PaulReiber/Log-Dissector

Here's another example - a bit simpler - this uses two associative arrays, with the same key, giving us both a counter and a list of entries at a given "index": Paul Reiber's answer to Linux: Which Linux or Windows utility application helps to find duplicated folders?

No comments:

Post a Comment