Monday, January 18, 2021

Ubuntu 20.04 with Panda PAU09 N600 Wireless USB dongle

Greetings fellow Linux hackers!

I ran into a rather hard-to-diagnose issue the other day, so I am sharing the details here to help anyone else who runs into the same or a similar problem.

I have a GNU/Linux workstation in my lab that's running Ubuntu 20.04, configured to download and install updates from Canonical as soon as they're available. The workstation uses a Panda PAU09 N600 wireless USB dongle for Internet access. The dongle has been working flawlessly for quite some time.

One recent update included kernel/BIOS level updates, so a reboot was needed. After rebooting, BOOM! my internet connection stopped working. I narrowed the problem down to something to do with the USB dongle.

Panda PAU09 Wireless Adaptor

Reviewing dmesg output, I determined there was a driver conflict. Two separate drivers were competing for the same hardware. It appears a second driver was provided by Canonical as part of the update, or they changed some configuration files. Not sure.

Symptom was, dmesg showed an authentication attempt to the wireless router, which was successful, then a bit later, it showed a SECOND authentication attempt, which failed.

Looking carefully, I saw that the first authentication was done by driver rt2800usb, whereas the second attempt was done by rt2x00usb. Two different drivers controlling the same hardware doesn't seem right.

I decided to disable one of them, to see if that would resolve things. (it did!)

Specifically, I added the following lines to the end of /etc/modprobe.d/blacklist.conf

blacklist rt2x00usb

blacklist rt2x00lib

One reboot later, my workstation was back on the Internet.

I didn't do a deep dive on why this occurred, or even on if the above is the ideal resolution.

What I do know is, it worked. Hope this helps.

Happy hacking!

-Paul

Tuesday, July 02, 2019

i make the internet work.

you're welcome

I look back upon the nearly half-decade I spent with Rackspace, training and coaching and working with the support floor, surrounded by hordes of techs providing fanatical support to Rackspace's customers, just about every day.

Helping, at Rackspace, didn't end with just the customers.

Rackers support other rackers, and other local San Antonio businesses, and local cultural organizations, and local social groups, and anyone who needs their support. Rackspace is the kind of place that doesn't turn away a hungry passer-by, when they're having their corporate BBQ events. They welcome the opportunity to assist.

Rackers are like that. That's why i love them all as much as I do. I'm proud to have been a Racker. I'm proud to STILL call myself a Racker, today. That's how much one good company helps.

I'm not on the support floor with Rackspace anymore, but I'm still helping their customers. I help anyone, really, who needs my help, if I can possibly manage it. That's how I roll.

I get calls occasionally from my old customers, now, asking how I'm doing. Wow. They remembered me. Me. Just some guy who helped them fix their website.

That's why I do customer support. Because people reach out, a decade later, just to thank me for having helped them, way-back-when. Because it mattered. And I helped. When it mattered.

That's why I do customer support.

Thursday, February 07, 2019

God's Paintbrush

Looks like God went a little overboard with a white paintbrush this morning.

Wednesday, February 06, 2019

Where did January go?

Whazzup!

It's been, well, FOREVER since I posted here on pbrs.blogspot.com

Last year, after nearly 2 decades in operation, I turned down the domain reiber.org and its associated public-facing TWiki, due to identity theft and security concerns.

I expect that i'll be writing posts here somewhat more frequently, since I no longer have the TWiki as a place to write.

If you receive anything from paul@reiber.org (or, anyone at reiber.org, really) it did not come from me - that domain's no longer mine.

I had ifttt rules in place to share here on this blog any new pages I added on that twiki. Since those pages are no longer online, I've removed the associated blog postings. Sorry if you were counting on them.

OK, I think that's it for the housekeeping.

Where did January go?!?

2019 finds me in San Jose, CA, in the heart of Silicon Valley, troubleshooting clouds and supporting developers. Helping people, basically, which is awesome 'cause that's my personal mission.

So, I get to go to do work and do what makes my heart happy. Pretty cool.

On workdays I motor over to a parking lot near my home, hop into a big white bus, flip open my notebook computer, and begin my business day before I'm even on campus, checking on the status of the various cloud components I've been asked to care for.

Maybe these pics helps explain why people love living and working in Silicon Valley?

for cloud monitoring :-)

Don't be a stranger!

I have a ton of friends I haven't heard from in forever. Go ahead and drop me a line to let me know what's going on in your world.

- reiber@gmail.com

Tuesday, May 10, 2016

It only took 11 years...

BitKeeper v bk-7.2ce is now open source - that's the community edition of BitKeeper source code product. This is a significant move; the technology's pretty awesome and it's great it's finally being open sourced.

http://www.phoronix.com/scan.php?page=news_item&px=BitKeeper-Open-Source indicates the software has been released under the apache 2.0 license. Not only does BitKeeper live... TCL/tk lives, too! Who knew? :-)

https://www.bitkeeper.org/

Tuesday, March 15, 2016

Quora helped me help out a quarter million new Linux users

How can I not humblebrag a little about this?

On Quora.com my stats page shows I've topped 2m views on the various answers I've written to questions. 2 million reads. That's pretty cool.

One answer has nearly a quarter of a million views on it alone: https://www.quora.com/I-am-shifting-from-Windows-to-Linux-What-are-some-explanations-for-Linux-as-I-am-new-to-it/answer/Paul-Reiber

It's so very awesome that I can help out that many people with figuring out Linux and UNIX, all at once.

Quora really has it going on, in terms of platform solutions. Nice work! Thank you, Quora.

Friday, August 09, 2013

Simple Counting Sort, in Python

For fun, I coded this up a while ago - both to review the details of implementing a counting sort (which is super-fast, since it does no comparisons!) and to review the details of implementing a Linux "pipeline" program in python.

Enjoy!
-pbr

https://gist.github.com/PaulReiber/6193485

https://gist.githubusercontent.com/PaulReiber/6193485/raw/7678f484648ad89608166c158ab8b00b98735394/countsort.py

Topping 100 posts

It's an arbitrary number, 100.

Ten tens. Only pertinent to us humans due to the count of digits we have on our hands and on our toes. Mathematically, a base is arbitrary, and its square is just as arbitrary. However, it seems reasonable to highlight that there are indeed over 100 relevant, hopefully helpful Linux-related posts on this blog at this point!

My posts of my notes/impressions of various Linux distros has pushed the count of posts I've made on this blog up over 100, in an otherwise unceremonious fashion, but it feels rather good to have accomplished that, and I look forward to continuing posting helpful stuff about Linux.
(The above confirms... people who abuse punctuation deserve a long sentence.)

If you have questions - anything you've "always wondered about" regarding Linux, anything perplexing, incomprehensible, or impenetrable... please don't hesitate to reach out and ask me. Most of the posts I've written have resulted from simple questions about how to best use Linux.

Thanks for reading!
-Paul

Exploring ArchLinux 2012

Sensibly, Arch includes vim. At least it's not forcing us to use 'nano'.

However, package installs were problematic.

# pacman -S mlocate
:: The following packages should be upgraded first :
pacman
:: Do you want to cancel the current operation
:: and upgrade these packages now? [Y/n] Y

resolving dependencies...
looking for inter-conflicts...

Targets (6): bash-4.2.045-1 filesystem-2013.03-2 glibc-2.17-5 libarchive-3.1.2-1 linux-api-headers-3.8.4-1 pacman-4.1.0-2

Total Installed Size: 51.14 MiB
Net Upgrade Size: -0.99 MiB

Proceed with installation? [Y/n] Y
(6/6) checking package integrity [##########################################################################] 100%
(6/6) loading package files [##########################################################################] 100%
(6/6) checking for file conflicts [##########################################################################] 100%
error: failed to commit transaction (conflicting files)
filesystem: /etc/profile.d/locale.sh exists in filesystem
filesystem: /usr/share/man/man7/archlinux.7.gz exists in filesystem
Errors occurred, no packages were upgraded.
#

These are not the sorts of problems I enjoy - it's a relatively pointless challenge to figure out how to use something that doesn't seem to want to be used.

Searching about on the internet for answers was no more enjoyable, and just as fruitless. Evidently, people who are expert with arch wish to remain an exclusive club, and have little interest in communicating HOW to use the package manager or otherwise become proficient with the distro.

I can't say that I would recommend Arch, based on my experiences with it to date.

Exploring Debian 6 - Squeeze

Debian's a reasonable distro. apt-get installed python, gcc, make, and vim quite handily.

I was a little disappointed to find it doesn't have pstree - and further to that:

# aptitude search pstree
#

I'm perplexed. No match on a search for pstree? Doesn't seem reasonable. Am I missing something? Or are they? It's a bit frustrating.

ps x --forest

...it's just not the same.

Otherwise, a very reasonable distro.

Exploring Gentoo 12

What do they have against Vim?
Vim is the default Linux/UNIX editor. Excluding it on a distro is bordering on criminal.

emerge sys-apps/mlocate

Nothing can be standard, Gentoo must differentiate. One cannot simply install, or update, or "get" a package, one must emerge it.

However, emerge worked, at least for "locate"... almost as straightforwardly as with RedHat/CentOS/Fedora/Ubuntu.

I found myself in "nano" having issued "visudo". That's just wrong. Let's see - can I install vim?

# emerge vim
* Last emerge --sync was 348d 11h 31m 40s ago.
Calculating dependencies... done!

>>> Verifying ebuild manifests

>>> Starting parallel fetch

>>> Emerging (1 of 6) app-admin/eselect-vi-1.1.7-r1
* Fetching files in the background. To view fetch progress, run
* `tail -f /var/log/emerge-fetch.log` in another terminal.
* vi.eselect-1.1.7.bz2 SHA256 SHA512 WHIRLPOOL size ;-) ... [ ok ]
>>> Unpacking source...
>>> Unpacking vi.eselect-1.1.7.bz2 to /var/tmp/portage/app-admin/eselect-vi-1.1.7-r1/work
>>> Source unpacked in /var/tmp/portage/app-admin/eselect-vi-1.1.7-r1/work
>>> Preparing source in /var/tmp/portage/app-admin/eselect-vi-1.1.7-r1/work ...
* Applying eselect-vi-1.1.7-prefix.patch ... [ ok ]
>>> Source prepared.
>>> Configuring source in /var/tmp/portage/app-admin/eselect-vi-1.1.7-r1/work ...
>>> Source configured.
>>> Compiling source in /var/tmp/portage/app-admin/eselect-vi-1.1.7-r1/work ...
>>> Source compiled.
>>> Test phase [not enabled]: app-admin/eselect-vi-1.1.7-r1

>>> Install eselect-vi-1.1.7-r1 into /var/tmp/portage/app-admin/eselect-vi-1.1.7-r1/image/ category app-admin
>>> Completed installing eselect-vi-1.1.7-r1 into /var/tmp/portage/app-admin/eselect-vi-1.1.7-r1/image/

>>> Installing (1 of 6) app-admin/eselect-vi-1.1.7-r1

>>> Emerging (2 of 6) app-admin/eselect-ctags-1.13
>>> Downloading 'http://mirror.usu.edu/mirrors/gentoo/distfiles/eselect-emacs-1.13.tar.bz2'
--2013-07-28 03:06:55-- http://mirror.usu.edu/mirrors/gentoo/distfiles/eselect-emacs-1.13.tar.bz2
Resolving mirror.usu.edu... 129.123.104.64
Connecting to mirror.usu.edu|129.123.104.64|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2013-07-28 03:06:55 ERROR 404: Not Found.

>>> Downloading 'http://mirror.mcs.anl.gov/pub/gentoo/distfiles/eselect-emacs-1.13.tar.bz2'
--2013-07-28 03:06:55-- http://mirror.mcs.anl.gov/pub/gentoo/distfiles/eselect-emacs-1.13.tar.bz2
Resolving mirror.mcs.anl.gov... 2620:0:dc0:1800:214:4fff:fe7d:1b9, 146.137.96.7
Connecting to mirror.mcs.anl.gov|2620:0:dc0:1800:214:4fff:fe7d:1b9|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2013-07-28 03:06:55 ERROR 404: Not Found.

>>> Downloading 'http://gentoo.cities.uiuc.edu/pub/gentoo/distfiles/eselect-emacs-1.13.tar.bz2'
--2013-07-28 03:06:55-- http://gentoo.cities.uiuc.edu/pub/gentoo/distfiles/eselect-emacs-1.13.tar.bz2
Resolving gentoo.cities.uiuc.edu... failed: Name or service not known.
wget: unable to resolve host address `gentoo.cities.uiuc.edu'
>>> Downloading 'http://gentoo.osuosl.org/distfiles/eselect-emacs-1.13.tar.bz2'
--2013-07-28 03:06:56-- http://gentoo.osuosl.org/distfiles/eselect-emacs-1.13.tar.bz2
Resolving gentoo.osuosl.org... 140.211.166.134
Connecting to gentoo.osuosl.org|140.211.166.134|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2013-07-28 03:06:56 ERROR 404: Not Found.

>>> Downloading 'http://ftp.halifax.rwth-aachen.de/gentoo/distfiles/eselect-emacs-1.13.tar.bz2'
--2013-07-28 03:06:56-- http://ftp.halifax.rwth-aachen.de/gentoo/distfiles/eselect-emacs-1.13.tar.bz2
Resolving ftp.halifax.rwth-aachen.de... 137.226.34.42
Connecting to ftp.halifax.rwth-aachen.de|137.226.34.42|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2013-07-28 03:06:56 ERROR 404: Not Found.

>>> Downloading 'http://gentoo.ussg.indiana.edu/distfiles/eselect-emacs-1.13.tar.bz2'
--2013-07-28 03:06:56-- http://gentoo.ussg.indiana.edu/distfiles/eselect-emacs-1.13.tar.bz2
Resolving gentoo.ussg.indiana.edu... 156.56.247.195
Connecting to gentoo.ussg.indiana.edu|156.56.247.195|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2013-07-28 03:06:57 ERROR 404: Not Found.

>>> Downloading 'http://gentoo-distfiles.mirrors.tds.net/distfiles/eselect-emacs-1.13.tar.bz2'
--2013-07-28 03:06:57-- http://gentoo-distfiles.mirrors.tds.net/distfiles/eselect-emacs-1.13.tar.bz2
Resolving gentoo-distfiles.mirrors.tds.net... 216.165.129.135
Connecting to gentoo-distfiles.mirrors.tds.net|216.165.129.135|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2013-07-28 03:06:57 ERROR 404: Not Found.

!!! Couldn't download 'eselect-emacs-1.13.tar.bz2'. Aborting.
* Fetch failed for 'app-admin/eselect-ctags-1.13', Log file:
* '/var/tmp/portage/app-admin/eselect-ctags-1.13/temp/build.log'

>>> Failed to emerge app-admin/eselect-ctags-1.13, Log file:

>>> '/var/tmp/portage/app-admin/eselect-ctags-1.13/temp/build.log'

* Messages for package app-admin/eselect-ctags-1.13:

* Fetch failed for 'app-admin/eselect-ctags-1.13', Log file:
* '/var/tmp/portage/app-admin/eselect-ctags-1.13/temp/build.log'

* GNU info directory index is up-to-date.

* IMPORTANT: 2 config files in '/etc' need updating.
* See the CONFIGURATION FILES section of the emerge
* man page to learn how to update config files.

The above experience gave me ZERO faith in gentoo. I had asked to install Vim, yet the errors are about it being unable to install emacs. Poignant, yet so totally inappropriate!

I can't say I've had a reasonable experience with Gentoo 12 so far. It hasn't been totally unwieldy, but it's been far from malleable.

Exploring Fedora 17

For anyone who is already familiar with RedHat/Centos, this is a painless distro to adopt - you'll find few if any surprises.

As with the other Rackspace Cloud distros, this one's lean - but not TOO lean. I found myself needing to install "locate" and "gcc":

yum install mlocate
updatedb
yum install gcc

It makes for a relatively boring blog post, but... I had so few problems with the Fedora 17 that I really have nothing more to report. It just works.

Exploring OpenSUSE 12

Yet more documentation of my Exploring Variants of Linux in the Rackspace Cloud

OpenSUSE package management is via Yast. Yast wants to be interactive. It might be possible to do things from the commandline but it seems to work best interactively.

This makes it harder to review what you've done since there's no record of exact command line options. My root history simply says "yast", with no indication of what I installed.

As such, I don't have a good audit trail for what I've done on the OpenSUSE server.

However, it gave me very few problems and had few if any issues.

OpenSUSE will surprise you if/when you run "pstree -paul"

You'll find init has been replaced by something called "systemd". As with Ubuntu's "upstart", OpenSUSE's "systemd" replaces the tried-and-true system init scripts with something new and wonderful.

man systemd.special systemd.unit systemd.service systemd.socket systemd.target

...there's quite a bit of learning and reading to be done. Another day. As with Ubuntu Server, were I forced to work with and manage a Linux other than RedHat/CentOS, I'd be quite happy with OpenSUSE.

Exploring Ubuntu 12 Server

As I promised in Exploring Variants of Linux I'm following up with my impressions of, and notes on, various Linux distros. This is the first of those "general impressions".

If you're a RedHat/CentOS centric Linux user, hopefully these posts will help you over the first few hurdles you might find on the various other distros.

As with all of the Rackspace Cloud distros, the Ubuntu 12 Server distro is very lean. I noticed a few relatively important tools were not installed. They're not "required" but they're really handy, so my first step was to install them:

apt-get install mlocate
updatedb
apt-get install make
apt-get install gcc

I noticed only one service running which is not needed - "whoopsie" - so I turned it off.

root@pbr-ubuntu12:~# cat /etc/default/whoopsie
[General]
report_crashes=true
root@pbr-ubuntu12:~# sed -i 's/report_crashes=true/report_crashes=false/' /etc/default/whoopsie
root@pbr-ubuntu12:~# cat /etc/default/whoopsie
[General]
report_crashes=false
root@pbr-ubuntu12:~# sudo service whoopsie stop
root@pbr-ubuntu12:~#

Upgrading to the latest ubuntu was very lengthy and verbose, including a full-screen interaction with a pink background.. but it was functional:

apt-get update
apt-get upgrade
do-release-upgrade

Upstart's quite a bit different from the standards sys5 init script approach, but easy enough to get accustomed to.

man upstart-events

...neat. Upstart's pretty darn powerful, in fact.

General impression? If I was forced to use something other than a RedHat/CentOS distro, I'd be quite happy with Ubuntu Server.

Friday, July 26, 2013

Exploring Variants of Linux

Rackspace Cloud Linux

One of the very cool aspects of the Rackspace Cloud is the number of Linux/GNU variants. As I'm already very intimately familiar with Red Hat / CentOS, I decided to take a look at the other seven.

Dealing with an array of cloud servers can be a little tedious, so I wrote a little program to run commands on all of them for me: github_gist:/PaulReiber/run

This article documents my exploration of Ubuntu, Arch, FreeBSD, Debian, Gentoo, Fedora, and openSUSE.

Ubuntu

CentOS

Red Hat

Arch

Debian

Fedora

FreeBSD

Gentoo

openSUSE

$ run 'hostname'|egrep ^pbr\|stdout
pbr_ubuntu12.10_512
  @stdout="pbr-ubuntu12.10-512\r\n">]
pbr_freebsd9_512
  @stdout="pbr-freebsd9-512\r\n">]
pbr_opensuse12.1_512
  @stdout="pbr-opensuse12.1-512\r\n">]
pbr_fedora17_512
  @stdout="pbr-fedora17-512\r\n">]
pbr_gentoo12.3_512
  @stdout="pbr-gentoo12.3-512\r\n">]
pbr_debian6_512
  @stdout="pbr-debian6-512\r\n">]
pbr_arch2012.08_512
  @stdout="bash: hostname: command not found\r\n">]

Grepping JSON output isn't pretty, but you can see from the above how my program "run" works - it prints out the name of each cloud server then the JSON output of running a command via the ruby fog "ssh" API.

The output above also highlights something I ran into a LOT - various "standard" commands are simply not present on some distros! Let's take a look - is "vim" available on all of the distros?

$ run 'which vim'|egrep ^pbr\|stdout
pbr_ubuntu12.10_512
  @stdout="/usr/bin/vim\r\n">]
pbr_freebsd9_512
  @stdout="vim: Command not found.\r\n">]
pbr_opensuse12.1_512
  @stdout="/usr/bin/vim\r\n">]
pbr_fedora17_512
  @stdout="/bin/vim\r\n">]
pbr_gentoo12.3_512
  @stdout="which: no vim in (/usr/bin:/bin:/usr/sbin:/sbin)\r\n">]
pbr_debian6_512
  @stdout="">]
pbr_arch2012.08_512
  @stdout="/usr/bin/vim\r\n">]

From this it seems that FreeBSD, Gentoo, and Debian don't come with vim pre-installed. How about 'make'?

$ run 'which make'|egrep ^pbr\|stdout
pbr_ubuntu12.10_512
  @stdout="/usr/bin/make\r\n">]
pbr_freebsd9_512
  @stdout="/usr/bin/make\r\n">]
pbr_opensuse12.1_512
  @stdout="/usr/bin/make\r\n">]
pbr_fedora17_512
  @stdout="/bin/make\r\n">]
pbr_gentoo12.3_512
  @stdout="/usr/bin/make\r\n">]
pbr_debian6_512
  @stdout="/usr/bin/make\r\n">]
pbr_arch2012.08_512
  @stdout="which: no make in (/usr/bin:/bin:/usr/sbin:/sbin)\r\n">]

No make command is available on arch. Arch is not for the faint of heart, I guess.

I'll be updating this blog post, and adding additional posts as I continue exploration of these various Linux distros.

Wednesday, July 24, 2013

Nameless Temporary Files

Linux 3.11 rc2

Here's an interesting snippet from Linus's announcement post regarding Linux 3.11 rc2:

 (a) the O_TMPFILE flag that is new to 3.11 has been going through a
few ABI/API cleanups (and a few fixes to the implementation too), but
I think we're done now. So if you're interested in the concept of
unnamed temporary files, go ahead and test it out. The lack of name
not only gets rid of races/complications with filename generation, it
can make the whole thing more efficient since you don't have the
directory operations that can cause serializing IO etc.

Interesting idea!  Temporary files that aren't burdened with having to have filenames.

It will be some time before sysadmins see this feature in production, but in this case I think it's best to get the word out early, especially since this new feature could cause "mystery drive space exhaustion".

Right now, the only discrepancy between "df" and "du" numbers is due to deleted files with still-opened file descriptors.  With this new feature, it appears that nameless temporary files will join the ranks of hard-to-spot possible root causes of space exhaustion.

It's not clear how these new files will be identifiable / distinquished, for example, in the output of "lsof".

As I learn more about this new feature, I'll be sure to write about it.

Sunday, July 07, 2013

Best Practices: It's Freezing in the Cloud

Production.

It's a term many people have heard of, but what does it mean? A lot of people have been asking me about this lately, so I'm happy to give an overview of some best practices for solution management.

Production Rule #1: A production environment is something you don't touch.

Its configuration is frozen - its not up for experimentation. It does exactly what it's been configured to do, nothing more, nothing less. A production environment is comprised of a set of production servers handling various functions. Irrespective if they're web servers, app servers, compute servers, db servers, or comms/queueing servers, each production server is a simple combination of three things:

a vetted version of your application/website/service/database/whatever
a vetted copy of each and every non-stock (tuned) configuration file
a base OS - hopefully one that's stock, but definitely one that's been proven stable

The sum total of those three things makes a production server.

Example Production Environment "Cook Book":
db application + db_master_config + stock RHEL6 server = master-database-server
db application + db_tapebackup_slave_config + stock RHEL6 server = backup-slave-database-server
db application + db_failover_slave_config + stock RHEL6 server = failover-slave-database-server
payment_gateway_if + comm_server_tuning + stock RHEL6 server = payment_gateway
payment_gateway_if + failover_comm_server_tuning + stock RHEL6 server = payment_gateway
website content + webhead_config + stock RHEL6 server = web1
website content + webhead_config + stock RHEL6 server = web2

website content + webhead_config + stock RHEL6 server = web3

load_balancer_config + stock cloud loadbalancer (LBaaS) = production_loadbalancer

Production Rule #2: Never login on your production servers.

Notice there's no mention of "custom tweaking" or "post-install tuning" or the like in the recipes I listed in the example above. That's because there really must be none. Human beings NEVER LOG IN on production servers. If they are doing that, they're almost certainly breaking rule #1 and they're definitely breaking rule #2. Early on, as you're putting the solution in place, it may be handy to use ssh to remotely run a command or two - but you must ensure you're only running non-intrusive monitoring - "read only" operations - if you wish to ensure the correctness of the production environment.

The moment you break rule #2, you'll be setting yourself up for a conundrum if/when there is a problem with the production environment. You'll then need to answer the question: "was it what I did in there, or was it something in the production push that's the root cause of the problem?"

If you've never logged in on the production servers, you then KNOW it was something in the production push that cause the problem.

How then do you arrive at a reasonable solution, not over-spending on servers, memory, storage, licenses, etc. if you don't tune your production environment? You tune your staging environment instead.

Staging.

There's a term fewer people have heard, but it's equally as important as "production". Every good production environment has at least one staging environment.

Ideally, a staging environment duplicates the production environment. If you're hesitant to jump straight to that, you can introduce less redundancy than the production environment has - but you're opening up the possibility of mis-deployment if you do.

Example "barely shorted" staging environment:
db application + db_master_config + stock RHEL6 server = master-database-server
db application + db_tapebackup_slave_config + stock RHEL6 server = backup-slave-database-server
payment_gateway_if + comm_server_tuning + stock RHEL6 server = payment_gateway
website content + webhead_config + stock RHEL6 server = web1
website content + webhead_config + stock RHEL6 server = web2

load_balancer_config + stock cloud loadbalancer (LBaaS) = production_loadbalancer

The idea with a staging environment is that it's a destination for changes to applications, website content, configuration changes prior to their going into production.

Staging Rule #1: A staging environment is something you don't touch.

(well... after it's been setup and debugged and is working properly, anyway)

It's a "devops" world now - sysadmin config changes need to be versioned and managed just as carefully as code changes. Ideally all of the changes are committed to a source code repository - ideally something like git.

Once a week, or more often if needed, the entire list of changes being made for all components and configurations is reviewed and vetted/approved. Then, all of those changes are applied to the staging environment, backing things up first if needed.

With that, you've achieved a "staging push" - combining all of the changes to all of the functionality and configuration for all of the various solution components and applying them to the staging environment. At that point automated testing begins against the solution that you've just put in place in the staging environment.

Real-world traffic to the solution is either simulated or exactly reproduced, and the performance and resource utilization of all servers implementing staging is logged. After a period of some day of testing (yes, multiple days - ideally simulating a full week of operations) then summarization and statistics can be generated from the resource utilization logs.

If there are any ill side-effects of the most recent push, they'll be evident because the resource utilization statistics will show that things got worse. For example, if there's a badly coded webpage introduced which is causing apache processes to balloon up in size, the memory statistics on the webheads will be notably worse than they were for the previous staging push.

Staging Rule #2: Never login on your staging servers.

If it's done right by suitably lazy programmers, your staging environment will be running all of this testing automatically, monitoring resources automatically, comparing the previous and current statistics resulting from testing at the end of the test run, and emailing you with the results.

You can only be 100% sure of the results of the staging test if it was entirely "hands off". Otherwise if/when something goes wrong (either in production or in staging) you'll be left wondering if it was due to the push, or due to whatever bespoke steps you took in staging. That's not a good feeling, and it's not a fun discussion with your board of directors either.

More Twenty-first century devops best practices

If you'd like to learn more, I can recommend Allspaw and Robbins: Web Operations - keeping the data on time Now it's your turn - what's your favorite "devops" runbook/rule-book?

Tuesday, July 02, 2013

Don't Fear the Mongo

NOSQL is a term that strikes fear in the heart of many with traditional relational database skills.

How can a database not use SQL? How could that possibly perform well? It does! And it's not hard to learn, either. Don't worry about performance - just dive in. http://education.10gen.com is offering free classes in Mongo - and the'yre totally worth your time.

I'm partially through "M101P MongoDB for Developers" and I now feel relatively comfortable addressing NOSQL related concerns. I'm also enrolled in an upcoming "MongoDB for DBAs" class.

Similar to MySQL, MongoDB is a service process. You connect using a client program, "mongo", or by using a MongoDB library and making calls from your favorite programming language. The class I'm in right now uses python, which is pretty straightforward to learn - but they give you most of the python code for the various homework exercises already, and you only really need to write a few line of calls that use the MongoDB API for the various assignments.

If you know javascript and JSON notation, you're 80% of the way to knowing MongoDB already. Here's a quick demo of using mongo:

bash-3.2$ mongo

MongoDB shell version: 2.4.4

connecting to: test

> show dbs

blog 0.203125GB

local 0.078125GB

m101 0.203125GB

students 0.203125GB

test 0.203125GB

> use students

switched to db students

> db.grades.find().forEach( function(one){db.gradesCopy.insert(one)});

> db.grades.count()

600

> db.gradesCopy.count()

600

> quit()

bash-3.2$

Pretty straightforward, huh? Don't fear the mongo!

Tuesday, June 25, 2013

Running commands on all of your cloud servers

I consider my cloud servers to be one big array of servers.

I decided to use "fog" - the Ruby API for the Rackspace Cloud - to build something to let me, in one step, run commands on all of the servers. It turned out to be pretty straightforward.

You might have a different model - an array of web servers, another array of db servers, and another array of compute servers, for example. If so, you can easily extend the code to work with your different groups of servers by querying for whatever differentiates them.

Monday, June 24, 2013

Migrating a website - sporadic performance

I had the opportunity today to help someone who had done an outstanding job of migrating a website into the cloud - but their new sites performance was sporadic and unpredictable.

I'll share both the technique I used to debug that, and the lessons learned.

The site was implemented over two webservers, with lsync handling content synchronization. The web servers were behind a cloud load balancer. They were sized right. They weren't pegging the CPU, swapping, or io intensive. But... they weren't working right. The site would load sometimes, and timeout other times.

To see the website, I had to put the domain name and its new IP address in my local /etc/hosts file. When I put the IP for the load balancer in my /etc/hosts file, I couldn't tell which apache child process was handling my request, because all of the connections were coming from the load balancer. So, I picked one of the web servers - web1 - and changed my /etc/hosts details for the domain to point my browser straight to that web server - bypassing the load balancer. That way, I could see which apache child process my browser had been hooked up to.

Lesson #1: bypass the load balancer for testing.

I used 'strace' to see what apache was doing. It took a few tries, but I soon had a good idea of what was going on. By the time you've got output from netstat, with the process ID, the work's already done - so how to strace that process super-fast?

alias s="strace -s 999 -p"

This way when netstat shows apache process 11245 is serving your IP, you can bust out:

"s 11245" and hit enter. Viola! (it's possible to go even further with this, but let's keep it simple)

Lesson #2: don't give up - figure out how to not have to type a lot.

I saw apache contacting some IPs I was familiar with - the caching nameservers for the datacenter where the server lives. Then I saw apache reach out and connect to an unfamiliar IP address.

What that means is that apache was looking up a domainname in DNS, then using the resulting IP.

I asked about that IP... and it turned out, it was the IP of where that website is CURRENTLY hosted.

I was helping to debug the NEW version of this site - but for whatever reason, the new code was reaching out to the OLD implementation of the website.

So, the "root cause" had been found. However, what to do next? I could have simply advised that the best solution would be to revise the code to use relative references. Or, mentioned that it could use IP addresses instead of domain names.

Instead, I fixed the problem, right then and there.

On both web servers, I added the domain name in /etc/hosts:

127.0.0.1 localhost localhost.localdomain thewebsite.com

That way, each machine considered that 127.0.0.1 was the proper IP address for the domain. This had the added benefit that references to the domain from either web server wouldn't cause traffic through the load balancer. I think it's an all-around good idea.

Lesson #3: servers that implement domains should consider themselves that domain.

By the way... the moment I edited /etc/hosts and fixed this, the site started to render super-fast, and the sporadic performance problem was gone.

My customer was so happy, he told me to tell my boss he said I could take the rest of the day off. (I didn't... but I loved the sentiment!)

Thursday, June 06, 2013

Nightly Maintenance and "Sorry Sites"

Servers need backups. And, sometimes, there are nightly maintenance scripts that need to be run, for example dumping out all transactions, or importing orders or products. Usually these maintenance tasks will be run from a cron job.

Often, these tasks impact the "production" website, or conversely, the "production" website often impacts these tasks. Either way, sometimes it's best to get the site offline for a minute or two, to let the maintenance task run quickly and to completion, without competition.

I thought the approach below was totally obvious, but I've learned that a lot of people are really happy to learn how to do this sort of thing.

It's really straightforward for a cron job to also put a "Sorry Site" in place - a website that states "We're down for maintenance - please reload in a few minutes" or similar. Here's a strategy for doing this.

Say your website document root is:

/var/www/website

And say your "sorry" website is:

/var/www/sorry

We'll make a script called /root/switch.

#!/bin/sh
site = /var/www/website
sorry = /var/www/sorry
hold = /var/www/hold
if [ -d /var/www/sorry ]; then
mv $site $hold
mv $sorry $site
else
mv $site $sorry
mb $hold $site
fi

Say your existing cron job is:

0 0 * * * /do/my/maintenance >/dev/null 2<&1

To put the sorry site in place while the maintenance is running, just change that to:

0 0 * * * /root/switch; /do/my/maintenance; /root/switch >/dev/null 2<&1

The above simply calls the "switch" script twice - once before, and once after, the maintenance script. It keeps all of the details of what "switch" actually does hidden away from the cron job, as a good programming practice.

The above approach lets you customize your "sorry site" - some of the pages can say "We're down for maintenance" (say, the main page) and other pages can still work (say... for example... the pages that let people check out :-)

If you just want to take ALL pages offline, there's a simpler way - setup a variant .htaccess file and swap that in place, instead of moving the directories around.

Tuesday, June 04, 2013

Fighting SPAM: Identifying compromised email accounts

A compromised email account is one where spammers have determined someone's email password, and they're using the email account to send out spam email.

Various email servers have better and worse logging. Depending on the server (qmail, postfix, sendmail) the logs may or may not let you directly correlate an outgoing spam email with the actual account that sent the email.

So, the following can be pretty useful. It collects up all the IP addresses ($13 - the thirteenth field in the logfile, in this particular case) that each user has connected from, and prints out the accounts that are connecting from more than one IP.

awk '/LOGIN,/ {if (index(i[$12], $13) == 0) i[$12]=i[$12] " " $13} END {for(p in i) {print split(i[p], a, " ") " " p " " i[p]}}' maillog|sort -n|grep -v '^1 '

If you see an account for an individual, which is getting connections from dozens or hundreds of IP addresses, that's very possibly a compromised email account.

Note that an end-user with a smartphone will end up with a big bank of IPs connecting to check email. They'll all have similar IP addresses in most cases.

Friday, May 31, 2013

Track Apache's calls to PHP

Customers often ask how to find out what PHP code is being called. Sometimes, they're looking to find abusers of PHP email forms - and other times, they're interested in learning which routines are being called the most often.

The following monitoring command will run until you interrupt it with a control-C.

lsof +r 1 -p `ps axww | grep [h]ttpd | awk '{ str=str","$1} END {print str}'`|grep vhosts|grep php

It takes the process IDs of all of the Apache processes and strings them together with commas inbetween. Then it calls "lsof", asking it to repeat every second.

"lsof" lists all of the open file descriptors for the processes listed after the "-p" argument.

At the end of the command, we select only those lines that have "vhosts" and "php". If your website document roots aren't under /var/www/vhosts you will want to look for some other string indicating "a file within a website"

Wednesday, May 29, 2013

As a software developer, how can I ensure I remain employable after age 50?

I used to think the same way. I've been programming UNIX/Linux for around 30 years. I liked writing code. I wanted my job to be writing code, and I wanted some company to pay me to do that.

I absolutely LOVE writing code now - because I only write WHAT I want to write, WHEN I want to, and HOW I want to. (I.e. it's no longer part of my job. I write code as a hobby now.)

I absolutely LOVE my job now - it's WAY better than any job I've ever had before - including when I was a consultant, and including when I worked for myself (I was CTO of my own startup some years ago).

My day job is: HELP PEOPLE. I found a very good fit in customer service.

I'm now a top shelf systems administrator, and I leverage my coding skills to solve problems that would make many sysadmins heads spin. For example, I was asked to action a db import the other day. Mid-import, the load on the server went almost to zero, and memory usage started to climb.

The import had dead-locked with the customer's runtime application logic.

Because of how apache works, and because most customers over-commit apache in terms of how they set MaxClients (they allow Apache's worst-case memory footprint to be larger than their total available memory)... in this sort of a case, it's imperative to act QUICKLY to correct the situation, or the server will very probably crash.

Most sysadmins in that case would immediately stop apache, which I did. They would then abort the import, probably restart mysql to clear the deadlock, and restart the import. That, I did not do - it's overkill.

Instead, I stopped apache, ran "mysqladmin processlist > queries", edited the file "queries" in vim and...

-> deleted the header, the footer and the specific db import query I did NOT want to kill,

-> issued :1,$s/^|/kill /

-> issued :1,$s/|.*/;/

-> wrote the file and exited.

That gave me a file full of lines like this:

kill 12345 ;

kill 67890 ;

...then I ran "mysql

It was a 4.5G import, so that was a good thing; restarting it would have added hours to the downtime.

This isn't something your typical dev knows how to do correctly. It's not something your typical admin knows how to do correctly. And it's not even something your typical DBA knows how to do correctly. It's something I knew how to do correctly, leveraging my years of experience.

I'm sharing this because it shows there's still a need for people who can solve difficult computing problems, accurately and quickly, but outside of the programming domain. Your experience level may well make you IDEAL for this sort of position, so if you find it at all compelling, I recommend that you:

review all of your past positions to see how each and every one of them had "customer service" as some aspect of what they were about
rework your resume to exude that aspect of what you did
apply for an entry-level position in customer service at a hosting company

Learning Vim From the Inside

Vim improves on vi in countless ways. As a curious vi expert, I wanted to know exactly what those were, so I dove into the source code. In doing so, I was compelled to create this online class a few years back: http://curiousreef.com/class/learning-vim-from-the-inside/

It's still going strong. New students join every month. For the most part, it runs itself now... but if you hit any hurdles while working your way through the content, please reach out to me and let me know.

The ethos of slicing and dicing logfiles

When a logfile is of reasonable size, you can review it using "view" - a read-only version of "vim". This gives you flexible searching, and all of the power of vim as you review the logfile. However, for viewing huge files, instead of editing them in vim directly, try this:

tail -100000 logfile | vim -

That way you're only looking at the last 100,000 lines not the whole file. On a server with 4GB of RAM, looking at a 6GB logfile in vim without something like the above can be, well... a semi-fatal mistake.

For logfile analysis, I use awk a lot, along with the other tools you mentioned - grep, etc. Awk's over the top - totally worth learning. You can do WAY cool things with it. For example, I once used grep on an apache access log to find all the SQL injections an attacker had attempted, and wrote that to a tempfile.

Then I used awk to figure out (a) which .php files had been called and how many times each, and (b) what parameters had been used to do the injections.

awk -F\" tells awk to use " as the field separator, so anything to the left of the first " is '$1' and whatever's between the first and second quote is $2, etc.

So awk -F\" '{print $2}' shows me what was inside the first set of quotes on each line.

Using other characters for the field separator let me slice out just the filename from the GET request, then another pass over the file with slightly different code let me slice out just the parameter names.

Here again, as you might feel is a resounding theme in my blog, the Linux commandline tools have proven to be immensely useful.

Log Dissector

If you want to see some of awk's more awesome features being leveraged for logfile analysis, take a look at this little program I threw together:
http://paulreiber.github.com/Log-Dissector/