Friday, November 26, 2021

Sysadmin Horror Story with a Moral

Of all of my horror stories working for a NYC financial organization, the worst one was also the most memorable, not because of its horribleness, but rather because there was something that I discovered which has served me well in the decades since.

Back in the days before server clouds and virtual machines, we had real computing hardware.

Sometimes, as in this financial institution, the servers, all 1,000 or so of them, were housed right inside the business. This firm owned an entire 24-floor building not far from Rockefeller Center.

We were informed by NYC's power company, Con Edison, that in order to beef up our growing power demand they needed to turn off the power to the building for a little bit. It might have just been to the two or three floors where we housed all of the computers. This would be done on a Saturday night when all of the financial markets were closed.

The plan was that Saturday evening we'd power down all of the servers and Sunday morning at 6AM the servers would be powered back up. Because every server had to be checked, we divided the power-on phase into two groups: one starting at 6AM, and another starting at noon. We had the more experienced people doing the noon shift, because we expected that some servers might be tricky, so the early group could focus on quantity and leave the more difficult machines for the more advanced group.

This seemed pretty reasonable and clever. I was on the second shift.

Things went pretty well and we got most of the servers up and running.

Except there was one. And it was one of the more important real-time trading servers.

So now we flash back six weeks earlier...

Back in the days before the widespread use of server configuration management and orchestration tools like kubernetes, ansible, chef, or puppet, we had to write our own scripts to do mundane things.

One time I was assigned the task of writing a script to update root passwords everywhere.

Actually the assignment wasn't really that. A batch of sysadmins had burned out and quit and upper management wanted to just make sure ASAP that they didn't inflect damage on us. Financial intuitions weren't friendly places to work.

Taking the longer view of things, I wrote a general program to change root passwords. You weren't going to change the attitude of the financial institution. They treated everyone like a disposable wipe. So you adjusted; this was an activity was going clearly going to happen over and over again.

Part of the task of was, of course, to write the program to update the root password. The other part of the task, do it everywhere, was actually more difficult: get and maintain an inventory of servers. A number of the servers you could get from a service developed at Sun Microsystems called NIS.

However not all servers used this, so NIS information had to be melded in with DNS. The two lists were largely distinct because servers were set up two different and somewhat competing groups. However the groups got merged sometime after I was hired.

When a server has more than one network interface, it was important to catch that. I had written blast script that would take a list of servers, copy some code to that server and run it. The code here was the thing that changed the root passwords.

One of these servers had more than one interface, was in listed both in DNS and NIS and so it was listed twice. My blast script updated servers independently rather than serially.
When you have 1,000 or so servers to update, you need to update servers in parallel if you want to get any task done in a single work shift. (More about that aspect perhaps some other time; that story has a moral, too.)

There is a funny thing about the Unix file system (and by extension Linux as well): consistency is not guaranteed over independent file writes. In other words, if you have two clients each of which opens the same file and writes to it in parallel with the other client, the result will probably be garbage. And that's what happened when I ran my password update in parallel through two different network interfaces.

The result was that we had a shadow password file which was garbage.

Much later when I learned about "idempotency" in configuration management, a light bulb went off in my head. See definition 3 in https://foldoc.org/idempotent .

Now, pretty soon we discovered the mistake. I went to upper management to report the problem. I figured, since we were low on sysadmins which is why I was doing this in the first place, my job was safe for a little while, at least until I got the program working reliably by adding a couple of file locks into the program, and updating the inventory for the mistake.

Then I was asked what the current situation of the server was. Was the important trading system still trading?

Interestingly the answer was, well, actually yes. It was. Any program that started out with root privileges and that was running was still functioning normally. It was only new requests that couldn't log in as root.

So upper management said, don't worry about it then: just don't reboot the box.

This is exactly the highly skilled 2-steps-ahead kind of answer typical of those minds that work at places making decisions about your financial future.

So now we come back to the problem that Con Edison had decided that we were going to reboot. Most of the important servers were hooked to battery backup with specially monitored power regulators. However this one server was in the trading area on some guy's desk hooked into the wall outlet.

And as we learned that day, it also had disk clustering of the root file system which was also slightly dysfunctional. The process of rebooting involved using a magical command as root which we were only to find out about later. I hope you see where we are going with this...

If you are still with me, now I get to the main part of the story.

We spent hours trying all sorts of things and the usual tricks to bring up a server in repair mode. But because of the root partition being part of disk clustering which was messed up, we couldn't. All the other folks were told to go home because it was just this one server and it was just me and the sysadmin team lead.

We had a contract with the manufacturer/vendor for support of the important servers, but not for this particular server which wasn't racked with other servers. Finally, the sysadmin team lead, Sam, decides to put on his own personal credit card 1/2 hour live support for about $800. (This was in 1999 dollars). I was impressed: it was resourceful and took guts.

We were told that we would deal with the expert in Veritas Cluster Management, our clustering software, at Sun.

So we were hopeful.

He had us try a few thing things, but none of them worked. And then after 3 minutes of elapsed time, he said, "I hope you have good backups to restore from." Well, predictably, this important but rogue trading server that happened to be sitting on some dude's desk and had neither fault-tolerant power nor server support maintenance wasn't in the list of servers to be backed up either.

But Sam just said: I want a second opinion.

The guy said: Okay, I hope your résumés are up to date.

I was very agitated. Using a low-level disk reading command, dd, I was able to see that the all the clustering info was there. It was just a matter of reconstructing whatever information is needed at boot so we could then get to a single-user root shell with /etc/ mounted. To spend $800 to be told in 3 minutes to basically blow it all away and restore everything from backup (which would have taken several hours and lost some amount of trading data) really irked me.


But Sam kept cool. We talked to the manager who explained that this guy really was the top-of-the-line expert in Veritas Clustering that they have.

Sam patiently said this wasn't acceptable and that we still had 15 minutes of paid time to figure out something. The manager said well, in 5 minutes, at midnight, the shift changes and we have another guy on duty. He's just the guy we use to fill in off hours when there is low activity. And he doesn't know that much about Veritas Clustering. Sam said, okay, let's try him.

We wait the 5 minutes and talk to the other guy. He affirms that the person we talked to really was the expert on Veritas Clustering. He didn't know much about it, but he will look up whatever resources he has available and call us back. Okay. And in another 5 minutes he says well, by googling he sees that you can get direct support for Veritas Clustering from the Veritas company itself. Sun Microsystems was just an OEM provider.

So we called up Veritas, and gave them the dd information I had gotten earlier. With that, someone in Veritas was able to write a custom shell script which we were able to recreate the volume manager information with all of the current data in tact, so that we could boot the server to at least a root shell. YAY!


Moral: It is sometimes more important to find someone willing to help and work with you than someone who has the most knowledge about something. With someone trying to be helpful, you might be able to figure the things you don't know, things that an expert might overlook.
 

Aftermath:

While we now had root access we still had a lot of work to recreate shadow passwords and go over the /etc/ filesystem. We had both been there over 12 hours. Sam had to get some rest before doing this important, unforgiving and custom bit of work. And this was wise, too, because when you are tired it is too easy to make a mistake. So Sam was going to go home and sleep to 5AM.

For my part, I just slept on the wall-to-wall carpeted floor in the sysadmin room. The carpeting was there because this was the floor with the trading room, not because the company cared about the comfort of sysadmins sleeping on the floor.

At about 3AM I hear the locked door open. It was the night guard who said he was suspicious because he heard some heavy snoring. I assured him I heard it, too, and was certain it was coming from the next room over.