Early in my career I was group engineering manager of the high-end storage division at Digital. The team was comprised of a couple of hundred people. And, it was my first, large management role. The technology had been built over a decade and was, at the time, the most sophisticated storage array technology in the world. It was also a foundational element of VMS clustering. If a customer wanted a VAX cluster they also had to have an "HSC" (Hierarchical Storage Controller).

The product was wildly successful and garnered many large customers. Our largest customer was a bank. A really big bank. And they had, as you would suspect, built the largest VMS cluster in the world. This came to be my misfortune. 

The bank's massive VMS cluster helped run its ATM network. At that time, ATMs were brand new (the equivalent to mobile banking apps today). Unfortunately, every 30-60 days, the system would crash and reboot. We discovered that the problem was in the HSC. Bummer. As you can imagine, the bank wasn't very happy but the technology was new so, initially at least, they were understanding.

It seemed like overnight I went from cog in the machine to a sort-of (undesirable) infamy within the company. The team became more and more consumed trying to solve the problem. I became more and more consumed with writing status updates and flying back and forth to NYC to appease the customer.

The problem was every software engineer's worst nightmare. It happened infrequently, was not consistently reproducible and was on a system larger than any system we had ever had in our lab! It took months to get detailed trace logs because we had to set up the customer's systems and then wait a month or two for a crash.

Over the next three months more of the team became involved. The simple fact remained -- we had absolutely no idea what the problem was. I am an engineer but this stuff was way past my skill set. All I could do was to facilitate and be the front man for the arrows. In desperation, I called the original architect of the system (now the division CTO) and begged for help.

Richard Lary (Richie) is one, if not the, smartest person I know. He reminds me of Kramer from Seinfeld (but much smarter).  Asking for Richie's time was a huge imposition but we were desperate. Richie met with the software team one Friday afternoon. When I returned Monday morning Richie came by (it looked like he slept there all weekend) and declared the problem fixed. Needless to say I was stunned. How could one guy who hadn't written, or even been involved with the code set for years, come in and fix something in a weekend that eluded and perplexed 60 people for three months?

We tested the code and, sure enough, it worked. The customer was happy, my job saved. I was desperate to know how he solved it.  I went to Ritchie and asked "What was the problem?" His answer was "I have no idea." Now, totally lost I asked "How did you fix the problem if you didn't even understand what it was?" His answer "I just looked at the set of conditions that needed to exist for the crash to happen and changed the code so those conditions would never exist."

He changed the rules. Instead of the trying to change the outcome of the game.

I apply this life lesson almost every day now when thinking about how to solve a problem. Look at every problem from every perspective. Some problems you solve head on. Others might require you to change the rules... 

 

Comment

Read More