« No Titles Except "Plant" Manager | Main | Why Did THAT Happen? »

26 January 2008

Asking Why in IT

Thanks to reader Alex for sending me a detailed article on how one IT system administrator used the five whys to solve a network connectivity problem. 

At 3:30 in the morning of January 10th, 2008, a shrill chirping woke up our system administrator, Michael Gorsuch, asleep at home in Brooklyn. It was a text message from Nagios, our network monitoring software, warning him that something was wrong. Michael logged onto his computer in the other room and discovered that one of the three data centers he runs, in downtown Manhattan, was unreachable from the Internet.

After a couple more occurences the culprit was identified.

The problem was something with the network switch. Michael temporarily took the switch out of the loop, connecting our router directly to Peer 1's router, and lo and behold, we were back on the Internet.  Michael spent some time doing a post-mortem, and discovered that the problem was a simple configuration problem on the switch. The switch that failed had been set to autonegotiate. This usually works, but not always, and on the morning of January 10th, it didn't.

After this experience he got to thinking about "uptime" in general, and the problems of outlier events.. 

Internet providers like Peer 1 like to guarantee the uptime of their services in terms of a Service Level Agreement, otherwise known as an SLA. A typical SLA might state something like "99.99% uptime." When you do the math, let's see, there are 525,949 minutes in a year, so that allows them 52.59 minutes of downtime per year. If they have any more downtime than that, the SLA usually provides for some kind of penalty.

Keeping internet services online suffers from the problem of black swans. Nassim Taleb, who invented the term, defines it thus: "A black swan is an outlier, an event that lies beyond the realm of normal expectations." Almost all internet outages are unexpected unexpecteds: extremely low-probability outlying surprises. They're the kind of things that happen so rarely it doesn't even make sense to use normal statistical methods like "mean time between failure."

There must be a better way to deal with such events... and he discovered the five whys.

Somewhere between the "extremely unreliable" level of service, where it feels like stupid outages occur again and again and again, and the "extremely reliable" level of service, where you spend millions and millions of dollars getting an extra minute of uptime a year, there's a sweet spot, where all the expected unexpecteds have been taken care of.

To reach this sweet spot, we borrowed an idea from Sakichi Toyoda, the founder of Toyota. He calls it Five Whys. When something goes wrong, you ask why, again and again, until you ferret out the root cause. Then you fix the root cause, not the symptoms.

Applying that methodology he identified a preventative approach.

  • Our link to Peer1 NY went down
  • Why? – Our switch appears to have put the port in a failed state
  • Why? – After some discussion with the Peer1 NOC, we speculate that it was quite possibly caused by an Ethernet speed / duplex mismatch
  • Why? – The switch interface was set to auto-negotiate instead of being manually configured
  • Why? – We were fully aware of problems like this, and have been for many years.  But - we do not have a written standard and verification process for production switch configurations.
  • Why? – Documentation is often thought of as an aid for when the sysadmin isn’t around or for other members of the operations team, whereas, it should really be thought of as a checklist.

"Had we produced a written standard prior to deploying the switch and subsequently reviewed our work to match the standard, this outage would not have occurred," Michael wrote. "Or, it would occur once, and the standard would get updated as appropriate."

Not only are they fixing the root cause, they are telling their customers about the problem and solutions.  That creates value through increased confidence.

Instead of setting up a SLA for our customers, we set up a blog where we would document every outage in real time, provide complete post-mortems, ask the five whys, get to the root cause, and tell our customers what we're doing to prevent that problem in the future.

Wouldn't you appreciate a supplier that did this instead of simply filling our corrective action forms, probably documenting the corrective action to a problem that has occured over and over again?

Comments

Post a comment

Comments are moderated, and will not appear on this weblog until the author has approved them.

Subscribe

  • Get EvolvingExcellence via email:

     

Win an iPod Touch!

  • We are giving away a new iPod Touch! Just subscribe using the email or RSS link above before the end of the month and we'll enter your name in the drawing! WiFi, music, email, web, take notes, check stocks and weather.

Search the Blog

The Book

  • Evolving Excellence
    Thoughts on Lean Enterprise Leadership

    by Kevin Meyer and Bill Waddell

    A 458-page edited and categorized compilation of our favorite posts! All for only $29.95.

    More information

Recent Comments

Superfactory

  • Download
    PowerPoint Presentations

    Download PowerPoint training presentations on over 50 topics.

    Lean Overview - 3P - 5S - Jidoka - Kaizen - Value Streams - Visual Factory - Pull - JIT - Kanban - Quick Changeover - Cellular Manufacturing - Theory of Constraints - TWI - TPM - Lean Office - TQM - SPC - Root Cause Analysis - Six Sigma - FMEA - Balanced Scorecard - Competitive Intelligence - Knowledge Management - Job Design - Outsourcing Strategy - Supply Chain Strategy - Strategic Management - Project Management - and many more

    More Information


     
    Training Packages

    Full packages with facilitator guide, reference materials, participant workbooks, tools, and forms.

    Lean Overview - Lean Manufacturing Workshop - 5S - Office 5S - Value Stream Mapping - Office VSM - Quick Changeover - Kaizen

    More Information


     
    Games & Simulations

    Training simulations and games to demonstrate the power of lean.

    JIT Factory Flow - 5S Action Kit - Flow Simulation

    More Information


     
    Download
    Factory Toolbox

    Over 500 forms, procedure templates, and tools for download.

    Lean Toolkit - Procedures Toolkit - Quality Toolkit - Tools and Forms Toolkit - Engineering Toolkit - Materials Toolkit - Safety Toolkit - HR Toolkit - Six Sigma Toolkit - Finance Tookit

    More Information


     
    DVD's and Videos

    Training and information videos on a wide variety of lean manufacturing topics.

    Life in a Workcell - Batchin' - What Lean Means - Kaizen Blitz - Customer Satisfaction - Work Teams - Velocity at Dell - Strategic Planning

    More Information


     
    Online Learning

    Web-based online training on lean manufacturing topics

    Lean Overview - 3P - 5S - Jidoka - Kaizen - Value Streams - Visual Factory - Pull - JIT - Kanban - Quick Changeover - Cellular Manufacturing - Theory of Constraints - TWI - TPM

    More Information

More Sponsors

  • AME 2008

Other

  • Copyright © 2004 - 2008
    Superfactory Ventures LLC.
    All rights reserved.