Brain dump, may make this more coherent later
Andon is a manufacturing management process for quickly dealing with critical issues. When a quality or process issue is found such as defective or missing parts, incomplete or incorrect installation, or issues that block a process, a worker can immediately escalate the issue to 1 or more senior staff who must immediately respond. If a fault is not rectified within a certain time, or an automated system is tripped, the manufacturing line is halted until the issue has been dealt with.
The key benefit of Andon is that it allows for critical issues to be escalated and examined within minutes, and brings in the ‘experts’ who have the knowledge to establish how to best respond to the issue. Electrical power system manufacturer Ensto has dedicated mobile phones for the response team, and all members must respond to an escalation within 5 minutes.
For software, relying on an issue to be escalated through usual channels can be too slow for dealing with truly critical problems.
Common issues in software development
- A defect is found by a tester or reported to 1st level support. They suspect that the defect is serious, but don’t know who is responsible. The defect is added to an escalation queue. The issue gets passed around until someone understands the impact.
- During a final meeting before a build it is announced that a build must be postponed due to a critical issue that only a couple of managers were aware of.
- Work done by one team has broken functionality in another area. The team is ordered to fix the problem they caused.
- A build is delayed because a developer working on a critical issue is sick.
Using Andon
It’s unlikely that you would want to completely drop Andon into your development process (although flashing alert lights and completely stopping production would probably get critical issues resolved quickly).
Some of the concepts though can be used to improve how you react and deal with critical issues.
Build a response team
Unless it is a small product you really need a small team of experts from a number of areas. The team is notified of critical issues and expected to investigate them quickly. Between them they should be able to establish if the issue is truly critical, and ideally get it fixed right now.
Notifying all response team members keeps everyone up to date on any current critical issues.
Multiple team members bring multiple views on an issue. Too often a critical issue is discounted because an individual doesn’t fully understand the impact of the issue, while having a number of different viewpoints can prevent very serious problems being dumped back in the standard queues to be investigated.
Having a team means that a single staff member away does not prevent a critical issue being fixed.
Drop everything
Critical means that this issue needs to be resolved now. Realistically truly critical issues will be infrequent, and it should be expected that the response team literally stops everything they are doing to look at the raised issue and get it fixed.
Define critical
For manufacturing this can be because all the outputs are going to be defective. For software, this depends on the company and can be anything from system completely dead, important functionality not working or top-level client cannot perform a required process. Care needs to be taken to not make the scope too broad as each critical event is disruptive to everyone on the response team. Any changes to scope need to be clearly and widely communicated to ensure that all staff are aware of what they need to immediately escalate.
Stop wasting time
Most staff are not experts on the intricacies of your software. If someone finds an issue that they suspect is critical, it makes sense for it to be immediately passed to someone who can confirm if it is critical AND who can get the issue fixed. Worst case is that the issue is not critical and the escalation team spends a short period confirming this. Excessive false alerts from new staff can be mitigated by having them check with a colleague when possible.
Stop the blame game
When Team A does work that catastrophically breaks Team B’s area, this now everyone’s problem to fix. A small team of experts is likely to be quicker in investigating and repairing a critical issue than the issue being first sent to Team B to look into, who then blames Team A and expects A to fix.
Review the causes
While not Andon specifically, every critical issue should be examined to determine how it occurred and if/how it can be avoided. You don’t want the same problem to occur again. Whether you make your system more resilient, implement auto tests or restrict access to key codebases, it’s good to be able to demonstrate that you have done everything you can to improve the reliability of your system.