A press release by Toyota recently stated:
Here are some notes for the lawyers suing Toyota. Here is what your testing experts should be telling you:
- Whoever wrote this, even if he is being perfectly honest, is not in a position to know the status of the testing of Toyota’s acceleration, braking, or fault handling systems. The press release was certainly not written by the lead tester on the project. Toyota would be crazy to let the lead tester anywhere near a keyboard or a microphone.
- Complete testing of complex hardware/software systems is not possible. But it is possible to do a thorough and responsible job of testing, in conjunction with hazard analysis, risk mitigation, and post-market surveillance. It is also quite expensive, difficult, and time consuming. So it is normal for management in large companies to put terrible pressure of the technical staff to cut corners. The more management levels between the testers and the CEO, the more likely this is to occur.
- “Extensive testing” has no fixed meaning. To management, and to anyone not versed in testing, ALL testing LOOKS extensive. This is because testing bores the hell out of most people, and even a little of it seems like a lot. You need to find out exactly what the testing was. Look at the production-time testing but focus on the design-time testing. That’s where you’ll be most likely to find the trouble.
- Even if testing is extensive in general, you need to find out the design history of the software and hardware, because the testing that was done may have been limited to older versions of the product. Inadequate retesting is a common problem in the industry.
- If Toyota is found to have used an automated “regression suite” of tests, then you need to look for the problem of inadequate sampling. What happens is that the tests are only covering a tiny fraction of the operational space of the product (a fraction of the states it can be in), and then they just run those over and over. It looks like a lot of testing, but it’s really just the same test again and again. Excellent testing requires active inquiry at all times, not just recycling old actions.
- If Toyota is found not to have used test automation at all, look for a different kind of sampling problem: limited human resources not being able to retest very extensively.
- Most testers are not very ambitious and not well trained in testing. No university teaches a comprehensive testing curriculum. Testing is an intellectually demanding craft. In some respects it is an art. Examine the training and background of the testing staff.
- Examine the culture of testing, too. If the corporate environment is one in which initiative is discouraged or all actions are expected to be explicitly justified (especially using metrics such as test case counts, pass/fail rates, cyclomatic complexity, or anything numerical), then testing will suffer. During discovery, subpoena the actual test reports and test documentation and evaluate that.
- Any argument Toyota makes about extensiveness of testing that is based on numbers can be easily refuted. Numbers are a smoke-screen.
- Examine the internal defect tracking systems and specifically look to see how intermittent bugs were handled. A lack of intermittent bug reports certainly would indicate something fishy going on.
- Examine how the design team handled reports from the field of unintended acceleration. Were they systematically reviewed and researched?
- Depositions of the testers will be critical (especially testers who left the company). It is typical in large organizations for testers to feel intimidated into silence on critical quality matters. It is typical for them to be cut off from the development team. You want to specifically look for the “normalization of risk” problem that was identified in both the Columbia and Challenger shuttle disasters.
- If the depositions or documentation show that no one raised any concerns about the acceleration or braking systems, that is a potential smoking gun. What you expect in a healthy organization is a lot of concerns being raised and then dealt with forthrightly.
- Find out what specific organizational mechanisms were used for “bug triage”, which is the process of examining problems reported and decided what to do about them. If there was no triage process, that is either a lie or a gross form of negligence.
- If Toyota claims to have used “proofs of correctness” in their development of the software controllers, that means nothing. First, obviously they would have to have correctly used proofs of correctness. But secondly, proofs of correctness are simply the modern Maginot line of software safety: defects drive right around them. Imagine that the makers of the Titanic provided “proof” that water cannot penetrate steel plates, and therefore the Titanic cannot sink. Yes steel isn’t porous, but so what? It’s the same with proofs of correctness. They rely on confusing a very specific kind of correctness with the general notion of “things done right.”
- The anecdotal evidence surrounding unintended acceleration is that it does not only involve acceleration, but also a failure of braking. Furthermore, it’s a very rare occurrence, therefore it’s probably a combination of factors that work together to cause the problem. It’s not surprising at ALL that internal testing under controlled conditions would not reproduce the problem. Look at the history of the crash of US Air flight427, which for years went unsolved until the transient mechanism of thermal shock was discovered.
- You need to get hold of their code and have it independently inspected. Look at the comments in the code, and examine any associated design documentation.
- Look at how the engineering team was constituted. Were there dedicated full-time testers? Were they co-located with the development team or stuffed off in another location? How often did the testers and developers speak?
- What were the change control and configuration management processes? How was the code and design modified over time? Were components of it outsourced? Is it possible that no one was responsible for testing all the systems as a whole?
- What about testability? Was the system designed with testing in mind. Because, if it wasn’t, the expense and difficulty of comprehensive testing would have been much much higher. Ask if simulators, log files, or any other testability interfaces were used.
- How did their testing process relate to applicable standards? Was the technical team aware of any such standards?
- In medical device development, manufacturers are required to do “single-fault condition” testing, where specific individual faults are introduced into the product, and then the product is tested. Did Toyota do this?
- What specific test techniques and tools did Toyota employ? Compare that to the corpus of commonly known techniques.
- Toyota cars have “black box” logs that record crucial information. Find out what those logs contain, how to read them, and then subpoena the logs from all cars that may have experienced this problem. Compare with logs from similar unaffected cars.
The best thing would be to reproduce the problem in an unmodified Toyota vehicle, of course. In order to do that, you not only need an automotive engineer and an electrical engineer and a software engineer, you need someone who thinks like a tester.
The unfortunate fact of technological progress is that companies are gleefully plunging ahead with technologies that they can’t possibly understand or fully control. They hope they understand them, of course, but only a few people in the whole company are even competent to decide if that understanding is adequate for the task at hand. Look at the crash of Swiss Air flight 111, for instance: a modern aircraft brought down by its onboard entertainment system, killing all aboard. The pilots had no idea it was even possible for an electrical fire to occur in the entertainment system. Nothing on their checklists warned them of it, and they had no way in the cockpit to disable it even if they’d had the notion to. This was a failure of design; a failure of imagination.
Toyota’s future depends on how they take seriously the possibility of novel, multivariate failure modes, and aggressively update their ideas of safe design and good testing. Sue them. Sue their pants off. This is how they will take these problems seriously. Let’s hope other companies learn from no-pants Toyota.
David Gilbert says
Great post James. Lots of good stuff to think about. But as someone who wears both the hat of a tester and of a producer of software, I have some sympathy for Toyota, especially as of late.
[James’ Reply: I’ll have sympathy for Toyota when their management stops pooh-pooing on reports of trouble. The testimony I saw before congress was classic dismissive management behavior.
They should have gone up there and simply said: “we’re not satisfied with our testing and we won’t be until these reports simply stop coming in. Here are the twenty-six things we are doing to investigate and fix the problem. Here is the panel of independent experts we’ve brought in to help us. We’re not making any excuses. We’re just working the problem.”]
Last week, the big story was some guy hurtling down the freeway at 80 mph in a runaway Prius who had to call 911 to get help shutting the car down. Seriously?!?!
[James’ Reply: It’s possible that the guy was crazy. It’s also possible that the software in the car went crazy. I read that the car’s brakes had been worn down to nothing. That’s doesn’t sound consistent with a fellow who was NOT trying to stop his vehicle.
I once had trouble with my Saturn’s brakes. I took the car in, and they told me they couldn’t reproduce the problem. I had to take a mechanic out and show him. It turned out they had ignored one of the conditions I reported: that the problem only happened at very low speeds (less than 5 MPH). It turned out that I had a bad sensor in one of the wheels that was reporting a lock up of that wheel at low speeds, causing the antilock brake system to activate on a dry road, which obviously reduces braking power and caused me to slide through stop signs. They were pretty confident that the problem was all in my own head, the bastards, until I proved them wrong.]
Notice to all Prius drivers: You know that big stick looking thing you pull down to “D” to drive around? The one you push to “R” to back up? And “P” to park? It also has another setting…”N”, for neutral. If your Prius takes off on you, push the big stick thing to “N” and it will stop running away.
[James’ Reply: Is that mechanically or logically coupled to the transmission. My understanding (I haven’t researched any of this, so maybe I’m totally wrong) is that there is very little or nothing that is mechanically yoked. Isn’t it all fly by wire? Is there a mechanically guaranteed way to stop the car, or does it depend on software and transistors that do the right things?
If there’s a mechanically guaranteed way to stop the car, Toyota should really be telling people about it!]
Yeah, the engine will likely blow up, but Toyota is gonna buy you a new one, trust me. Better that than to be on the national news talking to 911 about your runaway Prius…that can’t be duplicated…that neither Toyota or the Federal Government can find anything wrong with, as of this morning…hmmm…or is it?
[James’ Reply: What do you mean CAN’T be duplicated? How do YOU know it can’t be duplicated? All you know is that they haven’t yet duplicated it. But how did they try to duplicate it? Have they recreated all possible forms of electromagnetic interference? Have they recreated the physical condition of the brakes at the time (no, because they replaced the brakes). Have the recreated the heat, the battery charge, the state of the engine and the computers? How COULD they have done so?
I’d like to know what the log file in the car had recorded, though. Maybe Toyota needs to put a log more diagnostic logging into their vehicles.]
There are two points here…first, users do have some responsibility. If you are gonna hurl 2000 pounds of steel, plastic and glass down the freeway, you really should understand a little more than “As long as I keep the gas tank full it’s all good”. As a pilot, James, you know exactly what I am talking about. Yet, in our culture, we are all too happy to abdicate all responsibility to someone else.
[James’ Reply: Now this is a bit weird. Are you saying that people should pre-flight their cars, and study the gear linkages? The guy was an older fellow wasn’t he? A veteran driver, probably. Unless he was crazy, which is definitely possible, it seems to me he behaved reasonably. We don’t know what he tried to do, but apparently he tried the brakes. He did a reasonable thing by calling 911 if his car was hurtling out of control– so that his situation would be documented in case he ran off a cliff or something.]
Which brings me to my second point…when a big company suffers a major, public failure, a class of people known as sharks appear, smelling the deep pockets of blood in the water, and move in for their own little personal killing. And at that point, what hope does Toyota, or any other company, have of making sense of the totally bogus data the field is going to start providing to them. So I DO pity Toyota, at least a little. They may have missed many of the things you point out, but at the same time, they may have tried hard to get it right and just made a mistake; and then they went into damage control mode; and now, they are the target of a feeding frenzy, being driven at least in part by a media, culture, and government who is all too happy to serve them up as the next great piece of entertainment fodder.
[James’ Reply: You can feel sorry for them a little, but not too much, until they show they are serious. The testimony I saw was not encouraging.]
So while this may be a lesson in all the ways testing should have been done but wasn’t, it should also be a cautionary tale of how we expect everyone else to get everything right, while we ourselves don’t.
David
[James’ Reply: Look David, are you saying that Toyota really has a problem with it’s cars, or not? If this is all bogus data from sharkies, then Toyota has great cars and they did get it right. If, however, they did not get it right, and their cars are actually occasionally careening out of control, then that is a very very bad thing, and we should see them acting aggressively not to deny the problem, but to embrace it and resolve it.
Toyota can spend $50 million to research this problem and it still will be cheap. In fact, if I were Toyota, I would be trying to use this situation to innovate and leap frog the competition: create a new design and testing process with state of the art EMI testing facilities, EMI shielding on all components, long period reliability testing through simulations, mathematical proofs of correctness (yes, I said it), all kinds of innovation.]
David Gilbert says
James — Yeah, I should have known better than to start this conversation on line…the real time back and forth gets crippled by the asynchronicity. Anyway…
I am not saying that Toyota does not have a problem; and I am not saying they did everything they should have. What I AM saying is that at this point in time, it becomes difficult for outside observers to tell the fact from the fiction, and Toyota is directly in the line of fire for anyone who wants to try and sue them for this, exactly because it is, as of yet, seemingly impossible to replicate on demand, and relatively easy to “simulate”. Many bugs get this way, and stay this way a long time, not because of incompetent testing, but because they are just damned difficult to diagnose. But since the public, just like management oftentimes, does not understand or accept that fact, Toyota will get hung up as the poster child for incompetence, no matter what they do.
As to the point of people pre-flighting their cars, to some degree even that is not a terrible idea, but that is more than I intended. My intent simply was that they should understand them to the point of being able to control them as much as practical, even (especially?) in emergency situations. Personally, I am not positive if the shifter is mechanical or fly by wire…however, it would be good to know if anyone has tried it, because if it is fly by wire, and it fails to disengage the engine when the thing runs wild, it points to an even larger system failure than currently imagined, since presumptively the ability to shift gears should be independent of the throttle or brake. (I know, big and arguable presumption)
David
Todd Bradley says
Pre-flighting your car is always a good idea, whether you drive a Prius or a Porsche. They taught us that in driver’s ed in high school, and I got a recent reminder when I went through motorcycle safety to get my motorcycle license.
James, I liked this article. I may be crazy, but I went to the dealer and bought a new 2010 Prius today. I spent a lot of time last weekend researching every aspect I could read about the quality issues. After that, I’m pretty sure Toyota’s problems are exaggerated by about a factor of 10. Yeah, they’ve had some issues, but from what I read their failure rate is on par – or even less – than other car manufacturers. In other words, every make and model has issues. We’re just hearing about this one because it’s the trendy thing for the news to talk about. Next month it’ll be the brassiere bomber, Lindsay Lohan’s suicide, or how bread is evil because if you eat too much of it you’ll get fat.
And yes the shifter is fly-by-wire.
[James’ Reply: I would buy a Prius, too, and here’s why: those guys have to be sweating blood and bullets to analyze and fix this thing. If it turns out to be nothing, then they’ll end up with a very well tested vehicle design. If it turns out to be a big problem, they’ll do another recall. Either way, the problem, if it exists, is extremely rare.]
Mike Bonar says
Great stuff, James.
I would like to point out that Toyota is the #1 automobile manufacturer in the world now because of their relentless commitment to Lean Manufacturing techniques. Lean Manufacturing is all about eliminating waste in processes, and you have to wonder if they applied Lean principles to their software testing process. One of the tools of Lean Manufacturing is the Value Stream Map. The Value Stream Map identifies each unit of a process and measures how long each unit takes. In a manufacturing context it helps to predict how time/resources could and should be consumed during a given manufacturing cycle (order to customer). Software development can be viewed as a manufacturing process if we take Lean Manufacturing as a heuristic to solving the problem of inefficient software development projects. It would be interesting to apply the same techniques to the software testing craft to see if we learn anything new.
[James’ Reply: Hi Mike, software development and testing is not a manufacturing process at all. It’s a design process. Has Lean Manufacturing been applied to design? If so, how?
See Herbert Simon’s Sciences of the Artificial. He defines what the elements of a science of design would have to be. One thing we have to deal with is bounded rationality and satisficing. How does that relate to eliminating waste?
For instance, is a test that doesn’t discover a bug just a “waste”? I don’t think so. But what would eliminating waste mean in a test project?]
Eric says
Something I witnessed first hand.
While testing software for a flight system I, being new to the engineering dept., failed to run the test “properly”. Instead of setting up a scenario of a plane descending 600’/minute, I set in 6000 feet all at once. The test actually passed because the expected failure criteria did not match what was written in to the present operating system. Warn light did not light and autopilot did not dis-engage.
A version this software is currently in use. There have been related incidents and the experts can’t find anything wrong.
Why didn’t I speak up? because I had been recently written up for an incident in which a new supervisor was establishing his new position. So I said to myself “F’em”.
If a life had been lost, I would’ve spoken up pronto.
This very same scenario is what I believe is causing the Toyota problem.
Software engineers will see that the present software operates just fine – it does, BUT it doesn’t take in to account exceptional circumstances. Engineers generally hate exceptional circumstances, considering them to be ridiculous.
The recent NASA report looked at electronics integrity, which I’m sure is AOK. It’s a misleading report. Everything is not OK, the software was written with holes in it.
The company I work for will fire me for speaking up, I prefer to remain anonymous.
A truly good software engineer will spot the flaws, but he/she will be one in a thousand.
Good luck all you ambulance chasers.
NICK ZARRAS says
I had an uncommanded acceleration in my 2003 Toyota Avalon. i was in a parking space and it went full throttle into a brick wall. it was on feb 8, 2010 the day the news stories hit the tv. i have four college degrees, one in mechanical engineering. i flew jet fighters for the military and i was a flight test for jet fighter maintenance. i was also chief of safety, and a prior LEO. I currently am a feature editor and road test editor for a motorcycle magazine.
So i have the technical and tester background. i did my homework and when Toyota corporate assigned me an engineer to test the vehicle after it was totaled, i briefed him on what i wanted him to test, ie the ECU for software corruption and the air idle valve. Both areas were heavily mentioned on the NHSTA site for most models from 2000 on. My vehicle was not on the recall list. He showed up and put no power on the vehicle, and did not test what i prescribed. The accident happened in a hi EM environment. I feel the ECU is not shielded properly leading to software corruption.
When they sent me the test report they stated it was in a location that it was not in. i had a TV news team do the story so it is verifed.
When my insurance company wanted to test the ECU and air idle valve Toyota would not release the code.
So that tells you that Toyota is at fault and paying the fines to DC to get off. If they did not have a problem they would allow scrutiny which would lead to vindication.