Barr Code

Monday, February 08, 2010

Embedded Software is the Future of Product Quality and Safety

Last year a friend had a St. Jude pacemaker attached to his heart. When he reported an unexpected low battery reading (displayed on an associated digital watch) to his doctor a month later, he learned this was the result of a firmware bug known to the manufacturer. The battery was fine and would last on the order of a decade more. His new-model pacemaker's firmware didn't include a bug fix that was provided the year before to wearers of old-model.

Another friend owns a Land Rover LR2 SUV with back-up sensors. When the car is in reverse and nearing an obstacle or another car, the driver is alerted via a beeping sound. Except that the back-up sensors don't always work. Some "reboots" of the SUV don't seem to have this feature enabled. He suspects there is a "race condition" during the software startup sequence.

Yet another friend has driven a Toyota Prius hybrid over 100,000 miles. He reports that the brakes very occasionally have an odd/different feel. But his older model Prius is not expected to be subject to the 2010 model year recall.

These are just a few of the personal anecdotes behind the headlines. Embedded software is everywhere now, with over 4 billion new devices manufactured each year. Increasingly the quality and safety of products is a side-effect of the quality and safety of the software embedded inside.

One important question is, can we trust future software updates any more than we can trust the existing firmware? How do we know that the Toyota Prius hybrids with upgraded braking firmware will be safer than those with the factory firmware?

Labels: , , , ,

AddThis Social Bookmark Button

Thursday, January 28, 2010

Is Toyota's Accelerator Problem Caused by Embedded Software Bugs?

Last month I received an interesting e-mail in response to a column I wrote for Embedded Systems Design called The Lawyers are Coming! My column was partly about the poor state of embedded software quality across all industries, and my correspondent was writing to say my observations were accurate from his perch within the automotive industry. Included in his e-mail was this interesting tidbit:

I read something about the big Toyota recall being related to floor mats interfering with the accelerator, but I was told that the problem appears to be software (firmware) for the control-by-wire pedal.  Me thinks somebody probably forgot to check ranges, overflows, or stability properly when implementing the "algorithm".

As background for those of you who have been working in SCIFs or other labs, the "big Toyota recall" was first announced in September 2009. It was said to concern removable floor mats causing the accelerator pedal to be pressed down. Some 3.8 million Toyota and Lexus vehicles were involved and owners were told to remove floor mats immediately.

This week several related major news events have transpired, including:

But none of the articles I've read have talked about software being a cause. And it's not clear if the affected models are drive-by-wire. However, at least one article I read yesterday suggested that one fix being worked on is a software interlock to ensure that if both the brake and the gas pedal are depressed, the brake will override the accelerator. On the one hand, that seems to mean that software is already in the middle; on the other, I would be extremely surprised to learn that such an interlock wasn't already present in a drive-by-wire system.

So what's the story? Are embedded software bugs to blame for this massive recall? Do you know? Have you found any helpful articles pointing at software problems? Please share what you know in the comments below, or e-mail me privately.

Labels: , , ,

AddThis Social Bookmark Button

Tuesday, January 26, 2010

Firmware Update - A Free Newsletter for Firmware Engineers

I've been writing about the practice of embedded software development--in the form of books, articles, columns, conference papers, and blog posts--for nearly 15 years.  (How time flies...)  I also served as editor-in-chief of Embedded Systems Design magazine for about 3-1/2 years in the middle.  But it wasn't until August of last year that it occurred to me to write an e-mail newsletter.

My newsletter is called Firmware Update, and it is published about every 3 weeks.  The stated mission of Firmware Update is to spread the word about the firmware development best practices I have learned in my career as an engineer, consultant, and trainer.  In addition to connecting my past, present, and future writings into a coherent storyline, I am using the newsletter to link to articles and papers by others who influence my thinking.

In less than six months, the newsletter is up to more than 11,000 subscribers.  We've placed a helpful archive of all past issues at FirmwareUpdate.net.  And I'm working hard to make the format as easy and fun to read as it is informative.  If you develop embedded software, I'm certain you will find it valuable.   If you're not already a subscriber, you can join the mailing list at http://visitor.constantcontact.com/email.jsp?m=1101728959593.

Note that each issue of Firmware Update is Copyright 2009-2010 by Netrino, LLC.  However, you may reprint the newsletter for non-commercial purposes. Indeed, I encourage you to forward it to colleagues who may benefit from the information.

Labels: , ,

AddThis Social Bookmark Button

Friday, January 22, 2010

Rate Monotonic Analysis and Round Robin Scheduling

Rate Monotonic Analysis (RMA) is a way of proving a priori via mathematics (rather than post-implementation via testing) that a set of tasks and interrupt service routines (ISRs) will always meet their deadlines--even under worst-case timing.  In this blog, I address the issue of what to do if two or more tasks or ISRs have equal priority and whether round robin scheduling is necessary in an RTOS to deal with that special case.

First a little background.  In order for the schedulability analysis portion of the RMA mathematics to provide meaningful results, the following assumptions must hold:

Under RMA, the relative priorities are assigned according to a simple rule: "The more often a task or ISR runs (in the worst-case), the higher its priority."  Put another way, the task or ISR with the longest period between iterations (interarrival time, if you prefer) is least important.  This is because an infrequent but high-priority task could prevent a more frequent task from missing an entire iteration.

So what happens if you are using RMA to assign priorities and you wind up with two (or more) tasks or ISRs assigned equal priority?  (Translation: they have the same worst-case interarrival times).  Must they be assigned equal priority in the real system?  What if the RTOS (in the case of tasks) or hardware (in the case of interrupts) doesn't support round-robin scheduling--or even equal priorities with run-to-completion?

Interestingly, it turns out not to matter a bit whether you:
  1. Merge the two tasks into one (i.e., executed code for Task A then Task B).
  2. Give them equal priority
    • with round robin behavior, or
    • with run-to-completion behavior.
  3. Give them adjacent unequal priorities (in either relative order).
If you run through the timing diagrams for each of the above scenarios, you'll see that all three are equivalent.  Except that the equal priority with round robin potentially suffers a performance impact from unnecessary additional context switches.

Labels: , , , ,

AddThis Social Bookmark Button

Monday, January 11, 2010

Firmware Wall of Shame: Kenmore Elite Electric Range

A couple of years back, my wife and I remodeled our kitchen. In the process, we replaced our oven and range with a Kenmore Elite slide-in unit, similar to the one in the picture below. My wife was pretty nervous about buying an oven with a display and a keyboard--because she understood that meant embedded software with its all-too-frequent crashes and upgrades. At the time, I assured her that oven controller firmware was the sort of thing anyone who could spell 'C' could write.




But now my day of reckoning has come. Our 3-year old oven isn't working properly. It even failed my wife on Christmas Eve, as she prepared a meal for about 20 family and friends. (Praise be for a full tank of gas and a 3-burner outdoor grill!) But still I felt vindicated. Our oven problem was with the electronics not the firmware, I assured her--as if that were some great thing in itself! The problem only occurred when the oven was hot. And a power-cycle didn't cure it. We have learned that the buttons and display will work again, eventually, after the heat has dissipated.

Today the repairman is here. (I didn't dare void the warranty by peeking at the electronics inside before he came.) "What error code does it give when it fails?," he wants to know. "F-1-?," I reported quickly. "We can't read the last digit, because that's a part of the display that doesn't work when the oven fails in this way." "Hmm.", he muttered, turning to his repair manual, "the fix for F10 is as different from the fix for F19 as for every error code in between." "Can't you hook up your laptop to the oven's diagnostic serial port?," I wanted to know. "Nope," he replied, "The display is the diagnostic port."

Crap. My wife was right. Writing the embedded software for an oven controller is harder than I thought. The designers of the Kenmore Elite slide-in electric range's firmware forgot to account for the fact that they only had one diagnostic port and that it itself might break. Or they knew it and cheated their customers (including us), to reduce the BOM cost, out of a serial port we wouldn't know we didn't have until it was too late. Either way, shame on them.

Labels: , ,

AddThis Social Bookmark Button

Wednesday, January 06, 2010

Worst-Case Context Switch Times by RTOS

This morning I received an e-mail from an embedded software developer. It read in part:

We are trying to find the best case, average, and worst-case context switch times for the ThreadX and eCOS real-time operating systems. I have searched the Internet extensively. I found one source stating that the ThreadX context switch time can be under 1 microsecond, but it was unclear if that was the best-case, average, or worst-case timing. Can you help us?

As questions like this keep coming, I shall respond via this blog.

None of the timings sought (even the 1 microsecond timing found online) can be calculated without knowledge of the specific processor family, clock speed, and memory architecture. Context switch code is generally written in assembly language and mostly consists of pushing a number of CPU register contents to RAM and popping older data from RAM into registers. The primary factors in context switch timing are the number of opcodes involved, the speed of their execution, and RAM access speeds.

Note though that even for a given hardware platform, I am unaware of any analytical use of any but the worst-case context switch timings for an RTOS. ThreadX purveyor ExpressLogic should, like any RTOS vendor, be willing and able to provide a prospective customer an estimate of the worst-case context switch timing on their planned hardware. But you will want to validate that number on your final hardware before performing Rate Monotonic Analysis (RMA) to prove that all critical deadlines will be met.

Related Article: How to Choose a Real-Time Operating System.

Labels: , , ,

AddThis Social Bookmark Button

Wednesday, December 23, 2009

Is Reliable Multithreaded Software Possible?

Until earlier this month, I'd overlooked a most interesting May 2006 article in Embedded Software Design magazine by Mark Bereit titled "Escape the Software Development Paradigm Trap". The article opines that the methods we use to design embedded software, particularly multitasked software with interrupt service routines and/or real-time operating systems, are fundamentally incompatible with reliability.

Here's the critical analogy:

Imagine for a minute that I've invented the Universal Bolt. This is a metal object for joining threaded holes that can extend or collapse to fit a variety of lengths. It can expand or contract to fit holes of different diameters. The really cool feature is that I have replaced the bolt's spiral ridge with a series of extendable probes that can accommodate different thread pitches. You no longer need to stock a variety of bolts of different sizes and lengths and thread spacings because my Universal Bolt can be used in place of any of them.

Because it's able to change configurations extremely quickly, a single Universal Bolt can take the place of many conventional bolts simultaneously. What we do is rig up a clever and very fast dispatcher device that quickly moves the [Universal Bolt] from hole to hole. If the dispatcher is fast enough, my Universal Bolt can spend a moment in each hole in turn and get the whole way through your [mechanical] product so fast that it returns to each hole before the joint has had a chance to separate.

You'd have to be crazy to fly in an airplane designed this way. "If anything caused the dispatcher to derail, the entire product would collapse in a second." Yet this analogy describes the design of most products powered by embedded computers.

A fast and complex thread dispatcher keeps moving one simple and stupid integer-computation unit all over a big system tending to tasks [and ISRs] rapidly enough that they all get done. And if that dispatcher ever once leads the CPU into an invalid memory address the whole thing crashes to a halt.

Clearly, we need a new paradigm for reliable embedded software architecture. My thoughts on that are coming to this space in 2010.

Labels: , , , ,

AddThis Social Bookmark Button

Wednesday, December 16, 2009

Embedded Java Lives!

Reading the latest embedded software market survey highlights from VDC Research I was surprised to note two data points indicating new upward momentum for Java as an embedded software development language.

First, of those survey respondents using an operating system on their current project 11% indicated that a Java Virtual Machine is required in their product.  Second, Java was selected as the fifth most used language for firmware development at 14% of respondents (behind C, assembly, C++, and Matlab, in that order).

This is an interesting trend.  My regular readers will note that I have written and spoken about Java in embedded systems since 1997 and that I declared Java "dead" in the embedded realm about 18 months ago.

Labels: , , ,

AddThis Social Bookmark Button

Tuesday, December 15, 2009

Verification vs. Validation

The FDA 510(k) guidelines for medical device software leave something to be desired in the poor differentiation of two important and distinct software development practices: verification and validation.  In particular, the FDA often uses the word 'validation' to describe both types of activities.  (See, for example, the General Principles of Software Validation; Final Guidance for Industry and FDA Staff.)

Put simply, software validation is a set of activities that together demonstrate that you "made the correct product" (or, as others have put it, "built the right thing") for the customer's needs.  Validation tests that the product's behavior is consistent with the requirements, safe, and efficacious.

By contrast, software verification is a set of activities that together demonstrate that the implementation matches the design.  That is, verification tests that you "made the product correctly" ("built it right").

In the larger context, verification should come before validation.  It doesn't make sense to check that the product does what it is supposed to unless you first confirm that it does what you programmed it to.  If it were only the case that the many engineers and organizations that talk about software verification and validation (a.k.a., V&V) could get this simple concept.  It wouldn't hurt, of course, if the FDA rewrote the above document.

Labels: ,

AddThis Social Bookmark Button

Tuesday, November 24, 2009

Embedded Programmers Worldwide Earn Failing Grades in C and C++

In industry surveys, over 80% of embedded software developers report using C or C++ as their primary programming language. Yet as a group, these programmers earned a failing grade on a multiple-choice quiz testing firmware-related C programming skills. A scary result, considering that embedded software inside medical devices, industrial controls, anti-lock brakes, and cockpits place human lives at risk every day.

In a February 2008 blog post, I examined the first few hundred results from the "Embedded C Quiz" on the Netrino website. That analysis compared the performance of programmers in the U.S. and India with the rest of the world (the only three data sets large enough for meaningful analysis). I concluded that the average embedded programmers in the U.S. and India don't know C very well, but do know it better than programmers in the rest of the world.

Two years now since launching the quiz, we have collected thousands of data points, so it's time for an update on programmer performance. In total, 3,870 programmers have taken the short 10-question multiple-choice C skills test. A few (a bit less than 3%) didn't answer all of the questions; the analysis below is based on just the 3,755 completed quizzes. (Note that each website user can only take the quiz once.)

Across all countries, the mean result was 60.8%--a grade of 'D-' at best. That is to say that the average embedded programmer answered just 6 out of 10 multiple-choice questions correctly. A rather scary fact, given that C is the language of choice for most embedded projects and that C++ is even harder to master.

Programmers in the United States scored slightly above average. But they still earned a failing grade of 61.8%. Programmers in India scored slightly below the worldwide average, at 58.9%. Together, programmers from these two large English-speaking countries accounted for the majority of all quiz takers.

The number of completed quizzes, mean scores, and standard deviations for all countries with more than 20 completed quizzes are shown in the table below, sorted by average score. In general, programmers from European countries scored best.

Country
Completed
Mean
Std Dev
Poland
23
68.7
19.2
Sweden
26
67.7
15.8
Australia
45
67.3
22.3
Germany
57
67.2
17.2
France
35
66.9
24.0
United Kingdom
109
66.1
22.8
Spain
24
65.0
18.3
Canada
114
64.5
19.3
China
51
64.1
23.4
Israel
22
62.3
21.7
United States
1346
61.8
20.4
Egypt
28
59.3
22.8
India
1288
58.9
22.4
Romania
45
58.9
23.0
Singapore
24
58.3
20.1
Italy
44
56.4
20.8
Turkey
57
55.6
23.3
Brazil
47
55.1
24.1
Pakistan
25
44.0
21.7

How are your embedded C programming skills? Test them by taking the Embedded C Quiz yourself now at http://www.netrino.com/Embedded-Systems/Embedded-C-Quiz?

P.S.  We recently launched an Embedded C++ Quiz and the results so far look downright abysmal.  I'll write something about that in a future post.  Do you have a few minutes to take that one too?

Labels: , , , , ,

AddThis Social Bookmark Button