Every programmer knows that a stack overflow is a Very Bad Thing™. The effect of each stack overflow varies, though. The nature of the damage and the timing of the misbehavior depend entirely on which data or instructions are clobbered and how they are used. Importantly, the length of time between a stack overflow and its negative effects on the system depends on how long it is before the clobbered bits are used.
Unfortunately, stack overflow afflicts embedded systems far more often than it does desktop computers. This is for several reasons, including:
- embedded systems usually have to get by on a smaller amount of RAM;
- there is typically no virtual memory to fall back on (because there is no disk);
- firmware designs based on RTOS tasks utilize multiple stacks (one per task), each of which must be sized sufficiently to ensure against unique worst-case stack depth;
- and interrupt handlers may try to use those same stacks.
Further complicating this issue, there is no amount of testing that can ensure that a particular stack is sufficiently large. You can test your system under all sorts of loading conditions but you can only test it for so long. A stack overflow that only occurs “once in a blue moon” may not be witnessed by tests that run for only “half a blue moon.” Demonstrating that a stack overflow will never occur can, under algorithmic limitations (such as no recursion), be done with a top down analysis of the control flow of the code. But a top down analysis will need to be redone every time the code is changed.
Best Practice: On startup, paint an unlikely memory pattern throughout the stack(s). (I like to use hex
23 3D 3D 23, which looks like a fence ‘
#==#’ in an ASCII memory dump.) At runtime, have a supervisor task periodically check that none of the paint above some pre-established high water mark has been changed. If something is found to be amiss with a stack, log the specific error (e.g., which stack and how high the flood) in non-volatile memory and do something safe for users of the product (e.g., controlled shut down or reset) before a true overflow can occur. This is a nice additional safety feature to add to the watchdog task.