Quality Lessons from Beyond: The Mars Rover Curiosity’s Ability to Cope
Test and Monitor | Posted March 20, 2013

Think you’ve got it bad because your data centers are located in another country, forcing you to do all your troubleshooting remotely? Try having to troubleshoot on another planet!

While the world waited impatiently for the Mars Rover “Curiosity” to come back out of safe mode, I found myself sympathizing with their troubleshooting team as they struggled to find the source and resolution of multiple issues plaguing the poor robot. Sympathizing  - and also marveling - at how cleverly the system is constructed to handle situations like this. Too often we bypass the all-important failover and diagnostics part of our product in lieu of features (and too often we live to regret it). So, what can we learn from NASA?

 

1. Build in diagnostics from the beginning

    Here was a little tidbit I loved – when this week’s software event happened (a checksum failed due to an incorrect command file), Curiosity decided to put itself into safe mode instead of waiting for NASA to do it. Turns out it was a minor issue and easily resolved, but it’s reassuring to know it is constantly checking its own integrity and reacting cautiously to situations that seem incorrect - if only people could do this. Kudos to the engineers who ensured that not only were the appropriate checks put in place, like doing checksums on key files, but also allowing the rover to self-monitor and put itself into safe mode until the error could be analyzed and corrected.

    As jlouis points out in his 2012 article, “It is not an easy task to be a computer on board the Curiosity.” But while it’s extraordinarily true of the Curiosity, it is also true to a lesser extent of all critical systems. You need to approach your design as if your system has to withstand radiation flares on Mars. Without the right mindset during the planning phase, you won’t allocate time and energy to hardening your systems appropriately.

    2. Pay attention to failures

    Setting up the right diagnostics is just half the battle. The other is listening to what they have to say. We can be pretty cavalier about waving away alerts we see too often or issues that seem familiar. Part of focusing on quality is understanding when something needs to be corrected. Don’t ignore signs in the logs that there might be some memory corruption or load problems in your application – those are the kinds of things that become more severe when left unattended. When NASA saw the signs of memory corruption in Side A, they reacted immediately and failed over to Side B so they could troubleshoot the problem. Which, of course, is the perfect segue to our next item on the list.

    3. Have a backup plan

      Most applications do have some level of system protection in place, and the Curiosity is no different. Redundant and isolated systems in the Curiosity means they have a clean option for continuing basic activity without fatal disruption even when they have to shut down one system for maintenance or troubleshooting. (Scientific studies were put on hold until the A-side could be repaired and put back online as the new backup, but significantly, Curiosity continued to function and assist the lab with its maintenance) When your product is roaming around in the unpredictable atmosphere of another planet, resiliency is critical; when your data center is on the other side of the globe, your product might as well be on another planet. While redundancy is no longer an earth-shaking concept, it is deferred or neglected more often than you would think.

      4. Recover cleanly

        Many companies engage in Disaster Recovery exercises, in which they simulate catastrophic disasters to their data centers, software, office space and applications, then attempt to come back online in as short a period as possible. The more critical your system, the more critical your ability to recover when disaster strikes. Despite all of the fail-safes built into the rover’s software (you can read the coding guidelines here if you want a study in cautious and mindful coding), its world is unpredictable and unknown. When the B-side computer had to take over as the primary brains of the outfit, NASA had procedures for “informing” it about its current conditions and activities as well as procedures for informing the A-side that it was now the dormant backup.

        While most of us aren’t building interplanetary embedded software systems, there are some basic lessons we can learn from approaching our applications as if they were. We have become more complacent these days about software quality… until it affects our ability to perform critical tasks or manage our business. Maybe looking to the night sky on occasion to admire the mindful programming of the Mars rover will lead to more hardening of our earthbound systems as well. 

        Photo source: NASA

        See also:

         

         

        Close

        By submitting this form, you agree to our
        Terms of Use and Privacy Policy

        Thanks for Subscribing

        Keep an eye on your inbox for more great content.

        Continue Reading

        Add a little SmartBear to your life

        Stay on top of your Software game with the latest developer tips, best practices and news, delivered straight to your inbox