There’s a popular link floating around today for a.. surprise.. Stack Overflow question that quite predictably dispenses some worst-case advice regarding the use of kill -9. As the topmost voted answer puts it:
If you don’t give the process a chance to finish what it’s doing and clean up, it may leave corrupted files (or other state) around that it won’t be able to understand once restarted.
And that, ladies and gentlemen, is exactly why you should kill -9 software as a matter of routine – like “kill -TERM” is to “kill -KILL”, “kill -KILL” is to a lightning strike that just melted your generator and set fire to a tree 3 miles down the road.
When relying on software to pay our bills, ideally we’d like to discover these corruption scenarios before they’re rolled into production, and if an app can get into a bad state due to a perfectly acceptable software condition occurring (e.g. such is the case on a poorly configured Linux box and the OOM killer kicking in), imagine how that same software would deal with an actually difficult fault occurring in hardware.
If you have software that cannot successfully round-trip a kill -9 without failing to restart, or without corrupting user data (or worse, an account balance), then you have a ticking time bomb for the moment you, or a customer, or a little condensation trips a breaker at the worst possible moment.
Don’t forget that in the brave new word, perfectly acceptable conditions include Amazon’s network going down, or your EC2 instance being terminated without warning.
In most cases it is quite a simple matter to remove non-transactional behaviour from a software design, especially when it is built from well-matured storage and messaging primitives, and even when its function involves bridging with external systems over which the developer has no control. It is one of the main reasons for insisting on storing data in an SQL database or similar, even when alternatives exist (aka. filesystem).
This design approach is well understood and even comes with a fancy name: crash-only software. There are fringe benefits too. From LWN, July 2006:
George Candea and Armando Fox noticed that, counter-intuitively, many software systems can crash and recover more quickly than they can be shutdown and restarted. […] In their experiments, no important data was lost. This is not surprising as, after all, good software is designed to safely handle crashes
This would have made a good lead-in to discussing why it’s a great idea to always treat software as a distributed system, long before it ever interacts with a network interface, but I’m hungry and so it seems that’s a rant for another time.
Happy kill -9’ing!