hmmm, reminds me of http://www.twilio.com/blog/2013/07/billing-incident-post-mor...

If you have a software stack and it is going to do something that is not idempotent (like billing customers or sending emails), you need a state machine more complex than "not done" and "done". You need a "doing" state. After service is restored and everything is running smoothly, you go through all the tasks stuck in "doing" and decide whether to retry or abort, based on other logs or an evaluation of the consequences of not acting vs double acting. What you do not do is have your software just keep hammering away until everything magically turns "done".