.net – Diagnosing application hang in a production .NET desktop program

.netdiagnosticsfreezeproduction

I have trouble. One of the users of an application I'm developing is occasionally, but regularly, experiencing an application hang.

When this happens, we find an entry with a source of "Application Hang" in the machine's Event Log, with the informative message "Hanging application [my app], version [the right version], hang module hungapp, version 0.0.0.0, hang address 0x00000000."

I'm logging all unhandled exceptions that my application throws, and there aren't any entries in my log files when this happens.

My current working hypothesis is that this hang is occurring during the application's call to an unsafe legacy API. This wouldn't astonish me; I've been working with this API for years and while I haven't seen it hang before, it's genuinely crappy code. Also, the user's reporting that the program seems to hang at random times. I don't think this is really true. Not that I don't believe her, but that the code that talks to the legacy API is running inside a method called by a BackgroundWorker. If the background thread were making the application hang, this could very much look to the user like it were happening randomly.

So, I have two questions, one specific, one general.

The specific question: I would expect that if a method running on a non-UI thread were to hang, it would just kill the thread. Would it actually kill the whole application?

The general question:

I'm already logging all unhandled exceptions. My program's already set up to use tracing (though I'm going to need to add instrumentation code to trace activity in suspect methods). Are there other things I should be doing? Are there diagnostic tools that allow some kind of post-crash analysis when a .NET application hangs? Are there mechanisms inside the .NET framework that I can invoke to capture more (and more usable) data?

EDIT: On a closer examination of my code, I'm remembering that all of its usage of BackgroundWorker is through a utility class I implemented that wraps the method called in an exception handler. This handler logs the exception and then returns it as a propoerty of the utility object. The completion event handler in the UI thread re-throws the exception (less than ideal, since I lose the call stack, but it's already been logged), causing the UI's main exception handler to report the exception to a message box and then terminate the app.

Since none of that is happening, I'm pretty confident that there's no exception being thrown in the background thread. Well, no .NET exception, anyway.

Further followup:

Mercifully, I've now gotten enough data from the users to be certain that the hang isn't occurring inside the legacy API. This means it's clearly something I'm doing wrong, which means that I can fix it, so, win. It also means that I can isolate the problem through tracing, which is another win. I'm very happy at the answers I got to this question; I"m even happier that I probably don't need them for this problem.

Also: PostSharp is outstanding. If you need to add instrumentation code to an existing application, you almost certainly should be using it.

Best Solution

In answer to your specific question, when a background/worker thread blocks or hangs, the effect on the rest of the application would depend a lot on the synchronization happening between the threads in the app. There's no particular reason why it would necessarily hang the whole app, but it's entirely possible that it would.

One possible way to diagnose this would be to generate a dump of the process while it's hung (assuming someone is around to notice when it happens). This would be done using MiniDumpWriteDump, from dbghelp.dll. It's fairly straightforward to write a simple tool that can dump a process (based on its pid), which could be provided to the customer experiencing the issue. Since this is a managed app, a full memory dump is preferable (MiniDumpWithFullMemory), but a normal dump should still have some useful info. Once you have the dump, you can use windbg or your post-mortem debugger of choice to see what might be going on.

If you go this route, this msdn article is a good starting point for managed dump debugging.