The Slim Reader/Writer Lock in Orcas
Joe Duffy has posted about the new ReaderWriterLockSlim class in oracs, everything is cool, and I am very happy to see a replacement for the ReaderWriterLock. The Dynamic Proxy library had (twice!) bugs related to the way ReaderWriterLock was used (vs. the way it ought to work), which cause it to fail under high load.
So, I was very happy, until I get to the end of the post, and I saw this:
Lastly, I mentioned there are some caveats around where this lock’s use is appropriate. Well, there’s one, really: it’s not hardened to be reliable. This means a few things.
...
Next, the lock is not robust to asynchronous exceptions such as thread aborts and out of memory conditions. If one of these occurs while in the middle of one of the lock’s methods, the lock state can be corrupt, causing subsequent deadlocks, unhandled exceptions, and (sadly) due to the use of spin locks internally, a pegged 100% CPU.
So, I can't use this for anything that need to be reliable. Where do I usually use a lock, for multi threading scenarios, which often happens on servers, which has to be reliable. Hell, the way it sounds, using it in a web enviornment and editing the web.config is playing russian roulette*. I tried to play with in on the January CTP, but it is not there yet.
What about cross AppDomain stuff? Does it work across the AppDomains? If so, it really does need to handle AppDomain unloads, while keeping the process (and lock) safe to further use. What happens if I am using a plugins and I need to monitor rouge code and maybe kill it if it takes too long. That is a greate DoS attack against my code (actually, throwing exception from new thread or simply doing stack overflow will both do that as well).
* Okay, not fair, the CLR is supposed to handle AppDomain unload cleanly, but I am not sure whatever this holds here as well.
Comments
I started getting disillusioned with the CLR's ability to make guarantees a while ago when we (foolishly) wrote our own server process from scratch. First was the lack of FIFO locks anywhere in the .NET Framework. I spoke with Joe Duffy and Jan Gray about this at PDC05, and they both weren't interested in enabling this. I have come to believe that Microsoft's philosophy is if you want reliability and guarantees with your managed code, you need to leave that up to your unmanaged host, such as IIS, SQL Server, or WAS.
Another disillusionment point was when Windows Server 2003 SP1 was released which caused System.Threading.Timer in the 1.1 framework to eventually stop firing. You had to call PSS to get the fix. It was ONLY broken on Windows Server 2003. Yet another sign (perhaps unintentional) that Microsoft doesn't take .NET on servers seriously. If you have an unmanaged host like IIS recycling your code on a regular basis, why go to the extra effort of making .NET reliable?
See this post for the gory details of the timer issue which had a ripple effect on things like SqlClient.ConnectionPool and System.Web.HttpRuntime: http://groups.google.com/group/microsoft.public.dotnet.framework.clr/browse_thread/thread/772a5528aba714fa/a320226391613871
This is very sad.
I have already had to write more than a few locks because they didn't exists (wait for consumers, for instance).
Considering that a .NET service is something that they really going to start enabling with WCF (which will not sit on top of IIS in all cases), this is worrying.
A simple example of why I need reliability is a service that needs to have a huge cache in order to work. The first 15 minutes of loading the service are dedicated to populating that cache, which means that any shutdown usually leads to about 20 minutes outage, with any other application on the machine performing very badly.
The bug you describe is very scary, by the way.
People like Chris Brumme.worked their tails off to make .NET reliable, so perhaps I'm being a bit harsh. Reading Chris's huge blog posts, it became clear that there were quite a few outstanding hard problems that just had to wait until later versions of the framework to be solved. Hopefully the work to enable greater parallelism in managed code will motivate better reliability.
The non-IIS unmanaged WCF hosting option is WAS - Windows Activation Service. It's basically IIS for all the other protocols, so they've got us covered if self-hosting with non-HTTP transports isn't reliable enough.
Ayende, AppDomain unloads are not a problem since RWLSlim can't be shared across them. Individual thread aborts are. Because ASP.NET doesn't use the CLR's V2 hosting interfaces, however, nothing the RWLSlim could do would help the situation; effectively any lock written in managed code would suffer from these problems (without extreme measures).
Regarding your question about isolation and plug-in tear-down, the CLR is working on a new AddIn model, described a bit in this MSDN article: http://msdn.microsoft.com/msdnmag/issues/07/02/CLRInsideOut/. Honestly, until then my personal advice is process isolation: it's the safest and least risky.
Oran, you are correct, the CLR places a lot (but not all) of the reliability responsibility in the hands of the host. A host is, after all, the only component that should be introducing individual async thread aborts without also unloading the AppDomain, so once the host decides to do this it also accepts the additional responsibility. Any use of aborts w/out using the hosting APIs to guarantee they are done safely is asking for problems.
The CLR locks are in fact weakly FIFO, but not strictly. I'm sorry if I seemed to shrug this off at the PDC, but there is actually a good reason for this design. It has to do with some pretty serious liveness problems that can result otherwise (and have in fact resulted in the past when locks in Windows were originally strictly FIFO). I wrote about this more here: http://www.bluebytesoftware.com/blog/PermaLink,guid,e40c2675-43a3-410f-8f85-616ef7b031aa.aspx. FWIW.
--joe
@Joe,
Thanks for the response.
I am concerned about leaving reliability to the host, since i write quite a few Windows Services which has to be reliable, and those are running without a host.
My question about plug-in teardown is actually also relevant to ASP.Net timeout behavior, which will also abort a thread. Obviously separate processes are not an option here.
"Any use of aborts w/out using the hosting APIs to guarantee they are done safely is asking for problems."
Is there a way from the managed process to tell the host to kill a thread and do it safely? What happens when I am not running inside a host (windows service again)?
I am not sure that WAS is a good idea for those services at any rate, they most watch resources and act upon changes, not exposing services, etc.
When writing my own host (like a windows service) I usually wrap the thread class with one of my own - one which exposes a Stop method. In that way, I let the thread finish its current unit of work before it stops at some safe point. This keeps my locks in a consistent state when threads start and stop.
Hi Ayende,
There should be no thread aborts happening in an unhosted Windows Service, so this ought not to be an issue.
(Any rogue, trusted code can call Thread.Abort so long as it has a reference to the Thread object, but this is a highly discouraged practice.)
The right way to shut down a thread is exactly what Udi suggests: cooperative shutdown, by polling a shared flag at safe points, set by the shutdown initiator causing the thread to voluntarily shutdown. At some point, I hope the Framework gives better support for things like IO Cancellation to make this approach more responsive in blocking scenarios. You can also consider using thread interrupts to wake up blocked threads, though there are some pros/cons to this.
ASP.NET isn’t safe in the way it aborts threads, ever, regardless of whether the lock calls Begin/EndCriticalRegion or not. That’s because it doesn’t use the new V2.0 hosting interfaces and will attempt to cancel individual threads rather than the entire AppDomain in many cases.
Reliable locks can be built in managed code to tolerate this kind of thing, but only if you introduce lengthy delay-abort regions, possibly even while blocking, which is a horrible practice (since it prevents, say, ASP.NET from reclaiming a thread).
Reliability is mostly about statistics, and statistically speaking, most of this won't be a problem most of the time. It's likely there will never be a foolproof way to do any of this, which means failsafes need to be built into the system, to deal with data corruption, unresponsive threads, and other statistically infrequent undesirable events.
--joe
Comment preview