Concurrency Hour: Notes on the Aftermath Spring 2011 The concurrency quiz with the process manager problem didn't work out quite like I'd hoped. Now, I recognize that this problem might be confusing under pressure. It's half a page of text. But the concepts should already be familiar based on our discussion of the Unix process abstraction. Be sure you understand the class slides on Unix process trees and the process-related system calls (fork, wait, and exit). Anyway, that's why I gave just this one question in a 30 minute quiz. And I will also note that this problem is the FIRST PROBLEM on the list of OS-related synchronization problems that I offered for practice to prepare for this quiz. So it should not have been a surprise to anyone to see it. I thought I was being soft. I graded the problem on a 40-point scale. About 20% of the class got what I would call A grades (35 or 40). So at least somebody is with the program out there. But the rest of you need to step it up a notch. Duke students are smart and wonderful, and it is a privilege to teach here. You are all going to do great things out there. But in here you are going to have to study concurrent programming, BEFORE you go out there. Look at the slides I have given you. Read the Birrell paper I have given you. Work the practice problems I have given you. If you have trouble, look at the solutions I have given you. Do it. Learn it. Enjoy it. This is not over. In this problem, each process has one thread and a Process object (PCB), like the basic "Classical OS" slide you have seen many times. This problem asked you to write the three procedures Birth, Death, and Join, which will be used inside the kernel. These methods are designed so that all the details of memory allocation, new/delete, creating and destroying threads, etc., are all outside of them, so you don't have to deal with any of that. Nothing in the problem suggests that any of these three methods should call any of the others. But many people did weird stuff with them along those lines, which I mostly ignored. But anyway, the process threads call Birth and Death on their own Process objects as they start and exit. A thread MAY call Join on the Process object for any or all of its children, but is not required to. Now, let's focus on the fundamentals. These threads have some shared data structures: the Process objects, which are linked in a tree data structure. So the threads are going to need mutual exclusion to operate on those structures. Also, the threads sometimes have to wait for each other to reach a specific point in their execution. In particular, Join has to wait for the designated child (this) to signal its Death. And Death does some waiting around too. So: we're going to need some condition variables or semaphores. Inexplicably, 20% of the class somehow missed that the solution to a concurrent programming problem on an advertised concurrency quiz would require some use of synchronization to control concurrency. I got five answers with no locks, no condition variables, no semaphores. These are zero-point answers. Since I know for a fact that a couple of the people who did this are excellent students, I'm chalking it up to general pressure, confusion, and/or lack of sleep. Or something. So everybody gets another chance. But I hope it won't happen again. I got a couple more answers with some hints of synchronization, like interrupt disables or some such. These were worth a few points. Everybody else did something with some mutexes and/or condition variables. If you did some condition variable waits or broadcasts, with no locking, that was worth a maximum of 15 points. If you got the locking mostly right but didn't have any waits or signals or broadcasts, that was at best a half-credit answer (20). If you got some waits but no broadcasts or signals then it was at most 25. If you got some reasonable waits and broadcasts in there that were suitably paired and suitably protected by locks, you got 30 or above. If you got that far, you were in the top third of the class. In particular, the following is a 30-point answer: Birth(p) { lock(); install this in process tree unlock(); } Join() { lock(); while (this is not dead) wait(); unlock(); } Death() { lock(); say this is dead unlock(); signal(); } This solution gets the fundamentals right: all accesses to shared data structures are locked down, Join waits for its child to die, and a dying child signals its parent. It doesn't name multiple condition variables, so the grader assumes there is only one, which is sufficient. It doesn't associate condition variables with locks, but there is one mutex and it is used correctly. In particular, all calls to wait() are locked. So the grader looks elsewhere for mistakes. Of course, this answer is incomplete. We have to make sure that Death wakes up the right parent. It is simplest and safest to use one condition variable, as in this solution, but signal is not going to cut it. If two parents are waiting for their children, and the wrong parent gets the child's signal, then the parent will just go back to sleep, and the right parent is not notified. But we can fix that by using broadcast() instead of signal() to wake EVERYONE up. In general, using broadcast instead of signal is always safe if all of the waiters loop before they leap. We can be sure the right one will be awakened, and all the others just go back to sleep. This is always good practice for a "rough cut, get it right, don't care if it's fast" solution. Just make sure all of your waiters loop before they leap. There are a few other issues. The problem requires that dying children wait for their parents to join or die. That means the children have to wait, and the parent also has to signal its children when it joins or dies, and we have to get the sequencing right to avoid missed signals and deadlock. Again, the safest solution is to stick with the same mutex and condition variable, use broadcast instead of signal, and make sure all the waiters loop before leaping. Birth(p) { lock(); install this in process tree; unlock(); } Join() { lock(); say parent is joining on this; while (this is not dead) wait(); broadcast(); unlock(); } Death() { lock(); say this is dead; broadcast(); while (parent is still alive and has not joined on this yet) wait(); unlock(); } Note that the broadcasts are locked to be sure to avoid missed signals given the more complex signal patterns: slow but safe. Now, there are lots of little things missing here. We could flesh out all the flags and data structures, and make sure that Join returns the status as it should, which is just a little tricky. And there's another wait condition in the problem that is not reflected in the solution above. I have never published a full solution to the problem, so I will leave that as an exercise. But this covers the basics. If you read this far, say "Boo!" at the start of the next class.