Paul McKenney: Multi-core Linux

Syndicate content
paulmck paulmck 2010-03-01T01:43:44Z
Updated: 8 min 38 sec ago

Parallel Programming: Administrators as Architects

Sun, 02/28/2010 - 20:43
Although IT professionals should take care to avoid engineering envy, it is often useful to learn from the experiences of other engineering disciplines. In this posting, I will compare and contrast construction of a building to implementation of a large software project.

Leaving aside financial engineering, building construction starts with an architect, who lays out the general shape and look of the building. A structural engineer creates a detailed design, with an eye towards ensuring that the building will remain standing despite the best efforts of wind, gravity, and plate tectonics. A construction engineer works out the details of the construction process — for example, it is good if the building can support itself while being built as opposed to doing so only when completed. Other engineering specialties may be required as well, for example, HVAC (heating, ventilating, and air conditioning).

Once the building is built, different skills are needed, including operating engineers, maintenance personnel, and janitors.

A very similar sequence of events can play out for a large software application. Software architects (for better or worse) lay out the general shape of the project, developers design and code it, and others ensure that it is built, tested, and safely ensconced in some source-code management system.

However, once the application is completed, it is likely that its care and feeding will be taken over by application, database, and system administrators. The architects and developers will switch to other projects (possibly version N+1 of this same application), and perhaps even retire or otherwise move on. Of course, if the application runs at multiple sites, there might well be a separate set of administrators for each site. But for simplicity, let's assume that this application runs at only one site.

Now suppose that it is necessary to parallelize this application.

This is tantamount to major structural change to the building, such as adding several new floors. A structural change of this nature is clearly not a job that you would normally entrust to operating engineers, maintenance personnel, or janitors.

But what else can you do if the original architects and developers are gone?

Parallel Programming: Selective Struggling

Fri, 02/12/2010 - 18:33
An Eminent Reader privately indicated some distaste for the non-technical nature of recent parallel programming posts. Given that many of the obstacles to successful development of parallel software are non-technical, there will be future non-technical posts, but there is no reason not to take a technical break from these issues. And so, just for you, Eminent Reader, I present this parallel programming puzzle.

This puzzle stems from some researchers’ very selective struggles with parallel algorithms. Of course, it should be no surprise that many people, researchers and developers included, will struggle quite happily with their “baby”, but will even more happily bad-mouth competing approaches, even when (or perhaps especially when) those approaches requiring much less struggling. And yes, some might accuse me of favoring RCU in just this manner, but this is my answer to the likes of them.

Such selective struggling seems to have given rise to an interesting urban legend within the concurrency research community, namely that allowing concurrent access to both ends of a double-ended queue is difficult when using locking.

Can you come up with a lock-based solution that permits the two ends of a double-ended queue to be manipulated concurrently?

Parallel Programming: Questionable Quality Assurance

Tue, 02/09/2010 - 14:44
An earlier post noted that parallel programming suffers more potential failures in planning than does sequential programming due to the usual suspects: deadlocks, memory misordering, race conditions, and performance/scalability issues. This should lead us to suspect that parallel programs might need better quality assurance (Q/A) than do sequential programs. Q/A activities include validation, verification, inspection, review, and of course testing.

Traditionally, Q/A groups serve many roles:


  1. Run tests and find bugs.
  2. Break in new hires, who, strangely enough, are sometimes reluctant to irritate developers.
  3. Distract developers who are already behind schedule with pesky bugs.
  4. Act as scapegoat for schedule slips.
  5. Act as a target of complaints from developers who are tired of debugging either their new features or any bugs located by the Q/A group.

Although there are many highly effective Q/A groups in many software development organizations, it is not hard to find Q/A groups that find bugs, but that either cannot or will not get developers to pay attention to them. It is also not hard to find Q/A groups that are overridden whenever they point out problems that might cause a schedule slip. One way to avoid these problems is via enlightened management based on (for example) bug trends over time, and another way is for the Q/A organization to report high up into the organization. Of course, with this latter approach, one wonders just how often the Q/A organization can get away with yanking on the silver chain connecting to their executive sponsor.

Of course, FOSS communities have their own Q/A challenges, but the fact that the maintainers are usually responsible for the quality of their code adds a breath of fresh air to the process. Not least, their gatekeeper role enables them to vigorously enforce any design and coding guidelines that their FOSS community might have.

But what are the technical effects of parallel software on Q/A?

Parallel Programming: Terrible Tooling

Tue, 01/19/2010 - 21:00
In a previous life, I worked on a UNIX kernel on a team of a few tens of developers. This smallish team size resulted in a very smallish set of development tools. To see why, imagine a tool that required one developer one year to create, and that would provide a 1% improvement in everyone's productivity. Given 50 developers, it would take two years to recoup the time spent developing this tool. In contrast, consider the Linux kernel, which usually has thousands of developers. Assuming 1,000 developers, it would take less than six weeks to recoup the time spent developing this tool.

This example does much to explain the large quantity of tooling available for the Linux kernel. For example, the UNIX kernel I worked on did not have anything like lockdep, powertop, latencytop, or perf for in-kernel use, nor anything like valgrind or perf for user-mode use. Given the relatively small number of developers on that project, it almost always made more sense to find and fix problems manually.

It is important to note that this effect is due to the number of Linux-kernel developers, not to the FOSS nature of the Linux kernel — except insofar as the FOSS nature of the Linux kernel has enabled it to accumulate a large number of developers.

So what does this have to do with the perceived difficulty of parallel programming?

Parallel Programming: Amdahl's Anguish

Fri, 12/25/2009 - 00:29
The initialization portion of a kernel or server application may be amortized over an arbitrarily large runtime, limited only by the robustness of the kernel or application. If the runtime portion of that kernel or application is fully parallel, Amdahl's law gives a speedup as follows:

S = 1 / (1 - P)

As the runtime of the kernel/application increases, the parallelizable fraction P approaches
1, causing S to approach infinity, for perfect speedup.

We obtain the same result from Gustafson's law:

S(P) = P - α (P - 1)

As the runtime increases, the non-parallelizable portion of the program α approaches zero, so that the speedup S(P) approaches the number of processors P, which indicates perfect linear scalability.

What do you do if Amdahl's law says your program is perfect?

Parallel Programming: Bogus Benchmarks

Thu, 12/17/2009 - 20:38
Benchmarking has a checkered history in IT. It is not for nothing that some have paraphrased the famous and well-worn quotation “lies, damned lies, and statistics” to “lies, damned lies, and benchmarks.” This is of course unfair to benchmarks, which have been an important driving force behind improvements to computing technologies ranging from CPUs to compilers to storage arrays to graphics processors to databases. At their best, benchmarks play a number of exceedingly important roles:

  1. Providing a fair framework for comparing competing implementations on performance, price-performance, performance per watt, time to solution, and much else besides.
  2. Focusing competitive energy on improving implementations in ways that matter to users.
  3. Serving as example uses of the technologies being benchmarked.

Benchmarks from the Transaction Processing Council (TPC) and the Standard Performance Evaluation Corporation (SPEC) have been very helpful in the past, and so we can hope that good benchmarks will play an important role in accelerating the evolution of multicore hardware and software. That said, designing benchmarks is a painstaking exercise, not to be undertaken lightly. It is therefore interesting to look carefully at the Fibonacci benchmark that has been used for well over a decade to evaluate parallel work-stealing schedulers. An early paper using this benchmark admits that this is “a terrible way to compute Fibonacci numbers”, but argues that it stresses task creation and synchronization overhead.

A doubly recursive C-language implementation of the Fibonacci function takes about 18 seconds to compute fib(47) on my laptop. (Why fib(47)?) One could imagine this running faster if executed in parallel, although considerable imagination is required. For purposes of comparison, an iterative C-language implementation computes fib(47) in less than 100 nanoseconds, more than 8 orders of magnitude faster. Worse yet, the interpretive computer algebra system Maxima computes the exact 200,000-digit value of fib(1,000,000) in less than 500 milliseconds, which is still much less time than that required for fib(47) by the doubly recursive C-language implementation.

But again, the authors of the aforementioned paper argue that this stresses the thread creation and synchronization overhead. Is this a good justification for use of this benchmark?

Parallel Programming: Sequestered Source Code

Wed, 12/16/2009 - 20:05
Earlier, we noted that parallel programming requires additional planning. Failure to properly carry out this additional planning can result in deadlocks, data races, livelocks, and the usual litany of perils of parallelism. It is important to note that many of these perils are global in nature, for example, a deadlock cycle might involve any or all portions of a shared-memory parallel program.

Deadlocks are not difficult to find: once the system deadlocks, it certainly isn't going anywhere, and straightforward instrumentation can usually trace out the deadlock cycle. Although repairing deadlocks can occasionally be extremely challenging, it is quite often a simple matter of adjusting lock-acquisition order.

Not much of a problem, that is, if you have access to the source code. If different pieces of the source code are owned by different organizations, the process of fixing the code can easily be subordinated to a finger-pointing exercise. After all, breaking the deadlock cycle at any point fixes the problem, so which of the players is going to blink first and prepare the fix? And when one of the players does provide a fix, how can anyone else be sure that it actually fixes the problem, as opposed to greatly reducing its probability of occurrence? Or even not so greatly reducing its probability of occurrence?

In short, given the current state of the shared-memory-parallel software engineering art, it seems best to place an address-space boundary between code from different proprietary players — and between proprietary code and FOSS code.

Of course, this latter is something that numerous GPL boosters, including many in the Linux kernel community, have been advocating for quite a few years. Those of you who dismissed their stance as irrational free-software religion just might want to think again. ;–)

Parallel Programming: Dismal Designs, Atrocious APIs, and Crufty Code

Tue, 12/15/2009 - 20:46
Imagine a software project that has been written and maintained for many years as a single-threaded program. In such an environment, using a singleton object to hand out consecutive transaction IDs is the most straightforward thing in the world: simple, fast, and easily encapsulated. Similarly, a singleton object is also quite attractive as a central “mailbox” or communications hub. Again, simple, fast, and easily encapsulated.

That is, these approaches are attractive only until it is time to run on a multicore system, at which point their poor scalability will become painfully apparent. The following figure illustrates this point, showing the overhead of concurrent atomic increment of a single global variable in a tight loop.

Atomic increment does not scale well

Perfect scalability would result in a horizontal line in the above figure. The upward-slanting line shows that the more CPUs we add, the slower atomic increment gets. This is the antithesis of good scalability.

Of course, if the transaction ID was used only as a tag, there are a number of easy fixes, for example, assigning ranges of numbers to individual threads so that they can generate transaction IDs without the need for costly atomic operations. But what if there is a recovery mechanism that absolutely requires that the IDs be assigned consecutively? This might require a complete rewrite of what is usually extremely tricky code.

Worse yet, suppose that this transaction ID was part of some heavily used API. Modifications might then be required in many pieces of code throughout many organizations. It is likely that no one person would be able to find all the relevant code within the confines of a single organization, let alone across multiple organizations.

Oops!!!

One can argue that this is not parallelism's fault. After all, parallelism did not hold a gun to anyone's head and insist that poor design choices be made. But this is cold comfort to anyone tasked with parallelizing such software. Besides which, in absence of parallelism, these design choices were eminently reasonable.

That said, the figure above shows that atomic increment consumes less than 400 nanoseconds, even on a 16-CPU system. In many cases, this 400 nanoseconds is not a large overhead. So, is the use of global singletons really a problem for parallel code, and if so, what can be done about it?

Parallel Programming: Parallel Paranoia

Mon, 12/14/2009 - 14:20
It is all too easy to feel smug when remembering all the dismissals of parallelism, some as recent as five years ago. These dismissals portrayed parallelism as an aberration that would soon be swept into the dustbin of computing history by inexorable Moore's-Law increases in CPU frequency. It is easy to lampoon these dismissals as shortsighted or even wrongheaded, but doing so is counterproductive. As always, the reality is much more complex and much more instructive than any lampoonery, entertaining though the lampoonery might be.

But let me give you an example.

At a 1995 workshop, I was approached by a software architect asking for advice on parallelism. This guy was quite forward-thinking: even though he understood little about parallelism, he clearly understood that hardware-cost trends meant that he was going to have to deal with it sooner or later. Not many people can legitimately claim to have been ten years ahead of their fellows, but this guy was one of them.

So he asked me how best to go about parallelizing his application. After a bit of dicussion, I learned that it looked roughly like this:

Multiple instances connected to database

Here, an instance of the application is spawned for each user in order to mediate that user's interaction with the database. I asked if the application's execution was a significant component of the overall response time, and learned that it was not. I therefore told him that the best way to handle parallel systems was to continue running single-threaded per-user instances, and to let the database deal with the shared-memory parallelism. He seemed both happy with and relieved by this answer.

Why happy and relieved? Keep in mind that in the great majority of cases, parallelism is a performance optimization, and is but one potential optimization of many. Furthermore, performance is but one of many features that are of interest to the user. So the choice of parallelism can be an unwise one, especially when you have no parallel-savvy developers on your project and when parallel machines are a good order of magnitude more expensive than single-CPU machines, as they were in 1995. Adding parallelism to your project too soon might benefit a tiny fraction of your users, but the resulting deadlocks, race conditions, and other instability would greatly annoy your entire user base. Going parallel too soon could thus jeopardize your project's very existence. Therefore, simple Darwinian selection could easily generate reactions to parallelism ranging from strong distaste to absolute paranoia.

But then in the early part of this decade, the rules suddenly changed. Parallel systems are now quite cheap and readily available. So, should I give different advice to the same project today?

Regardless of the answer to this question, it is clear that some projects now need to lose their parallel paranoia if they are to thrive in our new multi-core world.

Parallel Programming: Uniprocessor Programmers and Sequential Staff

Fri, 12/11/2009 - 16:02
If the demand for parallel programmers has outstripped supply, then why not simply train up more parallel programmers? After all, there are a lot of parallel programs out there, and the developers who created them did not spring fully formed from the void.

Unfortunately, I cannot put myself forward as an expert in parallel programming courseware, because I did not learn parallel programming from a college class or a training course. Instead, I gained most of my parallel programming expertise directly from the hardware, with the aid of a logic analyzer or two. I would not have it any other way, as each new generation of hardware has something new to teach, and I have grown used to learning directly from it. Nevertheless, the fact that I have not learned much from parallel courseware makes it difficult for me to judge courseware quality.

On the other hand, I have had the good fortune to be a member of two communities distinguished by their success in quickly bringing new members up to speed on the art and science of writing parallel code. The first such community was Sequent's DYNIX/ptx engineering staff, and the second was (and is) the Linux kernel community. My experience with these two communities leads me to believe that the following conditions are required to quickly turn competent programmers into competent parallel programmers:

  1. Readily available parallel hardware.
  2. Large body of high-quality parallel code available for review and tinkering.
  3. Easy access to parallel-programming experts, preferably with a variety of viewpoints.
  4. Immediate “frank and open” expert feedback on new code.

Over the past 20 years, the situation for the first two items has improved beyond all recognition: cheap parallel hardware is readily available and there a wealth of open-source parallel software of all sorts is just waiting to be downloaded. The number of parallel-proficient developers has greatly increased, but there is still no shortage of projects lacking parallel expertise, and I have yet to see hordes of parallel-programming experts sitting around waiting to receive newbie parallel programs in need of review. That said, the shift from zero of four to two of four is a great improvement. Furthermore, one can argue that these first two conditions are the most important of the four, as they promote spontaneous generation of parallel programmers through self-education.

Of course, these communities both use what can be thought of as an apprenticeship approach, where new developers learn on the job. This can be quite effective, but it is not clear that it can produce the needed number of parallel developers. Perhaps formal education will start filling the gap.

In the meantime, what is the one most important parallel lesson?

All of this existing and expected improvement in parallel-programming expertise is encouraging, but what do you for a team of uniprocessor programmers who need to parallelize a large complex application?

There are now a number of parallel-programming classes and textbooks, so formal continuing education is becoming a feasible option, although it is too early to judge its effectiveness. As noted above, there are also a number of parallel open-source projects in full swing, so having some of your team members join one of these communities might provide effective education. Hiring a parallel-programming expert is also an option, assuming you pay close attention to team dynamics. Having parallel-programming expertise is not enough &mdash successful parallelization also requires deep knowledge of the application being parallelized.

I will leave you with the following caution:

Raw intelligence is completely insufficient to guarantee success in parallel programming. In fact, raw intelligence in absence of parallel design and coding know-how is a negative. The more intelligent you are, the deeper the hole will be before you realize that you need to stop digging. It all comes back to the need for additional planning when parallel programming. The more intelligent you are, the more likely you are to delude yourself into believing that your seat-of-the-pants “planning” is actually working, and the worse trouble you, your team, and your project will get into. There are a very few exceptions to this rule, one example being Bob Beck, the guy who defined Sequent's approach to parallelism in the early 1980s. Unfortunately for all of us, the number of people with this level of talent, ability, and experience is quite small.

Parallel Programming: Embarrassing Examples

Thu, 12/10/2009 - 00:55
Later in this series of postings, we will get back to the question of what lessons from the old Great Software Crisis might be applied to the new Great Parallel Software Crisis. In the meantime, this post will look at what can make parallel programming easy. Just to maintain my reputation for arguing with myself, several upcoming posts will examine things that can make parallel programming difficult.

The easiest problems to parallelize are those whose data is partitionable. In this case, just assign a separate CPU to each partition and stand back! This can be as easy as the following shell script:

./myprogram 1 & ./myprogram 2 & ./myprogram 3 &

Here we run three copies of the program, each working on a different part of the problem. Other examples include the data parallelism used in technical computing and the data locking used in operating systems and server applications.

Many types of computer-aided design have datasets that are nearly partitionable, a classic case being finite-element analysis. Here, an grid is superimposed on the object being analyzed, and a CPU assigned to a region of the grid. As we make the grid more fine-grained (thus usually making the solution more accurate) by a factor of N, the amount of computation per CPU rises as N3, while the amount of communication between CPUs only rises as N2. As N rises, the fraction of effort required for communication becomes arbitrarily small, allowing arbitrarily good performance and scalability.

However, as soon as there is communication, we do need to start being careful. Even in the tightest SMP, the overhead of communicating is a good order of magnitude greater than that of computation, and this ratio often amounts to multiple orders of magnitude. Therefore, even a small amount of communication can destroy your software's scalability, regardless of whether that communications is within a single CPU core or across a large cluster interconnect. This is why partitioning is a good thing, the more complete the partitioning the better. Completely partitionable problems are often called “embarrassingly parallel”, though has never been clear to me why partitioning should embarrass anyone. Perhaps it is a culture thing, but whatever it is, I simply think of it as an embarrassment of performance and scalability riches.

But what about RCU? RCU neither partitions data nor replicates it, so why is it fast?

Of course, for lazy people like me, the best cases are where the parallelism is done for you, as has long been the case for SQL developers. The database's query optimizer handles the parallelism, so that the developers don't have to. That said, a quick Google search for “SQL performance” should suffice to convince you that even SQL developers have not yet reached Parallel Nirvana.

But can you think of cases where parallelism is easy but useless?

Partitioning and replicating your program's data can be a wonderful path to excellent performance and scalability, but life is not always that easy. The next set of posts will look at some of the things that can make life difficult.

Parallel Programming: Heeding History

Tue, 12/08/2009 - 00:47
Given that parallel systems have been in existence for decades, it is worth asking why they have caused so much fuss over the past few years. Many argue that this is due to the end of Moore's-Law-induced frequency scaling, while others note that the business models of some corporations would suffer if people no longer felt the need to buy new computers every few years.

Although there is no doubt some validity to both of these arguments, the real reason is economics. Sure, parallel systems have been commercially available for some decades, but how many people could afford to splash out $1M in the 1980s? $100K in the early 1990s? $10K in the late 1990s? In contrast, how about a few hundred of 2009's sadly inflated US dollars?

As the price of parallel systems has plummeted, the number of situations where it makes economic sense to use them has increased exponentially. This in turn means that the demand for parallel software has also grown suddenly, outstripping the supply of developers with parallel-programming experience. Voilà, a parallel-software crisis.

But this is most definitely not the first software crisis. A very similar crisis arose in the late 1970s, with very similar history. A computer cost millions of dollars in the 1960s, tens of thousands with the advent of the minicomputer in the early 1970s, and mere thousands with the advent of the microcomputer and the personal computer in the late 1970s and early 1980s. Then as now, as the price of computer systems plummeted, the number of situations where it made economic sense to use them increased exponentially. Then as now, the demand for computer software grew suddenly, outstripping the supply of programmers. Then as now, a software crisis was proclaimed.

Many new programming languages were put forward to deal with this crisis, and these can be categorized, not into the good, the bad, and the ugly, but rather into the good, the fad, and the ugly.

The programming languages in the “ugly” category are still with us, though the fraction of code written using them has decreased. We still use various flavors of shell, sed, awk, and C, as well as holdovers from earlier times, including FORTRAN and COBOL. I myself used the Bourne shell and C for production software in the early 1980s, and would never have guessed that I would still be using them more than a quarter century later. They are simply too ugly — and too useful — to die.

The programming languages in the “fad” category include darlings such as PASCAL, MODULA, Scheme, Eiffel, Smalltalk, and CLU. There are no doubt a few developers still playing with these toys, but these darlings never were able to deliver on the promises made by their proponents, and never managed to gain a large developer base. (And given that I designed, coded, and put a 50,000-line PASCAL program into production, I know whereof I speak.)

So what does it mean for a programming language to be “good”? Given that the goal is to solve a software crisis, the only reasonable measures of goodness are: (1) an increase in productivity of existing developers by orders of magnitude, (2) an increase in the fraction of the population who can use computers, again by orders of magnitude, or preferably (3) both.

So, what were the “good” programming languages that solved the Great Software Crisis of the late 1970s and early 1980s?

And what lessons should we draw from the Great Software Crisis to help us deal with the Great Parallel Software Crisis?

Parallel Programming: Inappropriate Intuitions

Sat, 12/05/2009 - 18:38
I had the good fortune to attend a panel on parallel-programming education at the recent SC09 conference. To their credit, many of the panelists noted that parallel programming was not all that hard, and many of them took their more sequential-minded colleagues to task for telling students that parallel programming is difficult. Of course, many of them also said that the locking-plus-threads programming paradigm is difficult, never mind the fact that locking-plus-threads has been used very successfully in a great many software projects.

Nevertheless, this panel's message was a welcome change of pace from the many years of hand-wringing over parallel programming. And although I have given much thought to whether and why parallel programming might be difficult, this panel's message motivated me to step back and think on it some more.

There have been any number of people claiming that parallel programming is completely counter-intuitive, often with little or no supporting evidence. In contrast, we need to think carefully about this important issue, and the place to start is with sequential programming. Please note that it is easy to make a solid case for sequential programming being counter-intuitive. If you don't believe me, you should talk to people who took up programming, who mastered it, but who found it unappealing.

My discussions with such people uncovered the following major insults to their intuition:

  1. People expect intelligent beings, whether organic or inorganic, to have some degree of common sense. Despite the decades of research sacrificed at the altar of artificial intelligence, computers remain almost completely devoid of common sense.
  2. People also expect intelligent beings to have some understanding of their intent, which is called theory of mind. If anything, computers' lack of theory of mind is even more profound than their lack of common sense.
  3. People expect to be able to make a fragmentary plan, and nevertheless expect this plan to have the expected results. Computers are famously unforgiving of fragmentary plans, though perhaps GPS units are moving in the right direction.

The first two points should be uncontroversial. If you doubt them, consider the failures of the much-maligned Clippy and Microsoft Bob. These two product features attempted to relate to the user as people, thus raising common-sense and theory-of-mind expectations, expectations that these two products proved completely incapable of meeting.

The reliance people cheerfully place on fragmentary plans deserves further discussion. This reliance is apparently due to the expectation that the person executing the plan will be intelligent, have common sense, and have an understanding of the intent behind the plan, especially in the real-world common case where the person making the plan and the person executing it are one and the same. In the case, the plan will be revised as a matter of course to account for unexpected events and obstacles. This worked quite well through much of human history, which is no surprise: far better to start hunting in a random direction than to starve to death while planning for all possible situations that might arise during the hunt.

Unfortunately, even the best human reflexes simply cannot keep up with a 5GHz CPU. Even if we imagine a hyper-caffeinated 5GHz superhero, there are many millions of computers to be kept up with. And so the modern microprocessor invalidates untold millenia of evolution, frustrating untold numbers of would-be computer professionals.

Please note that these three inappropriate intuitions have absolutely nothing to do with parallel processing. This should be no surprise, as the universe is, always has been, and always will be highly concurrent. Given that, at least as far as we know, human beings have spent their entire history living in this universe, shouldn't we expect concurrency to be intuitive?

Of course, concurrency being intuitive does not necessarily mean that concurrent programming is intuitive. After all, the typical American-football player must deal with 21 other players on the field, the ball, and a few coaches and umpires, which should be excellent concurrency training. And perhaps it is excellent training, but I know of only one football player who went on to be a parallel programmer. Then again, he is an extremely talented and capable parallel programmer, so perhaps the generation being raised by soccer moms will show us old guys a thing or three about concurrency.

All that aside, the usual examples of parallel-programming difficulties — deadlocks, race conditions, memory misordering, performance and scalability issues — are really examples of item #3 above: failure to plan properly. In other words, parallel programming does not add new types of counter-intuitive behavior, but rather exacerbates the effects of existing sequential-program counter-intuitive behavior. One of the parallel-education panelists at the recent SC 09 conference claimed that parallel programming represented only about 5% additional difficulty over and above that of sequential programming, and this line of reasoning certainly supports that claim.

Nevertheless, 5% of the effort required to truly master sequential programming is a substantial undertaking, so the remainder of this series of blog posts will examine exactly what makes parallel programming such a challenge — and how to surmount this challenge.

Hunting More Heisenbugs

Fri, 11/20/2009 - 19:57
How should you celebrate a successful heisenbug hunt? Any five people would likely have ten different opinions, but whatever your choice, you should also fix your broken tests. Just what makes me say that your tests are broken? Consider the following definitions:

  1. The only bug-free programs are trivial programs.
  2. A reliable program has no known bugs.

From these two definitions, it clearly follows that any reliable non-trivial program has at least one bug that you don't know about. The fact that you don't know about it means that your tests have not yet found that bug, and therefore, as asserted above, your tests are broken. And yes, I am giving your program the benefit of the doubt by assuming that it is non-trivial, and in any case, I claim that the Linux kernel's RCU implementation is non-trivial. RCU therefore contains at least one bug I don't know about — one bug my tests didn't find — and I therefore need to raise my rcutorture game.

I fixed the rcutorture test process as follows:

  1. Choosing CONFIG_RCU_FANOUT=2 on an eight-CPU machine, which exercised code paths that would otherwise require 1024 CPUs.
  2. Decreasing the wait time successive randomly chosen CPU-hotplug operations from three seconds to a hundred milliseconds (using “sleep 0.1” from a bash script!)
  3. Retaining the artificially low one-scheduler-clock-tick delay between invocations of force_quiescent_state().

This improved testing regimen resulted in grace-period hangs, but only with CONFIG_PREEMPT_TREE_RCU. The really nice thing about this improved test was that the failures occurred within a few minutes, which permits use of printk() debugging, a rare luxury when working with RCU infrastructure. I nevertheless suppressed my urge to wildly code up random printk() statements in favor of enabling the RCU tracing infrastructure. The tracing data clearly showed that all leaf rcu_node structures had recorded quiescent-state passage for all their CPUs, but that this information had not propagated up the rcu_node hierarchy. Given that these grace-period hangs occurred only in CONFIG_TREE_PREEMPT_RCU, this data fingered two possible code paths:

  1. rcu_read_unlock_special() invoked from a task that had blocked within the prior RCU read-side critical section, and
  2. __rcu_offline_cpu() when offlining the last CPU from a given leaf rcu_node structure when that structure has queued a task that has blocked within its current RCU read-side critical section.

A little code inspection on these two code paths located the bug and the fix. With this fix, RCU appears to be quite reliable.

So how should I celebrate success in this latest hunt?