Home

Parallel Programming: Terrible Tooling

Paul McKenney: Multi-core Linux - Tue, 01/19/2010 - 21:00
In a previous life, I worked on a UNIX kernel on a team of a few tens of developers. This smallish team size resulted in a very smallish set of development tools. To see why, imagine a tool that required one developer one year to create, and that would provide a 1% improvement in everyone's productivity. Given 50 developers, it would take two years to recoup the time spent developing this tool. In contrast, consider the Linux kernel, which usually has thousands of developers. Assuming 1,000 developers, it would take less than six weeks to recoup the time spent developing this tool.

This example does much to explain the large quantity of tooling available for the Linux kernel. For example, the UNIX kernel I worked on did not have anything like lockdep, powertop, latencytop, or perf for in-kernel use, nor anything like valgrind or perf for user-mode use. Given the relatively small number of developers on that project, it almost always made more sense to find and fix problems manually.

It is important to note that this effect is due to the number of Linux-kernel developers, not to the FOSS nature of the Linux kernel — except insofar as the FOSS nature of the Linux kernel has enabled it to accumulate a large number of developers.

So what does this have to do with the perceived difficulty of parallel programming?

Using perf, the Linux Performance Analysis tool on Ubuntu Karmic

Anton Blanchard: Kernel hacker - Sun, 01/10/2010 - 08:57

A lot has been going on with Linux performance counters (now called performance events), but there is enough functionality in the 2.6.31 kernel that ships with Ubuntu karmic to be able to use some of the features available in perf. I recently found it useful when debugging a performance issue on my mythtv frontend.

To build perf, first install the dependencies:

sudo apt-get install libelf-dev binutils-dev

Then grab a recent kernel source tree and build perf:

wget http://www.kernel.org/pub/linux/kernel/v2.6/testing/linux-2.6.33-rc3.tar.bz2
tar xjf linux-2.6.33-rc3.tar.bz2
cd linux-2.6.33-rc3/tools/perf
make
make install

It will warn that libdwarf-dev is not installed, but the version in karmic is too old and regardless libdwarf is only required for event tracing that appeared in more recent kernels. perf installs into ~/bin/perf. You should then be able to use the top, stat, list, record, report and annotate options.

New Trusted Computing Blueprint published.

Emily Ratliff: Open Source Security - Thu, 01/07/2010 - 10:15

by Rajiv Andrade, Linux Technology Center

Since the foundation of the Trusted Computing Group, previously named Trusted Computing Platform Alliance, the pillars required to win most of today’s security challenges have been heavily developed.

The Trusted Platform Module and the Trusted Software Stack are two of these. Now that we have in our hands the required enablement, the next expected step is to come up with the development of detailed and implementable use cases that were originally envisioned when starting the Trusted Computing Initiative.

The use case presented in this newly published Blueprint exploits the integrity measurement capability that the TPM provides. Other than using a passphrase as an authorization token, it describes how to use a machine’s integrity to authorize access to sensitive files, by means of a key sealed to those integrity parameters.

The parameters include the loaded kernel image, the bootloader and its configuration file, and the BIOS. Thus, if one tries to load a different flawed kernel image, those sensitive files won’t be accessible. It’s also worth mentioning that the bootloader used is able also to measure critical system files (e.g. the libraries placed at /lib), making the job of a rootkit even harder.

The next step is to attest a machine’s integrity using the Integrity Measurements Architecture (IMA) logs that contain a list of measurements of all files accessed by the root user during runtime.

Check it out at: http://publib.boulder.ibm.com/infocenter/lnxinfo/v3r0m0/topic/liaai/tpm/liaaitpmstart.htm

Methods for relocating network connectivity

Methods for redirecting network connectivity

When a client talks to a service over a network, and the server providing the service fails, or the service needs to be moved for administrative reasons, what methods are available for redirecting network references to that service refer to a different server?

This post outlines the methods that I know of for doing this.  But note the word redirect - that word is the key to all these methods.  These different methods are ways of redirecting various layers in the networking stack.  So let's first look at what happens for a normal IPv4 network connection over ethernet, and what all layers are involved, and what all places there are to redirect (or reroute) the network traffic to another server.

For simplicity's sake, we will ignore load balancers - both in hardware form, and DNS-level load balancers, and we assume a modern "switched" network.

Information Routing

What happens when a client establishes a connection to a service on a server?

Here is a brief synopsis of how connections get established.

  1. The client is given or obtains from configuration information, bookmarks, etc. a name of the server running that service.
  2. The client consults a DNS server to translate the server/service name into a 32-bit IPv4 address.
  3. The client holds onto this name/IPv4 mapping in order to optimize future references to the server name.  DNS lookup libraries normally do this themselves, but some applications also perform their own address caching.
  4. The client then examines the IPv4 address, and determines which interface and gateway to send it out on on the basis of its local configuration and the IPv4 address itself.
  5. The client OS then sends out an ARP packet to determine the 48-bit Media Access Control (MAC) address of the gateway, or the server itself, if the client and server are on the same subnet.  It may have this MAC/IP correspondence cached from earlier packets it had received from the server.
  6. The client OS sends out the packet to the MAC address determined in step 5 over the interface selected in step 4.
  7. At some earlier time, the switch network will have "learned" which switch port the corresponds to the selected MAC address.  It does this by observing which port sends packets for that given MAC address.
  8. The switch network then routes the packet to the chosen MAC address on the subnet (this could either be the MAC address of the gateway or the server - as discussed earlier).
  9. If the server is on the same subnet as the client, it receives the packet and examines the packet to see if the destination IP address is one it provides.  If it does, then all is well.  If it does not, then the packet is dropped.    So ends the "same subnet" case.
  10. Assuming the server is not on the same subnet as the client...
  11. The gateway receives the packet, and examines its routing table to decide where to route the packet to.  This is determined by the routing protocol the gateway is running - for example, OSPF or BGP.
  12. The "network cloud" routes the packet to the "final gateway" on the same subnet as the destination server.  As before, this is determined by the various routing protocol(s) along the way from the first gateway to the last one.  (This explanation is similar to the "then a miracle occurs" in the middle of a math proof).
  13. The final gateway sends out an ARP packet to determine the MAC address of the destination server.  It is typically cached for a few minutes up to an hour.
  14. The final gateway then sends the packet out to the MAC address above over the selected interface based on the routing protocol it is running.
  15. The destination server receives the packet and examines the packet to see if the destination IP address is one it provides.  If it does, then all is well.  If it does not, then the packet is dropped.

There are several address transformations that transform from one conceptual address space into another lower level address space.  These are:

  • Translation from "conceptual knowledge" of the server to the DNS name of the server.
  • Translation from the DNS server name to the destination IPv4 address
  • Translation of the destination IPv4 address to the destination gateway using routing information.
  • Translation from the destination IPv4 address to the destination MAC address.
  • Translation from MAC address to destination switch port.

Each of these transformations is a place where a redirection can occur.

  • The conceptual knowledge layer can be redirected by telling all clients to switch to a new server name.
  • The DNS layer can be redirected by updating DNS entries.
  • The network routing layer can be redirected by updating routing information in the network and pushing out the new route information.
  • The IPv4->MAC layer can be redirected by updating the ARP information and forcing the various ARP caches to be updated.
  • The MAC->switch port can be redirected by updating MAC addresses and forcing the switch network to learn the new MAC->switch correspondence.

Subsequent sections present detailed explanations of how to perform these various kinds of redirections.

Conceptual Knowledge Layer

There is no universal automated to update the conceptual knowledge layer - nevertheless, server relocations are sometimes handled at this layer.  One can use automated client update tools to update client configuration files, one can use word-of-mouth, email, or any number of ad hoc tools.  This is the least commonly used method for redirection on failure.  Arguably, since it is hard to automate, it doesn't have much place on a blog on managing with automation.

DNS layer

Updating the DNS layer can be easily automated.  The advantages are - it's universal, and little or no prior preparation has to occur, and no server/network political boundaries have to be dealt with, and the two servers don't have to be on the same subnet.  The disadvantages are - not all clients use DNS addresses, Client OS DNS caching can interfere, Client software itself can interfere by caching the address outside DNS.  Even with Dynamic DNS, it can take minutes to hours for changes to propagate and the new server address become known (and usable) to all clients.  If the client application caches the address itself, then client applications have to be restarted.  This last subcase can be difficult to automate.

Network routing layer

If a server fails, routing can be used to redirect the traffic for the failed server to another server on a different physical access segment.  The advantages of this are - the two servers don't have to be on the same network segment, routing protocols are designed to deal with this kind of situation.  The disadvantages include - if the IP address is public, then  you have to move over at least 256 addresses at a time, there are often political boundaries making it hard for servers to automatically update network routing information, the additional routes for handling a large number of such movable addresses may slow down the routers involved.

ARP layer

When a server fails, another server can bring up the IP address of the dead service, update the ARP cache (typically using gratuitous ARPs - sometimes called ARP spoofing) and packets destined for the now-dead server go to the live one.  The advantages include:  IP address takeover can occur in less than a second, there is well-tested software for doing this, most organizations have a good method for allocating and managing additional IP addresses.  The disadvantages include:  The two servers have to be on the same network segment, some organizations lock down their network gear to make this "impossible" (which it doesn't - it just slows it down), and it typically increases the number of IP addresses needed by the servers and services.

MAC->switch port layer

MAC address takeover is a technique where a given network card is given multiple MAC addresses - one for an administrative address, and one for each group of independently-failable services.  Retraining the switches to understand which switch port services the given MAC address is accomplished by simply sending any IPv4 packet with the new MAC address.  The advantages include:  Takeover can be very fast and quite reliable.  The disadvantages include: the two servers have to be on the same network segment, some organizations lock down their network gear to make this "impossible" (which it doesn't - it just slows it down), and it typically increases the number of MAC addresses needed by the servers and services, organizations almost never have methods for allocating and managing MAC addresses like they do IP addresses.

Which Method is "Best"?

I have heard it said that when you ask an engineer a question the answer is always the same regardless of the question - "It Depends".  So it is here...

  • For servers on the same VLAN (or network segment) - IP address takeover (IP address spoofing) is the most common.
  • For servers on different VLANs or network segments:
    • Network route updating - if politically and technically feasible
    • DNS updating
    • Updating the conceptual knowledge layer
Note that there are some circumstances in which there is no easy answer.  You may have to change your network configuration, solve political problems, buy an external netblock, or execute various other combinations of uncomfortable or difficult steps.

Parallel Programming: Amdahl's Anguish

Paul McKenney: Multi-core Linux - Fri, 12/25/2009 - 00:29
The initialization portion of a kernel or server application may be amortized over an arbitrarily large runtime, limited only by the robustness of the kernel or application. If the runtime portion of that kernel or application is fully parallel, Amdahl's law gives a speedup as follows:

S = 1 / (1 - P)

As the runtime of the kernel/application increases, the parallelizable fraction P approaches
1, causing S to approach infinity, for perfect speedup.

We obtain the same result from Gustafson's law:

S(P) = P - α (P - 1)

As the runtime increases, the non-parallelizable portion of the program α approaches zero, so that the speedup S(P) approaches the number of processors P, which indicates perfect linear scalability.

What do you do if Amdahl's law says your program is perfect?

sVirt Stronger Security for Linux Virtualization

Emily Ratliff: Open Source Security - Tue, 12/22/2009 - 17:24

By Bryan Jacobson, Linux Technology Center.

While Virtualization offers many benefits, there can also be increased security risks. For example, consider a system running two hundred virtual images. All two hundred images are at risk if a flaw in the hypervisor (or configuration) allows any virtual guest to “break out” into the host environment and affect other virtual guests.

sVirt is a project to improve the security of Linux virtualization. Svirt applies the Mandatory Access Control (MAC) features of SELinux to strengthen the isolation between virtual images. Svirt works with KVM/QEMU and other Linux virtualization systems where the virtual image runs as a Linux user space process.

sVirt is a community project, with founding authors from Red Hat: Daniel Berrange, James Morris, and Dan Walsh. sVirt is integrated with libvirt.

One of my favorite sVirt use cases is: “Strongly isolating desktop applications by running them in separately labeled VMs (e.g. online banking in one VM and World of Warcraft in another; opening untrusted office documents in an isolated VM for view/print only).” (From the 8/11/2008 sVirt project announcement at www.redhat.com/archives/libvir-list/2008-August/msg00255.html).

The project announcement also identifies an excellent design goal: “Initially, sVirt should “just work” as a means to isolate VMs, with minimal administrative interaction. e.g. an option is added to virt-manager which allows a VM to be designated as “isolated”, and from then on, it is automatically run in a separate security context, with policy etc. being generated and managed by libvirt.”.

You can find a 48 minute video of James Morris’s February 2009 presentation on sVirt at Linux.conf.au: video.google.com/videoplay?docid=5750618585157629496#

Slides from that presentation are at: namei.org/presentations/svirt-lca-2009.pdf

Parallel Programming: Bogus Benchmarks

Paul McKenney: Multi-core Linux - Thu, 12/17/2009 - 20:38
Benchmarking has a checkered history in IT. It is not for nothing that some have paraphrased the famous and well-worn quotation “lies, damned lies, and statistics” to “lies, damned lies, and benchmarks.” This is of course unfair to benchmarks, which have been an important driving force behind improvements to computing technologies ranging from CPUs to compilers to storage arrays to graphics processors to databases. At their best, benchmarks play a number of exceedingly important roles:

  1. Providing a fair framework for comparing competing implementations on performance, price-performance, performance per watt, time to solution, and much else besides.
  2. Focusing competitive energy on improving implementations in ways that matter to users.
  3. Serving as example uses of the technologies being benchmarked.

Benchmarks from the Transaction Processing Council (TPC) and the Standard Performance Evaluation Corporation (SPEC) have been very helpful in the past, and so we can hope that good benchmarks will play an important role in accelerating the evolution of multicore hardware and software. That said, designing benchmarks is a painstaking exercise, not to be undertaken lightly. It is therefore interesting to look carefully at the Fibonacci benchmark that has been used for well over a decade to evaluate parallel work-stealing schedulers. An early paper using this benchmark admits that this is “a terrible way to compute Fibonacci numbers”, but argues that it stresses task creation and synchronization overhead.

A doubly recursive C-language implementation of the Fibonacci function takes about 18 seconds to compute fib(47) on my laptop. (Why fib(47)?) One could imagine this running faster if executed in parallel, although considerable imagination is required. For purposes of comparison, an iterative C-language implementation computes fib(47) in less than 100 nanoseconds, more than 8 orders of magnitude faster. Worse yet, the interpretive computer algebra system Maxima computes the exact 200,000-digit value of fib(1,000,000) in less than 500 milliseconds, which is still much less time than that required for fib(47) by the doubly recursive C-language implementation.

But again, the authors of the aforementioned paper argue that this stresses the thread creation and synchronization overhead. Is this a good justification for use of this benchmark?

Parallel Programming: Sequestered Source Code

Paul McKenney: Multi-core Linux - Wed, 12/16/2009 - 20:05
Earlier, we noted that parallel programming requires additional planning. Failure to properly carry out this additional planning can result in deadlocks, data races, livelocks, and the usual litany of perils of parallelism. It is important to note that many of these perils are global in nature, for example, a deadlock cycle might involve any or all portions of a shared-memory parallel program.

Deadlocks are not difficult to find: once the system deadlocks, it certainly isn't going anywhere, and straightforward instrumentation can usually trace out the deadlock cycle. Although repairing deadlocks can occasionally be extremely challenging, it is quite often a simple matter of adjusting lock-acquisition order.

Not much of a problem, that is, if you have access to the source code. If different pieces of the source code are owned by different organizations, the process of fixing the code can easily be subordinated to a finger-pointing exercise. After all, breaking the deadlock cycle at any point fixes the problem, so which of the players is going to blink first and prepare the fix? And when one of the players does provide a fix, how can anyone else be sure that it actually fixes the problem, as opposed to greatly reducing its probability of occurrence? Or even not so greatly reducing its probability of occurrence?

In short, given the current state of the shared-memory-parallel software engineering art, it seems best to place an address-space boundary between code from different proprietary players — and between proprietary code and FOSS code.

Of course, this latter is something that numerous GPL boosters, including many in the Linux kernel community, have been advocating for quite a few years. Those of you who dismissed their stance as irrational free-software religion just might want to think again. ;–)

Parallel Programming: Dismal Designs, Atrocious APIs, and Crufty Code

Paul McKenney: Multi-core Linux - Tue, 12/15/2009 - 20:46
Imagine a software project that has been written and maintained for many years as a single-threaded program. In such an environment, using a singleton object to hand out consecutive transaction IDs is the most straightforward thing in the world: simple, fast, and easily encapsulated. Similarly, a singleton object is also quite attractive as a central “mailbox” or communications hub. Again, simple, fast, and easily encapsulated.

That is, these approaches are attractive only until it is time to run on a multicore system, at which point their poor scalability will become painfully apparent. The following figure illustrates this point, showing the overhead of concurrent atomic increment of a single global variable in a tight loop.

Atomic increment does not scale well

Perfect scalability would result in a horizontal line in the above figure. The upward-slanting line shows that the more CPUs we add, the slower atomic increment gets. This is the antithesis of good scalability.

Of course, if the transaction ID was used only as a tag, there are a number of easy fixes, for example, assigning ranges of numbers to individual threads so that they can generate transaction IDs without the need for costly atomic operations. But what if there is a recovery mechanism that absolutely requires that the IDs be assigned consecutively? This might require a complete rewrite of what is usually extremely tricky code.

Worse yet, suppose that this transaction ID was part of some heavily used API. Modifications might then be required in many pieces of code throughout many organizations. It is likely that no one person would be able to find all the relevant code within the confines of a single organization, let alone across multiple organizations.

Oops!!!

One can argue that this is not parallelism's fault. After all, parallelism did not hold a gun to anyone's head and insist that poor design choices be made. But this is cold comfort to anyone tasked with parallelizing such software. Besides which, in absence of parallelism, these design choices were eminently reasonable.

That said, the figure above shows that atomic increment consumes less than 400 nanoseconds, even on a 16-CPU system. In many cases, this 400 nanoseconds is not a large overhead. So, is the use of global singletons really a problem for parallel code, and if so, what can be done about it?

Parallel Programming: Parallel Paranoia

Paul McKenney: Multi-core Linux - Mon, 12/14/2009 - 14:20
It is all too easy to feel smug when remembering all the dismissals of parallelism, some as recent as five years ago. These dismissals portrayed parallelism as an aberration that would soon be swept into the dustbin of computing history by inexorable Moore's-Law increases in CPU frequency. It is easy to lampoon these dismissals as shortsighted or even wrongheaded, but doing so is counterproductive. As always, the reality is much more complex and much more instructive than any lampoonery, entertaining though the lampoonery might be.

But let me give you an example.

At a 1995 workshop, I was approached by a software architect asking for advice on parallelism. This guy was quite forward-thinking: even though he understood little about parallelism, he clearly understood that hardware-cost trends meant that he was going to have to deal with it sooner or later. Not many people can legitimately claim to have been ten years ahead of their fellows, but this guy was one of them.

So he asked me how best to go about parallelizing his application. After a bit of dicussion, I learned that it looked roughly like this:

Multiple instances connected to database

Here, an instance of the application is spawned for each user in order to mediate that user's interaction with the database. I asked if the application's execution was a significant component of the overall response time, and learned that it was not. I therefore told him that the best way to handle parallel systems was to continue running single-threaded per-user instances, and to let the database deal with the shared-memory parallelism. He seemed both happy with and relieved by this answer.

Why happy and relieved? Keep in mind that in the great majority of cases, parallelism is a performance optimization, and is but one potential optimization of many. Furthermore, performance is but one of many features that are of interest to the user. So the choice of parallelism can be an unwise one, especially when you have no parallel-savvy developers on your project and when parallel machines are a good order of magnitude more expensive than single-CPU machines, as they were in 1995. Adding parallelism to your project too soon might benefit a tiny fraction of your users, but the resulting deadlocks, race conditions, and other instability would greatly annoy your entire user base. Going parallel too soon could thus jeopardize your project's very existence. Therefore, simple Darwinian selection could easily generate reactions to parallelism ranging from strong distaste to absolute paranoia.

But then in the early part of this decade, the rules suddenly changed. Parallel systems are now quite cheap and readily available. So, should I give different advice to the same project today?

Regardless of the answer to this question, it is clear that some projects now need to lose their parallel paranoia if they are to thrive in our new multi-core world.

Parallel Programming: Uniprocessor Programmers and Sequential Staff

Paul McKenney: Multi-core Linux - Fri, 12/11/2009 - 16:02
If the demand for parallel programmers has outstripped supply, then why not simply train up more parallel programmers? After all, there are a lot of parallel programs out there, and the developers who created them did not spring fully formed from the void.

Unfortunately, I cannot put myself forward as an expert in parallel programming courseware, because I did not learn parallel programming from a college class or a training course. Instead, I gained most of my parallel programming expertise directly from the hardware, with the aid of a logic analyzer or two. I would not have it any other way, as each new generation of hardware has something new to teach, and I have grown used to learning directly from it. Nevertheless, the fact that I have not learned much from parallel courseware makes it difficult for me to judge courseware quality.

On the other hand, I have had the good fortune to be a member of two communities distinguished by their success in quickly bringing new members up to speed on the art and science of writing parallel code. The first such community was Sequent's DYNIX/ptx engineering staff, and the second was (and is) the Linux kernel community. My experience with these two communities leads me to believe that the following conditions are required to quickly turn competent programmers into competent parallel programmers:

  1. Readily available parallel hardware.
  2. Large body of high-quality parallel code available for review and tinkering.
  3. Easy access to parallel-programming experts, preferably with a variety of viewpoints.
  4. Immediate “frank and open” expert feedback on new code.

Over the past 20 years, the situation for the first two items has improved beyond all recognition: cheap parallel hardware is readily available and there a wealth of open-source parallel software of all sorts is just waiting to be downloaded. The number of parallel-proficient developers has greatly increased, but there is still no shortage of projects lacking parallel expertise, and I have yet to see hordes of parallel-programming experts sitting around waiting to receive newbie parallel programs in need of review. That said, the shift from zero of four to two of four is a great improvement. Furthermore, one can argue that these first two conditions are the most important of the four, as they promote spontaneous generation of parallel programmers through self-education.

Of course, these communities both use what can be thought of as an apprenticeship approach, where new developers learn on the job. This can be quite effective, but it is not clear that it can produce the needed number of parallel developers. Perhaps formal education will start filling the gap.

In the meantime, what is the one most important parallel lesson?

All of this existing and expected improvement in parallel-programming expertise is encouraging, but what do you for a team of uniprocessor programmers who need to parallelize a large complex application?

There are now a number of parallel-programming classes and textbooks, so formal continuing education is becoming a feasible option, although it is too early to judge its effectiveness. As noted above, there are also a number of parallel open-source projects in full swing, so having some of your team members join one of these communities might provide effective education. Hiring a parallel-programming expert is also an option, assuming you pay close attention to team dynamics. Having parallel-programming expertise is not enough &mdash successful parallelization also requires deep knowledge of the application being parallelized.

I will leave you with the following caution:

Raw intelligence is completely insufficient to guarantee success in parallel programming. In fact, raw intelligence in absence of parallel design and coding know-how is a negative. The more intelligent you are, the deeper the hole will be before you realize that you need to stop digging. It all comes back to the need for additional planning when parallel programming. The more intelligent you are, the more likely you are to delude yourself into believing that your seat-of-the-pants “planning” is actually working, and the worse trouble you, your team, and your project will get into. There are a very few exceptions to this rule, one example being Bob Beck, the guy who defined Sequent's approach to parallelism in the early 1980s. Unfortunately for all of us, the number of people with this level of talent, ability, and experience is quite small.

Parallel Programming: Embarrassing Examples

Paul McKenney: Multi-core Linux - Thu, 12/10/2009 - 00:55
Later in this series of postings, we will get back to the question of what lessons from the old Great Software Crisis might be applied to the new Great Parallel Software Crisis. In the meantime, this post will look at what can make parallel programming easy. Just to maintain my reputation for arguing with myself, several upcoming posts will examine things that can make parallel programming difficult.

The easiest problems to parallelize are those whose data is partitionable. In this case, just assign a separate CPU to each partition and stand back! This can be as easy as the following shell script:

./myprogram 1 & ./myprogram 2 & ./myprogram 3 &

Here we run three copies of the program, each working on a different part of the problem. Other examples include the data parallelism used in technical computing and the data locking used in operating systems and server applications.

Many types of computer-aided design have datasets that are nearly partitionable, a classic case being finite-element analysis. Here, an grid is superimposed on the object being analyzed, and a CPU assigned to a region of the grid. As we make the grid more fine-grained (thus usually making the solution more accurate) by a factor of N, the amount of computation per CPU rises as N3, while the amount of communication between CPUs only rises as N2. As N rises, the fraction of effort required for communication becomes arbitrarily small, allowing arbitrarily good performance and scalability.

However, as soon as there is communication, we do need to start being careful. Even in the tightest SMP, the overhead of communicating is a good order of magnitude greater than that of computation, and this ratio often amounts to multiple orders of magnitude. Therefore, even a small amount of communication can destroy your software's scalability, regardless of whether that communications is within a single CPU core or across a large cluster interconnect. This is why partitioning is a good thing, the more complete the partitioning the better. Completely partitionable problems are often called “embarrassingly parallel”, though has never been clear to me why partitioning should embarrass anyone. Perhaps it is a culture thing, but whatever it is, I simply think of it as an embarrassment of performance and scalability riches.

But what about RCU? RCU neither partitions data nor replicates it, so why is it fast?

Of course, for lazy people like me, the best cases are where the parallelism is done for you, as has long been the case for SQL developers. The database's query optimizer handles the parallelism, so that the developers don't have to. That said, a quick Google search for “SQL performance” should suffice to convince you that even SQL developers have not yet reached Parallel Nirvana.

But can you think of cases where parallelism is easy but useless?

Partitioning and replicating your program's data can be a wonderful path to excellent performance and scalability, but life is not always that easy. The next set of posts will look at some of the things that can make life difficult.

Parallel Programming: Heeding History

Paul McKenney: Multi-core Linux - Tue, 12/08/2009 - 00:47
Given that parallel systems have been in existence for decades, it is worth asking why they have caused so much fuss over the past few years. Many argue that this is due to the end of Moore's-Law-induced frequency scaling, while others note that the business models of some corporations would suffer if people no longer felt the need to buy new computers every few years.

Although there is no doubt some validity to both of these arguments, the real reason is economics. Sure, parallel systems have been commercially available for some decades, but how many people could afford to splash out $1M in the 1980s? $100K in the early 1990s? $10K in the late 1990s? In contrast, how about a few hundred of 2009's sadly inflated US dollars?

As the price of parallel systems has plummeted, the number of situations where it makes economic sense to use them has increased exponentially. This in turn means that the demand for parallel software has also grown suddenly, outstripping the supply of developers with parallel-programming experience. Voilà, a parallel-software crisis.

But this is most definitely not the first software crisis. A very similar crisis arose in the late 1970s, with very similar history. A computer cost millions of dollars in the 1960s, tens of thousands with the advent of the minicomputer in the early 1970s, and mere thousands with the advent of the microcomputer and the personal computer in the late 1970s and early 1980s. Then as now, as the price of computer systems plummeted, the number of situations where it made economic sense to use them increased exponentially. Then as now, the demand for computer software grew suddenly, outstripping the supply of programmers. Then as now, a software crisis was proclaimed.

Many new programming languages were put forward to deal with this crisis, and these can be categorized, not into the good, the bad, and the ugly, but rather into the good, the fad, and the ugly.

The programming languages in the “ugly” category are still with us, though the fraction of code written using them has decreased. We still use various flavors of shell, sed, awk, and C, as well as holdovers from earlier times, including FORTRAN and COBOL. I myself used the Bourne shell and C for production software in the early 1980s, and would never have guessed that I would still be using them more than a quarter century later. They are simply too ugly — and too useful — to die.

The programming languages in the “fad” category include darlings such as PASCAL, MODULA, Scheme, Eiffel, Smalltalk, and CLU. There are no doubt a few developers still playing with these toys, but these darlings never were able to deliver on the promises made by their proponents, and never managed to gain a large developer base. (And given that I designed, coded, and put a 50,000-line PASCAL program into production, I know whereof I speak.)

So what does it mean for a programming language to be “good”? Given that the goal is to solve a software crisis, the only reasonable measures of goodness are: (1) an increase in productivity of existing developers by orders of magnitude, (2) an increase in the fraction of the population who can use computers, again by orders of magnitude, or preferably (3) both.

So, what were the “good” programming languages that solved the Great Software Crisis of the late 1970s and early 1980s?

And what lessons should we draw from the Great Software Crisis to help us deal with the Great Parallel Software Crisis?

Parallel Programming: Inappropriate Intuitions

Paul McKenney: Multi-core Linux - Sat, 12/05/2009 - 18:38
I had the good fortune to attend a panel on parallel-programming education at the recent SC09 conference. To their credit, many of the panelists noted that parallel programming was not all that hard, and many of them took their more sequential-minded colleagues to task for telling students that parallel programming is difficult. Of course, many of them also said that the locking-plus-threads programming paradigm is difficult, never mind the fact that locking-plus-threads has been used very successfully in a great many software projects.

Nevertheless, this panel's message was a welcome change of pace from the many years of hand-wringing over parallel programming. And although I have given much thought to whether and why parallel programming might be difficult, this panel's message motivated me to step back and think on it some more.

There have been any number of people claiming that parallel programming is completely counter-intuitive, often with little or no supporting evidence. In contrast, we need to think carefully about this important issue, and the place to start is with sequential programming. Please note that it is easy to make a solid case for sequential programming being counter-intuitive. If you don't believe me, you should talk to people who took up programming, who mastered it, but who found it unappealing.

My discussions with such people uncovered the following major insults to their intuition:

  1. People expect intelligent beings, whether organic or inorganic, to have some degree of common sense. Despite the decades of research sacrificed at the altar of artificial intelligence, computers remain almost completely devoid of common sense.
  2. People also expect intelligent beings to have some understanding of their intent, which is called theory of mind. If anything, computers' lack of theory of mind is even more profound than their lack of common sense.
  3. People expect to be able to make a fragmentary plan, and nevertheless expect this plan to have the expected results. Computers are famously unforgiving of fragmentary plans, though perhaps GPS units are moving in the right direction.

The first two points should be uncontroversial. If you doubt them, consider the failures of the much-maligned Clippy and Microsoft Bob. These two product features attempted to relate to the user as people, thus raising common-sense and theory-of-mind expectations, expectations that these two products proved completely incapable of meeting.

The reliance people cheerfully place on fragmentary plans deserves further discussion. This reliance is apparently due to the expectation that the person executing the plan will be intelligent, have common sense, and have an understanding of the intent behind the plan, especially in the real-world common case where the person making the plan and the person executing it are one and the same. In the case, the plan will be revised as a matter of course to account for unexpected events and obstacles. This worked quite well through much of human history, which is no surprise: far better to start hunting in a random direction than to starve to death while planning for all possible situations that might arise during the hunt.

Unfortunately, even the best human reflexes simply cannot keep up with a 5GHz CPU. Even if we imagine a hyper-caffeinated 5GHz superhero, there are many millions of computers to be kept up with. And so the modern microprocessor invalidates untold millenia of evolution, frustrating untold numbers of would-be computer professionals.

Please note that these three inappropriate intuitions have absolutely nothing to do with parallel processing. This should be no surprise, as the universe is, always has been, and always will be highly concurrent. Given that, at least as far as we know, human beings have spent their entire history living in this universe, shouldn't we expect concurrency to be intuitive?

Of course, concurrency being intuitive does not necessarily mean that concurrent programming is intuitive. After all, the typical American-football player must deal with 21 other players on the field, the ball, and a few coaches and umpires, which should be excellent concurrency training. And perhaps it is excellent training, but I know of only one football player who went on to be a parallel programmer. Then again, he is an extremely talented and capable parallel programmer, so perhaps the generation being raised by soccer moms will show us old guys a thing or three about concurrency.

All that aside, the usual examples of parallel-programming difficulties — deadlocks, race conditions, memory misordering, performance and scalability issues — are really examples of item #3 above: failure to plan properly. In other words, parallel programming does not add new types of counter-intuitive behavior, but rather exacerbates the effects of existing sequential-program counter-intuitive behavior. One of the parallel-education panelists at the recent SC 09 conference claimed that parallel programming represented only about 5% additional difficulty over and above that of sequential programming, and this line of reasoning certainly supports that claim.

Nevertheless, 5% of the effort required to truly master sequential programming is a substantial undertaking, so the remainder of this series of blog posts will examine exactly what makes parallel programming such a challenge — and how to surmount this challenge.

Git Rebase - The Missing Link

Darren Hart: Real Time - Thu, 12/03/2009 - 13:22

git-logo.png I've taken longer than many to get comfortable with git. I've become very comfortable with quilt and git consistently behaved non-intuitively (for me). A couple months back someone pointed me at "git rebase -i" - the clouds parted, heavenly rays shone down upon me, and all was well once again in the Zion in my head.

With quilt you work with patches, and you really need to have them well defined in your head before you start. You can get away with some lack of planning if your various patches are restricted to distinct files. More often than not for me however, my patch series work on a single file, or maybe two or three. Merging patches in quilt is ... well ... just like merging patches without quilt ... a flurry of filesystem commands and manual merging.

With git, you work with the source, and generate the patches from your changes. The git motto "commit often" is key to success here. I can start a project with only a rough idea of what needs doing and commit small changes along the way. When the project is done, I may have dozens of commits in my local git tree, none of which are functionally complete, and most likely won't even compile. With "git rebase -i" I can rearrange these patches, merge them, annotate them, and generate a series suitable for submission to LKML. Ahhhhh. I git it! Finally.

read more

Getting Thing Done with Remember the Milk

Darren Hart: Real Time - Fri, 11/27/2009 - 04:07

rtm

Todos pile up, balls get dropped, and I search for solutions to managing the chaos. Several years back I found inspiration in David Allen's Getting Things Done (GTD) approach to task management. The simplicity of the various lists and the generic process definition that relied on no specific medium or implementation appealed to my need to customize and make everything my own. GTD seemed perfectly matched to a software application that could manage the various views and relationships of tasks, projects, and contexts. I tried various existing tools, but they each were cumbersome, slow, and unintelligent (forcing me to spend more time in the tool than I was willing to give). I went so far as to program my own, Braindump, which addressed my frustrations. While I accomplished what I had set out to do, the increasingly pervasive presence of smart phones, netbooks, and other internet devices were highlighting the rather crude single-computer usage model of my application. It also suffered from what every other stand-alone task manager suffers: lack of calendar integration.

Among the list of rejected applications years back was Remember the Milk (RTM). While RTM kept things simple and didn't burden the user with a rigid work flow, I felt (at the time) that its extrememly free-form nature would leave too much process enforcement to the user, making it as awkward and cumbersome as its competitors which enforced too much process internally. I was also just not ready to make the leap into the world of "cloud services". After catching a blurb about a new RTM Google Calendar gadget, I decided to give it another shot... and I'm glad I did. I read several other user experiences on using RTM for GTD, one in particular I feel is worthy of citing here as being influential in my approach: Guest Post: Advanced GTD with Remember The Milk by Doug Ireton.

GTD Overview

If you are already familiar with GTD, skip this section. For those of you that haven't been indoctrinated, allow me to present a very brief introduction to the principles of GTD. It's all about freeing your mind from thinking about what you have to do so you can focus on doing it. You need a trusted system to store information so your brain doesn't have to. The process itself is defined by five stages: collect, process, organize, review, and do. These stages are implemented using lists of tasks, or "next actions", and "projects" (work items that encompass more than one task). A key point about next actions is they are immediately actionable. They have no dependency on anything else. During the "do" stage, you should be able to open the next actions list and immediately get to work performing those tasks. Another core mechanism of GTD is the "context". By organizing your next actions by context, such as Calls, Errands, Office, and Home, you can easily see the tasks that you can complete at any given time, based on where you are (by a phone, in the office, driving between work and home, etc.). Lastly, a weekly review is critical to making GTD work. During this review, you review each of your projects and ensure they have a next action so you can make progress on them during the week. Before getting too far into constructing a task management system, I'd recommend reading through GTD.

GTD with RTM

GTD is essentially implemented as a collection of lists. This translates well to Remember the Milk which is just that - a set of user defined lists. RTM also provides some basic properties to list items, such as due date, time estimate, and notes. What makes RTM smarter than a set of paper lists are the tags, locations, and powerful searches. Tags allow you to use short terms to relate tasks to eachother. Locations are intended to allow you to save off the address where a task needs to be completed, which is cute but largely useless IMNSHO, so I'll overload this feature a bit later. RTM provides users with a powerful boolean logic search and the ability to save those searches as custom lists which are automatically updated when your tasks change. Let's see how to put these all together to create a powerful GTD system using RTM!

Core GTD

Let's start with a set of real lists (created via Settings->Lists). I find it useful to separate work and personal lists with "W-" and "P-" respectively.

Inbox and Sent are defined by RTM and cannot be changed. The Inbox works well as a staging ground for thoughts you have that you haven't fully processed yet. You'll work through these during the "Process" and "Organize" stages, converting them into projects and tasks. I use the Personal and Work lists to store all my tasks, currently actionable or not. I use the "na" tag to mark tasks as next actions to distinguish them from tasks that are dependent on something before they can be executed. The two Project lists are used to store an item for every project I'm working on. I tag each project with a short identifier which I also use to tag all the associated tasks so they can be identified by project. Let's look at a project and a couple tasks.

Under the W-Project list I have an item titled "RT Elevator Pitch" tagged with "rt-elevator". I then create a task on the Work list titled "Brainstorm topics" and tag it with "rt-elevator" as well as "na". If I want to see all the tasks for this project, I can click on the project item on the W-Projects list and then click on "rt-elevator" in the Task box on the right. This will open a search for all the items tagged with "rt-elevator".

GTD refers to next actions lists, not task lists. When it's time to get something done, you shouldn't have to wade through all the tasks to find one you can do right now. In the search box, enter "tag:na", this will filter the list to only those tasks which are currently actionable. This doesn't take into account work vs. personal though. To limit it to just actionable work tasks, search for: "list:Work AND tag:na", click on the Save tab and call it "W-Next Actions".

To refine your task list even further, let's take a look at the GTD concept of "Contexts". David Allen suggests the use of the @ (for various reasons) to indicate contexts. This is particularly convenient with RTM as that is the hotkey for specifying location when entering a task. Since I find the locations feature rather useless in its intended form, I overload it as my contexts list. Click on locations and create your GTD contexts list (RTM wants you to specify a map location for each context - pick some place exotic!). For me this list includes: Calls, Errands, Home, and Office. You might add "Computer" ... but if you're reading this that would probably be akin to adding an "awake" context... it just doesn't refine your search very much ;-) Now you can specify a context for each task. To filter by context, search for each of your contexts and save the searches. For example, search for "tag:na AND location:office" and save the search as "@Office". This list will show only those actionable tasks that must be completed at the office! Note that I don't separate Work and Personal lists here. If I'm out running errands I might as well get those for work as well as home done in one go. Same for a batch of phone calls.

Lastly, you'll want a tag and a saved search to keep track of tasks you are waiting on from someone else. I use the tag "wait" for this, and will sometimes include a nickname for the delegate, like "john". Then create Waiting lists, such as "list:Work AND tag:wait" as "W-Waiting".

Calendaring

cal-sidebar Something that has always bothered me about the various advanced task management software that I've used, is the lack of integration with my calendaring. And no, the task add-ons in things like Evolution and Google calendar (and all the others) don't count for reasons that are hopefully obvious by now. RTM provides several very nice services for calendaring. You can get an iCalendar (ics) URL for any of your lists. I add the Personal and Work lists to my Google Calendar and all the tasks with due dates appear on the day they are due. RTM also has two Google Calendar gadgets, one that displays a check icon on each day so you can work with tasks of that day (mostly useless IMO) as well as a very nice sidebar gadget that will display the list of your choosing and group the tasks by due date: Overdue, Today, Tomorrow, Monday, Anytime.

There are some tasks that must be done on a certain day, but not at a particular time. You could just add these to your calendar application, but I find it convenient to keep all my tasks in one place. You don't want to have these tasks show up on your next actions lists until the day they are due. That means you either have to remember to set the na tag the morning of (yeah, not very likely right?) or come up with another mechanism. I tag these items as "cal" and ensure they have a due date. I then augment my next action searches to look for "tag:na OR (tag:cal AND dueBefore:tomorrow)". This way I only see the cal tasks when they're due (and afterwards if I failed to complete them).

Extras

When you complete tasks in RTM, it records the completion date. You can use this in a search as well to generate a weekly (or monthly, etc.) report for your boss. Consider searching for 'list:Work AND completedWithin:"1 week of today"' and saving it as "Weekly Status".

If you can convince your colleagues to setup an RTM account (they don't have to be GTD junkies by the way). You can use the RTM "Send To" feature to delegate tasks. The task will then appear be moved to the Sent list. You may need to adjust your waiting searches to include the Sent folder, and possible tag delegated items with "work" or "personal" as there aren't seprate Sent lists. Personally, I'd rather RTM didn't move my tasks to another list after I send them.

Wrap-up

Remember the Milk provide an incredibly flexible tool for managing tasks. Not only is it highly functional in its own right, but it also integrates brilliantly with services like Google Calendar and Google Gears (for offline use). RTM also provides a minimal mobile web interface (which is well... minimal), but if you're an iPhone or Android user, you can download an RTM application if you're a pro user for a much improved mobile interface. Well worth the $25 a year in my opinion. RTM, almost you convince me to purchase a smart-phone... and a $30-40/month data plan. Almost.

read more

Hunting More Heisenbugs

Paul McKenney: Multi-core Linux - Fri, 11/20/2009 - 19:57
How should you celebrate a successful heisenbug hunt? Any five people would likely have ten different opinions, but whatever your choice, you should also fix your broken tests. Just what makes me say that your tests are broken? Consider the following definitions:

  1. The only bug-free programs are trivial programs.
  2. A reliable program has no known bugs.

From these two definitions, it clearly follows that any reliable non-trivial program has at least one bug that you don't know about. The fact that you don't know about it means that your tests have not yet found that bug, and therefore, as asserted above, your tests are broken. And yes, I am giving your program the benefit of the doubt by assuming that it is non-trivial, and in any case, I claim that the Linux kernel's RCU implementation is non-trivial. RCU therefore contains at least one bug I don't know about — one bug my tests didn't find — and I therefore need to raise my rcutorture game.

I fixed the rcutorture test process as follows:

  1. Choosing CONFIG_RCU_FANOUT=2 on an eight-CPU machine, which exercised code paths that would otherwise require 1024 CPUs.
  2. Decreasing the wait time successive randomly chosen CPU-hotplug operations from three seconds to a hundred milliseconds (using “sleep 0.1” from a bash script!)
  3. Retaining the artificially low one-scheduler-clock-tick delay between invocations of force_quiescent_state().

This improved testing regimen resulted in grace-period hangs, but only with CONFIG_PREEMPT_TREE_RCU. The really nice thing about this improved test was that the failures occurred within a few minutes, which permits use of printk() debugging, a rare luxury when working with RCU infrastructure. I nevertheless suppressed my urge to wildly code up random printk() statements in favor of enabling the RCU tracing infrastructure. The tracing data clearly showed that all leaf rcu_node structures had recorded quiescent-state passage for all their CPUs, but that this information had not propagated up the rcu_node hierarchy. Given that these grace-period hangs occurred only in CONFIG_TREE_PREEMPT_RCU, this data fingered two possible code paths:

  1. rcu_read_unlock_special() invoked from a task that had blocked within the prior RCU read-side critical section, and
  2. __rcu_offline_cpu() when offlining the last CPU from a given leaf rcu_node structure when that structure has queued a task that has blocked within its current RCU read-side critical section.

A little code inspection on these two code paths located the bug and the fix. With this fix, RCU appears to be quite reliable.

So how should I celebrate success in this latest hunt?

Page allocator failure warnings in recent kernels

Mel Gorman: Kernel Spanner - Wed, 11/11/2009 - 19:36
So, in recent kernels since 2.6.31-rc1, there is a seemingly benign problem whose apparently manifestation is page allocation failures of GFP_ATOMIC allocations. The system recovers but there are large stalls even though on server systems, everything goes faster overall. The problem is particularly pronounced when using certain wireless cards but manifests in harder-to-diagnose stalls on machines with low memory under stress. The development methodology means that kernels come out very quickly even though right now, I would really prefer if the world would slow down while my poor test machines try to catch up.

I think I have a solution to this but it take several hours each time to figure out if forward progress has been made or not.

The lesson learnt here? Panic makes for poor decisions. I sent one patch what looked great at the time but have found out in the last few hours that it really sucks. While figuring this out for sure, I have to wait looking at a screen to painfully slowly update. To help the waiting, I found some beer, it's the Irish thing to do. Wonder what the rest of ye do :/

Hunting Heisenbugs

Paul McKenney: Multi-core Linux - Sun, 11/01/2009 - 18:35
My children's opinions notwithstanding, I recently found myself pursuing some nasty concurrency bugs in Linux's TREE_RCU implementation. This was not particularly surprising, given that I recently added preemption support, and the code really hadn't put up that much of a fight. In fact, I was getting the feeling that the bugs had gotten together and decided to hide out, the better to ambush me. This feeling wasn't far from wrong.

My first hint of trouble appeared when I began running longer sequences of rcutorture runs, seeing an occasional failure on one-hour runs. My first reaction was to increase the duration to ten hours and attempt to bisect the problem. Of course, even with bisection, finding the bug takes quite some time given ten hours for each probe, so rather than use “git bisect”, I manually ran parallel runs and (for example) quadrisected. I also ran multiple configurations. The results initially hinted that CONFIG_NO_HZ might have something to do with it, but later runs showed no shortage of failures in !CONFIG_NO_HZ runs as well.

The initial results of the bisection were quite puzzling, converging on a commit that could not possibly change RCU's behavior. Then I noticed that one of the machines seemed to be generating more failures than others, and, sure enough, this difference in failure rate was responsible for the false convergence. I therefore started keeping more careful records, including the machine name, test duration, configuration parameters, commit number, and number of errors for each run. These records proved extremely helpful later on.

Further testing showed that 2.6.32-rc1 (AKA 2.6.32-rc2) was reliable, even for the error-prone machine, and that 2.6.32-rc3 was buggy. Unfortunately, there are no RCU-related commits between 2.6.32-rc1 and 2.6.32-rc3. Unless you count commit #828c0950, which simply applies const to a few data structures involved in RCU tracing, which I don't and you shouldn't. So I ran a few more runs on 2.6.32-rc1, and eventually did trigger a failure. In contrast, 2.6.31 was rock solid.

Now there are quite a few RCU-related patches between 2.6.31 and 2.6.32-rc1, so I started searching for the offending commit. However, by this time I had written scripts to analyze rcutorture output, which I used to check the status of the test runs, stopping runs as soon as they generated an error. This sped things up considerably, because failed runs now took on average only a few hours rather than the 10 hours I was using as a (rough) success criterion.

Quick Quiz 1: If successful tests take 10 hours and failed runs take only a single hour, is bisection still the optimal bug-finding method?

Testing eventually converged on commit #b8d57a76. By this time, I getting a bit paranoid, so I ran no fewer than three ten-hour runs at the preceding commit on the most error-prone machine, none of which failed. But this commit does nothing to RCU, but rather makes rcutorture testing more severe, inserting delays of up to 50 milliseconds in RCU read-side critical sections. I therefore cherry-picked this commit back onto 2.6.31 and 2.6.30, and, sure enough, got failures in both cases. As it turned out, I was dealing with a day-one bug in TREE_RCU.

This did simplify matters, permitting me to focus my testing efforts on the most recent version of RCU rather than spreading my testing efforts across every change since 2.6.31. In addition, the fact that long-running RCU read-side critical sections triggered the bug told me roughly where the bug had to be: force_quiescent_state() or one of the functions it calls. This function runs more often in face of long-running RCU read-side critical sections. In addition, this explained the earlier CONFIG_NO_HZ results, because one of the force_quiescent_state() function's responsibilities is detecting dyntick-idle CPUs. In addition, it raised the possibility that the bug was unrelated to memory ordering, which motivated me to try a few runs on x86 — which, to my surprise, resulted in much higher failure rates than did the earlier tests on the Power machines.

I stubbed out force_quiescent_state() to check my assumption that it was to blame (but please, please do not do this on production systems!!!). Stubbing out force_quiescent_state() resulted in a statistically significant 3x decrease in failures on the x86 machine, confirming my assumption, at least for some subset of the bugs. Now that there was a much smaller section of code to inspect, I was able to locate one race involving mishandling of the ->completed values. This reduced the error rate on the x86 machine by roughly the same amount as did stubbing out force_quiescent_state(). One bug down, but more bugs still hiding.

I was also now in a position to take some good advice from Ingo Molnar: when you see a failure, work to increase the failure rate. This might seem counter-intuitive, but the more frequent the failures, the shorter the test runs, and the faster you can find the bug. I therefore changed the value of RCU_JIFFIES_TILL_FORCE_QS from three to one, which increased the failure rate by well over an order of magnitude on the x86 machine.

Quick Quiz 2: How could increasing the frequency of force_quiescent_state() by a factor of three increase the rcutorture failure rate by more than an order of magnitude? Wouldn't the increase instead be about a factor of three?

Given that the race I found involved unsynchronized access to the ->completed values, it made sense to look at other unsynchronized accesses. I found three other such issues, and testing of the resulting patches has thus far turned up zero rcutorture failures.

And it only took 882.5 hours of machine time to track down these bugs. :–)

This raises the question of why these bugs happened in the first place. After all, I do try to be quite careful with RCU-infrastructure code. In this case, it appears that these bugs were inserted during a bug-fixing session fairly late in the TREE_RCU effort. Bug-fixing is often done under considerably more time pressure than is production of new code, and the mistake in this case was failing to follow up with more careful analysis.

Another question is the number of bugs remaining. This is of course hard to say at present, but Murphy would assert that, no matter what you do, there will always be at least a few more bugs.

Answer to Quick Quiz 1: If successful tests take 10 hours and failed runs take only a single hour, is bisection still the optimal bug-finding method?.

Answer to Quick Quiz 2: How could increasing the frequency of force_quiescent_state() by a factor of three increase the rcutorture failure rate by more than an order of magnitude? Wouldn't the increase instead be about a factor of three?

Linux Static Tracepoints

Anton Blanchard: Kernel hacker - Tue, 10/06/2009 - 20:41

Linux has had dynamic trace functionality for a long time in the form of kprobes. Kprobes provides a kernel API for placing probes on kernel instructions and they can be exploited directly via a kernel module, or via systemtap which provides a high level scripting language. Dynamic tracing has a number of advantages – it has zero overhead when disabled and probes can be placed on almost any instruction in the kernel, not just where a kernel developer thinks you should.

All this flexibility does have some downsides. An executed kprobe has a significant overhead since it uses breakpoints and exception handlers. Having said that, there are patches that avoid the breakpoint and instead branch directly to the handler. Another issue is probe placement; kprobes are easily placed at function entry and exit but if you need to probe inside a function or probe local variables then you really need systemtap and a kernel compiled with CONFIG_DEBUG_INFO. On the other hand a static tracepoint can be placed anywhere in a function and can be passed any important local variables. Various static tracepoint patches have been available for Linux, but as of 2.6.32 a complete implementation is in mainline.

Adding a static tracepoint is very simple, an example can be found here. In this case I am adding to an existing trace group (irq), so I only need the tracepoint definitions and the tracepoints themselves. An explanation of the 5 parts of a tracepoint definition can be found in linux/samples/trace_events/trace-events-sample.h. For more complicated scenarios, refer to the files in linux/samples/trace_events/

Using static tracepoints

There are only a few steps to make use of static tracepoints. First ensure that debugfs is mounted. Most distros mount it on /sys/kernel/debug:

# mount | grep debugfs

debugfs on /sys/kernel/debug type debugfs (rw)

A list of available tracepoints can be found in tracing/available_events:

# cat /sys/kernel/debug/tracing/available_events

skb:skb_copy_datagram_iovec
skb:kfree_skb
block:block_rq_remap
block:block_remap
block:block_split
block:block_unplug_io
block:block_unplug_timer
...

Since we added our tracepoints to the irq group, we can find them in tracing/events/irq:

# ls /sys/kernel/debug/tracing/events/irq/

enable  irq_handler_entry  softirq_entry  tasklet_entry
filter  irq_handler_exit   softirq_exit   tasklet_exit

Enable the tasklet tracepoints:

# echo 1 >  /sys/kernel/debug/tracing/events/irq/tasklet_entry/enable
# echo 1 >  /sys/kernel/debug/tracing/events/irq/tasklet_exit/enable

And the output is available in the trace buffer:

# cat /sys/kernel/debug/tracing/trace

# tracer: nop
#
#           TASK-PID    CPU#    TIMESTAMP  FUNCTION
#              | |       |          |         |
-0     [000]   327.349213: tasklet_entry: func=.rpavscsi_task
-0     [000]   327.349217: tasklet_exit: func=.rpavscsi_task

When finished, we can disable the tracepoints. There are enable files at all levels of the hierarchy, so we can disable all tracepoints in one go:

# echo 0 > /sys/kernel/debug/tracing/events/enable

Using static tracepoints in kernel modules

Kernel modules can also make use of static tracepoints. A simple module that hooks the tasklet_entry tracepoint and printks the function name of the tasklet might look like (I’ve called it tracepoint-example.c):

#include <linux/module.h>
#include <trace/events/irq.h>

static void probe_tasklet_entry(struct tasklet_struct *t)
{
        printk("tasklet_entry %pf\n", t->func);
}

static int __init trace_init(void)
{
        WARN_ON(register_trace_tasklet_entry(probe_tasklet_entry));
        return 0;
}

static void __exit trace_exit(void)
{
        unregister_trace_tasklet_entry(probe_tasklet_entry);
}

module_init(trace_init)
module_exit(trace_exit)
MODULE_LICENSE("GPL");

If you are wondering, %pf is a printk formatter trick to pretty print a function name so you don’t have to go searching for the address in System.map.

Here is a Makefile to go with it:

obj-m := tracepoint-example.o
KDIR := /lib/modules/$(shell uname -r)/build
PWD := $(shell pwd)
default:
        $(MAKE) -C $(KDIR) SUBDIRS=$(PWD) modules
Syndicate content