Bill Buros: Improving Performance on Linux

Syndicate content
Updated: 20 min 39 sec ago

RHEL 5.2 and HPC performance hints

Thu, 07/17/2008 - 07:10
Building on the SLES 10 sp2 kernel build post from a couple of weeks ago, we got the equivalent RHEL 5.2 page posted under the developerWorks umbrella. Mostly the same conceptual steps, but a little different in the specifics. And of course, in the RHEL 5.2 example, we reverse the example from SLES 10 by building a 4KB kernel where "normally" the RHEL 5.2 kernel is based on the 64KB pages. It's a good experiment to play with when you want to see the performance gains that emerge from leveraging larger page sizes.
We linked this in under the HPC Central wiki page where several of us are playing around with adding descriptive how-to's for HPC workloads based on practical experience.

See HPC Central, follow the link to the Red Hat Enterprise Linux page, which is where the kernel page is linked in. We plan to replicate these pieces for SUSE Linux Enterprise Server next month.

The building blocks of HPC

Tue, 07/15/2008 - 07:04
Top 500 again. Linpack HPL. Hitting half a teraflop on a single system.

Using RHEL 5.2 on a single IBM Power 575 system, Linux was able to hit half a teraflop with Linpack. These water-cooled systems are pretty nice. Thirty-two POWER6 cores packed into a fairly dense 2U rack form factor. These systems are designed for clusters, so 14 nodes (14 single systems) can be loaded into a single rack. Water piping winds its way into each system and over the cores (we of course had to pop one open to see how things looked and worked). The systems can be loaded with 128GB or 256GB of memory. A colleague provided a nice summary of the Linpack result over on IBM's developerWorks.

For Linux, there are several interesting pieces, especially as we look at Linpack as one of the key workloads that takes advantage of easy HPC building blocks. RHEL 5.2 comes with 64KB pages. The 64KB pages provides easy performance gains out of the box. The commercial compilers and math libraries provide the tailored and easy exploitation of the POWER6 systems. Running Linpack on clusters is the whole basis for the Top 500 workloads.

It's easy to take advantage of the building blocks in RHEL 5.2. OpenMPI in particular, the Infiniband stack, libraries tuned for the POWER hardware are all included. When we fire up a cluster node, we automatically install these components.
  • openmpi including -devel packages
  • openib
  • libehca
  • libibverbs-utils
  • openssl
These building blocks allow us to take the half-a-teraflop single system Linpack result and begin running it "out-of-the-box" on multiple nodes. There are cluster experts around that I'm learning from. Lots of interesting new challenges in the interconnect technologies and configurations. In this realm, I'm learning that one of the technology shifts emerging is the 10GBe (10GB Ethernet) interconnect vs Infiniband. Infiniband has all sorts of learning curves associated with it. Everytime I try to do something with Infiniband, I'm finding another thing to learn. It'll be interesting to see whether the 10GBe technology will be more like simply plugging in an Ethernet cable and off we go. A good summer project...

Building a distro kernel on Power - not so bad

Sun, 06/29/2008 - 13:58
This should be simple. And when you know all the steps, it is. But I was surprised how challenging it's been to find easy examples of the steps to re-build a commercially shipping "distro" kernel, in this case the SLES 10 sp1 kernel.

It's probably documented cleverly in the end user documentation - but I'm far too addicted to the ease of googling compared to the inevitable drudgery of digging through user documentation. I always wonder when the "documentation community" will simply shift to wiki pages to document, but more importantly, maintain the correctness and accessibility, of end user documentation.

For this exercise, turns out we wanted to do something simple to a SLES 10 kernel shipping on Power. In our case, we wanted to see if we could re-build the distro kernel to support the 64KB pages available in the Power6 hardware systems. For the performance angle, 64KB pages can often significantly improve the performance of applications. Normally, when working with the Linux community, we simply snag the latest mainline kernel and work with that, but in this case, we were really interested in the specific performance differences between 4KB today and the expected performance of 64KB pages on the same base.

Out of that exercise, we created a new wiki page which documented the steps to re-build the SLES 10 kernel. A peer, Peter Wong, has already documented the RHEL 5.2 steps, we're just waiting for some web site maintenance to complete on the IBM developerWorks infrastructure to get that page posted as well.

For the SLES 10 sp1 (and sp2) kernel re-build instructions, see
Recently Jon Tollefson was playing around on the SLES 10 sp2 kernel and found that there's a missing file in the SLES 10 sp2 kernel package, so we had to comment out a line in the kernel-ppc64.spec file (modprobe.d-qeth).

One interesting aspect is I had thought the kernel re-build process would be precise and seamless. But there were a few tricks that had to be done to make it work.

One of them was adding control to be able to run the make on all of the CPUs seen by Linux.
%define jobs %(cat /proc/cpuinfo | grep processor | wc -l)
We've been playing recently on some of the sweet top-of-the-line POWER6 systems, in one case the Power 575 system with 32 cores. When running with SMT enabled, that's 64 CPUs that Linux controls. The kernel build goes very fast on that system.

Second, and there's probably a more clever way to do this, but we ended up having to unpack, modify, and re-pack the config.tar.bz2 file for the platform.

The last interesting aspect was the built-in "kabi" protections. When we first re-built the kernel, the build failed because this failed the kabi tolerance level. Very clever. I assume various kernel interfaces are flagged with KABI values, which when changed, cause the build to fail. In our case, we knew it would change things in the kernel, so we modified the tolerance value to allow for the kernel re-build.

So. Easy to do, easy to make changes, and for a performance team, easy to minimize how much is changing from one step to the next. By starting with a known entity in the distro kernel, we make one change, verify the performance differences, and then proceed to the next change. Simple. Methodical. Straight-forward.

PowerVM Lx86. Technology that works.

Mon, 05/05/2008 - 21:17
PowerVM Lx86. Running Linux 32-bit x86 apps on Power systems. Slick. Super-easy with Linux. And does it work? Absolutely.

Now, as a performance analyst, I'm often asked: "Is it a performance play?" My quick answer: "Nah - not usually..."   But for everything I've tried: "It just works" - which in and of itself is pretty cool. You really want the best performance for your app? Re-compile and run it natively. Duh. You want easy access to existing x86 compiled apps? Give this product a shot. And in some cases, the performance of the translated product is just fine for the user's needs.

In essence, this product is the flip-side of Transitive's translator technology for Apple which translates older Apple Power applications to run on the new x86-based Apple systems. Check out these web sites if you missed the technology introduction several years ago:


IBM and Transitive (http://transitive.com/customers/ibm) have already introduced the second release (Ver 1.2) of the IBM PowerVM Lx86 product.

Originally discussed in the press as p-AVE (for example, see an article from http://www.it-analysis.com/), IBM's product naming wizards must have been at work with the preliminary name of IBM System p Application Virtual Environment (System p AVE). Later they followed it with a newer official name under the IBM PowerVM umbrella as PowerVM Lx86 for x86 Linux applications. "p-AVE" certainly rolled off the tongue far easier than the PowerVM Lx86 name. But the PowerVM naming admit'ably fits better with the overall virtualization strengths of the Power line.

For a page full of pointers and interesting helpful hints, check out http://www-128.ibm.com/developerworks/linux/lx86/.

  • For example, the product is download'able from IBM's web site... gotta dig through the DeveloperWorks various registration pages - but you're looking for: IBM PowerVM Lx86 V1.2 (formerly System p Application Virtual environment or p AVE) p-ave-1.2.0.0-1.tar (8,294,400 bytes).

For a clever approach to using PowerVM Lx86, a nice demo was created which you can see on YouTube.

  • I really like this demo since it highlights one of several cases where the translation performance of the product isn't perceivable to an end user. The demo is an actual run, no tricks. As the demo narrative says, when you try to execute an x86 application on the Power system, the x86 executable is automatically recognized and the translation environment invoked.

  • The video clip goes on to show some of the highlights of the Power line where Power logical partitions can be migrated from one physical system to another. More cool stuff.


Another example of common product usage is in the world of graphing performance results. Users can check out a really nice set of charting libraries from Advanced Software Engineering (http://www.advsofteng.com/) available with the ChartDirector product. The executable run-time libraries are available for a variety of platforms, including Linux on i386, but alas, not for Linux on ppc64 systems. But when the i386 libraries are installed on a Power system running Linux with the additional PowerVM Lx86 product, Power users can use the graphing routines directly. Again, the perceptible performance differences are minimal, and the full function of the i386 routines are available to the Power users.

The IBM web site for PowerVM Virtualization Software offerings has a good description of the capabilities of the Linux product and the services available for software vendors to enable their apps for native execution while still exploiting the Power systems running Linux with their existing applications.

Keep in mind there are the normal obligatory footnotes and qualifications on what i386 applications can function under this product - check out the product web sites for that information.

Finally, as a performance team, we always tend to agonize over the corner cases which highlight the performance challenges of translating an application from one system platform base to another, and there certainly are some areas where performance can be a challenge. Java is a good example. There are too many steps of translating byte codes to executables, then those executables are translated again to execute on the Power platform, which can make for a rather poor execution path. If your Java app is a minor piece of a bigger application (the prime example is as an application installer), shrug. But whew, if you're thinking about snagging a full comprehensive Java based product and running it in in translation mode - as opposed to verifying that the Java code runs on the Power platform - I can anticipate you may be disappointed with the performance. One would've hoped that the Java world of write once, run anywhere would've panned out better than the write once test everywhere implementation.

In the meantime, if you need easy access to x86 executables and applications on your Power systems, give this a shot.

What "one thing" do you want in the Linux kernel?

Mon, 04/07/2008 - 13:06
An interesting question came by this week. If I could have one free wish, one thoughtful choice, one wise selection, what would I want in the Linux kernel?

I was of course assured that my choice didn't mean I'd get it, but it was an interesting hallway poll for the day. I suspect the answers were being pulled together as input for this week's Linux Foundation meetings happening this week in Austin Tx... lots of people in town.

The mind races.

So back to the Linux Weather Forecast. Whew. What to choose, what to choose.
In no particular order,
  • A new completely fair scheduler - the promise of an updated scheduler is intriguing for performance teams. Course, on a practical level, the thoughts of rather extensive regression testing keeps popping up in my mind. Watching the progress of that effort is reassuring though, so this will be cool in the next revs of the distros.

  • Kernel markers - cool technology which will make it easier for tools to hook into the kernel, which of course a performance team is always interested in. We need to make it absolutely seamless and safe for a customer, on a production system, as a protected but definitely non-root user, to gather system metric information.

  • Memory use profiling- now this will be really nice - way too many times we're asked about memory usage of an application - which is particularly dependent on other things happening in the operating system.

  • Real-time enhancements - continued integration of the real-time work happening in parallel in the community. This is proving particularly helpful in the HPC cluster space as work continues to improve the deterministic behavior of the operating system.

  • Memory fragmentation avoidance - another longer-term project which positions the kernel for far more aggressive memory management and control of varying page sizes.

  • And numerous other choices... better filesystems in ext4 and btrfs, better virtualization and container support, better energy management, improvements in glibc, etc etc
But really? All of these are in play today. All of them are being polished, updated, with many creative minds at work.

So what I asked for was for the kernel and Linux community to continue strengthening the ability to help customers easily understand and improve the performance of their applications and the system. Out of the box. Across all of the components. No special adds, rpms, or kernels on their installed system.

The nice part is the Linux programmers I work with are all committed to this. So the work continues - fit, finish, and polish - and continue working with the longer term changes which are being developed.