Invited speakers

Productivity and Performance Using Partitioned Global Address Space Languages
Katherine Yelick: University of California at Berkeley & Lawrence Berkeley National Laboratory

Partitioned Global Address Space (PGAS) languages combine the programming convenience of shared memory with the locality and performance control of message passing. One such language, Unified Parallel C (UPC) is an extension of ISO C defined by a consortium that boasts multiple proprietary and open source compilers. Another PGAS language, Titanium, is a dialect of JavaT M designed for high performance scientific computation. In this talk we describe some of the highlights of two related projects, the Titanium project centered at U.C. Berkeley and the UPC project centered at Lawrence Berkeley National Laboratory. Both compilers use a source-to-source strategy that translates the parallel languages to C with calls to a communication layer called GASNet. The result is portable high-performance compilers that run on a large variety of shared and distributed memory multiprocessors. Both projects combine compiler, runtime, and application efforts to demonstrate some of the performance and productivity advantages to these languages.

Bio:
Kathy Yelick received her Bachelors (1985), Masters (1985), and PhD (1991) degrees in Electrical Engineering and Computer Science from the Massachusetts Institute of Technology. Her research interests include parallel computing, memory hierarchy optimizations, programming languages and compilers. A summary of some current and ongoing projects is available, here along with recent papers and a few talks.

High Performance Computing: the Software Challenges
Mike Bauer: The University of Western Ontario

In the past decade we have seen significant advances in the reliability and performance of commodity computing elements, such as processors, disks and network devices. Processors, in particular, have increased the computational power available in desktops and laptops. The advent of these reliable and powerful off-the-shelf computational elements has also spurred a new generation of high performance computing systems. These systems, so-called commodity clusters, have become a mainstay of today's high-performance computing facilities. With today's processors now comprised of multiple cores, such systems may include thousands or tens-of-thousands of processing elements connected by commodity networking and using storage comprised of commodity disks and devices. System and communication software provides the glue that enables these processing elements to operate in parallel. Applications from a growing number of disciplines must be adapted to execute in parallel, but then can address significantly more complex problems or to analyze significantly greater amounts of data. Today, these parallel clusters dominate the top 500 supercomputing facilities in the world.

While there have been significant advances in hardware, software to efficiently utilize these systems and software to facilitate the development of applications to utilize these systems has been progressing at a much slower pace. Based on our experiences with SHARCNET and trends in hardware for next generation clusters, we see a growing gap between the capabilities of systems themselves and the software to efficiently operate those systems, software to support computational grids and, more importantly, the software used to develop applications. For example, the emergence of multi-core architectures, while increasing the processing capabilities of nodes, has created challenges for compilers and has created challenges for programmers with parallelism not only between nodes but within nodes. Such parallelism will likely be a key element in the push toward petascale computation, but will also create challenges in communication among processors and make programming more complex for most researchers. We outline the status of current software to support parallel clusters and to support the development of applications and then identify some of the existing challenges and limitations of the software. We then consider some of the emerging trends in hardware, processors, storage and networks, and the implications for these kinds of clusters and grids. We identify some of the future challenges of software to take advantage of these systems and describe some of the current approaches being considered.

Bio:
Michael Bauer is a Professor in the Department of Computer Science at the University of Western Ontario. He was Chair of the Department from 2002-2007 and from 1991-1996. From 1996-2001 he was the Associate Vice-President Information Technology at the University of Western Ontario. His Ph.D. is in Computer Science from the University of Toronto. His research interests include distributed computing, particularly the management of distributed applications and systems, network management, software engineering, and high performance computer networks. He is a member of the Association for Computing Machinery (ACM) and the IEEE and has served on various committees of both organizations. He has also served on the organizing and program committee of numerous conferences and has refereed for a variety of international journals.

Multithreaded Programming in Cilk
Matteo Frigo: Cilk Arts

Cilk is a multithreaded programming language aimed at making parallel programming a seamless extension of commodity serial computing. Cilk minimally extends the C programming language to allow interactions among computational threads to be specified in a simple and high-level fashion. Cilk's provably efficient runtime system dynamically maps a user's program onto available physical resources, freeing the programmer from concerns of communication protocols and load balancing. In addition, Cilk provides an abstract performance model that a programmer can use to predict the multiprocessor performance of an application from its execution on a single processor. Cilk programs not only scale up to run efficiently on multiple processors, they also `scale down': the efficiency of a Cilk program on one processor rivals that of a comparable C program.

In this talk, I will provide a tutorial on the Cilk language for people with a basic background in computer programming. I will explain how to program multithreaded applications in Cilk and how to analyze their performance. I will illustrate some of the ideas behind Cilk using the example of MIT's championship computer-chess programs, *Socrates and Cilkchess. I will also briefly sketch how the software technology underlying Cilk actually works.

Click here for more background on Cilk and to download the latest Cilk manual and software release.

Bio:
Matteo Frigo is currently Chief Scientist at Cilk Arts, Inc. He received his Ph. D. in 1999 from the Dept. of Electrical Engineering and Computer Science at the Massachusetts Institute of Technology. His research interests include the theory and implementation of Cilk, cache-oblivious algorithms, and software radios. He is a co-author of the FFTW library for computing Fourier transforms, and he has implemented a gas analyzer that is used for clinical tests on lungs.

KAAPI: a Thread Scheduling Runtime System for Data Flow Computations on Cluster of Multi-Processors
Thierry Gautier: INRIA

The high availability of multiprocessor clusters for computer science seems to be very attractive to the engineer because, at a first level, such computers aggregate high performances. Nevertheless, obtaining peak performances on irregular applications such as computer algebra problems remains a challenging problem. The delay to access memory is non uniform and the irregularity of computations requires to use scheduling algorithms in order to automatically balance the workload among the processors.

This talk focuses on the runtime support implementation to exploit with great efficiency the computation resources of a multiprocessor cluster. The originality of our approach relies on the implementation of an efficient work-stealing algorithm for a macro data flow computation based on minor extension of POSIX thread interface.

Bio:
Thierry Gautier is a full time researcher at the INRIA (France) in the project MOAIS. He received his Ph. D. in 1996 for his thesis dealing with parallel algorithms in computer algebra with algebraic numbers. In 1997, he had a post-doc position at ETH Zürich. The same year, Thierry Gautier obtained his current position at INRIA in where he had work in the APACHE project.

Automating Renormalization of Quantum Field Theories
Tony Kennedy: University of Edinburgh (UK)

We give an overview of state-of-the-art multi-loop Feynman diagram computations, and explain how we use symbolic manipulation to generate renormalized integrals that are then evaluated numerically. We explain how we automate BPHZ renormalization using "henges" and "sectors", and give a brief description of the symbolic tensor and Dirac gamma-matrix manipulation that is required. We shall compare the use of general computer algebra systems such as Maple with domain-specific languages such as FORM, highlighting in particular memory management issues.

Bio:

Tony Kennedy has interests which span theoretical physics, computer science, and mathematics. In theoretical physics he has worked on the perturbative renormalization of quantum field theory, where together with Bill Caswell he constructed a new proof of the fundamental BPH theorem: this proof is now to be found in textbooks on the subject. He also introduced a new algorithm for simplifying Dirac gamma matrix traces using diagrammatic techniques; further work using such methods led to the proof of the negative dimensional isomorphisms of Lie algebras SU(n) = SU(-n) and SO(n) = Sp(-n).

For the past 20 years or so he has worked principally in the area of lattice quantum field theory, which involves using large computers to carry out numerical evaluation of the infinite dimensional integrals that define quantum field theory in the path integral formulation. Here he was one of the authors of the Hybrid Monte Carlo algorithm that made computations that include dynamical fermions feasible. This algorithm is used universally in the field, and is also now a standard Markov Chain Monte Carlo method in many other fields such as neural networks, and appears in many textbooks. Over the past few years he developed the Rational Hybrid Monte Carlo algorithm (in collaboration with his PhD student Mike Clark), which has led to at least an order of magnitude speed up in lattice computations, and has been widely used in the community.

As the computational requirements for lattice field theory are so great he was led to work on developing specialized computers for lattice quantum chromodynamics (QCD), being one of the designers of QCDSP which won the Gordon Bell prize for best price/performance, as well as more recently QCDOC which was the direct precursor of IBM's BlueGene supercomputer.

Among many other activities, he has served on the CFI High Performance Computing Expert Committee several times, been an editor of several leading physics journals, and is on the advisory committee for the annual international lattice field theory conference.