Productivity and Performance Using
Partitioned Global Address Space Languages Katherine Yelick: University of California at Berkeley & Lawrence Berkeley National Laboratory Partitioned Global Address Space (PGAS) languages combine the programming convenience of shared memory with the locality and performance control of message passing. One such language, Unified Parallel C (UPC) is an extension of ISO C defined by a consortium that boasts multiple proprietary and open source compilers. Another PGAS language, Titanium, is a dialect of JavaT M designed for high performance scientific computation. In this talk we describe some of the highlights of two related projects, the Titanium project centered at U.C. Berkeley and the UPC project centered at Lawrence Berkeley National Laboratory. Both compilers use a source-to-source strategy that translates the parallel languages to C with calls to a communication layer called GASNet. The result is portable high-performance compilers that run on a large variety of shared and distributed memory multiprocessors. Both projects combine compiler, runtime, and application efforts to demonstrate some of the performance and productivity advantages to these languages. Bio: |
High Performance Computing: the
Software Challenges Mike Bauer: The University of Western Ontario In the past decade we have seen significant advances in the reliability and performance of commodity computing elements, such as processors, disks and network devices. Processors, in particular, have increased the computational power available in desktops and laptops. The advent of these reliable and powerful off-the-shelf computational elements has also spurred a new generation of high performance computing systems. These systems, so-called commodity clusters, have become a mainstay of today's high-performance computing facilities. With today's processors now comprised of multiple cores, such systems may include thousands or tens-of-thousands of processing elements connected by commodity networking and using storage comprised of commodity disks and devices. System and communication software provides the glue that enables these processing elements to operate in parallel. Applications from a growing number of disciplines must be adapted to execute in parallel, but then can address significantly more complex problems or to analyze significantly greater amounts of data. Today, these parallel clusters dominate the top 500 supercomputing facilities in the world. While there have been significant advances in hardware, software to efficiently utilize these systems and software to facilitate the development of applications to utilize these systems has been progressing at a much slower pace. Based on our experiences with SHARCNET and trends in hardware for next generation clusters, we see a growing gap between the capabilities of systems themselves and the software to efficiently operate those systems, software to support computational grids and, more importantly, the software used to develop applications. For example, the emergence of multi-core architectures, while increasing the processing capabilities of nodes, has created challenges for compilers and has created challenges for programmers with parallelism not only between nodes but within nodes. Such parallelism will likely be a key element in the push toward petascale computation, but will also create challenges in communication among processors and make programming more complex for most researchers. We outline the status of current software to support parallel clusters and to support the development of applications and then identify some of the existing challenges and limitations of the software. We then consider some of the emerging trends in hardware, processors, storage and networks, and the implications for these kinds of clusters and grids. We identify some of the future challenges of software to take advantage of these systems and describe some of the current approaches being considered. Bio: |
Multithreaded Programming in
Cilk Matteo Frigo: Cilk Arts Cilk is a multithreaded programming language aimed at making parallel programming a seamless extension of commodity serial computing. Cilk minimally extends the C programming language to allow interactions among computational threads to be specified in a simple and high-level fashion. Cilk's provably efficient runtime system dynamically maps a user's program onto available physical resources, freeing the programmer from concerns of communication protocols and load balancing. In addition, Cilk provides an abstract performance model that a programmer can use to predict the multiprocessor performance of an application from its execution on a single processor. Cilk programs not only scale up to run efficiently on multiple processors, they also `scale down': the efficiency of a Cilk program on one processor rivals that of a comparable C program. In this talk, I will provide a tutorial on the Cilk language for people with a basic background in computer programming. I will explain how to program multithreaded applications in Cilk and how to analyze their performance. I will illustrate some of the ideas behind Cilk using the example of MIT's championship computer-chess programs, *Socrates and Cilkchess. I will also briefly sketch how the software technology underlying Cilk actually works. Click here for more background on Cilk and to download the latest Cilk manual and software release. Bio: |
KAAPI: a Thread Scheduling Runtime
System for Data Flow Computations on Cluster of Multi-Processors Thierry Gautier: INRIA The high availability of multiprocessor clusters for computer science seems to be very attractive to the engineer because, at a first level, such computers aggregate high performances. Nevertheless, obtaining peak performances on irregular applications such as computer algebra problems remains a challenging problem. The delay to access memory is non uniform and the irregularity of computations requires to use scheduling algorithms in order to automatically balance the workload among the processors. This talk focuses on the runtime support implementation to exploit with great efficiency the computation resources of a multiprocessor cluster. The originality of our approach relies on the implementation of an efficient work-stealing algorithm for a macro data flow computation based on minor extension of POSIX thread interface. Bio: |
Automating Renormalization of
Quantum Field Theories Tony Kennedy: University of Edinburgh (UK) We give an overview of state-of-the-art multi-loop Feynman diagram computations, and explain how we use symbolic manipulation to generate renormalized integrals that are then evaluated numerically. We explain how we automate BPHZ renormalization using "henges" and "sectors", and give a brief description of the symbolic tensor and Dirac gamma-matrix manipulation that is required. We shall compare the use of general computer algebra systems such as Maple with domain-specific languages such as FORM, highlighting in particular memory management issues. Bio: Tony Kennedy has interests which span theoretical physics, computer science, and mathematics. In theoretical physics he has worked on the perturbative renormalization of quantum field theory, where together with Bill Caswell he constructed a new proof of the fundamental BPH theorem: this proof is now to be found in textbooks on the subject. He also introduced a new algorithm for simplifying Dirac gamma matrix traces using diagrammatic techniques; further work using such methods led to the proof of the negative dimensional isomorphisms of Lie algebras SU(n) = SU(-n) and SO(n) = Sp(-n). For the past 20 years or so he has worked principally in the area of lattice quantum field theory, which involves using large computers to carry out numerical evaluation of the infinite dimensional integrals that define quantum field theory in the path integral formulation. Here he was one of the authors of the Hybrid Monte Carlo algorithm that made computations that include dynamical fermions feasible. This algorithm is used universally in the field, and is also now a standard Markov Chain Monte Carlo method in many other fields such as neural networks, and appears in many textbooks. Over the past few years he developed the Rational Hybrid Monte Carlo algorithm (in collaboration with his PhD student Mike Clark), which has led to at least an order of magnitude speed up in lattice computations, and has been widely used in the community. As the computational requirements for lattice field theory are so great he was led to work on developing specialized computers for lattice quantum chromodynamics (QCD), being one of the designers of QCDSP which won the Gordon Bell prize for best price/performance, as well as more recently QCDOC which was the direct precursor of IBM's BlueGene supercomputer. Among many other activities, he has served on the CFI High Performance Computing Expert Committee several times, been an editor of several leading physics journals, and is on the advisory committee for the annual international lattice field theory conference. |