next up previous index
Next: To Program Or Not Up: Supercomputers and Clusters Previous: Scalability of Parallel Programming

HPC and The Grid

The latest fashion in some academic IT circles is ``The Grid'' , and many people have a quite incorrect view that ``The Grid'' in some way is deeply connected to high performance computing, or even that it is high performance computing in its latest guise.

``The Grid'' is not related to high performance computing.

High performance computing and supercomputing have been around for tens of years without ``The Grid'' and they will continue to be around for tens of years without it.

The other view, which is also quite incorrect, is that there is just one ``Grid'', which is also ``The Grid'', and that this grid is based on ``Globus'' , which is a collection of utilities and libraries developed by various folks, but mostly by folks from the University of Chicago and the Argonne National Laboratory.

Many people, who actually know something about distributed computing, pointed out that what is called ``The Grid'' nowadays, was called ``Distributed Computing''   only a decade ago. It is often the case in Information Technology, especially in academia, that old washed out ideas are being given new names and flogged off yet again by the same people who failed to sell them under the old names.

There are some successful examples of grids in place today. The most successful one, and  probably the only one that will truly flourish in years to come, is the Microsoft ``.NET Password'' program. It works like this: when you start up your PC running windows, ``MSN Messenger'' logs you in with the ``.NET Password''. This way you acquire credentials, which are then passed to all other WWW  sites that participate in the ``.NET Password'' program. For example, once I have been authenticated to ``.NET Password'', I can connect to Amazon.com, Nature, Science, Monster, The New York Times, and various other well known sites, which recognize me instantaneously and provide me with customized services.

Another example of a grid is AFS , the Andrew File System . AFS  is a world-wide file system, which, when mounted on a client machine, provides its user with transparent access to file systems at various institutions. It can be compared to the World Wide Web, but unlike WWW, AFS provides access to files on the kernel and file system level. You don't need to use a special tool such as a WWW browser. If you have AFS mounted on your computer, you can use native OS methods in order to access, view, modify, and execute files that live at other institutions. User authentication and verification is based on MIT Kerberos and user authorization is based on AFS Access Control Lists (ACLs).

Another example of a grid is the grid that is currently being built by CERN and that is going to be used by Europe's high energy physicists and their collaborators. They have their own highly specialized protocols, libraries and utilities that are built on top of ``Globus'', but the latter is used as a low-level library only. Recall that it was CERN where WWW was invented in the first place.

Another example of non-Globus Grid software developed  in Europe is Unicore. It was developed by Forschungszentrum Jülich, GmbH in cooperation with other German software companies, the supercomputer centers in Stuttgart, Munich, Karlsruhe, and Berlin, the European Centre for Medium-Range Weather Forecast in Reading, UK, and various hardware companies, such as Fujitsu, Hitachi, NEC, Siemens, as well as some American partners, HP, IBM and Sun Microsystems.

Europeans want to do things their own way and for various reasons eschew being dependent on American technology, including Information Technology.

A yet another example of Grid software is Legion , developed by researchers from the University of Virginia under various contracts with DoE, DoD and DARPA. Legion is much more usable and more functional than Globus and provides numerous higher level utilities and abstractions. An insightful comparison between Legion and Globus can be found in ``A philosophical and technical comparison of Legion and Globus'' by Grimshaw, Humphrey and Natrajan, IBM Journal of Research and Development, vol. 47, no. 2.

Why grid and high performance computing are not the same thing?

The purpose of the grid is to provide users with connection to various computing and storage resources usually at other institutions.

There is no need for grid protocols within an institution, because other site-wide authentication and verification methods such as Kerberos or Active Directory work better in this context. So grid is for long-haul connections.

Long-haul connections always have very high latencies. This is caused, first, by the speed  of light, which by supercomputer standards is very low, and, second, by the fact that you have to pass through numerous routers, switches and sometimes even firewalls on your way between the institutions and they add even more latency to the connection, while in some cases restricting the available bandwidth severely (e.g., the firewalls). High latency kills high performance computing. The only jobs that are relatively immune to it are "high capacity computing" jobs, i.e., trivially parallelizable jobs, which run as numerous small programs on a large number of computers and which don't communicate with each other. But "high capacity computing" is not high performance computing, neither is it supercomputing, where intense communication between processes running on various parts of the system is frequent and often synchronized.

The I-light  link between Bloomington and Indianapolis is not ``as crow flies''. It goes around quite a lot and its total length is about 80 miles. Were it not for switches and routers, it would take about 0.5 ms for the light signal to traverse this distance. In reality it takes much longer. But let's stay with this ideal number of 0.5 ms. In this time a 1 TFLOPS supercomputer can perform 500,000,000 floating point operations. If a program running on an IA32 cluster at IUB and IUPUI has to synchronize operations frequently, every time it needs to do so we'll lose billions of floating point operations waiting for the synchronization to complete.

The other problem with long-haul connections is that effective bandwidths on such connections for large data transfers are usually very low, even if the lines are advertised as high bandwidth ones and even if there are no firewalls. It is enough that a very small fraction of packets get dropped on some routers to bring the effective transfer rates down from the nominal hundreds of MB/s to mere five or so. We have seen some long-haul transcontinental transfer rate records established recently. For example, between SLAC and the Edinburgh University (40 MB/s) and between Caltech and CERN (80 MB/s). But 40 MB/s or 80 MB/s is very little by high performance computing standards. Here we need transfer rates of GB/s or better.

But let's get back to ``The Grid''. Assume that we have it in place. ``The Grid'' will provide us with tools to access, say, NCSA, SDSC, PSC, ARSC and some other supercomputer centers. The tools in question may even be quite nice given another 20 years of development. You'll click some push buttons, turn some dials and$\ldots$ you'll be there. Now what? You got connected to, say, NCSA, and you still have to write a supercomputer program to run on the NCSA  cluster utilizing all its nodes in parallel and communicating frequently between various processes. ``The Grid'' is of no help here. It doesn't tell you how to do this.

This course, however, will. Think of this in the following terms: this course is about what you need to do once you have connected to a supercomputer resource. You may have connected using ``The Grid'', or using ssh, or using Kerberized telnet, or using simply a telephone line. But now you are there and you have to reach for tools other than ssh, ``The Grid'', or the telephone line, in order to construct, submit and run a supercomputer job.


next up previous index
Next: To Program Or Not Up: Supercomputers and Clusters Previous: Scalability of Parallel Programming
Zdzislaw Meglicki
2004-04-29