This site is accessible to all versions of every browser. However, this browser may not support basic Web standards, preventing the proper display of this site. If you experience any display problems, please upgrade your browser to a newer, standard compliant, version.
EPFL
ROSO
EPFL Dr Jean-Albert Ferrez
english only

I have left EPFL !

My research activities are still on these pages, but the rest, in particular Linux ressources, are now on my new site hosted at IDIAP.

A quick evaluation of the machines available at DMA with HINT

Abstract

This page tries to evaluate the computing power of a 400 MHz Pentium II PC, comparing it to classical RISC workstations.

Résumé

On cherche à évaluer la puissance de calcul d'un PC Pentium II à 400 MHz par rapport aux stations de travail RISC classiques.

Update: This page was written in 1998 and describes the hardware we had at that time. Some of the tests where repeated in 1999 with the new hardware, and the results have lead us to the same conclusions.

Roadmap

The hardware
The benchmark: HINT
Results and comparaisons
A word about parallelism
Applications and libraries
Conclusion

Past and current hardware

The hardware reviewed here is a selection of the machines currently available at the Mathematics Department of EPFL, or available for immediate purchase. Most of the comments tend to be a comparaison with the Indigo2, since this is what we are used to work with.

  • PC Pentium II (1998): A dual-Pentium II @ 400 MHz, 512 KB of cache per processor, 128 MB main memory, running Linux, using gcc-2.7.2 with CFLAGS="-O3". For the first time, cheap PCs benefit from 100 MHz buses and memories. Only single-PE performance was benchmarked.
  • SGI Indigo2 (1997): An SGI Indigo2, MIPS R10000 @ 195 MHz, 1 MB cache, 96 MB memory, running IRIX 6.2, using MIPSpro C 7.2 with CFLAGS="-mips4 -r10000 -n32 -O3". This is currently our fastest machine. The DMA has about 50 similar or older workstations. Some of them can be (and have been) used as a cluster.
  • SUN Ultra 10 (1998): A SUN Ultra 10, UltraSPARC-IIi @ 300 MHz, 512 KB cache, 128 MB memory, running SunOS 5.6, using WorkShop C Compiler 4.2 with CFLAGS="-fast" (courtesy of Paul-Jean Cagnard from LITH-EPFL).
  • Swiss T0 (1998): A Swiss T0 prototype, 8 nodes, each with an Alpha 21164 @ 500 MHz, 1 MB cache, 256 MB memory, running Digital Unix, using DEC C V5.2-038 with CFLAGS="-arch host -tune host -fast -O5". Only single-PE performance was benchmarked. This machine was included in this study because of its Alpha chip, believed to be the fastest available, at least for floating point operations. It is actually located at SIC.

The benchmark: HINT

Unlike most "famous" benchmarks (SPECint_9x, SPECfp_9x, LINPACK, Peak FLOPS,...) HINT aims at benchmarking the machines for a wide range of problem sizes. Although it does also return a single number (in QUIPS, QUality Improvements Per Seconds), HINT gives an curve of the performance as a function of time or -- equivalently -- memory usage.

Find out more about the ideas behind HINT, and see this nice tutorial on Understanding HINT Graph.

The HINT benchmark was run several times on each platform with different data types: INT32, INT64, FLOAT and DOUBLE.

Results and comparaisons

MQUIPS table (single CPU, the higher the better)
  Pentium II Indigo2 Ultra 10 Swiss T0
INT32 18.705 10.801 10.591 N/A
INT64 7.497 8.567 4.517 14.352
FLOAT 9.593 10.230 11.344 17.290
DOUBLE 14.182 14.060 17.267 22.005

A quick look at these figures, taking the Indigo2 as reference, indicates that the new 400 MHz Pentium II performs very well on small integers, but still lacks floating point power. The Sun has a problem with non-optimized 64 bits interger, but does fairly well on floating point. The Alpha chip does not have 32 bits integer but otherwise beats everyone everywhere.

Let us now take a closer look at each machine:

The PC-PII results

HINT graph for the PC-PII

INTEL makes no secret that the Pentium's design aims at the buisness market, and not at big, number crunching, scientific applications. So it is not a surprise to see the best results for 32 bits integer computations. Unfortunately, this data type is too small to allow HINT to test larger problem sizes.

The 64 bits integer and 64 bits double curves both show very well the memory structure of the machine: the best performance is achieved until the 16 KB on-chip L1 cache is saturated, but the rythm remains steady until the limit of the 512 KB L2 cache. This cache is not on-chip, but on the same daughter card and is accessed at half the processor chip, 200 MHz in our case. By today's standards, 512 KB of cache is not enough, but this is attenuated by a very efficient main memory and a fast 100 MHz bus.

The Indigo2 results

HINT graph for the Indigo2

With the Indigo2, we leave the buisness world and get closer to what scientists expect, floating point performance. Again the various memory regimes (32 KB L1, 1 MB L2, 96 main memory) are clearly visible, but the drops between them is bigger.

The Ultra 10 results

HINT graph for the Ultra 10

The integer performance of this machine is suprisingly low, with an obvious problem on 64 bits integers. Interesting to note is the high memory bandwidth on large problems using DOUBLEs.

The T0 results

HINT graph for the T0

The Alpha chip of the T0 is a true 64 bit chip. It loses against the Pentium II for small integers. The high MFlops advertised can be seen in the left part of the blue curve, but they cannot be sustained when the CPU is not fed from the internal cache. Even a 3 level caching system (8 KB, 96 KB, 1 MB) does not help. Finaly, the larger memory (256 MB) avoided the final drop, where all other systems started to swap on disk.

Comparaison of the machines for various data types

The INT32 results

INT32

For small problems based on 32 bits integers, the Pentium II can be up to twice as fast as the MIPS and the Alpha, and four times faster than the Ultra. Again, the Alpha was tested on a wider range because small integers are indeed 64 bits wide.

The INT64 results

INT64

When longer integers are needed, the Alpha takes the lead again thanks to its 64 bits architecture. One interesting thing to note is that the Indigo2 is better than the PC for small problem sizes, but when main memory starts to play a dominant role, the Pentium II 100 MHz bus and memory are able to provide a higher bandwidth.

The FLOAT results

FLOAT

It is interesting to note that as long as the 32bits based FLOAT datatype provides enough precision, the PC, Indigo2 and Ultra 10 give similar performance results despite their various clock frequencies.

The DOUBLE results

DOUBLE

The DOUBLE datatype is probably the most commonly used. For small, cache resident problem sizes, the T0 remains ahead. For larger problems, the gap between the systems is smaller. The Ultra 10 and the PC both suffer from their small L2 cache, but benefit from better memory bandwidth.

A word about parallelism

This study covers exclusively single CPU performance of the machines, although some are "parallel machines":

The PC is a Dual-Pentium, and with the SMP support of Linux, it can be seen as a parallel machine, offering multithreading on shared memory as well as message passing (MPI).

The Swiss T0 prototype is the first of a suite of parallel machines that aims at providing "Teraflops" performance. But, however efficient the interconnection network is, single CPU efficiency is always a prerequisite.

Our results mesured on the Indigo2 were matched with those of an SGI Origin2000 server based on the same CPU, up to the improvement due to the bigger L2 cache (4MB).

Although the HINT benchmark can measure them, the parallel performances, including scalability, communication latency and bandwidth, were not addressed at all in this study.

Applications and libraries

Heavy users of CPU power at the DMA have various needs. Some rely on high level packages (Cplex, Splus, Mathematica...) that may or may not be available on every platform (CPU or OS). Some rely on basic linear algebra (BLAS) and need highly optimized versions of these tools, typically offered by UNIX vendors; the status and degree of optimization of these tools on the various Pentium chips is a key issue for them. Most users develop their own codes in Fortran, C or C++ and need efficient compilers. Traditional UNIX vendors offer solid compilers, optimized for their respective architecture, but Linux usually ships with the generic GNU gcc. More efficient alternatives are available either from 3rd parties compiler vendors or from the upcoming pgcc and egcs projects.

Conclusion

This quick review is intended to help people decide whether a cheap PC can replace a workstation, not just as an X terminal but to actually run CPU intensive applications. There is still no easy answer, even if for the first time, the PC outperforms the workstations in some areas.

The enhancements due to the new 100 MHz motherboards and memory result in very efficient systems. The current cache size of 512 KB is too small, but future versions of the Pentium II will accept up to 2 MB.


©2002 ROSO-EPFL, 1015 Lausanne, Jean-Albert Ferrez
update: 27 October 2002