WARP (systolic array) - Misplaced Pages

Parallel processing computers

The Warp machines were 3 generations of increasingly general-purpose systolic array processors. Each generation became increasingly general-purpose by increasing memory capacity and loosening the coupling between processors. Only the original WW-Warp forced a truly lock step sequencing of stages, which severely restricted its programmability but was in a sense the purest “systolic-array” design.

History

The Warp machines were created by Carnegie Mellon University (CMU), in conjunction with industrial partners G.E., Honeywell and Intel, and funded by the U.S. Defense Advanced Research Projects Agency (DARPA).

The Warp projects were started in 1984 by H. T. Kung at Carnegie Mellon University. The Warp projects yielded research results, publications and advancements in general purpose systolic hardware design, compiler design and systolic software algorithms.

A two cell prototype of WW-Warp was complete at Carnegie Mellon in June 1985. Two essentially identical ten-cell WW-Warp were produced in 1986, one by Honeywell and one by G.E., for use at Carnegie Mellon University. The system from G.E. was delivered in February 1986; the system from Honeywell was delivered in June 1986. The first of the significantly redesign production model, the PC-Warp, was delivered by G.E. in April 1987. About twenty production models of the PC-Warp were produced and sold by G.E. during 1987-1989.

In 1986, Intel was selected, as a result of competitive bidding, to be the industrial partner for the integrated circuit implementation of Warp. The first iWarp system, a 12-node system, became operational in March 1990. After a number of stepping of the part, about 39 machines, consisting of ten or more C-Step iWarp chips running at 20 MHz, were produced and sold by Intel in 1992 and 1993 to universities, government agencies and industrial research laboratories.

Architecture

There were three distinct machine designs known as the WW-Warp (Wire Wrap Warp), PC-Warp (Printed Circuit Warp), and iWarp (integrated circuit Warp, conveniently also a play on the “i” for Intel).

WW-Warp

WW-Warp forced a truly lock step sequencing of stages.

Linear array of ten or more programmable processing elements (PEs), each at 10 MFLOPS (SP).

PC-Warp

Linear array of ten or more programmable processing elements (PEs), each at 10 MFLOPS (SP).

iWarp

Main article: IWarp

Linear array of ten or more programmable processing elements (PEs), each at 20 MFLOPS (SP).

One PE consists of two main agents: a Computation Agent and a Communication Agent.

Computation Agent: This agent is responsible for processing data. It has a processing power of 20 MFLOPS (Millions of Floating-point Operations Per Second) and 20 MIPS (Millions of Instructions Per Second). It has access to local memory with a bandwidth of 160 MBytes/sec.
Communication Agent: This agent handles data transfer between this PE and its neighbors. Each of its physical ports has a bandwidth of 40 MBytes/sec.

The iWarp machines were based on a single-chip custom 700,000 transistor microprocessor, designed specifically for the Warp project, that utilized long-instruction-word (LIW) format instructions and tightly integrated communications with the computational processor. The standard iWarp machines configuration arranged iWarp nodes in a 2m x 2n torus. All iWarp machines included the “backedges” and, therefore, were tori.

Applications

Warp machines were attached to Sun workstations (UNIX based). Software development for all models of Warp machines was done on Sun workstations.

The originally intended application for Warp machines was low-lev el computer vision (convolutions, filtering, etc). It then found applications in magnetic resonance image processing, repetitive image texture analysis, and linear algebra.

Neural network

The 10-cell Warp (not iWarp) computer was benchmarked on performing a forward-backward propagation on the NETtalk. It achieved 16.5 MC/s (million connections per second), meaning that to run one forward and one backward pass over NETtalk's 18,629 weights takes ${\frac {18629}{16.5\times 10^{6}}}\;\mathrm {sec}$ .

This was a 8x speedup over a backpropagation algorithm on the Connection Machine-1, and 340x speedup over the original implementation on the Ridge 32. When the 10-cell iWarp came, the authors ran backpropagation on it with essentially the same implementation. It ran at 36 MC/s, a 760x speedup.

Compiler

A research compiler, for a language known as “W2,” targeted all three machines and was the only compiler for the WW-Warp and PC-Warp while it served as an early compiler during development of the iWarp. The production compiler for iWarp was a C and Fortran compiler based on the AT&T pcc compiler for UNIX, ported under contract for Intel and then extensively modified and extend by Intel.

Notes

Thomas Gross and Monica Lam. 1998. Retrospective: a retrospective on the Warp machines. In 25 years of the international symposia on Computer architecture (selected papers) (ISCA '98), Gurindar S. Sohi (Ed.). ACM, New York, NY, USA, 45-47.
Encyclopedia of Parallel Computing, Padua, David (Ed.), 2011, ISBN 978-0-387-09765-7
Thomas Gross and David R. O'Hallaron. iWarp: anatomy of a parallel computing system, MIT Press, Cambridge, MA, 1998.
Intel Corp. iWarp Microprocessor (Part Number 318153), Hillsboro, Oregon, 1991. Technical Information, Order Number 281006.
Borkar, S.; Cohn, R.; Cox, G.; Gleason, S.; Gross, T. (1988-11-01). "iWarp: an integrated solution of high-speed parallel computing". Proceedings of the 1988 ACM/IEEE Conference on Supercomputing. Supercomputing '88. Washington, DC, USA: IEEE Computer Society Press: 330–339. ISBN 978-0-8186-0882-7.
Shekhar Borkar, Robert Cohn, George Cox, Sha Gleason, and Thomas Gross. iWarp: an integrated solution of high-speed parallel computing, Proceedings of the 1988 ACM/IEEE conference on Supercomputing, p.330-339, November 12–17, 1988.
Annaratone, M. A. R. C. O., et al. "Applications experience on Warp." Proceedings of the 1987 National Computer Conference. 1987.
Pomerleau; Gusciora; Touretzky; Kung (1988). "Neural network simulation at Warp speed: How we got 17 million connections per second". IEEE International Conference on Neural Networks. IEEE. pp. 143–150 vol.2. doi:10.1109/icnn.1988.23922. ISBN 0-7803-0999-5.
Borkar, S.; Cohn, R.; Cox, G.; Gleason, S.; Gross, T. (1988-11-01). "iWarp: an integrated solution of high-speed parallel computing". Proceedings of the 1988 ACM/IEEE Conference on Supercomputing. Supercomputing '88. Washington, DC, USA: IEEE Computer Society Press: 330–339. ISBN 978-0-8186-0882-7.
Monica S. Lam. A Systolic Array Optimizing Compiler, Dordrecht, The Netherlands: Kluwer Academic Publishers, 1989.
Ali-Reza Adl-Tabatabai, Thomas Gross, Guei-Yuan Lueh and James Reinders. Modeling Instruction-Level Parallelism for Software Pipelining. In Proceedings of the IFIP WG10.3 Working Conference on Architectures and Compilation Techniques for Fine and Medium Grain Parallelism, Orlando, FL, pages 321-330.

External links

A Retrospective on the Warp Machines

Categories: