Cell (microprocessor)
|
Cellchip.jpg
The Cell is a microprocessor jointly developed by IBM, Toshiba and Sony. The Cell architecture is intended to be scalable from handheld devices to mainframe computers by utilizing parallel processing. Sony is using the chip in their PlayStation 3 game console to be released in the second quarter of 2006.
Contents |
History
In 2000, IBM, Sony Computer Entertainment Inc., and Toshiba Corp. formed an alliance to design and build the processor. Design process debuted in design centers in March 2001. Template:Ref The Cell was designed over a period of four years, using enhanced versions of the design tools for the POWER4 processor. Over 400 engineers from the three companies worked together in 10 of IBM's design centers. Template:Ref
On the 17th May, 2005, Sony Computer Entertainment confirmed the spec of the Cell processor that would be shipping in the forthcoming Playstation 3 console. This Cell will have one processing unit on the core, with seven SPEs ("Synergistic Processing Units", see below) and one SPE reserved for redundancy. It will be clocked at 3.2 GHz, although in lab conditions the processor apparently has been clocked successfully up to 5.2 GHz. The chips are being fabricated using IBM's 90 nanometre SOI (Silicon on insulator) process, at its fab in East Fishkill, New York. Full production may switch at some later date to use a 65-nm or 45-nm process jointly developed by IBM and Toshiba at their Nagasaki fabrication plant. Sony currently is also using its 90-nm process to produce the integrated GS/EE for the PSX*, the Japan-only combination PlayStation 2/DVR unit. (* This usage of "PSX" is distinct from the commonly used informal designation of the original PlayStation.)
Open Specs
As of May 5, 2005, patches for the Cell processor were mailed to the Linux kernel mailing list by IBM developers (Find them here (http://lkml.org/lkml/2005/5/13/217)). Arnd Bergmann of IBM will describe and premier the Linux based Cell architecture at Linuxtag 2005 (22-25 Jun). Template:Ref
Architecture
Cell.JPG
Power Processor Element
The PPE is based on the POWER Architecture, which is the basis of IBM's existing POWER line and related to the PowerPC used by Apple Computer and others. The PE is not the primary processor for the system, but acts as a controller for the other eight SPEs, which handle most of the computational workload. It has 32KB instruction & data Level 1 cache, and 512KB Level 2 cache. Template:Ref
Synergistic Processing Elements
Each SPE is composed of a "Synergistic Processing Unit" ("SPU"), and a SMF unit (DMA, MMU, and BUS IFTemplate:Ref). Template:Ref A SPE is a general purpose RISC processor with 128-bit SIMD organization Template:Ref for single and double precision instructions. It has 256 KB of instruction & data local high speed memory, which is also visible to the PPE to be loaded with data and programs as needed Template:Ref. It has 128 registers of 128bits Template:Ref. It measures 14.5mm˛ (90nm process) Template:Ref. It also has its own DMA unit connected to the EIB through a MMU for address translating.
The local high speed memory is called 'Local Store'. It performs load/stores, transactions for DMA, and fetches instructions in a instruction-line buffer. Template:Ref
In general use the system will load the SPEs with small programs, chaining the SPEs together to handle each step in a complex operation. For instance, a set-top box could load up programs for reading a DVD, video and audio decoding, and display, and the data would be passed off from SPE to SPE until finally ending up on the TV. At 4 GHz, each SPE gives 32 GFLOPS of performance, thereby giving the SPEs 256 GFLOPS of performance. Performance of the PPE's VMX unit is unclear, but should be around 32 GFLOPS in addition to the SPEs.
"The SPU is an in-order dual-issue statically scheduled architecture. Two SIMD instructions can be issued per cycle: one compute instruction and one memory operation. The SPU branch architecture does not include dynamic branch prediction, but instead relies on compiler-generated branch prediction using "prepare-to-branch" instructions to redirect instruction prefetch to branch targets."Template:Ref
Element Interconnect Bus
Unit that enable communication from one core to another. Template:Ref
Memory controller and I/O
The memory controller, a dual XDR controller, is incorporated in the Cell processor (25.6GB/s @3.2Ghz). This replaces the north bridge, like in Athlon 64 processors. The processor also feature two reconfigurable I/O interfaces called FlexIO (76.8GB/s @6.4Ghz) that eliminates the need of south bridge. Template:Ref
Broadband Engine
Much less information is available about the 'broadband engine', most come from patent applications. It's believed the Cell allows for multiple processing cores to be put onto one die, and the patent (http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=/netahtml/search-adv.htm&r=1&f=G&l=50&d=PTXT&p=1&p=1&S1=) showed four cores on one die, called the "Broadband Engine", potentially giving over 1 TFLOPS theoretical performance. The companies designing the chip have claimed they intend to scale performance for various uses, both low-end and high-end, by varying the number of cores on the chip, the number of units in a single core, and by linking multiple chips to each other via network or memory bus.
Initial speculation of TFLOPS performance was largely based on claims of a 65nm SOI process. Though IBM, Sony and Toshiba were following this agenda in the beginning, Intel and AMD's renewed concern for multi-core processing and Sony wanting first-mover's advantage on next generation gaming consoles may have forced them to go with a 90nm SOI process very much similar to the Intel Prescott core manufacturing process. However the 'Broadband engine' integrated into the Cell helps it attain enough bandwidth for theoretical 1 TFLOPS performance, though real-world models may rarely rise to such a figure.
Similar multiple-core designs include Sun Microsystems' MAJC (pronounced "magic"). The first MAJC chip was originally designed for multimedia processing, although Sun have subsequently repositioned the MAJC chip as a high-end graphics processor for workstations. In addition, Stanford University's Imagine Stream Processor (http://cva.stanford.edu/imagine/project/im_arch.html) shares a similar conceptual underpinning.
Facts
This seems to be the most common edition:
- 256 GFLOPS in single-precision operations @4Ghz. Template:Ref
- 25-30 GFLOPS in double-precision operations @4Ghz. Template:Ref (probably 26 Template:Ref)
- 234 millions transistors Template:Ref
- 1 PPE with 32KB I&D Level 1 cache, and 512KB Data Level 2 cache Template:Ref
- 8 SPE with 256KB I&D cache Template:Ref
- 0.9-1.3V nominal supply voltage Template:Ref
- 10 digital thermal sensors Template:Ref
- 5 power management states (Dynamic Power Management) Template:Ref
- 221 square millimeters die (90nm process) Template:Ref
- Power consumption is unknown yet, Template:Ref speaks of 30W, Template:Ref of 50-80W or more.
Architecture compared
In some ways the Cell system resembles early Seymour Cray designs in reverse. The famed CDC 6600 used a single very fast processor to handle the mathematical calculations, while a series of ten slower systems were given smaller programs to keep the main memory fed with data. In the Cell the problem has been reversed: reading the data is no longer the difficult problem due to the complex encodings used in industry; today the problem is efficiently decoding that data into an ever-less-compressed version as quickly as possible.
In other ways the Cell resembles a modern desktop computer on a single chip.
Modern graphics cards have multiple elements very similar to the SPE's, known as vertex shader units, with an attached high speed memory. Programs, known as shaders, are loaded onto the units to process the basic geometry fed from the computer's CPU, apply styles and display it.
The main differences are that the Cell's SPEs are much more general purpose than shader units, and the ability to chain the SPEs under program control offers considerably more flexibility, allowing the Cell to handle graphics, sound, or anything else.
Devices
Blade server
IBM has already presented a blade server prototype based on 2 Cell processors, running the Linux Kernel 2.6.11. Template:Ref The processors ran at 2.4-2.8Ghz. IBM expect to make them run at 3Ghz giving 200 GFLOPS per CPU (or 400 GFLOPS per board), and to put seven boards in a single rack for a total performance of 2.8 TFLOPS. This is equivalent to the 70th supercomputer in the TOP500 List as of 11/2004 (http://www.top500.org/lists/plists.php?Y=2004&M=11), or 125th as of 06/2005 (http://www.top500.org/lists/plists.php?TB=2&M=06&Y=2005). However those supercomputers use between 600 and 1000 CPU.
IBM probably plans to build 16 TFLOPS racks. Template:Ref Template:Ref That's 1 Peta-FLOPS (a million GFLOPS) for 64 racks.
Video Games
Sony's Playstation 3 video game console will use a 3.2Ghz Cell processor, providing 218 GFLOPS.
Home Cinema
Toshiba will probably manufacture HDTVs using this technology. They already presented a system to decode 48 MPEG-2 streams simultaneously on a 1920x1080 screen. Template:Ref
Software engineering
The PPE is the conductor, SPEs are the orchestra. The PPE should be used to control synchronization, for random access to memory, communicate with devices, run the operating system. SPEs should be used to execute repetitive tasks with limited memory access. Due to the flexible nature of the Cell, there're several ways to use it: Template:Ref
Job queue
The PPE maintains the job queue, schedules jobs in SPEs, and monitors progress. Each SPE has a mini kernel whose role is to get a job, execute it, and synchronize with the PPE. Template:Ref (More here (http://www.research.scea.com/research/html/CellGDC05/26.html))
Self-multitasking of SPEs
The kernel, and scheduling is distributed across the SPEs. Tasks are synchronized using mutexes or semaphores, like in a conventional operating systems. Ready to run tasks are either ran by a SPE, or in a waiting queue, other tasks wait. Tasks are contained in a shared memory. This maximizes the utilization of SPEs, and the PPE has nothing to do. Template:Ref (More here (http://www.research.scea.com/research/html/CellGDC05/37.html))
Stream processing
Each SPE has a program. Data comes from an input stream, and is sent to SPEs. When a SPE has terminated the processing, the output data is sent to output stream. Template:Ref (More here (http://www.research.scea.com/research/html/CellGDC05/41.html))
Software development
Both PPE and SPEs are programmable in C/C++ using a common API provided by libraries. No assembly is required to access SIMD instructions, the compiler has built-in functions. Compiler, debugger, IDE, performance analyzer, and Cell emulator should be made available. Template:Ref
Acronyms
- EIB: Element Interconnect Bus Template:Ref
- LS: Load Store (SPE's local memory) Template:Ref
- MIC: Memory Interface Controller Template:Ref
- PPE: Power Processor Element Template:Ref
- SPE: Synergistic Processing Element Template:Ref
- SPU: Streaming Processor Unit Template:Ref
- STI: Sony Computer Entertainement Inc., Toshiba Corp., IBM
References
- Template:Note Template:Citenews
- Template:Note Template:Citenews
- Template:Note Template:Citenews
- Template:Note Template:Citenews
- Template:Note Template:Citenews
- Template:Note Template:Citenews
- Template:Note Template:Citenews
- Template:Note Template:Citenews
- Template:Note Template:Citenews
- Template:Note Template:Citenews
- Template:Note Template:Citenews
- Template:Note Template:Citenews
- Template:Note Template:Citenews
- Template:Note Template:Citenews
- Template:Note Template:Web reference
- Template:Note Template:Web reference
External links
- IBM Research Labs (http://www.research.ibm.com/cell/)
- Power.org Community (http://www.power.org)
- Sony, IBM, and Toshiba announces Cell development (http://www-306.ibm.com/chips/news/2001/0312_sony-toshiba.html)
- Patent #6,526,491 (related to the Cell processor) (http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=/netahtml/search-adv.htm&r=1&f=G&l=50&d=PTXT&p=1&p=1&S1=((Sony+AND+PE)+AND+APU)&OS=Sony+AND+PE+AN%20D+APU&RS=((Sony+AND+PE)+AND+APU))
- EE Times article on ISSCC paper presentation (http://www.eet.com/semi/news/showArticle.jhtml?articleId=54200580)
- Sony/Toshiba Press Release on Cell Production (http://www.scei.co.jp/corporate/release/pdf/041129ae.pdf)
- Sony PR on one-rack 16 TFLOP workstation (http://www.scei.co.jp/corporate/release/pdf/041129be.pdf)
- Link to image of ISSCC presentation abstract for 90nm process (http://pcweb.mycom.co.jp/news/2004/11/29/011bl.jpg)
- Technical details the of Cell Architecture (presented at the ISSCC 2005) (http://www.realworldtech.com/page.cfm?ArticleID=RWT021005084318)
- In-depth look at the architecture (http://www.blachford.info/computer/Cells/Cell0.html)
- Criticism of "In-depth look ..." (http://arstechnica.com/news.ars/post/20050124-4551.html)
- Introducing the Cell Processor (Part I) (http://arstechnica.com/articles/paedia/cpu/cell-1.ars) - Jon "Hannibal" Stokes on Ars Technica
- Introducing the Cell Processor (Part II) (http://arstechnica.com/articles/paedia/cpu/cell-2.ars) - Jon "Hannibal" Stokes on Ars Technica
- IBM/Sony/Toshiba PR on key details of the Cell Chip (http://www-1.ibm.com/press/PressServletForm.wss?MenuChoice=pressreleases&TemplateName=ShowPressReleaseTemplate&SelectString=t1.docunid=7502&TableName=DataheadApplicationClass&SESSIONKEY=any&WindowTitle=Press+Release&STATUS=publish)
- Site offering news and info on the Cell processor (http://cell.raw.net)
- "PlayStation 3 chip has split personality" (http://news.com.com/PlayStation+3+chip+has+split+personality/2100-1043_3-5566340.html?tag=nl) – By David Becker, CNET News.com, 7 Feb 2005
- "It's the Software, Stupid!" (http://www.pbs.org/cringely/pulpit/pulpit20050217.html) - Robert X. Cringely piece about why software is key to the Cell success.
- "Because It's an Once in a Lifetime Challenge" (http://techon.nikkeibp.co.jp/english/NEWS_EN/20050407/103542) - Ken Kutaragifr:Cell (processeur)