45-nm Intel Penryn and Nehalem: architectural details
Introduction
One of the most intriguing news topics of this season is about
Intel's release of processors made following the 45-nm process
technology. As more and more details of new chips are coming in, there
has been a series of publications on our web site devoted to this
topic. Today, we'll try to sort them out and review the novelties and
technologies implemented in new processors. We'll try to recap all the
theoretical calculations followed by practical tests of engineering
samples and retail specimens of processors.
Micro architectures of the nearest few years
First of all, we'll tell you about Intel's most recent plans
as to the introduction of CPU architectures for the forthcoming two
years. The new-generation CPUs for desktops codenamed Penryn will be
built on the enhanced Intel Core micro architecture. Their major
distinction will be migration to the 45-nm process technology and some
architectural novelties which should result in increased power
efficiency, enhanced clock speed capabilities, and increased number of
instructions executed per cycle, etc.

Once the mass production of Penryn chips has been established,
Intel plans to present its Nehalem processors having a micro
architecture of the same name, which should come to replace Intel Core.
In about 2-3 years after the announcement of 45-nm processors
– approximately, in about 2009-2010, Intel hopes to present a
new, more precise 32-nm, process technology. For now, these plans are
still pretty vague: even transition to the 45-nm process was
accompanied by serious problems and called for involvement of
absolutely new materials (high-k dielectrics and metal gates). As part
of the 32-nm process technology, processors of the working name
Westmere, formerly known as Nehalem-C, of the same Nehalem micro
architecture will be presented.
Two years after the release of the Nehalem, the Gesher micro
architecture will come to replace the previous. There is still very
little information on it. We only know that first Gesher processors
will be manufactured following the 32-nm process technology. At that,
the forecasts regarding the future progress of processors are over.
Judging by these plans, Intel adheres to the former strategy
for replacement of micro architectures and transition to a new process
technology once every two years. It is hard to tell if the leader of
the CPU industry will be able keeping up with such a fast pace of the
progress. At Intel, they call such strategy of produce release as
“tick-tock”. Every "tick" expresses a new stage in
the development of semiconductor production technologies and
improvements in the field of micro architecture (e.g., Penryn). Every
"tock" matches the creation of a new micro architecture (e.g.,
Nehalem).
Penryn processors in detail
Penryn family processors will appear earlier than Nehalem, so
we start just with them. Currently, there are over 15 Penryn family
products under development. Among the first ones, we'll chips aimed at
various market sectors.
Until the recent time, it has been known of the preparation
for release of a dual-core processor for notebooks, 2- and 4-core
models for desktop PCs, as well as 2- and 4-core processors for the
server segment. During the days of Intel Developer Forum in Beijing, we
also found out about the company's plans regarding the release of 45-nm
chips for UMPC (Ultra Mobile PC) devices. The new processors will be a
serious claim and may be able to shatter the positions of such
manufacturers like AMD, VIA Technologies and others.
The improvements which the new process technology will brings
will be interesting to look at from the viewpoint of qualitative
comparison. For instance, quad-core Penryn processors will include
about 820 mln processors which will be placed on two chips of 107 mm2
in area. For comparison, modern quad-core Intel Kentsfield processors
offer 582 mln transistors, and the areas of quad-core processors
manufactured following the 65-nm process technology amount to 143 mm2.
The novelties which the next generation of processors will
bring can be viewed with regard to Intel's five modern technologies:
Wide Dynamic Execution, Advanced Smart Cache, Smart Memory Access,
Advanced Digital Media Boost, and Intelligent Power Capability.
The Wide Dynamic Execution provides
execution of greater number of instructions per cycle, which boosts
performance and helps enhance the power efficiency. Within this
technology, Intel will present an improved and faster division block
based on the radix-16 methodology, as well as the Enhanced
Intel Virtualization Technology. The innovative architecture
based on radix-16 will let substantially reduce the
latencies in executing integer division operations as well as
floating-point division operations. On the below diagram, you can see
eloquent results which don't require any comments.
The Advanced Smart Cache technology is
aimed at providing a higher performance and cache memory efficiency.
Intel decided to increase the cache size in Penryn family processors.
Dual-core processors will be equipped with the L2 cache of up to 6 MB
in size, whereas some quad-core models will acquire 12-MB cache memory.
Regarding the clock speeds, there is a mention of overcoming the 3 GHz
bar.
Regarding the Smart Memory Access
technology, they mention the increased bus bandwidth. There is
confirmation of FSB speeds as high as 1600 MHz. Reportedly, the FSB
1600 MHz will appear in some processor models aimed at servers and
workstations; it is still not yet specified when models of high-speed
bus for desktop PCs will be released.
The Advanced Digital Media Boost
technology is used to boost processing video, images and talk spurts.
To increase performance when handling media data, Intel decided to add
SSE4 (Streaming SIMD Extensions 4) to the ISA architecture, which will
be available for most mainstream desktop PC sectors with the advent of
45-nm processors. This new instruction set includes many innovative
instructions (they are as many as 50) which can be subdivided into the
two groups:
- Vectorization primitives for compilers and multimedia
application accelerators;
- Line and text data processing accelerators.
Perhaps, we dwell on SSE4 in more detail since this technology
is one of the key innovations. To start with, we describe the
applications which will be affected by this improvement. The
improvements will affect graphics, video encoding and processing, 3D
imaging, games, Web-servers, as well as application servers. According
to Intel, the performance of applications making intensive use of
computations will go up - namely, data storage analysis, database
management systems, complex search and mapping algorithms, algorithms
for compression of audio, video, images and data, algorithms for
parsing and logical state analysis, as well as many others.

According to Intel, SSE4 is the most substantial and
outstanding extension to the Intel ISA architecture since the times
when SSE2 emerged. The SSE4 instruction set includes a few
vectorization primitives for compilers, which provide further
performance boost and efficiency of multimedia applications. There are
also innovative instructions for lines processing.
Another enhancement is the Super Shuffle Engine.
The new engine is able shuffling values over all the 128-bit register
per cycle. This substantially raises the performance of processing
operations related to shuffling (packing, unpacking, shift of packed
values, insert). The diagram presents a comparison of the number of
cycles required for the execution of SSE operations. We can see an
almost twofold performance boost on the average.
There are interesting innovations which relate to the
reduction of power consumption and increase in the "performance per
watt" indicator. In this regard, Intel presented the two new
technologies: Deep Power Down Technology, and Enhanced
Dynamic Acceleration Technology.
The Deep Power Down Technology will be introduced primarily
into processors for mobile platforms (Mobile Penryn). To reduce the
power consumption in the idle mode, one more special state of the CPU
has been added, which is named as the Deep Power Down Technology State,
or C6. This mode implies disabling the cores, with the cache memory
disabled completely. This allows for a substantial decrease in the core
voltage and the power consumption, which in turn prolongs the battery
operation life.

Among the other interesting novelty is the Enhanced Dynamic
Acceleration Technology (EDAT). The idea behind it is as follows. For
ease of description, we take the case of a dual-core CPU. Since
single-threaded application makes little use of multi-core
computations, the major role here is played by the performance of a
specific core. That is why Intel has provided for the increase in the
clock speed of the non-idle core, whereas the idle core is in one of
the idle states C3-C6 and its heat emission drops sharply. This
difference is leveraged by the non-idle core which raises its clock
speed until the TDP boundary level is achieved. To visualize it, we
bring in the following illustration.

Now regarding the TDP level of 45-nm processors.
Unfortunately, there is still no data on the heat emission of mobile
chips. Dual-core Penryn for desktop PC will fall within the 65W power
consumption class, whereas for their quad-core relatives the TDP will
be 95 and 130 W. In the server segment, the TDP for dual-core Intel
Xeon will amount to 40, 65, and 80 W, whereas for the quad-core
– 50, 80, and 120 W.
According to Intel's in-house tests, gaming applications
demonstrate a 20% performance boost for new chips, while at video
decoding operations (provided the SSE4 is enabled) – a
performance boost of over 40%. If we compare the server Penryn of clock
speeds over 3 GHz versus the most powerful quad-core Xeon (Xeon X5355,
2.66 GHz, FSB 1333 MHz), the performance boost in applications making
intensive use of floating-point operations and sensitive to the
bandwidth will amount to about 45%.
|