Presentation of Intel Sandy Bridge processors: model range and architectural features. New Intel Turbo Boost Mode

A few years ago, during the reign of the Pentium brand, the first appearance of the trademark Intel Core and the microarchitecture of the same name (Architecture 101), the next generation of Intel microarchitecture with the working name Gesher ("bridge" in Hebrew) was first mentioned on slides about future processors, which later transformed into Sandy Bridge.

In that long time of NetBurst processor dominance, when the contours of the future Nehalem cores were just beginning to emerge, and we got acquainted with the features of the internal structure of the first representatives of the Core microarchitecture - Conroe for desktop systems, Merom - for mobile and Woodcrest - for server ...

In a word, when the grass was green, and before Sandy Bridge it was still like before the moon, even then Intel representatives said that this would be a completely new processor microarchitecture. This is how, say, today you can imagine the mysterious Haswell microarchitecture, which will appear after the Ivy Bridge generation, which, in turn, will replace Sandy Bridge next year.

However, the closer the release date of a new microarchitecture, the more we learn about its features, the more similarities of neighboring generations become noticeable, and the more obvious the evolutionary path of changes in processor circuitry. Indeed, if between the initial reincarnations of the first Core architecture - Merom / Conroe, and the firstborn of the second generation Core - Sandy Bridge - there really is an abyss of differences, then the current latest version of the Core generation - the Westmere core - and the upcoming first version of the Core generation considered today II - the core of Sandy Bridge, may seem similar.

And yet the differences are significant. So significant that now we can finally talk about the end of the 15-year era of the P6 microarchitecture (Pentium Pro) and the emergence of a new generation of Intel microarchitecture.

⇡ Sandy Bridge microarchitecture: bird's eye view

The Sandy Bridge chip is a quad-core 64-bit out-of-order processor, support for two data streams per core (HT), four instructions per clock; with integrated graphics core and integrated DDR3 memory controller; with a new ring bus, support for 3- and 4-operand (128/256-bit) AVX (Advanced Vector Extensions) vector instructions; the production of which has been established on lines in compliance with the standards of modern 32-nm technological process Intel.

So, in short, one sentence can try to characterize the new generation of Intel Core II processors for mobile and desktop systems, mass deliveries of which will begin in the very near future.

Intel Core II processors based on the Sandy Bridge microarchitecture will ship in a new 1155-pin LGA1155 design for new motherboards based on Intel 6 Series chipsets.

Approximately the same microarchitecture will be relevant for Intel Sandy Bridge-EP server solutions, except with actual differences in the form of a larger number of processor cores (up to eight), the corresponding LGA2011 processor socket, a larger L3 cache, an increased number of DDR3 memory controllers and PCI support -Express 3.0.

The previous generation, the Westmere microarchitecture by Arrandale and Clarkdale for mobile and desktop systems, is a two-die design - a 32nm processor core and an additional 45nm "coprocessor" with a graphics core and a memory controller on board, placed on a single substrate and performing data exchange via the QPI bus. In fact, at this stage, Intel engineers, using mainly previous developments, created a kind of integrated hybrid microcircuit.

When creating the Sandy Bridge architecture, the developers completed the integration process that began at the Arrandale / Clarkdale creation stage and placed all the elements on a single 32nm chip, while abandoning the classic QPI bus look in favor of a new ring bus. At the same time, the essence of the Sandy Bridge microarchitecture remained within the framework of the previous Intel ideology, which relies on increasing the overall performance of the processor by improving the "individual" efficiency of each core.

The structure of the Sandy Bridge chip can be divided into the following main elements: processor cores, graphics core, L3 cache memory, and the so-called System Agent.

In general, the structure of the Sandy Bridge microarchitecture is clear. Our task today is to find out the purpose and features of the implementation of each of the elements of this structure.

⇡ Ring bus (Ring Interconnect)

The whole history of Intel processor microarchitecture modernization in recent years is inextricably linked with the sequential integration into a single chip of an increasing number of modules and functions that were previously located outside the processor: in the chipset, on motherboard etc. Accordingly, as the processor performance and the degree of chip integration increased, the bandwidth requirements for internal interconnect buses grew at a faster pace. For the time being, even after the introduction of a graphics chip into the Arrandale/Clarkdale chip architecture, it was possible to manage with intercomponent buses with the usual cross topology - that was enough.

However, the efficiency of such a topology is high only with a small number of components participating in the data exchange. In the Sandy Bridge microarchitecture, to improve the overall performance of the system, the developers decided to turn to the ring topology of a 256-bit interconnect bus, made on the basis of a new version of QPI (QuickPath Interconnect) technology, expanded, refined and first implemented in the architecture of the server chip Nehalem-EX (Xeon 7500) , as well as planned for use in conjunction with the Larrabee chip architecture.

Ring bus in the Sandy Bridge version of the desktop and mobile systems(Core II) is used to exchange data between six key components of the chip: four x86 processor cores, graphics core, L3 cache and system agent. The bus consists of four 32-byte rings: data buses (Data Ring), request buses (Request Ring), state monitoring buses (Snoop Ring) and confirmation buses (Acknowledge Ring), in practice, this actually allows you to divide access to the 64-byte last-level cache interface into two different packages. Buses are controlled using a distributed arbitration communication protocol, while pipelined processing of requests occurs at the clock frequency of processor cores, which gives the architecture additional flexibility during overclocking. The performance of the ring bus is rated at the level 96 GB per second per connection at 3 GHz, effectively four times faster than previous generation Intel processors.

The ring topology and bus organization ensures minimum latency in processing requests, maximum performance and excellent technology scalability for chip versions with different numbers of cores and other components. According to company representatives, in the future, up to 20 processor cores per chip can be “connected” to the ring bus, and such a redesign, as you understand, can be done very quickly, in the form of a flexible and prompt response to current market needs. In addition, the ring bus is physically located directly above the L3 cache blocks in the upper metallization layer, which simplifies the design layout and allows the chip to be made more compact.

L3 - last level cache, LLC

As you have already noticed, on the Intel slides, the L3 cache is referred to as the "Last Level Cache", that is, LLC - Last Level Cache. In the Sandy Bridge microarchitecture, the L3 cache is distributed not only between the four processor cores, but, thanks to the ring bus, also between the graphics core and the system agent, which, among other things, includes the module hardware acceleration graphics and video output block. At the same time, a special tracing mechanism prevents the occurrence of access conflicts between processor cores and graphics.

Each of the four processor cores has direct access to “its own” L3 cache segment, while each L3 cache segment provides half the width of its bus for ring data bus access, while the physical addressing of all four cache segments is provided by a single hash function. Each segment of the L3 cache has its own independent ring bus access controller, which is responsible for processing requests for the allocation of physical addresses. In addition, the cache controller constantly interacts with the system agent regarding unsuccessful calls to L3, control of inter-component data exchange and non-cached calls.

Additional details about the structure and functioning of the L3 cache memory of Sandy Bridge processors will appear later in the text, in the process of getting acquainted with the microarchitecture, as necessary.

⇡ System agent: DDR memory controller3, PCUother

Previously, instead of defining the System Agent, Intel's terminology featured the so-called "Non-Kernel" - Uncore, that is, "everything that is not included in the Core", namely the L3 cache, graphics, memory controller, other controllers like PCI Express etc. Out of habit, we often called most of this elements of the north bridge, transferred from the chipset to the processor.

The system agent of the Sandy Bridge microarchitecture includes a DDR3 memory controller, a power control unit (Power Control Unit, PCU), PCI-Express 2.0, DMI controllers, a video output unit, etc. Like all other elements of the architecture, the system agent is connected to the overall system via high performance ring tire.

The architecture of the standard version of the Sandy Bridge system agent implies the presence of 16 PCI-E 2.0 bus lanes, which can also be distributed to two PCI-E 2.0 bus buses of 8 lanes, or one PCI-E 2.0 bus of 8 lanes and two PCI- E 2.0 on four lines. The dual-channel DDR3 memory controller has now "returned" to the chip (in Clarkdale chips it was located outside the processor chip) and, most likely, will now provide significantly lower latency.

The fact that the memory controller in Sandy Bridge has become dual-channel is unlikely to please those who have already managed to throw out a lot of money for overclocker sets of three-channel DDR3 memory. Well, it happens that sets of only one, two or four modules will now be relevant.

Regarding the return to the dual-channel memory controller scheme, we have some considerations. Perhaps Intel has begun preparing microarchitectures to work with DDR4 memory? Which, due to the departure from the “star” topology to the “point-to-point” topology, in versions for desktop and mobile systems will, by definition, be only two-channel (special multiplexer modules will be used for servers). However, these are just guesses, there is still not enough information about the DDR4 standard itself for confident assumptions.

The power management controller located in the system agent is responsible for the timely dynamic scaling of the supply voltages and clock frequencies of the processor cores, graphics core, caches, memory controller and interfaces. What is especially important to emphasize is that power and clock speed are managed independently for the processor cores and the graphics core.

A completely new version of Turbo Boost technology is implemented, not least thanks to this power management controller. The fact is that, depending on the current state of the system and the complexity of the task being solved, the Sandy Bridge microarchitecture allows Turbo Boost technology to “overclock” the processor cores and integrated graphics to a level that significantly exceeds TDP by enough long time. Indeed, why not take advantage of this opportunity on a regular basis, while the cooling system is still cold and can provide more heat removal than it is already warmed up?

Besides that turbo technology Boost now allows you to regularly "overclock" all four cores beyond the TDP limits, it is also worth noting that the performance and thermal management of graphics cores in Arrandale / Clarkdale chips, in fact, only built-in, but not fully integrated into the processor, was carried out using the driver . Now, in the Sandy Bridge architecture, this process is also assigned to the PCU controller. Such a tight integration of the supply voltage and frequency control system made it possible to put into practice much more aggressive scenarios for the operation of Turbo Boost technology, when both the graphics and all four processor cores, if necessary and under certain conditions, can work at once at increased clock frequencies with a significant excess of TDP, but without any side effects.

How the new version of Turbo Boost technology implemented in Sandy Bridge processors is perfectly described in a multimedia presentation shown in September at the Intel Developer Forum in San Francisco. The video below, recording this moment of the presentation, will tell you about Turbo Boost faster and better than any retelling.

How effectively this technology will work in mass-produced processors remains to be seen, but what Intel specialists showed during the closed demonstration of Sandy Bridge capabilities during the IDF days in San Francisco is simply amazing: both the increase in clock frequency and, accordingly, the performance of the processor and graphics, at the same time can reach simply fantastic levels.

There is information that for standard cooling systems, the mode of such "overclocking" with the help of Turbo Boost and exceeding TDP will be limited in the BIOS to a period of 25 seconds. But what if motherboard manufacturers can guarantee better heat dissipation with some exotic cooling system? This is where the expanse for overclockers opens up ...

Each of the four Sandy Bridge cores can be independently set to low power mode if needed, and the graphics core can also be set to a very low power mode. The ring bus and L3 cache, due to their distribution among other resources, cannot be disabled, however, the ring bus has a special economical standby mode when it is not loaded, and the traditional technology for disabling unused transistors, already known to us, is used for the L3 cache on previous microarchitectures. Thus, Sandy Bridge processors in mobile PCs provide a long offline work when powered by battery.

Video output and multimedia hardware decoding modules are also among the elements of the system agent. Unlike its predecessors, where hardware decoding was assigned to the graphics core (we will talk about its capabilities next time), the new architecture uses a separate, much more productive and economical module to decode multimedia streams, and only in the process of encoding (compressing) multimedia data, the capabilities of the shader units of the graphics core and the L3 cache are used.

In line with current trends, 3D content playback tools are provided: the Sandy Bridge hardware decoding module is able to easily process two independent MPEG2, VC1 or AVC streams at once in Full HD resolution.

Today we got acquainted with the structure of the new generation of the Intel Core II microarchitecture with the working title Sandy Bridge, figured out the structure and principle of operation of a number of key elements of this system: ring bus, L3 cache memory and a system agent, which includes a DDR3 memory controller, a control module food and other components.

However, this is only a small part of the new technologies and ideas implemented in the Sandy Bridge microarchitecture; no less impressive and large-scale changes have affected the architecture of the processor cores and the integrated graphics system. So our story about Sandy Bridge does not end there - to be continued.

1. Microarchitecture of Sandy Bridge: briefly

The Sandy Bridge chip is a dual-quad-core 64-bit processor with ●out-of-order execution sequence, ●support for two data streams per core (HT), ● executing four instructions per clock; ● with integrated graphics core and integrated DDR3 memory controller; ● with a new ring bus, ● support for 3- and 4-operand (128/256-bit) AVX (Advanced Vector Extensions) vector commands; the production of which has been established on lines in compliance with the norms of the 32-nm technological process of Intel.

So, in one sentence, you can describe the new generation of Intel Core 2 processors for mobile and desktop systems, delivered since 2011.

Intel Core II MP based on Sandy Bridge MA comes in new 1155 contact construct LGA1155 for new motherboards based on Intel 6 Series chipsets with chipsets (Intel B65 Express, H61 Express, H67 Express, P67 Express, Q65 Express, Q67 Express and 68 Express, Z77).

Approximately the same microarchitecture is relevant for server solutions Intel Sandy Bridge-E with differences in the form of a larger number of processor cores (up to 8), processor socket LGA2011, more L3 cache, more DDR3 memory controllers, and PCI-Express 3.0 support.

Previous generation, microarchitecture Westmere was a design from two crystals: ● 32nm processor core and ● additional 45nm "coprocessor" with graphics core and memory controller on board, placed on a single substrate and exchanging data via the QPI bus, i.e. integrated hybrid chip (center).

When creating the MA Sandy Bridge, the developers placed all the elements on a single 32nm crystal, while abandoning the classic look of the bus in favor of the new ring bus.

The essence of the Sandy Bridge architecture has remained the same - a bet on increasing the overall performance of the processor by improving the "individual" efficiency of each core.

The structure of the Sandy Bridge chip can be divided into the following main elements■ Processor cores, ■ Graphics core, ■ L3 cache, and ■ System Agent. Let us describe the purpose and features of the implementation of each of the elements of this structure.

The entire history of Intel processor microarchitecture upgrades in recent years is connected with sequential integration into a single crystal of an increasing number of modules and functions that were previously located outside the MP: in the chipset, on the motherboard, etc. As the performance of the processor and the degree of integration of the chip increased, the bandwidth requirements of the internal intercomponent buses grew at a faster pace. Previously, they managed with intercomponent buses with a cross topology - and that was enough.

However, the efficiency of such a topology is high only with a small number of components participating in the data exchange. At Sandy Bridge, to improve overall system performance, they turned to ring topology 256-bit interconnect bus based new version QPI(QuickPath Interconnect).

The tire is used for data exchange between chip components:

● 4 x86 MP cores,

● graphics core,

● L3 cache and

● system agent.

The bus consists of 4 32-byte rings:

■ data bus (Data Ring), ■ request bus (Request Ring),

■ Status monitoring buses (Snoop Ring) and ■ Acknowledgment buses (Acknowledge Ring).

Tires are controlled by distributed arbitration communication protocol, while the pipeline processing of requests occurs at the clock frequency of the processor cores, which gives the MA additional flexibility during overclocking. Tire performance is rated at 96 GB/s per connection at clock frequency 3 GHz, which is 4 times higher than the previous generation of Intel processors.

Ring topology and bus organization provides ●low latency in query processing, ●maximum performance and ●excellent technology scalability for chip versions with different numbers of cores and other components.

In the future, the ring bus can be "connected" up to 20 processor cores per die, and such a redesign can be done very quickly, in the form of a flexible and responsive response to current market needs.

In addition, the ring bus is physically located directly above the L3 cache blocks in the upper metallization layer, which simplifies the design layout and allows the chip to be made more compact.

These days, Intel is introducing the long-awaited processors to the world. Sandy Bridge, whose architecture was previously christened as revolutionary. But not only the processors have become novelties of these days, but also all the related components of the new desktop and mobile platforms.

So, this week, as many as 29 new processors, 10 chipsets and 4 wireless adapters for laptops and desktop work and gaming computers have been announced.

Mobile innovations include:

Intel Core i7-2920XM, Core i7-2820QM, Core i7-2720QM, Core i7-2630QM, Core i7-2620M, Core i7-2649M, Core i7-2629M, Core i7-2657M, Core i7-2617M, Core i5 2540M, Core i5-2520M, Core i5-2410M, Core i5-2537M, Core i3-2310M;

chipsets Intel QS67, QM67, HM67, HM65, UM67 Express;

wireless network controllers Intel Centrino Advanced-N + WiMAX 6150, Centrino Advanced-N 6230, Centrino Advanced-N 6205, Centrino Wireless-N 1030.

In the desktop segment will appear:

processors Intel Core i7-2600K, Core i7-2600S, Core i7-2600, Core i5-2500K, Core i5-2500S, Core i5-2500T, Core i5-2500, Core i5-2400, Core i5-2400S, Core i5- 2390T, Core i5-2300;

Intel P67, H67, Q67, Q65, B65 Express chipsets.

But it is immediately worth noting that the announcement new platform is not single-part for all models of processors and chipsets - since the beginning of January, only mainstream solutions are available, and most of the more popular and not so expensive ones will appear on sale a little later. Along with the release of Sandy Bridge desktop processors, a new processor socket for them is also introduced. LGA 1155. Thus, the new items do not complement the Intel Core i3/i5/i7 lineup, but are a replacement for LGA 1156 processors, most of which are now becoming completely unpromising acquisitions, because in the near future their production should cease altogether. And only for enthusiasts until the end of the year, Intel promises to continue producing older quad-core models based on the Lynnfield core.

However, judging by the roadmap, the long-lived Socket T (LGA 775) platform will still remain relevant at least until the middle of the year, being the basis for entry-level systems. For the most productive gaming systems and true enthusiasts, processors based on the Bloomfield core on the LGA 1366 socket will be relevant until the end of the year. As you can see, the life cycle of dual-core processors with an "integrated" graphics adapter on the Clarkdale core turned out to be very short, only one year, but they "trodden" the path for the presented "today" Sandy Bridge, accustoming the consumer to the idea that not only a memory controller, but also a video card can be integrated into the processor. Now it's time not just to release faster versions of such processors, but to seriously upgrade the architecture to provide a noticeable increase in their efficiency.

The key features of the Sandy Bridge architecture processors are:

release in compliance with the 32 nm process technology;

noticeably increased energy efficiency;

optimized Intel Turbo Boost technology and support for Intel Hyper-Threading;

a significant increase in the performance of the integrated graphics core;

implementation of a new set of instructions Intel Advanced Vector Extension (AVX) to speed up the processing of real numbers.

But all the above innovations would not provide an opportunity to talk about a truly new architecture if all this was not now implemented within a single core (crystal), unlike processors based on the Clarkdale core.

Naturally, in order for all processor nodes to work in concert, it was necessary to organize fast exchange information between them - an important architectural innovation was the Ring Interconnect bus.

It combines Ring Interconnect via L3 cache, now called LLC (Last Level Cache), processor cores, graphics core and system agent (System Agent), which includes a memory controller, PCI Express bus controller, DMI controller, power management module and other controllers and modules, previously united by the name "uncore".

The Ring Interconnect bus is the next stage in the development of the QPI (QuickPath Interconnect) bus, which, after being “run-in” in server processors with the updated 8-core Nehalem-EX architecture, migrated to the core of processors for desktop and mobile systems. The Ring Interconnect forms four 32-bit rings for the Data Ring, Request Ring, Snoop Ring, and Acknowledge Ring buses. The ring bus operates at the frequency of the cores, so its bandwidth, delays and power consumption are completely dependent on the frequency of the processing units of the processor.

Cache memory of the third level (LLC - Last Level Cache) is common for all computing cores, graphics core, system agent and other blocks. Wherein graphics driver determines which data streams to place in the cache, but any other block can access all the data in the LLC. A special mechanism controls the distribution of cache memory so that collisions do not occur. In order to speed up the work, each of the processor cores has its own segment of the cache memory, to which it has direct access. Each such segment includes an independent access controller to the Ring Interconnect bus, but at the same time, there is constant interaction with the system agent, which performs general cache management.

The System Agent, in fact, is built into the processor " north bridge”and integrates controllers for PCI Express, DMI, RAM, video processing unit (media processor and interface control), power manager and other auxiliary units. The system agent interacts with other processor nodes via the ring bus. In addition to streamlining data streams, the system agent monitors the temperature and load of various blocks, and through the Power Control Unit provides control of the supply voltage and frequencies in order to ensure the best energy efficiency with high performance. It can also be noted here that to power the new processors, you need a three-component power regulator (or two, if the integrated video core remains inactive) - separately for the computing cores, the system agent, and the integrated video card.

The PCI Express bus built into the processor complies with the 2.0 specification and has 16 lines for the possibility of increasing the power of the graphics subsystem using a powerful external 3D accelerator. In the case of using older chipsets and agreeing on licensing issues, these 16 lines can be divided into 2 or three slots in 8x+8x or 8x+4x+4x modes, respectively, for NVIDIA SLI and/or AMD CrossFireX.

The DMI 2.0 bus is used to exchange data with the system (drives, I/O ports, peripherals whose controllers are in the chipset), which allows you to transfer up to 2 GB/s of useful information in both directions.

An important part of the system agent is the dual-channel DDR3 memory controller built into the processor, which nominally supports modules at a frequency of 1066-1333 MHz, but when used in motherboards based on the Intel P67 Express chipset, it ensures the operation of modules at frequencies up to 1600 and even 2133 MHz without any problems. Placing the memory controller on the same chip as the processor cores (the Clarkdale core consisted of two chips) should reduce memory latency and, accordingly, increase system performance.

Thanks in part to the Power Control Unit's advanced monitoring of all cores, caches, and ancillaries, Sandy Bridge processors now feature enhanced Intel Turbo Boost 2.0 technology. Now, depending on the load and the tasks being performed, the processor cores can be accelerated even if the thermal package is exceeded, as with normal manual overclocking, when the need is high. But the system agent will monitor the temperature of the processor and its components, and when "overheating" is detected, the frequencies of the nodes will gradually decrease. However, desktop processors have a limited runtime in super-accelerated mode. here it is much easier to organize much more efficient cooling than a "boxed" cooler. Such an “overboost” will allow you to get an increase in performance at critical moments for the system, which should give the user the impression of working with a more powerful system, as well as reduce the waiting time for the system to respond. Also, Intel Turbo Boost 2.0 ensures that in desktop computers the integrated video core has dynamic performance.

The architecture of Sandy Bridge processors implies not only changes in the structure of intercomponent interaction and improvement of the capabilities and energy efficiency of these components, but also internal changes in each computing core. If we discard the "cosmetic" improvements, the most important will be the following:

return to allocation of cache memory for about 1.5 thousand decoded L0 micro-operations (used in Pentium 4), which is a separate part of L1, which allows to simultaneously ensure a more uniform loading of pipelines and reduce power consumption due to increased pauses in the operation of rather complex operation decoder circuits;

increasing the efficiency of the branch prediction block due to the increase in the capacity of buffers of addresses of branch results, command history, branch history, which increased the efficiency of pipelines;

increasing the capacity of the reordered instruction buffer (ROB - ReOrder Buffer) and increasing the efficiency of this part of the processor due to the introduction of a physical register file (PRF - Physical Register File, also a characteristic of Pentium 4) for storing data, as well as expanding other buffers;

doubling the capacity of registers for working with streaming real data, which in some cases can provide twice the speed of performing operations using them;

increasing the efficiency of executing encryption instructions for AES, RSA and SHA algorithms;

introduction of new Advanced Vector Extension (AVX) vector instructions;

optimization of the cache memory of the first L1 and second L2 levels.

An important feature of the graphics core of Sandy Bridge processors is that it is now located on the same chip with the rest of the blocks, and the system agent controls its characteristics and monitors the state at the hardware level. At the same time, the block for processing media data and generating signals for video outputs is placed in this same system agent. Such integration provides closer interaction, lower delays, greater efficiency, etc.

However, the architecture of the graphics core itself does not have as many changes as we would like. Instead of the expected DirectX 11 support, DirectX 10.1 support has just been added. Accordingly, not many applications with OpenGL support are limited to hardware compatibility only with the 3rd version of the specification of this free API. At the same time, although it is said about the improvement of computing units, but there are the same number of them - 12, and then only for older processors. However, increasing the clock frequency to 1350 MHz promises a noticeable performance boost in any case.

On the other hand, it is very difficult to create an integrated video core with really high performance and functionality for modern games with its low power consumption. Therefore, the lack of support for new APIs will only affect compatibility with new games, and if you really want to play comfortably, performance will need to be increased using a discrete 3D accelerator. But the expansion of functionality when working with multimedia data, primarily when encoding and decoding video within the framework of Intel Clear Video Technology HD, can be counted among the advantages of Intel HD Graphics II (Intel HD Graphics 2000/3000).

The updated media processor allows offloading processor cores when encoding video in MPEG2 and H.264 formats, and also expands the set of post-processing functions with hardware implementation of algorithms for automatically adjusting image contrast (ACE - Adaptive Contrast Enhancement), color correction (TCC - Total Color Control) and improve the display of the skin (STE - Skin Tone Enhancement). Increases the prospects of using the integrated video card Implemented interface support HDMI version 1.4 compatible with Blu-ray 3D (Intel InTru 3D).

All of the above architectural features provide a new generation of processors with a noticeable superiority in performance over the models of the previous generation, both in computing tasks and when working with video.

As a result, the Intel LGA 1155 platform becomes more productive and functional, replacing the LGA 1156.

To sum it up, the Sandy Bridge family of processors are designed to solve a very wide range of tasks with high energy efficiency, which should make them really mainstream in new high-performance systems, especially when more affordable models become available in a wide range.

In the near future, 8 processors for desktop systems of different levels will gradually become available to customers: Intel Core i7-2600K, Intel Core i7-2600, Intel Core i5-2500K, Intel Core i5-2500, Intel Core i5-2400, Intel Core i5-2300 , Intel Core i3-2120 and Intel Core i3-2100. Models with the K index are distinguished by a free multiplier and a faster integrated Intel HD Graphics 3000 video adapter.

Also, energy-efficient (index S) and highly energy-efficient (index T) models have been released for energy-critical systems.

Motherboards based on the Intel P67 Express and Intel H67 Express chipsets are already available to support the new processors, and Intel Q67 Express and Intel B65 Express are expected in the near future, aimed at corporate users and small businesses. All these chipsets are finally starting to support drives with SATA interface 3.0, although not all ports. But support, it would seem even more popular USB 3.0 bus, they do not. An interesting feature of the new chipsets for conventional motherboards is that they do not support the PCI bus. In addition, now the clock generator is built into the chipset and its characteristics can be controlled without affecting the stability of the system only in a very small range, if you're lucky, then only ±10 MHz, and in practice even less.

It should also be noted that different chipsets are optimized for use with different processors in systems designed for different purposes. That is, the Intel P67 Express differs from the Intel H67 Express not only in the lack of support for working with integrated video, but also in advanced features for overclocking and performance tuning. In turn, Intel H67 Express does not notice the free multiplier at all for models with the K index.

But due to architectural features, overclocking of Sandy Bridge processors is still possible only with the help of a multiplier, if it is a K-series model. Although all models are prone to some optimization and “overboost”.

Thus, temporarily to create the illusion of working on a very powerful processor even models with a locked multiplier are capable of noticeable acceleration. The time of such acceleration for desktop systems, as mentioned above, is limited by hardware, and not only by temperature, as in mobile PCs.

After presenting all the architectural features and innovations, as well as updated proprietary technologies, it remains only to sum up once again why Sandy Bridge is so innovative and remind you of positioning.

For high-performance and mass production systems, in the near future it will be possible to buy processors of the Intel Core i7 and Intel Core i5 series, which differ from each other in support Intel technologies Hyper-Threading (it is disabled for quad-core Intel Core i5 models) and L3 cache. For more economical buyers, new Intel Core i3 models are presented, which have 2 times fewer computing cores, although they support Intel Hyper-Threading, only 3 MB of LLC cache, do not support Intel Turbo Boost 2.0 and are all equipped with Intel HD Graphics 2000 .

In the middle of the year, Intel Pentium processors (it is very hard to refuse this brand, although it was predicted a year ago) based on a very simplified Sandy Bridge architecture will be presented for mass systems. In fact, these processors for "workhorses" will resemble in terms of capabilities the current Core i3-3xx on the Clarkdale core, since. they will lose almost all the functions inherent in the older models for the LGA 1155.

It remains to be noted that the release of Sandy Bridge processors and the entire LGA 1155 desktop platform has become another “Tak” within the framework of the Intel “Tick-Tock” concept, i.e. a major update of the architecture for release on the already debugged 32 nm process technology. In about a year, we will be waiting for Ivy Bridge processors with an optimized architecture and made according to the 22 nm process technology, which, for sure, will again have "revolutionary energy efficiency", but, we hope, will not abolish the LGA 1155 processor socket. Well, let's wait and see. In the meantime, we have at least a year to study the architecture of Sandy Bridge and its comprehensive testing , which we are going to start in the coming days.

Article read 14947 times

Subscribe to our channels

As part of the IDF 2010 forum held on September 13-15, Intel for the first time announced the details of a new processor microarchitecture, codenamed Sandy Bridge. Actually, the Sandy Bridge processor was demonstrated at last year's IDF 2009 forum, but the details of the new microarchitecture were not reported then (except perhaps the most general information). Immediately make a reservation that not all of its details have become public knowledge even now. Something the company wants to keep a secret until the official announcement, which should take place at the very beginning of next year. In particular, details regarding the performance of new processors, the model range, as well as some architectural features are not disclosed.
So, let's take a closer look at the new Sandy Bridge microarchitecture, as well as the features of processors based on it, which we will call Sandy Bridge processors in the future.

Briefly about Sandy Bridge processors

All processors codenamed Sandy Bridge will initially be manufactured using the 32nm process. In the future, when the transition to the 22-nm process technology takes place, processors based on the Sandy Bridge microarchitecture will be codenamed Ivy Bridge (Fig. 1).

Rice. 1. Evolution of Intel processor families and processor microarchitectures

Sandy Bridge processors, just like Westmere processors, form three families in the desktop and mobile segments: Intel Core i7, Intel Core i5 and Intel Core i3, however, the logos of these processors will change slightly (Fig. 2). To be more precise, we are talking about the second generation (2 nd Generation) of the Intel Core families.

Rice. 2. New logos for Sandy Bridge processors

It is known that the processor labeling system will change completely, but nothing was reported at the IDF 2010 forum regarding new system designations of processor models.

According to unofficial data, Sandy Bridge processors will be marked with a four-digit number, with the first digit - 2 - meaning the second generation of the Intel Core family. That is, there will be, for example (again, according to unofficial data), an Intel Core i7-2600 or Intel Core i5-2500 processor. The Intel Core i7 and Intel Core i5 families will have both locked and unlocked processors, the latter being denoted by the letter K (Intel Core i7-2600K, Intel Core i5-2500K).

The main differences between the Intel Core i7, Intel Core i5 and Intel Core i3 families will be the size of the L3 cache, the number of cores, and support for Hyper-Threading and Turbo Boost technologies.

The processors of the Intel Core i7 family will be quad-core with support for Hyper-Threading and Turbo Boost technologies, and the L3 cache size will be 8 MB.

The Intel Core i5 family of processors will be quad-core, but will not support Hyper-Threading Technology. The cores of these processors will support Turbo Boost technology, and the L3 cache size will be 6 MB.

Processors of the Intel Core i3 family will be dual-core with support for Hyper-Threading technology, but without support for Turbo Boost technology. The L3 cache size in these processors will be 3 MB.

After the announcement of unofficial information, let's move on to reliable data.

All new Sandy Bridge processors will receive a new LGA 1155 processor socket and, of course, will not be compatible with motherboards based on Intel 5-series chipsets. Actually, motherboards based on the new Intel 6-series chipset will be designed for Sandy Bridge processors. New to these single-chip chipsets will be support for two SATA 6 Gb/s (SATA III) ports as well as full-speed PCI Express 2.0 lanes (at 5 GHz). But integrated into the chipset USB controller 3.0 is not yet available.

However, back to the Sandy Bridge processors. The new processor socket LGA 1155 will most likely require new coolers, since coolers for the LGA 1156 socket will be incompatible with the LGA 1155 socket. However, this is just our guess based on simple logic. In the end, Intel must somehow stimulate the release of new cooler models, so that cooler manufacturers do not die completely.

A distinctive feature of all Sandy Bridge processors will be the presence of an integrated next-generation graphics core. Moreover, if in the processors of the previous generation (Clarkdale and Arrandale) the processor cores and the graphics core were located on different crystals and, moreover, were produced according to different technical processes, then in Sandy Bridge processors all processor components will be produced according to the 32-nm process technology and placed on one crystal.

It is important to emphasize that ideologically, the graphics core of the Sandy Bridge processor can be considered as the fifth core of the processor (in the case of quad-core processors). Moreover, the graphics core, as well as the computing cores of the processor, has access to the L3 cache.

Just like previous-generation Clarkdale and Arrandale processors, Sandy Bridge processors will have an integrated PCI Express 2.0 interface for using discrete graphics cards. Moreover, all processors support 16 PCI Express 2.0 lanes, which can be grouped either as one PCI Express x16 port or as two PCI Express x8 ports.

It should also be noted that all Sandy Bridge processors will have an integrated dual-channel DDR3 memory controller. Variants with a three-channel memory controller are not yet planned to be released. This is due to the fact that the range of Sandy Bridge processors will not cover the segment of top desktop processors. The top desktop processor will be new model Gulftown six-core processor (Intel Core i7-990X), and the range of Sandy Bridge processors will be focused on productive, mass and budget PCs.

Another feature of processors based on the Sandy Bridge microarchitecture is that instead of the QPI (Intel QuickPath Interconnect) bus, which was previously used to connect individual processor components to each other, a fundamentally different interface is now used, called the ring bus (Ring Bus), which we will consider in detail below.

In general, it should be noted that the architecture of the Sandy Bridge processor implies a modular, easily scalable structure (Fig. 3).

Rice. 3. Modular structure of the Sandy Bridge processor

Another feature of the Sandy Bridge microarchitecture is that it supports the Intel AVX (Intel Advanced Vector Extension) instruction set.

Intel AVX is new set extensions for the Intel architecture, which provides 256-bit floating-point vector calculations based on SIMD (Single Instruction, Multiple Data).

Intel AVX is a comprehensive extension of the instruction set architecture for the Intel 64 microarchitecture and has the following features:

support for vector data with a higher bit depth (up to 256 bits);
an efficient instruction coding scheme that supports three- and four-operand instruction syntax;
a flexible programming environment that provides a variety of possibilities - from branch processing instructions to reduced requirements for aligning offsets in memory;
new primitives for manipulating data and speeding up arithmetic calculations, including broadcast (broadcast), permutation (permute), simultaneous multiplication and addition (fused-multiply-add, FMA), etc.

Given the fact that the new Intel AVX instruction set can be used by any application where a significant proportion of the computation is in SIMD operations, the biggest performance gains new technology will give for those of them that predominantly perform floating point calculations and can be parallelized. Examples include audio and audio codecs, image and video editing software, modeling and financial analysis applications, and industrial and engineering applications.

Speaking of the Sandy Bridge processor microarchitecture, it should be noted that it is a development of the Nehalem or Intel Core microarchitecture (since the Nehalem microarchitecture is a development of the Intel Core microarchitecture). The differences between Nehalem and Sandy Bridge are quite significant, but it is still impossible to call this microarchitecture fundamentally new, which was the Intel Core microarchitecture at one time. This is exactly the modified Nehalem microarchitecture.

Now let's take a closer look at the innovations of Sandy Bridge microarchitecture and its differences from Nehalem.

Processor core based on Sandy Bridge microarchitecture

Before proceeding to consider the differences between Sandy Bridge and Nehalem microarchitectures, let us recall that the scheme of any processor implies the presence of several structural elements: the L1 data and instruction cache, the preprocessor (Front End) and the postprocessor, also called the instruction execution unit (Execution Engine).

The data processing process includes the following steps. First, instructions and data are fetched from the L1 cache (this step is called fetching). After that, the instructions fetched from the cache are decoded into machine primitives (micro-operations) understandable for the processor. This procedure called decoding. Further, the decoded commands are sent to the execution units of the processor and are executed, and the result is written to the memory.

The processes of fetching instructions from the cache, their decoding and promotion to execution units are carried out in the preprocessor, and the process of executing instructions is carried out in the postprocessor.

Now let's take a closer look at the Sandy Bridge processor core and compare it with the Nehalem core. When the processor core is based on the Nehalem or Sandy Bridge microarchitecture, x86 instructions are selected from the L1 instruction cache (Instruction Сache) with a size of 32 KB (8-channel cache). Instructions are loaded from the cache in fixed-length blocks, from which instructions are allocated to be decoded. Because x86 instructions have variable length, and the blocks by which commands are loaded from the cache are fixed, when decoding commands, it is necessary to determine the boundaries between individual commands.

Instruction size information is stored in the L1 instruction cache in special fields (3 bits of information for each instruction byte). In principle, this information for determining the boundaries of commands could be used in the decoder itself directly in the process of decoding commands. However, this would inevitably affect the decoding speed, and it would be impossible to decode several commands simultaneously. Therefore, before decoding, commands are extracted from the selected block. This procedure is called pre-decoding (PreDecode). The pre-decoding procedure allows you to maintain a constant decoding rate regardless of the length and structure of the commands.

Processors with the Nehalem and Sandy Bridge microarchitecture fetch instructions in 16-byte blocks, that is, a 16-byte instruction block is loaded from the cache for each clock cycle.

After the fetch operation, the commands are queued (Instruction Queue), and then transmitted to the decoder. During decoding (Decode), commands are converted into machine micro-operations of a fixed length (denoted as micro-ops or uOps).

The processor core decoder with the Sandy Bridge microarchitecture has not changed. Just like in the Nehalem microarchitecture, it is four-channel and can decode up to four x86 instructions per clock cycle. As already noted, in the Nehalem and Sandy Bridge microarchitectures, a 16-byte instruction block is loaded from the cache for each cycle, from which individual instructions are selected during the preliminary decoding. In principle, the length of one instruction can be up to 16 bytes. However, the average instruction length is 4 bytes. Therefore, on average, four instructions are loaded in each block, which, when using a four-channel decoder, are simultaneously decoded in one clock cycle.

The four-channel decoder consists of three simple decoders that decode simple instructions in one micro-op, and one complex decoder that can decode one instruction in four micro-ops (4-1-1-1 type decoder). For even more complex instructions that are decoded in more than four micro-ops, a complex decoder is connected to a uCode Sequenser block used to decode such instructions.

Naturally, decoding four instructions per clock is possible only if one 16-byte block contains at least four instructions. However, there are instructions longer than 4 bytes, and when loading several such instructions in one block, the decoding efficiency decreases.

When decoding instructions in the Nehalem and Sandy Bridge microarchitectures, two interesting technologies are used - Macro-Fusion and Micro-Fusion.

Macro-Fusion is the merging of two x86 instructions into one complex micro-op. AT previous versions processor microarchitecture, each x86 instruction was decoded independently of the others. When using the Macro-Fusion technology, some pairs of instructions (for example, a comparison instruction and a conditional branch) can be merged into one micro-operation during decoding, which will later be executed exactly as one micro-operation. Note that to effectively support the Macro-Fusion technology in the Nehalem and Sandy Bridge microarchitectures, extended ALUs (Arithmetical Logic Unit) are used, which are able to support the execution of merged micro-operations. Note also that in the case of using the Macro-Fusion technology, only four instructions can be decoded for each cycle of the processor (in a four-channel decoder), and when using the Macro-Fusion technology, five instructions can be read in each cycle, which are converted into four by merging and subjected to decoding.

Note that the Macro-Fusion technology was also used in the Intel Core microarchitecture, however, in the Nehalem microarchitecture, the set of x86 instructions was expanded, for which merging into one microoperation is possible. In addition, in the Intel Core microarchitecture, x86 instruction fusion was not supported for the 64-bit processor operation mode, that is, Macro-Fusion technology was implemented only in 32-bit mode. In the Nehalem architecture, this is bottleneck has been fixed and merge operations work in both 32-bit and 64-bit processor modes. In the Sandy Bridge microarchitecture, the set of x86 instructions for which a merge operation is possible has been extended even further.

Micro-Fusion is the merging of two micro-operations (not x86 instructions, namely micro-operations) into one containing two elementary actions. In the future, two such merged micro-ops are processed as one, which makes it possible to reduce the number of processed micro-ops and thereby increase the total number of instructions executed by the processor in one cycle. It is clear that the operation of merging two micro-operations is not possible for all pairs of micro-operations. The Sandy Bridge microarchitecture uses exactly the same Micro-Fusion operation (for the same set of microops) as the Nehalem microarchitecture.

Speaking about the procedure for fetching program instructions in the Nehalem microarchitecture, it is necessary to note the presence of a program cycle detection unit (Loop Stream Detector), which takes part in the process of fetching instructions and allows you to avoid repetitions in performing the same operations. The Loop Stream Detector (LSD) is also used in the Intel Core microarchitecture, but it is different from the LSD in Nehalem. So, in the Intel Core architecture, an LSD buffer for 18 instructions is used, and it is located before the decoder. That is, in the Intel Core architecture, only cycles containing no more than 18 instructions can be tracked and recognized. When a program cycle is detected, the instructions in the cycle skip the fetch and branch prediction phases in the program (Branch Prediction), while the instructions themselves are generated and fed to the decoder from the LSD buffer. On the one hand, this makes it possible to reduce the power consumption of the processor core, and on the other hand, to bypass the instruction fetching phase. If there are more than 18 instructions in the loop, then each time the instructions will go through all the standard steps.

In the Nehalem microarchitecture, the cycle detection block is located not before, but behind the decoder and is designed for 28 already decoded instructions. Since LSD stores instructions that have already been decoded, they will skip not only the branch prediction and fetch phase, as before, but also the decoding phase (in fact, the processor's preprocessor is turned off during the execution of the program cycle). Thus, in Nehalem, instructions in a loop pass through the pipeline faster and more often, and power consumption is lower than in the Intel Core architecture (Fig. 4).

Rice. 4. LSD buffer in Intel Core and Nehalem microarchitectures

In the Sandy Bridge microarchitecture, the developers went even further: together with the LSD buffer for 28 micro-ops, they used the Decoded Uop Cache - fig. 5. All decoded micro-ops are sent to the cache. The decoded micro-op cache is designed for approximately 1500 micro-ops (apparently, we are talking about medium-length micro-ops), which is equivalent to about a 6-kilobyte x86-instruction cache.

Rice. 5. Cache of decoded micro-ops in Sandy Bridge microarchitecture

The concept of the decoded micro-op cache is to store sequences of micro-ops in it. The micro-op cache does not work at the level of a single instruction, but at the level of a 32-byte micro-op block. The entire cache is divided into 32 sets, 8 lines each. Each line has up to 6 micro-operations. Up to 3 lines (18 micro-ops) can be mapped to a 32-byte block. Tagging occurs at the instruction pointer (IP). The check of the predicted instruction pointer is carried out in parallel in both the instruction cache and the micro-op cache, and if a hit occurs, the lines that make up the 32-byte block are caught from the micro-op cache and placed in the queue. In this case, there is no need to sample and decode again.

The efficiency of using the decoded micro-op cache largely depends on the efficiency of the Branch Prediction Unit (BPU). Recall that the branch prediction unit is used in all modern processors, and in Sandy Bridge processors it is significantly improved compared to the BPU in the Nehalem microarchitecture (Fig. 6).

Rice. 6. Branch Prediction Un in the Sandy Bridge microarchitecture

To understand why the branch prediction block is so important in the processor and how it affects performance, recall that virtually any more or less complex program has conditional branch instructions. The command of such a conditional branch means the following: if a certain condition is true, then you need to go to the execution of the program, starting from one address, and if not, then from another. From the point of view of the processor, the conditional branch instruction is a kind of stumbling block. Indeed, until it becomes clear whether the transition condition is true or not, the processor does not know which part of the program code to execute next, and therefore is forced to idle. To avoid this, the branch prediction block is used, which tries to guess which section of the program code the conditional jump instruction will point to, even before it is executed. Based on the branch prediction, the corresponding 86-instructions are fetched from the L1 cache or from the decoded µop cache.

When a conditional jump instruction is encountered for the first time, a so-called static prediction is applied. In essence, the BPU simply guesses which software branch will be executed next. Moreover, static prediction is based on the assumption that most backward branches occur in repeating loops, when a branch instruction is used to determine whether the loop continues or exits. More often than not, the loop continues, so the processor will re-execute the loop code again. For this reason, static prediction assumes that all backward branches are always executed.

As the statistics of the results of various conditional branches are accumulated (prehistory of conditional branches), the dynamic branch prediction algorithm is activated, which is precisely based on the analysis of the statistics of the results of conditional branches made earlier. Dynamic branch prediction algorithms use a Branch History Table (BHT) and an instruction address storage table (Branch Target Buffer, BTB). These tables contain information about the results of already executed branches. The BHT contains all conditional branches for the last few cycles. In addition, bits are stored here indicating the probability of re-selecting the same branch. The bits are arranged based on the statistics of previous transitions. In the standard bimodal (2-bit) scheme, there are four probabilities: the branch is often taken (strongly taken), the branch is taken (taken), the branch is not taken (not taken), and the branch is often not taken (strongly not taken).

In order to decide whether to speculatively execute a branch, the device must know exact location code in the L1 cache in the branch direction - let's call it the branch target. The targets of already completed branches are stored in the BTB. When a branch is executed, the BPU simply takes the branch target from the table and tells the preprocessor to start fetching instructions at that address.

It is clear that the reliability of branch prediction depends on the size of the BHT and BTB tables. The more entries in these tables, the higher the reliability of the prediction.

It should be noted that the probability of correct branch prediction in modern processors is very high (about 97-99%), and in fact the struggle is already going on for a fraction of a percent.

There are several BPU improvements in the Sandy Bridge microarchitecture. First, instead of using a different probability for each branch of the transition in the BHT table, the same probability is applied simultaneously for several branches. As a result, it is possible to optimize the BHT table, which affects the increase in the reliability of transition prediction.

The second BPU improvement in the Sandy Bridge microarchitecture is to optimize the BTB table. If earlier in VTB a fixed number of bits were used to set all branch targets, which led to an unjustified waste of space, now the number of bits used to set the branch address depends on the address itself. In fact, this allows you to store more addresses in the table and thereby increase the reliability of the prediction.

More accurate data on the sizes of the BHT and BTB tables are not yet available.

So, we talked about the changes in the preprocessor of the Sandy Bridge microarchitecture (the decoded micro-ops cache and the updated branch prediction block). Let's go further.

After the process of decoding x86 instructions, the stage of their execution begins. Initially, there is a renaming and allocation of additional processor registers (Allocate / Rename / Retirement block), which are not defined by the instruction set architecture.

Renaming registers allows you to achieve out-of-order execution of commands. The idea of renaming registers is as follows. In the x86 architecture, the number of general-purpose registers is relatively small: eight registers are available in 32-bit mode and 16 registers in 64-bit mode. Let's imagine that executable command waits for the operand values to be loaded into the register from memory. This is a long operation, and it's a good idea to let this register be used for another instruction whose operands are closer (for example, in a first-level cache). To do this, the “waiting” register is temporarily renamed and the renaming history is tracked. And the “ready to work” register is assigned a standard name in order to execute the instruction provided with operands right now. When data arrives from memory, the renaming history is accessed and the original register is returned to its legal name. In other words, the register renaming technique reduces downtime, and renaming history is used to eliminate conflicts.

At the next stage (reorder buffer - ReOrder Buffer, ROB), the micro-operations are reordered out of order (Out-of-Order), so that later they can be implemented more efficiently on execution units. Note that the ReOrder Buffer and the Retirement Unit are combined in a single processor unit, but initially the instructions are reordered, and the Retirement Unit is put into operation later, when it is necessary to issue the executed instructions in the order specified by the program.

In the Nehalem microarchitecture, the size of the reorder buffer was increased in comparison with the size of the reorder buffer in the Intel Core microarchitecture. So, if in Intel Core it was designed for 98 micro-ops, then in Nehalem you can already place 128 micro-ops.

Next, the micro-operations are distributed among the execution units. In the processor block, the Reservation Station forms queues of micro-ops, as a result of which micro-ops get to one of the ports of functional devices (dispatch ports). This process is called dispatching (Dispatch), and the ports themselves act as a gateway to functional devices.

After the micro-ops pass the dispatch ports, they are sent to the appropriate functional blocks for further execution.

In the Sandy Bridge microarchitecture, the Allocate/Rename/Retirement (Out-of-Order Cluster) cluster has been significantly changed. On the Intel Core and Nehalem microarchitectures, each micro-op has a copy of the operand or operands it requires. In fact, this means that out-of-order execution cluster blocks should be enough big size, because they must contain micro-ops along with the operands they need. In the Nehalem architecture, operands could be 128 bits in size, but with the introduction of the AVX extension, the operand size can be 256 bits, which requires doubling the size of all out-of-order cluster blocks.

However, instead, the Sandy Bridge microarchitecture uses a physical register file (Physical Register File, PRF), which stores the operands of micro-operations (Fig. 7). This makes it possible for the micro-ops themselves to store only pointers to operands, but not the operands themselves. On the one hand, this approach makes it possible to reduce the power consumption of the processor, since the movement of micro-operations along the pipeline along with their operands requires a significant amount of power consumption. On the other hand, the use of a physical register file helps to save space on the chip, and use the freed space to increase the size of the out-of-order cluster buffers (Load Buffers, Store Buffers, Reorder Buffers) - see table. In the Sandy Bridge microarchitecture, the physical register file for integer operands (PRF Integer) is designed for 160 entries, and for floating point operands (PRF Float Point) - for 144 entries.

Rice. 7. Using Physical Register Files in the Sandy Bridge Microarchitecture

In the architecture of Sandy Bridge, the execution units of the processor core have also undergone significant processing. Actually, there are six ports of functional devices, as before (three computing and three for working with memory), but their purpose, as well as the purpose of the execution units themselves, has changed (Fig. 8). Recall that a processor based on the Nehalem microarchitecture is capable of performing up to six operations per cycle. In this case, it is possible to carry out simultaneously three computational operations and three memory operations.

Rice. 8. Execution units in the Sandy Bridge microarchitecture

In the Sandy Bridge architecture, three execution units allow for eight FP (Float Point) data operations or two operations with 256-bit AVX data per clock.

In the Sandy Bridge microarchitecture, not only three execution units have changed, but also functional blocks for memory operations. Recall that in the Nehalem microarchitecture there were three ports for working with memory: Load (data loading), Store address (address storage), Store data (data storage) - fig. nine.

Rice. 9. Execution units for working with memory in the Nehalem microarchitecture

The Sandy Bridge microarchitecture also uses three ports for working with memory, but two ports have become universal and can not only implement data loading (Load), but also save the address (Store address). The third port has not changed and is intended for data storage (Store data) - fig. ten.

Rice. 10. Execution units for working with memory in the Sandy Bridge microarchitecture

Accordingly, the throughput of interaction with the L1 data cache has increased. If in the Nehalem microarchitecture 32 bytes of data could be transferred for each cycle between the L1 data cache and the execution units for working with memory, then in the Sandy Bridge microarchitecture it was already 48 bytes (two read requests of 16 bytes (128 bits) and one write request up to 16 bytes of data).

In conclusion, the description of the processor core based on the Sandy Bridge microarchitecture will bring everything together. On fig. 11 shown structural scheme processor cores based on Sandy Bridge microarchitecture. Yellow indicates changed or new blocks in the Sandy Bridge microarchitecture, and blue indicates blocks present in both the Nehalem and Sandy Bridge microarchitectures.

Rice. 11. Differences between the Sandy Bridge microarchitecture and the Nehalem microarchitecture
(common blocks are marked in blue, changed or new blocks
in Sandy Bridge microarchitecture - yellow)

Ring bus in Sandy Bridge microarchitecture

In the Nehalem microarchitecture, the interaction between each L2 cache and the L3 cache shared between all cores was carried out via an internal special processor bus with about a thousand contacts, and the interaction between individual processor units (memory controller, graphics controller, etc.) was carried out via the QPI bus. In the Sandy Bridge microarchitecture, the QPI bus, as well as the L2- and L3-caches interaction bus, have been replaced by a new ring bus (Ring Bus) - fig. 12. It allows you to organize the interaction between the L2 caches of each processor core and the L3 cache, and also provides access to the graphics core (GPU) and the video encoding unit (video transcoding engine) to the L3 cache. In addition, the same ring bus provides access to the memory controller. In passing, we note that now Intel calls the L3 cache the last level cache (Last Level Cache, LLC), and the L2 cache - the intermediate cache (Middle Level Cache, MLC).

Rice. 12. Ring bus in Sandy Bridge microarchitecture

The ring bus combines four separate buses: a 256-bit (32-byte) Data ring, a Request ring, an Acknowledge ring, and a Snoop ring.

The use of a ring bus made it possible to reduce the latency of the L3 cache. So, in the processors of the previous generation (Westmere), the latency of access to the L3 cache is 36 cycles, and in the Sandy Bridge processors - 26-31 cycles. In addition, the L3 cache now runs at the core clock (in Westmere processors, the L3 cache clock did not match the core clock).

The entire L3 cache is divided into separate sections, each of which is associated with a separate processor core. At the same time, the entire L3 cache is available to each core. Each of the L3 cache allocations is endowed with a ring bus access agent. Similar access agents are available for the L2 caches of each processor core, for the graphics core, and for the system agent that implements data exchange with the memory controller.

In conclusion, we note that the L3 cache in the Sandy Bridge microarchitecture remained fully inclusive (inclusive) with respect to L2 caches (as in the Nehalem microarchitecture).

Graphics core in Sandy Bridge microarchitecture

One of the major innovations in Sandy Bridge's microarchitecture is the new graphics core. As we have already noted, unlike the graphics core in Clarkdale/Arrandale processors, it is located on the same chip as the processing cores of the processor and, in addition, has access to the L3 cache via the ring bus. Moreover, as expected, the performance of the new graphics core will be about twice as high as the performance of the graphics core in Clarkdale / Arrandale processors. Of course, the graphics core in Sandy Bridge processors cannot match the performance of discrete graphics (by the way, DirectX 11 support for the new core has not even been announced), but in fairness, we note that this core is not positioned as a gaming solution.

The new graphics core may contain (depending on the processor model) 6 or 12 execution units (Execution Unit, EU), which, however, cannot be compared with unified shader processors in NVIDIA or AMD graphics processors, where there are several hundred of them (Fig. 13 ). This graphics core is primarily focused not on 3D games, but on hardware decoding and video encoding (including HD video). That is, the configuration of the graphics core includes hardware decoders. They are complemented by tools for changing resolution (scaling), noise reduction (denoise filtering), detecting and removing line interlacing (deinterlace / film-mode detection) and filters for improving detail. Post-processing for enhanced playback includes STE (skin tone enhancement), ACE (adaptive contrast enhancement), and TCC (total color management).

Rice. 13. Block diagram of the graphics core in the Sandy Bridge microarchitecture

The multi-format hardware codec supports MPEG-2, VC1 and AVC formats, performing all decoding steps using specialized hardware, while in the current generation integrated graphics processors this function is performed by universal EU execution units.

New Intel Turbo Boost Mode

One of the notable features of Sandy Bridge processors will be support for the new Turbo Boost mode. Recall that the meaning of Turbo Boost technology is dynamic overclocking under certain conditions of the clock frequencies of the processor cores.

To implement the Turbo Boost technology, the processor has a special functional block PCU (Power Control Unit), which monitors the load level of the processor cores, the temperature of the processor, and is also responsible for the power supply of each core and the regulation of its clock frequency. An integral part of the PCU is the so-called Power Gate (shutter), which is used to transfer each processor core individually to the C6 power consumption mode (in fact, the Power Gate disconnects or connects the processor cores to the VCC power line).

In Clarkdale and Arrandale processors, the Turbo Boost mode is implemented as follows. In the event that some processor cores turn out to be unloaded, they are simply disconnected from the power line using the Power Gate block (their power consumption is zero in this case). Accordingly, the clock frequency and supply voltage of the remaining loaded cores can be dynamically increased by several steps (133 MHz each), but so that the power consumption of the processor does not exceed its TDP. That is, the power consumption actually saved by disabling several cores is used to overclock the remaining cores, but in such a way that the increase in power consumption as a result of overclocking does not exceed the saved power consumption. Moreover, the Turbo Boost mode is also implemented when all processor cores are initially loaded, but its power consumption does not exceed the TDP value.

In mobile Arrandale processors with integrated graphics core, Turbo Boost technology extends not only to the processor cores, but also to the graphics core. That is, depending on the current temperature and power consumption, not only the processor cores, but also the graphics core will be overclocked. For example, if in some application the main load falls on GPU, and the processor cores remain underloaded, then the saved TDP will be used to overclock the graphics core, but so that the TDP limit of the graphics core is not exceeded.

Since in Sandy Bridge processors (both desktop and mobile) the graphics core is, in fact, the same processor core as the compute cores, Turbo Boost technology will extend to both the compute cores and the graphics core. In addition (and this is the main innovation), the new version of the Turbo Boost mode provides for the possibility of exceeding the TDP of the processor when overclocking the cores for a short time.

The fact is that when the TDP is exceeded, the processor does not overheat immediately, but after a certain period of time. Considering that in many applications the processor load at 100% occurs abruptly and only for very short periods of time, during these periods it is quite possible to overclock the processor clock so that the TDP limit is exceeded.

Sandy Bridge processors in Turbo Boost mode have the ability to exceed the TDP for up to 25 seconds (Fig. 14).

Conclusion

Let's summarize our review of the Sandy Bridge microarchitecture. This new microarchitecture is a major revision of the Nehalem microarchitecture. Among the innovations are the use of a cache of decoded micro-operations, a redesigned branch prediction block, the use of a physical register file, an increased size of out-of-order cluster buffers, improved processor execution units and blocks for working with memory. In addition, Sandy Bridge processors use a ring bus to access the processor cores to the L3 cache and memory. Also, Sandy Bridge processors received a new, more efficient graphics core that has access to the L3 cache.

In addition, Sandy Bridge processors have a new Turbo Boost mode that allows you to squeeze the maximum performance out of the processor.

The capabilities of the Sandy Bridge GPU are generally comparable to those of the previous generation of such solutions from Intel, except that DirectX 10.1 support has now been added in addition to the capabilities of DirectX 10, instead of the expected support for DirectX 11. Accordingly, not many applications with OpenGL support are limited to hardware compatibility only with Version 3 of the specification for this free API.

Nevertheless, there are a lot of innovations in Sandy Bridge graphics, and they are mainly aimed at increasing performance when working with 3D graphics.

The main emphasis in the development of a new graphics core, according to Intel representatives, was made on the maximum use of hardware capabilities for calculating 3D functions, and the same for processing media data. This approach is radically different from the fully programmable hardware model adopted, for example, by NVIDIA, or by Intel itself for the development of Larrabee (with the exception of texture units).

However, in the implementation of Sandy Bridge, the departure from programmable flexibility has its undeniable advantages, due to which more important benefits for integrated graphics are achieved in the form of lower latency when executing operations, better performance against the background of energy savings, a simplified driver programming model, and, importantly, with saving the physical size of the graphics module.

Sandy Bridge's programmable execution shader graphics units, traditionally referred to as "execution units" (EU, Execution Units) at Intel, are characterized by increased register file sizes, which makes it possible to achieve efficient execution of complex shaders. Also in the new execution units, branching optimization has been applied to achieve better parallelization of executable commands.

In general, according to Intel representatives, the new execution units have twice the bandwidth compared to the previous generation of integrated graphics, and the performance of calculations with transcendental numbers (trigonometry, natural logarithms, and so on) due to the emphasis on using the hardware computing capabilities of the model will increase by 4 -20 times.

The internal instruction set, reinforced in Sandy Bridge with a number of new ones, allows most of the DirectX 10 API instructions to be distributed one-to-one, as is the case with the CISC architecture, which results in significantly higher performance at the same clock speed.

Fast access via a fast ring bus to a distributed L3 cache with dynamically configurable segmentation allows you to reduce latency, increase performance and at the same time reduce the frequency of GPU access to RAM.

Ring bus

The entire history of Intel processor microarchitecture modernization in recent years is inextricably linked with the sequential integration into a single chip of an increasing number of modules and functions that were previously located outside the processor: in the chipset, on the motherboard, etc. Accordingly, as the processor performance and the degree of chip integration increased, the bandwidth requirements for internal interconnect buses grew at a faster pace. For the time being, even after the introduction of a graphics chip into the Arrandale/Clarkdale chip architecture, it was possible to manage with intercomponent buses with the usual cross topology - that was enough.

However, the efficiency of such a topology is high only with a small number of components participating in the data exchange. In the Sandy Bridge microarchitecture, to improve the overall performance of the system, the developers decided to turn to the ring topology of a 256-bit interconnect bus (Fig. 6.1), made on the basis of a new version of QPI (QuickPath Interconnect) technology, expanded, refined and first implemented in the architecture of the Nehalem server chip - EX (Xeon 7500), as well as planned for use in conjunction with the Larrabee chip architecture.

The ring bus (Ring Interconnect) in the version of the Sandy Bridge architecture for desktop and mobile systems is used to exchange data between six key components of the chip: four x86 processor cores, a graphics core, an L3 cache, now it is called LLC (Last Level Cache), and system agent. The bus consists of four 32-byte rings: a data bus (Data Ring), a request bus (Request Ring), a status monitoring bus (Snoop Ring) and an acknowledgment bus (Acknowledge Ring), in practice this actually allows sharing access to a 64-byte interface last level cache into two different packages. Buses are controlled using a distributed arbitration communication protocol, while pipelined processing of requests occurs at the clock frequency of processor cores, which gives the architecture additional flexibility during overclocking. Ring bus performance is rated at 96 GB per second per link at 3 GHz, effectively four times faster than previous generation Intel processors.

Fig.6.1. Ring bus (Ring Interconnect)