3DNews Vendor Reference English Resource -
All you need to know about your products!
Digital-Daily.com
Digital-Daily

CPU-boundedness of the video system Part I - Analysis

Author: Dmitry Sofronov
Date: 26/07/2007

Preface

Every person who has a computer has certainly come to the idea of upgrade, or the performance boost, by one means or another. The reasons may be varied - say, the program does all slowly and needs to be accelerated. Or, say, the new program version requires a new operating system be installed, which in turn requires higher performance. Another typical example - a new game offering fantastic graphics has been released, but the old "hardware buddy" is no longer capable of running it and plays a "slide-show" instead of playing the normal "movie".
Those who can afford the upgrade simply buy a new more powerful computer,and the upgrade issue is no longer relevant. In reality, it is more frequent when things seem to be OK with the system, but it would be nice if "this or that" component were a bit more powerful, and the user is short for money to buy a new computer immediately. What to do then? Evidently, the only option is incremental upgrade. Of course, the direction of the upgrade strongly depends on the main tasks entrusted to the computer. If these are mathematical computations, programming tasks, databases, then the CPU speed, the RAM capacity, and the speed of the disk subsystem are much more important. If the computer is used for games, the "capacity" of the video card is among the most important components. although the above requirements also matter, and it is not easy to say which of them is of more priority over the others.

In fact, we'd like to destroy the settled delusion that a very powerful computer is needed for «serious work», whereas some «middling» computer would suffice for games. However strange it seems, but most of the «serious» tasks done by the average user at work, like texts, spreadsheets, databases, Internet surfing etc. it's just the performance of the very "middling" computer is enough.
On vice versa. Even not the most recent 3D game is rather demanding for the computer resources.

Just let me compare - Leo Tolstoy wrote his impressive work "War and Peace" without a computer, using only the pen and paper, and games of the "fun regiments" of Peter the Great (real-time strategies + 3D shooter simulator) required for substantial human and material resources.

Modern computer games are able loading all the available resources of even the most powerful computer to the full (solitaire games don't count) Why so? All is very simple. A game comprises a lot of objects, plus as a real-world simulator it tries to somehow simulate the "real" illumination, physics laws, uses artificial intelligence components to emulate verisimilar behavior of characters around etc.

Hence, we arrive at the conclusion - if we regard modern computer games as a serious "load" for the computer and not just as a "fun", then buying a computer just "for games" we should choose the most powerful one of those which we can afford.

At the same time, we should be aware of the system balance as a whole. Finding the balanced configuration of the computer is quite a difficult issue and strongly depends on the task. As regards our case, we have already determined our requirements imposed on the computer – 3D games. We'll be treating the issue of optimum balance a bit later, but for now just as an illustration let us bring in evident examples of misbalanced configurations. For example, a weak CPU and a powerful video card. Evidently, a weak CPU is unlikely to unveil all the video card's potentials. A contrary example – a powerful CPU with the explicitly weak video subsystem is unlikely to enjoy the high-quality picture.

So, we start our study with the analysis of results which we produce during a typical performance test of video cards.

Notes on the tests

A few words on the figures that you can see in our tests. The thing is so evident that it is almost not perceived intentionally, so to be precise let's put all things straight and in detail. Every figure produced in a test reflects the performance of a video card under certain conditions. What do these "conditions" mean? First, the test setup – the type and the CPU clock speed, RAM capacity, etc. Secondly, an application being tested – e.g., Half-Life2. Thirdly, a demo in the game itself. Clearly, all the absolute speed values (frames per second) will differ with games. Also, don't forget that the resultant figure will depend on the complexity of a selected demo.

There arises the question – how "true to life" are the results produced in such conditions? The configurations of user computers differ in a vast variety and far not all of them have a top-end CPU which is used in the test setup. On the other hand, demos may strongly vary in complexity, even for a single game. So, how should we interpret the results produced in this case?

Fortunately, all is not as bad as it seems at first glance. As regards the demos used in the tests, then the general rule which engineers at test labs adhere to says that a demo should be content-rich enough, "hard", and reflect the capabilities of the graphic engine, and the character of the game. If these conditions hold, then we have every reason to assert that in the other parts of the game the results will not be worse than those produced while testing the demo scene. That means the results produced with the "right" demo scene are the so-called "bottom level" of performance.

Then what to do about the difference in PC configurations (let alone the differences among platforms)? Our methodology allows finding the correct solution to this complex problem. However strange it is, one limitation that we have started coming across while testing the latest video cards of top performance will help us.

Processor dependence

While reading various reviews of video cards you must have come across the phrase - "the results are CPU-bounded". The meaning of this phrase is easiest to demonstrate with a diagram like that:


Diagram 1

On this diagram, we can see the results of testing GeForce 7800GT and 7800GTX video cards in Half-Life2 game. As you can see, the effect gained from merging two video cards into the SLI mode does not add to the performance results (measured in FPS) at all, despite our expectations. At the same time, the boundary of results is absolutely the same for both the single GeForce 7800GTX and for a pair of GeForce 7800GTX video cards in the SLI mode. Evidently, the rise of results is limited to something else, and the CPU is just the "something else". We took the diagram from here. Let me remind you that we used AMD Athlon64 4000+ running at 2.4 GHz as the CPU. As you understand, the above diagram is not at all a special case for Half-Life2 and the resolution 1024õ768. You can find examples of diagrams like that over here - http://www.3dnews.ru/video/ati_x1900xtx_crossfire/index04.htm. It's simply meant to show that to see the CPU-boundedness in tests, you've got to have a really powerful video subsystem.

…Or a "weak" CPU. Actually, if instead of the CPU AMD Athlon64 4000+ used in our test setup we take a CPU of smaller performance, then the FPS value which limits the performance of video cards will be lower. Similarly, if we take a CPU of even smaller capacity, then the boundary of performance will be even lower. To make it clearer what we mean, let's illustrate it with a diagram similar to Diagram 1 which we brought in above but slightly change its positioning. We position the columns of results vertically, with the conditional performance of the CPU positioned horizontally. Therefore, we get Diagram 2.


Diagram 2

Evidently, Diagram 2 is quite conditional. We intentionally decided not to mark the labels of the CPUs for the left-hand and middle group of results but restricted to the descriptive terms like "a CPU weaker than Athlon 64 4000+". The red touches stand for the limit the results come up against. The left-hand and middle groups of results show only the presumptive results on the way to weaker CPUs. We'll know later what the real results look like.

Here, our desire to see what the "CPU-boundedness" looked like in reality brought us over to thinking about the methodology of tests run with a «weak CPU». We thought that couldn't be easier. Just take a few processors of various clock speeds and verify that. Unfortunately, the comparability of results in such tests would hardly be authentic. What should we use as a CPU performance measure? All these ratings and measures are all a nice kettle of fish - ratings for Athlon and Sempron are different. It is also not quite clear how to compare the AMD and Intel platforms. That is why we followed a different way.

Methodology for CPU-boundedness tests

We decided to use the real CPU clock speed in MHz as the measure of CPU performance, because that can be adjusted within some range and thus we can expect a proportional change in the CPU performance without conversions to whatever ratings. As regards the other parameters of modern CPUs like cache memory size, number of channels in the memory controller, etc., for now we'd better leave the effect of these outside consideration, although they can be taken into account properly. The most important for us now is the linearity of clock speed as a measure of CPU performance.

We took AMD Athlon64 4000+ as the basis. AMD processors prove as very good performers indeed just at gaming applications, and although AMD Athlon64 4000+ is no longer a flagship model for now, it is still one of the most powerful CPUs in the line to date. The indubitable convenience of AMD Athlon64 4000+ for this test is in that its multiplier is not fixed and can be adjusted towards reduction. So we decided to make use of just this feature of the CPU. Varying merely the CPU multiplier, we produced a "line" of processors running at various clock speeds but offering absolutely identical performance in all the other respects – the cache size, the system bus speed, etc. It was just that which allowed us to deduce certain regularities related to the joint operation of CPU and the video subsystem in graphic applications.

Since we prepared the material for quite a long time, we used various video cards as the video subsystem, but that will not affect the quality of the final result. Moreover, while working at the material we used that as indirect endorsement of correctly selected methodology according to which different video subsystems under certain conditions should behave in absolutely the same way, and we made sure of that more than once.

All started then when GeForce 7800GTX was the most powerful video card ever. The drivers used for tests were of version 81.85. As we already stated, the material was prepared for a long time, so the driver version is so old. Later on, we did not upgrade the version of the video drivers to provide comparability and consistence of the test results. However, as we'll see it later, the driver version is not decisive for the deduction of produced regularities. For the purpose of verification we once again tested the same with drivers of version 84.21 and produced similar results. I'd like to note for ATI fanciers that we conducted absolutely the same experiments with Radeon X1900XTX and produced absolutely the same results (within the measurement error).

OK, then, off we go.

"Measurements" of CPU-boundedness

Test setup
Bus PCI-E
CPU AMD Athlon64 4000+
MB ASUS A8N-SLI Deluxe
Memory Kingston HyperX PC3200 2x512 MB
OS WinXP + SP2 + DirectX 9.0c
PSU Hiper 525W

Through varying the CPU multiplier we produced the following set of operating clock speeds of the CPU (in MHz) – 1000, 1200, 1400, 1600, 1800, 2000, 2200, 2400.

To start with, we produce results in the game HALF-Life2 at 1024õ768 in the «maximum details» mode but with the antialiasing and anisotropic filtering disabled. There is contradiction here. The «maximum details» settings are in charge of the image quality, and the disabled AA/AF allow producing FPS values which do "rest against" the CPU performance. We depict the produced results on the graph, with the Õ axis denoting the CPU clock speed and the Y axis reflecting the produced values of the video card performance in FPS.


Graph 1

Finally, we get a curve that strongly resembles a straight line. In fact, it is just the way it should be - if the video subsystem performance is not the limiting factor, then the results are proportional to the CPU clock speed. Let me explain why. Let's see how the computer draws image in the general case. For clearness, we bring in a drawing.


Figure 1

As you know, every 3D object is defined by a certain model made up of polygons, elementary geometric objects. In shaping every frame, the CPU calculates the number of objects, their space location, light sources, etc., that is a frame is shaped up in the "skeleton" representation (on the drawing, it is a kettle made of "wires"). Then, this "skeleton" along with the information on how it should be "painted" is passed over to the video adapter. Finally, once all the required textures, light sources, shadows have been applied to the skeleton, we get the resultant image which we can see on the monitor display.

That is, image is drawn in two main stages. The first stage - drawing the skeleton of the frame - is done by the CPU. The second stage - "painting the skeleton" - is done by the video adapter.

Therefore, when the video subsystem performance (the speed of "painting") is more than enough, the number of resultant frames per second is limited to the number of "skeletons" which the CPU is able processing, that is, proportionally to its performance. Certainly, the presented example is rather schematic and the pattern of load distribution between the CPU and the video card is more complex (that is why, in the general case, "the line of maximum results" does not have to be a straight line).

Now we can state what the physical sense of the line depicted on Graph 1 is. Its sense is that it means the maximum number of frames processed by the CPU at the given clock speed. Or, in other words - the upper boundary of results which can be achieved for the given application at the given CPU under set conditions of tests. That is, for each CPU clock speed the line shows the maximum results bar which we won't be able to get over whatever we do to build up the capacity of the video subsystem.

That is just what the diagram in the beginning of the article demonstrates. Of course, that diagram presents the results for the 4AA/16AF mode, but that makes no difference. The upper boundary ~146 FPS for the CPU clock speed = 2400 MHz remains the same also for the more powerful system based on Radeon X1900 CrossFire, as can be seen on this diagram.

Once again, let's look at Graph 1. You must have noticed that this graph is built not quite "correctly", and the CPU clock speed values start with not «0» but with 1000 MHz? Yes, we intentionally built the graph in just this way in order to estimate the straightness of resultant line easier. Now we re-draw the graph in a way that the CPU clock speed values start with «0» MHz, and also add results for the resolutions 1280õ1024, 1600õ1200, and three more lines for the same resolutions but in the 4AA/16AF mode.


Graph 2

Let's analyze the produced results. Evidently, increase of load upon the video subsystem (through raising the resolution and enabling the AA/AF modes) should result in the drop of FPS.

That is just what we can see on the graph. See how the character of the lines is varying. For the "easiest" of the modes presented here, the 1024õ768 NO AA/AF is depicted by almost a straight line. As more load is applied to the video subsystem, the lines of results smoothly «bend down» to the X axis in the right-hand part of the graph at high values of the CPU clock speed, but in the left-hand part they preserve the characteristic slope and almost merge into the slanted straight line (line 2). For the most "demanding" mode – the line of results becomes parallel to the X axis at high values of the clock speed (line 1). What does it all mean? With the insufficient CPU performance, the results practically don't depend of the "hardness" extent of the graphic mode and are thus bounded to only the CPU performance (the slanted line). With the insufficient performance of the video subsystem, the results stop being dependent on the CPU clock speed at some moment (the horizontal line on the graph). There is a very simple explanation of the fact - the video adapter processes only the number of frames which it is able "painting" although the CPU is able drawing much more "skeletons".

However, a few more very interesting and important conclusions can be inferred from the resultant graph. That's what we are up to just now.

How to make the right estimation of video card's performance

This statement may seem far-fetched. It seems like what is the problem about that - just take a topmost test setup, install a video card and run it in al the modes. This is normally the way it is done. In fact, the question is about the interpretation of produced results. But all is not as easy as it seems. Let's once again look at our graph to look for the "pitfalls". So as not to overload the picture, we left only the lines for three modes.


Graph 3

The red line depicts the results of video card tests in the 1024õ768 NO AA/AF mode. What are we measuring in this case? The video card's performance? That's unlikely. See how the results vary strongly as the CPU clock speed changes. But the video card is the same, so the performance can't have changed in it! Conclusion: in this mode we in fact measure the performance of the CPU at generating frames, because the results are practically in linear dependence on the CPU clock speed.

Now look at the brown line which depicts the 1600õ1200 NO AA/AF mode. The load upon the video card has gone up substantially, and we no longer see a straight line. Nevertheless, the scattering of results remains significant. So, which value reflects the video card's performance? As we can see, the correct answer must definitely involve the test conditions and, albeit not fully, the CPU performance (clock speed) at which the result was produced, which is a must.

Finally, the third line (green) depicts the 1600õ1200 4AA/16AF mode. See that starting from the CPU clock speed 1600 MHz the results shown by the video card no longer depend on the CPU capacity. That is why we can state with confidence that the FPS level at which the horizontal "shelf" is just reflects the performance of a video card! Hence, we formulate

The criterion for correct comparison of video cards performance (with other things being equal) – it is admissible to compare only those performance values (FPS) shown by video cards which are equivalent to the horizontal levels on the CPU-boundedness graph.

To that end, we need to use a test setup with a powerful enough CPU and select the test mode in a way to produce the horizontal "shelf" of results. The resultant "shelf" is just the performance level of a video card in the given mode.

Interpretation of results shown by multi-GPU systems

You must be already familiar with the methods for merging performance of the video cards – SLI and CrossFire, and you know that the sense of these technologies is to boost the performance of the video subsystem. In simple terms (again looking at Figure 1), two video cards are much faster at "painting" the "skeleton" of a frame than a single video card. We intentionally don't mention the phrase "twice as fast" because the ordinary arithmetic does not count here, and "two" is not always twice faster than "one". Among the hindrances are the overheads for load distribution between two video cards, time spent for synchronization, etc. So the twofold performance boost gained through merging two video cards is possible only in theory, but in practice the maximum boost is limited to approximately 80-90%, or amounts to 1.8-1.9 times. Even the 80% boost gained through installation of a second video card is far not always demonstrated. Using the above graphs, we can now explain why that happens.
We take graph 3 and after adding a few lines show how that can be done.


Graph 4

As before, the green, brown and red lines depict the results shown by the video card in various graphic modes depending on the CPU clock speed. The blue straight line denotes "the line of maximum results" which is a result of performance restriction from the CPU part. The grey double-sided arrows depict the theoretical possible boost on building up the performance of the video subsystem. As is easy to see, the "boost margin" is minimum for the red line because it almost merges with the blue straight line. Hence, we see a minimum or zero performance boost even if the capacity of the video subsystem is increased through use of the SLI or CrossFire technologies. For the brown line which depicts the more demanding mode, the «boost margin» is somehow higher but it is anyway smaller than the theoretical limit 80-90% ( 120 fps + 80% ~ 220 fps, but we get merely about 50 fps). The most favorable situation is seen for the most demanding graphic mode – 4AA/16AF at 1600õ1200. In this case, the "boost margin" is even greater, so a combination of two video cards may show its full worth. As you can see, to make the most of the capabilities of SLI and CrossFire we would need a powerful CPU (the "boost margin" goes up towards the axis Õ) as well as tests in demanding graphic modes.

Certainly, all these conclusions were intuitive still on the date of announcement of technologies for merging performances of video cards, and we simply demonstrated it vividly where the boost should be looked for.

You might ask - what are those orange dotted lines doing on the graph? Assume that the lowest of these dotted lines depicts the results demonstrated in even more demanding modes (say, 4AA/16AF at 2048x1536). The upper dotted line runs at a level which is 80% higher, that is, depicts the performance of two video cards in the SLI or CrossFire technologies (the lower dotted arrow). Then, what does the upper dotted arrow show? Of course, it shows the remaining «boost margin» which can be implemented, e.g. with … Quad SLI. As you can see, the search for a real performance boost in this case requires an even more demanding graphic mode and of course a powerful CPU. (Note – the example presented for Quad SLI does not depict real values of performance values for this combination and is merely a demonstration of that the approaches used in this article can be successfully applied to such video solutions).

Verification of the findings with other 3D-applications

Up till now, we have built our reasoning based on merely a single 3D application, namely – Half-Life 2 with the demo scene «d1_canals_09 3dnews02». How valid are the results we produced in other applications? Let's check that up. Below, we bring in two consolidated graphs similar to graph 2, but for DOOM 3 and F.E.A.R. games, using the demos integrated into these games.


Graph 5

As you can see, the overall picture is very similar what we saw in Half-Life 2. Evidently, the absolute FPS values are different but the overall behavior of the lines is preserved.


Graph 6

F.E.A.R. is so "hard" that even with the rather powerful 7800GTX we almost immediately produce the horizontal "shelves", and just at the NO AA/AF modes, that is, with the FSAA/AA modes disabled. Therefore, to find the "line of maximum possible results" we had to use the lowest available resolution - 640x480 (the dark green line on the graph). As regards much higher resolutions, then some «roughness» of the lines is caused by that the test integrated into the F.E.A.R. game outputs integer values, which under small absolute values gives an essential relative error.

Finally, the most popular synthetic tests of 3DMark. As an example, we took 3DMark’05.


Graph 7

As it turned out, with the rise of CPU clock speed the results for GeForce 7800GTX (green line) under standard settings in 3DMark’05 turn into a «shelf». According to the criterion for correct comparison of video cards performance that we produced, that means the performance of GeForce 7800GTX was measured correctly in this test. That means it will correctly compare the "weaker" video cards with 3DMark’05.

I believe it is now clear why we decided not to bring in the results produced in 3DMark’06. Since this benchmark includes CPU tests, then we can't produce the horizontal "shelf" of results on the CPU-boundedness graph, thus the correct comparison of video cards' performance will be questionable.

Coming back to graph 7. In order to find the «line of maximum results» in this test we used Radeon X1900XTX (since 3DMark’05 is more favorable to Radeon’s) and tested it at 320õ240 (with the other test settings left unchanged). Although the resultant red line is not geometrically straight, it is quite suitable for the role of the «line of maximum results». As you can see, with the use of CPU Athlon 64 4000+ running at 2400 MHz the maximum number of "marks" is at about 12000, or 12500, if we follow the approximating curve. Up till now, none of the systems which we tested so far (7900GTX-SLI, CrossFire, Quad-SLI) has overcome the 12000 «marks» bar in 3DMark’05 on our test setup, which is a proof of the conclusions we made.

Forecasting the results

It is now time we moved to the practical conclusions made on the base of the theory that we produced.
I believe you will be curious to know the upper limit the CPU Athlon 64 should be overclocked to in order to produce 20000 "marks" in 3DMark'05, won't you? We can easily do that now with graph 7.


Graph 8

We have merely changed the scale of graph 7 and expanded it over both the height and width. The result is evident. In order to attain the 20000 «marks» in 3DMark’05, we need to overclock Athlon 64 to at least 4000 (of real megahertz, not rating figures) and only then start overclocking the video subsystem.

Conclusion

In the next part of the review «CPU-boundedness of the video system», we'll introduce you to the methodology which may help make up a decision on the necessity of upgrade and thus use a scientific approach in doing that. At the same time, you won't need any other "hardware" - just a few simple tests run on the computer to be upgraded will suffice.

Copyright © 2005 Digital-Daily. All Rights Reserved.
contact - info@digital-daily.com