| 3DNews Vendor Reference English Resource - All you need to know about your products! |
||||||
![]() |
||||||
|
|
||||||
CPU-boundedness of the video system Part II – Effect of the CPU cache memory and memory speedAuthor: Dmitry SofronovDate: 28/10/2006
Preface to the second partSince the publication of the article «CPU-boundedness of the video system. Part I – Analysis» we received extensive feedback from you, dear readers. Along with the questions as to whether the second part would be released, there were a lot of remarks regarding the presented graphs and doubts of their trustworthiness in some specific cases. Today, we are giving you an explanation of some fine points which attracted a special interest among the public but were not described in detail in the first part. We are examining the effect of the CPU cache memory size and the memory operating speed on the performance in 3D games. We'll also come close to the matter of comparing platforms on the whole. Well, let's go ahead.
«Non-fitment» ¹1. Or - where the "zero" has gone.Note the graph presented below.
![]() This graph was taken from the first part of the review. We see that the lines reflecting performance of the video card in various modes converge to the same slanting line as the clock speeds of the CPU go down. The "non-fitment" is in that if we try extending this approximating line up to crossing with the "FPS" axis, the we'll see that the straight line does not arrive at the origin of coordinates but somehow higher.than that. It turns out that at zero clock speed of the CPU we can play at as high as 15 FPS, doesn't it?! How can that be? If we distract from the conditions of tests at which we were producing the results, then in theory such a situation is anyway possible. Assume that we load some data together with shader programs which the video chip is to execute into the memory of the video card, and let all run on its own - the CPU is not needed here. Examples of using video processors for mathematical computations are known. But in our conditions of tests it is physically impossible to produce such a result. Some part must be in charge of computing the position of objects in a game scene instead of the CPU! How should the graph of CPU-boundedness behave with the CPU clock speed tending to zero? The attempt to run an experiment on the real hardware is complicated by the fact that much lower values of the CPU multiplier are no longer available, but if we take some "just a rather weak processors» – we then change the platform, that is the conditions of tests, and therefore we won't be able to adequately compare the results of tests. What to do then? Let's try to predict the «behavior» of the CPU-boundedness curve using logics and considering the real "behavior" of a typical personal computer. To that end, we would have to go deeper into the principles of operating systems with preemptive multitasking. Don't be scared by this long term. Most likely, for work and games you use just an operating system like that. I mean the commonly known operating system - Windows XP. Apart from Windows XP, there is Windows2000 as well as cloned of Linux which fall within the group of preemptive multitasking operating systems. The trait of these operating systems which is essential for our consideration is the way they make use of the hardware resources, namely - the distribution of processor time for simultaneous execution of several tasks. While sitting at the personal computer, we think that all is run simultaneously - downloading files from the Internet,playing music, and recording a CD, however, all is somehow different in reality. All the applications which you started on your computer are run in a strict sequence! There is no contradiction about that. Since there is only one processor, all the applications are run one after another, in "pieces". But these pieces are so tiny, and the operating system is switching between them so fast that the human eye is unable to perceive all that, which creates an illusion of simultaneously running applications. To cut it short and simply put, all the operating time of the CPU is divided into portions, or time "quanta". Then these time "quanta" are "issued" to applications, sort of "here you are, use the processor for a couple of milliseconds". In so doing, the core of the multitasking operating system itself consumes some part of these "quanta" of the processor time so that the system services could operate, and needs some time to think which "quantum" should be given to a certain application. That is, there arise some "wasteful" (in terms of the user application) losses of the processor time which are directed for servicing the operating system itself. All the above has the most direct relation to our "non-fitment". Let me explain why. If the operating system requires some fixed number of CPU time "quanta" to provide its own operation, then it is evident that with the decrease of CPU clock speed the number of vacant «quanta» which may be provided to the application (in our case - a 3D game) will drop faster than the CPU clock speed. This can be explained with different words either. Suppose that with the CPU clock speed 100 MHz the CPU performance is enough to service only the operating system Then, to produce the equivalent CPU clock speed, that is, the number of MHz available to the application, we should deduct those 100 MHz allocated to the operating system from the real CPU clock speed. In this case, it turns out that at CPU clock speed 1000 MHz the "operating system correction factor" amounts to 10%, at CPU 200 MHz – as much as 50%, and at CPU 100 MHz we get 0 FPS. On the below graph, we illustrated all the above stated.
![]() The red dash line depicts the presumptive behavior of the CPU-boundedness behavior with the CPU clock speed tending to zero. Attention! This line is drawn arbitrarily and is not a reflection of any experimental data! You may find it strange as to why we devote so much time and attention to this matter. Processors of such low clock speeds are no longer used in personal computers, and at first glance such an experiment, if we succeeded to perform it, would not bring any practical advantage. That is true, and not quite so. Let us ask ourselves the question "how to exclude or minimize the effect of the operating system upon the operating speed of an application?" That is - it it really possible to produce a graph of CPU-boundedness that runs through the origin of coordinates? Running ahead, we say that it is possible if the operating system were run ... on another processor. We'll come back to this point a bit later.
Non-fitment ¹2. A nonlinearity of the «line of maximum possible results»We have just examined the behavior of "the line of maximum possible results" with the decrease in CPU clock speed. Now let's see what happens if we move to the opposite direction and increase the CPU clock speed.
![]() In fact, the essence of the "non-fitment" is well seen on the same above graph. Namely, no matter at which point of the "line of maximum possible results" we build the tangent, further results deviate from the tangent downwards as the CPU clock speed goes up. Why doesn't the graph follow the linear law but starts "bending" towards the X axis? Let us bring in a few causes which can explain this phenomenon. The first cause is the consumption of the CPU capacity for the needs of the operating system. It's the matter which we already discussed. The second cause is the effect of the CPU multiplier. We wonder what sense does the CPU multiplier make? The sense is that if the CPU clock speed is increased at the expanse of the multiplier, we somehow raise the data-processing power of the CPU, but the data needs to be fetched to the CPU core, whereas the CPU bus speed remains unchanged. For tasks with a large amount of data which need to be processed and which do not fit within the CPU cache memory, there may come up the moment when the CPU has already calculated the available data and is still waiting for another portion of data. That is the processor starts running idle, which can be regarded as a reduction of the "efficient" operating clock speed of the CPU. The third possible cause is the pattern of processor time distribution between the graphic driver (executed on the CPU) and the computations of games (also run on the CPU). The situation looks a bit entangled since both the tasks use up the CPU, and the graphic driver can be attributed to both a component of the operating system (in terms of architecture) and to an important link in terms of running a 3D application. Among the other possible causes is the latency and bandwidth of the memory, CPU bus, etc. The list of presented causes is not final and exhaustive, and if necessary we could find a few more factors because of which the behavior of the "line of maximum results" will be different from the straight linear. Determining the extent of the effect of each of these causes and search for the bottlenecks is quite an extensive topic for research. Prior to moving to specific matters, let us formulate the generic postulate: In a multi-factor environment, the linear dependence of a quantity on a certain parameter can be achieved only if there are no restrictions from all the other parameters. Or, in other words, it's just the parameter on whose dependence the graph is built should be the most limiting. As applied to our research into CPU-boundedness of the video system, this means that apart from the CPU the performance of other components should be sufficient and not build up any restrictions. That is, a video card has to be powerful enough and be able operating in the lightest of the modes (e.g., 640x480 instead of 1600õ1200 with other settings being equal), the RAM should run at the maximum speed, with the effect of the operating system minimized, etc. Somehow or other, in practice, as the CPU clock speed goes up, we anyway see the rise of the "line of maximum possible results". Although this boost is not strictly linear, then to estimate the limit of possible performance of the platform in 3D applications it is applicable enough. Then we'll be examining a few factors which affect the computer performance in 3D applications. But we'll be talking about the things which we can control to some extent through a choice of the CPU and the type of the RAM, that is, through a choice of the «platform» for running 3D-applications.
Effect of the memory (RAM) operating speedIn the first part of the review, despite the generality of the posed problem, we used the only platform - the CPU Athlon 64 4000+, a motherboard on the base of the nForce 4 SLI chipset and DDR400 memory running in the dual-channel mode. Of the components listed here, only the CPU clock speed was varied through a reduction of the multiplier, whereas the FSB speed, memory operating speed, and all the other remained unchanged. It is quite reasonable to hear the question - how will the graphs of CPU-boundedness look on change in other parameters? As is known, both the memory operating speed and the CPU cache size affect the performance. So, we'll be examining the extent of their effect effect right now. You already know the test conditions used in the first part.
We used the method of finding the «line of maximum possible results», that is, for a selected 3D application we set the maximum possible resolution with neither the full-screen anti-aliasing(FSAA) nor anisotropic filtering (AF). In this case, the results are not determined by the video card performance, but the CPU - even by the platform as a whole! Apart from the already conducted tests of the standard configuration with DDR400 memory set to the dual-channel mode, we produced results for the following configurations:
Be not confused by some "artificiality" of the mentioned modes for the memory. However strange it sounds, the Single Channel DDR400 mode is quite possible to come across in the home computers of users. The reasons are hackneyed - a single memory module is used for the reason like "will buy one more whenever I get the money", or wrong installation of two memory modules into the memory bank of the same channel. The Dual Channel DDR200 mode looks more exotic, but also can be found sometimes. With 4 memory modules installed, some motherboards automatically reduce the operating speed down to DDR333 or even DDR266 to improve stability. The option of reducing the speed to DDR200 is some sort of an exaggeration, but we simply want to illustrate how the results under such minimum settings will change. The same applies also to the Single Channel DDR200 mode. The produced results are depicted on the following graph.
![]() What conclusions can inferred from this graph? As it turns out, more important is the memory operating speed but not the number of channels! The single-channel DDR400 mode is more powerful that the dual-channel DDR200, although the maximum theoretical bandwidth in both cases is the same. Needless to say, the lowest results are demonstrated by the single-channel DDR200 memory. But what is interesting, the platform with DDR400 Dual Channel memory is different from the platform with Single Channel DDR200 memory at the maximum memory bandwidth as much as by 4 times, but the difference in results (for the same CPU clock speed) turns out to be at the level of merely 50%, that is, by 1.5 times. The system with Dual Channel DDR200 memory lags behind the leader by 25%, whereas the system with Single Channel DDR400 – by merely 10%. As regards the other possible types of memory (DDR333 and DDR266), the results of such systems will evidently be between the results of systems with DDR200-DDR400 memory. That is just the answer to the question as to how the operating mode and the memory speed affect the maximum possible results for a selected platform. We stressed this phrase on purpose, since in reality the demonstrated FPS is normally restricted by the performance of the video card. Assume that under the conditions of our test some video card is able demonstrating no more than 60 FPS, whereas at CPU clock speed exceeding 1400 MHz it turns out that a system with Single Channel DDR200 memory is quite enough to reveal all the capabilities of the video card! The practical conclusion for thrifty users is - don't be hasty at getting rid of the old DDR266 and DDR333 memory. The RAM capacity is never too much, and it will still be of use for some time. Now let's move to a more difficult question.
The effect of CPU cache memory sizeAs is evident from the heading, the next object under study will be the attempt to estimate the extent of CPU cache influence upon the performance of the platform in 3D applications. As a large number of tests shows, at 3D games the performance does not depend much on the CPU cache size. Today, we are ready to move from intuitive perception to the figures and present a quantitative estimation of the effect of CPU cache size on the performance in games. The difficulty is in that we can't arbitrarily change the size of the CPU cache memory, so the only way to solve this problem is to compare two processors differing in only the cache memory size with other parameters unchanged. Our "master" processor which we have used so far in preparing materials for this review is Athlon 64 4000+. This processor offers 128 K of L1 cache and 1024 K of L2 cache. For comparison we could take Athlon 64 3800+ whose L1 cache size is the same, and the L2 cache size is twice as less – 512 K, but we decided to go even further. For comparison, we'll be examining the Sempron Socket 754 family of processors quite popular in the mainstream sector. The CPU Sempron which we use offers the rating 3400+, clock speed 2000 MHz, 128 K of L1 cache, plus 256 K L2 cache. That is the L2 cache size in the CPU Sempron is 4 times less than in Athlon64 4000+. As regards the correctness of the experiment, processors of the Sempron family also support the technology for changing the multiplier towards reduction, so we'll be using the same methodology for building a graph of CPU-boundedness, as in the case with Athlon 64 4000+. The difference of Sempron s754 from Athlon 64 s939 is in the support for only the single-channel mode and reduced cache size of the processor. We'll put the produced "line of maximum possible results" on the same graph where we compared the performance of Athlon64 platforms with various RAM types. For the Sempron s754 platform, we used DDR400 Single Channel.
![]() So, what can we see now? The results demonstrated by Sempron s754 DDR400 Single Channel almost reproduce the results produced for Atlon64 DDR200 Dual Channel. Surprisingly, but the fact is that the CPU Sempron s754 at 3D-games shows a decent performance and does not lag much from its elder brethren. You might ask - well, that's fine, but why the cache size matters and how to estimate its effect of the performance of the platform? That's very simple - let's remove "all unnecessary" from the above graph and take a closer look. On the below graph, we left only two lines which meet the Sempron s754 and Athlon64 DDR400 Single Channel platforms. Note that for these platforms the speed and the operating mode of the memory are the same, and the difference is in only the size of cache memory of the processors.
![]() As you can see, with the 4-fold difference in L2 cache size, Sempron shows results merely 10-12% worse than for Athlon64 running on the same clock speed. (Note – the results for Sempron start with 2000 MHz, since that is the maximum nominal speed for this processor, and overclocking would have resulted in the change of the operating system bus speed, memory speed, and therefore - in the distortion of results). The above graph also implies that for Athlon64 with L2 cache size 512 K the "line of possible maximum results" will take an interim position, that is, the difference as compared to Sempron will be even much smaller. Therefore, for processors of the AMD K8 architecture the increase in cache size in 3D games produces an insignificant effect upon the overall performance of the platform. But what will happen if we install a powerful enough video card sort of 7900GT on the Sempron platform and enable the 1280x1024 4AA/16AF mode? We set the screen resolution 1280õ1024 dots because it is the "native" resolution for most 17" and 19" LCD monitors, whereas for most CRT monitors of the same size the recommended resolution is 1024õ768 and 1280õ1024 dots, respectively. We made the graphics mode more demanding through activating the anisotropic filtering full-screen antialiasing in order to demonstrate the fact that a value processor is not the reason for giving up high-quality graphics.
![]() As is seen from the graphs, both the "line of maximum results" and the curve of CPU-boundedness for the 1280õ1024 4AA/16AF mode in the case of using the Sempron processor lie below the respective lines for the CPU Athlon 64. Such behavior of lines is quite normal since the game is old enough and the video card used in the tests is powerful. Therefore, for both the CPUs at the specified clock speeds in the 1280õ1024 4AA/16AF mode we receive a transient area but not a "shelf". But even in view of this circumstance, it is seen that Sempron s754 at 1600 MHz (the clock speed of lower-end models in this family) are quite capable of showing a result as high as 70 FPS. Of course, together with Sempron models with L2 cache size 256 K there are products with L2-cache 128 K. But as was already shown above, the cache size of the CPU produces an insignificant effect upon the overall performance. Even if we deduct extra 10% from the results produced for the Sempron s754 platform depicted on the graph, the performance of even the lowest Sempron models of clock speeds as low as 1600 MHz would be sufficient to provide over 60 FPS! Of course, both Half-Life 2 and DOOM 3 are rather old games. You may object that at modern games the Sempron is weak and its performance will "rest against" the power of the CPU. Let's verify that on the example of the game F.E.A.R. which is very demanding for the system resources.
![]() As you can see, when we build «the line of maximum results», at the same CPU clock speed the performance of Sempron platform still lags a bit behind Athlon 64 (as it should be), but once we enable the quality mode the performance immediately rests against the video card! As regards the Sempron SocketAM2 processors, we did not test these processors while preparing this article. But proceeding from the above, we can assume that the performance difference in 3D games for Athlon64 AM2 and Sempron AM2 processors will be even smaller since Sempron AM2 processors offer a dual-channel memory controller like Athlon64 AM2 processors. We have to admit that on the platform Socket AM2 the research into the effect of CPU cache size could have been conducted with less efforts. However, as you can see, we were able to do that in comparing both Socket 939 and Socket 754 platforms. We should not think that we decided to confine to Socket 939 and Socket 754 platforms. The next in turn is the Socket AM2 platform. The results that we produced, albeit predicted in theory, are anyway impressive.
We already tried comparing various platforms although related to one and the same manufacturer and being close relatives. Let's complicate the task a bit and try to produce the results using the same methodology, but this time for a dual-core processor. For that, we took Athlon 64 X2 4000 Socket AM2 whose nominal clock speed was 2000 MHz. Let's produce "the line of maximum results" for it in the same was as we did that before.
![]() Here we see a very interesting picture! Look - on the dual-core processor the "line of maximum results" precisely coincides with the straight line and precisely runs though the zero! There is no wonder - the system services are executed on the first core, whereas the application (in our case - DOOM3) is executed on the vacant, that is, the second core. The results have gone up substantially, although we did not change the tests of conditions, in the sense of graphics settings. We even did not try looking for any patches to the game which could make efficient use of the dual-core architecture of the CPU. It turns out that on this graph we see a performance boost due to the second core, but without any optimization of games for the second CPU core. Now we can answer the question "what will the performance boost be like due to the dual-core architecture in graphic applications not optimized for multitasking?" The answer is evident from the graph. Under the same CPU clock speed, the performance boost in the case of a system with the dual-core CPU amounts to 20-40% as compared to a system based on the single-core processor. And that - without any optimizations! Certainly, in our review we are not going to confine the study to AMD processors only. In the nearest time, we'll present you the results of tests run following our methodology regarding the Intel Celeron, Pentium 4, Pentium D, and of course the Intel Core Duo platforms. But we'll dwell on that in the third part of the review. Stay with us. Discuss the review in a forum.
|
| |||||