Intel Woodcrest: the Birth of a New King
by Jason Clark & Ross Whitehead on July 13, 2006 12:05 AM EST- Posted in
- IT Computing
Architecture Summary
Woodcrest's home is a newer revision of the Bensley platform than what Dempsey launched with, which means that it's a drop-in part for newer Bensley based systems. If all goes to plan Clovertown (Quad-Core Xeon) should be a drop-in upgrade as well (depending on the system vendor). As we discussed in our Dempsey article, the Bensley platform features FB-DIMM with a peak bandwidth of 21GB/sec, SAS/SATA support and 1066/1333MHz FSB.
Woodcrest Highlights:
Shared 4MB L2 "Smart Cache"
Dempsey based processors had a separate 2MB L2 cache for each core, but Woodcrest has 4MB of L2 Cache shared between both cores. Due to the fact that the cores share a single cache, there is no data replication like there is with separate L2 caches; this results in more efficient data-sharing between cores. The shared cache also helps with mismatched loads: when one core is consistently using more cache than the other core, the CPU can allocate more L2 cache to that core. Both of these techniques are illustrated below.
Wide Dynamic Execution Enhancements
With the Intel Core micro-architecture, every execution core is 33% wider than previous generations, allowing each core to fetch, dispatch, execute and retire up to four full instructions simultaneously. The Opteron - as well as all previous NetBurst Xeon processors - can only handle 3 at a time.
Macro Fusion
Macro-fusion combines certain common x86 instructions into a single instruction for execution. Without Macro-fusion four instructions at a time are fetched from the queue and each instruction gets decoded into separate micro-ops. With Macro Fusion, 5 instructions can be fetched at a time, and if a fusable pair is present it can be sent to a single decoder. A single micro-op can then represent two regular x86 instructions.
Beyond 2 Sockets, is Intel's FSB still an Achilles Heel?
As we've seen in past benchmarks, the front side bus has been a thorn in Intel's side, especially in the quad socket systems. Whether or not the new architectural changes that Intel has made with Woodcrest will alleviate enough of that pressure to overpower the scalability of Opteron in four socket configurations is unknown at this point. Intel is quite confident that with the shared cache and its dual independent FSB running at 1333MHz that bus bandwidth is not a concern, however at some point the bus bottleneck will be a problem. One of Intel's architects has however stated that an integrated memory controller is possible, which Intel has already shown us a demo of.
Woodcrest's home is a newer revision of the Bensley platform than what Dempsey launched with, which means that it's a drop-in part for newer Bensley based systems. If all goes to plan Clovertown (Quad-Core Xeon) should be a drop-in upgrade as well (depending on the system vendor). As we discussed in our Dempsey article, the Bensley platform features FB-DIMM with a peak bandwidth of 21GB/sec, SAS/SATA support and 1066/1333MHz FSB.
Woodcrest Highlights:
Shared 4MB L2 "Smart Cache"
Dempsey based processors had a separate 2MB L2 cache for each core, but Woodcrest has 4MB of L2 Cache shared between both cores. Due to the fact that the cores share a single cache, there is no data replication like there is with separate L2 caches; this results in more efficient data-sharing between cores. The shared cache also helps with mismatched loads: when one core is consistently using more cache than the other core, the CPU can allocate more L2 cache to that core. Both of these techniques are illustrated below.
Wide Dynamic Execution Enhancements
With the Intel Core micro-architecture, every execution core is 33% wider than previous generations, allowing each core to fetch, dispatch, execute and retire up to four full instructions simultaneously. The Opteron - as well as all previous NetBurst Xeon processors - can only handle 3 at a time.
Macro Fusion
Macro-fusion combines certain common x86 instructions into a single instruction for execution. Without Macro-fusion four instructions at a time are fetched from the queue and each instruction gets decoded into separate micro-ops. With Macro Fusion, 5 instructions can be fetched at a time, and if a fusable pair is present it can be sent to a single decoder. A single micro-op can then represent two regular x86 instructions.
Beyond 2 Sockets, is Intel's FSB still an Achilles Heel?
As we've seen in past benchmarks, the front side bus has been a thorn in Intel's side, especially in the quad socket systems. Whether or not the new architectural changes that Intel has made with Woodcrest will alleviate enough of that pressure to overpower the scalability of Opteron in four socket configurations is unknown at this point. Intel is quite confident that with the shared cache and its dual independent FSB running at 1333MHz that bus bandwidth is not a concern, however at some point the bus bottleneck will be a problem. One of Intel's architects has however stated that an integrated memory controller is possible, which Intel has already shown us a demo of.
59 Comments
View All Comments
Kiijibari - Thursday, July 13, 2006 - link
Due to the integrated memory controller, the scaling of Opterons is "nearly" linear. 10% more frequency gives you around 8% better benchmark results. That is true for SMP setups, too. Because you also add more memory bandwidth channels with each CPU. Of course you have to setup NUMA correctly then (SRAT enable, NODE interleave disable). By using SRAT it may be possible to raise also the performance of a 2way system. I am not sure if it was done for the benched article, it just stated, that NUMA was "enabled" not which kind ... :(
Anyways, I doubt that there will be a 3 GHz S940 Opteron. It will be S1207, i.e. it will feature DDR2 memory. Hence the performance scaling will be even better than "linear" (if you are willing to compare S940 vs. S1207) ;-)
cheers
Kiijibari
Accord99 - Thursday, July 13, 2006 - link
Maybe if your benchmark is heavily CPU bound, but not every test is, especially ones dealing with multi-gigabyte databases where the storage subsystem becomes more important.
Too bad there weren't more Opteron scores but a simple linear extrapolation from the two Opteron results would indicate that it would take a 3.4GHz Opteron to match the Woodcrest at saturation for the Dell Dvd Store benchmark, while a 3GHz Opteron would match the Woodcrest on the Forum benchmark at saturation. At the lower load points, it would probably take 4+GHz.
Kiijibari - Thursday, July 13, 2006 - link
Ah ok, sorry, if you were referring to Database test, of course I ment CPU bound applications.
However i cant see your point. If you are looking on databases then the most important stuff is the I/O subsystem, if it does not stress the CPU too much. Thus I dont understand, why a Woodcrest should be better than an Opteron or a Netburst setup.
As long as they feature the same harddisks & controllers and the CPU load is low, performance should be the same.
But the 2 test here were all CPU bound. You can see that in the DVD test, all system ties until Load 3 or 4, after that the woodcrests pull off, due to its higher processing power.
With the Forum benchmark, well I guess there were some problems with "throttling", mentioned in the text, thus the benchmark already benefits in stage 1 from the higher woocdrest performance.
cheers
Kiijibari
Kiijibari - Thursday, July 13, 2006 - link
Yes indeed the test is (mostly) crap.On the one hand it is ok with me, because you can get (somehow) a woodcrest system and nothing better from AMD.
However I expect something more in the conclusion then, but there is just that "we dont know" sentences: "How those parts will compete with future AMD products is unknown".
Dear people at anandtech, I give you a hint concerning that topic:
AMD will introduce 65nm technology. That will them enable to raise core clocks, while lowering power consumption. This is really no big speculation, it's a well known fact.
In addition to that, there is also an error in the article, Socket F wont add FB-DIMM, it will add DDR2. Download & read the updated BIOS guide from the AMD webpage. (For your convenience, here is the link: http://www.amd.com/us-en/assets/content_type/white...">http://www.amd.com/us-en/assets/content...e/white_... )
*No* I repeat *no* mention about FB-DIMM, but of course a lot information about DDR2.
Hence you can easily draw the conclusion, that AMD will have the better wattage package in 2007, as they lower CPU wattage with 65nm, and lower RAM wattage with DDR2, too.
Maybe the lowered DDR2 wattage will be already enough to even the Wodcrest wattage advantange with a Socket F 90nm CPU, but that is speculation. I dont know the absolut wattage differences between DDR1 and DDR2.
Anyways, the current Intel advantage is just due to the former "mobile" CPU Core2. Everybody knows that Netburst was/is a power hungry monster and that FBDs draw more power than any other kind of memory nowadays. Thus, any wattage advantage is due to the CPU.
cheers
Kiijibari
defter - Thursday, July 13, 2006 - link
Just a few months ago, when there were Conroe samples and benchmarks available some people were saying: "we know nothing about real performance let's wait for final benchmarks". Now when talking about 65nm Opterons these people are saying: "it's a fact that 65nm Opteron will be much faster" even though that there even aren't any samples available. Funny how things change...
How about some real facts:
- Fastest 130nm K8 reached 2.6GHz
- First 90nm K8s became available in about October 2004
- First faster-than-130nm 90nm K8 (2.8GHz model) became available in June 2005
With 130nm->90nm transition it took AMD 9 months until the newer process (90nm) achieved higher clockspeeds than the older process (130nm). Now, you seem to think that this kind of situation is impossible with 65nm and K8 will get sudden and major boost immideately?
Well, the last part is pretty obvious. DDR2 consumes significantly less power with equal bandwidth than DDR1. However, I would guess that AMD fans would scream a bloody murder if somebody would benchmark Socket F Opteron with DDR2-400 :) When comparing DDR1-400 against DDR2-667, I doubt that there will be a significant difference in power consumption.
Kiijibari - Thursday, July 13, 2006 - link
Hi,you cant compare it to Conroe, Conroe is a new architecture, first 65nm AMD chips will be a simple Die shrink, nothing to worry about as long AMD does not have major problems.
I havent checked your introduction dates, but I remind something like the same. AMD always introduces mid-range CPUs first. Because of that, I did not state, that a 3 GHz part will be out around christmas (this year). it will be some time in 2007, maybe they will skip higher clock parts and move to lower clock QuadCore parts. I dont know. However Intel will have a clock advantage until then: It could also be, that they hold that advantage. But there is to much speculation with that, AMD is using SOI and adding SiGe with 65nm, Intel is not.
To the RAM power consumption issue:
FB-DIMMs also run with DDR2 memory chips, too. Thus the additional +5W FBD wattage(Source: http://www.techreport.com/etc/2006q2/woodcrest/ind...">http://www.techreport.com/etc/2006q2/woodcrest/ind... ) is (only) due to the controller.
I dont think that there is a big wattage difference on different DDR2 speed grades, well there is surely some, but imporantant thing is the voltage, and that is always lower with DDR2 than DDR1 (1.8V vs. 2.5V).
Concerning that topic:
http://download.micron.com/pdf/pubs/designline/dl1...">http://download.micron.com/pdf/pubs/designline/dl1...
There they calcualte 2.7W for moderate use of a DDR2-533 module and 3W for a DDR2-400 module under high load.
Compared to DDR1 that is a 40-50% less power usage.(http://www.kingston.com/press/2006/memory/05b.asp)">http://www.kingston.com/press/2006/memory/05b.asp)
Anyways FBD adds +5W with every other memory module. That's bad, cause normally you have a lot of them in servers :(
But hopefully that will be changed with new, better controllers and/or bigger modules.
cheers
Kiijibari
JarredWalton - Thursday, July 13, 2006 - link
Could be wrong (replying to Kiijibari), but DDR2 is pin compatible with FB-DIMMs; you just need to implement the corrected memory controller. Initial Socket F should be registered DDR2, but the rumor mill has that later revisions will include FB-DIMM support.Kiijibari - Thursday, July 13, 2006 - link
Sorry, didnt saw your post soon enough, otherwise I would have answered it in my other post :)FBD is *not* pin compatible to DDR2 modules, they have their own interface. But they use normal, off the shelf DDR2 chips, the same which are also used with DDR2 modules.
Advantage of this is, that you can change the used memory modules, and the mainboard is still compatible to the new modules. That is due to the fact, that the systems just sees the FBD module controller, what is behind that, is not of interest.
Because of this, there will be compatible DDR3 FBDs for current Bensley platforms.
But there is also rumour that DDR3 modules will be compatible to DDR2 ones, too. There is very little difference(only voltage and 8xprefetch) between DDR2 and DDR3 and the modules will feature the same 240 pin interface.
Anyways AMD will go to FBD later (~2008), maybe after the wattage problem has been solved by Intel et al ;-) ( source: http://pc.watch.impress.co.jp/docs/2006/0623/kaiga...">http://pc.watch.impress.co.jp/docs/2006/0623/kaiga... )
But as I said earlier, Socket F is DDR2 only, no FBD in the BIOS guide.
cheers
Kiijibari
JarredWalton - Thursday, July 13, 2006 - link
Er, sorry, but by "pin compatible" I mean it also uses 240 pins. Same size DIMMs, but you need the right type of memory controller - just like registered vs. unbuffered RAM. One way or another, I'm sure AMD will support FBD in the future; the question is when? Although, since they link RAM to each CPU socket it's certainly not as big of a concern since two sockets already support four memory channels. If they can get 4 registered DIMMs per channel, they're already up to 16 DIMM sockets.