Name: NVIDIA's Tegra 3 Launched: Architecture Revealed
Item: NVIDIA's Tegra 3 Launched: Architecture Revealed
Author: Anand Lal Shimpi

Original Link: https://www.anandtech.com/show/5072/nvidias-tegra-3-launched-architecture-revealed

NVIDIA's Tegra 3 Launched: Architecture Revealed

VIEW ARTICLE

by Anand Lal Shimpi on November 9, 2011 12:34 AM EST

94 Comments

Originally announced in February of this year at MWC, NVIDIA is finally officially launching its next-generation SoC. Previously known under the code name Kal-El, the official name is Tegra 3 and we'll see it in at least one product before the end of the year.

Like Tegra 2 before it, NVIDIA's Tegra 3 is an SoC aimed at both smartphones and tablets built on TSMC's 40nm LPG process. Die size has almost doubled from 49mm^2 to somewhere in the 80mm^2 range.

The Tegra 3 design is unique in the industry as it is the first to implement four ARM Cortex A9s onto a chip aimed at the bulk of the high end Android market. NVIDIA's competitors have focused on ramping up the performance of their dual-core solutions either through higher clocks (Samsung Exynos) or through higher performing microarchitectures (Qualcomm Krait, ARM Cortex A15). While other companies have announced quad-core ARM based solutions, Tegra 3 will likely be the first (and only) to ship in an Android tablet and smartphone in 2011 - 2012.

NVIDIA will eventually focus on improving per-core performance with subsequent iterations of the Tegra family (perhaps starting with Wayne in 2013), but until then Tegra 3 attempts to increase performance by exploiting thread level parallelism in Android.

GPU performance also sees a boon thanks to a larger and more efficient GPU in Tegra 3, but first let's talk about the CPU.

Tegra 3's Four Five Cores

The Cortex A9 implementation in Tegra 3 is an improvement over Tegra 2; each core now includes full NEON support via an ARM MPE (Media Processing Engine). Tegra 2 lacked any support for NEON instructions in order to keep die size small.

NVIDIA's Tegra 2 die

NVIDIA's Tegra 3 die, A9 cores highlighted in yellow

L1 and L2 cache sizes remain unchanged. Each core has a 32KB/32KB L1 and all four share a 1MB L2 cache. Doubling core count over Tegra 2 without a corresponding increase in L2 cache size is a bit troubling, but it does indicate that NVIDIA doesn't expect the majority of use cases to saturate all four cores. L2 cache latency is 2 cycles faster on Tegra 3 than 2, while L1 cache latencies haven't changed. NVIDIA isn't commenting on L2 frequencies at this point.

The A9s in Tegra 3 can run at a higher max frequency than those in Tegra 2. With 1 core active, the max clock is 1.4GHz (up from 1.0GHz in the original Tegra 2 SoC). With more than one core active however the max clock is 1.3GHz. Each core can be power gated in Tegra 3, which wasn't the case in Tegra 2. This should allow for lightly threaded workloads to execute on Tegra 3 in the same power envelope as Tegra 2. It's only in those applications that fully utilize more than two cores that you'll see Tegra 3 drawing more power than its predecessor.

The increase in clock speed and the integration of MPE should improve performance a bit over Tegra 2 based designs, but obviously the real hope for performance improvement comes from using four of Tegra 3's cores. Android is already well threaded so we should see gains in portions of things like web page rendering.

It's an interesting situation that NVIDIA finds itself in. Tegra 3 will show its biggest performance advantage in applications that can utilize all four cores, yet it will be most power efficient in applications that use as few cores as possible.

There's of course a fifth Cortex A9 on Tegra 3, limited to a maximum clock speed of 500MHz and built using LP transistors like the rest of the chip (and unlike the four-core A9 cluster). NVIDIA intends for this companion core to be used for the processing of background tasks, for example when your phone is locked and in your pocket. In light use cases where the companion core is active, the four high performance A9s will be power gated and overall power consumption should be tangibly lower than Tegra 2.

Despite Tegra 3 featuring a total of five Cortex A9 cores, only four can be active at one time. Furthermore, the companion core cannot be active alongside any of the high performance A9s. Either the companion core is enabled and the quad-core cluster disabled or the opposite.

NVIDIA handles all of the core juggling through its own firmware. Depending on the level of performance Android requests, NVIDIA will either enable the companion core or one or more of the four remaining A9s. The transition should be seamless to the OS and as all of the cores are equally capable, any apps you're running shouldn't know the difference between them.

The Tegra 3 GPU: 2x Pixel Shader Hardware of Tegra 2

Tegra 3's GPU is very much an evolution of what we saw in Tegra 2. The GeForce in Tegra 2 featured four pixel shader units and four vertex shader units; in Tegra 3 the number of pixel shader units doubles while the vertex processors remain unchanged. This brings Tegra 3's GPU core count up to 12. NVIDIA still hasn't embraced a unified architecture, but given how closely it's mimicking the evolution of its PC GPUs I wouldn't expect such a move until the next-gen architecture - possibly in Wayne.

Mobile SoC GPU Comparison
	Adreno 225	PowerVR SGX 540	PowerVR SGX 543	PowerVR SGX 543MP2	Mali-400 MP4	GeForce ULP	Kal-El GeForce
SIMD Name	-	USSE	USSE2	USSE2	Core	Core	Core
# of SIMDs	8	4	4	8	4 + 1	8	12
MADs per SIMD	4	2	4	4	4 / 2	1	1
Total MADs	32	8	16	32	18	8	12
GFLOPS @ 200MHz	12.8 GFLOPS	3.2 GFLOPS	6.4 GFLOPS	12.8 GFLOPS	7.2 GFLOPS	3.2 GFLOPS	4.8 GFLOPS
GFLOPS @ 300MHz	19.2 GFLOPS	4.8 GFLOPS	9.6 GFLOPS	19.2 GFLOPS	10.8 GFLOPS	4.8 GFLOPS	7.2 GFLOPS

Per core performance has improved a bit. NVIDIA worked on timing of critical paths through the GPU's execution units to help it run at higher clock speeds. NVIDIA wouldn't confirm the target clock for Tegra 3's GPU other than to say it was higher than Tegra 2's 300MHz. Peak floating point throughput per core is unchanged (one MAD per clock), but each core should be more efficient thanks to larger caches in the design.

A combination of these improvements as well as newer drivers are what give Tegra 3's GPU its 2x - 3x performance advantage over Tegra 2 despite only a 50% increase in overall execution resources. In pixel shader bound scenarios, there's an effective doubling of execution horsepower so the 2x gains are more believable there. I don't expect many games will be vertex processing bound so the lack of significant improvement there shouldn't be a big issue for Tegra 3.

Ready for Gaming: Stereoscopic 3D and Expanded Controller Support

Tegra 3 now supports stereoscopic 3D for displaying content from YouTube, NVIDIA's own 3D Vision Live website and some Tegra Zone games. In its port of Android, NVIDIA has also added expanded controller support for PS3, Xbox 360 and Wii controllers among others.

Tegra 3 Video Encoding/Decoding and ISP

There's unfortunately not too much to go on here, especially not until we have some testable hardware in hand, but NVIDIA is claiming a much improved video decoder and more efficient video encoder in Tegra 3.

Tegra 3's video decoder can accelerate 1080p H.264 high profile content at up to 40Mbps, although device vendors can impose their own bitrate caps and file limitations on the silicon. NVIDIA wouldn't go into greater detail as to what's changed since Tegra 2, other than to say that the video decoder is more efficient. The video encoder is capable of 1080p H.264 base profile encode at 30 fps.

The Image Signal Processor (ISP) in Tegra 3 is twice as fast as what was in Tegra 2 and NVIDIA promised more details would be forthcoming (likely alongside the first Tegra 3 smartphone announcements).

Memory Interface: Still Single Channel, DDR3L-1500 Supported

Tegra 3 supports higher frequency memories than Tegra 2 did, but the memory controller itself is mostly unchanged from the previous design. While Tegra 2 supported LPDDR2 at data rates of up to 600MHz, Tegra 3 increases that to LPDDR2-1066 and DDR3-L is supported at data rates of up to 1500MHz. The memory interface is still only 32-bits wide, resulting in far less theoretical bandwidth than Apple's A5, Samsung's Exynos 4210, TI's OMAP 4, or Qualcomm's upcoming MSM8960. This is particularly concerning given the increase in core count as well as GPU execution resources. NVIDIA doesn't expect memory bandwidth to be a limitation, but I can't see how that wouldn't be the case in 3D games. Perhaps it's a good thing that Infinity Blade doesn't yet exist for Android.

SATA II Controller: On Die

Given Tegra 3 will find itself in convertible Windows 8 tablets, this next feature makes a lot of sense. NVIDIA's latest SoC includes an on-die SATA II controller, a feature that wasn't present on Tegra 2.

ASUS' Transformer Prime: The First Tegra 3 Tablet

With Tegra 2, Motorola was the primary launch partner both for smartphones and tablets. Since then, ASUS has risen in the ranks and is now a serious competitor in the Android tablet space. It's no surprise that the first Tegra 3 tablet out of the gate is ASUS' Transformer Prime.

ASUS will launch the Transformer Prime in the US before the end of the year. The tablet's specs are below:

Tablet Specification Comparison
	ASUS Eee Pad Transformer	ASUS Eee Pad Transformer Prime	Apple iPad 2	Samsung Galaxy Tab 10.1
Dimensions	271mm x 175mm x 12.95mm	263 x 180.8 x 8.3mm	241.2 x 185.7 x 8.8mm	256.6 x 172.9 x 8.6mm
Display	10.1-inch 1280 x 800	10.1-inch 1280 x 800 Super IPS+	9.7-inch 1024 x 768 IPS	10.1-inch 1280 x 800 PLS
Weight	675g	586g	601g	565g
Processor	1GHz NVIDIA Tegra 2 (2 x Cortex A9)	1.3GHz NVIDIA Tegra 3 (4 x Cortex A9)	1GHz Apple A5 (2 x Cortex A9)	1GHz NVIDIA Tegra 2 (2 x Cortex A9)
Memory	1GB	1GB	512MB	1GB
Storage	16GB + microSD card	32GB/64GB + microSD slot	16GB	16GB
Pricing	$399	$499/$599	$499	$499

Gallery: ASUS Transformer Prime Detailed

Final Words

At a high level Tegra 3 doesn't surprise us much. The improved GeForce GPU should deliver tangible performance gains both through increased operating frequency and more pixel shader hardware. CPU performance should also be better than Tegra 2 based designs thanks to an increase in clock speed, the inclusion of MPE and the availability of more cores for threaded applications. In the move from one to two cores we saw significant performance increases across the board in Android. I don't expect that we'll see gains of a similar magnitude in moving from two to four cores, but there will be some benefit.

For the majority of use cases I believe NVIDIA has done the hardware homework necessary to extend battery life. Individual cores can now be power gated and the companion core should do most of the lifting while your device is locked or mostly idle, processing background tasks.

How much of an impact we'll actually see from all of this remains to be seen. We hope to have our hands on the first Tegra 3 hardware in the coming weeks, so before the year is up we'll hopefully have some answers.