SiSoftware Sandra 2012 Released For Download

Post by **Apoptosis** » Mon Nov 07, 2011 9:05 am

SiSoftware has released Sandra 2012

http://www.sisoftware.co.uk/?d=news&f=2012_release

* Benchmark Results Certification*
---------------------------------

What is it? Certification uses the benchmark results submitted by users to work out whether your score (benchmark result) is valid (i.e. the device you tested is performing correctly) and how it compares to the scores obtained by other users when testing the same device.

By aggregating the scores submitted for each device and performing statistical analysis (e.g. computing mean/average, standard deviation, etc.) we can use statistical tools (e.g. normal distribution, T-distribution, etc.) to work out whether the score is within the expected range (confidence intervals).

You can see whether the score is above/below average average, but also how significant the difference is. For some devices, +/-10% is a lot; for others +/-30% may be just fine. It depends.

Based on the variability of scores you can determine whether the performance of your device is consistent or varies significantly from test to test. A large variability would indicate a problem either with the device or your environment (e.g. OS device drivers, virus checkers, etc.) that should be addressed.

* Ranker functionality must be enabled: the results get downloaded from the Ranker so if it is disabled, it cannot get the reference results for the device tested.

* General-Purpose (GP) Computing benchmarks - CPU vs. GP(GPU) vs. (GP)APU
-------------------------------------------------------------------------

Sandra's GP (former GPGPU) benchmarks may still be the only ones that allow full APU performance measurement against CPUs or even GP(GPU)s (through OpenCL) as they use the *same workload* as the native as well as the software VM (.Net/Java) counterparts allowing apples-to-apples comparisons:

- GP Performance (OpenCL / DX CS / CUDA) = CPU Multi-Media / .Net Multi-Media / Java Multi-Media / Video Shading
- GP Cryptography (OpenCL / DX CS / CUDA) = CPU Cryptography / .Net Cryptography / Java Cryptography
- GP Bandwidth (OpenCL / DX CS / CUDA) = Memory Bandwidth / Video Bandwidth

As a user, you would not care if a program uses native CPU instructions, the GP(GPU) or even your APU (CPU+GPU) to get your work done faster.

The point is you are benchmarking CPU+GPU together and not just CPU or GPU individually; both resources (thus the whole APU) is used to perform computations better which is the whole point.

As we support OpenCL, DirectX ComputeShader and CUDA just about all (GP)GPUs are supported.

* System Overall benchmark - Reloaded
-------------------------------------

What is it? The updated version generates an overall system performance score (geometric mean) based on the individual benchmarks that allows system-2-system comparisons:

- Native CPU performance: CPU Arithmetic, Multimedia
- Software VM performance: .Net* Arithmetic, .Net Multimedia
- Native Memory & Cache performance
- Storage performance
- GP(GPU/APU)** (General Purpose) performance: Arithmetic, Memory

* Why .Net? While applications have already been ported to .Net and WPF (Windows Presentation Foundation) the trend is accelerating with the launch of Windows 8/Server 2012 where new applications will need to use the METRO environment.

** Why GP(GPU/APU)? With the recent introduction of APUs (CPU with built-in (GP)GPU) we believe most future applications will use both (CPU+GPU aka APU) simultaneously for best performance.

* New Memory Latency Test: In-Page Random Access Pattern
--------------------------------------------------------

What is it? It is a new pattern that ensures the memory accesses stay "in-page" and thus we do not incur "out-of-page" lantencies as we move beyond L1D and L2 caches. The latencies reported are thus "best case" rather than "worst case" and match the latencies reported by vendors (which are always "in-page"/"best case").

Our view is that if you are unlucky enough to miss both L1D/L2 you are unlikely to be "in-page" considering the native page size is just 4kB thus you are very much likely to be "out-of-page". Even if using large pages (2MB but cannot easily be used in Windows, thus very few applications usually servers support them) considering most L2 caches are around this size you are still likely to be "out-of-page" if you missed L2.

We believe in giving you a choice thus you can select either "in-page random access", "full random access" and "sequential access".

Thanks go to Michael Schuette (Lost Circuits) and Joel Hruska for their testing, advice and support.

* Transcoding benchmark - CPU vs. GP(GPU) vs. (GP)APU
-----------------------------------------------------

The key advantage of Sandra's benchmark is WMF (Windows Media Foundation): it can use either software (CPU) transcoding, GP(GPU) (Intel/nVidia) or APU (AMD) transcoding depending on the encoders/decoders installed. So you can benchmark CPU vs. GP(GPU) or APU using the *same workload*.

Other benchmarks may use only software decoders/encoders which means you only test CPU performance and ignore GP(GPU) or APU performance entirely. Only by using the hardware accelerated decoders/encoders you can harness the power of GP(GPU) and APU.

* Large-page support for all memory tests
-----------------------------------------

Using large-pages (2MB on x86/x64 Windows) instead of native pages (4kB) results in less out-of-page hits and thus lower latencies. Unfortunately huge-pages (1GB) are not currently supported by Windows and we do not know whether the Windows 8/Server 2012 will enable them.

* FMA3 & FMA4 instruction set support for just released & future CPUs
---------------------------------------------------------------------

Using 256-bit register width (instead of 128-bit of SSE/2/3/4) yields further performance gains through greater parallelism in most algorithms. Combined with the increase in processor cores and threads we will soon have CPUs rivaling GPGPUs in performance.

-----------------------------------------------------------------------------

Post by **Apoptosis** » Mon Nov 07, 2011 11:08 am

Some more info on Sandra 2012:

Important Notes

Measurement units. We use the following conventions:

1GB(/s) = 1024MB(/s) = 1024*1024kB(/s) = 1024*1024*1024 bytes/second (powers of two)(aka GiB/MiB/etc. on other Operating systems)
1GOPS = 1000MOPS = 1000*1000kOPS = 1000*1000*1000op/second (powers of ten)
1Gbps = 1000Mbps = 1000*1000kbps = 1000*1000*1000bits/second (powers of ten)

Other companies may use different metrics in specs and measurements.

Sandra 2012 Changes

Geometric mean for aggregate scores (ALL benchmarks)

Why? When you have components of very different magnitudes the aggregate is more meaningful than an arithmetic one (e.g. 100 and 1 would give 10 (geometric) instead of 50 (arithmetic))

This ties with the mobile versions where one component (e.g. MIPS) is much larger than the other (e.g. emulated MFLOPS due to a lack of FPU); the old way creates an aggregate score imbalance when comparing devices.

Thus aggregate scores are not comparable with older versions of Sandra, though individual scores are.

Measures each operation and computes the score range (min-max), not just the average (ALL benchmarks)

Why? When you have operations that vary in completion time (e.g. disk requests), the maximum response time also matters, not just the average response time.

This ties with the aggregated scores from the Ranker which computes the range of submitted scores of a component (min, average, max) which allows the user to certify whether their score is within the range. So while the score might, say, be below average, it may well be within the range of scores for that device and thus perfectly valid.

Example on how this helps: Physical Disk Read : Kingston 120GB SSD, Physical Disk Write : Kingston 120GB SSD here we see that while reads are pretty consistent from I/O to I/O, while write average is also consistent, some writes may take 20 times longer than others!

How about Physical Disk Write : Intel 160GB SSD, Physical Disk Write : Intel 160GB SSD? Reads are very much consistent and writes are also pretty consistent, Max/Min is only about 2x not 20x as with other drives.

New Benchmark: Media Transcode. Measures the transcoding bandwidth converting media (e.g. from a camcorder or video editing program like Movie Maker) to a device format (TV, tablet, phone, etc.)

Why? Most devices cannot play video recorded by a camcorder. Such media would be most likely be edited first and then transcoded (perhaps transparently) when copied to a device for playback.

Using the new Media Foundation in Windows 7, we measure WMV (e.g. Movie Maker) > MP4/H264 and MP4/H264 (e.g. phone video) > MP4/H264 using various profiles (HD TV, SD TV, tablet, phone, etc.). Hardware transcoders (e.g. GPGPU) are supported if available. Unfortunately, not all of them work.

Note: You need to use nVidia ForceWare 196.xx for transcoding to work. Current 260.xx drivers don't work - you can check yourself that drag & drop transcoding is not accelerated.

Note 2: As we need to include the master video file, the Sandra has doubled in size.

New Benchmark: GPGPU Cryptography. Measures the encryption/decryption and hashing bandwidth of GPGPUs which is directly comparable to CPU cryptography extensions (AES, Padlock)

OpenCL 1.x, DirectX Compute Shader* and CUDA 3.x* are supported. Multiple video cards are supported and CPUs can also be used in parallel in OpenCL. The benchmark allocates workload based on the respective processing power of GPGPUs and CPU like the GPGPU processing benchmark.

Supports OpenCL "Fission", not to be confused with AMD's Fusion or support for Uranium enriching; it allows device resources to be partitioned (all CPUs in a system are 1 device) and thus different partitions for different things. What we do is leave some CPU cores/threads free to handle the GPUs in the system when using CPU+GPU and thus obtain better performance than when loading the CPU fully.

* Note: In 2011a.

Memory Bandwidth, Cache & Memory Benchmarks: AVX/FMA for future Intel (Sandy Bridge) and AMD CPUs.

Using 256-bit register width (instead of 128-bit of SSE(2)) yields further performance gains on future CPUs; do note that the 1st generation of CPUs with AVX/FMA may not show much improvement depending on CPU (4-channel CPUs do, lower ones may not).

Note: The benchmark uses different number of threads depending on CPU type and system topology. So a 2-CPU, 6-Core/CPU, 2-Thread/Core system may run anywhere between 2 and 2*6*2 = 24 threads at any one time. This is normal.

Note 2: Windows 7 SP1 / Server 2008 R2 SP1 or later is REQUIRED! OS support is needed to save/restore extended SIMD state, thus older operating systems will NOT work.

Note 3: The CPU package/code/thread chosen to be run on also depends on CPU type and system topology; Sandra has its own "scheduler" which hard affinitise threads for best performance.

While Windows 7 / Server 2008 R2 scheduler allocates threads (by default) much better (watch Task Manager on a Core i7 for example), Vista / Server 2008 and XP don't allocate threads on a HT system very well. To ensure consistent scores whatever the operating system, we had to roll our own.

Multi-Core Efficiency Benchmark: updated test block sizes

Why? As modern CPUs come with bigger and bigger L3 caches, we have updated the benchmark to measure transfers between 64-byte (1 cache line) and 32MB so you can see the effect of the cache system when transferring data between cores.

Power Management Efficiency Benchmark: updated ALU/MIPS and FPU/MFLOPS work-load sizes

Why? With modern CPUs pushing 400GIPS, we have refined the workloads to anything from notebook CPUs (500 MIPS) to 8-way multi-core servers (1TIPS) and see how well power management works at various workload sizes.

While few of you use this benchmark we feel is very much understated. Seeing how the frequency/voltage/current varies with various loads is pretty useful especially when undervolting/underclocking systems for better power efficiency.

Up to 256-thread support (ALL benchmarks)

Why? Until Server 2008 R2 even 64-bit versions of Windows were limited to 64-threads; as there are systems on the market with 80 or more threads, software changes had to be made to support more than 64-threads, including "holes" in the CPU list (supporting dynamic CPU plug in/out).

Sandra now supports up to 256-threads, we've tested 80-threads.

Ranker Engine: more useful

Why? The Ranker is not designed to compete with established engines (e.g. HwBot, Facebook ;) but to aggregate scores from users and generate additional reference results for devices we have not tested. Each benchmark automatically downloads new aggregated results from the ranker at regular intervals. We had over 200,000 submissions for one benchmark in 2010, but while others had less, it was still way more than we could test.

If the user decides to join and create a free account, then the software automatically backs-up their scores and restores them to a new Sandra installation be it on the same or other computer.

If the user is part of a team, it will also download the scores from other team members so they can see how other members are doing.

The user can also search for and download specific results if they so wish using Sandra, so if they want to compare against a specific device that is somewhat obscure, they can.

Price Engine: still useful

Why? The Price engine is not designed to compete with established price comparison engines, but to enhance the experience by providing product pictures and additional specifications - as well as the price.

This allows the benchmark to display important metrics like Performance vs. Price and Capacity vs. Price (for storage media) which we have not seen other benchmarks do. Naturally, the user can look up prices themselves on their favourite store but they would need to work out these metrics manually.

Legit Reviews

SiSoftware Sandra 2012 Released For Download

SiSoftware Sandra 2012 Released For Download

Re: SiSoftware Sandra 2012 Released For Download