We've launched a new benchmark in Sandra 2013: GP GPU/APU Cache and Memory Latency benchmark.
While it works on the same principle as the System Memory Latency benchmark you know (and love?), as GPUs are different from CPUs - some things are not quite the same.
Here is an article to clarify its operation through concrete examples: the different kinds of GP memory (global, constant, shared, private, texture), how the latency is measured through the different access patterns (full random, in-page random, sequential/linear), how TLB caches affect latencies, etc.
http://www.sisoftware.net/?d=qa&f=gpu_mem_latency
GPUs are somewhat more secretive than CPUs, with the different cache levels and types not always published - even less the latencies of various levels. While nVidia's specific architectures (G80, GT200) have been analysed though micro-benchmarking in CUDA before - here Sandra can benchmark and contrast different architectures of GPUs and APUs from different vendors through OpenCL (and CUDA).
Unlike yourselves we don't have many devices to test - but even the 4 devices tested here (2 GPU, 2 APU from different vendors) show pretty interesting results.
We've used the very results to improve our own benchmarks in Sandra 2013 (GP Cryptography - AES encrypt/decrypt kernels) with significant gains (+25-40%) especially on AMD and Intel. (details in next article ;)
While the optimisation is somewhat simple, it is not that obvious without the latency data: we have worked with both vendors on the previous version and nobody thought of it. So it may be more useful than it first appears.