Skybuck's CUDA Memory Bandwidth Performance Test, Manual
Software version 0.14 created on 10 march 2015 by Skybuck Flying
Manual version 0.02 created on 21 february 2015 by Skybuck Flying
Skybuck's E-Mail Address: skybuck2000@hotmail.com
Software Website: http://www.skybuck.org/CUDA/BandwidthTest/

To use the program:

1. Open the Program.

2. Click "Start" button at the bottom, the bold status should now update from "Not running" to "Starting..." to "Running".

3. Watch the bandwidth chart.

4. Wait until it sayS: "Stopping..." -> (and/or) "Not running". It's now done.

5. If there are problems check the Log and e-mail me/the author or try different settings.


How the program works:

The program divides the available ammount of GPU RAM into blocks of size "MemoryBlockSize (specified in bytes)" 

The program calculates how many blocks will fit into available GPU RAM.

The program allocates a block, measures the read speed, and then allocates the next block and so on until all available GPU RAM is used up.

To measure the speed of a block the program spawns threads according to the formula: "EstimatedBandwidth (specified in bytes)" divided by 16 bytes (128 bit data transfer by reading float4).

So for example:

GT 520 has roughly 1 GB RAM available.

MemoryBlockSize is set to 128 MB ram. (1024x1024x128=134217728 bytes).

There is room for 7 blocks so a total of: 7x128 MB = 896 MB ram will be used.

The estimated bandwidth is calculated by the program as follows:

Memory Clock Frequency in hertz multiplied by Memory Bus Width in bits multiple by 2 for double data rate divided by 8 for bytes.

So for example 

GT 520 memory clock frequency is 600.000.000 hertz (600 MegaHertz).

GT 520 memory bus bit width is 64 bits.

Memory type is DDR3 (double data rate)

So estimated bandwidth is:

( 600.000.000 x 64 x 2 ) / 8 = 9.600.000.000 bytes per second.

The program will now calculate how many threads to spawn to measure this bandwidth:

9.600.000.000 bytes per second / 16 bytes = 600.000.000 threads per second (roughly).

The actual bandwidth is unknown, so this program tries to execute 600 million threads per second and it measure the kernel time that elapsed.

It will now compute the bandwidth as follow:

ActualBandwidth := (EstimatedBandwidth / KernelTimeElapsedInMilliseconds) * 1000;

This actual bandwidth number is then added to the bandwidth chart.

(In reality it first performs a "warm up round" per block, to make sure the block is read into GPU ram and is not somewhere in a page block in a page file or something "virtual ram").

The number of "warm up rounds" can be indicated by the "Rounds" setting. The default is 2.

Which means 1 warm up round and 1 actual measuring round per block.

(There is also an additional chart called: "overhead", this chart was an experiment to see how much "overhead" memory is allocated per block, usually this will be 0, but if it's not then that's
kinda weird, not sure why it would not be 0, it simply measures "free memory" before and after a block is allocated, and computes any difference which is beyond the block size that was allocated.
Overhead = (NewFreeMemory + BlockSize) - OldFreeMemory
If it allocated more memory than just the block size, there would be a positive overhead, if less then negative. Perhaps other programs could influence the ammount of memory available or something else inside the driver,
perhaps the charts them selfes, not sure about that.

(The estimated bandwidth approach is needed to prevent the kernel from completing to fast, so it's an indication for the program how much work to generate for the gpu as to get a good/long
enough reading which should roughly last 1 second per block, or 1000 milliseconds per block).

(the ptx file was edited after compile to delete the branch and store instructions to test read speed only.)

(the program will now transcend compute capabilities if a launch failure occurs, it will start with device compute capability and proceed to lower compute capabilities if a launch failure happens until all are tried, if it then still fails it should stop).

The workload allows to override the estimated bandwidth setting, the volume bandwidth is the actual bandwidth that is going to be tested.

The program tries to limit all workload settings to estimated bandwidth so that volume bandwidth doesn't exceed estimated bandwidth too much.

Testing to large bandwidths could make the program run for a long time, thus this constraint tries to prevent.

To increase volume bandwidth set estimated bandwidth higher first.

The minimum for a workload setting is indicated by min. The maximum for a workload setting is indicated by max.

BlockWidth * BlockHeight * BlockDepth cannot exceed MaxThreadsPerBlock, the program will force a limit to not exceed it otherwise a kernel launch will fail.

EstimatedBandwidth is initially set to the theoretical maximum bandwidth, this could cause the program to run very long ! Try lowering estimated bandwidth
if that is the case !