Matrix multiplication: CUDA GPU

The first CUDA example doesn’t have any performance measurement code. Add some, and tell me how many clock cycles the overhead is for just a simple operation like that one.
The code here measures only steps 3 to 5 of the five-step process of working with the GPU, from copying data to the GPU to copying it back. Modify the code here to measure both that and just the parallel execution portion. How much is the overhead, and how much is the parallel execution? How does that match your understanding of Amdahl’s Law?
Assuming that one clock cycle (a delta of 1) is 1/2.2GHz seconds, calculate how many FLOPS we are getting out of the GPU.