From Nuno:
The GPU on Jetson (GK20A) with cuda capability 3.2 also supports double precision.
Performance for a 16^4 lattice volume for gauge fixing using MILC+QUDA With overrelaxation code:
GPU: time_GK20A / time_980GTX
single ~11.2x
double ~9.7x
CPU: time_ARM / time_hybrid
single ~4.5x
double ~4.9x
With FFT:
GPU: time_GK20A / time_980GTX
single ~13.7x
double ~9.6x
The GTX980 has ~10x more cuda cores than GK20A.
Problems that I found when compiling in Jetson:
- had to remove -m32 from QUDA code
- cannot use cudaHostRegister(), cuda 6.5 toolkit release notes:
"Mapping host memory allocated outside of CUDA to device memory is not allowed on ARM; because of this, cudaHostRegister() is not supported by the CUDA driver on ARM platforms. If required, cudaHostAlloc() with the flag cudaHostAllocMapped can be used to allocate device-mapped host-accessible memory"
- compiling was a bit slow.
Best regards,
Nuno