In this entry I'm going to take someone's existing Go implementation of the smallpt global illumination renderer. I hope to demonstrate the easy multi-threaded nature of CSP based techniques, even spanning machines!
The single threaded Go version gives me :
% time ./smallpt 8
Rendering (8 spp) 100.00
721.50 user 5.08 system 12:06.84 elapsed 99%CPU
Single threaded C++ with -O3
% time ./small-sthread 8
Rendering (8 spp) 100.00%
61.76u 0.27s 62.04r ./small-sthread 8
OpenMP Multi threaded Code, 2 cores + HT
% time ./small-omp 8
Rendering (8 spp) 100.00%
88.55u 0.90s 24.83r ./small-omp 8
Oh dear, that's amazingly piss poor. An order of magnitude to find!
OK I added an I/O thread to gather the pixels
time ./smallpt.io 8
Rendering (8 spp) 100.00
849.65 user 54.19 system 14:02.77 elapsed 107%CPU
And not suprisingly I added 50s of Syscalls, some 10% overhead but at least they seem to have ended up on 1 CPU.
Ultimate overload : 1 thread per pixel.
crashed
with a threadpool of 4 on a machine with 2 HT CPUS
time ./smallpt 8
Submitting (8 spp) 100.00
1106.38 user 369.83 system 15:47.15 elapsed 155% CPU
This used more CPU time but the running time was much the same. All the SMP gain is used up by the overhead.
I manually inlined the function calls - getting rid of the Vec methods
% time ./smallpt.un 8
Rendering (8 spp) 100.00
273.82 user 5.62 system 4:40.05 elapsed 99% CPU
shit, now I fixed it it is 4x slower !
One interesting question I have is, do Go routines without a CPU operate in non pre-empting co-op mode like Limbo? So if I spawn my Go routine to do slow I/O will it be interleaved or sit there doing nothing while the heavy CPU work co-ops ?
Go Routined version
time ./smallpt_gr.go 8
1012.35 user 454.21 system 16:21.66 elapsed 149% CPU
Libbed out vector you'll need
No comments:
Post a Comment