SIMD-friendly vectorized processing of particles
@hdembinski posted a detailed review of the benefits and a possible design for the SMID-friendly vectorizable processing of particles in the process interface here as part of MR !269 (merged). This is an important design decision that needs to be discussed so his comment is reproduced here as a starting point for further discussions:
Hans Dembinski Hans Dembinski @hdembinski · 8 hours ago Maintainer
To clarify: I was suggesting to use a std::tuple if you stick to the static design, because std::tuple is a standard container for heterogeneous types, but a static design is not what I want.
I am absolutely convinced that this design where you build the sequence at compile-time, is a dead end. It is not an efficient design and you are not doing yourself any favors with it. What you should use is a classic dynamic design with a virtual base class and derived classes. The minor additional cost of calling virtual functions is going to be amortized if each process handles a chunk of particles at once and not a single particle at a time.
A question to you: do you understand my arguments regarding cache efficiency and SIMD-friendliness? If you understand these it should be absolutely clear that you should not proceed with this design where you process one particle at a time.
I spend a long workday developing a demo of what I have in mind. You can find it here:
It contains a simple ProcessList, a simple Stack, and some Demo-Processes. All is wrapped with pybind11 to Python. You can build the ProcessList and configure the Processes from Python, which is what you usually want.
One iteration of the algorithm works as follows:
For each process in the process list, compute the step size for each active particle on the stack. Remember for each particle which process gave the smallest step size (can be done with a short index into the process list). https://github.com/HDembinski/corsika_dyn_demo/blob/722d011b93a278f44c49f29b7140ec196431d284/main.cpp#L306
Now each particle knows which process is handling it. We invert the association by looping over the particles and store the particle index in a cache vector held by the process. https://github.com/HDembinski/corsika_dyn_demo/blob/722d011b93a278f44c49f29b7140ec196431d284/main.cpp#L310 https://github.com/HDembinski/corsika_dyn_demo/blob/722d011b93a278f44c49f29b7140ec196431d284/main.cpp#L125
We loop over each process and let it handle its associated particles. https://github.com/HDembinski/corsika_dyn_demo/blob/722d011b93a278f44c49f29b7140ec196431d284/main.cpp#L320
Particles that reached the observation level are sorted to the front of the stack and skipped in the next iteration. Particles on the stack that fell below the minimum energy threshold are freed. This is the only time where particles on the stack are copied around. https://github.com/HDembinski/corsika_dyn_demo/blob/722d011b93a278f44c49f29b7140ec196431d284/main.cpp#L294https://github.com/HDembinski/corsika_dyn_demo/blob/722d011b93a278f44c49f29b7140ec196431d284/main.cpp#L324
Some design key points:
- StackParticle inherits from Particle and to add the bookkeeping data for the processes https://github.com/HDembinski/corsika_dyn_demo/blob/722d011b93a278f44c49f29b7140ec196431d284/main.cpp#L101 There are other ways to store this bookkeeping data, it could be a cache inside the ProcessList. Keeping it physically close to the Particle data is cache-friendly, because they are frequently accessed together.
- The base class ProcessBase defines two methods step and run, which derived classes need to implement. Both operate on a range of particles at once, which is cache-friendly and enables optimisations. One can see in the assembler that the optimizer already inserts SIMD instructions in that part, even though I just wrote ordinary C++.
The code can be tuned further to gain even more speed, after profiling. The central point is that users can now write alternative implementations for each Process which locally accelerate the computation of that Process in some way. A process that only operates on one particle at a time can never benefit from SIMD or GPU, it is impossible by design. Only by processing ranges of particles in a chunk one can even implement optimizations like this.
The other benefit is that I can fully configure and control the process list from Python.