2018.01.15 23:29 "Re: [Tiff] Strategies for multi-core speedups", by Larry Gritz
Yes, thanks, I'm already an experienced programmer of large multithreaded systems.
For zip compression, it is easily demonstrable that the compression dominates the raw I/O by a large factor, and the libIlmImf library shows a proof of concept that merely parallelizing the decompression (of multiple scanlines or strips or tiles, when reading large pieces of an image at once) can achieve a 6x or more speedup in wall clock time, even with all the other sources of overhead that you mention, as well as serialized I/O and serialized processing of things like making sense of the image header and metadata.
There is no mystery here -- the I/O itself is still serialized, so if it's only 10% of the total runtime, say, even perfectly linear thread scaling and an unlimited number of cores without any lock contention will experience a maximum of 10x improvement. Amdahl's Law and all that. That effect alone adequately explains my lack of thread scalability beyond 8 or so cores, for this application.
But there's still a huge win to be had by parallelizing the decompression.
> On Jan 15, 2018, at 3:12 PM, Mike Stanton <email@example.com> wrote:
To begin with I am an Operating Systems guy. When you are looking at going to multi-threading, what have you determined as your setup/teardown cost for the individual threads? How large a chunk will each thread be processing? Will the individual Scan-lines and/or Tiles be sufficiently large to absorb the cost of that setup and teardown by the inherent performance you are projecting?
> To set up a thread you not only have to create an instance on the CPU of the execution environment, you have to go through the memory manager to allocate space for stacks and thread-local variables. You are also introducing synchronization costs for the individual threads in parsing out the workload to the individual threads. The individual units of work will have to be considerably larger than the aforementioned setup, teardown, and synchronization actions that are needed. In addition to behavior of the application, there is underlying locking done by the kernel in processing multiple memory allocation (Buffer and Page Pool) and I/O operations (which also can impact memory allocations due to page pinning actions) against the same set of resources.
> The increases due to locking conflicts in modern systems are also not necessarily linear in nature. The magnitude of “loss due to friction” can be logarithmic depending on the access patterns to the data. This will be visible both in the application and the supporting OS Kernel and how the particular System Architecture supports conflict resolution between shared owners of a data structure. As you introduce more threads sharing the access to the data there is the extreme possibility that boundary conditions in scalability could be encountered at higher concurrencies that were not visible in 4 or even 8 core solutions. When you get up to 15 or 20 cores you start seeing interactions where shared data structures (Switching queues, workload dispatch points, buffer pools) can steal any gains realized by parallel execution. I have worked on systems that worked great at 8 threads sharing data but when you get up to about 24 threads the propagation delay of operand miss conditions or spin locks and semaphores results in a significant loss of capacity over all.
> Just a few thoughts.
> - M
> From: firstname.lastname@example.org [mailto:email@example.com] On Behalf Of Larry Gritz
> Sent: Monday, January 15, 2018 3:42 PM
> To: Bob Friesenhahn
> Hi, Bob, thanks for the reply.
> Of course the rest of my app already is threaded, and it works so well that for certain use cases, the TIFF I/O is by far the biggest remaining chunkc