2018.01.15 21:41 "Re: [Tiff] Strategies for multi-core speedups", by Larry Gritz

Hi, Bob, thanks for the reply.

Of course the rest of my app already is threaded, and it works so well that for certain use cases, the TIFF I/O is by far the biggest remaining chunk of wall clock time.

As for the timings, let's test your hypothesis!

Full read, then full write, of an 8k x 8k x 4 channel, float image, zip compressed. Running on a desktop machine with 16 cores. I'm using float simply because it's the one data type in common between TIFF and OpenEXR, so gives more of an apples-to-apples comparison on the compression and I/O.

libtiff, one thread: 16.2s
libtiff, multithread: 16.2s

OpenEXR, one thread: 18.3s
OpenEXR, multithread: 2.7s (6.8x improvement!)

I am certainly hitting a point of diminishing returns -- restricting it to 8 threads runs in 3.1s for "only" 5.9x speedup. So I'm not benefitting much beyond 8 threads, perhaps because of mem bandwith, or perhaps just Amdahl's law and the serialized raw I/O and other overhead of using the library. But still, I'd celebrate 6x I/O speedup any day of the week.

Interestingly, with a single thread, libIlmImf is slightly slower reading and writing the same size file, compared to libtiff. When allowed to be multithreaded, it absolutely clobbers tiff. But there's no reason to think that libtiff couldn't beat it for the multithread case as well, if only the APIs and internals supported it.

So I think we can establish that there's a big performance gain to be had with this strategy.

It would be great, long term, for libtiff to directly support multi-scanline/multi-tile reads and writes, with internal threading to at least parallelize the compression or decompression of strips/tiles.

Until then, I'm very interested in any suggestions for how I can hack it on the app side, but preferably still using libtiff for as much as possible. My read of the libtiff internals is that although I can read or write raw (uncompressed) strips or tiles, the codec implementations are have state and are therefore not reentrant -- I can't call them in parallel on multiple raw strips that I've already read, nor can I call the compressors on multiple data blocks to generate raw strips in parallel.

-- lg

> On Jan 15, 2018, at 12:23 PM, Bob Friesenhahn <bfriesen@simple.dallas.tx.us> wrote:

>
> On Mon, 15 Jan 2018, Larry Gritz wrote:
>>

>> I tried to do this myself using libtiff -- for the common case of needing to read an entire image, my aim was read all the raw strips (TIFFReadRawStrip, serially), then dole them out to different threads to decompress in parallel (and when writing, compress in parallel, then TIFFWriteRawStrip each one serially). If libIlmImf is any indication, doing this on our typical 12 or 16 core machines ought to speed up TIFF I/O by an order of magnitude, easily.

> An order of magnitude is unlikely. 2-4x is possible. The problem is that a computer has limited memory bandwidth. With enough threading, the bottleneck moves from CPU to memory bandwidth and CPUs offer 2-4x memory-path parallelism (more on expensive server CPUs and less on desktop CPUs). EXR uses floating point which requires a lot more CPU so there is more to be gained with many cores.

> It is likely that even with sequential single-threaded I/O through libtiff that other parts of your app will benefit by threading. I see up to 4x benefit when reading SMTPE DPX format sequentially (one thread doing I/O at a time), but then allowing each thread to continue processing the data in a multi-threaded fashion.

>> But my plan was thwarted because it sure seems that the API for the codecs is inherently stateful and non-reentrant.

>
> This would be very useful to fix.

Bob
--
Bob Friesenhahn
bfriesen@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/

--
Larry Gritz
lg@larrygritz.com