2008.06.19 21:34 "[Tiff] Scalability problem in tif_dirread.c - detecting loops in IFD offsets", by Steve Eddins
The file tif_dirread.c in 3.8.2 (and back at least to 3.7.0) contains code meant to detect a loop in the list of IFD offsets. The detection code maintains an array of IFD offsets already seen. When reading each new directory, it scans the array to see if the next offset has been seen before. Then it realloc's the array to make room for processing the next directory.
I think this particular code is related to a performance problem reported by some of our users. Believe it or not, some of our users are working with TIFF files containing tens of thousands of IFDs. They are complaining that our software takes longer and longer to read each image in the file. For example, a Sandia National Laboratory user sent us a 1.1 GB file containing around 129,000 IFDs. It takes about 200 time longer to read images at the end of the IDF chain, compared to the time it takes to read images at the beginning. I suspect that the loop-detection code is causing the time to read the N-th image to be N^2. (If I understand the code correctly, there are 129,000 calls to realloc required just to arrive at the last IFD, not to mention all those linear scans.) The constant of proportionality is very low, so you don't notice the N^2 behavior unless you are processing files with a very large number of IFDs.
I'm curious about this code and why it was put in. Was it inserted in response to actual perverted TIFF files detected in the wild? Or just plain-ol' cautious programming?
Either way, the current implementation scales badly with the number of IFDs.
I'm considering writing my own routine to quickly scan a TIFF file to determine the list of IFD offsets, and then patching tif_dirread.c to add a new method that can read an IFD at a specified offset. This would allow us to provide our users with much faster "random IFD" access for their TIFF files. I'm wondering if I need to implement loop detection in my IFD offset list routine. If so, I'd be tempted to use Floyd's cycle detection algorithm, which is linear time and requires no extra space.
Does anyone have any thoughts about the cycle detection code in tif_dirread.c, or about my ideas for implementing a work-around?
The MathWorks, Inc.