2005.01.21 09:32 "Re: [Tiff] tiff question - performance on large files", by Sherlog

problem is the poor performance in case of very large files

I suspect the problem is due to the way libtiff just "walks" the list of images when doing various kinds of operations rather than maintaining a master in-memory list of all availalbe image directories.

I think it is the same problem with quite a bunch of image format packages. Many of them seem to have been designed with single-image files in mind and hacked for multi-page TIFF support afterwards (not libtiff, of course <g>). Some - like LEAD tools - even re-open the file and close it again for reading field values of image pages that have just been retrieved (that is, one open-walk-close to retrieve a certain image page, one open-walk-close for each field value requested). This can add up to considerable amounts of time for files containing hundreds or even thousands of pages.

Add the brand-new network file handling b0rkenness in WinXP SP2 - which basically adds a semi-random delay of about 0.5 s to closing *large* network files - and you are looking at something like half an hour for what should take less than 5 seconds if you consider only network throughput and latency (and which does take less than 5 seconds if you load the file into memory and process it from there). Consider LEAD tools: reading image pages and two fields per page -> 1000 * (1 + 1 + 1) == 3000 file closes -> ~1500 seconds. I sh*t you not, this happened to our production system when we added a couple of WinXP SP2 boxen to help with the workload.

Of course, libtiff does not have the SP2 problem but it still suffers the consequences of TIFF files being simple linked lists (or, lately, trees) without global directory information. Accessing IFD <n> requires you to parse the <n-1> preceding IFDs and this can cause a lot of data to be pulled into OS FS caches, apart from the latency if network access is involved. For the past couple of years we avoided the problem by the simple expedient of storing single-page TIFFs in memo fields of data base tables, but for various reasons this will no longer do.

I have seen some workarounds, which basically boil down to 'add fake out-of-line TIFF field at the end of the file that points to the last IFD'. This ensures constant-time append performance regardless of the number of pages already in the file, but of course this violates the rule that TIFF fields except for strip/tile offset and SubIFD (and old JPEG <g>) must not contain file offsets.

I think a better solution is to keep the trailer out of the TIFF structure entirely. From the POV of a TIFF reader it would not exist at all and from the POV of a thorough TIFF structure checker (does such a beast exist?) it would appear as some unaccounted-for space at the end of the file. So, structurally the file would be sound regardless of what's written at the end of the file since slack space content is not part of the specification.

The only drawbacks I can see are these:

the space is wasted if an unaware tool appends a new page to the file
the data will become stale or even corrupted if an unaware tool modifies pages within the file in a manner that leaves the file size unchanged

Regarding (2), if the trailer says that page <k> is located at file offset <o> and we find something looking like a valid IFD at this file offset this does not mean that we do indeed have a valid IFD. It may be a stale copy that is no longer part of the IFD chain because a tool rewrote the IFD for that page at another place within the file or simply deleted (unlinked) it. To a certain degree this can be worked around by including the file's last modified time in the trailer and careful choice of file open sharing modes (to be sure that no other process is currently modifying the file because the time stamp may only get updated at file close).

As far as I can see there are two compatible motivations for a trailer:

locating the last IFD quickly in order to get constant append performance
locating any IFD quickly (by page index) in order to get constant retrieval performance

I think if people are going to write such trailers - they are doing so already - it might be desirable to find a consensus as to how exactly go about it. As far as I can see the trailer should contain

a magic
the total page (IFD) count of the 'main' chain
an array of offsets for the main IFDs
perhaps the file's last modified time when the trailer was generated

The last item is a bit thorny (low filetime resolution on UN*X, even lower on FAT, time warping twice a year under Win32, local time vs. UTC and so on) but in a business process this can work, even without the time thingy.

Is there any interest in such schemes here on this list? I mean schemes that work when opening a multi-page TIFF 'cold'. If the chain of IFDs has been walked already then the software *should* have remembered the offset of the last IFD so that the next IFD offset can be patched in, which gives constant peformance for repeated appends without any structural tricks like guerilla trailers.

P.S.: apologies to Frank Warmerdam - I accidentally flung this post into his inbox instead of posting it to the list. Mea culpa. In the meantime I have found that newer versions of the LEAD tools allow the user to pass the file offset of a 'base' IFD when accessing pages in a multipage file. This means processing multiple pages in a big multi-page TIFF is O(n), instead of O(n**2) as with the unhinted approach. I think such an approach would be reasonably easy to integrate into libtiff, and it would remove a class of performance problems without requiring things like trailer records.

Of course, the same speedup is possible even if the TIFF-reading library is uncooperative: extract the data for the page under consideration from the multi-page TIFF, repackage as a single-page TIFF, and then hand that to the uncooperative reader (e.g. libtiff). This is reasonably easy (we have been using such a scheme successfully for more than seven years) but it requires knowledge of TIFF fields where in addition to the usual TIFF field logic the *values* also need to be recalculated, e.g. StripOffsets, TileOffsets and so on.

Regards,

Sherlog