| AWARE [SYSTEMS] | Imaging expertise for the Delphi developer | |||||||
![]() |
TIFF and LibTiff Mailing List Archive | |||||||
LibTiff Mailing List
TIFF and LibTiff Mailing List Archive Contact
The TIFF Mailing List Homepage |
Thread2005.01.21 09:32 "Re: tiff question - performance on large files", by <sherlog@t-online.de>> > problem is the poor performance in case of very large files
>
> I suspect the problem is due to the way libtiff just "walks" the list
> of images when doing various kinds of operations rather than
> maintaining a master in-memory list of all availalbe image directories.
I think it is the same problem with quite a bunch of image format packages.
Many of them seem to have been designed with single-image files in mind and
hacked for multi-page TIFF support afterwards (not libtiff, of course <g>).
Some - like LEAD tools - even re-open the file and close it again for reading
field values of image pages that have just been retrieved (that is, one
open-walk-close to retrieve a certain image page, one open-walk-close for
each field value requested). This can add up to considerable amounts of time
for files containing hundreds or even thousands of pages.
Add the brand-new network file handling b0rkenness in WinXP SP2 - which
basically adds a semi-random delay of about 0.5 s to closing *large* network
files - and you are looking at something like half an hour for what should
take less than 5 seconds if you consider only network throughput and latency
(and which does take less than 5 seconds if you load the file into memory and
process it from there). Consider LEAD tools: reading image pages and two
fields per page -> 1000 * (1 + 1 + 1) == 3000 file closes -> ~1500 seconds. I
sh*t you not, this happened to our production system when we added a couple
of WinXP SP2 boxen to help with the workload.
Of course, libtiff does not have the SP2 problem but it still suffers the
consequences of TIFF files being simple linked lists (or, lately, trees)
without global directory information. Accessing IFD <n> requires you to parse
the <n-1> preceding IFDs and this can cause a lot of data to be pulled into
OS FS caches, apart from the latency if network access is involved. For the
past couple of years we avoided the problem by the simple expedient of
storing single-page TIFFs in memo fields of data base tables, but for various
reasons this will no longer do.
I have seen some workarounds, which basically boil down to 'add fake
out-of-line TIFF field at the end of the file that points to the last IFD'.
This ensures constant-time append performance regardless of the number of
pages already in the file, but of course this violates the rule that TIFF
fields except for strip/tile offset and SubIFD (and old JPEG <g>) must not
contain file offsets.
I think a better solution is to keep the trailer out of the TIFF structure
entirely. From the POV of a TIFF reader it would not exist at all and from
the POV of a thorough TIFF structure checker (does such a beast exist?) it
would appear as some unaccounted-for space at the end of the file. So,
structurally the file would be sound regardless of what's written at the end
of the file since slack space content is not part of the specification.
The only drawbacks I can see are these:
(1) the space is wasted if an unaware tool appends a new page to the file
(2) the data will become stale or even corrupted if an unaware tool modifies
pages within the file in a manner that leaves the file size unchanged
Regarding (2), if the trailer says that page <k> is located at file offset
<o> and we find something looking like a valid IFD at this file offset this
does not mean that we do indeed have a valid IFD. It may be a stale copy that
is no longer part of the IFD chain because a tool rewrote the IFD for that
page at another place within the file or simply deleted (unlinked) it. To a
certain degree this can be worked around by including the file's last
modified time in the trailer and careful choice of file open sharing modes
(to be sure that no other process is currently modifying the file because the
time stamp may only get updated at file close).
As far as I can see there are two compatible motivations for a trailer:
- locating the last IFD quickly in order to get constant append performance
- locating any IFD quickly (by page index) in order to get constant retrieval
performance
I think if people are going to write such trailers - they are doing so
already - it might be desirable to find a consensus as to how exactly go
about it. As far as I can see the trailer should contain
- a magic
- the total page (IFD) count of the 'main' chain
- an array of offsets for the main IFDs
- perhaps the file's last modified time when the trailer was generated
The last item is a bit thorny (low filetime resolution on UN*X, even lower on
FAT, time warping twice a year under Win32, local time vs. UTC and so on) but
in a business process this can work, even without the time thingy.
Is there any interest in such schemes here on this list? I mean schemes that
work when opening a multi-page TIFF 'cold'. If the chain of IFDs has been
walked already then the software *should* have remembered the offset of the
last IFD so that the next IFD offset can be patched in, which gives constant
peformance for repeated appends without any structural tricks like guerilla
trailers.
P.S.: apologies to Frank Warmerdam - I accidentally flung this post into his
inbox instead of posting it to the list. Mea culpa. In the meantime I have
found that newer versions of the LEAD tools allow the user to pass the file
offset of a 'base' IFD when accessing pages in a multipage file. This means
processing multiple pages in a big multi-page TIFF is O(n), instead of
O(n**2) as with the unhinted approach. I think such an approach would be
reasonably easy to integrate into libtiff, and it would remove a class of
performance problems without requiring things like trailer records.
Of course, the same speedup is possible even if the TIFF-reading library is
uncooperative: extract the data for the page under consideration from the
multi-page TIFF, repackage as a single-page TIFF, and then hand that to the
uncooperative reader (e.g. libtiff). This is reasonably easy (we have been
using such a scheme successfully for more than seven years) but it requires
knowledge of TIFF fields where in addition to the usual TIFF field logic the
*values* also need to be recalculated, e.g. StripOffsets, TileOffsets and so
on.
Regards,
Sherlog
|
|||||||