2007.01.15 01:09 "[Tiff] bigtiff", by Albert Cahalan

2007.01.15 01:09 "[Tiff] bigtiff", by Albert Cahalan

I've written some tiff read/write code, by hand (w/o libtiff), and I've been looking at bigtiff.

The design looks like it got pulled between the conflicting desires of compatibility and sanity, getting rather bent out of shape in the process. It's not too late to reconsider.

On the one hand, you could go for compatibility. Define a new type, the big-IFD. Within a big-IFD, all file offsets are expressed in multiples of 4096. (you shift by 12 bits) The same can be done for strip offsets and so on. The most logical way is to use one of the type bits to mark this. For example, bit 7 (0x80) could indicate that any offset will be large. The very first IFD must go in the low 4 GB of course, which nicely allows many older applications to at least see a valid TIFF -- or let a change of magic indicate only that one issue. Maybe add one more type, shifting the BYTE type in a similar manner, if people really want individual tiles to be big.

That would be REALLY EASY to deal with. No data structures would need to change. If a flag is set when reading, shift some bits. If a number is too big when writing, set the flag and shift some bits. The hardest problems would be rounding up and adding an old-style IFD, neither of which is really any trouble at all.

On the other hand, you could eliminate some cruft.

The old TIFF spec "requires" alignment, wasting bits in offsets and bytes in the file, but a real TIFF reader still has to tolerate images which lack the alignment. (giving a writer no incentive to comply) If the low bits were simply missing (shifted away), then readers could rely on alignment. RISC systems tend to need 4-byte alignment, some SSE instructions need 16-byte alignment, most raw IO needs 512-byte alignment, newer disks (and thus likely newer raw IO) need 4096-byte alignment, and mmap on typical hardware needs 4096-byte alignment. Somewhere in there is a reasonable value.

If all the data structures are changing anyway, one might as well use a UUID/GUID (128-bit value) as the tag type. Old tags get zero-padded. This would get rid of the problem of number allocation: if you need a private tag, just generate 128 random bits and you have a tag number. An IFD entry could be laid out as a nice power of two:

128-bit UUID

64-bit value/offset (with offsets shifted to enforce alignment) 64-bit count, with 8-bit or 16-bit type in the upper bits

As a beneficial side effect, that limits the count such that the count multiplied by the size of the data type will not wrap a 64-bit signed or unsigned integer.

The IFD entry count could go. There are other ways to get it. The parent could hold the count, consistent with the way other things are done. An end-marker could be used instead. If the count stays in the old location, at least it could be padded out to allow a decent alignment.

BTW, byte swapping is faster than deciding if byte swapping is required. One could notice that even the Mac has gone little-endian, so there really isn't much call for big-endian anymore. I say this despite actually using a now-obsolete PowerPC Mac right now.