1999.10.04 17:05 "libtiff problems with Group 4", by Joel Schumacher

1999.10.04 17:05 "libtiff problems with Group 4", by Joel Schumacher

Dear Sam Leffler,

We currently have an imaging project to scan and store documents as TIFF files. To make a long story short, we're trying to OCR them and our OCR company says there's problems with some of our TIFF images and so does tiffinfo.

I turned on DEBUG mode and recompiled tiffinfo to get it to show exactly what it's decoding, then ran it with the -D option on a problem file:

0000001F/9: V0         0        1
0000000F/8: V0         0        1
00000007/7: V0         0        1
00002003/14: V0         0       1
00001001/13: V0         0       1
00000800/12: EOL        0       0000000
Fax4Decode: 169439.tif: Bad code word at scanline 3291 (x 0).
Fax4Decode: Warning, 169439.tif: Premature EOL at scanline 3291 (got 0, expected 2544).
00000010/13: VL         2       000010
00000000/7: EOL        0        0000000
Fax4Decode: 169439.tif: Bad code word at scanline 3292 (x 2542).
Fax4Decode: Warning, 169439.tif: Premature EOL at scanline 3292 (got 2542, expected 2544).
00000008/8: Pass       0        0001
00000000/12: EOL        0       0000000
Fax4Decode: Warning, 169439.tif: Premature EOL at scanline 3294 (got 0, expected 2544).
00000000/7: EOL        0        0000000
Fax4Decode: Warning, 169439.tif: Premature EOF at scanline 3295 (x 0).
Fax4Decode: Warning, 169439.tif: Premature EOL at scanline 3295 (got 0, expected 2544).

At the point where it's failing (last 16 bytes of the data), we see this:

21fb0 - 21fbf : FF FF FF FF FF FF FF FF FF FF FF F0 01 00 10 00

The problem seems to be at the F0 01 00 10. In binary, that's

    1111 0000 0000 0001 0000 0000 0001 0000

So the run of 1's are V0 codes, then you get to the 000000000001 000000000001 where it fails.

This is an EOFB (end of facsimile block) codeword. It is a 24-bit codeword defined as follows:

   000000000001000000000001

As you can see, it's not interpreting this properly. It sees 7 0's in a row and reports it as an invalid codeword.

The EOFB is defined in section 2.4.1.1 of Recommendation T.6 and is also described on page 52 of the TIFF 6.0 spec.

So, it would seem we're encountering an end of page before we've decoded the 3300 lines that RowsPerStrip and ImageLength defined. Looks like we've done 3290 and ran across an EOFB. This is not neccesarily a problem as the TIFF 6.0 spec states on page 53:

If a TIFF reader encounters EOFB before the expected number of
lines has been extracted, it is appropriate to assume that the
missing rows consist entirely of white pixels. Cautious readers
might produce an unobtrusive warning if such an EOFB is followed
by anything other than pad bits.

Readers that successfully decode the RowsPerStrip (or TileLength
or residual ImageLength) number of lines are not required to
verify that an EOFB follows. That is, it is generally appropriate
to stop decoding when the expected lines are decoded or the EOFB
is detected, whichever occurs first. Whether error indications or
warnings are also appropriate depends upon the application and
whether more precise troubleshooting of encoding deviations is
important.

Although it doesn't seem like you're even recognizing the EOFB, much less having a problem with it ending before the appropriate number of lines were decoded.

Can you verify that this is indeed a problem with libtiff?

And do you also interpret the EOFB coming early as something that shouldn't be a problem and should be handled?

______________________________________________________________________
Joel Schumacher                    JCPenney Co. - UNIX Network Systems
jschumac@uns-dv1.jcpenney.com      12700 Park Central Pl   M/S 6021
(972) 591-7543                     Dallas TX  75251