
Thread
2006.04.22 03:37 "[Tiff] Microsoft Document Imaging status / snapshot", by Brad Hards
After a very long hiatus (I've been working on some Qt-based crypto), I recently spent a little bit of time working on the .mdi file format extensions to tiff.
I've almost got the "OCR'd text" tag sorted out: 37679 - looks like the text version of the document contents. The content are 0x01 0x00, followed by a length (4 byte aka long) which has a value that is 6 bytes less than the actual length of this field, followed by the ascii text version. Each phrase is delimited by a space followed by a newline (0x20 0x0a aka ' \n'). The end is 0x0d 0x00. There are sometimes additional bytes (e.g. 0xe2 0x80 0x9c) which appear to be some kind of character / symbol encoding. Combinations include:
0xef 0x82 0xa7 = some kind of bullet point symbol
0xef 0x82 0xb7 = some kind of bullet point symbol (different to a7)
0xe2 0x80 0x93 = em-dash
0xe2 0x80 0x9c = `` (smart doublequotes, left side of quoted material)
0xe2 0x80 0x9d = '' (smart doublequotes, right side of quoted material)
0xe2 0x80 0x99 = ' (apostrophe of some kind)
0xe2 0x80 0xa6
0xe2 0x80 0x94 = short dash?
0xc3 0xa9 = e with grave. (00a9 is the unicode equivalent, perhaps
this will form some pattern)
My current set of notes is attached.
I've been hacking libtiff to try to figure out what is going on. A cvs diff is also attached - you'll need the tif_mdi.c file as well, if you want it to compile (it goes into libtiff/libtiff/).
Work on the actual image content (compression type) has only just started. Right now I have absolutely no idea what the format could be, although it certainly does appear to use some kind of compression format. I generated a few trivial files (a large filled blue rectangle, same rectangle without fill, same rectangle in a slightly different shade of blue, same rectangle in green) Here is what the content looks like: /home/bradh/mdi/greenrect.MDI:
char count: 336
02 00 00 00 7b 00 00 00 ac 02 00 00 03 00 00 00 00 00 00 00 ff ff ff ff 00 00
00 00 78 01 5d 91 bd 4a c4 50 10 85 cf cd 46 37 82 8a 18 11 41 8b b5 f0 0f 1b
c1 b5 b7 d1 46 5c c4 42 7b d1 42 10 0b 5d b0 dd c2 c2 97 c8 1b 58 28 68 ef 33
f8 04 3e 85 ed fa 9d 35 37 9b cd c0 c9 3d 33 77 66 ce 64 6e 90 f4 0c 8a 20 e 5 89 d4 cb e0 6d 02 a5 65 e7 52 da 97 3a 47 a7 c7 52 d0 2b 39 eb dc b5 c0 2c b8 2b f3 de 66 a4 8f 39 e9 1b 7f 87 5e 75 3b e8 b6 b4 ff 92 ea 4c 0f ba d5 bd fa ea 80 1b 3d f2 b5 6d 00 f7 42 76 10 39 6e c5 e7 e1 ab 04 16 40 34 a4 26 fc 2b 7c f7 b0 f4 75 c9 97 a8 7b 82 73 4e 58 d4 e0 57 2b 8d 29 f8 2e 59 2b e0 eb
67 38 ac 23 e6 37 e7 5b 26 d7 9a ae 59 2c b9 f5 ad e7 b8 67 ac f3 13 fc 43 b0
07 d0 63 0b ff f3 ba a6 fe 6f ef ec f2 d3 c5 a5 45 3f ce e1 b9 b7 b8 b3 ae 6 b 47 36 10 2f 33 de a5 e3 17 c0 5a cd b7 b5 76 11 f2 a4 97 19 45 bb 08 46 9e 6c 12 5f 03 a3 46 9c 95 d1 3b 6a fb ee 12 74 41 b3 ef 34 b1 df 74 dc 2f f6 77 2c d6 37 77 b8 4d 8d 77 e5 be 91 7b 76 de bc 7a 37 ef c6 7b fc 03 fe c8 3b 4d
/home/bradh/mdi/bluerect.MDI:
char count: 337
02 00 00 00 7b 00 00 00 ac 02 00 00 03 00 00 00 00 00 00 00 ff ff ff ff 00 00
00 00 78 01 5d 91 bd 4a c4 50 10 85 4f b2 d1 8d a0 22 46 44 d0 62 2d fc c3 46
70 ed 6d dc 66 71 11 0b ed 45 0b 41 2c 74 c1 d6 c2 c2 97 c8 1b 58 28 28 58 fa
0c 3e 81 4f 61 1b bf b3 e6 66 b3 19 38 b9 67 e6 ce cc 99 cc 8d 24 3d 81 3c 9 2 b2 58 1a a4 f0 36 81 d2 d2 53 29 19 4a 9d a3 e3 9e 14 e9 85 9c 75 ee 5a 60 16 dc 94 79 af 33 d2 fb 9c f4 8d bf 43 af ba 1d 74 5b da 7f 4e 74 a2 3b 5d eb 56 43 75 c0 95 ee f9 da 36 80 7b 21 fb 18 38 6e c5 e7 e1 ab 04 16 40 30 a4 26 fc 0b 7c f7 b0 f4 65 c9 97 a8 7b 80 73 4e 58 d0 e0 57 2b 8d 29 f8 2e 59 2b e0
eb a7 28 ea 08 f9 cd f9 96 c9 b5 a6 6b 16 4b 6e 7d eb 39 ee 19 eb bc 8f 7f 08
f6 00 7a 6c e1 7f 5e d7 d4 ff ed 8d 5d 7e b8 b8 b4 e0 87 39 3c f7 16 77 d6 7 5 ed c8 3e 0b 5e 66 bc 4b c7 cf 80 b5 9a 6f 6b ed 3c ca e2 41 6a e4 ed 3c 32 b2 78 93 f8 1a 18 35 e2 ac 8c de 41 db 77 e7 a0 0b 9a 7d a7 89 fd 26 e3 MDI contains images of the page, and the text that it contains. Based on TIFF format.
Office Document Imaging creates MDI files in these formats:
- Monochrome One bit per pixel, MODI BW compression
- Grayscale 8 bits per pixel, MODI Color compression
- Color 24 bits RGB, MODI Color compression
Office Document Imaging supports:
- All compression types listed in the TIFF 6.0 specification.
- Different compression types for each page of a multi-page document.
- TIFF images with 1-bit, 4-bit, 8-bit, or 24-bit color depth (both palette and non-palette).
- MDI images with 1-bit, 8-bit, or 24-bit color depth.
- RGB and CMYK color spaces.
- Tiled images.
Office Document Imaging does not support:
- YCbCr color space, except when the image is JPEG.
- CIE Lab color space.
- Images with more than five samples per pixel, or a sample size larger than 32 bits.
- Images in Planar format.
MDI support document annotations.
- add text as a note or comment
- apply highlighting to important text
- draw freeform text or shapes to circle text in question
- insert a picture in your document by using the buttons on the
Annotations toolbar.
- Move, resize, and remove annotations.
- Select the font and background color (if any) for text you add to your document.
- Choose the thickness and color of ink for the pens you use to add highlighting and drawings.
- Make your annotations a permanent part of your document.
- Print your document with or without annotations.
Each MDI document consists of an ordered collection of pages (images), plus metadata.
The metadata has a standard set of properties, plus a custom set. Standard ("built in") properties includes "Title", "Author" and "Creation Date". Title and Author may not be present. "Last print date" and "Last save time" are "available but not used"
File format can vary:
- DEFAULTVALUE -1
- MDI 4
- TIFF 1
- TIFF_LOSSLESS 2
Compression level can vary
- High (2)
- Low (0)
- Medium (1)
Compression type (259, 0x0103) can vary:
- MODI_BLC 34718
- MODI_PTC 34720
- MODI_VECTOR 34719
- TIFF_CCITT1D 2
- TIFF_CCITT3 3
- TIFF_CCITT4 4
- TIFF_JPEG 7
- TIFF_JPEG6 6
- TIFF_LZW 5
- TIFF_NONE 1
- TIFF_PACK 32773
- UNKNOWN 0
Variable font families:
- Century 3
- DEFAULT 6
- Helvetica 1
- HelveticaCentury 5
- Times 2
- TimesCentury 4
Variable face styles:
- BOLD 3
- BOLD_ITALIC 4
- ITALIC 2
- ROMAN 1
Serif styles:
- SerifRND 4
- SerifSANS 1
- SerifSQ 3
- SerifSTYLE_UNKNOWN 5
- SerifTHIN 2
Languages:
- CHINESE_SIMPLIFIED 2052
- CHINESE_TRADITIONAL 1028
- CZECH 5
- DANISH 6
- DUTCH 19
- ENGLISH 9
- FINNISH 11
- FRENCH 12
- GERMAN 7
- GREEK 8
- HUNGARIAN 14
- ITALIAN 16
- JAPANESE 17
- KOREAN 18
- NORWEGIAN 20
- POLISH 21
- PORTUGUESE 22
- RUSSIAN 25
- SPANISH 10
- SWEDISH 29
- SYSDEFAULT 2048
- TURKISH 31
Thumbnail sizes:
- LARGE 3
- MAXIMUM 3
-_MEDIUM 2
- SMALL 1
- TINY 0
Each page can have different image properties:
- picture of page
- thumbnail of page
- BitsPerPixel - The bits per pixel.
- Compression The compression level.
- Layout The results of OCR on the page.
- PixelHeight The height in pixels.
- PixelWidth The width in pixels.
- The X-axis pixels per inch.
- YDPI The Y-axis pixels per inch.
Layout - provides summary information (such as the number of words) about the recognized text on the page and gives access to the recognized text itself and to each individual word in the text.
The Word object exposes additional information about each word's font, its location on the page, and even the OCR engine's RecognitionConfidence factor, which estimates the likelihood of a recognition error:
Layout properties
- Language The language setting used by the OCR process.
- NumChars The number of characters in the rec
? autom4te.cache
? rects.txt
? tiffmdi-snapshot-2006-04-22.patch
? libtiff/tif_mdi.c
Index: libtiff/Makefile.am
===================================================================
RCS file: /cvs/maptools/cvsroot/libtiff/libtiff/Makefile.am,v
retrieving revision 1.21
diff -u -4 -p -r1.21 Makefile.am
--- libtiff/Makefile.am 21 Apr 2006 14:18:54 -0000 1.21
+++ libtiff/Makefile.am 22 Apr 2006 04:16:56 -0000
@@ -70,8 +70,9 @@ SRCS = \
tif_getimage.c \
tif_jpeg.c \
tif_luv.c \
tif_lzw.c \
+ tif_mdi.c \
tif_next.c \
tif_ojpeg.c \
tif_open.c \
tif_packbits.c \
Index: libtiff/tif_codec.c
===================================================================
RCS file: /cvs/maptools/cvsroot/libtiff/libtiff/tif_codec.c,v retrieving revision 1.10
diff -u -4 -p -r1.10 tif_codec.c
--- libtiff/tif_codec.c 21 Dec 2005 12:23:13 -0000 1.10
+++ libtiff/tif_codec.c 22 Apr 2006 04:16:56 -0000
@@ -68,8 +68,11 @@ static int NotConfigured(TIFF*, int);
#endif
#ifndef LOGLUV_SUPPORT
#define TIFFInitSGILog NotConfigured
#endif
+#ifndef MDI_SUPPORT
+#define TIFFInitMDI NotConfigured
+#endif
/*
* Compression schemes statically built into the library.
*/
@@ -94,8 +97,12 @@ TIFFCodec _TIFFBuiltinCODECS[] = {
{ "AdobeDeflate", COMPRESSION_ADOBE_DEFLATE , TIFFInitZIP },
{ "PixarLog", COMPRESSION_PIXARLOG, TIFFInitPixarLog },
{ "SGILog", COMPRESSION_SGILOG, TIFFInitSGILog },
{ "SGILog24", COMPRESSION_SGILOG24, TIFFInitSGILog },
+ /* TODO - add proper decompression for these */
+ { "MODI BW", COMPRESSION_MODI_BLC, TIFFInitMDI },
+ { "MODI Colour", COMPRESSION_MODI_PTC, TIFFInitMDI },
+ { "MODI Vector", COMPRESSION_MODI_VECTOR, TIFFInitMDI },
{ NULL, 0, NULL }
};
static int
Index: libtiff/tif_dirinfo.c
=================================================================== RCS file: /cvs/maptools/cvsroot/libtiff/libtiff/tif_dirinfo.c,v retrieving revision 1.62
diff -u -4 -p -r1.62 tif_dirinfo.c
--- libtiff/tif_dirinfo.c 7 Feb 2006 10:45:38 -0000 1.62
+++ libtiff/tif_dirinfo.c 22 Apr 2006 04:16:57 -0000
@@ -268,8 +268,16 @@ tiffFieldInfo[] = {
{ TIFFTAG_STONITS, 1, 1, TIFF_DOUBLE, FIELD_CUSTOM,
0, 0, "StoNits" },
{ TIFFTAG_INTEROPERABILITYIFD, 1, 1, TIFF_LONG, FIELD_CUSTOM,
0, 0, "InteroperabilityIFDOffset" },
+ { TIFFTAG_MDIOCRTEXT, -1, -1, TIFF_UNDEFINED, FIELD_CUSTOM,
+ 0, 0, "TextContentsMDI" },
+/* MDI tag for document level metadata? */
+ { TIFFTAG_MDIMETADATA, -1, -1, TIFF_UNDEFINED, FIELD_CUSTOM,
+ 0, 0, "MDIMetaData" },
+/* MDI tag for page thumbnail? */
+ { TIFFTAG_MDITHUMBNAIL, -1, -1, TIFF_UNDEFINED, FIELD_CUSTOM,
+ 0, 0, "MDIMetaData" },
/* begin DNG tags */
{ TIFFTAG_DNGVERSION, 4, 4, TIFF_BYTE, FIELD_CUSTOM,
0, 0, "DNGVersion" },
{ TIFFTAG_DNGBACKWARDVERSION, 4, 4, TIFF_BYTE, FIELD_CUSTOM,
Index: libtiff/tif_dirread.c
=================================================================== RCS file: /cvs/maptools/cvsroot/libtiff/libtiff/tif_dirread.c,v retrieving revision 1.84
diff -u -4 -p -r1.84 tif_dirread.c
--- libtiff/tif_dirread.c 4 Apr 2006 02:00:08 -0000 1.84
+++ libtiff/tif_dirread.c 22 Apr 2006 04:16:57 -0000
@@ -29,8 +29,9 @@
*
* Directory Read Support Routines.
*/
#include "tiffiop.h"
+#include "ctype.h"
#define IGNORE 0 /* tag placeholder used below */
/*
* Copyright (c) Brad Hards <bradh@frogmouth.net>
*
- Permission to use, copy, modify, distribute, and sell this software and
- its documentation for any purpose is hereby granted without fee, provided
- that the above copyright notices and this permission notice appear in
- all copies of the software and related documentation.
- THE SOFTWARE IS PROVIDED "AS-IS" AND WITHOUT WARRANTY OF ANY KIND,
- EXPRESS, IMPLIED OR OTHERWISE, INCLUDING WITHOUT LIMITATION, ANY
- WARRANTY OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
- IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
- ANY SPECIAL, INCIDENTAL, INDIRECT OR CONSEQUENTIAL DAMAGES OF ANY KIND,
- OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS,
- WHETHER OR NOT ADVISED OF THE POSSIBILITY OF DAMAGE, AND ON ANY THEORY OF
- LIABILITY, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE
- OF THIS SOFTWARE.
*/
#include "tiffiop.h"
#ifdef MDI_SUPPORT
/*
* TIFF Library.
*
* MDI Image Support
*
*/
static int
MDISetupDecode(TIFF* tif)
{
return (1);
}
/*
* Setup state for decoding a strip.
*/
static int
MDIPreDecode(TIFF* tif, tsample_t s)
{
return 1;
}
static int
MDIDecode(TIFF* tif, tidata_t op, tsize_t occ, tsample_t s)
{
int lv;
printf("decode size: %i (%i)\n", occ, 2480*3508*3);
printf("char count: %i\n", tif->tif_rawcc);
for (lv = 0; lv < tif->tif_rawcc; ++lv) {
printf("%02x ", tif->tif_rawcp[lv]);
}
printf("\n");
return 0;
}
static int
MDISetupEncode(TIFF* tif)
{
return 0;
}
/*
* Reset encoding state at the start of a strip.
*/
static int
MDIPreEncode(TIFF* tif, tsample_t s)
{
return 0;
}
/*
* Encode a chunk of pixels.
*/
static int
MDIEncode(TIFF* tif, tidata_t bp, tsize_t cc, tsample_t s)
{
return (1);
}
/*
- Finish off an encoded strip by flushing the last
- string and tacking on an End Of Information code.
*/
static int
MDIPostEncode(TIFF* tif)
{
return 1;
}
static void
MDICleanup(TIFF* tif)
{
}
int
TIFFInitMDI(TIFF* tif, int scheme)
{
assert( (scheme == COMPRESSION_MODI_BLC)
|| (scheme == COMPRESSION_MODI_VECTOR)
|| (scheme == COMPRESSION_MODI_PTC)
);
// printf("Init MDI\n");
/*
* Install codec methods.
*/
tif->tif_setupdecode = MDISetupDecode;
tif->tif_predecode = MDIPreDecode;
tif->tif_decoderow = MDIDecode;
tif->tif_decodestrip = MDIDecode;
tif->tif_decodetile = MDIDecode;
tif->tif_setupencode = MDISetupEncode;
tif->tif_preencode = MDIPreEncode;
tif->tif_postencode = MDIPostEncode;
tif->tif_encoderow = MDIEncode;
tif->tif_encodestrip = MDIEncode;
tif->tif_encodetile = MDIEncode;
tif->tif_cleanup = MDICleanup;
}
#endif