AWare Systems, , Home TIFF and LibTiff Mailing List Archive

LibTiff Mailing List

TIFF and LibTiff Mailing List Archive
October 2005

Previous Thread
Next Thread

Previous by Thread
Next by Thread

Previous by Date
Next by Date


The TIFF Mailing List Homepage
Archive maintained by AWare Systems

New Datamatrix section

Valid HTML 4.01!

2005.10.18 10:20 "Notes on Microsoft Office Document Imaging file format", by Brad Hards

I've been looking at the Microsoft Office Document Imaging (.mdi) file format. 
Notes to date are below - hope this helps. Also, any suggestions or updates 
would be appreciated.


MDI contains images of the page, and the text that it contains.
Based on TIFF format.

Uses different magic number to TIFF: 0x5045
Same version number: 0x002a

There are unknown fields:
37679 - appears on every page, always starts with 0x01 0x00, then varies
37680 - only appears to occur on the first page, always appears to be length 
4096, always starts with 0xd0 0xcf 0x11 0xe0 0xa1 0xb1 0x1a 0xe1, then a 
string of zeros, and then varies.
37681 - appears on every page, always stars with 0x02 0x00 (+ 0x00, 0x00?), 
then varies

These unknown properties appear to occur in both TIFF and MDI files.

37679 - looks like the text version of the document contents. The content are
0x01 0x00, followed by a length (4 byte aka long) which is 6 bytes less
than the actual length of this field, followed by the ascii text
version. Each phrase is delimited by a space followed by a newline 
(0x20 0x0a aka ' \n'). The end is 0x0d 0x00. 

37680 might be some kind of metadata dictionary. It is located at the end of
the file, and there are 16-bit wide characters that look like "Root Entry", 
"CONTENTS" (sometimes more than once, even if only one page), 
"prop2" (sometimes more than once), "prop3" (somtimes more than once), 
"DICT", "Summary Information", "Owner" and some names. There might be some 
random stuff / fill in there too.
Also appears to be a consistent bit of stuff "AuvsxjatP0udlw1Aaq5eubr5h" (this
might not be ASCII though - there is a 0x05 0x00 always on the front of it.

37681 hasn't been looked at yet - possibly the thumbnail image?

There are new kinds of image compression (259, 0x0103):
 - MODI_BLC  34718
 - MODI_PTC  34720
 - MODI_VECTOR   34719
Plus existing TIFF compression types can be used. MDI appears to be mostly 
MOD_VECTOR. Don't know how any of this works yet.