AWARE [SYSTEMS] Imaging expertise for the Delphi developer
AWare Systems, Imaging expertise for the Delphi developer, Home TIFF and LibTiff Mailing List Archive

LibTiff Mailing List

TIFF and LibTiff Mailing List Archive
April 2006

Previous Thread
Next Thread

Previous by Thread
Next by Thread

Previous by Date
Next by Date

Contact

The TIFF Mailing List Homepage
This list is run by Frank Warmerdam
Archive maintained by AWare Systems



Valid HTML 4.01!



Thread

2006.04.22 03:37 "Microsoft Document Imaging status / snapshot", by Brad Hards
2006.04.22 07:02 "Re: Microsoft Document Imaging status / snapshot", by Brad Hards
2006.04.24 11:39 "Re: Microsoft Document Imaging status / snapshot", by Gerben Vos
2006.04.25 14:49 "Re: Microsoft Document Imaging status / snapshot", by Glenn Widener

2006.04.22 03:37 "Microsoft Document Imaging status / snapshot", by Brad Hards

After a very long hiatus (I've been working on some Qt-based crypto), I 
recently spent a little bit of time working on the .mdi file format 
extensions to tiff.

I've almost got the "OCR'd text" tag sorted out:
37679 - looks like the text version of the document contents. The
content are 0x01 0x00, followed by a length (4 byte aka long) which
has a value that is 6 bytes less than the actual length of this field,
followed by the ascii text version. Each phrase is delimited by a
space followed by a newline  (0x20 0x0a aka ' \n'). The end is 0x0d
0x00. There are sometimes additional bytes (e.g. 0xe2  0x80  0x9c)
which appear to be some kind of character / symbol encoding.
Combinations include:
0xef  0x82  0xa7  = some kind of bullet point symbol
0xef  0x82  0xb7  = some kind of bullet point symbol (different to a7)
0xe2  0x80  0x93  = em-dash
0xe2  0x80  0x9c  = `` (smart doublequotes, left side of quoted material)
0xe2  0x80  0x9d  = '' (smart doublequotes, right side of quoted material)
0xe2  0x80  0x99  = ' (apostrophe of some kind)
0xe2  0x80  0xa6
0xe2  0x80  0x94 = short dash?
0xc3  0xa9 = e with grave. (00a9 is the unicode equivalent, perhaps
this will form some pattern)

My current set of notes is attached.

I've been hacking libtiff to try to figure out what is going on. A cvs diff is 
also attached - you'll need the tif_mdi.c file as well, if you want it to 
compile (it goes into libtiff/libtiff/).

Work on the actual image content (compression type) has only just started. 
Right now I have absolutely no idea what the format could be, although it 
certainly does appear to use some kind of compression format. I generated a 
few trivial files (a large filled blue rectangle, same rectangle without 
fill, same rectangle in a slightly different shade of blue, same rectangle in 
green) Here is what the content looks like:
/home/bradh/mdi/greenrect.MDI:
char count: 336
02 00 00 00 7b 00 00 00 ac 02 00 00 03 00 00 00 00 00 00 00 ff ff ff ff 00 00 
00 00 78 01 5d 91 bd 4a c4 50 10 85 cf cd 46 37 82 8a 18 11 41 8b b5 f0 0f 1b
 c1 b5 b7 d1 46 5c c4 42 7b d1 42 10 0b 5d b0 dd c2 c2 97 c8 1b 58 28 68 ef 33 
f8 04 3e 85 ed fa 9d 35 37 9b cd c0 c9 3d 33 77 66 ce 64 6e 90 f4 0c 8a 20 e
5 89 d4 cb e0 6d 02 a5 65 e7 52 da 97 3a 47 a7 c7 52 d0 2b 39 eb dc b5 c0 2c 
b8 2b f3 de 66 a4 8f 39 e9 1b 7f 87 5e 75 3b e8 b6 b4 ff 92 ea 4c 0f ba d5 bd
fa ea 80 1b 3d f2 b5 6d 00 f7 42 76 10 39 6e c5 e7 e1 ab 04 16 40 34 a4 26 fc 
2b 7c f7 b0 f4 75 c9 97 a8 7b 82 73 4e 58 d4 e0 57 2b 8d 29 f8 2e 59 2b e0 eb
 67 38 ac 23 e6 37 e7 5b 26 d7 9a ae 59 2c b9 f5 ad e7 b8 67 ac f3 13 fc 43 b0 
07 d0 63 0b ff f3 ba a6 fe 6f ef ec f2 d3 c5 a5 45 3f ce e1 b9 b7 b8 b3 ae 6
b 47 36 10 2f 33 de a5 e3 17 c0 5a cd b7 b5 76 11 f2 a4 97 19 45 bb 08 46 9e 
6c 12 5f 03 a3 46 9c 95 d1 3b 6a fb ee 12 74 41 b3 ef 34 b1 df 74 dc 2f f6 77
2c d6 37 77 b8 4d 8d 77 e5 be 91 7b 76 de bc 7a 37 ef c6 7b fc 03 fe c8 3b 4d

/home/bradh/mdi/bluerect.MDI:
char count: 337
02 00 00 00 7b 00 00 00 ac 02 00 00 03 00 00 00 00 00 00 00 ff ff ff ff 00 00 
00 00 78 01 5d 91 bd 4a c4 50 10 85 4f b2 d1 8d a0 22 46 44 d0 62 2d fc c3 46
 70 ed 6d dc 66 71 11 0b ed 45 0b 41 2c 74 c1 d6 c2 c2 97 c8 1b 58 28 28 58 fa 
0c 3e 81 4f 61 1b bf b3 e6 66 b3 19 38 b9 67 e6 ce cc 99 cc 8d 24 3d 81 3c 9
2 b2 58 1a a4 f0 36 81 d2 d2 53 29 19 4a 9d a3 e3 9e 14 e9 85 9c 75 ee 5a 60 
16 dc 94 79 af 33 d2 fb 9c f4 8d bf 43 af ba 1d 74 5b da 7f 4e 74 a2 3b 5d eb
56 43 75 c0 95 ee f9 da 36 80 7b 21 fb 18 38 6e c5 e7 e1 ab 04 16 40 30 a4 26 
fc 0b 7c f7 b0 f4 65 c9 97 a8 7b 80 73 4e 58 d0 e0 57 2b 8d 29 f8 2e 59 2b e0
 eb a7 28 ea 08 f9 cd f9 96 c9 b5 a6 6b 16 4b 6e 7d eb 39 ee 19 eb bc 8f 7f 08 
f6 00 7a 6c e1 7f 5e d7 d4 ff ed 8d 5d 7e b8 b8 b4 e0 87 39 3c f7 16 77 d6 7
5 ed c8 3e 0b 5e 66 bc 4b c7 cf 80 b5 9a 6f 6b ed 3c ca e2 41 6a e4 ed 3c 32 
b2 78 93 f8 1a 18 35 e2 ac 8c de 41 db 77 e7 a0 0b 9a 7d a7 89 fd 26 e3 7e a1
bf 63 a1 be b9 c3 6d 6a bc 2b f7 0d dc b3 f3 e6 d5 bb 79 37 de e3 1f b4 ae 3d 
bb

/home/bradh/mdi/blue8rect.MDI:
char count: 339
02 00 00 00 7b 00 00 00 ac 02 00 00 03 00 00 00 00 00 00 00 ff ff ff ff 00 00 
00 00 78 01 5d 91 bf 2e 04 61 14 c5 cf cc 0e 3b 12 44 cc 46 24 14 ab f0 2f 1a
 89 d5 6b 6c 23 36 a2 a0 17 0a 89 28 d8 44 ab 50 78 05 c5 bc 81 82 84 de 03 a8 
3c 81 a7 d0 8e df 61 be d9 d9 b9 c9 99 ef dc fb dd 7b cf 9d fb 45 92 1e 40 1
e 49 59 2c 0d 52 78 9b 40 69 e9 b1 94 0c a5 ee fe 61 5f 8a f4 4c ce 0a 77 2d 
30 0d ae ca bc 97 29 e9 6d 46 fa c2 df a4 57 dd 76 7b 2d ed 3c 26 3a d2 8d 2e
75 ad a1 ba e0 42 b7 7c 6d ab c0 bd 90 bd 0f 1c b7 e2 b3 f0 25 02 73 20 18 52 
63 fe 19 be 7b 58 fa bc e4 1d ea ee e0 9c 63 16 34 f8 d5 4a 63 02 be 45 d6 22
 f8 f8 2e 8a 3a 42 7e 73 be 05 72 ad e9 9a f9 92 5b df 7a 8e 7b c6 3a 3f c0 df 
03 db 00 3d b6 f0 3f af 6b ea ff f6 ca 2e df 5d 5c 5a f0 c3 1c 9e 7b 9d 3b e
b ba d6 f6 f4 59 f0 32 a3 5d 3a 7e 02 ac d5 7c 5b 6b e7 51 16 0f 52 23 6f e7 
91 91 c5 6b c4 97 c1 5f 23 ce 60 ee 1d b4 7d 77 0a 7a a0 d9 77 92 d8 4f 32 ea
17 fa 3b 16 ea 9b 3b dc a0 c6 bb 72 df c0 3d 3b 6f 5e bd 9b 77 e3 3d fe 02 aa 
91 3f 15

/home/bradh/mdi/bluenofill.MDI:
char count: 299
02 00 00 00 7b 00 00 00 34 02 00 00 03 00 00 00 00 00 00 00 ff ff ff ff 00 00 
00 00 78 01 5d 91 b1 4a c4 50 10 45 6f 62 34 11 74 11 57 44 d0 62 1b c5 65 1b
 c1 d8 6f a3 8d b8 88 8d bd 68 21 88 85 2e d8 5a 58 ec 4f e4 1f 14 14 2c fd 06 
bf c0 af b0 8d e7 e2 7b 21 d9 81 9b 77 67 de cc bb 33 93 44 d2 0b a8 12 a9 9
f 4a 93 02 9e 13 08 56 5c 48 d9 54 1a 1c 9f 9d 48 89 4a 72 b6 b9 e3 d0 0a b8 
0b 79 af cb d2 fb aa f4 8d 3f e4 ad b6 1d 95 0b 3a 9c 65 3a d7 83 6e 75 af a9
06 e0 46 8f 7c 6d bb c0 6f 21 fb 1c 39 6e c3 7b 70 6b ae 81 68 48 75 fc 2b 7c 
bf 61 e9 eb c0 37 a8 7b 82 73 76 2c 6a 30 6a a3 b1 08 1f 91 b5 05 be 7e ea ba
 8d 98 3f df df 26 b9 d6 74 cd 7a e0 d6 b7 9e e3 ee b1 cd 4f f1 c7 e0 00 a0 c7 
16 fe fb 75 4d 7b b6 37 76 f9 e1 e2 60 d1 df c3 df 01 ce ef d8 67 9d c6 1e 7
d 77 09 4a 30 ff 4f 97 88 fd 66 55 5e 25 46 3f 9d 14 46 95 3b 16 eb 3d e3 3e 
79 71 97 ec be d9 9f 7b f4 3c 7f e9 85 33 2a

There are some obvious similarities between the files, but what they mean 
isn't clear to me. I'd appreciate any suggestions though!

I'm happy to provide the test file collection (they are each less than 
5kbytes) to anyone with an interest - just let me know.

Brad