AWARE [SYSTEMS] Imaging expertise for the Delphi developer
AWare Systems, Imaging expertise for the Delphi developer, Home TIFF and LibTiff Mailing List Archive

LibTiff Mailing List

TIFF and LibTiff Mailing List Archive
August 2004

Previous Thread
Next Thread

Previous by Thread
Next by Thread

Previous by Date
Next by Date

Contact

The TIFF Mailing List Homepage
This list is run by Frank Warmerdam
Archive maintained by AWare Systems



Valid HTML 4.01!



Thread

2004.08.22 11:37 "tiff2pdf: character encoding in PDF meta-data", by Peter Adolphs
2004.08.22 16:23 "Re: tiff2pdf: character encoding in PDF meta-data", by Ross A Finlayson

2004.08.22 16:23 "Re: tiff2pdf: character encoding in PDF meta-data", by Ross A Finlayson

On Sun, 22 Aug 2004, Peter Adolphs wrote:
> tiff2pdf from libtiff 3.6.1 does not seem to correctly encode non-ASCII
> characters in meta-data as author, title, keywords, etc., that are
> provided as commandline arguments. Vowels with accents (â, á, è), German
> umlauts (ä, Ä, ü, Ü, ö, Ö), etc. are all mapped to \377 which is
> displayed as ¾. Should I file this as a bug or has this already been fixed?

Hi Peter,

Do you know the codepage on your system?  For example, it might be
ISO_8859-1, or ISO_8859-15.  The PDF metadata strings in the Info
dictionary (for PDF names /Title, /Author, etcetera) are probably stored
in PDFDocEncoding or else Unicode encoding.  PDFDocEncoding is like ASCII
in the lower order characters 32-126 of 0-255, while the values 128-255 do
not necessarily match any other code page / glyph map.

It would be convenient to add the feature you request if there was a
standard C library function to convert a string in-place from the host
codepage to PDFDocEncoding, there is not.

You might consider writing your own small application to convert the
string's data, for example using something like ICU or ICU4J.

Basically it is more simple than that.  What you need is a function to
convert the given input string into PDFDocEncoding.  For each character
with bit seven set, eg 128-255, you might need to set it to a different
value.  For strings in the PDF besides the document information and
bookmark titles, you could use a font mapping with differences, the
PDFDocEncoding for the Info values is immutable.  So for example you would
write a function to convert a string in the input codepage to
PDFDocEncoding.

int t2p_convert_ISO_8859-15_to_PDFDocEncoding(char* str){

	uint32 i = 0;
	if(str != NULL){
		while(str[i] != 0){
			if(str[i] & 0x80){
			// set str[i] to the correct value
			}
			i++;
		}
	}
	return 0;
}

Then, call that function from t2p_validate on each of the elements of the
T2P*:

t2p_convert_ISO_8859-15_to_PDFDocEncoding(t2p->pdf_creator);
t2p_convert_ISO_8859-15_to_PDFDocEncoding(t2p->pdf_author);
t2p_convert_ISO_8859-15_to_PDFDocEncoding(t2p->pdf_title);
t2p_convert_ISO_8859-15_to_PDFDocEncoding(t2p->pdf_subject);
t2p_convert_ISO_8859-15_to_PDFDocEncoding(t2p->pdf_keywords);

To set str[i] to the correct value, you could have a function return a
char when given str[i] that is the correct code point in PDFDocEncoding to
match the code point in your encoding.  You could make an array of char
points[256] and index into the array by (unsigned char)str[i], or you
could have a switch statement, with cases for each point that is
different, and a default for those points that are the same.

That's assuming I'm correct and that the metadata is supposed to be in
PDFDocEncoding.  Massive overexposure to specification can lead to acronym
fatigue.

You might understand why that is not part of the program already:  there
are widely varying codepages in use, and generally handling character code
page conversions, while not necessarily overly complicated, involves
several hundred kilobytes of input source code, hopefully humanely
generated.

So, you should be able to add this function to a local version or copy of
your program.  If you need further assistance, please feel free to contact
me personally.

Regards,

Ross F.