2004.08.22 11:37 "[Tiff] tiff2pdf: character encoding in PDF meta-data", by Peter Adolphs

2004.08.22 16:23 "Re: [Tiff] tiff2pdf: character encoding in PDF meta-data", by Ross A. Finlayson

Hi Peter,

Do you know the codepage on your system? For example, it might be ISO_8859-1, or ISO_8859-15. The PDF metadata strings in the Info dictionary (for PDF names /Title, /Author, etcetera) are probably stored in PDFDocEncoding or else Unicode encoding. PDFDocEncoding is like ASCII in the lower order characters 32-126 of 0-255, while the values 128-255 do not necessarily match any other code page / glyph map.

It would be convenient to add the feature you request if there was a standard C library function to convert a string in-place from the host codepage to PDFDocEncoding, there is not.

You might consider writing your own small application to convert the string's data, for example using something like ICU or ICU4J.

Basically it is more simple than that. What you need is a function to convert the given input string into PDFDocEncoding. For each character with bit seven set, eg 128-255, you might need to set it to a different value. For strings in the PDF besides the document information and bookmark titles, you could use a font mapping with differences, the PDFDocEncoding for the Info values is immutable. So for example you would write a function to convert a string in the input codepage to PDFDocEncoding.

int t2p_convert_ISO_8859-15_to_PDFDocEncoding(char* str){

        uint32 i = 0;
        if(str != NULL){
                while(str[i] != 0){
                        if(str[i] & 0x80){
                        // set str[i] to the correct value
                        }
                        i++;
                }
        }
        return 0;
}

Then, call that function from t2p_validate on each of the elements of the T2P*:

t2p_convert_ISO_8859-15_to_PDFDocEncoding(t2p->pdf_creator);
t2p_convert_ISO_8859-15_to_PDFDocEncoding(t2p->pdf_author);
t2p_convert_ISO_8859-15_to_PDFDocEncoding(t2p->pdf_title);
t2p_convert_ISO_8859-15_to_PDFDocEncoding(t2p->pdf_subject);
t2p_convert_ISO_8859-15_to_PDFDocEncoding(t2p->pdf_keywords);

To set str[i] to the correct value, you could have a function return a char when given str[i] that is the correct code point in PDFDocEncoding to match the code point in your encoding. You could make an array of char points[256] and index into the array by (unsigned char)str[i], or you could have a switch statement, with cases for each point that is different, and a default for those points that are the same.

That's assuming I'm correct and that the metadata is supposed to be in PDFDocEncoding. Massive overexposure to specification can lead to acronym fatigue.

You might understand why that is not part of the program already: there are widely varying codepages in use, and generally handling character code page conversions, while not necessarily overly complicated, involves several hundred kilobytes of input source code, hopefully humanely generated.

So, you should be able to add this function to a local version or copy of your program. If you need further assistance, please feel free to contact me personally.

Regards,

Ross F.