2003.12.16 23:46 "Re: [Tiff] Stupid question", by Joris
No pressure. Frankly if you could assemble the email, clean out the spam and make it available for a normal threaded archive I think that would be great.
When text is concerned, I strongly believe in 'single-sourcing'. You know, grabbing data, building to codes to convert it to a proprietary format that is really cut out for the particular data, and next being able to build the code to convert it into anything else. The main point of this 'single-sourcing' is that the actual content should not be format-specific, meaning, since we live in a real world, should have a format that is exactly fitted to the data and can be transformed into any other.
Having progressed a little more, I can see now that I best work in two stages. Indeed, there should be no regrouping or categorizing in this first stage. Therefore, the result of the first stage is a normal threaded archive. I'll be able to deliver mbox, I'm sure, as well as HTML pages, a Word doc, pdf, or just about anything. I already build and am able to reuse a limited HTML and PDF codec, and have experience in Word OLE, so it's just this mbox, but that will be no problem at all I guess. I realize this talk about 'proprietary format' in-between may not sound very average, but, trust me, I did some similar stuff before. The only real reason for concern is the magnitude of the project.
If you can run (and trust) a windows executable, I'll send you a viewer for the proprietary format with the data I processed up to that point in a few days. That executable will be able to build a plain text output, HTML pages and a Word doc already. If not, I'll send you the HTML pages. That is, if you like to receive such a preview, of course. The current processing involves:
- extracting from the HTML pages (the aug 99 archive)
- processing of the headers to no longer include 'Company' and 'CC' and 'Reply-To' and such, if originally present
- processing of the 'From' field in the header to uniquelly identify a sender, even if that sender changed his/hers e-mail address during these many years
- converting the date indications to GMT
- filtering out test messages, mailing list software generated messages, and spam
- converting the occasional HTML message to plain text
- Perhaps I'll add some de-word-wrapping code, so as to end up with formatable text, as opposed to pre-formated, but I haven't given that a lot of thought yet.
- All of these actions are done automatically, for the most part, by my temporary dirty code. But I still review each processed message by hand, to either give my consent or improve on the code, and I plan to continue doing that... Which means that this will take some time...
Even better if you can extract a few FAQs with answers.
In a second stage, I will still like to build a 'TIFF-sensitive knowledge base' kinda thing, that is able to enumerate relevant pointers into the spec and relevant mailing list messages from any TIFF page, and I like this idea of a FAQ too. Categorizing the messages in the complete archive build in the first stage will be the main issue here, I guess.
I don't see any problem with your hosting a copy, or any reasonable use you put the email archive to. You have my complete approval.