2003.12.16 15:34 "Re: [Tiff] Stupid question", by Joris
I think my only hope in completing this project is by not going about things manually anyway. For example, the 'old' on-line archive that goes back to aug 99 is 6217 messages long, and accessable only through 6217 seperate HTML pages. I didn't quite feel up to hitting 'Save as' 6217 times, so I brew a few lines of code that downloaded the 6217 messages to my hard disc with a single click. I figure that's the only feasable strategy for all post-processing and most of the grouping and indexing too. So I don't mind CR/LF, in fact, that's going to be quite a relief after brewing code to extract the true messages from the HTML pages and filtering out all viagra related stuff.
You can say that again. ;-)
No, really, it's not that bad. The HTML is very predictable and uniform accross pages, as is typical for generated HTML, of course, so I coded up a little thing to extract the first message (your 'test' message, quite a symbol), and that is already enough to handle most pages.
Note, Andrey and I are happy to provide more direct access to the archives... such as the mbox format recent archive if you need it.
The extraction from those HTML pages is nearly finished. But I figure I need to handle this mbox format anyhow if I understand you correctly that this is the format of the 2003 archive. And I'll most defenetly end up with code eliminating duplicates, so anything is helpfull.
Nevertheless, the folder with the auto-downloaded HTML pages, each to a seperate text file, is about 30 meg. This mbox format is a lot more efficient, but still, I think it'll be 10 to 15 meg, right? It's not going to be easy to mail that to me, especially seeing my provider's mail servers are having a bad week again (or is it a bad year?). So, bottom line is: Yes, I appreciate receive 'em, unless it's too much trouble sending them.
Also, we are happy to host what you come up with as a cleaned up archive.
Let's first wait and see if I get this job done, I'm not yet quite 100% sure that it's going to be feasable, even with maximum non-manual handling. I've got a few ideas for usefull regrouping as a kind of context-sensitive help. Except that it'll be more like 'TIFF-sensitive' instead of context-sensitive, and more like 'knowledge base' instead of help. That is not quite the complete archive but is almost guaranteed to be feasable I think... So I guess something usefull is going to come out of this, anyhow, even though it may perhaps not be a completely restored archive for the last 15 years or so. Anyway, let's just wait and see first.
As to the hosting... That may be the only part of the job that might be somewhat beneficial to me. I may want to take you up on the hosting offer, but I'm also wondering if you would mind me hosting the results myself. Even if I'm allowed to host it myself, it will - of course - be a completely free download, no strings or even advertising attached, except for the hosting domain name and maybe a single pointer to my site or something. If this is not acceptable, just say so, I'm not going to do anything on this project that either you or Andrey object to. Your approval is important to me.