Wednesday, June 30, 2010

gau ko vreji fi le samymri (or, Save the Email!)

by Jennifer Davis

The origins of languages are not usually well-documented. But as the brand-new language, Lojban, is being created in modern times, with details and decisions being hammered out via electronic communication, there is an opportunity to capture the linguistics and history from the very start. I talked to Robin Powell, the web administrator for lojban.org, about archiving and preserving the language's history.

Lojban is described as "a carefully constructed spoken language designed in the hope of removing a large portion of the ambiguity from human communication" on the official Lojban site. The web page states that Lojban began development in 1987, and Robin Powell, web administrator, treasurer, and secretary of the parent organization, says that a related mailing list has been around since 1989. Recently he took it upon himself to collect the group's emails from various venues since that time and consolidate them into a Google group. His experience might serve as a template for other web historians or those who are interested in archiving public electronic records.

Robin is a Linux systems administrator for EngineYard. He's been working with the Lojban organization for years; he has helped publish a book on Lojban and has organized Lojban-related conferences. At the time of this interview, he had spent about 60 hours on the email project.

The following conversation was conducted, appropriately, via AOL Instant Messenger (AIM).

Acquisition

Jennifer: Can you describe how you've been ferreting out the emails? I assume you've had to look in multiple places?

Robin: Well, there was an archive that was on the site when I took it over. Which was woefully incomplete, and covered about 1989 to April 1998. Far from complete coverage in that range, too.

Jennifer: So where did you turn from there?

Robin: I've mostly been breaking it down by months; without a known-complete archive, and without being willing to go through by hand and look for mails that seem to be replying to nothing or not have replies or that sort of forensics, I can't ever know what I'm missing for sure. There are mails from most months in that time period in the old site archive, but not all. Then there's the archive I've kept since I took over handling the mailing list on my computer; that one, I trust implicitly, but it starts in mid-2002. From late 1998 through to that time, it was on ONElist, which got taken over by Yahoo Groups, which still exists. I was able to retrieve their mail. The rest was covered by simply asking members of the community for what they had, which was woefully incomplete, it turns out.

Jennifer: How did you do that, exactly? The Yahoo Groups part?

Robin: I found someone's script for extracting mails from Yahoo Groups (yahoo2mbox) . The Yahoo Groups archives cover about 28K mails, so not something I could do by hand. Then I had to deal with the address munging [a process of disguising email addresses]. For instance with old mails, they would come in as bob@XXXXXXXX.XXXX instead of bob@foobar.com, or whatever. That was semi-manual; hunt through the mails to find the real address, copy & replace throughout.

Jennifer: Where/how did you find the scripts?

Robin: The first script was something I had known about for years, and also friends mentioned it. The second was found through Google.

Jennifer: Were there any permissions issues for grabbing all those emails from Yahoo?

Robin: I can't see how there would be; it's a publicly accessible archive. I didn't ask, though.

Jennifer: Ah, okay...no membership required then?

Robin: I think if you're not logged in, you always get munged emails. If you are, only for the older ones. But I'm pretty sure you can see them without being logged in, other than that.

Format Conversion

Robin: It was trying to de-convert MHonArc emails that I ended up finding another script for. Well, MHonArc was one format. That's a system to archive mails on a website. So it converts them to HTML.

Jennifer: So then people sent you what they had stored up. What formats did they send them? Were they bundled, as in zip files?

Robin: They were all bundled. One of the people who sent me mails sent them already imported into MHonArc, which means they look nothing like UNIX mail files. But it turns out all the information, in particular the Message-ID, which is the most important bit, was in the MHonArc files.

So I started playing with de-converting them, because it covered periods I did not otherwise have coverage for, and discovered other people had already done that. I suspect the search I used to find the script was "mhonarc mbox", but I couldn't swear to it.

Jennifer: So it was good to have the MHonArc format in the end; it preserved more information?

Robin: I would MUCH rather have had a standard UNIX mailbox format. But it was better than nothing, and better than some of the other archives I was sent, which did not have Message-ID headers, which makes them nearly useless. Other than the MHonArc mails, there were mostly in variants of UNIX mailbox format; that is, plain text with some amount of headers.

I got enough that I have just shy of 60K mails (after extensive de-duplication) in the archive folders, and only a very small list of months have less than 30 mails, which is my arbitrary cutoff for "that month is probably missing stuff.'

Jennifer: Wow. So the MHonArc...does that give you a lot more info, or is it just a lot of header cruft?

Robin: A UNIX mail file with full headers is as much information as any mail format, at least for mail that's going out over the general Internet. Everything else is either (1) a strict subset (this is the most common), (2) a re-arrangement of the same data (MHonArc) or (3) contains idiosyncratic information particular to the person who archived the mail.

Jennifer: Did you ask for people's emails in email, IRC, or where?

Robin: I asked both in email and IRC. I did not specify format. Mostly it was zip or .tar.gz. In some cases it was one giant file with lots of mails in it. Which is a standard UNIX thing, actually; pretty much all mail used to be like that.

Jennifer: About what percentage response do you think you got? In terms of people who answered versus people you asked.

Robin: I asked a mailing list with about 800 members; I got perhaps half a dozen helpful responses. But then, only a few people have been around long enough to have decent archives anyways.

Jennifer: You got 6 responses out of 800?

Robin: I have everything from Jan 1999 on, you see. So I was asking for the stuff I didn't have.

Jennifer: Okay, then, 6 responses out of how many ideally? Like, how many major long-term players?

Robin: I actively communicated with every long-term player that I could easily get ahold of. I wasn't about to go rooting around for ancient email addresses or anything. Everyone I explicitly asked, responded. Just turns out that people's archives have been lost/destroyed in various ways. Or never existed. Several people simply didn't archive in the first place; short on disk space.

What Was Lost

Jennifer: Were there any stories of how others lost stuff?

Robin: Quoting: "I searched through the backups I have on my home computer and found complete mailbox copies for (1993-10 - 1996-02), (1996-04 - 1996-08) and (1996-11 - 1997-09) plus the whole of 1992 split into individual postings and converted to HTML. My personal mail archives at work (including the backups) were shredded when I retired so I cannot check whether there might have been any additional saved archives." One member lost his to a drive crash.

Jennifer: And as far as you know, none of the emails that you've saved through the years have been lost--i.e., you haven't had any of those disasters yourself?

Robin: I do not seem to have my own archives of Lojban list mail before about 2004, but then I didn't look carefully because I only joined the list in 2001, and as I said all that time period is covered. Certainly the archives that I caused to be made automatically once I took over the list are fully intact, and with complete headers, and formatted usefully, and so on. I went to some effort to ensure that. It's a thing--a personal goal/pride. The idea that a mailing list I run would not also, as a side effect, generate archivally - useful versions of itself is abhorrent to me.

Jennifer: Do you have more you think you need to collect?

Robin: There is more I would like to collect. I'm almost certain, for example, that we're missing much of March 1998. But I don't see any way to get it.

Archiving

Jennifer: So what else do you intend to do on this project?

Robin: The rest of it has been merging, converting, de-duplicating, and uploading to Google Groups. The actual goal here is two-fold: (1) have an as-pristine-as-possible archive of the list, and (2) move the list to Google Groups so I don't have to manage it anymore. I would feel bad if I moved it without also uploading a decent archive.

Jennifer: Are you doing these things by hand, per email?

Robin: Heh. There are, right now 47,855 emails that my scripts considered "good". Counting all the duplicates and rejects that I've expunged, the grand total is 95,554, it seems. Since if I took 30 seconds per email, that would be about 30 days of 24/7 work.

Jennifer: Okay. So how are you accomplishing those 4 fine things? Especially considering that you are dealing with emails in multiple formats--UNIX standard, MHoArc, and I assume others?

Robin: Variants on UNIX standard; missing headers, extra headers, that sort of thing. Many many mails without Message-ID, which is the worst. The one guy whose emails include, in the body above the regular text, a copy of some of the headers, for no apparent reason, blocked out in a special format. *shudder*

Jennifer: No unique identifier?

Robin: Correct.

Jennifer: So how are you doing these things?

Robin: Scripts. Some of which require a small amount of input from me, but mostly automated. Whole pile of scripts. Mostly /bin/sh, but also some Perl.

Jennifer: Have you done any testing on these scripts?

Robin: I suppose it depends how you define "testing." But certainly I test each before I let it loose on the archive as a whole. It's hard to recover from mistakes at this stage of the game.

Jennifer: Do you keep a copy of the originals?

Robin: As much as I can without causing myself too much extra work, yes. I have 3 or 4 copies of the whole archive right now, snapshots taken at various points.

Jennifer: So give me an example of testing.

Robin: Copying a month's worth of data and running the script on the copy. And then deleting and copying again, cuz it didn't work. :)

Jennifer: How can you tell it didn't work? You check a few? You read through the whole thing?

Robin: Depends on the script, and what it's supposed to do. Because of the nature of this process, I'm usually leaving debugging output turned on. So I'll watch that scroll past and eyeball for problems. Which means that every time the script makes a change, it either says what it's going to change, or shows me diffs [a UNIX command that finds all differences between two files] afterwards, or both. My basic testing usually means take a copy, running the script, and then diffing.

Jennifer: Where are you keeping all these copies?

Robin: On my hard drive? No particular organization, if that's what you're asking.

Jennifer: And how do you back that up? Nightly? Offsite?

Robin: Yep. It's all automated; has been for years.

Weeding for Duplicates

Robin: I actually haven't talked about any of the hard bits; I've had to do 3 different types of de-duplication, for example. Crazy stuff.

I have one script that, for any given month, compares all of the email bodies with all of the other email bodies stripped of whitespace, and considers any identicals to be duplicates and tosses one out. (Not permanently; it gets moved aside, that whole 95k vs. 48k thing) Got thousands of hits on that; no idea what information might have been lost (like if one copy says the mail is from Alice but the other copy says it's from Bob; I just toss one at random). But since the bodies are the same, any loss will be minor.

Another that does the same thing, but with the headers, ignoring the body. Because if all the headers are the same, it's the same mail; any differences are encoding, or errors. But again, since one mail might be an error and the other not, I may be losing information there, too. Which is why I don't actually throw them away; if someone comes back years later and says "Hey, where's the rest of this mail?", I want to at least have a chance of being able to answer the question. Got thousands of hits on that one too.

Third one is header-based, but respects Message-ID. It treats any two mails with the same message id as potential duplicates, but this one tries to be smarter; it tries tossing away some headers, and re-arranging others. If it can make the headers look identical by doing that, it calls them duplicates. If it can't, it asks me. Probably only a few hundred hits with that one, but since it's manual and I'm not done uploading, I'm not actually done running it.

I have almost everything automated at this point. The two stopping points are that I refuse to automate everything, because then things could get shredded and I'd never know, and Google Groups doesn't like how fast I'm uploading things. :)

Reuse?

Jennifer: Are those scripts that are so specialized to this process that no one else could use them, or could they be useful to others who might have to do something similar? And if the latter, would you be willing to make them available on your website?

Robin: That's kind of a toss-up; they're pretty idiosyncratic. I probably will make them available, but mostly undocumented. It just seems like way too much work at this point to document them properly, but I don't see any reason not to provide the scripts themselves and one-liner explanations.

Jennifer: Maybe they would be useful as a kind of template, for the types of things someone might need to do. Do you despair about having to archive emails that are "Me too!"--i.e., no useful content?

Robin: Not a bit. Because it's part of our history. Maybe someone flamed a "me too" poster in Lojban; that would be great historical stuff for us. If I worried about tossing useless mails, I'd have to risk losing context. And I don't have a prayer of wading through 50K mails for content anyways. :) Besides, bits are cheap.


Robin went on to comment that the project was far more work than he expected. He said that the community has been grateful for his work, which has helped fuel his ongoing efforts. He feels that these archives will provide insight into why decisions were made about language nuances. They will also document old usage because Lojban, even though it is a new language, has already evolved. This archive project, and others like it, are the new face of historical documentation. People like Robin with patience and training are needed to cull and preserve electronic conversations for archival purposes.


Jennifer is in her third year of a seven-year MLIS at SJSU. She hopes to become a data curator one day. In the meantime, she tests software part-time at a biotech company and savors one class per semester.

No comments:

Post a Comment