Mbox格式
mbox格式
原文出处:待考
原文作者:Heinz Tschabitscher
授权许可:未知
翻译人员:MillenniumDark
校对人员:MillenniumDark
适用版本:all
email客户端如何在硬盘上保存邮件?
看email客户端时,通常你会发现你的信存在文件夹里。一个文件夹是Fred的,一个是duck-painters的邮件列表的,再一个是工作用的,等等。在email程序里邮件看起来是这么储存的。但是程序如何把信件保存在硬盘上呢?
邮件的存储
把所有的信,连同每封信应该出现在哪个文件夹里的信息,都保存在同一个文件中是可行的。这种邮件存储可能是,比如说,以数据库的形式实现的,有时的确是就是以这个形式实施的。
另一种做法是为每封信创建一个文件,并将它们整理成一个很好地反映用户界面中的文件夹的文件系统。尽管这种处理可能在安全上有一些好处,但它需要大量的硬盘活动,这可就慢了。
在这两种尝试之间,还有另一种处理。我们为每个信箱创建一个文件,然后把同一个文件夹里的所有信件放在这个文件里。大多数email客户端采用这个格式,它叫做mbox格式。
mbox格式
如果我们使用mbox格式存储信件,我们把它们放在一个文件里。大致上这就产生了一个包含一封接一封信件的文本文件(因特网上的email总是只以7-bit ASCII文本的形式存在,其他东西——比如说,附件——将被编码成这一格式。)。那我们怎么知道哪里是开头,哪里是结束呢?
幸运的是,每封email在它的开头至少有一个From行。每封信以From开头(From后跟着一个白色空白字符,也叫做From_行)。如果在一个空行前面或者整个文件开始有这样的以From开头的行,那我们就找到了一封信的开始了。
所以当我们分析mbox格式的文件的时候,基本上我们在找在From前面的空行。
我们可以把这个写成正则表达式的形式\n\nFrom .*\n
,只有第一个是不同的。它仅仅以From作为开始的一行(^From .*\n)
。
正文中的From
上面的序列就在信件的主体里出现了呢?如果下面的文字是一封信的一部分呢?
...I send you the most recent report. From this report, you need not...
这儿,我们有一个空行,紧接着是以From开头的一行。如果这出现在一个mbox文件里,我们把它作为一封信件的开头是没错的。至少语法分析是这么想的——为什么 email客户端和我们会对一封既不包含寄信人也不包含收信人,但是以From this report开头的信件感到困惑呢?
为了避免这种灾难性的情况,我们需要确保From从不在一封信的主体部分的接在空行后面的一行的开头出现。
不管什么时候我们将一封新的邮件加入mbox文件,我们在信件的主体里寻找这样的序列,很简单地用>From替换From,这样就不可能产生误解了。上面的例子现在看起来是这样的,再也不会触发语法分析的反应了:
...I send you the most recent report. >From this report, you need not...
这就是有时候你会在一封 email里找到>From,而不是你期望的From的原因。
原文
The mbox Format Your Guide, Heinz Tschabitscher From Heinz Tschabitscher, Your Guide to Email. FREE Newsletter. Sign Up Now! How email clients store mail on your hard disk When you look at your email client, you will -- usually -- find your email messages stored in folders. There's one folder for Fred, one for the duck-painters mailing list, another for work and so on. This is how mail storage looks in the email program. But how does the program store the messages on the hard disk? Email Storage
It would be possible to have all email messages in one file, together with information in which folder each email should appear. This kind of mail storage could be, and sometimes is, implemented as a database, for example.
Another attempt would create one file for every email message and arrange them in a file system hierarchy representing the folders in the user interface. While this approach maybe has security benefits, it requires a lot of hard disk activity, which is slow.
Somewhere in between those two attempts lies another approach to storing email messages. We create a file for every mailbox and put all the messages in the corresponding folder in this file. This is the format used by most email clients, and it is called the mbox format. The mbox Format
If we use the mbox format to store emails, we put all of them in one file. This creates more or less long text file (Internet email always only exists as 7-bit ASCII text, everything else -- attachments, for example -- is encoded) containing one email message after the other. How do we know where one ends and another starts?
Fortunately, every email has at least one From-line at its very beginning. Every message begins with "From " (From followed by a white space character, also called a "From_" line). If this sequence ("From ") at the beginning of a line is preceded by an empty line or is at the top of the file, we have found the beginning of a message.
So what we look for when parsing an mbox file is, essentially, an empty line followed by "From ".
As a regular expression, we can write this as "\n\nFrom .*\n". Only the very first message is different. It starts merely with "From " at the beginning of a line ("^From .*\n"). "From " in the Body
What if exactly the sequence above appears in the body of an email message? What if the following is part of an email?
...I send you the most recent report.
From this report, you need not...
Here, we have an empty line followed by "From " at the beginning of the line. If this appears in a mbox file, we unmistakably have the beginning of a new message. At least that's what the parser thinks -- and why both the email client and we would be quite confused by an email message that contains neither sender nor recipient, but begins with "From this report".
To avoid such desastrous conditions, we need to make sure "From " never appears at the beginning of a line following an empty line in the body of an email.
Whenever we add a new message to a mbox file, we look for such sequences in the body and simply replace "From" with ">From". This makes misinterpretations impossible. The example above now looks like this and no more triggers the parser:
...I send you the most recent report.
>From this report, you need not...
This is why you may sometimes find ">From" in an email where you'd expect a mere "From".