The code and tips here, are all included in the new pyzmail library. More samples and tips can be found in the API documentation.
The first part was about mail header. This second part take focus on the mail content.
Today's mails include HTML formatted texts, pictures and other attachments.
MIME allows to mix all these items into a single mail. But MIME is complex and not all emails comply with the standards.
Even if few MUA are strictly compliant with the RFC, most are close. Poorest emails come from self-made programs and mail merge applications. Even if bad emails are often useless (my experience), your email parser must handle them at least without crashing.
The Python email library does a wonderful job to split email into parts following the MIME philosophy.
The email parts can be split into 3 categories:MIME don't clearly indicate which part is the message content. The plain text followed by the HTML version are usually at the top to allow MIME unaware mail readers to read them easily. We must be careful not to use an ordinary attachment as the message of the email. This is what try to do my functions search_message_bodies()
To understand the complexity, I will explain how different content can be mixed into a single email. Parts have a type that can be among others: 'text/plain', 'text/html', 'image/*', 'application/*' or 'multipart/*' to indicate a container. Containers can contains other containers. Here are the most interesting containers defined by MIME:
Used to mix files of different type. Parts can be displayed inline or as attachment depending of the Content-disposition header (that is often missing).
Each part is an alternative of the same content, each in different format. The formats are ordered by how faithful they are to the original, with the least faithful first and the most faithful last. You are supposed to process the last part you are able to handle regarding the format. This is how mail include a text and HTML version of the same message. Be careful, sometime one part or another is missing and sometime the text part just say: "Read the HTML part" !
Parts must be considered as an aggregate whole. The root part that is usually the first one, references other parts inline using their "Content-ID" parameter. This is how pictures are embedded into HTML.
Other containers exists, multipart/report is used for mail delivery notification and contains message/* parts. These message/* parts are handled has a message too by Python that split them in one header and one body, this last one is also parsed and splited into parts. This is one option, but I prefer to consider the attached message as a whole. This is why I don't use the Python's Message.walk() to iterate over parts of the message.
Here are some structures you can find when parsing emails from different sources. The first one is my favorite, the one I use if I have to send complex email including: multiple message formats, related contents and attachments :multipart/mixed | +-- multipart/related | | | +-- multipart/alternative | | | | | +-- text/plain | | +-- text/html | | | +-- image/gif | +-- application/mswordYou can see simple structure, without related contents or attachments :
multipart/alternative | +-- text/plain +-- text/htmlThis unbalanced structure is also a valid one :
multipart/alternative | +-- text/plain +-- multipart/related | +-- text/html +-- image/gif
Some attachments must be shown inline, but if you cannot render such contents, you must make them available as separate attachments. Regular attachments must have a filename, but sometime it is stored at the wrong place and sometime it is just missing. If the filename contains non ascii characters it must be encoded using RFC 2231, but every body wrongly use RFC 2047 instead. The function get_filename() search and decode the filename.
When you have filenames you still need to sanitize it to be sure you can save the file on you filesystems. Some characters are forbidden like '/' on Un*x and '\' on Windows, characters must match your filesystem charset (if any is in use). Also, Windows don't accept some name like "COM1" or "NUL".
My function get_mail_contents() return a list of Attachment with related attributes. When attachment is of type text/*, payload must be decoded using the charset if set. Be careful, charset is not always accurate ! Use function decode_text(). Attachments holding the message content can be found using to the is_body attribute.
The code include pieces from first part and can be downloaded here. Ran without parameter, it parse the embedded sample. The path of a saved raw email can be used as argument. Here is the output for the embedded sample :
Subject: u'Dr. Pointcarr\xe9' From: (u'Alain Spineux', 'alain.spineux@gmail.com') To: [(u'Dr. Pointcarr\xe9', 'alain.spineux@gmail.com')] filename=None is_body=text/plain type=text/plain charset=ISO-8859-1 desc=None size=12 Hello World filename=None is_body=text/html type=text/html charset=ISO-8859-1 desc=None size=21 filename=u'smile.png' is_body=None type=image/png charset=None desc=None size=473
And some explanation :
This function navigate into the MIME tree of the mail to retrieve the parts and their format that contain the message. It return something like this
{ 'text/plain' : <email.message.Message instance at 0xXXXX>, 'text/html' : <email.message.Message instance at 0xYYYY> }
And now the code
import sys, os, re, StringIO import email, mimetypes invalid_chars_in_filename='<>:"/\\|?*\%\''+reduce(lambda x,y:x+chr(y), range(32), '') invalid_windows_name='CON PRN AUX NUL COM1 COM2 COM3 COM4 COM5 COM6 COM7 COM8 COM9 LPT1 LPT2 LPT3 LPT4 LPT5 LPT6 LPT7 LPT8 LPT9'.split() # email address REGEX matching the RFC 2822 spec from perlfaq9 # my $atom = qr{[a-zA-Z0-9_!#\$\%&'*+/=?\^`{}~|\-]+}; # my $dot_atom = qr{$atom(?:\.$atom)*}; # my $quoted = qr{"(?:\\[^\r\n]|[^\\"])*"}; # my $local = qr{(?:$dot_atom|$quoted)}; # my $domain_lit = qr{\[(?:\\\S|[\x21-\x5a\x5e-\x7e])*\]}; # my $domain = qr{(?:$dot_atom|$domain_lit)}; # my $addr_spec = qr{$local\@$domain}; # # Python's translation atom_rfc2822=r"[a-zA-Z0-9_!#\$\%&'*+/=?\^`{}~|\-]+" atom_posfix_restricted=r"[a-zA-Z0-9_#\$&'*+/=?\^`{}~|\-]+" # without '!' and '%' atom=atom_rfc2822 dot_atom=atom + r"(?:\." + atom + ")*" quoted=r'"(?:\\[^\r\n]|[^\\"])*"' local="(?:" + dot_atom + "|" + quoted + ")" domain_lit=r"\[(?:\\\S|[\x21-\x5a\x5e-\x7e])*\]" domain="(?:" + dot_atom + "|" + domain_lit + ")" addr_spec=local + "\@" + domain email_address_re=re.compile('^'+addr_spec+'$') class Attachment: def __init__(self, part, filename=None, type=None, payload=None, charset=None, content_id=None, description=None, disposition=None, sanitized_filename=None, is_body=None): self.part=part # original python part self.filename=filename # filename in unicode (if any) self.type=type # the mime-type self.payload=payload # the MIME decoded content self.charset=charset # the charset (if any) self.description=description # if any self.disposition=disposition # 'inline', 'attachment' or None self.sanitized_filename=sanitized_filename # cleanup your filename here (TODO) self.is_body=is_body # usually in (None, 'text/plain' or 'text/html') self.content_id=content_id # if any if self.content_id: # strip '<>' to ease searche and replace in "root" content (TODO) if self.content_id.startswith('<') and self.content_id.endswith('>'): self.content_id=self.content_id[1:-1] def getmailheader(header_text, default="ascii"): """Decode header_text if needed""" try: headers=email.Header.decode_header(header_text) except email.Errors.HeaderParseError: # This already append in email.base64mime.decode() # instead return a sanitized ascii string # this faile '=?UTF-8?B?15HXmdeh15jXqNeVINeY15DXpteUINeTJ9eV16jXlSDXkdeg15XXldeUINem15PXpywg15TXptei16bXldei15nXnSDXqdecINek15zXmdeZ?==?UTF-8?B?157XldeR15nXnCwg157Xldek16Ig157Xl9eV15wg15HXodeV15bXnyDXk9ec15DXnCDXldeh15gg157Xl9eR16rXldeqINep15wg15HXmdeQ?==?UTF-8?B?15zXmNeZ?=' return header_text.encode('ascii', 'replace').decode('ascii') else: for i, (text, charset) in enumerate(headers): try: headers[i]=unicode(text, charset or default, errors='replace') except LookupError: # if the charset is unknown, force default headers[i]=unicode(text, default, errors='replace') return u"".join(headers) def getmailaddresses(msg, name): """retrieve addresses from header, 'name' supposed to be from, to, ...""" addrs=email.utils.getaddresses(msg.get_all(name, [])) for i, (name, addr) in enumerate(addrs): if not name and addr: # only one string! Is it the address or is it the name ? # use the same for both and see later name=addr try: # address must be ascii only addr=addr.encode('ascii') except UnicodeError: addr='' else: # address must match address regex if not email_address_re.match(addr): addr='' addrs[i]=(getmailheader(name), addr) return addrs def get_filename(part): """Many mail user agents send attachments with the filename in the 'name' parameter of the 'content-type' header instead of in the 'filename' parameter of the 'content-disposition' header. """ filename=part.get_param('filename', None, 'content-disposition') if not filename: filename=part.get_param('name', None) # default is 'content-type' if filename: # RFC 2231 must be used to encode parameters inside MIME header filename=email.Utils.collapse_rfc2231_value(filename).strip() if filename and isinstance(filename, str): # But a lot of MUA erroneously use RFC 2047 instead of RFC 2231 # in fact anybody miss use RFC2047 here !!! filename=getmailheader(filename) return filename def _search_message_bodies(bodies, part): """recursive search of the multiple version of the 'message' inside the the message structure of the email, used by search_message_bodies()""" type=part.get_content_type() if type.startswith('multipart/'): # explore only True 'multipart/*' # because 'messages/rfc822' are also python 'multipart' if type=='multipart/related': # the first part or the one pointed by start start=part.get_param('start', None) related_type=part.get_param('type', None) for i, subpart in enumerate(part.get_payload()): if (not start and i==0) or (start and start==subpart.get('Content-Id')): _search_message_bodies(bodies, subpart) return elif type=='multipart/alternative': # all parts are candidates and latest is best for subpart in part.get_payload(): _search_message_bodies(bodies, subpart) elif type in ('multipart/report', 'multipart/signed'): # only the first part is candidate try: subpart=part.get_payload()[0] except IndexError: return else: _search_message_bodies(bodies, subpart) return elif type=='multipart/signed': # cannot handle this return else: # unknown types must be handled as 'multipart/mixed' # This is the peace of code could probably be improved, I use a heuristic : # - if not already found, use first valid non 'attachment' parts found for subpart in part.get_payload(): tmp_bodies=dict() _search_message_bodies(tmp_bodies, subpart) for k, v in tmp_bodies.iteritems(): if not subpart.get_param('attachment', None, 'content-disposition')=='': # if not an attachment, initiate value if not already found bodies.setdefault(k, v) return else: bodies[part.get_content_type().lower()]=part return return def search_message_bodies(mail): """search message content into a mail""" bodies=dict() _search_message_bodies(bodies, mail) return bodies def get_mail_contents(msg): """split an email in a list of attachments""" attachments=[] # retrieve messages of the email bodies=search_message_bodies(msg) # reverse bodies dict parts=dict((v,k) for k, v in bodies.iteritems()) # organize the stack to handle deep first search stack=[ msg, ] while stack: part=stack.pop(0) type=part.get_content_type() if type.startswith('message/'): # ('message/delivery-status', 'message/rfc822', 'message/disposition-notification'): # I don't want to explore the tree deeper her and just save source using msg.as_string() # but I don't use msg.as_string() because I want to use mangle_from_=False from email.Generator import Generator fp = StringIO.StringIO() g = Generator(fp, mangle_from_=False) g.flatten(part, unixfrom=False) payload=fp.getvalue() filename='mail.eml' attachments.append(Attachment(part, filename=filename, type=type, payload=payload, charset=part.get_param('charset'), description=part.get('Content-Description'))) elif part.is_multipart(): # insert new parts at the beginning of the stack (deep first search) stack[:0]=part.get_payload() else: payload=part.get_payload(decode=True) charset=part.get_param('charset') filename=get_filename(part) disposition=None if part.get_param('inline', None, 'content-disposition')=='': disposition='inline' elif part.get_param('attachment', None, 'content-disposition')=='': disposition='attachment' attachments.append(Attachment(part, filename=filename, type=type, payload=payload, charset=charset, content_id=part.get('Content-Id'), description=part.get('Content-Description'), disposition=disposition, is_body=parts.get(part))) return attachments def decode_text(payload, charset, default_charset): if charset: try: return payload.decode(charset), charset except UnicodeError: pass if default_charset and default_charset!='auto': try: return payload.decode(default_charset), default_charset except UnicodeError: pass for chset in [ 'ascii', 'utf-8', 'utf-16', 'windows-1252', 'cp850' ]: try: return payload.decode(chset), chset except UnicodeError: pass return payload, None if __name__ == "__main__": raw="""MIME-Version: 1.0 Received: by 10.229.233.76 with HTTP; Sat, 2 Jul 2011 04:30:31 -0700 (PDT) Date: Sat, 2 Jul 2011 13:30:31 +0200 Delivered-To: alain.spineux@gmail.com Message-ID: <CAAJL_=kPAJZ=fryb21wBOALp8-XOEL-h9j84s3SjpXYQjN3Z3A@mail.gmail.com> Subject: =?ISO-8859-1?Q?Dr.=20Pointcarr=E9?= From: Alain Spineux <alain.spineux@gmail.com> To: =?ISO-8859-1?Q?Dr=2E_Pointcarr=E9?= <alain.spineux@gmail.com> Content-Type: multipart/mixed; boundary=mixed --mixed Content-Type: multipart/alternative; boundary=alternative --alternative Content-Type: text/plain; charset=ISO-8859-1 Hello World --alternative Content-Type: text/html; charset=ISO-8859-1 Hello World<br> <br> --alternative-- --mixed Content-Type: image/png; name="smile.png" Content-Disposition: attachment; filename="smile.png" Content-Transfer-Encoding: base64 iVBORw0KGgoAAAANSUhEUgAAAA4AAAAOBAMAAADtZjDiAAAAMFBMVEUQEAhaUjlaWlp7e3uMezGU hDGcnJy1lCnGvVretTnn5+/3pSn33mP355T39+//75SdwkyMAAAACXBIWXMAAA7EAAAOxAGVKw4b AAAAB3RJTUUH2wcJDxEjgefAiQAAAAd0RVh0QXV0aG9yAKmuzEgAAAAMdEVYdERlc2NyaXB0aW9u ABMJISMAAAAKdEVYdENvcHlyaWdodACsD8w6AAAADnRFWHRDcmVhdGlvbiB0aW1lADX3DwkAAAAJ dEVYdFNvZnR3YXJlAF1w/zoAAAALdEVYdERpc2NsYWltZXIAt8C0jwAAAAh0RVh0V2FybmluZwDA G+aHAAAAB3RFWHRTb3VyY2UA9f+D6wAAAAh0RVh0Q29tbWVudAD2zJa/AAAABnRFWHRUaXRsZQCo 7tInAAAAaElEQVR4nGNYsXv3zt27TzHcPup6XDBmDsOeBvYzLTynGfacuHfm/x8gfS7tbtobEM3w n2E9kP5n9N/oPZA+//7PP5D8GSCYA6RPzjlzEkSfmTlz+xkgffbkzDlAuvsMWAHDmt0g0AUAmyNE wLAIvcgAAAAASUVORK5CYII= --mixed-- """ if len(sys.argv)>1: raw=open(sys.argv[1]).read() msg=email.message_from_string(raw) attachments=get_mail_contents(msg) subject=getmailheader(msg.get('Subject', '')) from_=getmailaddresses(msg, 'from') from_=('', '') if not from_ else from_[0] tos=getmailaddresses(msg, 'to') print 'Subject: %r' % subject print 'From: %r' % (from_, ) print 'To: %r' % (tos, ) for attach in attachments: # dont forget to be careful to sanitize 'filename' and be carefull # for filename collision, to before to save : print '\tfilename=%r is_body=%s type=%s charset=%s desc=%s size=%d' % (attach.filename, attach.is_body, attach.type, attach.charset, attach.description, 0 if attach.payload==None else len(attach.payload)) if attach.is_body=='text/plain': # print first 3 lines payload, used_charset=decode_text(attach.payload, attach.charset, 'auto') for line in payload.split('\n')[:3]: # be careful console can be unable to display unicode characters if line: print '\t\t', line
Comments
Cosmin L. Neagu (not verified)
Wed, 08/24/2011 - 13:31
Permalink
Excelent article
After I struggled a few days to make my own email parsing code I found yours and all I can say is that your code is way better. The multipart explanations also helped allot. Thank you.
Add new comment