Parsing email using Python part 1 of 2 : The Header

Submitted by aspineux on Sun, 07/03/2011 - 13:45

The code and tips here, are all included in the new pyzmail library. More samples and tips can be found in the API documentation.

Welcome to this series of two small articles about how to parse emails using Python. This first one is about mail header. The second one is about mail content.

A lot of programs and libraries commonly used to send emails don't comply with RFC. Ignore such kind of email is not an option because all mails are important. It is important to do it best when parsing emails, like does most popular MUA.

Python's has one of the best library to parse emails: the email package.

First part, how to decode mails header

Regarding RFC 2047 non ascii text in the header must be encoded. RFC 2822 make the difference between different kind of header. *text field like Subject: or address fields like To:, each with different encoding rules. This is because RFC 822 forbids the use of some ascii characters at some place because they have some meaning, but these ascii characters can be used when they are encoded because the encoded version don't disturb the parsing of string.

Python provides email.Header.decode_header() for decoding header. The function decode each atom and return a list of tuples ( text, encoding ) that you still have to decode and join to get the full text. This is done in my getmailheader() function.

For addresses, Python provides email.utils.getaddresses() that split addresses in a list of tuple ( display-name, address ). display-name need to be decoded too and addresses must match the RFC2822 syntax. The function getmailaddresses() does all the job.

Here are the functions in actions.

import re
import email
from email.Utils import parseaddr
from email.Header import decode_header

# email address REGEX matching the RFC 2822 spec
# from perlfaq9
#    my $atom       = qr{[a-zA-Z0-9_!#\$\%&'*+/=?\^`{}~|\-]+};
#    my $dot_atom   = qr{$atom(?:\.$atom)*};
#    my $quoted     = qr{"(?:\\[^\r\n]|[^\\"])*"};
#    my $local      = qr{(?:$dot_atom|$quoted)};
#    my $domain_lit = qr{\[(?:\\\S|[\x21-\x5a\x5e-\x7e])*\]};
#    my $domain     = qr{(?:$dot_atom|$domain_lit)};
#    my $addr_spec  = qr{$local\@$domain};
# 
# Python translation
    
atom_rfc2822=r"[a-zA-Z0-9_!#\$\%&'*+/=?\^`{}~|\-]+"
atom_posfix_restricted=r"[a-zA-Z0-9_#\$&'*+/=?\^`{}~|\-]+" # without '!' and '%'
atom=atom_rfc2822
dot_atom=atom  +  r"(?:\."  +  atom  +  ")*"
quoted=r'"(?:\\[^\r\n]|[^\\"])*"'
local="(?:"  +  dot_atom  +  "|"  +  quoted  +  ")"
domain_lit=r"\[(?:\\\S|[\x21-\x5a\x5e-\x7e])*\]"
domain="(?:"  +  dot_atom  +  "|"  +  domain_lit  +  ")"
addr_spec=local  +  "\@"  +  domain

email_address_re=re.compile('^'+addr_spec+'$')

raw="""MIME-Version: 1.0
Received: by 10.229.233.76 with HTTP; Sat, 2 Jul 2011 04:30:31 -0700 (PDT)
Date: Sat, 2 Jul 2011 13:30:31 +0200
Delivered-To: alain.spineux@gmail.com
Message-ID: <CAAJL_=kPAJZ=fryb21wBOALp8-XOEL-h9j84s3SjpXYQjN3Z3A@mail.gmail.com>
Subject: =?ISO-8859-1?Q?Dr.=20Pointcarr=E9?=
From: Alain Spineux <alain.spineux@gmail.com>
To: =?ISO-8859-1?Q?Dr=2E_Pointcarr=E9?= <alain.spineux@gmail.com>
Content-Type: multipart/alternative; boundary=000e0cd68f223dea3904a714768b

--000e0cd68f223dea3904a714768b
Content-Type: text/plain; charset=ISO-8859-1

-- 
Alain Spineux

--000e0cd68f223dea3904a714768b
Content-Type: text/html; charset=ISO-8859-1



-- 
Alain Spineux


--000e0cd68f223dea3904a714768b--
"""

def getmailheader(header_text, default="ascii"):
    """Decode header_text if needed"""
    try:
        headers=decode_header(header_text)
    except email.Errors.HeaderParseError:
        # This already append in email.base64mime.decode()
        # instead return a sanitized ascii string 
        return header_text.encode('ascii', 'replace').decode('ascii')
    else:
        for i, (text, charset) in enumerate(headers):
            try:
                headers[i]=unicode(text, charset or default, errors='replace')
            except LookupError:
                # if the charset is unknown, force default 
                headers[i]=unicode(text, default, errors='replace')
        return u"".join(headers)

def getmailaddresses(msg, name):
    """retrieve From:, To: and Cc: addresses"""
    addrs=email.utils.getaddresses(msg.get_all(name, []))
    for i, (name, addr) in enumerate(addrs):
        if not name and addr:
            # only one string! Is it the address or is it the name ?
            # use the same for both and see later
            name=addr
            
        try:
            # address must be ascii only
            addr=addr.encode('ascii')
        except UnicodeError:
            addr=''
        else:
            # address must match adress regex
            if not email_address_re.match(addr):
                addr=''
        addrs[i]=(getmailheader(name), addr)
    return addrs

msg=email.message_from_string(raw)
subject=getmailheader(msg.get('Subject', ''))
from_=getmailaddresses(msg, 'from')
from_=('', '') if not from_ else from_[0]
tos=getmailaddresses(msg, 'to')
    
print 'Subject: %r' % subject
print 'From: %r' % (from_, )
print 'To: %r' % (tos, )

And the ouput:

Subject: u'Dr. Pointcarr\xe9'
From: (u'Alain Spineux', 'alain.spineux@gmail.com')
To: [(u'Dr. Pointcarr\xe9', 'alain.spineux@gmail.com')]

Comments

tweety (not verified)

Wed, 07/06/2011 - 19:52

Permalink

get Email Content

Thanks! But what should I do if I want to get the email content ? Any suggestions?

aspineux

Thu, 07/07/2011 - 00:51

Permalink

Part 2 will speak about mail content

This is the object of coming part 2.

tweety (not verified)

Thu, 07/07/2011 - 12:59

Permalink

Thanks, will look forward for

Thanks, will look forward for it. Currently Im using email parser to get the content by using -
msg = email.message_from_string(raw_message)
But in case of multiparts or if the content type in the email header is not defined well, then it does not always return correct results.

I hope your next tutorial would help with this. Thanks again! :-)

Terry Bates (not verified)

Mon, 10/29/2012 - 04:43

Permalink

Thank you for this code. I

Thank you for this code. I was mucking about with O'Reilly's new book "Agile Data." There is an Open Feedback system that one can use to read the book before it is published. It talked about using code to download Gmail folder contents, but did not provide the code I would need to parse it. I think I will be using this snippet in the future.

Main menu

Error message

Parsing email using Python part 1 of 2 : The Header

First part, how to decode mails header

Tags:

Comments

get Email Content

Part 2 will speak about mail content

Thanks, will look forward for

Thank you for this code. I

Add new comment

Filtered HTML

Plain text

Monthly archive

Search form

Main menu

Error message

You are here

Parsing email using Python part 1 of 2 : The Header

First part, how to decode mails header

Tags:

Comments

get Email Content

Part 2 will speak about mail content

Thanks, will look forward for

Thank you for this code. I

Add new comment

Filtered HTML

Plain text

Monthly archive