Hi,
I've had a number of unhandled exceptions related to all the Unicode issues (#32, #33, #35).
I accidentally originally installed the Python 2 version of maildir-deduplicate, and it is better on the Python 3 version, but several of my mails still manage to cause unhandled exceptions.
These seem to predominantly be mails which have been wrongly encoded - which should have been marked and encoded as UTF, but haven't been.
Python does seem to recognize something is off and shift those characters to 0xFFFD, the Unicode Replacement Character, but nevertheless fails with UnicodeEncodeError: 'ascii' codec can't encode character '\ufffd' in position 13: ordinal not in range(128)
.
I understand that maildir-deduplicate can't magically know in what particular way a fucked up mail was fucked up and treat the wrong data correctly. That mail was encoded wrongly and that's my problem, not maildir-deduplicate's.
What's annoying me is the lack of handling on the exception.
There are over 4000 mails in that maildir, the vast majority of which are perfectly RFC-compliant, and I can't parse that folder because a handful of mails are screwed and maildir-deduplicate doesn't properly handle it.
I don't expect the software to magically fix broken input. But if I have 1 broken e-mail out of a thousand, I do expect it to just skip the broken one and do the other 999.
Unicode-errors are the most common ones, but it's not limited to that: I've also had one run fail on me because of a missing header in a collection:
File "/usr/local/lib/python3.4/dist-packages/maildir_deduplicate/deduplicate.py", line 355, in get_lines_from_message_body
header_text, sep, body = message.as_string().partition("\n\n")
File "/usr/lib/python3.4/email/message.py", line 159, in as_string
g.flatten(self, unixfrom=unixfrom)
File "/usr/lib/python3.4/email/generator.py", line 112, in flatten
self._write(msg)
File "/usr/lib/python3.4/email/generator.py", line 178, in _write
self._dispatch(msg)
File "/usr/lib/python3.4/email/generator.py", line 211, in _dispatch
meth(msg)
File "/usr/lib/python3.4/email/generator.py", line 269, in _handle_multipart
g.flatten(part, unixfrom=False, linesep=self._NL)
File "/usr/lib/python3.4/email/generator.py", line 112, in flatten
self._write(msg)
File "/usr/lib/python3.4/email/generator.py", line 178, in _write
self._dispatch(msg)
File "/usr/lib/python3.4/email/generator.py", line 211, in _dispatch
meth(msg)
File "/usr/lib/python3.4/email/generator.py", line 269, in _handle_multipart
g.flatten(part, unixfrom=False, linesep=self._NL)
File "/usr/lib/python3.4/email/generator.py", line 112, in flatten
self._write(msg)
File "/usr/lib/python3.4/email/generator.py", line 186, in _write
msg.replace_header('content-transfer-encoding', munge_cte[0])
File "/usr/lib/python3.4/email/message.py", line 559, in replace_header
raise KeyError(_name)
KeyError: 'content-transfer-encoding'
The fact that you're using exceptions at all is good. But not handling exceptions is bad, and not handling exceptions in a program designed for batch processing is just wrong.
I will try to hack something up for my local installation and I will submit a patch if I succeed, but this is ultimately a question of design mentality: You are currently placing the burden of dealing with problematic input on the user. You're essentially saying "this program will work fine...if you made sure those 10000 mails you want to scan are all RFC-compliant in advance!".
I do believe it would greatly increase the usefulness of this tool if you expected it to fail on some messages and dealt with that gracefully, instead of just crashing back into the terminal in the middle of processing.
Thank you for your efforts. I haven't actually gotten this tool to work yet, but thanks to your work, I at least have a shot at dealing with these mails. I do appreciate the time you're investing.
๐ bug