Fast duplicate message finding script for Apple's Mail.app

What versions of Mail does it support?

This version is tested on Mail 3.5 on OS X 10.5. It ought to work on earlier and later versions of Mail too, but I don't have other versions available to test. The previous version was tested in Mail on 10.3 and 10.4 and the new version doesn't use any new features.

What's new in Version 1.3?

Works in Mail 3.x in OS X 10.5: The old version should, in principle, still have worked but for large numbers of messages it generated an AppleScript timeout. That's now fixed.

Faster: the message collection is much faster, it's now mostly limited by disk seek performance. This version typically runs in 2/3rds to 1/10th the time of the old one. For a few tens of duplicates in a few hundred messages it should only take a few seconds to finish. Sorting and selecting is now a little faster too. For the curious applescript hacker: it now gets Mail to generate the lists it needs rather than iterating through collecting data from every message, and it uses parallel lists for quicker access, rather than Applescript's notion of objects.

Better handling of messages with no headers: It actually handles them properly now! Thanks to Brendan Ferguson for pointing out this problem (that was a year ago, it's taken me a while to polish up the version I hacked to gether to deal with his problem). It's quite slow handling messages like this, because it has to parse the full content of every message before even doing the sort.

How do I use it?

To use it from within Mail directly, you'll need to have the Script Menu turned on. To turn it on in 10.4 or 10.5, start AppleScript Utility, which is in /Applications/AppleScript and tick "Show Script Menu in menu bar". To turn it on in 10.3, open "Install Script Menu" in /Applications/AppleScript. Next, download the zipfile and extract Select Duplicate Messages.scpt to ~/Library/Scripts/Mail Scripts/ (the ~ means your home folder). You may have to create the folders if you're using 10.3. You can now run the script by selecting it from under Mail Scripts in the Script Menu, which will be in the menu bar somewhere on the right. It's between the battery indicator and Airport icon on my machine, but exactly where it is will depend on what you have in your menu bar.

The other way to run it is from within Script Editor. Just extract the zipfile anywhere and double-click the script. This will start Script Editor with the script loaded. Hit "Run" and it'll start Mail if required and switch to it.

You will be presented with a window which will ask you to check you have the right folders and mailboxes selected and that 'Organize By Thread' is switched off (it can't select messages it can't see). Then you decide whether you want to compare message bodies or full messages including the headers. The script's view of you messages is limited to the list of messages currently displayed, so you shouldn't change the view while it is running.

Once the script has finished running it will display how many duplicates it found. You can click on the "Stats" button to see some statistics which are probably only of interest to me. The duplicate messages will be highlighted in the list of messages; this may not be immediately apparent if no duplicates messages are visible in the portion of the list currently visible. If you scroll through the list you will be able to see which messages are selected. You may now do whatever you like with the selected messages - drag them to another mailbox, flag them, mark them as Junk or anything else you can normally do with messages in Mail. Chances are that you just want to get rid of them. If the duplicates are in a normal mailbox, you probably want to move them to the Trash. Either click and drag to the Trash, or go to the "Message" menu and "Move To" "Trash". If the duplicates are in the Trash and you want to permanently delete them, ctrl-click or right-click on one of the highlighted messages and select "Delete" from the pop-up menu. Needless to say, if you're going to be permanently deleting messages you might want to take a backup of your mail first, just in case.

How does it work?

It extracts the SMTP message-id from the headers, or creates a "unique" id from contents of the message if there is no message-id header; sorts the ids into order; then runs through the list looking for adjacent items with the same message-id. Once it has found messages with identical message-ids, it checks the size of the message to see if AppleScript can handle the full content (AppleScript doesn't like strings larger than about 256KiB). If it can, it compares either the body or the full source of the messages, depending on what you chose which when you ran the script. If the message is too large for AppleScript to handle, it just compares the sizes. Once the full list of verified duplicates is prepared, it selects them ready for you move them to the Trash, erase them or do anything else you like. Finally, it beeps and lets you know it's finished.

There are already scripts on the web. Why write another?

Searching the web for a suitable tool or script was the first thing I did when I accidentally created 6000 duplicates messages, which took my Trash up to a total of 24 000 messages. There are several available. I tried all the ones I could find, but none seemed to be able to handle my very large Trash and large number of duplicates. Perhaps they could handle it eventually, but I got bored of waiting after several hours. The most promising script (the one which I think would have worked, given enough time) moved files to a folder, which wouldn't help in getting rid of duplicates from the Trash anyway.

How fast is fast?

Faster than the others I could find, but not mind-blowing.

Hundreds of messages with a few duplicates should take a couple of seconds.
Thousands of messages with a few duplicates should take a few tens of seconds.
Thousands of messages with thousands of duplicates should take a few minutes.
Tens of thousands of messages with thousands of duplicates should take a few tens of minutes.

The number of duplicates and the size of the messages matters a lot - the script has to process the contents of every message to verify that suspected duplicates really do match and that means lots of Apple Events to collect the data, plus AppleScript isn't very fast at that kind of text processing. Verifying messages can only be done at a rate of about 25 per second, mostly limited by the rate at which Apple Events can be processed, and usually represents the bulk of the time taken.

Why it is faster than the rest?

There seems to be one main reason - the algorithm used to detect the duplicate messages. It's not that special, it just sorts the messages, then runs through the list looking for adjacent duplicates. The sort algorithm used (Combsort) isn't that amazing either, a Quicksort would be better. I'd never written a Combsort and wanted to try it out. The sort is far from being the performance bottleneck, so I just stuck with the sub-optimal Combsort. The sort-then-select approach is much faster for large numbers of messages than the naive approach, which is to compare every message with every other message.

It doesn't work!

Let me know (my email address is at the bottom of the page). In order to help, I need to know which version of Mail and which version of OS X you are using, exactly what you did and exactly what happened. Even that might not be enough for me to reproduce the problem, so ideally you will also run the script from within Script Editor with Event Log selected at the bottom (scripts run very slowly this way, but it does provide excellent detail for debugging), and paste the contents of the Event Log into your bug report. That way I'll know exactly where it went wrong and even if I can't reproduce the problem myself I'll have a fighting chance of fixing it.


Home

Email: tim (at) my surname without the initial 'a' (dot) org