Manage Learn to apply best practices and optimize your operations.

Remove duplicate attachments from user's e-mail database

This tip shows you a way to remove duplicate attachments from an e-mail database.

View member feedback to this tip.

As a corollary to the recent attachment-related tips, here's a way to remove duplicate attachments from an e-mail database.

It evaluates not only the attachment filename but also the size, so if a file is passed back and forth for modification (i.e., for redlining), it will leave all copies intact. But if, for example, a user simply receives a file and forwards it on, saving the sent copy as well, it will strip out all duplicate copies, leaving the most recent.

We've added this agent to the mail template, giving users an easy way to clean up their account without losing any data. The users love it, and it saves us a huge amount of disk space. In one user's case, the agent removed over a gigabyte of duplication.

  
  Code: '(Declarations)
Dim gsesCurrent As NotesSession
Dim grtsRemovalDescription 
As NotesRichTextStyle Sub Initialize
 On Error Goto ErrorHandler
 Dim dbCurrent As NotesDatabase
 Dim vwAllDocuments As NotesView 
 Dim docEmail As NotesDocument 
 Dim rtiBody As NotesRichTextItem 
 Dim nrtiEmailBody As NotesRichTextItem
 Dim boolTestMode As Boolean 
 Dim intFileCount As Integer
 Dim lngTotalBytesFreed As Long
 Dim strFileNames As Variant
 
 Select Case Messagebox
("Do you wish to permanently remove all duplicate 
attachments (select No to run this utility in 
test mode)?", 291, "Permanently 
Delete?")
 Case 6
  boolTestMode = False
 Case 7
  boolTestMode = True
 Case Else  
  Exit Sub
 End Select
 
 lngTotalBytesFreed = 0
 Redim strFileNames(0)
 Set gsesCurrent = New NotesSession
 DefineRichTextStyle
 Set dbCurrent = gsesCurrent.CurrentDatabase
 Set vwAllDocuments = dbCurrent.GetView ("($All)")
 Set docEmail = vwAllDocuments.GetLastDocument   
 Do While Not (docEmail Is Nothing)
  intFileCount = 0   
  If docEmail.HasEmbedded 
And docEmail.HasItem("Body") Then  
   Print "Scanning: " & docEmail.Subject(0)   
   Set rtiBody = docEmail.GetFirstItem("Body")
   If ( rtiBody.Type = RICHTEXT ) Then
    Forall object In rtiBody.EmbeddedObjects
     If ( object.Type = EMBED_ATTACHMENT ) Then
      If IsArrayMember(strFileNames, 
object.Source & Cstr(object.FileSize)) 
Then
       intFileCount = intFileCount + 1
       If intFileCount = 1 Then
        Call rtiBody.AddNewLine (2) 
        Call rtiBody.AppendStyle
(grtsRemovalDescription) 
        Call rtiBody.AppendText
("Attachment(s) removed from this document:")
        Call rtiBody.addnewline (1)
       End If
       Call rtiBody.AddNewLine (1)
       Call rtiBody.AppendText(object.Source)
       lngTotalBytesFreed = 
lngTotalBytesFreed + object.FileSize
       Print "Removed: " & Cstr(lngTotalBytesFreed) 
& " bytes"
       If Not (boolTestMode) Then Call object.Remove       
      Else
       AddToArray strFileNames, 
object.Source & Cstr(object.FileSize)
      End If
     End If
    End Forall
   End If 
   If intFileCount > 0 And Not (boolTestMode) Then
    Call docEmail.Save(True, True)
   End If
  End If
  Set docEmail = vwAllDocuments.
GetPrevDocument(docEmail)
 Loop
 
 Set docEmail = dbCurrent.CreateDocument
 docEmail.Form = "Memo"
 docEmail.Subject = "Results of removal 
of duplicate attachments from 
database " & dbCurrent.Title
 Set nrtiEmailBody = docEmail.CreateRichTextItem("Body")
 Call nrtiEmailBody.AppendText("Bytes freed: " 
& Cstr(lngTotalBytesFreed))
 Call docEmail.Send
(False,gsesCurrent.EffectiveUserName)
 Exit Sub
ErrorHandler:
 Print {Error #} & Str(Err()) & { in line } 
& Str(Erl()) & { of } & {Remove 
Duplicate Attachments} & {: "} & Error$ & {"}
 Resume Next
End Sub

Function AddToArray(pvarParentArray 
As Variant, pvarArrayItem As 
Variant) 'This function appends a
 value of any datatype to an array of the 
same datatype
 On Error Goto ErrorHandler
 If Isnull(pvarParentArray) 
Then Redim Preserve pvarParentArray(0)
 Redim Preserve 
pvarParentArray(Ubound(pvarParentArray) + 1)
 pvarParentArray(Ubound(pvarParentArray)) = 
pvarArrayItem
 Exit Function
ErrorHandler:
%REM 
All errors in this context assume 
the array has not yet been initialized; 
hence, the array is redimmed to a 
bound of 0 to store the parameter's value / 
object.
%END REM
 Redim Preserve pvarParentArray(0)
 Resume Next
End Function

Sub DefineRichTextStyle
 Set grtsRemovalDescription = 
gsesCurrent.CreateRichTextStyle() 
 grtsRemovalDescription.NotesColor = 8 'Maroon
 grtsRemovalDescription.FontSize = 8 
 grtsRemovalDescription.Bold = True
End Sub

Function IsArrayMember 
(pvarParentArray As Variant, pvarArrayItemValue As 
Variant) As Variant
 On Error Goto ErrorHandler
 If Isnull (Arraygetindex(pvarParentArray,
 pvarArrayItemValue, 5)) Then
  IsArrayMember = False
 Else
  IsArrayMember = True
 End If 
 Exit Function
ErrorHandler:
 Redim Preserve pvarParentArray(0)
 Resume Next
End Function

MEMBER FEEDBACK TO THIS TIP

The code in this tip works great, and it relies on the filename and file size to find the duplicates. I guess it is not foolproof because the file size does not change when there is a change in the format (upper case to lower case or vice versa) of the document that is attached with the filename remaining the same.

—Dhana S.

******************************************

By default it retains a copy of the attachment that was last sent. How can we change it so it keeps the original attachment? It's currently harder to find things if you have to go to the sent items to find the attachment. It should keep the original received attachment instead (unless it's modified of course).

—Tony R.

******************************************

The modification you're requesting only requires a change to two lines of code:

  1. Change the line right before the start of the document loop from:

    Set docEmail = vwAllDocuments.GetLastDocument
    To:
    Set docEmail = vwAllDocuments.GetFirstDocument

  2. Change line right before the end of the document loop from:

    Set docEmail = vwAllDocuments.GetPrevDocument(docEmail)
    To:
    Set docEmail = vwAllDocuments.GetNextDocument(docEmail)

With this, change the code scans the database in ascending order instead of descending.

—Tim Tripcony, tip author

******************************************

I've used this utility and found one pitfall: If you have more than two duplicates of the attachment it will not remove all occurrences. For example, I have one attachment that appears in eight different places. Each time I run the utility, it removes one copy only. I thought that it would find all eight and remove the seven copies and leave one -- maybe I am wrong. Thanks in any case!

—Luke S.

*******************************************

I would LOVE to know how to restrict this agent to attachments in messages older than X months. And I know several other companies that I have passed this code along to would love to know how to do the same.

—John G.

********************************************

Modifications to the "Remove duplicate attachments" agent that allow age-based removal:

'Add the following lines to your Declarations: 
        Dim datCreation As NotesDateTime 
        Dim datExpiration As NotesDateTime 
        Dim strExpirationPrompt As String 

Insert the following block immediately prior to the "Do While" loop. This ensures the user will not be prompted multiple times. I've chosen to prompt the user to enter the age in days; changing the unit reference in the prompt to "months" and the default value to "3", and using .AdjustMonth instead of .AdjustDay, would be approximately equivalent. The difference, of course, being that adjusting by month should return the same day number in the adjusted month as the current month (i.e., from 7/14 to 4/14), whereas adjusting by day will return a date a couple days ahead if it overlaps months with 31 days.

NOTE: The following block can technically be inserted anywhere prior to the loop (i.e., before asking the user whether they wish to run the utility live or in test mode), but if the NotesSession variable has not yet been instantiated, the agent will fail when the code attempts to instantiate the NotesDateTime variable (in this case, datExpiration).

         strExpirationPrompt = Inputbox
("How many days would you like to retain?", 
"Expiration Preference", "90") 
        If (strExpirationPrompt = "" 
or Not (IsNumeric(strExpirationPrompt))) 
Then 'Checks for valid entry 
                MessageBox "Please enter a
 valid number of days." 
                Exit Sub 
        Else 
                Set datExpiration = gsesCurrent.
CreateDateTime(Cstr(Now)) 
                Call datExpiration.AdjustDay(
1 - Cint(strExpirationPrompt)) 'Results in 
adjustment by a negative number 
        End If 

'Insert the following immediately after 
opening the loop. 
                Set datCreation = gsesCurrent.
CreateDateTime(Cstr(docEmail.Created)) 
                If datExpiration.TimeDifference(datCreation) 
> 0 Then 'This skips any documents 
created after the cutoff date 

Finally, because the last line opens an If block, be sure to insert an "End If" before "Loop." Word of caution: The If block must actually be closed prior to "Set docEmail = vwAllDocuments.GetPrevDocument(docEmail)"... if "End If" is the last line before "Loop", the code will keep looping forever, assuming it found at least one document that was created prior to the cutoff date.

Also worth noting: Using .LastModified or .LastAccessed instead of .Created in the first of the two lines in the above block would give you even more flexibility (particularly if you also include a prompt up front to determine which date property the user would prefer, similar to the standard archiving feature in Notes). Instead of simply purging attachments based on their original age, you're taking into account some measure of the validity/pertinence of the data. In other words, if I'm not even viewing the data on a regular basis (.LastAccessed), there may be no value in continuing to retain it -- unless, of course, there are regulatory requirements involved -- and certainly not in duplicate.

—Tim Tripcony, tip author

********************************************

It would be nice to see removed filenames in the result e-mail. Otherwise the "test mode" is not very helpful.

—Vladimir T.

********************************************

The result e-mail was originally written as sort of a sales pitch to users -- if they didn't believe it would make much of a difference I suggested that they run it in test mode, confident that no data would be touched, but that they'd get an e-mail saying that running it live could remove, for example, 450,000,000 bytes from their account, without deleting anything they didn't still have elsewhere. Once the test mode finished, running a simple full-text search in the All Documents view for "Attachment(s) removed" would immediately display all messages impacted by the agent, so we hadn't yet identified a need for additional detail in the result email.

That said, what Vladimir is requesting wouldn't be too difficult. One approach would be to maintain two separate Lists througout instead of the Array it currently does, and set the list members to instances of a class/UDT (user-defined type) with two members: name and size. One list would store all the files already found, so that additional matching files would be removed (or logged, in test mode), while the other list would store the files removed. Then the code that generates the e-mail can loop through the second list, writing the file information to the e-mail body.

If the class or UDT had additional members, such as the date/time, sender, and subject of the email impacted, this utility could generate a full log of what it did. In fact, if the e-mail is formatted as MIME instead of standard rich text, they could display all of this information in a pretty little table -- not that you can't do that in rich text with the new LotusScript classes, but for something like this, MIME is just far easier and more elegant.

—Tim Tripcony, tip author

********************************************

I have written a tip that fixed the bug that caused only one deletion of multiple copies per run. In addition, this code has added sender, subject and filename to the summary e-mail.

—Vladimir T.

********************************************

The ability to add a doclink to the message where the attachment was removed, pointing to the e-mail that still contains the attachment, would be handy. Also, I don't think it was mentioned anywhere, it seems that this code is for version 6 or higher.

—Glenn P.

********************************************

This tip sounded great at first, but what happens if you're dealing with thousands of attachments and have dozens of similarly sized identically-named files? There's a risk of tossing a unique item. Is there anyway to also include metadata from the attachment, as well as a creation date or a last modification date?

—Bradley M.

********************************************

I have one more correction to the code in this tip. The first time I ran the code in the debugger, I noticed several hits of the OnError section resulting from documents with rtiBody.EmbeddedObjects beeing empty (although docEmail.HasEmbedded is True). To fix this, use the correction below.

Replace:

                  If ( rtiBody.Type = RICHTEXT ) Then

with this:

                  If ( rtiBody.Type = RICHTEXT ) And Not
Isempty(rtiBody.EmbeddedObjects) Then

—Markus K.

Do you have comments on this tip? Let us know.

This tip was submitted to the SearchDomino.com tip exchange by member Tim Tripcony. Please let others know how useful it is via the rating scale below. Do you have a useful Notes/Domino tip or code to share? Submit it to our monthly tip contest and you could win a prize and a spot in our Hall of Fame.

Dig Deeper on LotusScript

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.

-ADS BY GOOGLE

SearchWindowsServer

Search400

  • iSeries tutorials

    Search400.com's tutorials provide in-depth information on the iSeries. Our iSeries tutorials address areas you need to know about...

  • V6R1 upgrade planning checklist

    When upgrading to V6R1, make sure your software will be supported, your programs will function and the correct PTFs have been ...

  • Connecting multiple iSeries systems through DDM

    Working with databases over multiple iSeries systems can be simple when remotely connecting logical partitions with distributed ...

SearchDataCenter

SearchContentManagement

Close