[SGVLUG] How to find files/messages that are "almost" the same

Emerson, Tom (*IC) Tom.Emerson at wbconsultant.com
Mon Apr 20 13:21:56 PDT 2009


> -----Original Message----- Of John E. Kreznar
> [replying to Matti's comment]
> > I would also only use diff on files close to the
> > same size, as obviously files of significantly
> > different sizes are very different.
>
> Wouldn't work for something like Swantje's post, the size of
> which is more than doubled by including two versions of the
> same content.
>
> The focus should be first on trying to /canonicalize/
> individual messages, before trying to compare pairs of them.

True enough :)  some "mutiliations", however, are due to the writers -- you and I, for instance, tend to indicate /italicized/ information within slashes, and some MUA's actually display that (and remove the slashes)  An HTML version would have them within a particular <i>set of tags</i>.

> David Lawyer's post, which was taken by Tom to be lacking a
> reply, may actually have been a subtle reply by exhibiting a
> canonicalization of Tom's message.

In my particular case, the fact that David's reply was essentially unchanged from my question is marginally significant -- I generally wouldn't care that these were "duplicates" [mainly because they are plain text and the frequency of such duplications is, well, this is the first time I've seen it in the 10+ years that this e-mail list has existed...]

What I'm after are the cases where the MUA has, according to the "rules" it has been given, created multiple instances of the same message.  One rule tells the client to "keep the reply in the folder where it started", and another rule indicates that anything addressed to a particular group should be placed in a certain directory.  Replies "to the group" then qualify under both rules, and two copies are created.  The difference between these copies is apparent in both the "header" (because the "copy" will occur 2 seconds after it was "posted" and thus have a different timestamp) and the main body of the message (sometimes differing by as little as a single character between the two "internal" copies of the message)

Now, couple that "copy" with the fact that the original may have had a full-sized screen shot (or two) as well as an attached file (spreadsheet, document, whatever) making the final filesize a megabyte or more. (5, 10 ...)


More information about the SGVLUG mailing list