[SGVLUG] How to find files/messages that are "almost" the same

John E. Kreznar jek at ininx.com
Mon Apr 20 12:55:48 PDT 2009


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

In a message purporting to be from matti <mathew_2000 at yahoo.com> but
lacking a digital signature it was written:

> I would also only use diff on files close to the
> same size, as obviously files of significantly
> different sizes are very different.

Wouldn't work for something like Swantje's post, the size of which is
more than doubled by including two versions of the same content.

The focus should be first on trying to /canonicalize/ individual
messages, before trying to compare pairs of them.

David Lawyer's post, which was taken by Tom to be lacking a reply, may
actually have been a subtle reply by exhibiting a canonicalization of
Tom's message.

In the case of email, canonicalization should begin by assembling a
catalogue of all the kinds of mutilation that are done by the various
Mail User Agents.  Then, a program could be attempted which, for each
message, identifies the particular mutilations it has suffered, and
de-mutilates it accordingly.  Only then would comparison of pairs of
messages be attempted.

- -- 
 John E. Kreznar jek at ininx.com 9F1148454619A5F08550 705961A47CC541AFEF13

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Processed by Mailcrypt 3.5.8+ <http://mailcrypt.sourceforge.net/>

iD8DBQFJ7NMaYaR8xUGv7xMRAqTaAJ4peZU4DihCrG/2wZqYSIWca2R2bwCeO1PX
hS8WAVC9dKEfli3cRxvWG5A=
=Obdq
-----END PGP SIGNATURE-----



More information about the SGVLUG mailing list