[SGVLUG] How to find files/messages that are "almost" the same

Jeremy Leader jleader at alumni.caltech.edu
Mon Apr 20 11:52:10 PDT 2009


If you're really ambitious, there's a technique called 
"shingleprinting", which computes hashes for small chunks of the text, 
and determines the number of identical chunks between 2 files (or email 
messages, or whatever you're trying to de-dupe).  The theory is that the 
proportion of chunks shared between two files gives a measure of their 
similarity.  A quick search turned up two implementations, which I 
haven't investigated further:

http://research.microsoft.com/en-us/downloads/4e0d0535-ff4c-4259-99fa-ab34f3f57d67/default.aspx
http://wiki.cs.pdx.edu/forge/simhash.html

-- 
Jeremy Leader
jleader at alumni.caltech.edu

matti wrote:
> 
> Hmmm... interesting problem
> 
> let the world of command line save you!
> 
> lol! ;-)
> 
> ok!
> 
> this IS how I would try to solve the problem
> 
> use "diff" - maybe piping to wc or something
> and seeing which files are minimally different
> 
> I would also only use diff on files close to the
> same size, as obviously files of significantly
> different sizes are very different.
> 
> http://en.wikipedia.org/wiki/Diff
> 
> hmm, so probably use "ls -lta" and pipe to a file,
> then sort the file based on size of the files,
> extract the nearest files basesd on size, diff
> those -> use wc to determine how big the difference
> is, and then manually look and see if it worked ;)
> 
> best
> matti
> 
> --- On Mon, 4/20/09, Emerson, Tom (*IC) <Tom.Emerson at wbconsultant.com> wrote:
> 
>> From: Emerson, Tom (*IC) <Tom.Emerson at wbconsultant.com>
>> Subject: [SGVLUG] How to find files/messages that are "almost" the same
>> To: "'SGVLUG Discussion List.'" <sgvlug at sgvlug.net>
>> Date: Monday, April 20, 2009, 8:50 AM
>> One more reason to dislike certain
>> email clients: using automation to sort e-mails can end up
>> with "duplicates" in multiple folders, however these are
>> not-quite-perfect duplicates, so a binary comparison will
>> see them as distinct messages when in fact the /content/ is
>> the same.
>>
>> Does anyone know of a product or program that would ignore
>> small differences (such as an extra space at the end of a
>> line) when comparing the body/text of a message?



More information about the SGVLUG mailing list