[SGVLUG] Question for "text" database gurus

Emerson, Tom (*IC) Tom.Emerson at wbconsultant.com
Mon Jun 16 17:06:55 PDT 2008


When I went to Germany and Ireland, I dabbled a little with
"geotagging", and worked up enough material to provide a presentation on
it which I will do next month.  On my last trip, however, I've managed
to create an excessive amount of data, some of which I'm certain has
been "duplicated".

When I started each day, I saved and cleared any previous track.  During
the day, (when I remembered), I'd also save the track, but not "clear"
it (though in retrospect perhaps I should have, but I'll save that
discussion for the meeting)  So the first "track" of the day would be
point "A" to "B, but the second track would be from "A" THRU "B" and on
to "C" -- at least, that's the theory -- in practice, once the track was
"saved", some or all of the previous track would be cleared
automatically, so the second track would begin anywhere BETWEEN points
"A" and "B"

Either way, I'm likely to end up with duplicated data for anything
between "A" and "B" -- what is the simplest "unix recipe" for detecting
and removing duplicate data from multiple plain-text files?

I can make this a little easier:  The "plain text" format of a Magellan
GPS unit is one-line-per reading, and although "comma delimited", in
practice the column(s) of data are fixed width.  Since there are no
header/trailer lines in the magellan format, it is a simple matter to
combine them into a single file:

   $ cat ireland_*.trk >> all_of_ireland.trk

(where "*" is the number of the file as I saved it)

Once combined, I know I can use "sort" to (a) sort by the date (or even
the entire record), but then how do I detect and remove duplicate rows
from this data?



More information about the SGVLUG mailing list