[SGVLUG] Help with file format

Wed Nov 29 10:10:02 PST 2006

Keep in mind that "file" just looks at the first few hundred bytes of the file, 
but if it's ostensibly a text file, you should be able to break it up into 
smaller chunks and run "file" on each of them.  You might want to try "file -k" 
which tells it to look for more than one matching format, since the format it's 
reporting is very likely a false positive.

Also, in vim, try doing ":set fileencoding?" to see what encoding vim thinks the 
file is in.

od is a handy tool, my favorite invocation is something like this:

od -tx1 -tc ./garbage.foo | less

which shows alternating lines of hex bytes and ascii text:

0000000 73 65 6c 65 63 74 20 6d 65 73 73 61 67 65 2c 20
           s   e   l   e   c   t       m   e   s   s   a   g   e   ,
0000020 63 6f 75 6e 74 28 2a 29 20 66 72 6f 6d 20 64 69
           c   o   u   n   t   (   *   )       f   r   o   m       d   i

-- 
Jeremy Leader
jleader at alumni.caltech.edu
leaderj at yahoo-inc.com (work)

on 11/29/2006 08:11 AM Ted Arden  wrote:
> sounds like the file is multi-part.  you could
> try stripping off the txt bits in front, redirecting
> the 'garbage bits' to another file then asking
> linux to tell you what that 'garbage' is..
> octal dumps are kinda fun too.
> 
> od -c ./garbage.foo | less
> 
> then you can kinda see what sorta characters are
> there.
> 
> if there's any sorta texty type stuff in the
> 'garbage', use strings to strip it out as well.
> 
> anyway, the od stuff is olde skewl unix commands
> back from my OSF/1 days.. strings sometimes works
> a bit better to *read* files like that.
> 
> strings ./garbage.foo
> 
> =ted=
> 
> On Wed, 29 Nov 2006, James Neff wrote:
> 
>> Greetings,
>>
>> We received a file from a customer and I'm having trouble determine what
>> the character set is.
>>
>> When I run the "file" utility:
>>
>> [root at appserver2 06-11-28]# file customer-file.txt
>> customer-file.txt: MPEG ADTS, layer I, v1,  96 kBits, 44.1 kHz, Stereo
>>
>>
>> When I run "less"  it thinks its a binary file and I see garbage if I
>> choose to look at it anyway.
>>
>> When I run "vi" I can read the file just fine from start to finish but
>> at the bottom of the terminal is:
>>
>> "customer-file.txt" [converted][dos] 47830L, 9943298C
>>
>> The line count is correct.
>>
>> When I run "more" I can read the file just fine from start to finish.
>>
>> When I try to use "split", the first 15103 lines look ok, but after that
>> everything looks like garbage, as if its binary.
>>
>> Before I can go back to our customer and ask them for a proper file, I
>> need to at least tell them what is wrong with this file (other than
>> saying something is wrong with it).
>>
>> What started this problem was when we tried to import this into our MS
>> SQL database using DTS.  At line 15103 the DTS reported an error saying
>> there were extra columns in that record.  When we first opened DTS it
>> reported the file is in UNICODE.   How would I go about verifying that?
>>
>> So how do I determine what exactly is wrong with this?  Any ideas?
>>
>> Thanks in advance,
>> James