[SGVLUG] Grep "quickie" needed -- searching for hi-bit characters

Christopher Smith x at xman.org
Fri Jan 4 20:18:53 PST 2008


Claude Felizardo wrote:
> On Jan 4, 2008 3:57 PM, Emerson, Tom (*IC) <Tom.Emerson at wbconsultant.com> wrote:
>   
>>
>> I've got an odd one here -- I know how I'd do this on an HP using some
>> proprietary tools I've used for the last 15 years, but this is on a *nix
>> system so I need to know how to do this using grep.
>>
>> We have some files that were transferred from one machine to another [one of
>> which was a PC], and somewhere in the process, it appears that some
>> local-language/"multi-byte" characters got translated to
>> multiple-ascii-bytes, which in turn buggered up the record length.
>> Fortunately, these are easy to detect visually as the new values for each
>> "byte" of the character are between 128 and 255 and generally look like
>> "line noise" when cat'd to the screen.  Unfortunately, the files involved
>> are thousands of lines long, so a pure visual search is out of the question.
>>
>> What would I use as a regex to find characters with a byte (ascii) value >
>> 127?
>>     
>
> sounds like you should be using sed or perl.
> can't think of the regex right now but if it's suppose to be regular
> text, what about just running the files through strings?
>   
This is simple enough to do in C, let alone perl:

#include <stdio.h>
#include <sys/types.h>
#include <unistd.h>

int main()
{
    off_t offset = 0;
    int byte;
    while (EOF != (byte = getchar())) {
        if (byte > 127) {
            printf("Offset: %Lu\t Character value: %d\n", (uint64_t)
offset, byte);
        }
        ++offset;
    }
    return 0;
}

--Chris


More information about the SGVLUG mailing list