onehourhacks-GameSpot: Finding Strings in Raw Data

Occasionally, you'll find yourself wishing you could pull out human-readable strings from binary formats, like JPEGs, EXEs or even raw disk dumps. This may be for a variety of reasons, such as:

Performing malware analysis
Finding lost files
Efficient searching

Algorithm

Actually extracting interesting text from a file is somewhat difficult because,

you don't know what format the file is in
you don't know what language the text is in
you don't know how relevant a snippet is

Therefore, we can only rely on if a character is printable or non-printable, whatever that means. Then we can use information to try and weed out the junk without compromising things like IP addresses, which may look very much like junk, but are not.

The method I found that works relatively well is this:

Look for runs of printable characters in a file with length greater than 5
Look for runs of uppercase letters, lowercase letters, or numbers.
For each character of each run, set the value of the nth letter to n.
Sum the run totals
Sort the scores and return.

This algorithm is implemented in the code below.

Outputs

The software below strips special terminal characters (like newlines and carriage returns), and prints out items in the following format:

<score><tab><file><tab><string><newline>

Example:

00000250    dd.img  &'()*56789:CDEFGHIJSTUVWXYZcdefghijstuvwxyz
00000176    dd.img  .IEC 61966-2.1 Default RGB colour space - sRGB
00000176    dd.img  .IEC 61966-2.1 Default RGB colour space - sRGB
00000138    dd.img  ;CREATOR: gd-jpeg v1.0 (using IJG JPEG v62), quality = 95 
00000136    dd.img  ,Reference Viewing Condition in IEC61966-2.1
00000136    dd.img  ,Reference Viewing Condition in IEC61966-2.1
00000099    dd.img  Copyright (c) 1998 Hewlett-Packard Company
00000078    dd.img  <?xpacket begin='
00000075    dd.img  Hhttp://ns.adobe.com/xap/1.0/
00000075    dd.img  Adobe Photoshop CS2 Macintosh
00000063    dd.img  $54545,,4,44455,554,,4,,,4,55456,,,,,,/5554-,,4,24,
00000058    dd.img  IEC http://www.iec.ch
00000058    dd.img  IEC http://www.iec.ch
00000051    dd.img  bottomOutsetlong
00000046    dd.img  rightOutsetlong

It appears this unallocated space actually has files!

Operation

Save the below code to reader.py and provide it paths to the files you want it to read. There are no parameters.

Source

#!/usr/bin/env python
'''
A program that extracts strings from binary files.

Usage: reader.py file [file ...]

Copyright (c) 2013, Joseph Lewis III <joseph@josephlewis.net> | <joehms22@gmail.com>
All rights reserved.

Redistribution and use in source and binary forms, with or without modification,
are permitted provided that the following conditions are met:

    Redistributions of source code must retain the above copyright notice, this 
    list of conditions and the following disclaimer.
    
    Redistributions in binary form must reproduce the above copyright notice, 
    this list of conditions and the following disclaimer in the documentation 
    and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND 
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED 
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES 
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; 
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON 
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS 
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

'''

import sys

ALPHABET_UPPER = frozenset("ABCDEFGHIJKLMNOPQRSTUVWXYZ")
ALPHABET = frozenset("abcdefghijklmnopqrstuvwxyz,-. \t\n")
NUMBERS = frozenset("0123456789")
OTHER_CHARACTERS = frozenset("!\"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r")
PRINTABLE = frozenset('0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r')
READ_BLOCK_SIZE_B = 2048

def type_of_char(c):
    if c in ALPHABET_UPPER:
        return 'A'
    if c in ALPHABET:
        return 'a'
    if c in NUMBERS:
        return 'n'
    return 'o'

def value_of_string(string):
    ''' A heuristic to give an approximate value to a string. '''

    # make sure the string isn't total junk
    if all([c in OTHER_CHARACTERS for c in string]):
        return 0

    types = [type_of_char(c) for c in string]
    runs = [0 for c in string]

    # higher runs are better
    for i in range(1, len(types)):
        if types[i] == 'o':
            runs[i] = 0
        elif types[i] == types[i - 1]:
            runs[i] = runs[i - 1] + 1
        else:
            runs[i] = 0

    return sum(runs)

def process_string(string, extracts):
    if len(string) < 5:
        return

    score = value_of_string(string)

    if score > 0:
        extracts.append((score, path, string))

def extract_strings(path, extracts):
    '''Extracts "interesting" strings from a file'''

    mystr = []

    with open(path, 'rb') as fd:
        chars = fd.read(READ_BLOCK_SIZE_B)

        while chars != "":
            for char in chars:
                if char in "\r\n":
                    mystr.append(" ")
                elif char in PRINTABLE:
                    mystr.append(char)
                elif len(mystr) == 0:
                    continue
                else:
                    process_string("".join(mystr), extracts)
                    mystr = []
            chars = fd.read(READ_BLOCK_SIZE_B)

    process_string("".join(mystr), extracts)


if __name__ == "__main__":
    if len(sys.argv) == 1:
        print("Usage: {} file [file ...]".format(sys.argv[0]))
        exit(1)

    # a list of tuples (score, document, string)
    extracts = []

    for path in sys.argv[1:]:
        extract_strings(path, extracts)

    extracts = sorted(extracts, reverse=True)
    for score, path, string in extracts:
        print("{:08.0f}\t{}\t{}".format(score, path, string))

onehourhacks-GameSpot

Thursday, May 30, 2013

Finding Strings in Raw Data

Algorithm

Outputs

Operation

Source

No comments:

Post a Comment