Thursday, May 30, 2013

Finding Strings in Raw Data

Occasionally, you'll find yourself wishing you could pull out human-readable strings from binary formats, like JPEGs, EXEs or even raw disk dumps. This may be for a variety of reasons, such as:
  • Performing malware analysis
  • Finding lost files
  • Efficient searching

Algorithm

Actually extracting interesting text from a file is somewhat difficult because,
  • you don't know what format the file is in
  • you don't know what language the text is in
  • you don't know how relevant a snippet is
Therefore, we can only rely on if a character is printable or non-printable, whatever that means. Then we can use information to try and weed out the junk without compromising things like IP addresses, which may look very much like junk, but are not.

The method I found that works relatively well is this:
  1. Look for runs of printable characters in a file with length greater than 5
  2. Look for runs of uppercase letters, lowercase letters, or numbers.
  3. For each character of each run, set the value of the nth letter to n.
  4. Sum the run totals
  5. Sort the scores and return.
This algorithm is implemented in the code below.

Outputs

The software below strips special terminal characters (like newlines and carriage returns), and prints out items in the following format:
<score><tab><file><tab><string><newline>
Example:
00000250    dd.img  &'()*56789:CDEFGHIJSTUVWXYZcdefghijstuvwxyz
00000176 dd.img .IEC 61966-2.1 Default RGB colour space - sRGB
00000176 dd.img .IEC 61966-2.1 Default RGB colour space - sRGB
00000138 dd.img ;CREATOR: gd-jpeg v1.0 (using IJG JPEG v62), quality = 95
00000136 dd.img ,Reference Viewing Condition in IEC61966-2.1
00000136 dd.img ,Reference Viewing Condition in IEC61966-2.1
00000099 dd.img Copyright (c) 1998 Hewlett-Packard Company
00000078 dd.img <?xpacket begin='
00000075 dd.img Hhttp://ns.adobe.com/xap/1.0/
00000075 dd.img Adobe Photoshop CS2 Macintosh
00000063 dd.img $54545,,4,44455,554,,4,,,4,55456,,,,,,/5554-,,4,24,
00000058 dd.img IEC http://www.iec.ch
00000058 dd.img IEC http://www.iec.ch
00000051 dd.img bottomOutsetlong
00000046 dd.img rightOutsetlong
It appears this unallocated space actually has files!

Operation

Save the below code to reader.py and provide it paths to the files you want it to read. There are no parameters.

Source

#!/usr/bin/env python
'''
A program that extracts strings from binary files.

Usage: reader.py file [file ...]

Copyright (c) 2013, Joseph Lewis III <joseph@josephlewis.net> | <joehms22@gmail.com>
All rights reserved.

Redistribution and use in source and binary forms, with or without modification,
are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.

Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

'''

import sys

ALPHABET_UPPER = frozenset("ABCDEFGHIJKLMNOPQRSTUVWXYZ")
ALPHABET = frozenset("abcdefghijklmnopqrstuvwxyz,-. \t\n")
NUMBERS = frozenset("0123456789")
OTHER_CHARACTERS = frozenset("!\"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r")
PRINTABLE = frozenset('0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r')
READ_BLOCK_SIZE_B = 2048

def type_of_char(c):
if c in ALPHABET_UPPER:
return 'A'
if c in ALPHABET:
return 'a'
if c in NUMBERS:
return 'n'
return 'o'

def value_of_string(string):
''' A heuristic to give an approximate value to a string. '''

# make sure the string isn't total junk
if all([c in OTHER_CHARACTERS for c in string]):
return 0

types = [type_of_char(c) for c in string]
runs = [0 for c in string]

# higher runs are better
for i in range(1, len(types)):
if types[i] == 'o':
runs[i] = 0
elif types[i] == types[i - 1]:
runs[i] = runs[i - 1] + 1
else:
runs[i] = 0

return sum(runs)

def process_string(string, extracts):
if len(string) < 5:
return

score = value_of_string(string)

if score > 0:
extracts.append((score, path, string))

def extract_strings(path, extracts):
'''Extracts "interesting" strings from a file'''

mystr = []

with open(path, 'rb') as fd:
chars = fd.read(READ_BLOCK_SIZE_B)

while chars != "":
for char in chars:
if char in "\r\n":
mystr.append(" ")
elif char in PRINTABLE:
mystr.append(char)
elif len(mystr) == 0:
continue
else:
process_string("".join(mystr), extracts)
mystr = []
chars = fd.read(READ_BLOCK_SIZE_B)

process_string("".join(mystr), extracts)


if __name__ == "__main__":
if len(sys.argv) == 1:
print("Usage: {} file [file ...]".format(sys.argv[0]))
exit(1)

# a list of tuples (score, document, string)
extracts = []

for path in sys.argv[1:]:
extract_strings(path, extracts)

extracts = sorted(extracts, reverse=True)
for score, path, string in extracts:
print("{:08.0f}\t{}\t{}".format(score, path, string))

No comments:

Post a Comment