JPEGs
, EXEs
or even raw disk dumps. This may be for a variety of reasons, such as:- Performing malware analysis
- Finding lost files
- Efficient searching
Algorithm
Actually extracting interesting text from a file is somewhat difficult because,- you don't know what format the file is in
- you don't know what language the text is in
- you don't know how relevant a snippet is
The method I found that works relatively well is this:
- Look for runs of printable characters in a file with length greater than 5
- Look for runs of uppercase letters, lowercase letters, or numbers.
- For each character of each run, set the value of the nth letter to n.
- Sum the run totals
- Sort the scores and return.
Outputs
The software below strips special terminal characters (like newlines and carriage returns), and prints out items in the following format:<score><tab><file><tab><string><newline>
Example:00000250 dd.img &'()*56789:CDEFGHIJSTUVWXYZcdefghijstuvwxyz
00000176 dd.img .IEC 61966-2.1 Default RGB colour space - sRGB
00000176 dd.img .IEC 61966-2.1 Default RGB colour space - sRGB
00000138 dd.img ;CREATOR: gd-jpeg v1.0 (using IJG JPEG v62), quality = 95
00000136 dd.img ,Reference Viewing Condition in IEC61966-2.1
00000136 dd.img ,Reference Viewing Condition in IEC61966-2.1
00000099 dd.img Copyright (c) 1998 Hewlett-Packard Company
00000078 dd.img <?xpacket begin='
00000075 dd.img Hhttp://ns.adobe.com/xap/1.0/
00000075 dd.img Adobe Photoshop CS2 Macintosh
00000063 dd.img $54545,,4,44455,554,,4,,,4,55456,,,,,,/5554-,,4,24,
00000058 dd.img IEC http://www.iec.ch
00000058 dd.img IEC http://www.iec.ch
00000051 dd.img bottomOutsetlong
00000046 dd.img rightOutsetlong
It appears this unallocated space actually has files!Operation
Save the below code toreader.py
and provide it paths to the files you want it to read. There are no parameters.Source
#!/usr/bin/env python
'''
A program that extracts strings from binary files.
Usage: reader.py file [file ...]
Copyright (c) 2013, Joseph Lewis III <joseph@josephlewis.net> | <joehms22@gmail.com>
All rights reserved.
Redistribution and use in source and binary forms, with or without modification,
are permitted provided that the following conditions are met:
Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
'''
import sys
ALPHABET_UPPER = frozenset("ABCDEFGHIJKLMNOPQRSTUVWXYZ")
ALPHABET = frozenset("abcdefghijklmnopqrstuvwxyz,-. \t\n")
NUMBERS = frozenset("0123456789")
OTHER_CHARACTERS = frozenset("!\"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r")
PRINTABLE = frozenset('0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r')
READ_BLOCK_SIZE_B = 2048
def type_of_char(c):
if c in ALPHABET_UPPER:
return 'A'
if c in ALPHABET:
return 'a'
if c in NUMBERS:
return 'n'
return 'o'
def value_of_string(string):
''' A heuristic to give an approximate value to a string. '''
# make sure the string isn't total junk
if all([c in OTHER_CHARACTERS for c in string]):
return 0
types = [type_of_char(c) for c in string]
runs = [0 for c in string]
# higher runs are better
for i in range(1, len(types)):
if types[i] == 'o':
runs[i] = 0
elif types[i] == types[i - 1]:
runs[i] = runs[i - 1] + 1
else:
runs[i] = 0
return sum(runs)
def process_string(string, extracts):
if len(string) < 5:
return
score = value_of_string(string)
if score > 0:
extracts.append((score, path, string))
def extract_strings(path, extracts):
'''Extracts "interesting" strings from a file'''
mystr = []
with open(path, 'rb') as fd:
chars = fd.read(READ_BLOCK_SIZE_B)
while chars != "":
for char in chars:
if char in "\r\n":
mystr.append(" ")
elif char in PRINTABLE:
mystr.append(char)
elif len(mystr) == 0:
continue
else:
process_string("".join(mystr), extracts)
mystr = []
chars = fd.read(READ_BLOCK_SIZE_B)
process_string("".join(mystr), extracts)
if __name__ == "__main__":
if len(sys.argv) == 1:
print("Usage: {} file [file ...]".format(sys.argv[0]))
exit(1)
# a list of tuples (score, document, string)
extracts = []
for path in sys.argv[1:]:
extract_strings(path, extracts)
extracts = sorted(extracts, reverse=True)
for score, path, string in extracts:
print("{:08.0f}\t{}\t{}".format(score, path, string))