Thursday, May 30, 2013

Finding Strings in Raw Data

Occasionally, you'll find yourself wishing you could pull out human-readable strings from binary formats, like JPEGs, EXEs or even raw disk dumps. This may be for a variety of reasons, such as:
  • Performing malware analysis
  • Finding lost files
  • Efficient searching

Algorithm

Actually extracting interesting text from a file is somewhat difficult because,
  • you don't know what format the file is in
  • you don't know what language the text is in
  • you don't know how relevant a snippet is
Therefore, we can only rely on if a character is printable or non-printable, whatever that means. Then we can use information to try and weed out the junk without compromising things like IP addresses, which may look very much like junk, but are not.

The method I found that works relatively well is this:
  1. Look for runs of printable characters in a file with length greater than 5
  2. Look for runs of uppercase letters, lowercase letters, or numbers.
  3. For each character of each run, set the value of the nth letter to n.
  4. Sum the run totals
  5. Sort the scores and return.
This algorithm is implemented in the code below.

Outputs

The software below strips special terminal characters (like newlines and carriage returns), and prints out items in the following format:
<score><tab><file><tab><string><newline>
Example:
00000250    dd.img  &'()*56789:CDEFGHIJSTUVWXYZcdefghijstuvwxyz
00000176 dd.img .IEC 61966-2.1 Default RGB colour space - sRGB
00000176 dd.img .IEC 61966-2.1 Default RGB colour space - sRGB
00000138 dd.img ;CREATOR: gd-jpeg v1.0 (using IJG JPEG v62), quality = 95
00000136 dd.img ,Reference Viewing Condition in IEC61966-2.1
00000136 dd.img ,Reference Viewing Condition in IEC61966-2.1
00000099 dd.img Copyright (c) 1998 Hewlett-Packard Company
00000078 dd.img <?xpacket begin='
00000075 dd.img Hhttp://ns.adobe.com/xap/1.0/
00000075 dd.img Adobe Photoshop CS2 Macintosh
00000063 dd.img $54545,,4,44455,554,,4,,,4,55456,,,,,,/5554-,,4,24,
00000058 dd.img IEC http://www.iec.ch
00000058 dd.img IEC http://www.iec.ch
00000051 dd.img bottomOutsetlong
00000046 dd.img rightOutsetlong
It appears this unallocated space actually has files!

Operation

Save the below code to reader.py and provide it paths to the files you want it to read. There are no parameters.

Source

#!/usr/bin/env python
'''
A program that extracts strings from binary files.

Usage: reader.py file [file ...]

Copyright (c) 2013, Joseph Lewis III <joseph@josephlewis.net> | <joehms22@gmail.com>
All rights reserved.

Redistribution and use in source and binary forms, with or without modification,
are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.

Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

'''

import sys

ALPHABET_UPPER = frozenset("ABCDEFGHIJKLMNOPQRSTUVWXYZ")
ALPHABET = frozenset("abcdefghijklmnopqrstuvwxyz,-. \t\n")
NUMBERS = frozenset("0123456789")
OTHER_CHARACTERS = frozenset("!\"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r")
PRINTABLE = frozenset('0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r')
READ_BLOCK_SIZE_B = 2048

def type_of_char(c):
if c in ALPHABET_UPPER:
return 'A'
if c in ALPHABET:
return 'a'
if c in NUMBERS:
return 'n'
return 'o'

def value_of_string(string):
''' A heuristic to give an approximate value to a string. '''

# make sure the string isn't total junk
if all([c in OTHER_CHARACTERS for c in string]):
return 0

types = [type_of_char(c) for c in string]
runs = [0 for c in string]

# higher runs are better
for i in range(1, len(types)):
if types[i] == 'o':
runs[i] = 0
elif types[i] == types[i - 1]:
runs[i] = runs[i - 1] + 1
else:
runs[i] = 0

return sum(runs)

def process_string(string, extracts):
if len(string) < 5:
return

score = value_of_string(string)

if score > 0:
extracts.append((score, path, string))

def extract_strings(path, extracts):
'''Extracts "interesting" strings from a file'''

mystr = []

with open(path, 'rb') as fd:
chars = fd.read(READ_BLOCK_SIZE_B)

while chars != "":
for char in chars:
if char in "\r\n":
mystr.append(" ")
elif char in PRINTABLE:
mystr.append(char)
elif len(mystr) == 0:
continue
else:
process_string("".join(mystr), extracts)
mystr = []
chars = fd.read(READ_BLOCK_SIZE_B)

process_string("".join(mystr), extracts)


if __name__ == "__main__":
if len(sys.argv) == 1:
print("Usage: {} file [file ...]".format(sys.argv[0]))
exit(1)

# a list of tuples (score, document, string)
extracts = []

for path in sys.argv[1:]:
extract_strings(path, extracts)

extracts = sorted(extracts, reverse=True)
for score, path, string in extracts:
print("{:08.0f}\t{}\t{}".format(score, path, string))

Friday, May 24, 2013

Friday Operating System News

Operating Systems

NixOS is a neat little OS that is built around a single config file. The file can be moved to a different computer, and it'll set itself up as a clone, or as you're more likely to use it, as a single file that does version control on your system (so next time you upgrade, you can roll it back right away!)

Jolla has released it's first phones running Sailfish OS.

Wednesday, May 15, 2013

Finding Deleted JPEGs in Disk Dumps

If you've ever accidentally deleted items from your digital camera, or want to find them from a drive that has been lost, this script could be for you!

It works by looking through a disk image for a sequence of bytes that signify the beginning and end of a JPEG image.

Running It

First save the below script as ImageCarver.java

Then compile it:
javac ImageCarver.java
Afterwards, run it with the path to the disk image you want it to search:
java ImageCarver disk.img
It will report images it found, as well as the locations in the disk that it found them:
1.jpg 0xe3fa 0x10272
2.jpg 0xe000 0x20865
3.jpg 0x1000 0x2557e
4.jpg 0x41000 0x4437d
5.jpg 0x26000 0x654d4
6.jpg 0x25000 0x96816
7.jpg 0x86000 0x9eca0
8.jpg 0x6d000 0xdc614
9.jpg 0xe2000 0xe7ac9
10.jpg 0xa4000 0xeb28a
11.jpg 0xed000 0xfc9a9
12.jpg 0x136000 0x13b08e
13.jpg 0x1b5144 0x1b629e
14.jpg 0x1b5000 0x1b794a
15.jpg 0x1b67f0 0x1bea75
16.jpg 0x1c9000 0x1cfada
17.jpg 0x22d000 0x237824
18.jpg 0x281000 0x288928

The Script

import java.io.IOException;
import java.io.OutputStream;
import java.nio.ByteBuffer;
import java.nio.channels.SeekableByteChannel;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.file.StandardOpenOption;
import java.util.LinkedList;

/**
* Carves images from a file given from the command line.
*
* @author Joseph Lewis <joehms22@gmail.com>
*
*/
public class ImageCarver
{
byte[] STARTING_BYTES = new byte[]{(byte)0xFF, (byte)0xD8,(byte)0xFF};
byte[] ENDING_BYTES = new byte[]{(byte)0xFF,(byte)0xD9};
int imgno = 0;


/**
* Carves images out of a file.
* @param sbc - the byte channel to read from.
* @throws IOException - on a file error
*/
public ImageCarver(SeekableByteChannel sbc) throws IOException
{
// find all potential starting and ending points in the file.
LinkedList<Long> startingPoints = ringMatch(sbc, STARTING_BYTES);
LinkedList<Long> endingPoints = ringMatch(sbc, ENDING_BYTES);

// keep writing out we're out of images
while(startingPoints.size() > 0)
{
carveImage(sbc, startingPoints, endingPoints);
}
}

/**
* Carves an image out of the file.
* @param sbc - the channel to read from
* @param starts - the start positions of the image
* @param ends - the end positions of the image
* @throws IOException - on error
*/
public void carveImage(SeekableByteChannel sbc, LinkedList<Long> starts, LinkedList<Long> ends) throws IOException
{
// clear off no longer useful end points
while(starts.peek() > ends.peek())
{
ends.pop();
}

// If ends is empty, add the literal file end.
if(ends.size() == 0)
{
ends.push(sbc.size());
}

// get our starting position.
long sp = starts.pop();

// Check to see if there's an image inside of us.
if(starts.size() > 0 && starts.peek() < ends.peek())
{
carveImage(sbc, starts, ends);
}

// get our closest end point.
long endloc = ends.pop();

imgno++;
createImage(imgno + ".jpg", sbc, sp, endloc);
}

/**
* Extracts an image from the SeekableByteChannel.
* @param name - the image name
* @param src - the channel to read from
* @param begin - the offset to the beginning of the image
* @param end - the offset to the end of the image
* @throws IOException - on error
*/
public void createImage(String name, SeekableByteChannel src, long begin, long end) throws IOException
{
// alert the user about the image we're outputting
System.out.println(String.format("%s 0x%x 0x%x", name, begin, end));

// write out the given slice of bytes
try(OutputStream os = Files.newOutputStream(Paths.get(name), StandardOpenOption.CREATE))
{
src.position(begin);
ByteBuffer b = ByteBuffer.allocate((int) (end - begin));
src.read(b);
os.write(b.array());
}
}

/**
* Searches through a channel looking for a given array of bytes to match.
*
* @param sbc - the channel to read from
* @param toMatch - the bytes to match in the stream
* @return a list of all offsets where the given bytes are found
* @throws IOException - on file read error
*/
public LinkedList<Long> ringMatch(SeekableByteChannel sbc, byte[] toMatch) throws IOException
{
// holds all found vars
LinkedList<Long> finds = new LinkedList<Long>();

for(long i = 0; i < sbc.size() - toMatch.length; i++)
{
ByteBuffer b = ByteBuffer.allocate(toMatch.length);
sbc.position(i);
sbc.read(b);

boolean matches = true;
for(int j = 0; j < toMatch.length; j++)
{
if(b.array()[j] != toMatch[j])
{
matches = false;
break;
}
}

if(matches)
{
finds.add(i);
}

}
return finds;
}

public static void main(String[] args) throws Exception
{
// Make sure we have an arg.
if(args.length != 1)
{
System.out.println("Must supply exactly one argument.");
return;
}

Path p = Paths.get(args[0]);

try (SeekableByteChannel sbc = Files.newByteChannel(p))
{
new ImageCarver(sbc);
}
catch (IOException x)
{
System.out.println("Caught exception: " + x);
}
}
}

Friday, May 10, 2013

Doing a Diff on a Specific Git Version

Sometimes you just need to look up something that changed between versions of a repository instead of reverting a full file, here two git commands come in handy.

The first shows the log of your recent commits, along with hex numbers which represent the versions:
git log
When you find the revision you want, just type:
git show af78ec673f46db23ae5537a8dbea1253e6e844f3
replacing the revision number with the one you want, and you'll get a diff from that version!

Thursday, May 2, 2013

If A Robot Makes Your Makefile

If a robot makes your Makefile, it is no better than pressing that magic green arrow at the top of an IDE and watching the project compile.
 
LIST=CPU
ifndef RECURSE
RECURSE=recurse.mk
ifdef CONFIG
RDIR=$(dir $(CONFIG))
endif
endif
include settings.mk
include $(RDIR)$(RECURSE)

While a robot may easily understand this, it isn't any better than using g++ by hand.

Makefiles, in this programmer's humble opinion, should be clean, elegant, and easy to understand. They are the easiest way to immediately jump in to a project and figure out what is important and the way it is structured.
Even more importantly, they can point to dead ends and code that is never actually accessed by the program, keeping wasted days at bay.