Wednesday, October 31, 2012

Fake Text Generation Through Markov Chains

A Markov Chain is roughly described as a finite state machine, in which you know the chance of one state transitioning to another. In this case, we build a graph of words, based upon an input text file, for each word in the text we match it with the word right after and count the occurrences i.e.

The quick red fox jumped over the lazy brown dog is turned in to:
  • <START> the
  • the quick
  • quick red
  • ...
  • the lazy
  • lazy brown
  • brown dog
  • dog <END>
In this case the word "the" has a 50% chance of being followed by red or lazy; when constructing a new text, we find a start word,  and choose a the word right after. Rinse, lather, repeat until you hit an <END>.

The code at the bottom of the page loads a text file, cleans and parses it, then generates 15 "random" sentences afterward.

Here are some fun generations:

Alice's Adventures in Wonderland
Will you had made her lips. I'm on it did not get to herself and the reason so savage when i dare say you so Alice did so bills got into a little recovered his voice. i suppose by this she had been it kills all like said it does it was a present thought it away. oh don't believe i wonder at his ear and who for the other however this so he said Alice you know why then came upon her head must crossexamine the newspapers at one hand in a farmer you know when it must be more puzzled her for she said Alice. once in that the sort of that said the neck nicely by the air.
The Holy Bible
165 none delivereth souls by the lees of their portion of padon 748 the bason of hosts shall he will bless thee and elms because they knew him 710 so marred more because thou king of life. 1011 out of any more to me that by ship his heart for a flock. 87 and he will come near unto you all the fleece and to hebron for the winds of thy men with cunning work. 347 when they shall yet again unto you that the disciples when they unto the dwellings have i the sight and save one had been carried away. 612 and five years and i will diminish from rithmah and lord.
The Bill of Rights (USA)
 the powers not be compelled in the people. ix the people peaceably to answer for a speedy and to the right of the accusation to the states by congress september 25 1789 ratified december 15 1791 i congress september 25 1789 ratified december 15 1791 i congress shall have compulsory process for his defense. iv the ten original amendments to have been committed which district wherein the owner nor excessive bail shall enjoy the rules of life liberty or the free state the constitution nor shall private property be searched and the right of the people.
Now to the code!


#!/usr/bin/env python3
'''
Provides common functions used through the similarity engines.

Copyright 2012 Joseph Lewis <joseph@josephlewis.net> | <joehms22@gmail.com>

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.

Redistributions in binary form must reproduce the above
copyright notice, this list of conditions and the following disclaimer
in the documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

2012-10-28 - Initial Work
'''
import random
import pickle
import re

START_WORD = "<START>"
TERMINAL_WORD = "<END>"

def document_to_terms(doc):
'''Parses a document in to a list of strings.'''
doc = doc.lower()
doc = re.sub(r"""[^\w\s]+""", '', doc)
return doc.split()

class TwoGramCorpus():
words = None
def __init__(self, fp=None):
if fp:
self.words = pickle.load(fp)
else:
self.words = {}

def save(self, fp):
pickle.dump(self.words, fp, 1)

def load_file(self, fp):
for sentence in fp.read().split("."):
self.load_line(document_to_terms(sentence))

def load_line(self, line):
if len(line) == 0:
return

line.append(TERMINAL_WORD)

lastword = START_WORD
for word in line:
self.add_2gram(lastword, word)
lastword = word

def add_2gram(self, first, second):
firstlist = None
try:
firstlist = self.words[first]
except KeyError:
self.words[first] = {second : 1}
return
try:
firstlist[second] += 1
except KeyError:
firstlist[second] = 1

def __str__(self):
return str(self.words)

def __unicode__(self):
return str(self.words)

def _random_prob_word(self, start):
''' chooses a random word that follows the given one
with the probability of the frequency it occurs.
'''

rest = self.words[start]
total_possible = sum(rest.values())

chosen = random.randint(1,total_possible)

for word, value in rest.items():
chosen -= value

if chosen <= 0:
return word

return None



def generate_sentence(self):
# choose a random start word.
sentence = []
lastword = self._random_prob_word(START_WORD)

while lastword != TERMINAL_WORD:
sentence.append(lastword)
lastword = self._random_prob_word(lastword)

return " ".join(sentence) + "."


def generate(self, num_sentences):
sentences = []
for i in range(num_sentences):
sentences.append(self.generate_sentence())

return " ".join(sentences)

if __name__ == "__main__":
tgc = TwoGramCorpus()
tgc.load_file(open('/path/to/text/file.txt'))
print(tgc.generate(150))

Uses 
  • Discovering an artist's style
  • Writing fake text for websites
  • Writing fake books (some on Amazon are created this way by bots)
  • Passing around statistical information about documents without violating copyright

Saturday, October 27, 2012

A Java eval() Implementation

I recently came across a situation where I had to evaluate expressions that the user had input, in Java there are quite a few ways to do this, but the solutions had several downsides:
  • Resource hungry (3mb libraries that add to JAR file size greatly)
  • RAM hungry, some store strings indefinitely
  • Insecure, compile code using JVM then load it as a class
  • Difficult to integrate
After working with Mozilla's Rhino for around eight months, it became burdensome and I wrote my own interpreter that:
  • Can do standard operations */+-()
  • Can optionally run "functions" or use "variables"
  • Is fast, as fast as Rhino for small one-off expressions
  • Is small (< 7Kb compiled, before being put in a JAR)
The code is on Github, under a BSD license.

Running it is quite simple, just download the jar, add it to your project, and use the following examples as guides

Simple Example:

import net.josephlewis.jeval.Eval;
import net.josephlewis.jeval.MalformedExpression;

public class BasicExample {
public static void main(String[] args)
{
// A simple expression.
try {
System.out.println( Eval.eval("2 * 3 + 5") );
} catch (MalformedExpression e) {
System.err.println(e.getMessage());
}

// Check to see if an expression is valid or not.
System.out.println("is valid? (should be true)");
System.out.println(Eval.isErrorFree("2 * 3 + 5", null));

System.out.println("is valid? (should be false)");
System.out.println(Eval.isErrorFree("2 * 3 +", null));
}
}

Example With Variables/Function Calls:

import net.josephlewis.jeval.Eval;
import net.josephlewis.jeval.MalformedExpression;
import net.josephlewis.jeval.VariableStore;

public class VariableStoreExample implements VariableStore {

/**
* Not a very useful getValue, just summs the ASCII value of the given
* string and returns it.
*/
@Override
public double getValue(String variable) throws MalformedExpression {
double b = 0;

for(char c : variable.toCharArray())
b += Character.getNumericValue(c);

return b;
}

public static void main(String[] args)
{
VariableStore vse = new VariableStoreExample();

try {
System.err.println(Eval.eval("$a - $a", vse));
System.err.println(Eval.eval("$abc - $cba", vse));

// Spaces can be included in variable names, if they're enclosed
// within ( and ) or "s
System.err.println(Eval.eval("$(hello world) - $(world hello)", vse));
System.err.println(Eval.eval("$hello(world) - $world(hello)", vse));
System.err.println(Eval.eval("$hello(\"world\") - $world(\"hello\")", vse));

} catch (MalformedExpression e) {
e.printStackTrace();
}
}
}

And there you have it, a custom parser for expressions as simple as just */+-() or as complex as you want through extensions using dollar signs.

After using the old JS parser for over a year in the before-mentioned project this was a breath of fresh air, the JAR file that held Rhino shrunk by a few MB providing relief for my users and the server that had to send it around.

Wednesday, October 17, 2012

First Experiments with Test Driven Development

Last Friday, Baird Ramsey, a test engineer from Google gave a talk at my school promoting test driven development; I decided to try it with a new interpreter I was writing for my book on the implementation full text search engines (Warning: It's currently a rough draft!) and it worked wonderfully.

The last time I wrote an interpreter it was hard, not conceptually, but to get all the components working in the right order.

In intro programming, I was taught to write the outline of a class/function write pseudo-code using comments, then turn that pseudocode in to real code, it worked something like this:


class Player{
private health = 0;

public int changeHealth(int change) {
// change the private health
// return the new health.
}
}

This style works well enough because it lets you quickly see what should be broken in to methods, and outlines your algorithm so mistakes are usually syntactic in nature, but not always, especially when you don't write the proper pseudocode or have a good idea of what the algorithm should look like.

This time, I wrote tests first, and built the program up from the smallest pieces, rather than drilling down and creating new methods as I went along. I suspect this has much to do with the successes experienced by many developers, as they are forced to not create thousand line functions.

The cycle I followed was something like this:
  1. Write a test
  2. Write code
  3. Make sure it works
  4. Write a test for another case of the same function
  5. Make sure it works, if not fix it.
  6. If not all cases are satisfied GOTO 4
  7. Move on to the next function
Not only did I get code that worked the first time I tested it in its entirety, but I was saved from the bug-hunting headache of my prior escapades, and I got free unittests to test future additions to the codebase.

Tuesday, October 16, 2012

Unified Media Search

This is a simple idea that I've been formulating for some years:

What happens when people want to look back and find something? Interactions from people lost long ago?

Sure, each social media site allows you to go back and find things/people/whatevers you posted from a long time ago, but that doesn't help in the long run, because they come and go.

For this I have an idea, some kind of a background service that checks your accounts, automatically adds contacts to its file, and your correspondence, chronologically. All of this wouldn't just come from one source either, emails, texts, voicemails, and forms of communication we haven't even imagined yet, mixed together with a base platform with rich plug-in hooks to process the data.

It would be a tool for social engineers, scholars, detectives, and regular people tired of managing their digital lives across so many services.

The basic service would be fairly simple to build, initial extension points I see would be:
  • Inputs/Parsers (create "messages" or "contacts" from files or services)
  • Reports (visual ways of viewing/manipulating stored data and analyzers)
  • Analyzers (Something that creates tentative links between things, could be by phone numbers, IP addresses, lat and longs, facial recognition, writing styles, etc.)
I suspect the initial data would be fairly simple to structure as well, each message would have:
  • Date
  • Sender(s)
  • Receiver(s)
  • Format
  • Extracted text/Resources
  • Original Message
Basic outputs I would like to see are:
  • Who knows who? A map between people likely to know each other (good for looking up someone you don't remember)
  • Conversation history, show all conversations between two people on a timeline, from all media sources.
  • Happenings, mapping when people got together and where, creating "events" in a timeline.