onehourhacks-GameSpot: October 2012

Wednesday, October 31, 2012

Fake Text Generation Through Markov Chains

A Markov Chain is roughly described as a finite state machine, in which you know the chance of one state transitioning to another. In this case, we build a graph of words, based upon an input text file, for each word in the text we match it with the word right after and count the occurrences i.e.

The quick red fox jumped over the lazy brown dog is turned in to:

<START> the
the quick
quick red
...
the lazy
lazy brown
brown dog
dog <END>

In this case the word "the" has a 50% chance of being followed by red or lazy; when constructing a new text, we find a start word, and choose a the word right after. Rinse, lather, repeat until you hit an <END>.

The code at the bottom of the page loads a text file, cleans and parses it, then generates 15 "random" sentences afterward.

Here are some fun generations:

Alice's Adventures in Wonderland

Will you had made her lips. I'm on it did not get to herself and the reason so savage when i dare say you so Alice did so bills got into a little recovered his voice. i suppose by this she had been it kills all like said it does it was a present thought it away. oh don't believe i wonder at his ear and who for the other however this so he said Alice you know why then came upon her head must crossexamine the newspapers at one hand in a farmer you know when it must be more puzzled her for she said Alice. once in that the sort of that said the neck nicely by the air.

The Holy Bible

165 none delivereth souls by the lees of their portion of padon 748 the bason of hosts shall he will bless thee and elms because they knew him 710 so marred more because thou king of life. 1011 out of any more to me that by ship his heart for a flock. 87 and he will come near unto you all the fleece and to hebron for the winds of thy men with cunning work. 347 when they shall yet again unto you that the disciples when they unto the dwellings have i the sight and save one had been carried away. 612 and five years and i will diminish from rithmah and lord.

The Bill of Rights (USA)

the powers not be compelled in the people. ix the people peaceably to answer for a speedy and to the right of the accusation to the states by congress september 25 1789 ratified december 15 1791 i congress september 25 1789 ratified december 15 1791 i congress shall have compulsory process for his defense. iv the ten original amendments to have been committed which district wherein the owner nor excessive bail shall enjoy the rules of life liberty or the free state the constitution nor shall private property be searched and the right of the people.

Now to the code!

#!/usr/bin/env python3
'''
Provides common functions used through the similarity engines.

Copyright 2012 Joseph Lewis <joseph@josephlewis.net> | <joehms22@gmail.com>

Redistribution and use in source and binary forms, with or without 
modification, are permitted provided that the following conditions are met:

    Redistributions of source code must retain the above copyright
    notice, this list of conditions and the following disclaimer.
    
    Redistributions in binary form must reproduce the above 
    copyright notice, this list of conditions and the following disclaimer 
    in the documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
 ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED 
 WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 
 DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 
 FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 
 DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 
 SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 
 CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 
 OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 
 OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

2012-10-28 - Initial Work
'''
import random
import pickle
import re

START_WORD = "<START>"
TERMINAL_WORD = "<END>"

def document_to_terms(doc):
 '''Parses a document in to a list of strings.'''
 doc = doc.lower()
 doc = re.sub(r"""[^\w\s]+""", '', doc)
 return doc.split()

class TwoGramCorpus():
 words = None
 def __init__(self, fp=None):
  if fp:
   self.words = pickle.load(fp)
  else:
   self.words = {}
 
 def save(self, fp):
  pickle.dump(self.words, fp, 1)
  
 def load_file(self, fp):
  for sentence in fp.read().split("."):
   self.load_line(document_to_terms(sentence))
   
 def load_line(self, line):
  if len(line) == 0:
   return
  
  line.append(TERMINAL_WORD)
  
  lastword = START_WORD
  for word in line:
   self.add_2gram(lastword, word)
   lastword = word
 
 def add_2gram(self, first, second):
  firstlist = None
  try:
   firstlist = self.words[first]
  except KeyError:
   self.words[first] = {second : 1}
   return
  try:
   firstlist[second] += 1
  except KeyError:
   firstlist[second] = 1
 
 def __str__(self):
  return str(self.words)
 
 def __unicode__(self):
  return str(self.words)
  
 def _random_prob_word(self, start):
  ''' chooses a random word that follows the given one 
  with the probability of the frequency it occurs.
  '''
  
  rest = self.words[start]
  total_possible = sum(rest.values())
  
  chosen = random.randint(1,total_possible)
  
  for word, value in rest.items():
   chosen -= value
   
   if chosen <= 0:
    return word
  
  return None
   
  
  
 def generate_sentence(self):
  # choose a random start word.
  sentence = []
  lastword = self._random_prob_word(START_WORD)
  
  while lastword != TERMINAL_WORD:
   sentence.append(lastword)
   lastword = self._random_prob_word(lastword)
  
  return " ".join(sentence) + "."
 
 
 def generate(self, num_sentences):
  sentences = []
  for i in range(num_sentences):
   sentences.append(self.generate_sentence())
   
  return " ".join(sentences)

if __name__ == "__main__":
 tgc = TwoGramCorpus()
 tgc.load_file(open('/path/to/text/file.txt'))
 print(tgc.generate(150))

Uses

Discovering an artist's style
Writing fake text for websites
Writing fake books (some on Amazon are created this way by bots)
Passing around statistical information about documents without violating copyright

Saturday, October 27, 2012

A Java eval() Implementation

I recently came across a situation where I had to evaluate expressions that the user had input, in Java there are quite a few ways to do this, but the solutions had several downsides:

Resource hungry (3mb libraries that add to JAR file size greatly)
RAM hungry, some store strings indefinitely
Insecure, compile code using JVM then load it as a class
Difficult to integrate

After working with Mozilla's Rhino for around eight months, it became burdensome and I wrote my own interpreter that:

Can do standard operations */+-()
Can optionally run "functions" or use "variables"
Is fast, as fast as Rhino for small one-off expressions
Is small (< 7Kb compiled, before being put in a JAR)

The code is on Github, under a BSD license.

Running it is quite simple, just download the jar, add it to your project, and use the following examples as guides

Simple Example:

import net.josephlewis.jeval.Eval;
import net.josephlewis.jeval.MalformedExpression;

public class BasicExample {
    public static void main(String[] args)
    {
        // A simple expression.
        try {
            System.out.println( Eval.eval("2 * 3 + 5") );
        } catch (MalformedExpression e) {
            System.err.println(e.getMessage());
        }

        // Check to see if an expression is valid or not.
        System.out.println("is valid? (should be true)");
        System.out.println(Eval.isErrorFree("2 * 3 + 5", null));

        System.out.println("is valid? (should be false)");
        System.out.println(Eval.isErrorFree("2 * 3 +", null));
    }
}

Example With Variables/Function Calls:

import net.josephlewis.jeval.Eval;
import net.josephlewis.jeval.MalformedExpression;
import net.josephlewis.jeval.VariableStore;

public class VariableStoreExample implements VariableStore {

    /**
     * Not a very useful getValue, just summs the ASCII value of the given
     * string and returns it.
     */
    @Override
    public double getValue(String variable) throws MalformedExpression {
        double b = 0;

        for(char c : variable.toCharArray())
            b += Character.getNumericValue(c);

        return b;
    }

    public static void main(String[] args)
    {
        VariableStore vse = new VariableStoreExample();

        try {
            System.err.println(Eval.eval("$a - $a", vse));
            System.err.println(Eval.eval("$abc - $cba", vse));

            // Spaces can be included in variable names, if they're enclosed 
            // within ( and ) or "s
            System.err.println(Eval.eval("$(hello world) - $(world hello)", vse));
            System.err.println(Eval.eval("$hello(world) - $world(hello)", vse));
            System.err.println(Eval.eval("$hello(\"world\") - $world(\"hello\")", vse));

        } catch (MalformedExpression e) {
            e.printStackTrace();
        }
    }
}

And there you have it, a custom parser for expressions as simple as just */+-() or as complex as you want through extensions using dollar signs.

After using the old JS parser for over a year in the before-mentioned project this was a breath of fresh air, the JAR file that held Rhino shrunk by a few MB providing relief for my users and the server that had to send it around.

Wednesday, October 17, 2012

First Experiments with Test Driven Development

Last Friday, Baird Ramsey, a test engineer from Google gave a talk at my school promoting test driven development; I decided to try it with a new interpreter I was writing for my book on the implementation full text search engines (Warning: It's currently a rough draft!) and it worked wonderfully.

The last time I wrote an interpreter it was hard, not conceptually, but to get all the components working in the right order.

In intro programming, I was taught to write the outline of a class/function write pseudo-code using comments, then turn that pseudocode in to real code, it worked something like this:

class Player{
    private health = 0;

    public int changeHealth(int change) {
        // change the private health
        // return the new health.
    }
}

This style works well enough because it lets you quickly see what should be broken in to methods, and outlines your algorithm so mistakes are usually syntactic in nature, but not always, especially when you don't write the proper pseudocode or have a good idea of what the algorithm should look like.

This time, I wrote tests first, and built the program up from the smallest pieces, rather than drilling down and creating new methods as I went along. I suspect this has much to do with the successes experienced by many developers, as they are forced to not create thousand line functions.

The cycle I followed was something like this:

Write a test
Write code
Make sure it works
Write a test for another case of the same function
Make sure it works, if not fix it.
If not all cases are satisfied GOTO 4
Move on to the next function

Not only did I get code that worked the first time I tested it in its entirety, but I was saved from the bug-hunting headache of my prior escapades, and I got free unittests to test future additions to the codebase.

Tuesday, October 16, 2012

Unified Media Search

This is a simple idea that I've been formulating for some years:

What happens when people want to look back and find something? Interactions from people lost long ago?

Sure, each social media site allows you to go back and find things/people/whatevers you posted from a long time ago, but that doesn't help in the long run, because they come and go.

For this I have an idea, some kind of a background service that checks your accounts, automatically adds contacts to its file, and your correspondence, chronologically. All of this wouldn't just come from one source either, emails, texts, voicemails, and forms of communication we haven't even imagined yet, mixed together with a base platform with rich plug-in hooks to process the data.

It would be a tool for social engineers, scholars, detectives, and regular people tired of managing their digital lives across so many services.

The basic service would be fairly simple to build, initial extension points I see would be:

Inputs/Parsers (create "messages" or "contacts" from files or services)
Reports (visual ways of viewing/manipulating stored data and analyzers)
Analyzers (Something that creates tentative links between things, could be by phone numbers, IP addresses, lat and longs, facial recognition, writing styles, etc.)

I suspect the initial data would be fairly simple to structure as well, each message would have:

Date
Sender(s)
Receiver(s)
Format
Extracted text/Resources
Original Message

Basic outputs I would like to see are:

Who knows who? A map between people likely to know each other (good for looking up someone you don't remember)
Conversation history, show all conversations between two people on a timeline, from all media sources.
Happenings, mapping when people got together and where, creating "events" in a timeline.