Wednesday, October 31, 2012

Fake Text Generation Through Markov Chains

A Markov Chain is roughly described as a finite state machine, in which you know the chance of one state transitioning to another. In this case, we build a graph of words, based upon an input text file, for each word in the text we match it with the word right after and count the occurrences i.e.

The quick red fox jumped over the lazy brown dog is turned in to:
  • <START> the
  • the quick
  • quick red
  • ...
  • the lazy
  • lazy brown
  • brown dog
  • dog <END>
In this case the word "the" has a 50% chance of being followed by red or lazy; when constructing a new text, we find a start word,  and choose a the word right after. Rinse, lather, repeat until you hit an <END>.

The code at the bottom of the page loads a text file, cleans and parses it, then generates 15 "random" sentences afterward.

Here are some fun generations:

Alice's Adventures in Wonderland
Will you had made her lips. I'm on it did not get to herself and the reason so savage when i dare say you so Alice did so bills got into a little recovered his voice. i suppose by this she had been it kills all like said it does it was a present thought it away. oh don't believe i wonder at his ear and who for the other however this so he said Alice you know why then came upon her head must crossexamine the newspapers at one hand in a farmer you know when it must be more puzzled her for she said Alice. once in that the sort of that said the neck nicely by the air.
The Holy Bible
165 none delivereth souls by the lees of their portion of padon 748 the bason of hosts shall he will bless thee and elms because they knew him 710 so marred more because thou king of life. 1011 out of any more to me that by ship his heart for a flock. 87 and he will come near unto you all the fleece and to hebron for the winds of thy men with cunning work. 347 when they shall yet again unto you that the disciples when they unto the dwellings have i the sight and save one had been carried away. 612 and five years and i will diminish from rithmah and lord.
The Bill of Rights (USA)
 the powers not be compelled in the people. ix the people peaceably to answer for a speedy and to the right of the accusation to the states by congress september 25 1789 ratified december 15 1791 i congress september 25 1789 ratified december 15 1791 i congress shall have compulsory process for his defense. iv the ten original amendments to have been committed which district wherein the owner nor excessive bail shall enjoy the rules of life liberty or the free state the constitution nor shall private property be searched and the right of the people.
Now to the code!


#!/usr/bin/env python3
'''
Provides common functions used through the similarity engines.

Copyright 2012 Joseph Lewis <joseph@josephlewis.net> | <joehms22@gmail.com>

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.

Redistributions in binary form must reproduce the above
copyright notice, this list of conditions and the following disclaimer
in the documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

2012-10-28 - Initial Work
'''
import random
import pickle
import re

START_WORD = "<START>"
TERMINAL_WORD = "<END>"

def document_to_terms(doc):
'''Parses a document in to a list of strings.'''
doc = doc.lower()
doc = re.sub(r"""[^\w\s]+""", '', doc)
return doc.split()

class TwoGramCorpus():
words = None
def __init__(self, fp=None):
if fp:
self.words = pickle.load(fp)
else:
self.words = {}

def save(self, fp):
pickle.dump(self.words, fp, 1)

def load_file(self, fp):
for sentence in fp.read().split("."):
self.load_line(document_to_terms(sentence))

def load_line(self, line):
if len(line) == 0:
return

line.append(TERMINAL_WORD)

lastword = START_WORD
for word in line:
self.add_2gram(lastword, word)
lastword = word

def add_2gram(self, first, second):
firstlist = None
try:
firstlist = self.words[first]
except KeyError:
self.words[first] = {second : 1}
return
try:
firstlist[second] += 1
except KeyError:
firstlist[second] = 1

def __str__(self):
return str(self.words)

def __unicode__(self):
return str(self.words)

def _random_prob_word(self, start):
''' chooses a random word that follows the given one
with the probability of the frequency it occurs.
'''

rest = self.words[start]
total_possible = sum(rest.values())

chosen = random.randint(1,total_possible)

for word, value in rest.items():
chosen -= value

if chosen <= 0:
return word

return None



def generate_sentence(self):
# choose a random start word.
sentence = []
lastword = self._random_prob_word(START_WORD)

while lastword != TERMINAL_WORD:
sentence.append(lastword)
lastword = self._random_prob_word(lastword)

return " ".join(sentence) + "."


def generate(self, num_sentences):
sentences = []
for i in range(num_sentences):
sentences.append(self.generate_sentence())

return " ".join(sentences)

if __name__ == "__main__":
tgc = TwoGramCorpus()
tgc.load_file(open('/path/to/text/file.txt'))
print(tgc.generate(150))

Uses 
  • Discovering an artist's style
  • Writing fake text for websites
  • Writing fake books (some on Amazon are created this way by bots)
  • Passing around statistical information about documents without violating copyright

No comments:

Post a Comment