Tuesday, September 27, 2011

Automatic Summary of a Document

It is the primary dogma of computer science that computer scientists are lazy. That is why they work with tiny stupid slaves all day, forcing them to do their bidding. Another strange and seemingly unrelated fact is that reading documents and providing summaries is mind numbingly boring.

Some years ago the folks at Microsoft made a little addition to their word processor that would allow it to automatically generate summaries. This feature is little used and little needed. However it is still fun, and could have the potential to provide some nice summaries to sites like Manybooks.net which would definitely benefit from some documents of theirs (especially lesser known books) being summarized.

Thinking about having your computer generate nice summaries and actually having it do them are two entirely different things it turns out. However in more structured documents, such as non-fiction some simple math can work out the problem of which sentences are important and which are not.

Before we can worry about that, let's first find the important words in a document. One of the better ways, is simply counting every occurrence of every word, then removing fairly common words; I removed the top 100 in the example code given by setting their scores to -1, although different values can be tried to attempt to prove a sentence better or not. For example the sentence, "Call me Ishmael." would likely score rather low, though it is one of the most known sentences of all time. If the word "me" were set to a higher value (as it is in the top 100) the sentence may be included in a summary. 

Sentences can then be split up at punctuation marks (. ! ?) and a total for the sentence can be generated by adding all of the word scores and dividing by the number of words in the sentence, thus creating an average word score for the sentence.

Once all sentences have been scored, the top ones are taken out, and put in order again, thus preserving some of the structural integrity of the original document, then can be returned.

In a few of my tests, this was not enough however. I also added a multiplier that took in to account the average length of words in the sentence; as either main points are generally made well, or supporting evidence doesn't get its credit.
Talk is cheap, show me the code!

#!/usr/bin/env python
# -*- coding: utf-8 -*-
'''Provides An Automated Summary of a document.

Copyright 2011 Joseph Lewis

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:

* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above
copyright notice, this list of conditions and the following disclaimer
in the documentation and/or other materials provided with the
distribution.
* Neither the name of the nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
'''

import sys
import re
import argparse

__author__ = "Joseph Lewis"
__copyright__ = "Copyright 2011, Joseph Lewis"
__license__ = "BSD"

TOP_HUNDRED_SCORE = 0
EN_TOP = '''the be to of and a in that have i it for not on with he as you do at
this but his by from they we say her she or an will my one all would
there their what so up out if about who get which go me when make can
like time no just him know take person into year your good some could
them see other than then now look only come its over think also back
after use two how our work first well way even new want because any
these give day most us'''


text = """Fourscore and seven years ago our fathers brought forth upon this continent a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battlefield of that war. We have come to dedicate a portion of that field as a final resting-place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this. But in a larger sense we cannot dedicate, we cannot consecrate, we cannot hallow this ground. The brave men, living and dead, who struggled here, have consecrated it far above our poor power to add or detract. The world will little note, nor long remember, what we say here. But it can never forget what they did here. It is for us, the living, rather to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us, that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion; that we here highly resolve that these dead shall not have died in vain; that this nation, under God, shall have a new birth of freedom, and that government of the people, by the people, and for the people, shall not perish from the earth. """


def chars_per_word(sentence):
'''Returns the average characters per word.'''
return float(len(sentence)) / float(len(sentence.split(' ')))

def clean_text(text,leavepunct=False):
# Strip to a-z A-Z (TODO: I18N).
text = text.replace("\n", " ")
if leavepunct:
return re.sub("[^a-z.!\? ]", " ", text.strip().lower())
return re.sub("[^a-z ]", " ", text.strip().lower())

def word_frequency(text):
'''Counts the frequenc of words in the piece of text, and returns
a dict for each word (a-z) lowercased where the key represents the
word and the number of times the word is found the value.

'''
words = {}


tmp = clean_text(text)

# Cut to words.
for word in tmp.split():
if word in words:
words[word] += 1
else:
words[word] = 1

return words

def set_words_to_value(word_dict, top=EN_TOP, value=TOP_HUNDRED_SCORE):
'''Sets the given words to the given value in the given word_dict.
The default is to set the top hundred words in english to the
TOP_HUNDRED_SCORE.

'''

j = word_frequency(top).keys() # get the words in the top hundred text.
# remove the top 100 words
for w in j:
words[w] = value

def sentences_in(text):
'''Returns a list of sentences in the text.

'''
text = text.replace("\n", " ")
return re.split('[\?.!]', text)


def score_sentence(sentence, words):
'''The scoring function, given a dictoinary of word:value pairs,
creates a score for each sentence.

'''
# Score value based upon words and frequencies of those words.
tmp = clean_text(sentence)
total = 0

for word in tmp.split():
if word in words:
total += words[word]

# Make the total in to a percentage.
try:
total /= float(len(tmp.split()))

# Secret ingredient, higher characters per word generally means
# more important sentence.
total *= chars_per_word(tmp)

return total
except ZeroDivisionError:
return -100


def top_sentences(text, word_freq_dict):
'''Returns a sorted list with the top rated sentences first, that
contains the tuples (score, sentence_location, sentence_text)

For example, the sentence "Call me Ishmael" would come back:
(1.8304283, 0, "Call me Ishmael.")

0 would mean it was the first sentence in the text, and it had a
score of 1.83...

'''
sentences = [] # array of tuples (total score, sentence num, sentence text)
known_sentences = set()
currs = 0
for s in sentences_in(text):
currs += 1 # Increment the current sentence.
total = score_sentence(s, words) # Don't add duplicate sentences.

s = s.strip()
if s not in known_sentences:
sentences.append((total, currs, s+"."))
known_sentences.add(s)

# Sort highest rated sentences to lowest.
sentences = sorted(sentences)
sentences.reverse()

return sentences


def __combine_summary(summary):
'''Combines a summary in to a meaningful paragraph. A summary is an
array of tuples with values of (position of sentence, text). This
creates better summary paragraphs by ordering important sentences
in the order they were originally.

'''
summary = sorted(summary)

paragraph = ""
for l, s in summary:
paragraph += s + " "

return paragraph

def percentage_top(percentage, sorted_sentences):
'''Returns the top rated percentage of the given sentences, in
paragraph form.

i.e. to get the top 25 percent of a text, call with:
percentage_top(25, )

'''

percentage = percentage / 100.0

num_sentences = int(len(sorted_sentences)*percentage) + 1

if num_sentences >= len(sorted_sentences):
num_sentences = len(sorted_sentences)

# Create summary (top x sentences)
summary = []
for j in range(num_sentences):
t,l,s = sorted_sentences[j]
summary.append((l,s))


return __combine_summary(summary)



def top_words(num_words, sorted_sentences):
'''Returns x number of words from the top sentences.'''

summary = []
words = 0
try:
for t,l,s in sorted_sentences:
if words >= num_words:
break

words += len(s.split())
summary.append((l,s))


except IndexError:
pass

return __combine_summary(summary)


if __name__ == "__main__":

parser = argparse.ArgumentParser(description='Creates summaries from documents by magic.')
parser.add_argument('-p','--percent', type=int, default=None,
help='the percent of the original document to give as a summary')
parser.add_argument('-w', '--words', type=int, default=None,
help='the number of words to output as a summary')

parser.add_argument('PATH', help='the path of the textfile to read from')

args = parser.parse_args()

try:
if args.PATH:
text = open(args.PATH).read()
words = word_frequency(text)

set_words_to_value(words)
sentences = top_sentences(text, words)

if args.words:
print top_words(args.words, sentences)
if args.percent:
if args.words:
print "\n\n"
print percentage_top(args.percent, sentences)

except IOError:
print "File given can't be found."

No comments:

Post a Comment