Sunday, December 30, 2012

DOCX Text Mining in Python and Java

In recent years Microsoft has been kind in making its formats easier to parse than binary lumps like the old .doc format. Instead, they now use XML within zip files with the extension .docx. This format can be quickly and easily parsed in Python (see my earlier xlsx parser.)

The Format

Within all new .docx files there is a folder named word and within this folder is a file named document.xml. This has a bunch of <w:t> tags, which hold text. To extract the text, just find those and pull out the contents!

The Code (Python)

#!/usr/bin/env python3
'''

Copyright 2012 Joseph Lewis <joehms22@gmail.com> | <joseph@josephlewis.net>

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:

* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above
copyright notice, this list of conditions and the following disclaimer
in the documentation and/or other materials provided with the
distribution.
* Neither the name of the nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

This software recieves a docx file from the command line, and returns the text
from its path.

'''

import zipfile
import xml.dom.minidom
import sys


def extract_docx_text(fp):
myFile = zipfile.ZipFile(fp)

share = xml.dom.minidom.parseString(myFile.read('word/document.xml'))
text = share.getElementsByTagName('w:t')

return " ".join([node.childNodes[0].nodeValue for node in text])

if __name__ == "__main__":
if len(sys.argv) == 1:
print "usage: docx.py FILENAME [FILENAME...]"
else:
for arg in sys.argv:
try:
print(extract_docx_text(arg))
except Exception:
pass # not a valid zipfile, probably.

The Code (Java)

There is an issue with this Java implementation, but it should work for most purposes: special characters that are escaped in xml, like &amp; will not be unescaped in processing.
/**

Copyright 2012 Joseph Lewis <joehms22@gmail.com> | <joseph@josephlewis.net>

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:

* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above
copyright notice, this list of conditions and the following disclaimer
in the documentation and/or other materials provided with the
distribution.
* Neither the name of the nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

This software recieves a docx file from the command line, and returns the text
from its path.

**/
import java.io.IOException;
import java.io.InputStream;
import java.util.Scanner;
import java.util.zip.ZipEntry;
import java.util.zip.ZipFile;

public class DOCXExtracter {
private static String convertStreamToString(java.io.InputStream is) {
try(Scanner s = new java.util.Scanner(is).useDelimiter("\\A")) {
return s.hasNext() ? s.next() : "";
}
}

public static String extractFromPath(String path) throws IOException
{
ZipFile zipFile = new ZipFile(path);
String ret = extractFromFile(zipFile);
zipFile.close();
return ret;
}

public static String extractFromFile(ZipFile f) throws IOException
{
ZipEntry e = f.getEntry("word/document.xml");
InputStream is = f.getInputStream(e);
String contents = convertStreamToString(is);
is.close();
return contents.replaceAll("(?s)\\<.*?\\>", " ");
}

public static void main(String[] args)
{
for(int i = 0; i < args.length; i++)
try {
System.out.println(extractFromPath(args[i]));
} catch (IOException e) {
e.printStackTrace();
}
}
}

A Python Hopfield Network

Quick Background

Hopfield networks are general purpose neural networks that take a set of input values (boolean) and produce an output of same length. They are particularly useful in signal processing and cleaning up images where noise is an issue, (think smudged characters).

I imagine they would be particularly useful in normalizing inputs from different hardware (e.g. creating color profiles, removing static or echos from some microphones, etc.)

This Implementation

This implementation is a tanslation of the algorithm found in the Clever Algorithms book from Ruby to Python.
#!/usr/bin/env python3
# A translation of http://www.cleveralgorithms.com/nature-inspired/neural/hopfield_network.html
import random

def random_vector(minmax):
return [row[0] + ((row[1] - row[0]) * random.random()) for row in minmax]

def initialize_weights(problem_size):
return random_vector([[-0.5,0.5] for i in range(problem_size)])

def create_neuron(num_inputs):
return {'weights' : initialize_weights(num_inputs)}

def propagate_was_change(neurons):
i = random.randrange(len(neurons))
activation = 0

for j, other in enumerate(neurons):
activation += other['weights'][i] * other['output'] if i != j else 0

output = 1 if activation >= 0 else -1
change = output != neurons[i]['output']
neurons[i]['output'] = output
return change

def flatten(nested):
try:
return [item for sublist in nested for item in sublist]
except TypeError:
return nested

def get_output(neurons, pattern, evals=100):
vector = flatten(pattern)

for i, neuron in enumerate(neurons):
neuron['output'] = vector[i]

for j in range(evals):
propagate_was_change(neurons)

return [neuron['output'] for neuron in neurons]

def train_network(neurons, patters):
for i, neuron in enumerate(neurons):
for j in range((i+1), len(neurons)):
if i == j:
continue

wij = 0.0

for pattern in patters:
vector = flatten(pattern)
wij += vector[i] * vector[j]

neurons[i]['weights'][j] = wij
neurons[j]['weights'][i] = wij

def to_binary(vector):
return [0 if i == -1 else 1 for i in vector]

def print_patterns(provided, expected, actual):
p, e, a = to_binary(provided), to_binary(expected), to_binary(actual)
p1, p2, p3 = ', '.join(map(str, p[0:2])), ', '.join(map(str, p[3:5])), ', '.join(map(str, p[6:8]))
e1, e2, e3 = ', '.join(map(str, e[0:2])), ', '.join(map(str, e[3:5])), ', '.join(map(str, e[6:8]))
a1, a2, a3 = ', '.join(map(str, a[0:2])), ', '.join(map(str, a[3:5])), ', '.join(map(str, a[6:8]))
print( "Provided\tExpected\tGot")
print( "%s\t\t%s\t\t%s" % (p1, e1, a1))
print( "%s\t\t%s\t\t%s" % (p2, e2, a2))
print( "%s\t\t%s\t\t%s" % (p3, e3, a3))


def calculate_error(expected, actual):
return sum([1 for i in range(len(expected)) if expected[i] != actual[i]])

def perturb_pattern(vector, num_errors=1):
perturbed = [v for v in vector]
indicies = [0 for i in range(random.randrange(len(perturbed)))]

while len(indicies) < num_errors:
index = random.randrange(len(perturbed))
if not index in indicies:
indicies.append(index)

for i in indicies:
perturbed[i] = -1 if perturbed[i] == 1 else 1

return perturbed


def test_network(neurons, patterns):
error = 0.0

for pattern in patterns:
vector = flatten(pattern)
perturbed = perturb_pattern(vector)
output = get_output(neurons, perturbed)
error += calculate_error(vector, output)
print_patterns(perturbed, vector, output)

error = error / float(len(patterns))
print("Final Result: avg pattern error=%s" % (error))

return error

def execute(patters, num_inputs):
neurons = [create_neuron(num_inputs) for i in range(num_inputs)]
train_network(neurons, patters)
test_network(neurons, patters)
return neurons

if __name__ == "__main__":
# problem configuration
num_inputs = 9
p1 = [[1,1,1],[-1,1,-1],[-1,1,-1]] # T
p2 = [[1,-1,1],[1,-1,1],[1,1,1]] # U
patters = [p1, p2]
# execute the algorithm
execute(patters, num_inputs)

Saturday, December 29, 2012

Get CRON to Send Email Reports

Here is a quick tip: if you have CRON running jobs, but want assurance they're getting done by way of email, do the following:
Open CRON:
crontab -e
You'll see something like this:
# Edit this file to introduce tasks to be run by cron.
#
# Each task to run has to be defined through a single line
# indicating with different fields when the task will be run
# and what command to run for the task
#
# To define the time you can provide concrete values for
# minute (m), hour (h), day of month (dom), month (mon),
# and day of week (dow) or use '*' in these fields (for 'any').#
# Notice that tasks will be started based on the cron's system
# daemon's notion of time and timezones.
#
# Output of the crontab jobs (including errors) is sent through
# email to the user the crontab file belongs to (unless redirected).
#
# For example, you can run a backup of all your user accounts
# at 5 a.m every week with:
# 0 5 * * 1 tar -zcf /var/backups/home.tgz /home/
#
# For more information see the manual pages of crontab(5) and cron(8)
#
# m h dom mon dow command

28 * * * * cd /home/joseph/google_drive && grive
Now add the following line to the end of the file:
MAILTO=somebody@your.domain
Then Save and Close; as long as you have a mail server set up and running, you'll get an email with the output of each of the commands in the file every time they're run.

If You Don't Have a Mail Server

If you don't have a mail server set up, and are running Ubuntu, you may want to check out Juju Charms they are Ubuntu's newest way to set up complex server configurations with a single command. Or just follow this tutorial on how to set up Postfix.

Friday, December 28, 2012

Friday Link List - Sixth Sense, End of Social Media and 3D Printer Sale

Operating Systems

  • OUYA the $99 alternative video game console that runs Android has just gotten its first batch of prototypes in.

Curiosities

  • North Paw is an interesting project that will always have you know where North is (it uses a haptic feedback anklet that you wear for several weeks that eventually gets stuck in your mind).
  • Warren Ellis published an article about the end of the first generation of social media, definitely worth a read.

Algorithms

  • Nothing to see here!

Making

Thursday, December 27, 2012

One in One Out

I'm an avid fan of Leo Babauta author of zenhabits. I admire his work to pair down his worldly things because in the end, they are the things that own you. While I may not be ready to cut down all of my possessions to 100 items I do have a practice of one in, one out.

Any time I get something new, I get rid of something old. At first this seems rather difficult, as you may think "I need all of this stuff" but you start to realize it isn't true. One of the easiest things to cut is books, with so many being open-sourced or reaching the public domain every day it is easy to trade out the paper for the digital. (Not that I advocate becoming an archive of everything digital)

Another great place to purge is the paper you collect; a friend of my father's used to run under the rule "what is the worst that can happen if I don't have this?" and that kept her office neat and clean. Chances are that you don't really need most of the paper you keep around.

The simple guide to removing clutter is this:
  • Have I used the thing in the past year? (if no, chances are you don't need it)
  • Can I get the thing digitally for free, or transfer it to digital easily?
  • If I donate this thing, will it help someone else?
  • Will this thing become mostly obsolete before I use it again?
  • Do I already have one of these?
Being this is one hour hacks, give yourself a one hour challenge to get rid of 100 things, even if they're small (small things still take up room in your brain!)

Wednesday, December 26, 2012

On MOOCs

I recently read an opinion piece labeled "Jump Off the Coursera Bandwagon" --published in The Chronicle of Higher Education--which puts up a few interesting points against online education:
There are critical pedagogical issues at stake in the online market, and MOOC's have not done nearly enough to deal with those concerns.

Coursera and its devotees simply have it wrong. The Coursera model doesn't create a learning community; it creates a crowd. In most cases, the crowd lacks the loyalty, initiative, and interest to advance a learning relationship beyond an informal, intermittent connection.

Why should we be impressed that an online course can reach 100,000 students at once? By celebrating massification, advocates of Coursera elevate volume as the chief objective of online learning. Is that truly our goal in academe?
Because posing a rhetorical question is a very poor form of argument, it is here I'll stop quoting and answer YES.

What MOOCs Do

MOOCs are an invaluable tool, and align better with the underlying principles the pedagogy than do traditional methods.
The goal of the professor is to ignite a curiosity in the minds of the uninitiated and stoke it in to a roaring flame. However, in order to stay employed and live in the real world they have to meet certain ratings from students; students that are easily upset with lower GPAs. They have to appease everyone, including all of the students that surf Facebook all the way through lectures.
If you must subjugate your students to treating you with a formal respect, the war is already lost. Teachers gain respect by being unmatched. MOOCs promise to bring these kinds of educators out in to the open, where they'll be respected by hundreds of thousands instead of a few in a classroom.
Students should never be totally loyal to their professors, and MOOCs allow students to quickly and easily seek out alternative styles of teaching, and alternative viewpoints.
As to "the crowd lack"ing the initiative to advance the relationship, that is an unfounded statement. Students that actively watch and participate in MOOCs are clearly far beyond those that are in traditional colleges that try to escape the classes with a good grade.
Now to the questions the author posed:
"Why should we be impressed that an online course can reach 100,000 students at once?"
Because it proves there is a vast market for education that is not being met, and that there are 100,000 students interested enough in the topic that weren't getting their needs met through traditional means.
"By celebrating massification, advocates of Coursera elevate volume as the chief objective of online learning. Is that truly our goal in academe?"
MOOCs allow everyone with a decent network connection to become classically educated. There is no bias based upon race, religion, color, ethnicity, or prior level of education.

Conclusion

Locking away education has been the prerogative of tyrants. In a world that is dominated by technologies few understand because of their complexities open on-demand education has no real drawbacks, other than possibly causing a change in the way education works.
I don't feel universities will go away for a good long time, they are simply too important of an institution to fail, but they certainly will change if MOOCs catch on; and I feel that many of the mediocre professors will be looking for new things to do with their lives, as will many of the mediocre students.
As a student, my goal is to graduate, and with a high enough GPA such that I can get in to a good grad school. There are some classes in topics such as Mathematics, Chemistry, or Circiutry where I would happily take the class, if it didn't count toward my GPA. I want the knowledge, and feel it would be exciting to learn, but not at the expense of my future; so I have turned to Coursera for these topics, and will never regret the time I spend learning.

Monday, December 24, 2012