Starting to muck about with text generation in anticipation of NaNoGenMo this year. I trained a second-order Markov model on over 200 million words of data (three thousand or so texts from Project Gutenberg). Written in Ruby, using my native Sooth library, the entire process took 28 hours and resulted in a 674mb model file. Because Sooth uses a 32-bit context, I used a 16-bit dictionary of words, which I generated by stripping punctuation and capitalising words and then selecting the most frequent 64822 words (I wrote a script to count word frequencies and select word that occurred at least n times such that the result would contain fewer than 65536 words; I think n ended up being 27 or something like that).
I want to use the Markov model to generate sentences, but at the moment it does a rather poor job. Here are some examples:
<SENTENCE> SOME GENERATION OF ARCHITECT OF GREATNE WOULD COME FOR THE TERM OF ADORATION <SENTENCE> <SENTENCE> YES SIR HE MADE HER A MAGICIAN HE EXCLAIMED <SENTENCE> <SENTENCE> AND THIS THOU BUT <BLANK> WAS A PRAYER OVER THAT ALIEN ELEMENT <SENTENCE>
I should also note that
<BLANK> are special words, as are
<BOOK>. And that I strip
the ends of words, so that
GREATNE, in an effort to reduce
the number of unique words (as removing an
S will often turn a word from
plural to singular).
As an example, here is the opening chapter of “The Emerald City of Oz” by L. Frank Baum as presented to the inference algorithm, once parsed into the 16-bit dictionary:
<BOOK> <SENTENCE> PERHAP I SHOULD ADMIT ON THE TITLE PAGE THAT THIS BOOK IS BY L FRANK BAUM AND HIS CORRESPONDENT FOR I HAVE USED MANY SUGGESTION CONVEYED TO ME IN LETTER FROM CHILDREN <SENTENCE> <SENTENCE> ONCE ON A TIME I REALLY IMAGINED MYSELF AN AUTHOR OF FAIRY TALE BUT NOW I AM MERELY AN EDITOR OR PRIVATE SECRETARY FOR A HOST OF YOUNGSTER WHOSE IDEA I AM <BLANK> TO WEAVE INTO THE THREAD OF MY STORIE <SENTENCE> <PARAGRAPH> <SENTENCE> THESE IDEA ARE OFTEN CLEVER <SENTENCE> <SENTENCE> THEY ARE ALSO LOGICAL AND INTERESTING <SENTENCE> <SENTENCE> SO I HAVE USED THEM WHENEVER I COULD FIND AN OPPORTUNITY AND IT IS BUT JUST THAT I ACKNOWLEDGE MY INDEBTEDNE TO MY LITTLE FRIEND <SENTENCE> <PARAGRAPH> <SENTENCE> MY WHAT IMAGINATION THESE CHILDREN HAVE DEVELOPED <SENTENCE> <SENTENCE> SOMETIME I AM FAIRLY ASTOUNDED BY THEIR DARING AND GENIU <SENTENCE> <SENTENCE> THERE WILL BE NO LACK OF FAIRY TALE AUTHOR IN THE FUTURE I AM SURE <SENTENCE> <SENTENCE> MY READER HAVE TOLD ME WHAT TO DO WITH DOROTHY AND AUNT EM AND UNCLE HENRY AND I HAVE OBEYED THEIR MANDATE <SENTENCE> <SENTENCE> THEY HAVE ALSO GIVEN ME A VARIETY OF SUBJECT TO WRITE ABOUT IN THE FUTURE ENOUGH IN FACT TO KEEP ME BUSY FOR SOME TIME <SENTENCE> <SENTENCE> I AM VERY PROUD OF THIS ALLIANCE <SENTENCE> <SENTENCE> CHILDREN LOVE THESE STORIE BECAUSE CHILDREN HAVE HELPED TO CREATE THEM <SENTENCE> <SENTENCE> MY READER KNOW WHAT THEY WANT AND REALIZE THAT I TRY TO PLEASE THEM <SENTENCE> <SENTENCE> THE RESULT IS VERY SATISFACTORY TO THE PUBLISHER TO ME AND I AM QUITE SURE TO THE CHILDREN <SENTENCE> <PARAGRAPH> <SENTENCE> I HOPE MY DEAR IT WILL BE A LONG TIME BEFORE WE ARE OBLIGED TO DISSOLVE PARTNERSHIP <SENTENCE> <CHAPTER>
So, how to generate a novel novel from this mess? Here are my thoughts:
- Generate a prototype sentence, which consists of a certain number of empty slots for words, with the length of the sentence statistically consistent with what has been observed in the past.
- Populate the slots with some candidate keywords that have high mutual information according to the previous two sentences.
- For the remainder of the slots, determine a list of words that could fill those slots, as constrained by the other known words in the prototype sentence.
- Fill the empty slots with the candidate words, preferring to fixate on a relevant keyword, and providing the choice is legal according to the Markov model.
- Generate hundreds of candidate sentences, and select the best according to some heuristic.
The heuristic for selecting the best generation will be a function of two factors; the average information of the generated sentence as measured by the Markov model, and the average mutual information of the generated sentence, as measured by a model that takes the previous few sentences into account, and possibly also the fixation words.
These fixation words should also be determined stochastically from data, by
observing words that tend to occur in clusters. I am trying here to identify
character names, locations, objects and so on that are pertinent to the story.
If the model generates a sentence containing the word
SHERLOCK, for instance,
then the mere presence of this word in the story should make it much more likely
to occur in the future. This is something to be figured out.