Nonsense Poetry

I made another bot on Twitter that generates nonsensical poetry in the form of haikus, limericks, love poems, and quatrains! Check it out on Infinite Typewriters (@infinite_poetry).

As an offshoot of my Twitter bot project, I decided to look into poetry generation. It started with wanting to write Markov-generated haikus. However, this requires counting the syllables of a word, which is actually very difficult to do. At first, I tried to use a naive approach that counted vowels and groups of vowels. Unfortunately, English does not adhere to many of its own rules, especially regarding syllable counts, and this approach was unreliable.

Finally, I settled on using the Natural Language Toolkit, a powerful Python package that provides a platform for language processing. Specifically, it includes the Carnegie Mellon University Pronunciation Dictionary, which not only provides the phoneme set for a word, it also has the lexical stress for each phoneme. This opened up an entirely new avenue of investigation, as now not only could I count the syllables in a word, I could also utilize its stress patterns to determine the cadence of a sentence.

Using the CMU Dictionary and NLTK is relatively easy. After installing NLTK with pip install nltk and installing the necessary cmudict corpus, the code is simple

import nltk
from ntlk.corpus import cmudict

# Get the pronunciation as a Python dictionary
d = cmudict.dict()

print(d['hello'])

will return [['HH', 'AH0', 'L', 'OW1'], ['HH', 'EH0', 'L', 'OW1']].

Now, counting the syllables is simply a matter of counting the number of elements in the pronunciation that have a lexical stress. Further, we can get more specific by finding the lexical stress of the word, which, in this case, is 01.

Therefore, to generate a line of a given number of syllables with a given word_list, the code is

def generate_line(num_syl):
    '''
    Recursively generates a line with given number of syllables
    '''

    # Base case: zero syllables
    if num_syl == 0:
        return []

    # Recursive case
    else:
        word = random.choice(word_list)

        # Randomly get a word of the correct length
        while nsyl(word) > num_syl or nsyl(word) == 0:
            word = random.choice(word_list)

        return [word] + generate_line(num_syl - nsyl(word))

In this case, the word_list I used was the 10,000 most common English words, as determined by by n-gram frequency analysis of the Google’s Trillion Word Corpus. You can find it on this GitHub repo. One could conceivably replace this set with a set of words obtained from any corpus.

The CMU Dictionary also allows us to find rhymes for a given word by matching the last parts of the pronunciations. This brute force method goes through every entry in the CMU Dictionary and finds words that match the last phonemes of the input word, to a given level.

def rhyme_set(input_word, level):
    '''
    Returns set of all words that rhyme with the input word at given level,
    i.e. the number of element matches in the pronunciation
    
    Supports restricting restricting the rhyme set to eliminate repeat rhymes.
    '''

    # Get the pronunciation of the word
    syllable = d[input_word][0]

    # Find all matching pronunciations, i.e. rhymes, of word
    rhymes = [word for word in word_list if d[word][0][-level:] == syllable[-level:]]

    # Remove the rhymes from restricted set
    rhyming_set = set(rhymes)

    # The input does not rhyme with itself
    rhyming_set.remove(input_word)

    return rhyming_set

Finally, the last element of poetry generation requires matching cadences.

def cadence_match(cad, pattern):
    '''
    Recursively traverses both patterns to match

    By default, the matching is done in the forward 
    direction, so returns true if the input cadence 
    matches the start of the pattern
    '''

    # Base cases: empty strings
    if len(cad) == 0:
        return True
    elif len(pattern) == 0:
        return False

    # Recursive case
    # Note: Patterns match if stresses match exactly, or the pattern is indeterminate
    # and the input cadence has no stress. We do not want a stressed syllable matching 
    # an indeterminate syllable. 
    else:
        if cad[0] == pattern[0] or cad[0] == '0' and pattern[0] == '*':
            return cadence_match(cad[1:], pattern[1:])
        else:
            return False

This method required some tinkering. Before, when I would find the lexical stress pattern of a word, one syllable words would always be stressed. However, in normal conversation, one syllable words are often unstressed, though they can have stress. Thus, I assigned the character * to one syllable words to denote an indeterminate stress.

We can put all of these methods together to generate a line with a given number of syllables that matches a given pattern and rhymes with a word, which is all the foundation we need to start generating poetry. For example, a haiku it generated is

"Validation Quantum"
A Nonsense Haiku
By Poetry Bot

Quantum dimension
Offense guardian save bring
Thank validation

Composed in 0.01 seconds

Limericks are more involved and take longer, but are quite fun to say aloud.

"Aluminum"
A Nonsense Limerick
By Poetry Bot

Performance mas spoken hint rings
Romantic rush pregnant patch brings
Aluminum box
Bed patrick cheat stocks
Victorian pockets peas springs

Composed in 4.37 seconds

Overall, poetry generation is really cool, especially with powerful tools like the CMU Dictionary of Pronuncation. Check out my GitHub repo to keep up to date with this project as I continue to add functionality and improve performance.

Andy Zhang

Programming enthusiast and math geek

Nonsense Poetry