NodeBox | Perception comparisons

Perception comparisons

To create a machine that is able to combine existing concepts into new, creative ideas, this machine needs knowledge of what the real world is like. How things look, feel, and how they relate to each other.

Providing this knowledge is a daunting, longer-than-life task. In this article we discuss a number of techniques, using a mash-up of different NodeBox libraries, to automatically discover new rules for the Perception module (a NodeBox add-on that contains knowledge of how things look and feel) and comparative relations between concepts (what's the new cool thing? what's the biggest thing in human culture?)

perception_comparisons0
Lots of people tend to think that God, and science, are very, very big. But not as big as Harry Potter.

Text patterns

A powerful feature in the NodeBox Linguistics library is the en.sentence.find() command. It analyzes a text and retrieves the portions that match a certain pattern. For example, the pattern "I like NN" will retrieve any string of words in the text that starts with "I like" followed by any noun: I like brownies, I like chickens, etc. Patterns can also contain optional wildcards: "I like (JJ) NN" will return I like brownies as well as I like tasty brownies. This way we can quickly find out what kind of things the "I" in the text likes. Or, "* like NN" will yield I like brownies as well as smells like brownies.

The NN and JJ are called part-of-speech tags - representing a noun and an adjective respectively. Under the hood, the Linguistics library will mark each word in the text with its part-of-speech (noun, pronoun, adjective, verb, ...), taking into consideration its place in the sentence.

import en
print en.sentence.tag("I like eating tasty brownies")
>>> I/PRP like/IN eating/VBG tasty/JJ brownies/NNS

To find out what a tag means, or which tags you can use in your patterns, have a look at the list here.

Part-of-speech tagging has been used in natural language processing for over forty years to improve and perfect large corpuses of what words in sentences means, enabling a machine to "read" the text. The Linguistics library uses a Brill corpus of tagged words and a tagger written by Jason Wiener.

Once the text has been tagged, we scan it for words that match the pattern, or words whose POS-tag matches the pattern:

print en.sentence.find("I like tasty brownies", "I like (JJ) NN")
>>> [[('I', 'PRP'), ('like', 'IN'), ('tasty', 'JJ'), ('brownies', 'NNS')]]

The result is a list of matches. Each item in the list is itself a list of (word, tag)-tuples. As you can see you'll need to use some list operations to get the content you want from the result.

Searching, matching and parsing online texts

Now where can we find a large volume of text to analyze? That's right: the World Wide Web! There's a wealth of content written by all sorts of people, waiting to be parsed for information.

The Perception library has a number of tools that combine the Web, Linguistics and Graph libraries to retrieve and visualize online information. At the heart is the perception.search_match_parse() command. It works in three steps:

Search: use the Web library to retrieve results from a Google or Yahoo! query,
Match: in the description of each result (the few lines you get to read when Google or Yahoo displays the results found), look for patterns using en.sentence.find(),
Parse: extract the things we need from the pattern matches.

Here's a short example. It attempts to find out what people say aliens want... cows, war, Scully, and disclosure (???), among other things!

perception = ximport("perception")
results = perception.search_match_parse(
    "aliens want *",
    "aliens want NN",
    lambda chunk: perception.clean(chunk[2][0]),
    service="google"
)
print perception.count(results)
>>> {u'disclosure': 2, u'patrol': 2, u'alien-human': 2, 
>>>  u'roxella': 1, u'driver': 2, u'scully': 2, u'planet': 1, 
>>>  u'cows': 1, u'war': 1}

The first parameter supplied to the perception.search_match_parse() command is a Google query. It will yield a number of website links that have aliens want [something] in their content. Note the use of the asterisk wildcard. This is a powerful feature from Google. The asterisk can be any word.

The second parameter is the pattern used to extract matches from the description of each result. Here we are being a bit more specific: the thing aliens want must be a noun.

The third parameter requires a closer look. Basically, it is a user-defined function that receives input from en.sentence.find() and cleans it up. The input could be something like:
[(u'aliens', 'NNS'), (u'want', 'VBP'), (u'cows', 'NNS')]

The parse command is responsible for extracting the information we need. In this case, that's the noun at the end of the chunk. In other words: the third element in the list, which is a (word, tag)-tuple, of which we take the first element: chunk[2][0]. Next, we run it through the perception.clean() command which strips any quotes, brackets, comma's etc. from the word. The returned output is cows.

In this example I used a lambda function as parse command. A lambda function is like a small, anonymous function. It means exactly the same as:

def parse_noun(chunk):
    return perception.clean(chunk[2][0])
results = perception.search_match_parse(
    "aliens want *",
    "aliens want NN",
    parse_noun,
...

The perception.search_match_parse() command will then return a list of all the nouns it retrieved. Some nouns will occur more than once, so as a final step we run the output through perception.count(), which will group nouns with the number of times each noun was found.

Simile

So, now that we can scan the entire web for patterns, we can start harvesting and comparing all kinds of information. We have already bundled some of the searches we need for the Perception module into robust commands. One of them is based on simile. A simile is a literary device that uses the words "like" or "as": as proud as a lion, as big as a house, like a fish on a line.

In their scientific paper "Learning to Understand Figurative Language: From Similes to Metaphors to Irony" [1], Tony Veale and Yanfen Hao explain how they employ the Google search engine to look for "as [adjective] as *" patterns, ending up with over 70,000 properties linked to nouns. You can review the results on their website (The Creative Language System Group, Department of Computer Science, University College Dublin). Check out their Sardonicus project.

The is-property-of relation is a central paradigm in the Perception module: red is-property-flower, chaotic is-property-of traffic jam, etc. Since the module is concerned with how things look and feel, this automated simile retrieval method is compelling to us as well.

The perception.suggest_properties() command returns adjectives for a given noun:

print perception.suggest_properties("princess")
>>> {u'beautiful': 4, u'adorned': 1, u'haughty': 5, u'polite': 2, 
>>>  u'pretty': 4, u'dainty': 3, u'lovely': 2, u'happy': 4}

The perception.suggest_objects() command returns adjectives for a given noun:

print perception.suggest_objects("blue")
>>> {u'sea': 3, u'sky': 6, u'loo': 4, u'cornflower': 6, 
>>>  u'iceberg': 4, u'ocean': 7, u'smurf': 4, u'robin': 6, ...}

These we can then use to update and expand the Perception module with is-property-of rules:

haughty is-property-of princess
blue is-property-of ocean

Comparatives

Another search heuristic is based on comparatives. A comparative is the form of an adjective that denotes the degree by which one thing has a property greater than another thing: a gazelle is faster than a turtle.

Using the Yahoo! search engine, we can look for relations between concepts such as: "[noun] is more important than [noun]" or "[noun] is the new [noun]". These are excellent to get an idea of what people appreciate or value based on what they are writing online. Furthermore, once we have a set of these relations we can link them together in a graph, and find out what is the most important or the newest concept.

The perception.compare_concepts() returns a list of (word1, word2)-tuples:

results = perception.compare_concepts("is more important than", cached=True)
print results
>>> [(u'life', u'tradition'), 
>>>  (u'security', u'privacy'), 
>>>  (u'experience', u'knowledge'),
>>>  (u'chocolate', u'blogging') ... ]

Note the cached=True parameter. When set to False, the command will execute a live query on Yahoo! each time it is run. This would allow us to start looking at how people's opinions change over time.

Drawing a graph of the results is easy:

results = perception.compare_concepts("is more important than", cached=True)
g = results.graph()
g = g.split()[0]
g.solve()
g.styles.apply()
g.draw(weighted=True, directed=True, traffic=3)

is-more-important-than graph | view full image

Subgraphs

Since not every result might be connected to other results, the graph may contain some unconnected subgraphs.Drawing them all at once takes time and clutters the canvas, so we use the graph's split() method which returns a list of all the different subgraphs sorted by size. By picking different graphs from the list we can visualize different aspects of the comparison.

g.split()[4] and g.split()[6]

Node and edge weight

When the graph is "solved", weights for the nodes and edges (connections) in it are calculated. The weight of a connection depends on how many times the comparison occurs in the search results. Node weight depends on eigenvalue: nodes that are pointed at by high-scoring nodes get a better score themselves (between 0.0 and 1.0). The node with the highest weight is therefore the most important.

	Nodes become bigger when they have more connections. Nodes that have four or more connections also become darker. Edges with a heavier weight (i.e. more people have voiced their opinion on this relation) get a bigger shadow.
	Nodes with a large shadow score high on "betweenness centrality". These are the nodes that potentially connect a lot of other nodes (e.g. they are like waypoints or landmarks).
	Nodes with a weight greater than 0.75 get a double stroke. These can be considered salient concepts in the network.

In the case of our example graph, life is more important than anything. Other important things (the darker nodes) are money, security, health, fitness, love, experience and attitude. That's not a bad analysis for a piece of NodeBox code.

Ranking

To get a ranking of node and edge weight, we can use the result's rank() method - optionally supplying one of the subgraphs as a parameter to process only those nodes.The method will return a list of (concept, [concept2, concept3, ...])-tuples, with the most important concept first.

results = perception.compare_concepts("is bigger than")
g = results.graph().split()[0]
g.solve()
g.styles.apply()
g.draw(weighted=True, directed=True, traffic=3)
 
for concept1, comparisons in results.rank(g):
    for concept2 in comparisons:
        print concept1, results.relation, concept2
>>> potter is-bigger-than god
>>> potter is-bigger-than star
>>> love is-bigger-than life

perception_comparisons5

Here we find that science and mess are bigger than god. God is bigger than life however (and so are love and Bigfoot). The biggest of them all is potter: bigger than God and the stars alike! I can only assume this must be Harry Potter.

When we calculate the rank of the entire resultset (so without the graph parameter) we find that love is bigger than everything else. Some other interesting tidbits some of you put online:

god is bigger than superman
nobody is bigger than mcdonald
facebook is bigger than myspace
weapons-gate is bigger than breast-gate
everybody is bigger than I

Ironically, by listing some examples here, Perception might pick up the page next time and make the mentioned concepts even bigger. This seems like an ideal moment to note that NodeBox is bigger than love and more important than life, don't you think?

Let's examine another example. Here's a subgraph for things we compared with "is better than":

perception_comparisons9

Finally, we can look for what kind of comparisons are possible between two given concepts:

print perception.suggest_comparisons("mac", "pc")
>>> {u'smaller': 2, u'slower': 3, u'faster': 3, u'actually cheaper': 2, 
     u'cooler': 4, u'less': 1, u'cheaper': 1, u'better': 2, 
     u'more expensive': 2, u'safer': 2, u'intelligent': 1, 
     u'easier': 4, u'worse': 1, u'greater': 4}

print perception.suggest_comparisons("pc", "mac")
>>> {u'no different': 2, u'worse': 2, u'cooler': 2, u'stronger': 2, 
     u'totally different': 2, u'cheaper': 3, u'better': 10, 
     u'more entertaining': 2, u'less expensive': 1, u'bigger': 2, 
     u'faster': 2}

Created By Tom De Smedt, Frederik De Bleser