Linguistics
endif ?>Description
With the Nodebox English Linguistics library you can do grammar inflection and semantic operations on English content. You can use the library to conjugate verbs, pluralize nouns, write out numbers, find dictionary descriptions and synonyms for words, summarise texts and parse grammatical structure from sentences.
The library bundles WordNet (using Oliver Steele's PyWordNet), NLTK, Damian Conway's pluralisation rules, Bermi Ferrer's singularization rules, Jason Wiener's Brill tagger, several algorithms adopted from Michael Granger's Ruby Linguistics module, John Wiseman's implementation of the Regressive Imagery Dictionary, Charles K. Ogden's list of basic English words, and Peter Norvig's spelling corrector.
Download
linguistics.zip (15MB) Last updated for NodeBox 1.9.4.2 Licensed under GPL Author: Tom De Smedt |
Documentation
- How to get the library up and running
- Categorise words as nouns, verbs, numbers and more
- Categorise words as emotional, persuasive or connective
- Converting between numbers and words
- Quantification of numbers and lists (e.g. 367 x chicken = hundreds of chickens)
- Indefinite article: a or an
- Pluralization/singularization of nouns
- Emotional value of a word
- WordNet glossary, synonyms, antonyms, components
- Verb conjugation
- Spelling corrections
- Shallow parsing, the grammatical structure of a sentence
- Summarisation of text to keywords
- Regressive Imagery Dictionary, content analysis
- Ogden's basic English words
How to get the library up and running
Put the en library folder in the same folder as your script so NodeBox can find the library. You can also put it in ~/Library/Application Support/NodeBox/. It takes some time to load all the data the first time.
import en
Categorise words as nouns, verbs, numbers and more
The is_number() command returns True when the given value is a number:
print en.is_number(12) print en.is_number("twelve") >>> True >>> True
The is_noun() command returns True when the given string is a noun. You can also check for is_verb(), is_adjective() and is_adverb():
print en.is_noun("banana") >>> True
The is_tag() command returns True when the given string is a tag, for example HTML or XML.
The is_html_tag() command returns True when the string is a HTML tag.
Categorise words as emotional, persuasive or connective
The is_basic_emotion() command returns True if the given word expresses a basic emotion (anger, disgust, fear, joy, sadness, surprise):
print en.is_basic_emotion("cheerful") >>> True
The is_persuasive() command returns True if the given word is a "magic" word (you, money, save, new, results, health, easy, ...):
print en.is_persuasive("money") >>> True
The is_connective() command returns True if the word is a connective (nevertheless, whatever, secondly, ... and words like I, the, own, him which have little semantical value):
print en.is_connective("but") >>> True
Converting between numbers and words
The number.ordinal() command returns the ordinal of the given number, 100 yields 100th, 3 yields 3rd and twenty-one yields twenty-first:
print en.number.ordinal(100) print en.number.ordinal("twenty-one") >>> 100th >>> twenty-first
The number.spoken() command writes out the given number:
print en.number.spoken(25) >>> twenty-five
Quantification of numbers and lists
The number.quantify() command quantifies the given word:
print en.number.quantify(10, "chicken") print en.number.quantify(800, "chicken") >>> a number of chickens >>> hundreds of chickens
The list.conjunction() command quantifies a list of words. Notice how goose is correctly pluralized and duck has the right article.
farm = ["goose", "goose", "chicken", "chicken", "chicken"] print en.list.conjunction(farm) >>> several chickens, a pair of geese and a duck
You can also quantify the types of things in the given list, class or module:
print en.list.conjunction((1,2,3,4,5), generalize=True) print en.list.conjunction(en, generalize=True) >>> several integers >>> a number of modules, a number of functions, a number of strings, >>> a pair of lists, a pair of dictionaries, an en verb, an en sentence, >>> an en number, an en noun, an en list, an en content, an en adverb, >>> an en adjective, a None type and a nodebox graphics cocoa Context class
Indefinite article: a or an
The noun.article() returns the noun with its indefinite article:
print en.noun.article("university") print en.noun.article("owl") print en.noun.article("hour") >>> a university >>> an owl >>> an hour
Pluralization and singularization of nouns
The noun.plural() command pluralizes the given noun:
print en.noun.plural("child") print en.noun.plural("kitchen knife") print en.noun.plural("wolf") print en.noun.plural("part-of-speech") >>> children >>> kitchen knives >>> wolves >>> parts-of-speech
You can also do adjective.plural().
An optional classical parameter is True by default and determines if either classical or modern inflection is used (e.g. classical pluralization of octopus yields octopodes instead of octopuses).
The noun.singular() command singularizes the given plural:
print en.noun.singular("people") >>> person
Emotional value of a word
The noun.is_emotion() guesses whether the given noun expresses an emotion by checking if there are synonyms of the word that are basic emotions. Returns True or False by default.
print en.noun.is_emotion("anger") >>> True
Or you can return a string which provides some information with the boolean=False parameter.
print en.adjective.is_emotion("anxious", boolean=False) >>> fear
An additional optional parameter shallow=True speeds up the lookup process but doesn't check as many synonyms. You can also use verb.is_emotion(), adjective.is_emotion() and adverb.is_emotion().
WordNet glossary, synonyms, antonyms, components
WordNet describes semantic relations between synonym sets.
The noun.gloss() command returns the dictionary description of a word:
print en.noun.gloss("book") >>> a written work or composition that has been published (printed on pages >>> bound together); "I am reading a good book on economics"
A word can have multiple senses, for example "tree" can mean a tree in a forest but also a tree diagram, or a person named Sir Herbert Beerbohm Tree:
print en.noun.senses("tree") >>> [['tree'], ['tree', 'tree diagram'], ['Tree', 'Sir Beerbohm Tree']]
print en.noun.gloss("tree", sense=1) >>> a figure that branches from a single root; "genealogical tree"
The noun.lexname() command returns a categorization for the given word:
print en.noun.lexname("book") >>> communication
The noun.hyponym() command return examples of the given word:
print en.noun.hyponym("vehicle") >>> [['bumper car', 'Dodgem'], ['craft'], ['military vehicle'], ['rocket', >>> 'projectile'], ['skibob'], ['sled', 'sledge', 'sleigh'], ['steamroller', >>> 'road roller'], ['wheeled vehicle']]
print en.noun.hyponym("tree", sense=1) >>> [['cladogram'], ['stemma']]
The noun.hypernym() command return abstractions of the given word:
print en.noun.hypernym("earth") print en.noun.hypernym("earth", sense=1) >>> [['terrestrial planet']] >>> [['material', 'stuff']]
You can also execute a deep query on hypernyms and hyponyms. Notice how returned values become more and more abstract:
print en.noun.hypernyms("vehicle", sense=0) >>> [['vehicle'], ['conveyance', 'transport'], >>> ['instrumentality', 'instrumentation'], >>> ['artifact', 'artefact'], ['whole', 'unit'], >>> ['object', 'physical object'], >>> ['physical entity'], ['entity']]
The noun.holonym() command returns components of the given word:
print en.noun.holonym("computer") >>> [['busbar', 'bus'], ['cathode-ray tube', 'CRT'], >>> ['central processing unit', 'CPU', 'C.P.U.', 'central processor', >>> 'processor', 'mainframe'] ...
The noun.meronym() command returns the collection in which the given word can be found:
print en.noun.meronym("tree") >>> [['forest', 'wood', 'woods']]
The noun.antonym() returns the semantic opposite of the word:
print en.noun.antonym("black") >>> [['white', 'whiteness']]
Find out what two words have in common:
print en.noun.meet("cat", "dog", sense1=0, sense2=0) >>> [['carnivore']]
The noun.absurd_gloss() returns an absurd description for the word:
print en.noun.absurd_gloss("typography") >>> a business deal on a trivial scale
The return value of a WordNet command is usually a list containing other lists of related words. You can use the en.list.flatten() command to flatten the list:
print en.list.flatten(en.noun.senses("tree")) >>> ['tree', 'tree', 'tree diagram', 'Tree', 'Sir Herbert Beerbohm Tree']
If you want a list of all nouns/verbs/adjectives/adverbs there's the wordnet.all_nouns(), wordnet.all_verbs() ... commands:
print len(en.wordnet.all_nouns()) >>> 117096
All of the commands shown here for nouns are also available for verbs, adjectives and adverbs, en.verb.hypernyms("run"), en.adjective.gloss("beautiful") etc. are valid commands.
Verb conjugation
NodeBox English Linguistics knows the verb tenses for about 10000 English verbs.
The verb.infinitive() command returns the infinitive form of a verb:
print en.verb.infinitive("swimming") >>> swim
The verb.present() command returns the present tense for the given person. Known values for person are 1, 2, 3, "1st", "2nd", "3rd", "plural", "*". Just use the one you like most.
print en.verb.present("gave") print en.verb.present("gave", person=3, negate=False) >>> give >>> gives
The verb.present_participle() command returns the present participle tense:
print en.verb.present_participle("be") >>> being
Return the past tense:
print en.verb.past("give") print en.verb.past("be", person=1, negate=True) >>> gave >>> wasn't
Return the past participle tense:
print en.verb.past_participle("be") >>> been
A list of all possible tenses:
print en.verb.tenses() >>> ['past', '3rd singular present', 'past participle', 'infinitive', >>> 'present participle', '1st singular present', '1st singular past', >>> 'past plural', '2nd singular present', '2nd singular past', >>> '3rd singular past', 'present plural']
The verb.tense() command returns the tense of the given verb:
print en.verb.tense("was") >>> 1st singular past
Return True if the given verb is in the given tense:
print en.verb.is_tense("wasn't", "1st singular past", negated=True) print en.verb.is_present("does", person=1) print en.verb.is_present_participle("doing") print en.verb.is_past_participle("done") >>> True >>> False >>> True >>> True
The verb.is_tense() command also accepts shorthand aliases for tenses: inf, 1sgpres, 2gpres, 3sgpres, pl, prog, 1sgpast, 2sgpast, 3sgpast, pastpl and ppart.
Spelling corrections
NodeBox English Linguistics is able to perform spelling corrections based on Peter Norvig's algorithm. The spelling corrector has an accuracy of about 70%.
The spelling.suggest() returns a list of possible corrections for a given word. The spelling.correct() command returns the corrected version (best guess) of the word.
print en.spelling.suggest("comptuer") >>> ['computer']
Shallow parsing, the grammatical structure of a sentence
NodeBox English Linguistics is able to do sentence structure analysis using a combination of Jason Wiener's tagger and NLTK's chunker. The tagger assigns a part-of-speech tag to each word in the sentence using a (Brill's) lexicon. A postag is something like NN or VBP marking words as nouns, verbs, determiners, pronouns, etc. The chunker is then able to group syntactic units in the sentence. A syntactic unit is, for example, a determiner followed by adjectives followed by a noun: the tasty little chicken is a syntactic unit.
The sentence.tag() command tags the given sentence. The return value is a list of (word, tag) tuples. However, when you print it out it will look like a string.
print en.sentence.tag("this is so cool") >>> this/DT is/VBZ so/RB cool/JJ
There are lots of part-of-speech tags and it takes some time getting to know them. The full list is here. The sentence.tag_description() returns a (description, examples) tuple for a given tag:
print en.sentence.tag_description("NN") >>> ('noun, singular or mass', 'tiger, chair, laughter')
The sentence.chunk() command returns the chunked sentence:
from pprint import pprint pprint( en.sentence.chunk("we are going to school") ) >>> [['SP', >>> ['NP', ('we', 'PRP')], >>> ['AP', >>> ['VP', ('are', 'VBP'), ('going', 'VBG'), ('to', 'TO')], >>> ['NP', ('school', 'NN')]]]]
Now what does all this mean?
- NP: noun phrases, syntactic units describing a noun, for example: a big fish.
- VP: verb phrases, units of verbs and auxillaries, for example: are going to.
- AP: a verb/argument structure, a verb phrase and a noun phrase being influenced.
- SP: a subject structure: a noun phrase which is the executor of a verb phrase or verb/argument structure.
A handy sentence.traverse(sentence, cmd) command lets you feed a chunked sentence to your own command chunk by chunk:
s = "we are going to school" def callback(chunk, token, tag): if chunk != None : print en.sentence.tag_description(chunk)[0].upper() if chunk == None : print token, "("+en.sentence.tag_description(tag)[0]+")" en.sentence.traverse(s, callback) >>> SUBJECT PHRASE >>> NOUN PHRASE >>> we (pronoun, personal) >>> VERB PHRASE AND ARGUMENTS >>> VERB PHRASE >>> are (verb, non-3rd person singular present) >>> going (verb, gerund or present participle) >>> to (infinitival to) >>> NOUN PHRASE >>> school (noun, singular or mass)
A even handier sentence.find(sentence, pattern) command lets you find patterns of text in a sentence:
s = "The world is full of strange and mysterious things." print en.sentence.find(s, "JJ and JJ NN") >>> [[('strange', 'JJ'), ('and', 'CC'), >>> ('mysterious', 'JJ'), ('things', 'NNS')]]
The returned list contains all chunks of text that matched the pattern. In the example above it retrieved all chunks of the form an adjective + and + an adjective + a noun. Notice that when you use something like "NN" in your pattern (noun), NNS (plural nouns) are returned as well.
s = "The hairy hamsters visited the cruel dentist." matches = en.sentence.find(s, "JJ NN") print matches >>> [[('hairy', 'JJ'), ('hamsters', 'NNS')], [('cruel', 'JJ'), ('dentist', 'NN')]]
An optional chunked parameter can be set to False to return strings instead of token/tag tuples. You can put pieces of the pattern between brackets to make them optional, or use wildcards:
s = "This makes the parser an extremely powerful tool." print en.sentence.find(s, "(extreme*) (JJ) NN", chunked=False) >>> ['parser', 'extremely powerful tool']
Finally, if you feel up to it you could feed the following command with a list of your own regular expression units to chunk, mine are pretty basic as I'm not a linguist.
print en.sentence.chunk_rules()
Summarisation of text to keywords
NodeBox English Linguistics is able to strip keywords from a given text.
en.content.keywords(txt, top=10, nouns=True, singularize=True, filters=[])
The content.keywords() command guesses a list of words that frequently occur in the given text. The return value is a list (length defined by top) of (count, word) tuples. When nouns is True, returns only nouns. The command furthermore ignores connectives, numbers and tags. When singularize is True, attempts to singularize nouns in the text. The optional filters parameter is a list of words which the command should ignore.
So, assuming you would want to summarise web content you can do the following:
from urllib import urlopen html = urlopen("http://news.bbc.co.uk/").read() meta = ["news", "health", "uk", "version", "weather", "video", "sport", "return", "read", "help"] print sentence_keywords(html, filters=meta) >>> [(6, 'funeral'), (5, 'beirut'), (3, 'war'), (3, 'service'), (3, 'radio'), >>> (3, 'lebanon'), (3, 'islamist'), (3, 'function'), (3, 'female')]
Regressive Imagery Dictionary, psychological content analysis
NodeBox English Linguistics is able to do psychological content analysis using John Wiseman's Python implementation of the Regressive Imagery Dictionary. The RID asigns scores to primary, secondary and emotional process thoughts in a text.
- Primary: free-form associative thinking involved in dreams and fantasy
- Secondary: logical, reality-based and focused on problem solving
- Emotions: expressions of fear, sadness, hate, affection, etc.
en.content.categorise(str)
The content.categorise() command returns a sorted list of categories found in the text. Each item in the list has the following properties:
- item.name: the name of the category
- item.count: the number of words in the text that fall into this category
- item.words: a list of words from the text that fall into this category
- item.type: the type of category, either "primary", "secondary" or "emotions".
Let's run a little test with Lucas' Ideas from the Heart text:
txt = open("heart.txt").read() summary = en.content.categorise(txt) print summary.primary print summary.secondary print summary.emotions >>> 0.290155440415 >>> 0.637305699482 >>> 0.0725388601036 # Lucas' text has a 64% secondary value.
# The top 5 categories in the text: for category in summary[:5]: print category.name, category.count >>> instrumental behavior 30 >>> abstraction 30 >>> social behavior 28 >>> temporal references 24 >>> concreteness 18
# Words in the top "instrumental behavior" category: print summary[0].words >>> ['students', 'make', 'students', 'reached', 'countless', >>> 'student', 'workshop', 'workshop', 'students', 'finish', >>> 'spent', 'produce', 'using', 'work', 'students', 'successful', >>> 'workshop', 'students', 'pursue', 'skills', 'use', >>> 'craftsmanship', 'use', 'using', 'workshops', 'workshops', >>> 'result', 'students', 'workshops', 'student']
You can find all the categories for primary, secondary and emotional scores in the en.rid.primary, en.rid.secondary and en.rid.emotions lists.
Ogden's basic English words
NodeBox English Linguistics comes bundled with Charles K. Ogden list of basic English words: a set of 2000 words that can express 90% of the concepts in English. The list is stored as en.basic.words. It can be filtered for nouns, verbs, adjectives and adverbs:
print en.basic.words >>> ['a', 'able', 'about', 'account', 'acid', 'across', ... ]
print en.basic.verbs >>> ['account', 'act', 'air', 'amount', 'angle', 'answer', ... ]
include("util/comment.php"); ?>