How to Utilize WordNet with Python NLTK for Natural Language Processing (NLP)
This article explains how to utilize the WordNet lexical database within the NLTK (Natural Language Toolkit) framework.
We will cover the fundamental use of WordNet, including finding synonyms, antonyms, hypernyms, hyponyms, and holonyms for words. Additionally, we will explore how to determine the similarities between two words.
WordNet, a network of words, connects terms through various linguistic relationships such as synonyms, hypernyms, and hyponyms. It encompasses a vast collection of English vocabulary, where words are interconnected and organized into sets based on their meanings.
Nouns, verbs, adjectives, and adverbs are organized into groups of cognitive synonyms called synsets, with each synset representing a unique concept. These synsets are connected through various conceptual-semantic and lexical relationships.
WordNet is part of the NLTK corpus.
Loading WordNet Corpus
Here, we search for a specific word.
from nltk.corpus import wordnet as wn print (wn.synsets('good')) ''' Output: [Synset('good.n.01'), Synset('good.n.02'), Synset('good.n.03'), Synset('commodity.n.01'), Synset('good.a.01'), Synset('full.s.06'), Synset('good.a.03'), Synset('estimable.s.02'), Synset('beneficial.s.01'), Synset('good.s.06'), Synset('good.s.07'), Synset('adept.s.01'), Synset('good.s.09'), Synset('dear.s.02'), Synset('dependable.s.04'), Synset('good.s.12'), Synset('good.s.13'), Synset('effective.s.04'), Synset('good.s.15'), Synset('good.s.16'), Synset('good.s.17'), Synset('good.s.18'), Synset('good.s.19'), Synset('good.s.20'), Synset('good.s.21'), Synset('well.r.01'), Synset('thoroughly.r.02')] '''
The synsets
function returns various forms of the specified word "good". A synset is a collection of
synonyms for the given word that convey a similar meaning. Synsets are identified by a 3-part naming convention
in the format: word.pos.nn
.
The synsets
function also accepts a second parameter, which specifies the part of speech (POS) tag
for the word.
The part of speech (POS) tags are as follows: ‘a’ for adjectives (ADJ), ‘s’ for adjective satellites (ADJ_SAT), ‘r’ for adverbs (ADV), ‘n’ for nouns (NOUN), and ‘v’ for verbs (VERB). The tag ADJ_SAT stands for Adjective Satellite.
# print (wn.synsets('good', pos=wn.NOUN)) print (wn.synsets('good', pos='n')) ''' Output: [Synset('good.n.01'), Synset('good.n.02'), Synset('good.n.03'), Synset('commodity.n.01')] ''' my_word = wn.synset('good.n.01') print (my_word.definition()) # Output: benefit print (my_word.examples()) ''' Output: ['for your own good', "what's the good of worrying?"] ''' my_word = wn.synset('good.n.02') print (my_word.definition()) # Output: moral excellence or admirableness print (my_word.examples()) ''' Output: ['there is much good to be found in people'] ''' my_word = wn.synset('good.n.03') print (my_word.definition()) # Output: that which is pleasing or valuable or useful print (my_word.examples()) ''' Output: ['weigh the good against the bad', 'among the highest goods of all are happiness and self-realization'] ''' my_word = wn.synset('good.a.01') print (my_word.definition()) # Output: having desirable or positive qualities especially those suitable for a thing specified print (my_word.examples()) ''' Output: ['a good report card', 'when she was good she was very very good', 'a good knife is one good for cutting', 'this stump will make a good picnic table', 'a good check', 'a good joke', 'a good exterior paint', 'a good secretary', 'a good dress for the office'] ''' my_word = wn.synset('good.a.03') print (my_word.definition()) # Output: morally admirable print (my_word.examples()) # Output: []
SYNONYMS & ANTONYMS
We can utilize the lemmas()
function of the synset to obtain synonyms for that specific synset.
Synonyms
my_word = wn.synset('good.n.01') print (my_word.lemmas()) # Output: [Lemma('good.n.01.good')] print (my_word.lemmas()[0].name()) # Output: good print (my_word.lemmas()[0].antonyms()) # Output: [] my_word = wn.synset('good.n.02') print (my_word.lemmas()) # Output: [Lemma('good.n.02.good'), Lemma('good.n.02.goodness')] print (my_word.lemmas()[0].name()) # Output: good print (my_word.lemmas()[1].name()) # Output: goodness
Antonyms
First, we identify the synonyms of a given word using the lemmas() function. Then, we can determine the antonyms for each of those synonyms.
my_word = wn.synset('good.n.02') print (my_word.lemmas()) # Output: [Lemma('good.n.02.good'), Lemma('good.n.02.goodness')] print (my_word.lemmas()[0].name()) # Output: good print (my_word.lemmas()[0].antonyms()) # Output: [Lemma('evil.n.03.evil')] print (my_word.lemmas()[0].antonyms()[0].name()) # Output: evil print (my_word.lemmas()[1].name()) # Output: goodness print (my_word.lemmas()[1].antonyms()) # Output: [Lemma('evil.n.03.evilness')] print (my_word.lemmas()[1].antonyms()[0].name()) # Output: evilness
SIMILARITY BETWEEN TWO WORDS
NLTK offers various methods to measure similarity. These include:
- Path Similarity: Provides a similarity score between two word senses by calculating the shortest path that links them within the is-a (hypernym/hyponym) hierarchy.
- Leacock-Chodorow (LCH) Similarity: Returns a similarity score for two word senses based on the shortest path between them (similar to Path Similarity) and the maximum depth of the taxonomy where these senses exist.
- Wu-Palmer (WUP) Similarity: Offers a similarity score by considering the depth of both word senses in the taxonomy and the depth of their closest shared ancestor (Least Common Subsumer).
- Resnik (RES) Similarity: Provides a similarity score based on the Information Content (IC) of the closest shared ancestor (Least Common Subsumer) of the two word senses.
- Jiang-Conrath (JCN) Similarity: Calculates a similarity score by combining the Information Content (IC) of both the closest shared ancestor (Least Common Subsumer) and the two input word senses.
- Lin Similarity: Returns a similarity score by considering the Information Content (IC) of the closest shared ancestor (Least Common Subsumer) and the two input word senses.
print (wn.synsets('bad')) ''' Output: [Synset('bad.n.01'), Synset('bad.a.01'), Synset('bad.s.02'), Synset('bad.s.03'), Synset('bad.s.04'), Synset('regretful.a.01'), Synset('bad.s.06'), Synset('bad.s.07'), Synset('bad.s.08'), Synset('bad.s.09'), Synset('bad.s.10'), Synset('bad.s.11'), Synset('bad.s.12'), Synset('bad.s.13'), Synset('bad.s.14'), Synset('badly.r.05'), Synset('badly.r.06')] ''' word_1 = wn.synset('good.n.01') word_2 = wn.synset('bad.n.01') print (word_1.wup_similarity(word_2)) # Output: 0.666666666667 print (word_2.wup_similarity(word_1)) # Output: 0.666666666667 word_1 = wn.synset('good.n.01') word_2 = wn.synset('evil.n.01') print (word_1.wup_similarity(word_2)) # Output: 0.25 word_1 = wn.synset('bad.n.01') word_2 = wn.synset('evil.n.01') print (word_1.wup_similarity(word_2)) # Output: 0.285714285714 print (wn.synsets('eat')) ''' Output: [Synset('eat.v.01'), Synset('eat.v.02'), Synset('feed.v.06'), Synset('eat.v.04'), Synset('consume.v.05'), Synset('corrode.v.01')] ''' print (wn.synsets('sleep')) ''' Output: [Synset('sleep.n.01'), Synset('sleep.n.02'), Synset('sleep.n.03'), Synset('rest.n.05'), Synset('sleep.v.01'), Synset('sleep.v.02')] ''' word_1 = wn.synset('eat.v.01') word_2 = wn.synset('sleep.v.01') print (word_1.wup_similarity(word_2)) # Output: 0.25 word_1 = wn.synset('dog.n.01') word_2 = wn.synset('cat.n.01') print (word_1.wup_similarity(word_2)) # Output: 0.857142857143 print (word_1.path_similarity(word_2)) # Output: 0.2 print (word_1.lch_similarity(word_2)) # Output: 2.02814824729 word_1 = wn.synset('ship.n.01') word_2 = wn.synset('boat.n.01') print (word_1.wup_similarity(word_2)) # Output: 0.909090909091 print (word_1.path_similarity(word_2)) # Output: 0.333333333333 print (word_1.lch_similarity(word_2)) # Output: 2.53897387106
HYPERNYMS, HYPONYMS, & HOLONYMS
All synsets are linked to one another through different semantic relationships. Some examples of these relationships are:
- Hypernyms: Y is a hypernym of X if every X is a (kind of) Y.
- Hyponyms: Y is a hyponym of X if every Y is a (kind of) X.
- Holonyms: Y is a holonym of X if X is a part of Y.
In below example code, we can see the following:
-
Canine is another term for Dog.
According to the definition of a hypernym, Canine (Y) is a hypernym of Dog (X) because every Dog (X) is a type of Canine (Y). -
Basenji is a breed of hunting dog.
Following the definition of a hyponym, Basenji (Y) is a hyponym of Dog (X) because every Basenji (Y) is a type of Dog (X). -
Canis is a genus in the Canidae family that includes species such as wolves, dogs, and coyotes. These
species are known for their moderate to large size, strong skulls and teeth, long legs, and relatively short
ears and tails.
Based on the definition of a holonym, Canis (Y) is a holonym of Dog (X) because Dog (X) is a part of the Canis (Y) genus.
dog = wn.synset('dog.n.01') print (dog.hypernyms()) ''' Output: [Synset('canine.n.02'), Synset('domestic_animal.n.01')] ''' print (dog.hyponyms()) ''' Output: [Synset('basenji.n.01'), Synset('corgi.n.01'), Synset('cur.n.01'), Synset('dalmatian.n.02'), Synset('great_pyrenees.n.01'), Synset('griffon.n.02'), Synset('hunting_dog.n.01'), Synset('lapdog.n.01'), Synset('leonberg.n.01'), Synset('mexican_hairless.n.01'), Synset('newfoundland.n.01'), Synset('pooch.n.01'), Synset('poodle.n.01'), Synset('pug.n.01'), Synset('puppy.n.01'), Synset('spitz.n.01'), Synset('toy_dog.n.01'), Synset('working_dog.n.01')] ''' print (dog.member_holonyms()) ''' Output: [Synset('canis.n.01'), Synset('pack.n.06')] '''