Lemmatization of English words in sentences in XML format by Python
Python 2.7, NLTK 3.0
The input XML file look likes this:
<?xml version="1.0" encoding="UTF-8"?>
<sentences version="1.0">
<item id="1" asks-for="cause" most-plausible-alternative="1">
<p>my body cast a shadow over the grass . </p>
<a1>the sun be rise . </a1>
<a2>the grass be cut . </a2>
</item>
<item id="2" asks-for="cause" most-plausible-alternative="1">
<p>the woman tolerate the woman friend 's difficult behavior . </p>
<a1>the woman know the woman friend be go through a hard time . </a1>
<a2>the woman felt that the woman friend take advantage of her kindness . </a2>
</item>
...
</sentences>
Python Code
#This setting is only necessary for error about 'encoding utf-8'
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
import xml.etree.cElementTree as ET #library for XML processing
from nltk.tokenize import word_tokenize #library for word tokenize
from nltk.stem import WordNetLemmatizer #library for word lemmatize
wordnet_lemmatizer = WordNetLemmatizer()
tree = ET.parse('input.xml') #parse the XML tree from input.xml
root = tree.getroot() #get root element of the tree
for item_of_root in root: #for each item
for sentence in item_of_root: #for each sentence in the item
words = word_tokenize(sentence.text) #divide sentence to words
sentenceNew = "" #contatiner for new lemmatized sentence
for word in words: #for each word in the sentence
lamWord = wordnet_lemmatizer.lemmatize(word, pos='v') #lemmatize the words
sentenceNew += lamWord + ' ' #put the lemmatized word to the contatiner
sentence.text = sentenceNew #store the new sentence to the tree
tree.write('output.xml') #ouput the lemmatized tree to file
Reference
The ElementTree XML API – Python 2.7.12 Documentation
Dive Into NLTK, Part I: Getting Started with NLTK
Dive Into NLTK, Part II: Sentence Tokenize and Word Tokenize
Dive Into NLTK, Part IV: Stemming and Lemmatization
Gem5 Basic Guideline
All contents original from https://github.com/dependablecomputinglab This article is just a ... Read more
我所亲历的“韩国大学“新鲜趣事
首先不得不提的是 韩国人的两个小习惯 刷牙的习惯 无论是图书馆,教学楼还是公司办公室,到卫生间总是能见到韩国人在刷牙。一天刷3次以上都很正常。我见过最奇葩的一次是,在教室跟教授谈话的时候,嘴里衔着牙刷,满嘴泡沫,更神奇的是教授完全没有感觉不合适。那画面太美至今不敢多想。 关于拖鞋 拖鞋——无疑是韩国人在办公室的必备用品之一! 此外还有常驻图书馆的孩子们 常常能见到他们穿着拖鞋在办公室和图书馆里走来走去 ... Read more