NLP基础工具

1.Common NLP Techniques

NLP Technique	Description
Tokenization(word segmentation)	Convert raw text into separate words or tokens. Word boundaries and punctuations vary across natural languages and therefore it can be a non-trivial task
Parsing & Tagging	Parsing is about creating a tree like structure with words, focusing on relationships between them. Tagging is attaching additional info with tokens
Stemming（词干提取）	Reducing words into their base form using rules.
Lemmatization(词形还原)	Reducing words into their base dictionary form (called as lemma)
Stop Word Filtering（过滤停用词）	Removing common,trivial words to reduce clutter and analyze.
Parts of Speech Tagging（词性标注）	Determining parts of speech for each work,and tag them accordingly
Named Entity Recognition（命名实体识别）	Determining proper names in the text, i.e., names of people, places.

一些开源的工具：

1.1 Tokenization

Tokenization is about breaking text into components (tokens)

Tokenization uses prefix, suffix and infix characters, and punctuation rules to split text into tokens.
Tokens are pieces of original text. No transformation performed.
Tokens form building blocks of a “Doc” object
Tokens have a variety of useful attributes and methods

tokenization-9b27c0f6fe98dcb26239eba4d3ba1f3d

1.2 Stemming

Technique of reducing words into their base form by applying rules. The rules can be crude such as chopping of letters from end until the stem is achieved, or bit more sophisticated
For examples words like boat, boater, boating may reduce to the same stem, which will help if you are looking for certain words
One of the most common -and effective -stemming tools is Porter’s Algorithm developed by Martin Porter in 1980
The algorithm employs five phases of word reduction, each with its own set of mapping rules

In the first phase, simple suffix mapping rules are defined, such as（在第一阶段，定义简单的后缀映射规则）

From a given set of stemming rules only one rule is applied, based on the longest suffix S1.（从一组给定的词干规则中，仅应用一个规则，基于最长的后缀 S1。）

More sophisticated phases consider the length/complexity of the word before applying a rule. For example:

Snowball is the name of a stemming language also developed by Martin Porter.（这种算法也称为 Porter2 词干算法。它几乎被普遍认为比 Porter 更好，甚至发明 Porter 的开发者也这么认为。Snowball 在 Porter 的基础上加了很多优化。Snowball 与 Porter 相比差异约为5％。）
The algorithm used here is more accurately called the “English Stemmer” or “Porter2 Stemmer”.
It offers a slight improvement over the original Porter stemmer, both in logic and speed

1.3 Lemmatization

In contrast to stemming, lemmatization looks beyond word reduction, and considers a language’s full vocabulary to apply a morphological analysis to words.
The lemma of ‘was’ is ‘be’ and the lemma of ‘mice’ is ‘mouse’. Further, the lemma of ‘meeting’ might be ‘meet’ or ‘meeting’ depending on its use in a sentence
Lemmatization is typically seen as much more informative than simple stemming
Some libraries such as Spacy have opted to support only lemmatization and do not support stemming techniques
Lemmatization looks at surrounding text to determine a given word’s part of speech, it does not categorize phrases

1.4 Stop Word Filtering

Words like “a” and “the” appear so frequently that they don’t require tagging as thoroughly as nouns, verbs and modifiers
We call these stop words, and they can be filtered from the text to be processed.
NLP libraries typically hold a list of stop words. For example,Spacy holds a built-in list of some 305 English stop words

1.5 Parts of Speech Tagging

Parts of Speech tagging is a technique of using linguistic knowledge to add useful information to tokens (words)
Parts of Speech is a categorization of words in a natural language text, that are governed by the grammar.
In English language there are ten parts of speech -noun, pronoun, adjective, verb, adverb, preposition, conjunction, interjection, determiners, and articles.
For example, in English, Parts of Speech mean categorizing tokens as noun, verb, adjective, etc. Most NLP libraries have additional tags such as plural noun, past tense of a verb etc.
The premise is that the same word in a different order may mean something completely different.
In NLP,POS tagging is essential for building parse trees, which are used for building named entities and noun phrases,and extracting relationships between words.

1.6 Named Entity Recognition

Named entities are real-world objects (e.g. persons, organizations, cities and countries, etc.) that can be given proper names
Named-entity recognition (NER) seeks to locate and classify named entities in unstructured text into pre-defined categories such as the person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.
NER helps to extract main entities in a text, and detect important information, which is crucial if you are dealing with large dataset.

Business Use Cases for NER:

Categorizing tickets in customer support
Gain insights from customer feedback
Speeding up content recommendation
Processing resumes
Detecting fake news
Efficient search algorithms

2.Introduction to NLTK and Spacy

2.1 NLTK

NLTK-Natural Language Toolkit is a very popular open source.
Initially released in 2001, it is much older than Spacy (released 2015) Created essentially for teaching and research
It also provides many functionalities, but includes less efficient implementations.

2.2 Spacy

Open Source Natural Language Processing Library.
Designed to effectively handle NLP tasks with the most efficient implementation of common algorithms
Designed to get things done
For many NLP tasks, Spacy only has one implemented method, choosing the most efficient algorithm currently available.
This means you often don’t have the option to choose other algorithms.
It is an optionated software!
For many common NLP tasks, Spacy is much faster and more efficient, at the cost of the user not being able to choose algorithmic implementations.
However,Spacy does not include pre-created models for some applications, such as sentiment analysis, which is typically easier to perform with NLTK.

NLTK VS Spacy（processing tests）

Spacy works with a pipeline object

The nlp() function from Spacy automatically takes raw text and performs a series of operations to tag, parse, and describe the text

installing Spacy：

Download Spacy

1	pip install Spacy -i https://pypi.doubanio.com/simple

Download English language library and link with Spacy（这条指令貌似执行下载的结果也是spacy-model en_core_web_sm包，可以使用下面的指令安装）
1
python -m spacy download en

或者在管理员模式下，调整anaconda源，并使用anaconda中已经集成了spacy的英文model

conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge/
conda config --set show_channel_urls yes

1
2
3

conda install -c conda-forge spacy-model-en_core_web_sm
conda install -c conda-forge spacy-model-en_core_web_md
conda install -c conda-forge spacy-model-en_core_web_lg

3.Spacy

3.1 Introduction

此处有一些额外的参考：

spacy官方网站：https://spacy.io/usage

导入spacy包，并导入一个en的模型

1 2	import spacy nlp = spacy.load('en_core_web_sm') #导入一个en的模型

将一句话导入nlp()的模型：

1 2	doc = nlp(u'Apple is looking at buy a U.K. startup for $10 Billion') #u表示unicode string type(doc) #将此字符串拆分为token

doc object has tokens
原始句子有一下几个属性：
- token.pos 词性参考
- token.pos_ 词性表示
- token.dep_给出更多的信息,比如依存关系

1 2	for token in doc: print('%-15s%-15s%-15s%-15s'%(token.text, token.pos, token.pos_, token.dep_))

查看nlp()管道中包含的属性：

1	nlp.pipeline # 查看管道

具体的属性说明如下：

再导入一句话，查看处理的结果：

1 2	doc2 = nlp("Apple isn't looking into buying \ startups in U.K. anymore.")

print('%-15s%-15s%-15s%-15s'%('token.text', 'token.pos', 'token.pos_', 'token.dep_'))
print('--------------------------------------------------------')
for token in doc2:
    print('%-15s%-15s%-15s%-15s'%(token.text, token.pos, token.pos_, token.dep_))

可以发现spacy能够区分is 和n't是两部分，同时能够区分U.K.中的.和句号之间的区别：

doc对象还可以进行索引，并对索引对象查看属性，同时可以使用spacy.explain进行描述：

print(doc2[0], '--', type(doc[0]))            # 可以进行索引 并查看类型
print(doc2[0].pos_, doc2[0].dep_)             # 查看词性和句法分析
print(spacy.explain(str(doc2[0].pos_)))
print(spacy.explain(str(doc2[0].dep_)))       #解释对应的词汇

span对象：span对象是doc对象的一个切片

doc3 = nlp(u'Although commmonly attributed to John Lennon from \
		   his song "Beautiful Boy", the phrase "Life is what \
		   happens to us while we are making other plans" was \
		   written by  cartoonist Allen Saunders and published \
		   in Reader\'s Digest in 1957, when Lennon was 17.')

对这句话中的quote进行切片：

1 2	life_quote = doc3[16:30] print(life_quote)

查看该切片的数据类型，可以发现，为span：

1	type(life_quote)

doc.sents：将一段话按照句子进行切分：其每一句话均为一个span对象：

1 2	doc4 = nlp(u'This is first sentence.Hey, second sentence. \ Third sentence. Fourth sentence. Stupid sentence.')

1 2	for sentence in doc4.sents: print(sentence ,type(sentence)) # 切分为句子

同时可以使用doc.is_sent_start判断某个词是否为一句话的开始：

1 2	print(doc4[0], doc4[0].is_sent_start) # 询问是否为一句话的开始 print(doc4[5], doc4[5].is_sent_start)

3.2 Tokenization

Spacy处理一句话变成为Tokenization的过程：

1 2	import spacy nlp = spacy.load('en_core_web_sm')

分割多个引号：

1
2
3

doc = nlp(u'"We\'re moving to L.A.!"')
for token in doc:
    print(token, end = ' | ')

分割链接：

doc2 = nlp(u"We're here to help! Send snail-mail,\
		   emall support@oursite.com or visit us at\
           http://www.oursite.com!")
for token in doc2:
    print(token, end = ' | ')

分割特殊符号：

1
2
3

doc3 = nlp(u"I paid $50.23 for a used furniture.")
for token in doc3:
    print(token, end = ' | ')

分割.号：

1
2
3

doc4 = nlp(u"Let's visit St. Louis in U.S. next month")
for token in doc4:
    print(token, end = ' | ')

同时：spacy.tokens.doc.Doc中的对象不支持修改：

Name Entity：

entity.text：实体内容
entity.label_：实体标签

doc8 = nlp(u"Apple is trying to build a new factory in Hong Kong in 2021")
for token in doc8:
    print(token, end = ' | ')
print('\n')
# Name Entities
for entity in doc8.ents:
    print('%-15s%-15s%-30s'%(entity.text,entity.label_, 
                              spacy.explain(str(entity.label_))))

有三个span对象的name entities，同时可以打印对其所属类别的解释：

Noun Chunks：名词组块分析，提取名词短语

doc9 = nlp(u"Autonomous cars shift insurance \
            liability toward manufactures.")
for chunk in doc9.noun_chunks:
    print(chunk.text)

visualization

依存关系的可视化：

1
2
3

from spacy import displacy
displacy.render(doc9, style = 'dep', jupyter = True, 
                options = {'distance':90})

Name Entity的可视化：

1
2
3

doc8 = nlp(u"Apple is trying to build a new factory in \
           Hong Kong in 2021")
displacy.render(doc8, style = 'ent', jupyter = True)

3.3 Lemmatization

1 2	import spacy nlp = spacy.load('en_core_web_sm')

使用token.lemma_查看Lemmatization

def show_lemmas(doc):
    print('%-15s%-15s%-15s%-15s%-15s'%('token.text', 'token.pos', 
                                       'token.pos_', 'token.dep_', 
                                       'token.lemma'))
    for token in doc:
        print('%-15s%-15s%-15s%-15s%-15s'%(token.text, token.pos, 
                                           token.pos_, token.dep_, 
                                           token.lemma_))

测试第一个句子：

1
2
3

doc = nlp("I am runner running in a race because \
           I love to run and I ran earlier this morning")
show_lemmas(doc)

不规则的名词复数可以变为原形：

1 2	doc1 = nlp(u'I saw eighteen mice today') show_lemmas(doc1)

系动词会恢复原形：

1 2	doc2 = nlp(u"That's an apple.") show_lemmas(doc2) # 's-->be

副词不会恢复原形，因为含义不同：

1 2	doc4 = nlp(u"That was ridiculously easy and easily done and fairly") show_lemmas(doc4)

3.4 stop words

nlp.Defaults.stop_words默认的stop words列表

1	print(nlp.Defaults.stop_words)

1	nlp.vocab['whenever']

使用is_stop判断是否为停用词

1	nlp.vocab['whenever'].is_stop

查看有多少个停用词：

1	len(nlp.Defaults.stop_words)

# 设置停用词的两种方法：
nlp.Defaults.stop_words.add('btu')
nlp.vocab['btw'].is_stop = True
print(nlp.vocab['btu'].is_stop)
print(nlp.vocab['btw'].is_stop)

3.5 Speech tagging

使用token.tag_查看Speech tagging：

print('%-15s%-15s%-15s%-15s%-30s'%('token.text', 'token.pos_', 
                                   'token.dep_', 'token.tag_',
                                   'explain token_tag'))
print('-----------------------------------------------------------------------------------')
for token in doc:
    print('%-15s%-15s%-15s%-15s%-30s'%(token.text, token.pos_, 
                                       token.dep_, token.tag_, 
                                       spacy.explain(str(token.tag_))))

对POS Tags进行计数：

doc = nlp(u"Apple is looking at buying a U.K. startup for $1 Billion")
pos_counts = doc.count_by(spacy.attrs.POS)
print(pos_counts)
print(doc.vocab[96].text)

1 2	for k, v in sorted(pos_counts.items()): print(f'{k}.{doc.vocab[k].text:{6}}:{v}')

tag_counts = doc.count_by(spacy.attrs.TAG)
for k, v in sorted(tag_counts.items()):
    print(f'{k:<{23}}{doc.vocab[k].text:{4}}:{v:<{5}} \
    {spacy.explain(doc.vocab[k].text)}')

进行可视化：

from spacy import displacy
doc5 = nlp(u'A quick brown fox jumps over the lazy dog')
displacy.render(doc5, style = 'dep', jupyter = True, 
                options = {'distance':110})

定义其他的可视化options：

options = {'distance' : 110, 
           'compact'  : 'True',
           'color'    : 'yellow',
           'bg'       : '#09a3d5',
           'font'     : 'Times'}
displacy.render(doc5, 
                style = 'dep', 
                jupyter = True, 
                options = options)

3.6 NER

1 2	import spacy nlp = spacy.load('en_core_web_sm')

定义显示命名实体识别的函数：

def show_ents(doc):
    if doc.ents:
        for entity in doc.ents:
            print(entity.text + '---' + entity.label_ + 
                  '---', spacy.explain(str(entity.label_)))
    else:
        print("No named entities found!")

1
2
3

doc = nlp(u'I am heading to New York City and will visits\ 
          Statue of Liberty tomorrow')
show_ents(doc)

下面加入自定义的命名实体：

1
2
3

doc2 = nlp(u'Tesla is planning to build a new \
		   plant in U.K. for $50 Million')
show_ents(doc2)

查看命名实体的类型，可以看到其为spacy.tokens.span.Span类型：

1	type(doc2.ents[0])

查看某一实体类型对应的编号：

1
2
3

from spacy.tokens import Span 
ORG = doc.vocab.strings['ORG']
ORG

# 创建一个新的命名实体Tesla
new_entity = Span(doc2, 0, 1, label = ORG)
# 将该命名实体加入命名实体列表中
doc2.ents = list(doc2.ents) + [new_entity]
show_ents(doc2)

Adding multiple named entities for all matching spans

doc3 = nlp(u"Our company plans to introduce new vaccum cleaner."
          "If this works out the new vaccum cleaner will be our\ 
           first product")
show_ents(doc3)

创建匹配命名实体的短语列表：

from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)
phrase_list = ['vaccum cleaner', 'vaccum-cleaner',
               'vaccum_cleaner', 'vaccumcleaner']
phrase_patterns = [nlp(text) for text in phrase_list]

1	type(phrase_patterns[0])

1	matcher.add('clientproducts', None, *phrase_patterns)

doc3 = nlp(u"Our company plans to introduce new vaccum cleaner."
          "If this works out the new vaccum cleaner will be our \
          first product")
matches = matcher(doc3)
matches

1 2	prod = doc.vocab.strings[u'PRODUCT'] prod

new_entities = [Span(doc3, match[1], 
                     match[2], label = prod) for match in matches]
print(len(new_entities))
print(type(new_entities[0]))

1	doc3.ents = list(doc3.ents) + new_entities

1	show_ents(doc3)

counting entities of a certain type(label)

1
2
3

doc4 = nlp(u'I found a furniture priced at $2000 which \
           is marked down by 500 dollars.')
show_ents(doc4)

1 2	# 实体计数 len([ent for ent in doc4.ents if ent.label_ == 'MONEY'])

可视化：

from spacy import displacy
doc = nlp(u'Over the last quarter Apple sold nearlt 20 Million iPhone\
12s for a profit of $20 million.')
displacy.render(doc, style = 'ent', jupyter = True)

设置对特定类型的实体进行识别：

options = {'ents': ['ORG', 'DATE', 'MONEY']}
displacy.render(doc, style = 'ent', 
                jupyter = True, 
                options = options)

设定特定实体类型的颜色：

colors = {'ORG': 'orange', 'MONEY': 'yellow'}
options = {'ents': ['ORG', 'DATE', 'MONEY'],
           'colors': colors}
displacy.render(doc, style = 'ent', 
                jupyter = True, 
                options = options)

4.NLTK

4.1 Stemming

1 2	import nltk from nltk.stem.porter import PorterStemmer

创建一个PorterStemmer()对象：

1	p_stemmer = PorterStemmer()

使用port算法，处理列表中的词汇：

1
2
3

words = ['run', 'runs', 'runner', 'running', 'ran', 'easily', 'fairly']
for word in words:
    print(word + '---->' + p_stemmer.stem(word))

可以发现对于easily，fairly等词汇的处理并不智能。

下面尝试使用snowball算法：创建一个SnowballStemmer()对象：

1
2
3

# snowball stemmer
from nltk.stem.snowball import SnowballStemmer
s_stemmer = SnowballStemmer(language = 'english')

1
2
3

words = ['run', 'runs', 'runner', 'running', 'ran', 'easily', 'fairly']
for word in words:
    print(word + '---->' + s_stemmer.stem(word))

NLTK中的stemming算法的局限性：

phrase = 'I am meeting Raj at the meeting this afternoon'
for word in phrase.split():
    print(word + '----->' + s_stemmer.stem(word))
# 无法区分两个meeting