1.Common NLP Techniques

NLP Technique Description
Tokenization(word segmentation) Convert raw text into separate words or tokens. Word boundaries and punctuations vary across natural languages and therefore it can be a non-trivial task
Parsing & Tagging Parsing is about creating a tree like structure with words, focusing on relationships between them. Tagging is attaching additional info with tokens
Stemming(词干提取) Reducing words into their base form using rules.
Lemmatization(词形还原) Reducing words into their base dictionary form (called as lemma)
Stop Word Filtering(过滤停用词) Removing common,trivial words to reduce clutter and analyze.
Parts of Speech Tagging(词性标注) Determining parts of speech for each work,and tag them accordingly
Named Entity Recognition(命名实体识别) Determining proper names in the text, i.e., names of people, places.

一些开源的工具:

image-20220725152300585

1.1 Tokenization

Tokenization is about breaking text into components (tokens)

  • Tokenization uses prefix, suffix and infix characters, and punctuation rules to split text into tokens.
  • Tokens are pieces of original text. No transformation performed.
  • Tokens form building blocks of a “Doc” object
  • Tokens have a variety of useful attributes and methods

tokenization-9b27c0f6fe98dcb26239eba4d3ba1f3d

1.2 Stemming

  • Technique of reducing words into their base form by applying rules. The rules can be crude such as chopping of letters from end until the stem is achieved, or bit more sophisticated
  • For examples words like boat, boater, boating may reduce to the same stem, which will help if you are looking for certain words
  • One of the most common -and effective -stemming tools is Porter’s Algorithm developed by Martin Porter in 1980
  • The algorithm employs five phases of word reduction, each with its own set of mapping rules

In the first phase, simple suffix mapping rules are defined, such as(在第一阶段,定义简单的后缀映射规则)

image-20220723122800290

From a given set of stemming rules only one rule is applied, based on the longest suffix S1.(从一组给定的词干规则中,仅应用一个规则,基于最长的后缀 S1。)

image-20220723122930964

More sophisticated phases consider the length/complexity of the word before applying a rule. For example:

image-20220723123129742

  • Snowball is the name of a stemming language also developed by Martin Porter.(这种算法也称为 Porter2 词干算法。它几乎被普遍认为比 Porter 更好,甚至发明 Porter 的开发者也这么认为。Snowball 在 Porter 的基础上加了很多优化。Snowball 与 Porter 相比差异约为5%。)
  • The algorithm used here is more accurately called the “English Stemmer” or “Porter2 Stemmer”.
  • It offers a slight improvement over the original Porter stemmer, both in logic and speed

1.3 Lemmatization

  • In contrast to stemming, lemmatization looks beyond word reduction, and considers a language’s full vocabulary to apply a morphological analysis to words.
  • The lemma of ‘was’ is ‘be’ and the lemma of ‘mice’ is ‘mouse’. Further, the lemma of ‘meeting’ might be ‘meet’ or ‘meeting’ depending on its use in a sentence
  • Lemmatization is typically seen as much more informative than simple stemming
  • Some libraries such as Spacy have opted to support only lemmatization and do not support stemming techniques
  • Lemmatization looks at surrounding text to determine a given word’s part of speech, it does not categorize phrases

1.4 Stop Word Filtering

  • Words like “a” and “the” appear so frequently that they don’t require tagging as thoroughly as nouns, verbs and modifiers
  • We call these stop words, and they can be filtered from the text to be processed.
  • NLP libraries typically hold a list of stop words. For example,Spacy holds a built-in list of some 305 English stop words

1.5 Parts of Speech Tagging

  • Parts of Speech tagging is a technique of using linguistic knowledge to add useful information to tokens (words)
  • Parts of Speech is a categorization of words in a natural language text, that are governed by the grammar.
  • In English language there are ten parts of speech -noun, pronoun, adjective, verb, adverb, preposition, conjunction, interjection, determiners, and articles.
  • For example, in English, Parts of Speech mean categorizing tokens as noun, verb, adjective, etc. Most NLP libraries have additional tags such as plural noun, past tense of a verb etc.
  • The premise is that the same word in a different order may mean something completely different.
  • In NLP,POS tagging is essential for building parse trees, which are used for building named entities and noun phrases,and extracting relationships between words.

1.6 Named Entity Recognition

  • Named entities are real-world objects (e.g. persons, organizations, cities and countries, etc.) that can be given proper names
  • Named-entity recognition (NER) seeks to locate and classify named entities in unstructured text into pre-defined categories such as the person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.
  • NER helps to extract main entities in a text, and detect important information, which is crucial if you are dealing with large dataset.

image-20220723130626663

Business Use Cases for NER:

  • Categorizing tickets in customer support
  • Gain insights from customer feedback
  • Speeding up content recommendation
  • Processing resumes
  • Detecting fake news
  • Efficient search algorithms

2.Introduction to NLTK and Spacy

2.1 NLTK

  • NLTK-Natural Language Toolkit is a very popular open source.
  • Initially released in 2001, it is much older than Spacy (released 2015) Created essentially for teaching and research
  • It also provides many functionalities, but includes less efficient implementations.

2.2 Spacy

  • Open Source Natural Language Processing Library.
  • Designed to effectively handle NLP tasks with the most efficient implementation of common algorithms
  • Designed to get things done
  • For many NLP tasks, Spacy only has one implemented method, choosing the most efficient algorithm currently available.
  • This means you often don’t have the option to choose other algorithms.
  • It is an optionated software!
  • For many common NLP tasks, Spacy is much faster and more efficient, at the cost of the user not being able to choose algorithmic implementations.
  • However,Spacy does not include pre-created models for some applications, such as sentiment analysis, which is typically easier to perform with NLTK.

NLTK VS Spacy(processing tests)

image-20220723131521938

image-20220723131625158

Spacy works with a pipeline object

image-20220723131848369

  • The nlp() function from Spacy automatically takes raw text and performs a series of operations to tag, parse, and describe the text

installing Spacy:

  • Download Spacy

    1
    pip install Spacy -i https://pypi.doubanio.com/simple
  • Download English language library and link with Spacy(这条指令貌似执行下载的结果也是spacy-model en_core_web_sm包,可以使用下面的指令安装)

    1
    python -m spacy download en
  • 或者在管理员模式下,调整anaconda源,并使用anaconda中已经集成了spacy的英文model

    1
    2
    3
    4
    conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
    conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
    conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge/
    conda config --set show_channel_urls yes
    1
    2
    3
    conda install -c conda-forge spacy-model-en_core_web_sm
    conda install -c conda-forge spacy-model-en_core_web_md
    conda install -c conda-forge spacy-model-en_core_web_lg

3.Spacy

3.1 Introduction

此处有一些额外的参考:

  • spacy官方网站:https://spacy.io/usage
  • image-20220725013503799

导入spacy包,并导入一个en的模型

1
2
import spacy
nlp = spacy.load('en_core_web_sm') #导入一个en的模型

将一句话导入nlp()的模型:

1
2
doc = nlp(u'Apple is looking at buy a U.K. startup for $10 Billion')  #u表示unicode string
type(doc) #将此字符串拆分为token

image-20220725001718753

  • doc object has tokens
  • 原始句子有一下几个属性:
    • token.pos 词性参考
    • token.pos_ 词性表示
    • token.dep_给出更多的信息,比如依存关系
1
2
for token in doc:
print('%-15s%-15s%-15s%-15s'%(token.text, token.pos, token.pos_, token.dep_))

image-20220725001855887

查看nlp()管道中包含的属性:

1
nlp.pipeline  # 查看管道

image-20220725001914976

具体的属性说明如下:

image-20220725002542229

再导入一句话,查看处理的结果:

1
2
doc2 = nlp("Apple isn't looking into buying \
startups in U.K. anymore.")
1
2
3
4
print('%-15s%-15s%-15s%-15s'%('token.text', 'token.pos', 'token.pos_', 'token.dep_'))
print('--------------------------------------------------------')
for token in doc2:
print('%-15s%-15s%-15s%-15s'%(token.text, token.pos, token.pos_, token.dep_))

可以发现spacy能够区分isn't是两部分,同时能够区分U.K.中的.和句号之间的区别:

image-20220725002018647

doc对象还可以进行索引,并对索引对象查看属性,同时可以使用spacy.explain进行描述:

1
2
3
4
print(doc2[0], '--', type(doc[0]))            # 可以进行索引 并查看类型
print(doc2[0].pos_, doc2[0].dep_) # 查看词性和句法分析
print(spacy.explain(str(doc2[0].pos_)))
print(spacy.explain(str(doc2[0].dep_))) #解释对应的词汇

image-20220725002132515

  • span对象:span对象是doc对象的一个切片
1
2
3
4
5
doc3 = nlp(u'Although commmonly attributed to John Lennon from \
his song "Beautiful Boy", the phrase "Life is what \
happens to us while we are making other plans" was \
written by cartoonist Allen Saunders and published \
in Reader\'s Digest in 1957, when Lennon was 17.')

对这句话中的quote进行切片:

1
2
life_quote = doc3[16:30]
print(life_quote)

image-20220725002316336

查看该切片的数据类型,可以发现,为span

1
type(life_quote)

image-20220725002348218

  • doc.sents:将一段话按照句子进行切分:其每一句话均为一个span对象:
1
2
doc4 = nlp(u'This is first sentence.Hey, second sentence. \
Third sentence. Fourth sentence. Stupid sentence.')
1
2
for sentence in doc4.sents:
print(sentence ,type(sentence)) # 切分为句子

image-20220725002724800

同时可以使用doc.is_sent_start判断某个词是否为一句话的开始:

1
2
print(doc4[0], doc4[0].is_sent_start)  # 询问是否为一句话的开始
print(doc4[5], doc4[5].is_sent_start)

3.2 Tokenization

Spacy处理一句话变成为Tokenization的过程:

image-20220723122015779

1
2
import spacy
nlp = spacy.load('en_core_web_sm')
  • 分割多个引号:
1
2
3
doc = nlp(u'"We\'re moving to L.A.!"')
for token in doc:
print(token, end = ' | ')

image-20220725014206009

  • 分割链接:
1
2
3
4
5
doc2 = nlp(u"We're here to help! Send snail-mail,\
emall support@oursite.com or visit us at\
http://www.oursite.com!")
for token in doc2:
print(token, end = ' | ')

image-20220725014240074

  • 分割特殊符号:
1
2
3
doc3 = nlp(u"I paid $50.23 for a used furniture.")
for token in doc3:
print(token, end = ' | ')

image-20220725014300346

  • 分割.号:
1
2
3
doc4 = nlp(u"Let's visit St. Louis in U.S. next month")
for token in doc4:
print(token, end = ' | ')

image-20220725014332054

同时:spacy.tokens.doc.Doc中的对象不支持修改:

image-20220725014346835

Name Entity

  • entity.text:实体内容
  • entity.label_:实体标签
1
2
3
4
5
6
7
8
doc8 = nlp(u"Apple is trying to build a new factory in Hong Kong in 2021")
for token in doc8:
print(token, end = ' | ')
print('\n')
# Name Entities
for entity in doc8.ents:
print('%-15s%-15s%-30s'%(entity.text,entity.label_,
spacy.explain(str(entity.label_))))

有三个span对象的name entities,同时可以打印对其所属类别的解释:

image-20220725014521268

Noun Chunks:名词组块分析,提取名词短语

1
2
3
4
doc9 = nlp(u"Autonomous cars shift insurance \
liability toward manufactures.")
for chunk in doc9.noun_chunks:
print(chunk.text)

image-20220725014551438

visualization

依存关系的可视化:

1
2
3
from spacy import displacy
displacy.render(doc9, style = 'dep', jupyter = True,
options = {'distance':90})

image-20220725014655913

Name Entity的可视化:

1
2
3
doc8 = nlp(u"Apple is trying to build a new factory in \
Hong Kong in 2021")
displacy.render(doc8, style = 'ent', jupyter = True)

image-20220725014709095

3.3 Lemmatization

1
2
import spacy
nlp = spacy.load('en_core_web_sm')

使用token.lemma_查看Lemmatization

1
2
3
4
5
6
7
8
def show_lemmas(doc):
print('%-15s%-15s%-15s%-15s%-15s'%('token.text', 'token.pos',
'token.pos_', 'token.dep_',
'token.lemma'))
for token in doc:
print('%-15s%-15s%-15s%-15s%-15s'%(token.text, token.pos,
token.pos_, token.dep_,
token.lemma_))

测试第一个句子:

1
2
3
doc = nlp("I am runner running in a race because \
I love to run and I ran earlier this morning")
show_lemmas(doc)

image-20220725122044343

  • 不规则的名词复数可以变为原形:
1
2
doc1 = nlp(u'I saw eighteen mice today')
show_lemmas(doc1)

image-20220725122107682

  • 系动词会恢复原形:
1
2
doc2 = nlp(u"That's an apple.")
show_lemmas(doc2) # 's-->be

image-20220725122159013

  • 副词不会恢复原形,因为含义不同:
1
2
doc4 = nlp(u"That was ridiculously easy and easily done and fairly")
show_lemmas(doc4)

image-20220725122233154

3.4 stop words

nlp.Defaults.stop_words默认的stop words列表

1
print(nlp.Defaults.stop_words)

image-20220725164326466

1
nlp.vocab['whenever']

image-20220725164352151

  • 使用is_stop判断是否为停用词
1
nlp.vocab['whenever'].is_stop

image-20220725164437764

  • 查看有多少个停用词:
1
len(nlp.Defaults.stop_words)

image-20220725164507084

1
2
3
4
5
# 设置停用词的两种方法:
nlp.Defaults.stop_words.add('btu')
nlp.vocab['btw'].is_stop = True
print(nlp.vocab['btu'].is_stop)
print(nlp.vocab['btw'].is_stop)

image-20220725164529232

3.5 Speech tagging

使用token.tag_查看Speech tagging:

1
2
3
4
5
6
7
8
print('%-15s%-15s%-15s%-15s%-30s'%('token.text', 'token.pos_', 
'token.dep_', 'token.tag_',
'explain token_tag'))
print('-----------------------------------------------------------------------------------')
for token in doc:
print('%-15s%-15s%-15s%-15s%-30s'%(token.text, token.pos_,
token.dep_, token.tag_,
spacy.explain(str(token.tag_))))

image-20220726171426406

  • 对POS Tags进行计数:
1
2
3
4
doc = nlp(u"Apple is looking at buying a U.K. startup for $1 Billion")
pos_counts = doc.count_by(spacy.attrs.POS)
print(pos_counts)
print(doc.vocab[96].text)

image-20220726171538698

1
2
for k, v in sorted(pos_counts.items()):
print(f'{k}.{doc.vocab[k].text:{6}}:{v}')

image-20220726171554293

1
2
3
4
tag_counts = doc.count_by(spacy.attrs.TAG)
for k, v in sorted(tag_counts.items()):
print(f'{k:<{23}}{doc.vocab[k].text:{4}}:{v:<{5}} \
{spacy.explain(doc.vocab[k].text)}')

image-20220726171626881

进行可视化:

1
2
3
4
from spacy import displacy
doc5 = nlp(u'A quick brown fox jumps over the lazy dog')
displacy.render(doc5, style = 'dep', jupyter = True,
options = {'distance':110})

image-20220726171700191

定义其他的可视化options:

1
2
3
4
5
6
7
8
9
options = {'distance' : 110, 
'compact' : 'True',
'color' : 'yellow',
'bg' : '#09a3d5',
'font' : 'Times'}
displacy.render(doc5,
style = 'dep',
jupyter = True,
options = options)

image-20220726171735467

3.6 NER

1
2
import spacy
nlp = spacy.load('en_core_web_sm')

定义显示命名实体识别的函数:

1
2
3
4
5
6
7
def show_ents(doc):
if doc.ents:
for entity in doc.ents:
print(entity.text + '---' + entity.label_ +
'---', spacy.explain(str(entity.label_)))
else:
print("No named entities found!")
1
2
3
doc = nlp(u'I am heading to New York City and will visits\ 
Statue of Liberty tomorrow')
show_ents(doc)

image-20220803141253127

下面加入自定义的命名实体:

1
2
3
doc2 = nlp(u'Tesla is planning to build a new \
plant in U.K. for $50 Million')
show_ents(doc2)

image-20220803143518170

查看命名实体的类型,可以看到其为spacy.tokens.span.Span类型:

1
type(doc2.ents[0])

image-20220803143541084

查看某一实体类型对应的编号:

1
2
3
from spacy.tokens import Span 
ORG = doc.vocab.strings['ORG']
ORG

image-20220803143614013

1
2
3
4
5
# 创建一个新的命名实体Tesla
new_entity = Span(doc2, 0, 1, label = ORG)
# 将该命名实体加入命名实体列表中
doc2.ents = list(doc2.ents) + [new_entity]
show_ents(doc2)

image-20220803143640931

Adding multiple named entities for all matching spans

1
2
3
4
doc3 = nlp(u"Our company plans to introduce new vaccum cleaner."
"If this works out the new vaccum cleaner will be our\
first product")
show_ents(doc3)

image-20220803143716208

创建匹配命名实体的短语列表:

1
2
3
4
5
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)
phrase_list = ['vaccum cleaner', 'vaccum-cleaner',
'vaccum_cleaner', 'vaccumcleaner']
phrase_patterns = [nlp(text) for text in phrase_list]
1
type(phrase_patterns[0])

image-20220803143744676

1
matcher.add('clientproducts', None, *phrase_patterns)
1
2
3
4
5
doc3 = nlp(u"Our company plans to introduce new vaccum cleaner."
"If this works out the new vaccum cleaner will be our \
first product")
matches = matcher(doc3)
matches

image-20220803143805895

1
2
prod = doc.vocab.strings[u'PRODUCT']
prod

image-20220803143835718

1
2
3
4
new_entities = [Span(doc3, match[1], 
match[2], label = prod) for match in matches]
print(len(new_entities))
print(type(new_entities[0]))

image-20220803143900404

1
doc3.ents = list(doc3.ents) + new_entities
1
show_ents(doc3)

image-20220803143913652

counting entities of a certain type(label)

1
2
3
doc4 = nlp(u'I found a furniture priced at $2000 which \
is marked down by 500 dollars.')
show_ents(doc4)

image-20220803143945976

1
2
# 实体计数
len([ent for ent in doc4.ents if ent.label_ == 'MONEY'])

image-20220803144025943

可视化:

1
2
3
4
from spacy import displacy
doc = nlp(u'Over the last quarter Apple sold nearlt 20 Million iPhone\
12s for a profit of $20 million.')
displacy.render(doc, style = 'ent', jupyter = True)

image-20220803145526316

设置对特定类型的实体进行识别:

1
2
3
4
options = {'ents': ['ORG', 'DATE', 'MONEY']}
displacy.render(doc, style = 'ent',
jupyter = True,
options = options)

image-20220803145548661

设定特定实体类型的颜色:

1
2
3
4
5
6
colors = {'ORG': 'orange', 'MONEY': 'yellow'}
options = {'ents': ['ORG', 'DATE', 'MONEY'],
'colors': colors}
displacy.render(doc, style = 'ent',
jupyter = True,
options = options)

image-20220803145636203

4.NLTK

4.1 Stemming

1
2
import nltk
from nltk.stem.porter import PorterStemmer

创建一个PorterStemmer()对象:

1
p_stemmer = PorterStemmer()

使用port算法,处理列表中的词汇:

1
2
3
words = ['run', 'runs', 'runner', 'running', 'ran', 'easily', 'fairly']
for word in words:
print(word + '---->' + p_stemmer.stem(word))

image-20220725120104043

可以发现对于easilyfairly等词汇的处理并不智能。

下面尝试使用snowball算法:创建一个SnowballStemmer()对象:

1
2
3
# snowball stemmer
from nltk.stem.snowball import SnowballStemmer
s_stemmer = SnowballStemmer(language = 'english')
1
2
3
words = ['run', 'runs', 'runner', 'running', 'ran', 'easily', 'fairly']
for word in words:
print(word + '---->' + s_stemmer.stem(word))

image-20220725120248336

NLTK中的stemming算法的局限性:

1
2
3
4
phrase = 'I am meeting Raj at the meeting this afternoon'
for word in phrase.split():
print(word + '----->' + s_stemmer.stem(word))
# 无法区分两个meeting

image-20220725120326292