Convert raw text into separate words or tokens. Word boundaries and punctuations vary across natural languages and therefore it can be a non-trivial task
Parsing & Tagging
Parsing is about creating a tree like structure with words, focusing on relationships between them. Tagging is attaching additional info with tokens
Stemming(词干提取)
Reducing words into their base form using rules.
Lemmatization(词形还原)
Reducing words into their base dictionary form (called as lemma)
Stop Word Filtering(过滤停用词)
Removing common,trivial words to reduce clutter and analyze.
Parts of Speech Tagging(词性标注)
Determining parts of speech for each work,and tag them accordingly
Named Entity Recognition(命名实体识别)
Determining proper names in the text, i.e., names of people, places.
一些开源的工具:
1.1 Tokenization
Tokenization is about breaking text into components (tokens)
Tokenization uses prefix, suffix and infix characters, and punctuation rules to split text into tokens.
Tokens are pieces of original text. No transformation performed.
Tokens form building blocks of a “Doc” object
Tokens have a variety of useful attributes and methods
1.2 Stemming
Technique of reducing words into their base form by applying rules. The rules can be crude such as chopping of letters from end until the stem is achieved, or bit more sophisticated
For examples words like boat, boater, boating may reduce to the same stem, which will help if you are looking for certain words
One of the most common -and effective -stemming tools is Porter’s Algorithm developed by Martin Porter in 1980
The algorithm employs five phases of word reduction, each with its own set of mapping rules
In the first phase, simple suffix mapping rules are defined, such as(在第一阶段,定义简单的后缀映射规则)
From a given set of stemming rules only one rule is applied, based on the longest suffix S1.(从一组给定的词干规则中,仅应用一个规则,基于最长的后缀 S1。)
More sophisticated phases consider the length/complexity of the word before applying a rule. For example:
Snowball is the name of a stemming language also developed by Martin Porter.(这种算法也称为 Porter2 词干算法。它几乎被普遍认为比 Porter 更好,甚至发明 Porter 的开发者也这么认为。Snowball 在 Porter 的基础上加了很多优化。Snowball 与 Porter 相比差异约为5%。)
The algorithm used here is more accurately called the “English Stemmer” or “Porter2 Stemmer”.
It offers a slight improvement over the original Porter stemmer, both in logic and speed
1.3 Lemmatization
In contrast to stemming, lemmatization looks beyond word reduction, and considers a language’s full vocabulary to apply a morphological analysis to words.
The lemma of ‘was’ is ‘be’ and the lemma of ‘mice’ is ‘mouse’. Further, the lemma of ‘meeting’ might be ‘meet’ or ‘meeting’ depending on its use in a sentence
Lemmatization is typically seen as much more informative than simple stemming
Some libraries such as Spacy have opted to support only lemmatization and do not support stemming techniques
Lemmatization looks at surrounding text to determine a given word’s part of speech, it does not categorize phrases
1.4 Stop Word Filtering
Words like “a” and “the” appear so frequently that they don’t require tagging as thoroughly as nouns, verbs and modifiers
We call these stop words, and they can be filtered from the text to be processed.
NLP libraries typically hold a list of stop words. For example,Spacy holds a built-in list of some 305 English stop words
1.5 Parts of Speech Tagging
Parts of Speech tagging is a technique of using linguistic knowledge to add useful information to tokens (words)
Parts of Speech is a categorization of words in a natural language text, that are governed by the grammar.
In English language there are ten parts of speech -noun, pronoun, adjective, verb, adverb, preposition, conjunction, interjection, determiners, and articles.
For example, in English, Parts of Speech mean categorizing tokens as noun, verb, adjective, etc. Most NLP libraries have additional tags such as plural noun, past tense of a verb etc.
The premise is that the same word in a different order may mean something completely different.
In NLP,POS tagging is essential for building parse trees, which are used for building named entities and noun phrases,and extracting relationships between words.
1.6 Named Entity Recognition
Named entities are real-world objects (e.g. persons, organizations, cities and countries, etc.) that can be given proper names
Named-entity recognition (NER) seeks to locate and classify named entities in unstructured text into pre-defined categories such as the person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.
NER helps to extract main entities in a text, and detect important information, which is crucial if you are dealing with large dataset.
Business Use Cases for NER:
Categorizing tickets in customer support
Gain insights from customer feedback
Speeding up content recommendation
Processing resumes
Detecting fake news
Efficient search algorithms
2.Introduction to NLTK and Spacy
2.1 NLTK
NLTK-Natural Language Toolkit is a very popular open source.
Initially released in 2001, it is much older than Spacy (released 2015) Created essentially for teaching and research
It also provides many functionalities, but includes less efficient implementations.
2.2 Spacy
Open Source Natural Language Processing Library.
Designed to effectively handle NLP tasks with the most efficient implementation of common algorithms
Designed to get things done
For many NLP tasks, Spacy only has one implemented method, choosing the most efficient algorithm currently available.
This means you often don’t have the option to choose other algorithms.
It is an optionated software!
For many common NLP tasks, Spacy is much faster and more efficient, at the cost of the user not being able to choose algorithmic implementations.
However,Spacy does not include pre-created models for some applications, such as sentiment analysis, which is typically easier to perform with NLTK.
NLTK VS Spacy(processing tests)
Spacy works with a pipeline object
The nlp() function from Spacy automatically takes raw text and performs a series of operations to tag, parse, and describe the text
doc3 = nlp(u'Although commmonly attributed to John Lennon from \ his song "Beautiful Boy", the phrase "Life is what \ happens to us while we are making other plans" was \ written by cartoonist Allen Saunders and published \ in Reader\'s Digest in 1957, when Lennon was 17.')
对这句话中的quote进行切片:
1 2
life_quote = doc3[16:30] print(life_quote)
查看该切片的数据类型,可以发现,为span:
1
type(life_quote)
doc.sents:将一段话按照句子进行切分:其每一句话均为一个span对象:
1 2
doc4 = nlp(u'This is first sentence.Hey, second sentence. \ Third sentence. Fourth sentence. Stupid sentence.')
1 2
for sentence in doc4.sents: print(sentence ,type(sentence)) # 切分为句子
doc = nlp(u'"We\'re moving to L.A.!"') for token in doc: print(token, end = ' | ')
分割链接:
1 2 3 4 5
doc2 = nlp(u"We're here to help! Send snail-mail,\ emall support@oursite.com or visit us at\ http://www.oursite.com!") for token in doc2: print(token, end = ' | ')
分割特殊符号:
1 2 3
doc3 = nlp(u"I paid $50.23 for a used furniture.") for token in doc3: print(token, end = ' | ')
分割.号:
1 2 3
doc4 = nlp(u"Let's visit St. Louis in U.S. next month") for token in doc4: print(token, end = ' | ')
同时:spacy.tokens.doc.Doc中的对象不支持修改:
Name Entity:
entity.text:实体内容
entity.label_:实体标签
1 2 3 4 5 6 7 8
doc8 = nlp(u"Apple is trying to build a new factory in Hong Kong in 2021") for token in doc8: print(token, end = ' | ') print('\n') # Name Entities for entity in doc8.ents: print('%-15s%-15s%-30s'%(entity.text,entity.label_, spacy.explain(str(entity.label_))))
有三个span对象的name entities,同时可以打印对其所属类别的解释:
Noun Chunks:名词组块分析,提取名词短语
1 2 3 4
doc9 = nlp(u"Autonomous cars shift insurance \ liability toward manufactures.") for chunk in doc9.noun_chunks: print(chunk.text)
print('%-15s%-15s%-15s%-15s%-30s'%('token.text', 'token.pos_', 'token.dep_', 'token.tag_', 'explain token_tag')) print('-----------------------------------------------------------------------------------') for token in doc: print('%-15s%-15s%-15s%-15s%-30s'%(token.text, token.pos_, token.dep_, token.tag_, spacy.explain(str(token.tag_))))
对POS Tags进行计数:
1 2 3 4
doc = nlp(u"Apple is looking at buying a U.K. startup for $1 Billion") pos_counts = doc.count_by(spacy.attrs.POS) print(pos_counts) print(doc.vocab[96].text)
1 2
for k, v insorted(pos_counts.items()): print(f'{k}.{doc.vocab[k].text:{6}}:{v}')
1 2 3 4
tag_counts = doc.count_by(spacy.attrs.TAG) for k, v insorted(tag_counts.items()): print(f'{k:<{23}}{doc.vocab[k].text:{4}}:{v:<{5}} \ {spacy.explain(doc.vocab[k].text)}')
进行可视化:
1 2 3 4
from spacy import displacy doc5 = nlp(u'A quick brown fox jumps over the lazy dog') displacy.render(doc5, style = 'dep', jupyter = True, options = {'distance':110})
defshow_ents(doc): if doc.ents: for entity in doc.ents: print(entity.text + '---' + entity.label_ + '---', spacy.explain(str(entity.label_))) else: print("No named entities found!")
1 2 3
doc = nlp(u'I am heading to New York City and will visits\ Statue of Liberty tomorrow') show_ents(doc)
下面加入自定义的命名实体:
1 2 3
doc2 = nlp(u'Tesla is planning to build a new \ plant in U.K. for $50 Million') show_ents(doc2)
查看命名实体的类型,可以看到其为spacy.tokens.span.Span类型:
1
type(doc2.ents[0])
查看某一实体类型对应的编号:
1 2 3
from spacy.tokens import Span ORG = doc.vocab.strings['ORG'] ORG
Adding multiple named entities for all matching spans
1 2 3 4
doc3 = nlp(u"Our company plans to introduce new vaccum cleaner." "If this works out the new vaccum cleaner will be our\ first product") show_ents(doc3)
创建匹配命名实体的短语列表:
1 2 3 4 5
from spacy.matcher import PhraseMatcher matcher = PhraseMatcher(nlp.vocab) phrase_list = ['vaccum cleaner', 'vaccum-cleaner', 'vaccum_cleaner', 'vaccumcleaner'] phrase_patterns = [nlp(text) for text in phrase_list]
doc3 = nlp(u"Our company plans to introduce new vaccum cleaner." "If this works out the new vaccum cleaner will be our \ first product") matches = matcher(doc3) matches
1 2
prod = doc.vocab.strings[u'PRODUCT'] prod
1 2 3 4
new_entities = [Span(doc3, match[1], match[2], label = prod) for match in matches] print(len(new_entities)) print(type(new_entities[0]))
1
doc3.ents = list(doc3.ents) + new_entities
1
show_ents(doc3)
counting entities of a certain type(label)
1 2 3
doc4 = nlp(u'I found a furniture priced at $2000 which \ is marked down by 500 dollars.') show_ents(doc4)
1 2
# 实体计数 len([ent for ent in doc4.ents if ent.label_ == 'MONEY'])
可视化:
1 2 3 4
from spacy import displacy doc = nlp(u'Over the last quarter Apple sold nearlt 20 Million iPhone\ 12s for a profit of $20 million.') displacy.render(doc, style = 'ent', jupyter = True)