Text Classification

1. 概述

Text Classification is a technique of categorizing natural language texts into pre-defined, organized groups
In other words - it is the activity of labeling texts with categories from a pre-defined set, based upon the content
Classic examples include classification of books in libraries or segmentation of articles in news,by looking at the content.
Text classification is a sub-field of text analytics, which uses machine learning to extract meaning from text documents
Text classification technique has been used successfully for:
- Sentiment analysis
- Topic detection
- Language detection
- Fraud, Profanity, and Emergency detection
- Urgency detection in customer support

2. Machine Learning

参考链接：

https://www.lexalytics.com/technology/text-analytics/

machine learning for NLP

Processing natural language text is complex, and the traditional rules-based, explicit programming is not practical
Machine Learning allows algorithms to iteratively learn from text and extract rules, instead of explicitly programming for it
Machine learning can improve, accelerate and automate NLP tasks and text analytics functions

Core NLP tasks are performed with machine learning models:

Machine Learning Process:CRISP-DM

The Cross-Industry-StandardProcess-for-Data-Mining (CRISP-DM), a well-established scientific method
Process Steps are:
- Business Understanding
- Data Understanding
- Data Preparation
- Modeling
- Evaluation
- Deployment

CRISP DM

Model Building & Evaluation Deep Dive

3. Classification Modeling Example

model evaluation-Confusion Matrix

For the classification problem,you predict either right or wrong-however,there are two dimensions to evaluate the result.

#Correct Predictions/Total # Predictions Ability to find all relevant cases within

dataset. Tells us how complete the result is Ability to find only the relevant data points. Tells us how valid the result is

Harmonic mean of recall and precision

Metric	Definition	Calculations using confusion matrix
Accuracy	#Correct Predictions/Total # Predictions	$(TP+TN)/Total$
Recall(sensitivity)	Ability to find all relevant cases within dataset. Tells us how complete the result is	$(TP)/(TP+FN)$
Precision	Ability to find only the relevant data points. Tells us how valid the result is	$(TP)/(TP+FP)$
F1-Score	Harmonic mean of recall and precision	$F1=2\times \frac{\text{precision}\times\text{recall}}{\text{precision}+\text{recall}}$

An example:

4. scikit-learn

安装：

1	pip install scikit-learn

Scikit usage steps(5 high level steps)

select a model(estimator object)

1 2	from sklearn.linear_model import LinearRegression model = LinearRegression(normalize = True)

Split the data into test and training sets

1 2	from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

Train/fit the model
1
model.fit(X_train, t_train)
Predict Labels for test data
1
predictions = model.predict(X_test)
Evaluate model(create metrics)

sklearn demo

import numpy as np
import pandas as pd
df = pd.read_csv('./smsspamcollection.tsv', sep='\t')
df.head()

1 2	# 查看是否有缺失的数据 df.isnull().sum()

可以看到此时数据集中没有空缺项，如果有空缺项，可以使用下面的代码补充空缺：

blanks = []  # start with an empty list

for index,label,review in df.itertuples():   # iterate over the DataFrame
    if type(review)==str:                    # avoid NaN values
        if review.isspace():                 # test 'review' for whitespace
            blanks.append(index)             # add matching index to list
        
print(len(blanks), 'blanks: ', blanks)

1 2	print(df['label'].unique()) # 查看类别 print(df['label'].value_counts()) # 查看每个类别对应的数据数量

1 2	# 对某一数据进行统计描述 df['length'].describe()

X = df[['length','punct']]
y = df['label']
print(X.shape)
print(y.shape)

from sklearn.model_selection import train_test_split
# 划分数据集和验证集
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.33, 
                                                    random_state=42)
print('Training Data Shape:', X_train.shape)
print('Testing Data Shape: ', X_test.shape)

创建一个LogisticRegression模型

# 建立模型
from sklearn.linear_model import LogisticRegression
lr_model = LogisticRegression(solver='lbfgs')
lr_model.fit(X_train, y_train)

from sklearn import metrics
# 创建一个预测
predictions = lr_model.predict(X_test)
# 输出混淆矩阵
print(metrics.confusion_matrix(y_test,predictions))
# 输出分类报告
print(metrics.classification_report(y_test,predictions))

创建一个朴素贝叶斯模型naïve Bayes classifier

# Train a naïve Bayes classifier:
from sklearn.naive_bayes import MultinomialNB
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

predictions = nb_model.predict(X_test)
print(metrics.confusion_matrix(y_test,predictions))
print(metrics.classification_report(y_test,predictions))
print(metrics.accuracy_score(y_test,predictions))

创建一个支持向量机模型svc

1
2
3

from sklearn.svm import SVC
svc_model = SVC(gamma='auto')
svc_model.fit(X_train,y_train)

predictions = svc_model.predict(X_test)
print(metrics.confusion_matrix(y_test,predictions))
print(metrics.classification_report(y_test,predictions))
print(metrics.accuracy_score(y_test,predictions))

5. Text Feature Extraction

Machine learning algorithms (models) need numerical features to perform learning and prediction activities
We need to extract numerical features from the raw text

5.1 count vectorization

CountVectorizer 类：

create a matrix of counts,with columns as words，会将文本中的词语转换为词频矩阵
矩阵中包含一个元素a[i][j]，它表示j词在i类文本下的词频。This sparse matrix is called Document Term Matrix(DTM)
fit_transform函数计算各个词语出现的次数
get_feature_names()可获取词袋中所有文本的关键字，
toarray()可看到词频矩阵的结果。

# Scikit-learn's CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
X_train_counts.shape

5.2 Term Frequency(TF)

Term Frequency tf(t,d)-raw count of a term in a document, i.e.,the number of times a term t occurs in document d
However,Term Frequency alone is not enough for a thorough feature analysis of the text. Consider stop words like “a” or “the”
Because the term “the” is so common, term frequency will tend to incorrectly emphasize documents which happen to use the word “the” more frequently, without giving enough weight to the more meaningful terms “red” and “dogs”.

5.3 Inverse Document Frequency(IDF)

In order to reduce the unwanted impact of common words, an inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely
It is the logarithmically scaled inverse fraction of the documents that contain the word (obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient)

5.4 TF-IDF

TF-IDF = Term Frequency * (1/document Frequency)
TF-IDF = Term Frequency * Inverse Document Frequency

$\begin{align} \operatorname{tfidf}(t, d, D) & = \operatorname{tf}(t, d) \cdot \operatorname{idf}(t, D) \\ \operatorname{idf}(t, D) & = \log \frac{N}{|\{d \in D: t \in d\}|} \end{align}$

# Transform Counts to Frequencies with Tf-idf
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

5.5 TF-IDF Vectorizer

TF-IDF Vectorizer is superior to raw Count Vectorizer
TF-IDF allows us to understand the context of words across an entire corpus of documents, instead of just its relative importance in a single document
Scikit-Learn’s TfidfVectorizer to train and fit our models

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train) 
# 使用原始的X_train
X_train_tfidf.shape

5.6 Pipline

Pipeline of transforms with a final estimator.
Sequentially apply a list of transforms and a final estimator. Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The final estimator only needs to implement fit. The transformers in the pipeline can be cached using memory argument.
The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. For this, it enables setting parameters of the various steps using their names and the parameter name separated by a '__', as in the example below. A step’s estimator may be replaced entirely by setting the parameter with its name to another estimator, or a transformer removed by setting it to 'passthrough' or None.

from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer
text_clf = Pipeline([('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])
# Feed the training data through the pipeline
text_clf.fit(X_train, y_train)

对预测结果进行评估验证，可以发现此时准确度有明显的提升：

from sklearn import metrics
predictions = text_clf.predict(X_test)    # 预测
print(metrics.confusion_matrix(y_test,predictions))      # 混淆矩阵
print(metrics.classification_report(y_test,predictions)) # 预测报告
print(metrics.accuracy_score(y_test,predictions))        # 预测准确度

使用该算法预测两个句子：

1 2	text_clf.predict(["Hello, how are you?"]) text_clf.predict(['Congratulations, you have won $1 M'])

Text Classification Project:

Read in a collection of documents - a corpus
Transform text into numerical vector data using a pipeline
Create a classifier
Fit/train the classifier
Test the classifier on new data
Evaluate performance

6. Semantic Analysis

6.1 概述

Semantic analysis is the process of drawing meaning from natural language text
It attempts to mimic the process humans follow, i.e., processing words in the context of their appearance, relating them with other words, and selecting most appropriate meaning (removing ambiguities)
Context plays an important role, it helps to attribute the correct meaning
It is an essential sub-task of NLP and the driving force behind machine learning tools such as chatbots, search engines and text analysis

How does Semantic Analysis work?

Semantic analysis begins by understanding relationships between lexical items (words, phrasal verbs, noun phrases, etc.)
It creates lexical hierarchies using:
- Hyponyms/Hypernyms-inheritance like structure
- Meronomy-whole/part like structure
- Polysemy-relationship based on common core meaning
- Synonyms-words with same meaning and can substitute
- Antonyms-words with opposite meaning
- Homonyms-words with similar sound & spelling but different meaning
Semantic analysis considers signs and symbols (semiotics) and collocations (words that often go together)
Automated semantic analysis works with the help of machine learning algorithms. By feeding semantically enhanced algorithms with sample text,you can train machines to make accurate predictions based on past observations
Two important sub-tasks involved in this approach are:
- Word sense disambiguation (e.g.,Orange could mean color,fruit, or a county in California)
- Relationship extraction(e.g.,relationship between persons, organizations, and places)

6.2 Semantic Analysis Techniques

Classification Models

Topic Classification：Sorting text to predefined topics that they belong to. For example, a service ticket could be regarding a “payment issue” or “shipping problem”.
Sentiment Analysis：Detecting positive, negative, or neutral emotions. This could mean, for example, how customers feel about a product or service.
Intent Classification：Classifying text based on what customers intend to do next. This could mean, for example, that customer wants to talk to an expert.

Extraction Models

Keyword Extraction
- Finding relevant words and expressions in a text
- Used for more granular insight,e.g.,a feedback classified as negative, what words or topics are mentioned most often
Entity Extraction
- Identifying named entities in text. This could be customized to automatically detect company specific texts, e.g.,product/service names, ticket # etc.

7. Word Vectors

What are word vectors?

A word vector is a set of numbers that represent the meaning of a word,with a lot of contextual information
A vector representing a word is an array of real-valued numbers, where each point captures a dimension of word’s meaning
These numbers encode the meaning of words in such a way that words close in vector space are expected to have similar meaning
Creation of word vectors is a critical component of semantic analysis, and this approach is called word embedding

Why word vectors?

Representing words with numbers enables mathematical operations: such as detecting (cosine) similarity, adding &subtracting vectors, finding associations, and predicting meaning

Interesting relationships can be established between word vectors.

How are word vectors created?

Word vectors are created by feeding a large corpus of text into a deep learning model (neural network)
Word vectors are created by following distributional hypothesis, which states-“You shall know a word by the company it keeps”
Words that share similar context tend to have similar meaning.
Models either use context to predict a target word (CBOW method) or use a word to predict a target context (Skip-Gram method)
word2vec is a two-layer neural network model

1
2
3

import spacy
nlp = spacy.load('en_core_web_md')  # 使用大模型
nlp(u'lion').vector                 # 计算词向量

查看向量的维度：

1
2
3

doc = nlp(u'lion')
print(doc.vector.shape)
print(doc.vocab.vectors.shape)

查看句子的向量表示，同样是300个维度：

1 2	doc = nlp(u'The quick brown fox jumps over the lazy dog') doc.vector

Identifying similar vectors：The best way to expose vector relationships is through the .similarity() method of Doc tokens.

tokens = nlp(u'dog cat monkey')
for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

Note that order doesn’t matter. token1.similarity(token2) has the same value as token2.similarity(token1)

Vector norms

It’s sometimes helpful to aggregate 300 dimensions into a Euclidian (L2) norm, computed as the square root of the sum-of-squared-vectors. This is accessible as the .vector_norm token attribute. Other helpful attributes include .has_vector and .is_oov or out of vocabulary.

1
2
3

tokens = nlp(u'dog cat nowaythere')
for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

Vector arithmetic

Believe it or not, we can actually calculate new vectors by adding & subtracting related vectors. A famous example suggests

"king" - "man" + "woman" = "queen"

from scipy import spatial
cosine_similarity = lambda x, y: 1 - spatial.distance.cosine(x, y)
king = nlp.vocab['king'].vector
man = nlp.vocab['man'].vector
woman = nlp.vocab['woman'].vector
queen = nlp.vocab['queen'].vector
new_vector = king - man + woman
cosine_similarity(new_vector, queen)
cosine_similarity(new_vector, woman)

8. sentiment analysis

What is sentiment analysis?

Sentiment analysis is a method that detects polarity (e.g.,positive or negative opinion) within the text
It is also used to detect emotions (happy, sad, angry, etc.), urgency (urgent vs. not urgent) and intentions (interested vs. not interested)
Sentiment analysis can be rule-based (manually crafted rules), automatic(feature extraction & text classification),or hybrid
Sentiment analysis is hard due to multiple reasons - sarcasm, idioms, negation handling, adverbial modifiers, comparisons etc.

How is sentiment analysis used?

Since people share their opinion more openly than ever before, sentiment analysis is useful in a variety of ways such as: social media monitoring, customer support, customer feedback, brand monitoring, voice of customer, market research, etc.
Sentiment analysis can also be used as a real-time analysis tool, especially if events requiring urgent action need to be detected

9. VADER

Valence Aware Dictionary for Sentiment Reasoning (VADER) is a model used for text sentiment analysis that is sensitive to both polarity (positive/negative) and intensity (strength) of emotion
Primarily, VADER sentiment analysis relies on a dictionary which maps lexical features to emotion intensities called sentiment scores
The sentiment score of a text can be obtained by summing up the intensity of each word in the text
VADER is a rule-based system
VADER understands that words like “love”, “like”, “enjoy”,“happy” all convey a positive sentiment.
VADER is intelligent enough to understand basic context of these words, such as “did not love” as a negative sentiment.
It uses rules to also understand that the capitalization and punctuation enhance intensity of emotions, e.g., “LOVE!!!!”
VADER is available in the NLTK package and can be applied directly to unlabeled data.

Download the VADER lexicon.

import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()

VADER’s SentimentIntensityAnalyzer() takes in a string and returns a dictionary of scores in each of four categories:

negative
neutral
positive
compound (computed by normalizing the scores above)

计算每个句子中包含的情感比例：

a = 'This was a good movie.'
sid.polarity_scores(a)
a = 'This was the best, most awesome movie EVER MADE!!!'
sid.polarity_scores(a)
a = 'This was the worst film to ever disgrace the screen.'
sid.polarity_scores(a)

Adding Scores and Labels to the DataFrame

在score后加入一列review数据

1 2	df['scores'] = df['review'].apply(lambda review: sid.polarity_scores(review)) df.head()

10. Sentiment Analysis Project

For this project, we’ll perform the same type of NLTK VADER sentiment analysis, this time on our movie reviews dataset.

The 2,000 record IMDb movie review database is accessible through NLTK directly with

1	from nltk.corpus import movie_reviews

However, since we already have it in a tab-delimited file we’ll use that instead.

Load the Data

import numpy as np
import pandas as pd
df = pd.read_csv('../TextFiles/moviereviews.tsv', sep='\t')
df.head()

Remove Blank Records (optional)

# REMOVE NaN VALUES AND EMPTY STRINGS:
df.dropna(inplace=True)
blanks = []  # start with an empty list
for i,lb,rv in df.itertuples():  # iterate over the DataFrame
    if type(rv)==str:            # avoid NaN values
        if rv.isspace():         # test 'review' for whitespace
            blanks.append(i)     # add matching index numbers to the list
df.drop(blanks, inplace=True)

Import SentimentIntensityAnalyzer and create an sid object

1 2	from nltk.sentiment.vader import SentimentIntensityAnalyzer sid = SentimentIntensityAnalyzer()

Use sid to append a comp_score to the dataset

df['scores'] = df['review'].apply(lambda review: sid.polarity_scores(review))
df['compound']  = df['scores'].apply(lambda score_dict: score_dict['compound'])
df['comp_score'] = df['compound'].apply(lambda c: 'pos' if c >=0 else 'neg')
df.head()

Perform a comparison analysis between the original label and comp_score

1
2
3

from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report,confusion_matrix
accuracy_score(df['label'],df['comp_score'])

1	print(classification_report(df['label'],df['comp_score']))

1	print(confusion_matrix(df['label'],df['comp_score']))