河北住房和城乡建设局网站,自己做的网站给人攻击了怎么办,北京网站公司建设,app资源网站开发文章目录1. bag of words2. 建立词袋模型3. 训练文本分类模型4. 预测练习#xff1a;1. 评估方法2. 数据预处理、建模3. 训练4. 预测5. 评估模型6. 改进learn from https://www.kaggle.com/learn/natural-language-processing
NLP中的一个常见任务是文本分类。这是传统机器学…
文章目录1. bag of words2. 建立词袋模型3. 训练文本分类模型4. 预测练习1. 评估方法2. 数据预处理、建模3. 训练4. 预测5. 评估模型6. 改进learn from https://www.kaggle.com/learn/natural-language-processing
NLP中的一个常见任务是文本分类。这是传统机器学习意义上的“分类”并应用于文本。
包括垃圾邮件检测、情绪分析和标记客户查询。
在本教程中您将学习使用spaCy进行文本分类。该分类器将检测垃圾邮件这是大多数电子邮件客户端的常见功能。
读取数据
import pandas as pd
spam pd.read_csv(./spam.csv)
spam.head(10)1. bag of words
模型不能直接从原始文本中学习需要转化成数字特征最简单的方法是用 one-hot 编码。
举个例子
句子1 Tea is life. Tea is love.句子2 Tea is healthy, calming, and delicious.
忽略标点后的词表是 {tea, is, life, love, healthy, calming, and, delicious}
通过对每个句子的单词出现的次数进行统计用向量表示
v1[2,2,1,1,0,0,0,0]v1[2,2,1,1,0,0,0,0]v1[2,2,1,1,0,0,0,0] v2[1,1,0,0,1,1,1,1]v2[1,1,0,0,1,1,1,1]v2[1,1,0,0,1,1,1,1]
这就是词袋表示相似的文档将会有相似的词袋向量
还有一种表示法TF-IDF (Term Frequency - Inverse Document Frequency) 2. 建立词袋模型
使用 spacy 的 TextCategorizer 可以处理词袋的转换建立一个简单的线性模型它是一个 spacy 管道
import spacy
nlp spacy.blank(en) # 建立空模型# Create the TextCategorizer with exclusive classes
# and bow architecture
textcat nlp.create_pipe(textcat,config{exclusive_classes: True, # 排他的二分类architecture: bow
})# Add the TextCategorizer to the empty model
nlp.add_pipe(textcat)# help(nlp.create_pipe)
Help on method create_pipe in module spacy.language:create_pipe(name, config{}) method of spacy.lang.en.English instanceCreate a pipeline component from a factory.name (unicode): Factory name to look up in Language.factories.config (dict): Configuration parameters to initialise component.RETURNS (callable): Pipeline component.DOCS: https://spacy.io/api/language#create_pipe# Add labels to text classifier
textcat.add_label(ham) # 正常邮件
textcat.add_label(spam) # 垃圾邮件3. 训练文本分类模型
数据获取
train_texts spam[text].values
train_labels [{cats: {ham: label ham,spam: label spam}} for label in spam[label]]将 文本 和 对应的标签 打包
train_data list(zip(train_texts, train_labels))
train_data[:3]输出
[
(Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...,{cats: {ham: True, spam: False}}),(Ok lar... Joking wif u oni..., {cats: {ham: True, spam: False}}),(Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)TCs apply 08452810075over18s,{cats: {ham: False, spam: True}})
]准备训练模型
创建优化器 optimizer nlp.begin_training()spacy使用它更新模型权重数据分批 minibatch更新模型参数 nlp.update
from spacy.util import minibatchspacy.util.fix_random_seed(1)
optimizer nlp.begin_training()# 数据分批
batches minibatch(train_data, size8)
# 迭代
for batch in batches:texts, labels zip(*batch)nlp.update(texts, labels, sgdoptimizer)这只是一次 epoch batch [(1, True),(2, False)]texts, labels zip(*batch)texts
(1, 2)labels
(True, False)https://www.runoob.com/python/python-func-zip.html
多次 epochs 迭代
import random
random.seed(1)
spacy.util.fix_random_seed(1)
optimizer nlp.begin_training()loss {}
for epoch in range(10):# 每次随机打乱数据random.shuffle(train_data)# 数据分批batches minibatch(train_data, size8)# 迭代for batch in batches:texts, labels zip(*batch)nlp.update(texts, labels, drop0.3, sgdoptimizer, lossesloss)print(loss)# help(nlp.update)
Help on method update in module spacy.language:update(docs, golds, drop0.0, sgdNone, lossesNone, component_cfgNone) method of spacy.lang.en.English instanceUpdate the models in the pipeline.docs (iterable): A batch of Doc objects.golds (iterable): A batch of GoldParse objects.drop (float): The dropout rate.sgd (callable): An optimizer.losses (dict): Dictionary to update with the loss, keyed by component.component_cfg (dict): Config parameters for specific pipelinecomponents, keyed by component name.DOCS: https://spacy.io/api/language#update输出
{textcat: 0.22436044702671132}
{textcat: 0.41457826484549287}
{textcat: 0.5661000985640895}
{textcat: 0.7119002992385974}
{textcat: 0.8301601885299159}
{textcat: 0.9572314705652767}
{textcat: 1.050187804254974}
{textcat: 1.1268915971417424}
{textcat: 1.2132206293363608}
{textcat: 1.3000399094508472}4. 预测
预测前先要将文本nlp.tokenizer一下
texts [Are you ready for the tea party????? Its gonna be wild,URGENT Reply to this message for GUARANTEED FREE TEA]
docs [nlp.tokenizer(text) for text in texts]textcat nlp.get_pipe(textcat)
scores, _ textcat.predict(docs)
print(scores)输出预测概率
[[9.9999392e-01 6.1252954e-06][4.1843491e-04 9.9958152e-01]]打印预测标签
predicted_labels scores.argmax(axis1)
print([textcat.labels[label] for label in predicted_labels])[ham, spam]练习
在上一个练习中你为德法尔科餐厅做了一项非常出色的工作以至于厨师为一个新项目雇佣了你。
餐厅的菜单上有一个电子邮件地址游客可以在那里对他们的食物进行反馈。
经理希望你创建一个工具自动将所有负面评价发送给他这样他就可以修正它们同时自动将所有正面评价发送给餐厅老板这样经理就可以要求加薪了。
您将首先使用Yelp评论构建一个模型来区分正面评论和负面评论因为这些评论包括每个评论的评级。你的数据由每篇评论的正文和星级评分组成。
1-2 星的评级为“负样本”4-5 星的评级为“正样本”。3 星的评级是“中性”的已经从数据中删除。
1. 评估方法
上面方法的优势在于你可以区分正面邮件和负面邮件即使你没有标记为正面或负面的历史邮件。这种方法的缺点是电子邮件可能与Yelp评论很不同不同的分布这会降低模型的准确性。例如客户在电子邮件中通常会使用不同的单词或俚语而基于Yelp评论的模型不会看到这些单词。如果你想知道这个问题有多严重你可以比较两个来源的词频。在实践中手动从每一个来源读几封电子邮件就足以判断这是否是一个严重的问题。如果你想做一些更花哨的事情你可以创建一个包含Yelp评论和电子邮件的数据集看看模型是否能从文本内容中分辨出评论的来源。理想情况下您希望发现该模型的性能不佳因为这意味着您的数据源是相似的。
2. 数据预处理、建模 数据集切分
def load_data(csv_file, split0.9):data pd.read_csv(csv_file)# Shuffle datatrain_data data.sample(frac1, random_state7)texts train_data.text.valueslabels [{POSITIVE: bool(y), NEGATIVE: not bool(y)}for y in train_data.sentiment.values]split int(len(train_data) * split)train_labels [{cats: labels} for labels in labels[:split]]val_labels [{cats: labels} for labels in labels[split:]]return texts[:split], train_labels, texts[split:], val_labelstrain_texts, train_labels, val_texts, val_labels load_data(../input/nlp-course/yelp_ratings.csv)查看训练数据
print(Texts from training data\n------)
print(train_texts[:2])
print(\nLabels from training data\n------)
print(train_labels[:2])输出
Texts from training data
------
[Some of the best sushi Ive ever had....and I come from the East Coast. Unreal toro, have some of its available.One of the best burgers Ive ever had and very well priced. I got the tortilla burger and is was delicious especially with there tortilla soup!]Labels from training data
------
[{cats: {POSITIVE: True, NEGATIVE: False}},
{cats: {POSITIVE: True, NEGATIVE: False}}]建模
import spacy
nlp spacy.blank(en) # 建立空模型# Create the TextCategorizer with exclusive classes
# and bow architecture
textcat nlp.create_pipe(textcat,config{exclusive_classes: True, # 排他的二分类architecture: bow
})# Add the TextCategorizer to the empty model
nlp.add_pipe(textcat)# Add NEGATIVE and POSITIVE labels to text classifier
textcat.add_label(NEGATIVE) # 负面邮件
textcat.add_label(POSITIVE) # 正面邮件3. 训练
from spacy.util import minibatch
import randomdef train(model, train_data, optimizer, batch_size8):loss {}random.seed(1)random.shuffle(train_data)batches minibatch(train_data, sizebatch_size)for batch in batches:# train_data is a list of tuples [(text0, label0), (text1, label1), ...]# Split batch into texts and labelstexts, labels zip(*batch)# Update model with texts and labelsmodel.update(texts, labels, sgdoptimizer, lossesloss)return loss训练
# Fix seed for reproducibility
spacy.util.fix_random_seed(1)
random.seed(1)# This may take a while to run!
optimizer nlp.begin_training()
train_data list(zip(train_texts, train_labels))
losses train(nlp, train_data, optimizer)
print(losses[textcat])测试下效果
text This tea cup was full of holes. Do not recommend.
doc nlp(text)
print(doc.cats)输出
{NEGATIVE: 0.7731374502182007, POSITIVE: 0.22686253488063812}这杯茶不好喝负类概率大
4. 预测
def predict(nlp, texts): # Use the models tokenizer to tokenize each input textdocs [nlp.tokenizer(text) for text in texts]# Use textcat to get the scores for each doctextcat nlp.get_pipe(textcat)scores, _ textcat.predict(docs)# From the scores, find the class with the highest score/probabilitypred_labels scores.argmax(axis1)return pred_labels5. 评估模型
def evaluate(model, texts, labels): Returns the accuracy of a TextCategorizer model. Arguments---------model: ScaPy model with a TextCategorizertexts: Text samples, from load_data functionlabels: True labels, from load_data function# Get predictions from textcat model (using your predict method)predicted_class predict(model, texts)# From labels, get the true class as a list of integers (POSITIVE - 1, NEGATIVE - 0)true_class [int(label[cats][POSITIVE]) for label in labels]# A boolean or int array indicating correct predictionscorrect_predictions (true_class predicted_class)# The accuracy, number of correct predictions divided by all predictionsaccuracy sum(correct_predictions)/len(true_class)return accuracyaccuracy evaluate(nlp, val_texts, val_labels)
print(fAccuracy: {accuracy:.4f})输出验证集准确率 92.39%
Accuracy: 0.9239多次迭代训练
# This may take a while to run!
n_iters 5
for i in range(n_iters):losses train(nlp, train_data, optimizer)accuracy evaluate(nlp, val_texts, val_labels)print(fLoss: {losses[textcat]:.3f} \t Accuracy: {accuracy:.3f})Loss: 6.752 Accuracy: 0.940
Loss: 4.105 Accuracy: 0.947
Loss: 2.904 Accuracy: 0.945
Loss: 2.267 Accuracy: 0.946
Loss: 1.826 Accuracy: 0.9446. 改进
这里有各种超参数可以调节。最重要的超参数是TextCategorizer 的 architecture
上面使用的最简单的模型它训练得快但可能比 CNN 和 ensemble 模型的性能差 我的CSDN博客地址 https://michael.blog.csdn.net/
长按或扫码关注我的公众号Michael阿明一起加油、一起学习进步