太原哪里做网站好,搜索大全搜索引擎,研发了一个app以后怎么盈利,温州个人网站建设一、概念
首先我们来看一下停用词的概念#xff0c;然后来介绍使用nltk如何删除英文的停用词#xff1a;
由于一些常用字或者词使用的频率相当的高#xff0c;英语中比如a#xff0c;the, he等#xff0c;中文中比如#xff1a;我、它、个等#xff0c;每个页面几乎都包…一、概念
首先我们来看一下停用词的概念然后来介绍使用nltk如何删除英文的停用词
由于一些常用字或者词使用的频率相当的高英语中比如athe, he等中文中比如我、它、个等每个页面几乎都包含了这些词汇如果搜索引擎它们当关键字进行索引那么所有的网站都会被索引而且没有区分度所以一般把这些词直接去掉不可当做关键词。
二、使用nltk删除英文停用词
首先我import stopwords进来代码如下
from nltk.corpus import stopwords
words stopwords.words(english)
print(words)
首先看看打印停用词的结果
[i, me, my, myself, we, our, ours, ourselves, you, your, yours, yourself, yourselves, he, him, his, himself, she, her, hers, herself, it, its, itself, they, them, their, theirs, themselves, what, which, who, whom, this, that, these, those, am, is, are, was, were, be, been, being, have, has, had, having, do, does, did, doing, a, an, the, and, but, if, or, because, as, until, while, of, at, by, for, with, about, against, between, into, through, during, before, after, above, below, to, from, up, down, in, out, on, off, over, under, again, further, then, once, here, there, when, where, why, how, all, any, both, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very, s, t, can, will, just, don, should, now, d, ll, m, o, re, ve, y, ain, aren, couldn, didn, doesn, hadn, hasn, haven, isn, ma, mightn, mustn, needn, shan, shouldn, wasn, weren, won, wouldn]
当然在很多任务比如对话任务中中停用词还包括下面这些符合和后缀
[!, , ,. ,? ,-s ,-ly , , s]
使用下面代码将他们加上去
for w in [!,,,.,?,-s,-ly,,s]:
self.stopwords.add(w)
然后删除的用法就非常容易假如我们的语料在word_list中我们只需要写上下面的代码即可
from nltk.corpus import stopwords
for w in [!,,,.,?,-s,-ly,,s]:
self.stopwords.add(w)
filtered_words [word for word in word_list if word not in stopwords.words(english)]