当前位置：首页 > news >正文

珠海网络营销网站建设下载京东购物

news 2026/1/13 17:19:56

珠海网络营销网站建设,下载京东购物,网站建设需要的资质,做公益的网站文章目录1. 使用 spacy 库进行 NLP2. Tokenizing3. 文本处理4. 模式匹配练习#xff1a;食谱满意度调查1 在评论中找到菜单项2 对所有的评论匹配3 最不受欢迎的菜4 菜谱出现的次数learn from https://www.kaggle.com/learn/natural-language-processing 1. 使用 spacy 库进行… 文章目录1. 使用 spacy 库进行 NLP2. Tokenizing3. 文本处理4. 模式匹配练习食谱满意度调查1 在评论中找到菜单项2 对所有的评论匹配3 最不受欢迎的菜4 菜谱出现的次数learn from https://www.kaggle.com/learn/natural-language-processing 1. 使用 spacy 库进行 NLP spacyhttps://spacy.io/usage spacy 需要指定语言种类使用spacy.load()加载语言管理员身份打开 cmd 输入python -m spacy download en 下载英语语言en模型 import spacy nlp spacy.load(en)你可以处理文本 doc nlp(Tea is healthy and calming, dont you think?)2. Tokenizing Tokenizing 将返回一个包含 tokens 的 document 对象。 token 是文档中的文本单位例如单个单词和标点符号。 SpaCy 将像 dont这样的缩略语分成两个标记“do”和“n’t”。可以通过遍历文档来查看 token。 for token in doc:print(token)输出 Tea is healthy and calming , do nt you think ?3. 文本处理有几种类型的预处理可以改进我们如何用单词建模。第一种是 lemmatizing一个词的 lemma是它的基本形式。例如“walk”是单词“walking”的 lemma。所以当你把walking这个词lemmatizing时你会把它转换成walk。删除stopwords也是很常见的。stopwords是指在语言中经常出现的不包含太多信息的单词。英语的stopwords包括“the”“is”“and”“but”“not”。 token.lemma_返回单词的lemma token.is_stop如果是停用词返回布尔值True否则返回False print(fToken \t\tLemma \t\tStopword.format(Token, Lemma, Stopword)) print(-*40) for token in doc:print(f{str(token)}\t\t{token.lemma_}\t\t{token.is_stop})在上面的句子中重要的词是tea, healthy, calming。删除停用词可能有助于预测模型关注相关词。 Lemmatizing 同样有助于将同一单词的多种形式组合成一个基本形式calming, calms, calmed 都会转成 calm。然而Lemmatizing 和删除停用词可能会导致模型性能更差。因此您应该将此预处理视为超参数优化过程的一部分。 4. 模式匹配另一个常见的NLP任务在文本块或整个文档中匹配单词或短语。可以使用正则表达式进行模式匹配但spaCy的匹配功能往往更易于使用。要匹配单个tokens令牌需要创建Matcher匹配器。当你想匹配一个词语列表时使用PhraseMatcher会更容易、更有效。例如如果要查找不同智能手机型号在某些文本中的显示位置可以为感兴趣的型号名称创建 patterns。首先创建PhraseMatcher from spacy.matcher import PhraseMatcher matcher PhraseMatcher(nlp.vocab, attrlower)以上我们使用已经加载过的英语模型的单词进行匹配并转换为小写后进行匹配创建要匹配的词语列表 terms [Galaxy Note, iPhone 11, iPhone XS, Google Pixel] patterns [nlp(text) for text in terms] print(patterns) # 输出 [Galaxy Note, iPhone 11, iPhone XS, Google Pixel] matcher.add(match1, patterns) # help(matcher.add)text_doc nlp(Glowing review overall, and some really interesting side-by-side photography tests pitting the iPhone 11 Pro against the Galaxy Note 10 Plus and last year’s iPhone XS and Google Pixel 3.) for i, text in enumerate(text_doc):print(i, text) matches matcher(text_doc) print(matches)输出 0 Glowing 1 review 2 overall 3 , 4 and 5 some 6 really 7 interesting 8 side 9 - 10 by 11 - 12 side 13 photography 14 tests 15 pitting 16 the 17 iPhone 18 11 19 Pro 20 against 21 the 22 Galaxy 23 Note 24 10 25 Plus 26 and 27 last 28 year 29 ’s 30 iPhone 31 XS 32 and 33 Google 34 Pixel 35 3 36 . [(12981744483764759145, 17, 19), # iPhone 11 (12981744483764759145, 22, 24), # Galaxy Note (12981744483764759145, 30, 32), # iPhone XS (12981744483764759145, 33, 35)] # Google Pixel 返回元组匹配id, 匹配开始位置匹配结束位置match_id, start, end matches[3] print(nlp.vocab.strings[match_id], text_doc[start:end])输出 match1 Google Pixel练习食谱满意度调查你是DelFalco意大利餐厅的顾问。店主让你确认他们的菜单上是否有令食客失望的食物。店主建议你使用Yelp网站上的评论来判断人们喜欢和不喜欢哪些菜。你从Yelp那里提取了数据。在开始分析之前请运行下面的代码单元快速查看必须使用的数据。 import pandas as pd data pd.read_json(../input/nlp-course/restaurant.json) data.head()店主还给了你这个菜单项和常见的替代拼写列表 menu [Cheese Steak, Cheesesteak, Steak and Cheese, Italian Combo, Tiramisu, Cannoli,Chicken Salad, Chicken Spinach Salad, Meatball, Pizza, Pizzas, Spaghetti,Bruchetta, Eggplant, Italian Beef, Purista, Pasta, Calzones, Calzone,Italian Sausage, Chicken Cutlet, Chicken Parm, Chicken Parmesan, Gnocchi,Chicken Pesto, Turkey Sandwich, Turkey Breast, Ziti, Portobello, Reuben,Mozzarella Caprese, Corned Beef, Garlic Bread, Pastrami, Roast Beef,Tuna Salad, Lasagna, Artichoke Salad, Fettuccini Alfredo, Chicken Parmigiana,Grilled Veggie, Grilled Veggies, Grilled Vegetable, Mac and Cheese, Macaroni, Prosciutto, Salami]根据Yelp提供的数据和菜单项列表您有什么想法可以找到哪些菜单项让食客失望你可以根据评论中提到的菜单项对其进行分组然后计算每个项目的平均评分。你可以分辨出哪些食物在评价中被提及得分较低这样餐馆就可以修改食谱或从菜单中删除这些食物。 1 在评论中找到菜单项 import spacy from spacy.matcher import PhraseMatcherindex_of_review_to_test_on 14 text_to_test_on data.text.iloc[index_of_review_to_test_on]# Load the SpaCy model nlp spacy.blank(en)# Create the tokenized version of text_to_test_on review_doc nlp(text_to_test_on)# Create the PhraseMatcher object. The tokenizer is the first argument. Use attr LOWER to make consistent capitalization matcher PhraseMatcher(nlp.vocab, attrLOWER)# Create a list of tokens for each item in the menu menu_tokens_list [nlp(item) for item in menu]# Add the item patterns to the matcher. # Look at https://spacy.io/api/phrasematcher#add in the docs for help with this step # Then uncomment the lines below matcher.add(MENU, # Just a name for the set of rules were matching tomenu_tokens_list )# Find matches in the review_doc matches matcher(review_doc)for i, text in enumerate(review_doc):print(i, text) for match in matches:print(fToken number {match[1]}: {review_doc[match[1]:match[2]]})找到了评论中包含食谱中的单词的位置 0 The 1 Il 2 Purista 3 sandwich 4 has 5 become 6 a 7 staple 8 of 9 my 10 life 11 . 12 Mozzarella 13 , 14 basil 15 , 16 prosciutto 17 , 18 roasted 19 red 20 peppers 21 and 22 balsamic 23 vinaigrette 24 blend 25 into 26 a 27 front 28 runner 29 for 30 the 31 best 32 sandwich 33 in 34 the 35 valley 36 . 37 Goes 38 great 39 with 40 sparkling 41 water 42 or 43 a 44 beer 45 . 46 47 DeFalco 48 s 49 also 50 has 51 other 52 Italian 53 fare 54 such 55 as 56 a 57 delicious 58 meatball 59 sub 60 and 61 classic 62 pastas 63 . Token number 2: Purista Token number 16: prosciutto Token number 58: meatball2 对所有的评论匹配每条评论里出现的食谱key[stars 。。。]value将分数加到列表里 from collections import defaultdict# item_ratings is a dictionary of lists. If a key doesnt exist in item_ratings, # the key is added with an empty list as the value. item_ratings defaultdict(list) # 字典的值是listfor idx, review in data.iterrows():doc nlp(review.text)# Using the matcher from the previous exercisematches matcher(doc)# Create a set of the items found in the review textfound_items set([doc[m[1]:m[2]].lower_ for m in matches])# Update item_ratings with rating for each item in found_items# Transform the item strings to lowercase to make it case insensitivefor item in found_items:item_ratings[item].append(review.stars)3 最不受欢迎的菜 # Calculate the mean ratings for each menu item as a dictionary mean_ratings {name: sum(scores)/len(scores) for name,scores in item_ratings.items()}# Find the worst item, and write it as a string in worst_text. This can be multiple lines of code if you want.worst_item sorted(mean_ratings, keylambda x : mean_ratings[x])[0]# After implementing the above cell, uncomment and run this to print # out the worst item, along with its average rating. print(worst_item) print(mean_ratings[worst_item])输出 chicken cutlet 3.44 菜谱出现的次数每个菜有多少条评论 counts {item: len(ratings) for item, ratings in item_ratings.items()}item_counts sorted(counts, keycounts.get, reverseTrue) for item in item_counts:print(f{item:25}{counts[item]:5})输出 pizza 265pasta 206meatball 128cheesesteak 97cheese steak 76cannoli 72calzone 72eggplant 69purista 63lasagna 59italian sausage 53prosciutto 50chicken parm 50garlic bread 39gnocchi 37spaghetti 36calzones 35pizzas 32salami 28chicken pesto 27italian beef 25tiramisu 21italian combo 21ziti 21chicken parmesan 19chicken parmigiana 17portobello 14mac and cheese 11chicken cutlet 10steak and cheese 9pastrami 9roast beef 7fettuccini alfredo 6grilled veggie 6tuna salad 5turkey sandwich 5artichoke salad 5macaroni 5chicken salad 5reuben 4chicken spinach salad 2corned beef 2turkey breast 1打印出平均打分前十的和倒数10个的 sorted_ratings sorted(mean_ratings, keymean_ratings.get)print(Worst rated menu items:) for item in sorted_ratings[:10]:print(f{item:20} Ave rating: {mean_ratings[item]:.2f} \tcount: {counts[item]})print(\n\nBest rated menu items:) for item in sorted_ratings[-10:]:print(f{item:20} Ave rating: {mean_ratings[item]:.2f} \tcount: {counts[item]})输出 Worst rated menu items: chicken cutlet Ave rating: 3.40 count: 10 turkey sandwich Ave rating: 3.80 count: 5 spaghetti Ave rating: 3.89 count: 36 italian beef Ave rating: 3.92 count: 25 tuna salad Ave rating: 4.00 count: 5 macaroni Ave rating: 4.00 count: 5 italian combo Ave rating: 4.05 count: 21 garlic bread Ave rating: 4.13 count: 39 roast beef Ave rating: 4.14 count: 7 eggplant Ave rating: 4.16 count: 69Best rated menu items: chicken pesto Ave rating: 4.56 count: 27 chicken salad Ave rating: 4.60 count: 5 purista Ave rating: 4.67 count: 63 prosciutto Ave rating: 4.68 count: 50 reuben Ave rating: 4.75 count: 4 steak and cheese Ave rating: 4.89 count: 9 artichoke salad Ave rating: 5.00 count: 5 fettuccini alfredo Ave rating: 5.00 count: 6 turkey breast Ave rating: 5.00 count: 1 corned beef Ave rating: 5.00 count: 2你对任何特定商品的数据越少你就越不相信平均评级是客户的“真实”情绪。我会把评分较低且评价人数超过20个人的菜撤掉。我的CSDN博客地址 https://michael.blog.csdn.net/ 长按或扫码关注我的公众号Michael阿明一起加油、一起学习进步

查看全文

http://www.yutouwan.com/news/162795/