首页 \ 问答 \ Spacy vs Word2Vec中的文档相似性(Document similarity in Spacy vs Word2Vec)

Spacy vs Word2Vec中的文档相似性(Document similarity in Spacy vs Word2Vec)

我有一个约12k文档的利基语料库,并且我想测试具有类似含义的近乎重复的文档 - 思考关于由不同新闻机构覆盖的同一事件的文章。

我曾尝试过gensim的Word2Vec,即使测试文档位于语料库中,它也会给我带来可怕的相似性分数(<0.3),并且我尝试了SpaCy,它给了我> 5k个文档的相似度> 0.9。 我测试了SpaCy最相似的文档,而且它几乎没有用处。

这是相关的代码。

tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=40)
doc = preprocess(query)
vec_bow = dictionary.doc2bow(doc)
vec_lsi_tfidf = lsi[tfidf[vec_bow]] # convert the query to LSI space
index = similarities.Similarity(corpus = corpus, num_features = len(dictionary), output_prefix = "pqr")
sims = index[vec_lsi_tfidf] # perform a similarity query against the corpus
most_similar = sorted(list(enumerate(sims)), key = lambda x:x[1])

for mid in most_similar[-100:]:
    print(mid, file_list[mid[0]])

使用gensim我已经找到了一个体面的方法,进行了一些预处理,但相似性分数仍然很低。 有没有人遇到过这样的问题,并且有一些可能有用的资源或建议?


I have a niche corpus of ~12k docs, and I want to test near-duplicate documents with similar meanings across it - think article about the same event covered by different news organisations.

I have tried gensim's Word2Vec, which gives me terrible similarity score(<0.3) even when the test document is within the corpus, and I have tried SpaCy, which gives me >5k documents with similarity > 0.9. I tested SpaCy's most similar documents, and it was mostly useless.

This is the relevant code.

tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=40)
doc = preprocess(query)
vec_bow = dictionary.doc2bow(doc)
vec_lsi_tfidf = lsi[tfidf[vec_bow]] # convert the query to LSI space
index = similarities.Similarity(corpus = corpus, num_features = len(dictionary), output_prefix = "pqr")
sims = index[vec_lsi_tfidf] # perform a similarity query against the corpus
most_similar = sorted(list(enumerate(sims)), key = lambda x:x[1])

for mid in most_similar[-100:]:
    print(mid, file_list[mid[0]])

Using gensim I have found a decent approach, with some preprocessing, but the similarity score is still quite low. Has anyone faced such a problem, and are there are some resources or suggestions that could be useful?


原文:https://stackoverflow.com/questions/49767270
更新时间:2023-09-04 09:09

最满意答案

$ []是v3.6的新功能。

为了使其工作,您需要mongodb v3.6, 并将 FeatureCompatibilityVersion设置为“3.6”。


$更新数组中的单个元素。

对于它的工作,你的查询应该包含数组中元素的过滤器,例如

const query = {'_id': new ObjectID(currentPrediction._id), "predictions.status" : "FT"};

$是第一个匹配的元素,没有过滤器就没有匹配。


$[] is a new feature of v3.6.

For it to work you need mongodb v3.6, and set FeatureCompatibilityVersion to "3.6".


$ updates a single element in the array.

For it to work your query should include a filter for elements in the array, e.g.

const query = {'_id': new ObjectID(currentPrediction._id), "predictions.status" : "FT"};

The $ refers to the first matching element, and without filter there are no matches.

相关问答

更多

相关文章

更多

最新问答

更多
  • 如何使用自由职业者帐户登录我的php网站?(How can I login into my php website using freelancer account? [closed])
  • 如何打破按钮上的生命周期循环(How to break do-while loop on button)
  • C#使用EF访问MVC上的部分类的自定义属性(C# access custom attributes of a partial class on MVC with EF)
  • 如何获得facebook app的publish_stream权限?(How to get publish_stream permissions for facebook app?)
  • 如何并排放置两个元件?(How to position two elements side by side?)
  • 在MySQL和/或多列中使用多个表用于Rails应用程序(Using multiple tables in MySQL and/or multiple columns for a Rails application)
  • 如何隐藏谷歌地图上的登录按钮?(How to hide the Sign in button from Google maps?)
  • Mysql左连接旋转90°表(Mysql Left join rotate 90° table)
  • 带有ImageMagick和许多图像的GIF动画(GIF animation with ImageMagick and many images)
  • 电脑高中毕业学习去哪里培训
  • 电脑系统专业就业状况如何啊?
  • IEnumerable linq表达式(IEnumerable linq expressions)
  • 如何在Spring测试中连接依赖关系(How to wire dependencies in Spring tests)
  • Solr可以在没有Lucene的情况下运行吗?(Can Solr run without Lucene?)
  • 如何保证Task在当前线程上同步运行?(How to guarantee that a Task runs synchronously on the current thread?)
  • 在保持每列的类的同时向数据框添加行(Adding row to data frame while maintaining the class of each column)
  • 的?(The ? marks in emacs/haskell and ghc mode)
  • 一个线程可以调用SuspendThread传递自己的线程ID吗?(Can a thread call SuspendThread passing its own thread ID?)
  • 延迟socket.io响应,并“警告 - websocket连接无效”(Delayed socket.io response, and “warn - websocket connection invalid”)
  • 悬停时的图像转换(Image transition on hover)
  • IIS 7.5仅显示homecontroller(IIS 7.5 only shows homecontroller)
  • 没有JavaScript的复选框“关闭”值(Checkbox 'off' value without JavaScript)
  • java分布式框架有哪些
  • Python:填写表单并点击按钮确认[关闭](Python: fill out a form and confirm with a button click [closed])
  • PHP将文件链接到根文件目录(PHP Linking Files to Root File Directory)
  • 我如何删除ListView中的项目?(How I can remove a item in my ListView?)
  • 您是否必须为TFS(云)中的每个BUG创建一个TASK以跟踪时间?(Do you have to create a TASK for every BUG in TFS (Cloud) to track time?)
  • typoscript TMENU ATagParams小写(typoscript TMENU ATagParams lowercase)
  • 武陟会计培训类的学校哪个好点?
  • 从链接中删除文本修饰(Remove text decoration from links)