首页 \ 问答 \ Spacy vs Word2Vec中的文档相似性(Document similarity in Spacy vs Word2Vec)

Spacy vs Word2Vec中的文档相似性(Document similarity in Spacy vs Word2Vec)

 我有一个约12k文档的利基语料库，并且我想测试具有类似含义的近乎重复的文档 - 思考关于由不同新闻机构覆盖的同一事件的文章。  
 我曾尝试过gensim的Word2Vec，即使测试文档位于语料库中，它也会给我带来可怕的相似性分数（<0.3），并且我尝试了SpaCy，它给了我> 5k个文档的相似度> 0.9。 我测试了SpaCy最相似的文档，而且它几乎没有用处。  
 这是相关的代码。  
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=40)
doc = preprocess(query)
vec_bow = dictionary.doc2bow(doc)
vec_lsi_tfidf = lsi[tfidf[vec_bow]] # convert the query to LSI space
index = similarities.Similarity(corpus = corpus, num_features = len(dictionary), output_prefix = "pqr")
sims = index[vec_lsi_tfidf] # perform a similarity query against the corpus
most_similar = sorted(list(enumerate(sims)), key = lambda x:x[1])

for mid in most_similar[-100:]:
    print(mid, file_list[mid[0]])
 
 使用gensim我已经找到了一个体面的方法，进行了一些预处理，但相似性分数仍然很低。 有没有人遇到过这样的问题，并且有一些可能有用的资源或建议？ 

I have a niche corpus of ~12k docs, and I want to test near-duplicate documents with similar meanings across it - think article about the same event covered by different news organisations.  
I have tried gensim's Word2Vec, which gives me terrible similarity score(<0.3) even when the test document is within the corpus, and I have tried SpaCy, which gives me >5k documents with similarity > 0.9. I tested SpaCy's most similar documents, and it was mostly useless. 
This is the relevant code.  
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=40)
doc = preprocess(query)
vec_bow = dictionary.doc2bow(doc)
vec_lsi_tfidf = lsi[tfidf[vec_bow]] # convert the query to LSI space
index = similarities.Similarity(corpus = corpus, num_features = len(dictionary), output_prefix = "pqr")
sims = index[vec_lsi_tfidf] # perform a similarity query against the corpus
most_similar = sorted(list(enumerate(sims)), key = lambda x:x[1])

for mid in most_similar[-100:]:
    print(mid, file_list[mid[0]])
 
Using gensim I have found a decent approach, with some preprocessing, but the similarity score is still quite low. Has anyone faced such a problem, and are there are some resources or suggestions that could be useful?

原文：https://stackoverflow.com/questions/49767270

更新时间：2023-09-04 09:09

最满意答案

 $ []是v3.6的新功能。  
 为了使其工作，您需要mongodb v3.6， 并将 FeatureCompatibilityVersion设置为“3.6”。  
 
 $更新数组中的单个元素。  
 对于它的工作，你的查询应该包含数组中元素的过滤器，例如  
const query = {'_id': new ObjectID(currentPrediction._id), "predictions.status" : "FT"};
 
 $是第一个匹配的元素，没有过滤器就没有匹配。 

$[] is a new feature of v3.6. 
For it to work you need mongodb v3.6, and set FeatureCompatibilityVersion to "3.6". 
 
$ updates a single element in the array. 
For it to work your query should include a filter for elements in the array, e.g.  
const query = {'_id': new ObjectID(currentPrediction._id), "predictions.status" : "FT"};
 
The $ refers to the first matching element, and without filter there are no matches.

Spacy vs Word2Vec中的文档相似性(Document similarity in Spacy vs Word2Vec)

最满意答案

相关问答

c＃mongodb upsert与位置运算符问题(c# mongodb upsert with positional operator issues)[2022-08-31]

Mongodb位置运算符$或$ []不适用于数组元素(Mongodb positional operators $ or $[] do not work for array elements)[2022-04-22]

如何使用带过滤位置运算符和arrayFilters来应用更新(How to apply update using Filtered positional operator with arrayFilters)[2022-04-26]

如果没有包含数组的相应查询字段，则无法应用位置运算符(Cannot apply the positional operator without a corresponding query field containing an array)[2023-09-15]

位移运算符(Bit shift operators)[2024-01-02]

使用Mango的位置运算符更新MongoDB中文档数组中的子文档字段(update of a subdocument's field in an array of documents in MongoDB using the positional operator with Mango)[2022-08-02]

什么是MongoDB修饰符和运算符？(What are MongoDB Modifiers and Operators?)[2023-02-13]

HashSet集合运算符是基于GetHashCode（）还是Equals（）工作的？(Do the HashSet set operators work based on GetHashCode() or Equals()?)[2021-11-30]

Mongodb脚本定位运算符(Mongodb scripting Positional operator)[2023-08-14]

位置运算符更新嵌套数组[重复](Positional operator to update nested array [duplicate])[2022-07-18]

相关文章

最新问答