我有一个约12k文档的利基语料库,并且我想测试具有类似含义的近乎重复的文档 - 思考关于由不同新闻机构覆盖的同一事件的文章。

我曾尝试过gensim的Word2Vec,即使测试文档位于语料库中,它也会给我带来可怕的相似性分数(<0.3),并且我尝试了SpaCy,它给了我> 5k个文档的相似度> 0.9。 我测试了SpaCy最相似的文档,而且它几乎没有用处。


tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=40)
doc = preprocess(query)
vec_bow = dictionary.doc2bow(doc)
vec_lsi_tfidf = lsi[tfidf[vec_bow]] # convert the query to LSI space
index = similarities.Similarity(corpus = corpus, num_features = len(dictionary), output_prefix = "pqr")
sims = index[vec_lsi_tfidf] # perform a similarity query against the corpus
most_similar = sorted(list(enumerate(sims)), key = lambda x:x[1])

for mid in most_similar[-100:]:
    print(mid, file_list[mid[0]])

使用gensim我已经找到了一个体面的方法,进行了一些预处理,但相似性分数仍然很低。 有没有人遇到过这样的问题,并且有一些可能有用的资源或建议?

I have a niche corpus of ~12k docs, and I want to test near-duplicate documents with similar meanings across it - think article about the same event covered by different news organisations.

I have tried gensim's Word2Vec, which gives me terrible similarity score(<0.3) even when the test document is within the corpus, and I have tried SpaCy, which gives me >5k documents with similarity > 0.9. I tested SpaCy's most similar documents, and it was mostly useless.

This is the relevant code.

Using gensim I have found a decent approach, with some preprocessing, but the similarity score is still quite low. Has anyone faced such a problem, and are there are some resources or suggestions that could be useful?

$ []是v3.6的新功能。

为了使其工作,您需要mongodb v3.6, 并将 FeatureCompatibilityVersion设置为“3.6”。



const query = {'_id': new ObjectID(currentPrediction._id), "predictions.status" : "FT"};


$[] is a new feature of v3.6.

For it to work you need mongodb v3.6, and set FeatureCompatibilityVersion to "3.6".

$ updates a single element in the array.

For it to work your query should include a filter for elements in the array, e.g.

const query = {'_id': new ObjectID(currentPrediction._id), "predictions.status" : "FT"};

The $ refers to the first matching element, and without filter there are no matches.






