首页 \ 问答 \ 与Lucene的余弦相似性仅适用于匹配的文档(Cosine Similarity with Lucene only for documents that match)

与Lucene的余弦相似性仅适用于匹配的文档(Cosine Similarity with Lucene only for documents that match)

 据我所知，Lucene是一个反向索引系统，它的强大之处在于它只会将查询与至少与令牌匹配的文档进行比较。  
 与将查询与每个文档进行比较的天真方法（甚至那些没有提及查询中存在的任何标记）相比，这是一个很大的好处。  
 例如，如果我有索引文件：  
D1: "Hello world said the guy"
D2: "Hello, what a beautiful world"
D3: "random text"
 
 正如我所看到的那样 ，搜索查询：“ Hello world ”只会查看索引文档D1和D2并在D3上跳过，这样可以节省时间。  
 它是否正确？  
 现在，我正在计算文档之间的余弦相似度。 输入的查询将是一个文档，输出应该是余弦分数。 这是0到1之间的数字。  
 我已经阅读了一些计算余弦相似性的方法，但他们都是通过比较每个文档的术语向量来做到这一点。 例如这个博客提到了以下内容：  
 
  如果您确实需要文档之间的余弦相似度，则必须为源字段启用词条矢量，并使用它们来计算角度。 问题是这不能很好地扩展，你需要计算几乎所有其他文档的角度 。  
 
 而这个答案似乎也是这样说的：  
 
  
   迭代所有文档ID ，0到maxDoc（）;  
  
 
 没有办法只计算匹配查询的文档的余弦相似度，并让它返回为文档的分数？  
 作为一个侧面说明，我确实已经读过TFIDFSimilarity接近，我相信VSM部分正是我所需要的，但是这部分似乎已经在Lucene实用评分函数中消失了。 我不知道如何才能“变换”这个相似性类，最后只剩下纯余弦相似性。  
 所以我回答一个问题：  
 
  我对逆向指标如何节省时间的认识是正确的吗？  
  有没有办法只计算符合其中一个令牌的文档的余弦相似度，而不是所有文档？  
  我可以使用/转换TFIDFSimilarity类来结束纯余弦相似性吗？  

Lucene is a inverse indexing system, as far as I understand, its power lies in the fact that it will compare a query only with documents that at least match a token. 
Compared to the naive approach where the query is compared to every document, (even those that don't mention any token that is present in the query) this is a great benefit. 
For example if I have the indexed documents: 
D1: "Hello world said the guy"
D2: "Hello, what a beautiful world"
D3: "random text"
 
As I see it, the search for query: "Hello world", will only look into the indexed documents D1 and D2 and skips on D3, which saves time. 
Is this correct? 
Now, I'm trying to calculate the cosine similarity between documents. The input query will be a document and the output should be the cosine score. Which is a number between 0 and 1. 
I've already read some approaches that calculate the cosine similarity, but they all do this by comparing the term vector for every document. For example this blog mentioned the following: 
 
 If you really need cosine similarity between documents, you have to enable term vectors for the source fields, and use them to calculate the angle. The problem is that this does not scale well, you would need to calculate angles with virtually all other documents. 
 
and this SO answers seems to say the same: 
 
  
  iterate over all doc ids, 0 to maxDoc(); 
  
 
Isn't there a way to only calculate the cosine similarity for documents that match the query and let this return as score for the document? 
As a side note, I did read that the TFIDFSimilarity comes close, I believe the VSM part is exactly what I need, however this part seems to have disappeared in the Lucene Practical Scoring Function. I'm not sure how I can "transform" this Similarity class to end up with only the pure cosine similarity as result. 
So a recap of my question: 
 
 Is my perception of how the inverse indexes save time correct?  
 Is there way to only calculate cosine similarity for documents that actually match one of the tokens, instead of for all the documents? 
 Can I use/transform the TFIDFSimilarity class to end up with the pure cosine similarity? 

原文：https://stackoverflow.com/questions/32443482

更新时间：2022-04-20 19:04

最满意答案

 将其包装在另一个查询中：  
SELECT a, b, Total, Total / 2 FROM (
  SELECT a, b, (SELECT count(*) FROM X) AS Total
  FROM Y
  WHERE ...) Z 

Wrap it in another query: 
SELECT a, b, Total, Total / 2 FROM (
  SELECT a, b, (SELECT count(*) FROM X) AS Total
  FROM Y
  WHERE ...) Z

与Lucene的余弦相似性仅适用于匹配的文档(Cosine Similarity with Lucene only for documents that match)

最满意答案

相关问答

TCP/IP模型是一个________。[2023-05-19]

下列中不属于面向对象的编程语言的是?[2022-05-30]

用于排除数据的MySQL WHERE子句的子查询[重复](Subquery for MySQL WHERE clause to exclude data [duplicate])[2023-03-11]

由于子查询，MySQL结果中的PHP数组具有重复值(PHP array from MySQL result has duplicate values due to subquery)[2022-04-13]

MySQL递归子查询[重复](MySQL recursive subquery [duplicate])[2022-10-20]

使用子查询中的值[重复](use value from a subquery [duplicate])[2023-04-15]

MySQL子查询将无法正常工作[重复](MySQL subquery will not work [duplicate])[2022-07-03]

在子查询上获取重复的行(Getting duplicate rows on subquery)[2023-10-22]

如何使用Subquery Select语句作为IF条件？(How to use Subquery Select statement as IF condition? [duplicate])[2023-10-10]

子查询返回多个值如何获取此数据(Subquery returned more than 1 value how to get this data)[2022-09-27]

相关文章

最新问答