首页 \ 问答 \ 与Lucene的余弦相似性仅适用于匹配的文档(Cosine Similarity with Lucene only for documents that match)

与Lucene的余弦相似性仅适用于匹配的文档(Cosine Similarity with Lucene only for documents that match)

据我所知,Lucene是一个反向索引系统,它的强大之处在于它只会将查询与至少与令牌匹配的文档进行比较。

与将查询与每个文档进行比较的天真方法(甚至那些没有提及查询中存在的任何标记)相比,这是一个很大的好处。

例如,如果我有索引文件:

D1: "Hello world said the guy"
D2: "Hello, what a beautiful world"
D3: "random text"

正如我所看到的那样 ,搜索查询:“ Hello world ”只会查看索引文档D1和D2并在D3上跳过,这样可以节省时间。

它是否正确?

现在,我正在计算文档之间的余弦相似度。 输入的查询将是一个文档,输出应该是余弦分数。 这是0到1之间的数字。

我已经阅读了一些计算余弦相似性的方法,但他们都是通过比较每个文档的术语向量来做到这一点。 例如这个博客提到了以下内容:

如果您确实需要文档之间的余弦相似度,则必须为源字段启用词条矢量,并使用它们来计算角度。 问题是这不能很好地扩展,你需要计算几乎所有其他文档的角度

而这个答案似乎也是这样说的:

  1. 迭代所有文档ID ,0到maxDoc();

没有办法只计算匹配查询的文档的余弦相似度,并让它返回为文档的分数?

作为一个侧面说明,我确实已经读过TFIDFSimilarity接近,我相信VSM部分正是我所需要的,但是这部分似乎已经在Lucene实用评分函数中消失了。 我不知道如何才能“变换”这个相似性类,最后只剩下纯余弦相似性。

所以我回答一个问题:

  1. 我对逆向指标如何节省时间的认识是正确的吗?

  2. 有没有办法只计算符合其中一个令牌的文档的余弦相似度,而不是所有文档?

  3. 我可以使用/转换TFIDFSimilarity类来结束纯余弦相似性吗?

Lucene is a inverse indexing system, as far as I understand, its power lies in the fact that it will compare a query only with documents that at least match a token.

Compared to the naive approach where the query is compared to every document, (even those that don't mention any token that is present in the query) this is a great benefit.

For example if I have the indexed documents:

D1: "Hello world said the guy"
D2: "Hello, what a beautiful world"
D3: "random text"

As I see it, the search for query: "Hello world", will only look into the indexed documents D1 and D2 and skips on D3, which saves time.

Is this correct?

Now, I'm trying to calculate the cosine similarity between documents. The input query will be a document and the output should be the cosine score. Which is a number between 0 and 1.

I've already read some approaches that calculate the cosine similarity, but they all do this by comparing the term vector for every document. For example this blog mentioned the following:

If you really need cosine similarity between documents, you have to enable term vectors for the source fields, and use them to calculate the angle. The problem is that this does not scale well, you would need to calculate angles with virtually all other documents.

and this SO answers seems to say the same:

  1. iterate over all doc ids, 0 to maxDoc();

Isn't there a way to only calculate the cosine similarity for documents that match the query and let this return as score for the document?

As a side note, I did read that the TFIDFSimilarity comes close, I believe the VSM part is exactly what I need, however this part seems to have disappeared in the Lucene Practical Scoring Function. I'm not sure how I can "transform" this Similarity class to end up with only the pure cosine similarity as result.

So a recap of my question:

  1. Is my perception of how the inverse indexes save time correct?

  2. Is there way to only calculate cosine similarity for documents that actually match one of the tokens, instead of for all the documents?

  3. Can I use/transform the TFIDFSimilarity class to end up with the pure cosine similarity?

原文:https://stackoverflow.com/questions/32443482
更新时间:2022-04-20 19:04

最满意答案

将其包装在另一个查询中:

SELECT a, b, Total, Total / 2 FROM (
  SELECT a, b, (SELECT count(*) FROM X) AS Total
  FROM Y
  WHERE ...) Z 

Wrap it in another query:

SELECT a, b, Total, Total / 2 FROM (
  SELECT a, b, (SELECT count(*) FROM X) AS Total
  FROM Y
  WHERE ...) Z 

相关问答

更多

相关文章

更多

最新问答

更多
  • 获取MVC 4使用的DisplayMode后缀(Get the DisplayMode Suffix being used by MVC 4)
  • 如何通过引用返回对象?(How is returning an object by reference possible?)
  • 矩阵如何存储在内存中?(How are matrices stored in memory?)
  • 每个请求的Java新会话?(Java New Session For Each Request?)
  • css:浮动div中重叠的标题h1(css: overlapping headlines h1 in floated divs)
  • 无论图像如何,Caffe预测同一类(Caffe predicts same class regardless of image)
  • xcode语法颜色编码解释?(xcode syntax color coding explained?)
  • 在Access 2010 Runtime中使用Office 2000校对工具(Use Office 2000 proofing tools in Access 2010 Runtime)
  • 从单独的Web主机将图像传输到服务器上(Getting images onto server from separate web host)
  • 从旧版本复制文件并保留它们(旧/新版本)(Copy a file from old revision and keep both of them (old / new revision))
  • 西安哪有PLC可控制编程的培训
  • 在Entity Framework中选择基类(Select base class in Entity Framework)
  • 在Android中出现错误“数据集和渲染器应该不为null,并且应该具有相同数量的系列”(Error “Dataset and renderer should be not null and should have the same number of series” in Android)
  • 电脑二级VF有什么用
  • Datamapper Ruby如何添加Hook方法(Datamapper Ruby How to add Hook Method)
  • 金华英语角.
  • 手机软件如何制作
  • 用于Android webview中图像保存的上下文菜单(Context Menu for Image Saving in an Android webview)
  • 注意:未定义的偏移量:PHP(Notice: Undefined offset: PHP)
  • 如何读R中的大数据集[复制](How to read large dataset in R [duplicate])
  • Unity 5 Heighmap与地形宽度/地形长度的分辨率关系?(Unity 5 Heighmap Resolution relationship to terrain width / terrain length?)
  • 如何通知PipedOutputStream线程写入最后一个字节的PipedInputStream线程?(How to notify PipedInputStream thread that PipedOutputStream thread has written last byte?)
  • python的访问器方法有哪些
  • DeviceNetworkInformation:哪个是哪个?(DeviceNetworkInformation: Which is which?)
  • 在Ruby中对组合进行排序(Sorting a combination in Ruby)
  • 网站开发的流程?
  • 使用Zend Framework 2中的JOIN sql检索数据(Retrieve data using JOIN sql in Zend Framework 2)
  • 条带格式类型格式模式编号无法正常工作(Stripes format type format pattern number not working properly)
  • 透明度错误IE11(Transparency bug IE11)
  • linux的基本操作命令。。。