与Lucene的余弦相似性仅适用于匹配的文档(Cosine Similarity with Lucene only for documents that match)
据我所知,Lucene是一个反向索引系统,它的强大之处在于它只会将查询与至少与令牌匹配的文档进行比较。
与将查询与每个文档进行比较的天真方法(甚至那些没有提及查询中存在的任何标记)相比,这是一个很大的好处。
例如,如果我有索引文件:
D1: "Hello world said the guy" D2: "Hello, what a beautiful world" D3: "random text"
正如我所看到的那样 ,搜索查询:“ Hello world ”只会查看索引文档D1和D2并在D3上跳过,这样可以节省时间。
它是否正确?
现在,我正在计算文档之间的余弦相似度。 输入的查询将是一个文档,输出应该是余弦分数。 这是0到1之间的数字。
我已经阅读了一些计算余弦相似性的方法,但他们都是通过比较每个文档的术语向量来做到这一点。 例如这个博客提到了以下内容:
如果您确实需要文档之间的余弦相似度,则必须为源字段启用词条矢量,并使用它们来计算角度。 问题是这不能很好地扩展,你需要计算几乎所有其他文档的角度 。
而这个答案似乎也是这样说的:
- 迭代所有文档ID ,0到maxDoc();
没有办法只计算匹配查询的文档的余弦相似度,并让它返回为文档的分数?
作为一个侧面说明,我确实已经读过TFIDFSimilarity接近,我相信VSM部分正是我所需要的,但是这部分似乎已经在Lucene实用评分函数中消失了。 我不知道如何才能“变换”这个相似性类,最后只剩下纯余弦相似性。
所以我回答一个问题:
我对逆向指标如何节省时间的认识是正确的吗?
有没有办法只计算符合其中一个令牌的文档的余弦相似度,而不是所有文档?
- 我可以使用/转换
TFIDFSimilarity
类来结束纯余弦相似性吗?Lucene is a inverse indexing system, as far as I understand, its power lies in the fact that it will compare a query only with documents that at least match a token.
Compared to the naive approach where the query is compared to every document, (even those that don't mention any token that is present in the query) this is a great benefit.
For example if I have the indexed documents:
D1: "Hello world said the guy" D2: "Hello, what a beautiful world" D3: "random text"
As I see it, the search for query: "Hello world", will only look into the indexed documents D1 and D2 and skips on D3, which saves time.
Is this correct?
Now, I'm trying to calculate the cosine similarity between documents. The input query will be a document and the output should be the cosine score. Which is a number between 0 and 1.
I've already read some approaches that calculate the cosine similarity, but they all do this by comparing the term vector for every document. For example this blog mentioned the following:
If you really need cosine similarity between documents, you have to enable term vectors for the source fields, and use them to calculate the angle. The problem is that this does not scale well, you would need to calculate angles with virtually all other documents.
and this SO answers seems to say the same:
- iterate over all doc ids, 0 to maxDoc();
Isn't there a way to only calculate the cosine similarity for documents that match the query and let this return as score for the document?
As a side note, I did read that the TFIDFSimilarity comes close, I believe the VSM part is exactly what I need, however this part seems to have disappeared in the Lucene Practical Scoring Function. I'm not sure how I can "transform" this Similarity class to end up with only the pure cosine similarity as result.
So a recap of my question:
Is my perception of how the inverse indexes save time correct?
Is there way to only calculate cosine similarity for documents that actually match one of the tokens, instead of for all the documents?
- Can I use/transform the
TFIDFSimilarity
class to end up with the pure cosine similarity?
原文:https://stackoverflow.com/questions/32443482
最满意答案
将其包装在另一个查询中:
SELECT a, b, Total, Total / 2 FROM ( SELECT a, b, (SELECT count(*) FROM X) AS Total FROM Y WHERE ...) Z
Wrap it in another query:
SELECT a, b, Total, Total / 2 FROM ( SELECT a, b, (SELECT count(*) FROM X) AS Total FROM Y WHERE ...) Z
相关问答
更多-
TCP/IP模型是一个________。[2023-05-19]
a -
下列中不属于面向对象的编程语言的是?[2022-05-30]
a -
用于排除数据的MySQL WHERE子句的子查询[重复](Subquery for MySQL WHERE clause to exclude data [duplicate])[2023-03-11]
尝试这个: SELECT t1.name FROM t1 WHERE t1.name NOT IN (SELECT t2.name from t2) 你将排除t2.name中存在的所有t1.name 编辑1 做一些测试我最终得到一个非常奇怪的查询,但它的工作原理,你去: SELECT * from t1 WHERE t1.name NOT IN ( SELECT t1.name FROM t1, t2 WHERE t1.name LIKE CONCAT('%', t2.name ,'%') ... -
由于子查询,MySQL结果中的PHP数组具有重复值(PHP array from MySQL result has duplicate values due to subquery)[2022-04-13]
我使用PHP方法修复了这个问题并在数组上运行,并在必要时进行更改。 完美适合我的需求。 $showerror = false; $lineid = array(); foreach ($empresults as $invoiceline => $line) { if (isset($lineid[$line['id']])) { $empresults[$invoiceline]['duplicate'] = true; $ ... -
我不确定herachial数据是否是你需要/想要的,但这就是你需要的 所有上衣,中间人,普通会员, DEMO ,SQL: select IFNULL((SELECT name FROM users where uid=smng.leader_id),(SELECT name FROM users where uid=mng.leader_id)) as top, (SELECT name FROM users where uid = mng.leader_id) ...
-
将其包装在另一个查询中: SELECT a, b, Total, Total / 2 FROM ( SELECT a, b, (SELECT count(*) FROM X) AS Total FROM Y WHERE ...) Z Wrap it in another query: SELECT a, b, Total, Total / 2 FROM ( SELECT a, b, (SELECT count(*) FROM X) AS Total FROM Y WHERE .. ...
-
使用更新查询,您应该收到错误: 您无法在FROM子句中为更新指定目标表'companys' 一种解决方案是强制MySQL创建一个可以用作源的临时结果集: UPDATE companys SET companys.signature = '$SIGNATURE' WHERE companys.id = ( SELECT id FROM ( SELECT companys.id FROM companys JOIN users ON users. ...
-
在子查询上获取重复的行(Getting duplicate rows on subquery)[2023-10-22]
你的查询看起来很糟糕。 首先,您使用的连接语法已经过时了二十多年。 然后你使用NOT IN它没有意义(你的集合只包含一个值)。 然后我很难想象DISTINCT在你的查询中有意义。 或者两个表中的一个包含完全重复的记录,但不应该这样。 Survey_ID真的是一个字符串吗? 当您使用两个表时,您应该在列上使用限定符来指示它们所在的位置,例如GetAllUsers.id 。 我假设Survey_ID位于getperms 。 您正确编写的查询看起来如下: Select * From GetAllUsers gau ... -
如何使用Subquery Select语句作为IF条件?(How to use Subquery Select statement as IF condition? [duplicate])[2023-10-10]
如何在单个查询中完成整个操作? 这有用吗? select cnt + (case when count(*) > 0 then 1 else 0 end) into cnt from sung where hakbun = 1; How about doing the whole thing in a single query? Does this work? select cnt + (case when count(*) > 0 then 1 else 0 end) into cnt from sun ... -
我觉得你错了 where color_id=(select trans_id ......) 你试试这个 color_id=(select color_id from transaction_order where order_id=(select order_id from master_order where program_no='13-065454')) I think you get wrong id where color_id=(select trans_id ......) You try ...