首页 \ 问答 \ 如何正确解释solr相似度得分？(How to correctly interpret solr similarity score?)

如何正确解释solr相似度得分？(How to correctly interpret solr similarity score?)

python

 我知道Solr返回的相似性分数仅与特定查询相关，并且它们只具有相对含义。 话虽如此，有没有办法确定全球时尚得分的“优点”？  
 例如：假设我运行MLT查询并获得5个文档。 每个文档都有一个分数，但事实是分数最高的文档不一定是最相关的。 我希望能够指定一个阈值分数，我甚至不会考虑这些文件。  
 如何确定这个门槛？ 只是通过经验测量，或者我可以说通常 ，大于3的相似性得分在内容上有很好的相似性，而小于1的相似性得分通常意味着文档完全不相关？ 或者，我可以说结果小于文档与自身相似性的80％是无关紧要的吗？ 

I am aware that the similarity scores returned by Solr are relevant only for a specific query and that they have only relative meaning. Having said that, is there a way to determine the 'goodness' of a score in a global fashion? 
For example: Suppose I run an MLT query and get 5 documents. Each document has a score but the fact is that the document with the highest score is not necessarily the most relevant. I want to be able to specify a threshold score below which I do not even consider the documents. 
How can this threshold be determined? Is it only by empirical measurement, or can I say that usually, a similarity score larger than 3 gives good resemblance in content, while similarity scores smaller than 1 usually means the document is completely irrelevant? Or alternatively, can I say that results that are less than 80% of the similarity of a document to itself are irrelevant?

原文：https://stackoverflow.com/questions/21382258

更新时间：2022-02-14 14:02

最满意答案

 你可以将upload_to值更改为这样的功能  
def upload_to_id_image(instance, filename):
    extension = splitext(filename)[1].lower()
    salt, hashed = generate_sha1(instance.id)
    path = 'profiles/%(id)s_%(date_now)s_' % {
                                         'id': instance.user.id,
                                         'date_now': get_datetime_now().date().strftime("%Y%m%d")}
    return '%(path)s%(hash)s%(extension)s' % {'path': path,
                                          'hash': hashed[:16],
                                          'extension': extension}
 
 然后，你应该改变你的代码，  
photo = ProcessedImageField(upload_to=upload_to_id_image,
 
 当然你可以删除哈希码。 但是对于文件安全性，make散列文件名更好。 

you can change upload_to value to function like this 
def upload_to_id_image(instance, filename):
    extension = splitext(filename)[1].lower()
    salt, hashed = generate_sha1(instance.id)
    path = 'profiles/%(id)s_%(date_now)s_' % {
                                         'id': instance.user.id,
                                         'date_now': get_datetime_now().date().strftime("%Y%m%d")}
    return '%(path)s%(hash)s%(extension)s' % {'path': path,
                                          'hash': hashed[:16],
                                          'extension': extension}
 
and then, you should change your code like this, 
photo = ProcessedImageField(upload_to=upload_to_id_image,
 
of course you can delete hash code. but for file security, make hashed file name is better.

如何正确解释solr相似度得分？(How to correctly interpret solr similarity score?)

最满意答案

相关问答

Django ImageKit：如何重命名上传的图像？(Django ImageKit: How to rename uploaded images?)[2023-10-23]

django-imagekit教程失败 - 不创建缩略图(django-imagekit tutorial fail - not creating thumbnails)[2022-08-26]

哪个Django应用程序更适合图像大小调整（easy-Thumbnails或django-imagekit）？(Which one of Django app is better for image resizing (easy-thumbnails or django-imagekit)? [closed])[2022-02-09]

django-imagekit最小设置不起作用(django-imagekit minimal setup not working)[2022-11-17]

ImageKit不会在Django Admin上显示缩略图(ImageKit does not show thumbnail on Django Admin)[2022-08-06]

如何将响应大小应用于Django imagekit中的图像(How to apply a responsive size to an image in Django imagekit)[2023-07-08]