Solr:如何突出显示整个搜索短语?(Solr: how to highlight the whole search phrase only?)
我需要执行短语搜索。 在搜索结果我得到确切的词组匹配,但看着突出显示的部分,我看到这个短语被标记化,即当我搜索“第1天”时,我得到这个词:
<arr name="post"> <str><em>Day</em> <em>1</em> We have begun a new adventure! An early morning (4:30 a.m.) has found me meeting with</str> </arr>
这就是我想要得到的结果:
<arr name="post"> <str><em>Day 1</em> We have begun a new adventure! An early morning (4:30 a.m.) has found me meeting with</str> </arr>
我在做的查询是这样的:管理控制台:
q = day 1 fq = post:"day 1" OR title:"day 1" hl = true hl.fl =title,post
选择Q =天+ 1&FQ =张贴%3A%22天+ 1%22 + OR +标题%3A%22天+ 1%22重量= XML&缩进=真HL =真hl.fl =标题%2Cpost&hl.simple.pre =%3Cem%3E&HL .simple.post =%3C%2Fem%3E
这些是我的领域:
<field name="post" type="text_general" indexed="true" stored="true" required="true" multiValued="false" /> <field name="post" type="text_general" indexed="true" stored="true" required="true" multiValued="false" />
这是我的fied类型text_general的solr模式部分:
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <!-- in this example, we will only use synonyms at query time <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> --> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.GreekStemFilterFactory"/> <filter class="solr.GreekLowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
B)我可以在突出部分看到更多令人不安的结果,即突出显示不是预期的整个单词,而是单个片段:在
.where you get to see all of Athens ... <em>Day</em> 2 - Carmens
我不想要在突出显示的部分查看此结果(仅需要查看“第1天”这两个词)。 有任何想法吗 ?我正在阅读Solr亮点部分,但是......真的......甚至没有一个例子!
A I need to perform a phrase search. On the search results Im getting the exact phrase matches but looking at the highlighted parts I see that the phrase are tokenized i.e This is what I get when I search for the prase "Day 1" :
<arr name="post"> <str><em>Day</em> <em>1</em> We have begun a new adventure! An early morning (4:30 a.m.) has found me meeting with</str> </arr>
This is what I want to receive as a result:
<arr name="post"> <str><em>Day 1</em> We have begun a new adventure! An early morning (4:30 a.m.) has found me meeting with</str> </arr>
The query I m doing is this: Admin console:
q = day 1 fq = post:"day 1" OR title:"day 1" hl = true hl.fl =title,post
select?q=day+1&fq=post%3A%22day+1%22+OR+title%3A%22day+1%22&wt=xml&indent=true&hl=true&hl.fl=title%2Cpost&hl.simple.pre=%3Cem%3E&hl.simple.post=%3C%2Fem%3E
Theese are my fields:
<field name="post" type="text_general" indexed="true" stored="true" required="true" multiValued="false" /> <field name="post" type="text_general" indexed="true" stored="true" required="true" multiValued="false" />
This is the solr schema section for my fied type text_general:
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <!-- in this example, we will only use synonyms at query time <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> --> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.GreekStemFilterFactory"/> <filter class="solr.GreekLowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
B) I can see in the highlight section more disturbing results i.e highlighting not the whole word as expected but single fragments:
.where you get to see all of Athens ... <em>Day</em> 2 - Carmens
I dont want to see this result in the highlighted section (Only need to see both words "Day 1"). Any ideas ?I m reading the Solr highlight section but .. really... there is not even 1 example!!!
原文:https://stackoverflow.com/questions/25930180
最满意答案
with open(src, newline='') as file: r = csv.reader(file, delimiter=';') for line in r: if len(line[0]) ==2 and line[0].isalpha() and line[16]=='15%': print(line) #Or whatever it is you want to do
没有正则表达式真的需要,但
r'[a-zA-Z]{2}'
也可以工作with open(src, newline='') as file: r = csv.reader(file, delimiter=';') for line in r: if len(line[0]) ==2 and line[0].isalpha() and line[16]=='15%': print(line) #Or whatever it is you want to do
No regex really necessary, but
r'[a-zA-Z]{2}'
could also work
相关问答
更多-
从这开始: pd.read_csv(tweets_data_path, sep="::", header = None, usecols = [0,1,2,3]) 上面应该带4列,那么你可以找出有多少行被丢弃,以及数据是否有意义。 使用这种模式: data["lang"].unique() 因为,你有数据问题,而不是它在哪里。 你需要退后一步,使用python'csv reader'。 这应该让你开始。 import csv reader = csv.reader(tweets_data_path) t ...
-
处理这个的更好方法是使用Python的csv模块。 从你的CSV的外观来看,我猜它是用制表符分隔的,所以我就是在假设这样做了。 import csv match = "JB74XYZ" matched_row = None with open("RegDD.txt", "r") as file: # Read file as a CSV delimited by tabs. reader = csv.reader(file, delimiter='\t') for row in ...
-
with open(src, newline='') as file: r = csv.reader(file, delimiter=';') for line in r: if len(line[0]) ==2 and line[0].isalpha() and line[16]=='15%': print(line) #Or whatever it is you want to do 没有正则表达式真的需要,但r'[a-zA-Z]{2}'也可以工 ...
-
你可以用 r'\./([^\W_]+)_word1_([0-9.]+)_([0-9.]+)_([0-9]+(?:\.[0-9]+)*)' 请参阅正则表达式演示 细节 : \. - 一个文字点(如果它没有转义,它匹配除了换行符之外的任何字符) / - 一个/符号(不需要在Python正则表达式模式中转义它) ([^\W_]+) - 组1匹配1个或多个字母或数字(如果要匹配包含_的块,保留原始(\w+)模式) _word1_ - 文字子字符串 ([0-9.]+) - 组1匹配1位或更多位数和/或. 符号 _ ...
-
CSV文件是一系列行,每行都有多个字段。 你的x变量依次引用每一行; 但一行是一个列表,你不能在列表中使用正则表达式。 我不确定你想要做什么; 如果每行只有一个字段,则根本不应该使用csv模块,只需遍历文件中的行即可。 A CSV file is a series of rows, each of which has multiple fields. Your x variable refers to each row in turn; but a row is a list, you can't use ...
-
使用re.UNICODE标志: >>> import re >>> P = re.compile(r'[\s\t]+', flags=re.UNICODE) >>> re.sub(P, u' ', u'\xa0 haha') u' haha' 没有标志,只有ASCII空白符合; \xa0不是ASCII标准的一部分(它是Latin-1码点)。 re.UNICODE标志是Python 3中的默认标志; 如果你想拥有Python 2(字符串)行为,请使用re.ASCII 。 请注意,在字符类中包含\t是没有意义 ...
-
import re DataL = [ '''Grand Total for ATHLET:,,,"1,312 ",,62:58:18,130.62 ,,''', '''Grand Total for SELF:,,,"6,589 ",,397:57:58,708.53 ,,''' ] Pattern = re.compile(r''',(?=[^"']*(?:(?:[^'"]*["'][^"']*){2})*$)''') for (i, d) in enum ...
-
我担心你要求的三个包中的答案是否定的。 但是,您可以直接replace('\t', ',') (或相反)。 例如: from StringIO import StringIO # py3k: from io import StringIO import csv with open('./file') as fh: io = StringIO(fh.read().replace('\t', ',')) reader = csv.reader(io) for row in reader: p ...
-
大熊猫用正则表达式读取csv(pandas read csv with regex)[2022-11-22]
我会将所有这些CSV收集到DataFrames的字典中,结构如下: df['20140803'] - 包含属于所有df_trip_20140803_*.csv CSV文件的连接数据的DF。 解: import os import re import glob import pandas as pd fpattern = r'D:\temp\.data\41444939\df_trip_{}_{}.csv' files = glob.glob(fpattern.format('*','*')) dates ...