Full-Text Search in ASP.NET using Lucene.NET

2019-03-27 01:00|来源: 网路

This post is about the full-text search engine   Lucene.NET  and how I integrated it into   BugTracker.NET  .   If you are thinking of adding full-text search to your application, you might find this post useful.  I'm not saying this is THE way of using Lucene.NET, but it is an example of ONE way.
 
Lucene.NET is a C# port of the original   Lucene  , an Apache Foundation open source project, written in java.

Why did I use Lucene.NET instead of the SQL Server full-text search engine?  Well, I'd like to say that I did some research into the pros and cons of the two choices, but actually I didn't do any comparative research.  What happened was that during a   Stackoverflow podcast  I heard Joel Spolsky mention that FogBugz uses Lucene as its engine and that he was happy with it.   I trust him, and  I was curious, so, one weekend I downloaded Lucene.NET and played with it a bit and before the weekend was over I was already done integrating it into BugTracker.NET.   I never looked at the SQL Server alternative at all, so I can't tell you anything about it.

Lucene itself is a class library, not an executable.   You call Lucene functions to do the search.  There is an open source standalone server built on Lucene called   Solr  .   You send Solr messages to do the search.  One way of using Lucene would have been to have my users run Solr side-by-side with SQL Server.   As with SQL Server full-text search, I can't tell you anything about Solr because I didn't try it.   It wouldn't have made sense to use Solr for BugTracker.NET, I think, because Solr would have been an additional installation hassle.   And running a server wouldn't have been doable at all at a cheap shared host like GoDaddy, where my own BugTracker.NET demo lives.    So, instead of using Solr, I used the Lucene class libraries directly.

To integrate Lucene, I had to build the following, which I list here and then describe in more detail below.

1) How Lucene would build its searchable index.   Lucene doesn't search my SQL Server database directly.   Instead, it searches its own "database", its own index.
2) The design of the Lucene index
3) How I would update Lucene's index whenever data in my database changes.
4) Sending the search query to Lucene.
5) Displaying the results.


Now the details.  I've simplied my code for this post, so that you can more easily see the overall design and understand the concepts and my design choices.   



1) Building the index.

When an ASP.NET application receives its first HTTP request after having been shut down, the Application_OnStart event fires, which I handle in Global.asax.   I call my "build_lucene_index" method.    Notice that I have a configuration setting "EnableLucene".    I was nervous about the my understanding of Lucene and whether my way of using it was the right architecture, and so I wanted to make sure I gave my users a way of turning Lucene off in case it was causing trouble.    More on that in a bit.

For a really big database, you wouldn't want to necessarily build the search index from scratch over and over, but I'm counting on BugTracker.NET databases being on the small side.   Is that a safe assumption?  A bug database shouldn't be that big or else you're doing it wrong, right?


public void Application_OnStart(Object sender, EventArgs e)
{
    if (btnet.Util.get_setting("EnableLucene", "1") == "1")
    {
        build_lucene_index(this.Application);
    }
}



The build_lucene_index method starts a new worker thread, where the real work is done.


public static void build_lucene_index(System.Web.HttpApplicationState app)
{
    System.Threading.Thread thread = new System.Threading.Thread(threadproc_build);
    thread.Start(app);
}


The worker thread first grabs a lock so that it can build the index without being disturbed by other threads.   The other threads would be the result of users either searching or users updating text, triggering a modification to Lucene's index.    I don't want those threads to be dealing with a partially built index, so I make those threads wait for the one-and-only lock.

My way of handling multithreading was one of the things that I was nervous about.   I feared some sort of hard-to-reproduce deadlock condition, or race condition, but so far, there have been no reports from BugTracker.NET users of any trouble, so I my design appears to be solid.

To create the index, I create a Lucene "IndexWriter".   I run a SQL query against my database to fetch the text I want to be able to search and the database keys that go with that text.   Then I loop through the query results adding a Lucene "Document" for each row.   Actually, in my real code, I get the searchable text from several different fields in my database, but in the snippet below I have simplified my harvesting of text from my database.  



Lucene.Net.Analysis.Standard.StandardAnalyzer analyzer = new Lucene.Net.Analysis.Standard.StandardAnalyzer();

static void threadproc_build(object obj)
{
    lock (my_lock)
    {
        try
        {

            Lucene.Net.Index.IndexWriter writer = new Lucene.Net.Index.IndexWriter("c:\\folder_where_lucene_index_lives", analyzer , true);
            
            DataSet ds = btnet.DbUtil.get_dataset("select bug_id, bug_text from bugs")
            
            foreach (DataRow dr in ds.Tables[0].Rows)
            {
                writer.AddDocument(create_doc(
                    (int)dr["bug_id"],
                    (string)dr["bug_text"]));
            }
            
            writer.Optimize();
            writer.Close();
        }
        catch (Exception e)
        {
            btnet.Util.write_to_log("exception building Lucene index: " + e.Message);
        }
    }
}
 



2) The design of the Lucene Index

Here's where I create a Lucene "Document".    An index contains a list of documents.   A doc has fields that you define.   My doc shown here has three fields.   The first field "text" is what Lucene will analyze and index, the searchable text.  The second field is the key I will use to link the Lucene data to the rows in my database.   Notice I tell Lucene that this key should be UN_TOKENIZED, stored as is.   That's all you need for a minimal Lucene doc, a key for you and some text to search on for Lucene.  The third field in my example is the text again, but this time, UN_TOKENIZED, stored as is.  I will use that text for having Lucene highlight in my results page the snippets where the hits are.   More on highlighting later.

One of the decisions you'll have to make when using Lucene is what text to index and how to package it for Lucene.    In my database, the text doesn't just live in one field.    A bug has a short text description, a list of comments, a list of incoming and outgoing emails, and even Digg-style tags.   In my real code as opposed to the snippets here,  I fetch text from all these places.   My real Lucene doc has four fields, the forth being another database key that I can use to link to the specific comment or email where the search hit is.   BugTracker.NET supports custom text fields and in the future I hope to harvest that text from the database and add it to the Lucene doc.

So, if your app is like mine, with text in many different places, then you'll have a challenge like mine, how to package the text into a Lucene doc.


static Lucene.Net.Documents.Document create_doc(int bug_id, string text)
{     
    Lucene.Net.Documents.Document doc = new Lucene.Net.Documents.Document();
    
    doc.Add(new Lucene.Net.Documents.Field(
        "text",
        new System.IO.StringReader(text)));
    
    doc.Add(new Lucene.Net.Documents.Field(
        "bug_id",
        Convert.ToString(bug_id),
        Lucene.Net.Documents.Field.Store.YES,
        Lucene.Net.Documents.Field.Index.UN_TOKENIZED));
    
    // For the highlighter, store the raw text
    doc.Add(new Lucene.Net.Documents.Field(
        "raw_text",
        text,
        Lucene.Net.Documents.Field.Store.YES,
        Lucene.Net.Documents.Field.Index.UN_TOKENIZED));

    return doc;
}



3) Updating the index

Whenever a user updates text in a bug I launch a worker thread to update the index.    The worker thread grabs a lock so that only one thread is updating the index at a time.   

The worker thread creates a Lucene "IndexModifier", deletes the old doc, and replaces it with a new one.

Notice that the thread closes the "searcher".   The searcher is a Lucene "Searcher".   The life cycle of a Searcher is that it first loads the index and then does its searches using that loaded, cached version of the index.   If the real index changes on disk, the searcher wouldn't know about it.   It would continue searching the out-of-date cached copy of the index in its memory.    That might be ok for your situation, and if your index is very big and the cost of creating a new searcher is high, you might be forced to use a searcher with a stale index.   BugTracker.NET databases tend to be small, so I can get away with making sure my searcher always has an up-to-date index to work with.

The official Lucene fact says that a Searcher (aka IndexSearcher) "is thread-safe. Multiple search threads may use the same instance of IndexSearcher concurrently without any problems. It is recommended to use only one IndexSearcher from all threads in order to save memory."  



Lucene.Net.Search.Searcher searcher = null;

static void threadproc_update(object obj)
{
    lock (my_lock) // If a thread is updating the index, no other thread should be doing anything with it.
    {
        
        try
        {
            if (searcher != null)
            {
                try
                {
                    searcher.Close();
                }
                catch (Exception e)
                {
                    btnet.Util.write_to_log("Exception closing lucene searcher:" + e.Message);
                }
                searcher = null;
            }
            
            Lucene.Net.Index.IndexModifier modifier = new Lucene.Net.Index.IndexModifier("c:\\folder_where_lucene_index_lives", analyzer, false);
            
            // same as build, but uses "modifier" instead of write.
            // uses additional "where" clause for bugid
            
            int bug_id = (int)obj;
            
            modifier.DeleteDocuments(new Lucene.Net.Index.Term("bug_id", Convert.ToString(bug_id)));
            
            DataSet ds = btnet.DbUtil.get_dataset("select bug_id, bug_text from bugs where bug_id = " + ConvertToString(bug_id));
            
            foreach (DataRow dr in ds.Tables[0].Rows) // one row...
            {
                modifier.AddDocument(create_doc(
                    (int)dr["bug_id"],
                    (string)dr["bug_text"]));
            }
            
            modifier.Flush();
            modifier.Close();
            
        }
        catch (Exception e)
        {
            btnet.Util.write_to_log("exception updating Lucene index: " + e.Message);
        }
    }
}




4) Sending the search query to Lucene

To search, create a Lucene "QueryParser".    Call its Parse() method passing the text the user typed in.   The Parse() method returns a "Query".   Call the Searcher's Search() method passing the Query.   The Search() method returns a Lucene "Hits" object, a collection of the search hits.    
          
As I've mentioned, I want my searcher to always be using the most up-to-date index, so whenever I do update the index, I destroy the old searcher, and then recreate it again the next time it's needed.     

Since IIS is handling the HTTP requests with multiple threads, these searches are happening on multiple threads.   Each search tries to grab my one-and-only lock, the one that keeps the updating threads from conflicting with each other and that keeps the updating threads from conflicting with searches.     Because there is just this one-and-only lock, all the searches on the website have to line up in single-file to get through this bottleneck.   Sounds terrible, doesn't it?   But so far, no reports of any problems.   It's just a bug tracker, not twitter, and so I can get away with this design, and there's no confusion ever about people doing searches with out-of-date indexes.

     
Lucene.Net.QueryParsers.QueryParser parser = new Lucene.Net.QueryParsers.QueryParser("text", analyzer );
Lucene.Net.Search.Query query = null;

try
{
    if (string.IsNullOrEmpty(text_user_entered))
    {
        throw new Exception("You forgot to enter something to search for...");
    }
    
    query = parser.Parse(text_user_entered);
    
}
catch (Exception e)
{
    display_exception(e);
}


lock (my_lock)
{
    
    Lucene.Net.Search.Hits hits = null;
    try
    {
        if (searcher == null)
        {
            searcher = new Lucene.Net.Search.IndexSearcher("c:\\folder_where_lucene_index_lives");
        }

        hits = searcher.Search(query);

    }
    catch (Exception e)
    {
        display_exception(e);
    }
    
    for (int i = 0; i < hits.Length(); i++)
    {
        Lucene.Net.Documents.Document doc = hits.Doc(i);
        ~~
        ~~ more processing of the hits and the Lucene docs here ~~
        ~~
    }
}



5) Displaying the results

If you didn't like my design prior to this point, what with the locking and the bottleneck, then you are going to really hate it now, because it gets weird now.    The search results I get back from Lucene is in the form of a Hits object, a collection of hits that you access by index.   The collection is in the order of the probability score, which you can get using the Hits.Score() method.   You can also get at the Lucene Document related to the hit via the Hits.Doc() method.

Now, back when I was designing my Lucene Document, I had to be thinking ahead regarding how I would display the results.   Would I display the results based purely on what's in the document?   If so, then I would have had to add fields to the doc for everything I wanted to eventually be displaying, not just the fields I needed for search.   The more fields I put in the doc, the more I would have to be updating the doc and the index to keep it in sync with my database, and the more I would be duplicating database data in the Lucene index.   So, there was a downside to relying strickly on the Lucene doc for my display.

Also, and for me more importantly, I already have a page in my app that knows how to display a list of bugs based on the result of a SQL query.   I didn't want to have to adopt that page to work with a Lucene Hits object.    I wanted to somehow convert the Lucene results into the format expected by that existing page.

So, I decided to try importing the Hits into the database, then letting my existing page fetch the hits out of the database, joining the hits to my bugs table to pick up the fields that I had not bothered to duplicate in the Lucene doc as fields.

The code below shows how I imported the Lucene hits into the database.    In short, I create a big batch of SQL Statements and execute them in one trip to the server.    The batch of SQL Statements creates a temporary table with a unique name plus a bunch of insert statements, one for every Lucene hit I want to import and display.    I import the best 100 hits, which is more than enough.   Lucene can find multiple hits in the same document, but I only want to list a given bug once in the search results, so I have logic for that below, the dict_already_seen_ids.

You will probably want to show your users the text around where the hit is, with the searched-for words highlighted, displayed in their context.   Lucene can prepare that displayable snippet of text for your.   You have to create a bunch of Lucene objects, a Formatter, a SimplerFragmenter, a QueryScorer, a Highlighter, etc, as does my code below.   I specified a snippet length of 400 characters and I specified the highlighting to be done using this HTML:  <span style='background:yellow;'></span>.    I feed to the highlighter the original Query and the raw text that I had saved in the doc.   Lucene then gave me the formatted, highlighted snippets, which I inserted into my temporary database table.

You might think that the import of the Lucene hits into the database would perform poorly, but actually, it's fast.    Had this not worked, then my plan B would have been to create a more complete Lucene Doc, and then somehow programmatically synthesize an ADO.NET recordset for my page downstream that displays results.



Lucene.Net.Highlight.Formatter formatter = new Lucene.Net.Highlight.SimpleHTMLFormatter(
    "<span style='background:yellow;'>",
    "</span>");

Lucene.Net.Highlight.SimpleFragmenter fragmenter = new Lucene.Net.Highlight.SimpleFragmenter(400);
Lucene.Net.Highlight.QueryScorer scorer = new Lucene.Net.Highlight.QueryScorer(query);
Lucene.Net.Highlight.Highlighter highlighter = new Lucene.Net.Highlight.Highlighter(formatter, scorer);
highlighter.SetTextFragmenter(fragmenter); 

StringBuilder sb = new StringBuilder();
string guid = Guid.NewGuid().ToString().Replace("-", "");
Dictionary&lt;string, int&gt; dict_already_seen_ids = new Dictionary&lt;string, int&gt;();
sb.Append(@"
    create table #$GUID
    (
        temp_bg_id int,
        temp_score float,
        temp_text nvarchar(3000)
    )
");


// insert the search results into a temp table which we will join with what's in the database
for (int i = 0; i < hits.Length(); i++)
{
    if (dict_already_seen_ids.Count < 100)
    {
        Lucene.Net.Documents.Document doc = hits.Doc(i);
        string bg_id = doc.Get("bg_id");
        if (!dict_already_seen_ids.ContainsKey(bg_id))
        {
            dict_already_seen_ids[bg_id] = 1;
            sb.Append("insert into #");
            sb.Append(guid);
            sb.Append(" values(");
            sb.Append(bg_id);
            sb.Append(",");
            //sb.Append(Convert.ToString((hits.Score(i))));
            sb.Append(Convert.ToString((hits.Score(i))).Replace(",", "."));  // Somebody said this fixes a bug. Localization issue?
            sb.Append(",N'");
            
            string raw_text = Server.HtmlEncode(doc.Get("raw_text"));


            Lucene.Net.Analysis.TokenStream stream = analyzer.TokenStream("", new System.IO.StringReader(raw_text));

            string highlighted_text = highlighter.GetBestFragments(stream, raw_text, 1, "...").Replace("'", "''");


            if (highlighted_text == "") // someties the highlighter fails to emit text...

            {
                highlighted_text = raw_text.Replace("'","''");
            }
            if (highlighted_text.Length > 3000)
            {
                highlighted_text = highlighted_text.Substring(0,3000);
            }
            sb.Append(highlighted_text);
            sb.Append("'");
            sb.Append(")\n");
        }
    }
    else
    {
        break;
    }
}  
 

We're done.  I'd be very interested in your feedback.   Was my explanation here helpful to you?   Were my design choices stupid?   I'd like to hear from you.
 

转自:http://www.cnblogs.com/zcm123/archive/2013/06/03/3116192

相关问答

更多
  • 我在2006年在SQL Server 2005的FTS之上构建了一个中等规模的知识库(可能是2GB的索引文本),现在已经将其移植到了2008年的iFTS。 这两种情况对我来说都很好,但从2005年到2008年的变化对我来说实际上是一个改善。 我的情况并不像StackOverflow那样,我正在索引每天只刷新的数据,但是我试图将来自多个CONTAINSTABLE语句的搜索结果重新加入到关系表中。 在2005年的FTS中,这意味着每个CONTAINSTABLE都必须在索引上执行搜索,返回完整的结果,然后让DB引 ...
  • 是的,我用它正是你正在描述的。 我们有两个服务 - 一个用于阅读,一个用于写作,但只因为我们有多个读者。 我相信我们可以用一个服务(作者)完成它,并将读者嵌入到Web应用程序和服务中。 我已经使用了lucene.net作为一般的数据库索引器,所以我回到基本上是DB id(到索引的电子邮件),我也用它来获取足够的信息来填充搜索结果,而不用触摸数据库。 在这两种情况下,SQL都可以很好的工作,因为你可能会慢一些,因为你必须得到一个ID,选择一个ID等等。我们通过创建一个临时表(只有ID行)和从文件(这是luce ...
  • 我认为在构建索引时存在问题。 您为每个文档添加四个字段,所有这些字段都存储,但没有索引(=> Lucene.Net.Documents.Field.Index.NO)。 你应该至少在现场索引。 请注意,StandardAnalyzer以下列方式标记每个字段索引:使用普通英语停用词进行缩小和分割。 因此,在构建查询时,请使用LOWERCASE前缀以获得匹配结果: query.Add(new PrefixQuery(new Term("B", "lala")), BooleanClause.Occur.MUST ...
  • 你确定你用Lucene.Net.Analysis.Snowball.SnowballAnalyzer("English")来编写你的索引吗? 您必须使用相同的分析器来编写和查询索引。 Are you sure you used Lucene.Net.Analysis.Snowball.SnowballAnalyzer("English") to write your index ? You have to use the same analyzer to write and query the index. ...
  • 我不担心索引的大小:-) 我们一直选择选项1,并且永远不会过滤lucene.net之外的数据。 在数据库中进行过滤后获得“真实”命中数之前,您可能会遇到需要从lucene.net检索大量命中的情况 - 它可能还需要几次往返数据库。 我们目前在150K文档上平均有大约100个字段,并且工作得非常好。 I wouldn't worry about the size of the index :-) We go for option 1 all the time and never filter data out ...
  • 是的,如果你想从lucene中查询多个字段(包括标签之类的东西),你需要将这些数据提供给lucene。 听起来这可能是重复,但它不是多余的重复 - 它将数据重组为一个非常不同的布局 - 为搜索索引。 它应该工作正常; 它几乎就是如何在stackoverflow上搜索工作(使用lucene.net来执行搜索)。 然而,应该注意的是,几百个不是一个大样本:坦率地说,你可以以任何你喜欢的方式做到这一点,并且它需要大约相同的时间。 编写复杂的SQL查询应该可以正常工作,数据库中的全文搜索也应该工作(这就是stack ...
  • 迁移到Azure时我也遇到了这个问题,最终得到了相同的权限模型。 由于您的userIds是整数并且没有特殊字符,因此您可以依赖许多Lucene(.net)分析器(如StandardAnalyzer和WhitespaceAnalyzer)将ID列表分割为术语,只要您输入字符串即可。 只需使用空格或逗号分隔每个ID,具体取决于分析器将拆分的内容。 你应该能够做这样简单的事情来索引ID ...... IEnumerable userIds = new int[] { 123, 456, 789 }; ...
  • 即使您可以选择全文搜索,Lucene.Net也是一个不错的选择。 Lucene.Net扩展了FTS(全文sql)提供的功能。 包括提升术语,模糊查询,简单的分面搜索,可以在2.9.4g分支的contrib项目中找到,等等。 它的开源,所以你不必等待别人的周期来修改它或扩展或添加功能。 有几个帖子,甚至是FOSS贡献项目,以帮助规避更高的进入门槛。 从Lucene in Action开始,我推荐下面列表中的内容。 这本书是一个很好的资源,但最新版本的目标是Lucene 3.0,java版本,其中包括尚未进入. ...
  • 这听起来像是一个打开的IndexWriter在索引目录上保持锁定的问题。 一个工作进程将锁定其他进程的索引。 只要每个索引只有一个编写器,Lucene.Net就可以在多进程环境中使用。 不同的目录实现以不同的方式强制执行此操作,通常涉及名为write.lock的文件。 常见的解决方案是使用单独的搜索过程来处理索引和搜索。 This sounds like an issue with an open IndexWriter keeping locks on the index directory. One w ...
  • Lucene.Net将只使用文本匹配,因此在添加到索引之前,您需要正确格式化日期: public static string Serialize(DateTime dateTime) { return dateTime.ToString("yyyyMMddHHmmss", CultureInfo.InvariantCulture); } public static DateTime Deserialize(string str) { ...