Open Source Search Engines in Java

2019-03-27 01:07|来源: 网路

Open Source Search Engines in Java

    Open Source Search Engines in Java
    Compass

    The Compass Framework is a first class open source Java framework, enabling the power of Search Engine semantics to your application stack declaratively. Built on top of the amazing Lucene Search Engine, Compass integrates seamlessly to popular development frameworks like Hibernate and Spring. It provides search capability to your application data model and synchronises changes with the datasource. With Compass: write less code, find data quicker.

    Go To Compass
    Oxyus

    Oxyus Search Engine is a Java based Application for indexing web documents for searching from an intranet or the Internet similar to other propietary search engines of the industry. Oxyus has a web module to present search results to the clients throught web browsers using Java Server that access a JDBC repository through Java Beans.

    Go To Oxyus
    BDDBot



    DDBot is a web robot, search engine, and web server written entirely in Java. It was written as an example for a chapter on how to write your search engines, and as such it is very simplistic.

    Go To BDDBot
    Egothor

    Egothor is an Open Source, high-performance, full-featured text search engine written entirely in Java. It is technology suitable for nearly any application that requires full-text search, especially cross-platform. It can be configured as a standalone engine, metasearcher, peer-to-peer HUB, and, moreover, it can be used as a library for an application that needs full-text search.

    Go To Egothor
    Nutch

    Nutch is a nascent effort to implement an open-source web search engine. Nutch provides a transparent alternative to commercial web search engines.

    Go To Nutch
    Lucene

    Jakarta Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

    Go To Lucene
    Zilverline

    Zilverline is what you could call a 'Reverse Search Engine'.

    It indexes documents from your local disks (and UNC path style network disks), and allows you to search through them locally or if you're away from your machine, through a webserver on your machine.

    Zilverline supports collections. A collection is a set of files and directories in a directory. PDF, Word, txt, java, CHM and HTML is supported, as well as zip and rar files. A collection can be indexed, and searched. The results of the search can be retrieved from local disk or remotely, if you run a webserver on your machine. Files inside zip, rar and chm files are extracted, indexed and can be cached. The cache can be mapped to sit behind your webserver as well.

    Go To Zilverline
    YaCy

    This is a distributed web crawler and also a caching HTTP proxy. You are using the online-interface of the application. You can use this interface to configure your personal settings, proxy settings, access control and crawling properties. You can also use this interface to start crawls, send messages to other peers and monitor your index, cache status and crawling processes. Most important, you can use the search page to search either your own or the global index.

    Go To YaCy
    Lius

    LIUS - Lucene Index Update and Search
    LIUS is an indexing Java framework based on the Jakarta Lucene project. The LIUS framework adds to Lucene many files format indexing fonctionalities as: Ms Word, Ms Excel, Ms PowerPoint, RTF, PDF, XML, HTML, TXT, Open Office suite and JavaBeans.
    LIUS is very easy to use; all the configuration of the indexing (types of files to be indexed, fields, etc...) as well as research is defined in a XML file, so the user only have to write few lines of code to carry out the indexing or research.

    LIUS has been developed from a range of Java technologies and full open source applications.

    Go To Lius
    Solr

    Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP APIs, caching, replication, and a web administration interface.

    Go To Solr
    regain

    ´regain´ is a fast search engine on top of Jakarta-Lucene. It crawles through files or webpages using a plugin architecture of preparators for several file formats and data sources. Search requests are handled via browser based user interface using Java server pages. ´regain´ is released under LGPL and comes in two versions:

    1. standalone desktop search program including crawler and http-server
    2. server based installation providing full text searching functionality for a website or intranet fileserver using XML configuration files.

    Go To regain
    MG4J

    MG4J (Managing Gigabytes for Java) is a collaborative effort aimed at providing a free Java implementation of inverted-index compression techniques; as a by-product, it offers several general-purpose optimised classes, including fast and compact mutable strings, bit-level I/O, fast unsynchronised buffered streams, (possibly signed) minimal perfect hashing, etc. MG4J functions as a full-fledged text-indexing system. It can analyze, index, and query consistently large document collections.

    Go To MG4J
    Piscator

    Piscator is a small SQL/XML search engine. Once an XML feed is loaded, it can be queried using plain SQL. The setup is almost identical to the DB2 side tables approach.

    Go To Piscator
    Hounder

    Hounder is a simple and complete search system. Out of the box, Hounder crawls the web targeting only those documents of interest, and presents them through a simple search web page and through an API, ideal for integrating into other projects. It is designed to scale on all fronts: the number of the indexed pages, the crawling speed and the number of simultaneous search queries. It is in use in many large scale search systems.

    Go To Hounder
    HSearch
    HSearch is an open source, NoSQL Search Engine built on Hadoop and HBase. HSearch features include:



     * Multiple document formats

     * Record and document level search access control

     * Continuous index updating

     * Parallel indexing using multiple machines

     * Embeddable application

     * A REST-ful Web service gateway that supports XML

     * Auto sharding

     * Auto replication


转自:http://www.cnblogs.com/lexus/archive/2012/03/07/2383599

相关问答

更多
  • 我建议jMonkeyEngine ,它很棒。 I suggest jMonkeyEngine, it is great.
  • 虽然Lucene.Net的“完整文档,网站更新”一直没有“充分”的发布,但仍然有一些新的SVN信息库。 例如,最新版本(2.3.2)在07/24/09被标记( 见这里 )。 由于开发仍然活跃,我将使用它来进行新的全文搜索项目。 While they were no 'full blown' releases (i.e. full documentation, web site updates) of Lucene.Net for quite some time, there are still fresh ...
  • 评论员看起来有一个坚实的社区支持,用PHP编写,并且拥有评论系统期望的大部分功能。 Commentator looks to have a solid community backing, is written in PHP, and has most of the features you'd expect from a comment system.
  • 在Google中搜索example site:github.com 。 Yes search for example site:github.com in Google.
  • 从“301重定向和搜索引擎优化” : 从搜索引擎的角度来看,301重定向是重定向网址的唯一可接受的方式 。 在移动页面的情况下,搜索引擎将仅索引新的URL,但会将链接流行度从旧URL转移到新URL,以便搜索引擎排名不受影响。 当通过301重定向将其他域设置为指向主域时,会发生相同的行为。 编辑: 看看这些页面描述301和302重定向之间的区别,以及为什么通常应该避免302: Here , Here and Here 。 From "301 Redirects and Search Engine Optimi ...
  • 好吧,既然你提到了谷歌语音搜索: https : //github.com/The-Shadow/java-speech-api 。 它使用相同的语音识别引擎。 我相信其余的都可以在Lucene API中处理。 Well, since you mentioned Google Voice Search: https://github.com/The-Shadow/java-speech-api. It uses the same Speech Recognition Engine. I am sure th ...
  • 听起来您正在寻找一种临床决策支持系统(CDSS),该系统由“主动知识系统定义,该系统使用两项或更多项患者数据生成针对具体病例的建议。” - http://en.wikipedia。组织/维基/ Clinical_decision_support_system 有一个开源CDSS http://www.opencds.org/正在开发,它是用Java编写的,目前用alpha编写。 维基百科列出了其他CDSS系统,但我不认为它们中的任何一个都是开源的 - http://en.wikipedia.org/wiki ...
  • 您是指Google的新屏幕截图功能还是旧缓存功能? 你的问题是关于屏幕截图,并没有提到缓存,但你对你的问题的评论似乎暗示你指的是缓存,而不是屏幕截图。 在截图的情况下: 你是正确的,因为搜索引擎通常只读取网站上的HTML和文本,因为这是他们所需要的。 但这并不意味着他们不能 。 当他们想要拍摄一个网站的截图时,他们只会做一个正常的浏览器在用户访问该网站时所做的事情。 下载网站,CSS,图像和其他所有内容,并使用WebKit等Web浏览器的渲染引擎进行渲染。 在缓存的情况下: 搜索引擎通常只在存储/解析之前存 ...
  • 画廊是经典的选择。 它有皮肤,安全层,插件堆等,但如果你愿意,可以使用默认设置轻松运行。 我已经用了很多年了。 Gallery is the classic choice. It has skins, security layers, heaps of plugins, etc, but can be run with the default settings easily if you want to. I've used it for years.
  • Markus写的关于Google在许多机器上并行处理查询的内容是正确的。 此外,还有信息检索算法,使这项工作更容易一些。 这样做的经典方法是构建一个倒置索引 ,该索引由过帐列表组成 - 按顺序列出包含该术语的所有文档的每个术语。 在搜索具有两个术语的查询时,从概念上讲,您将为两个术语(“david”和“john”)中的每一个获取帖子列表,并沿着它们查找,查找两个列表中的文档。 如果两个列表以相同的方式排序,则可以在O(N)中完成。 当然,N仍然是巨大的,这就是为什么这将在数百台并行机器上完成的原因。 此外, ...