首页 \ 教程 \ solr

知识点

Solr

Apache POI的使用

Apache Nutch（一）

在SolrNet中使用Apache Tika抽取文件元数据

使用Apache Ambari管理Hadoop

apache的作用是什么？

Apache Solr 介绍

Apache Zookeeper入门1

Apache Solr 介绍（1）

Apache Solr 介绍

Apache Solr初体验一

Apache zookeeper 基本概念

实践：使用 Apache Hadoop 处理日志

Apache Solr 启动

如何监控 Apache Solr ?

Apache Solr 的新特性

Apache Tika:通用的内容分析工具

2019-03-27 01:13|来源: 网路

项目介绍

Tika是一个内容分析工具，自带全面的parser工具类，能解析基本所有常见格式的文件，得到文件的metadata，content等内容，返回格式化信息。总的来说可以作为一个通用的解析工具。特别对于搜索引擎的数据抓去和处理步骤有重要意义。

Tika是一个目的明确，使用简单的apache的开源项目。下图是Tika诞生的一个历史过程。

Tika项目之初来源于Nutch项目(大家应该都不陌生)，现在是Lucene的子项目，所以也是来源于搜索引擎。其实Nutch这个项目的开发过程中，孕育了不少东西，应该都归功于Doug Cutting。我个人也是觉得这件事情很赞，要搞Nutch这样一个通用的搜索引擎，包括了全文索引和Web爬虫两大块内容，在开发过程中逐渐诞生出一些核心的周边产品，再孕育成子项目，包括hadoop，Lucene，Tika等等这些现代很主流，使用人群很广的通用项目，带给了IT界不少便利。我个人对此非常憧憬，觉得甚是美好。

从源码看功能

通过src里几个包和主要类，看Tika能干什么。跳过core包，tika-parsers展示了Tika能处理的文件类别和内容，

音频，图片，文本，各种格式的文件，tika都有对应的parser类来处理。而且Tika提供给了一些parser接口供扩展。tika-bundle提供Tika结合OSGi容器的能力。tika-app而则是一个在代码外直接使用Tika的jar包，可以在官网直接下载使用，提供gui和cmd使用方式，直观地体验这款产品。下面我会截图展示。

Tika架构

下图解释了Tika的架构以及关键零部件的主要设计目标：由一个解析器框架（中间），MIME检测机制（右侧），语言检测（左侧），和一个facade组件（中间部分的原理图）联系所有组件。外部接口，包括命令行和图形界面（下一节我会简单介绍），允许用户集成到脚本或者应用程序，并与Tika直接交互。在整个结构中，Tika的体系结构是可扩展的，新的解析器可以轻松地添加和删除。

Tika使用

直接使用Tika，只要java -jar tika-app-1.2 --gui即可启动，你可以把打开本地文件或者添加你要解析的url地址，甚至直接把各种文件拖入Tika，查看Tika的解析结果。大家可以直接下载jar包体验下，非常方便。在View内可以选择你想查看的内容(Metadata, text等)。Tika对图片的处理主要是提供一些元信息，并不能分析出图片内的内容，所以图片形式的pdf文件自然是不能查看text信息的。

如果是用命令行，类似的语法是这样的：

[html] view plain copy

java -jar tika-app-1.0.jar --text document.doc
java -jar tika-app-1.0.jar --encoding=UTF-8 --text document.doc
java -jar tika-app-1.0.jar --metadata document.doc

想在别的工程中使用Tika，只要在maven项目依赖里添加Tika，new Tika的实例，然后直接调用Tika的解析parser类，即可获取到处理后的信息。给个最简单的例子：

[java] view plain copy

import java.io.File;
import org.apache.tika.Tika;
public class SimpleTextExtractor {
public static void main(String[] args) throws Exception {
// Create a Tika instance with the default configuration
Tika tika = new Tika();
// Parse all given files and print out the extracted text content
for (String file : args) {
String text = tika.parseToString(new File(file));
System.out.print(text);
}
}
}

总结

介绍Tika出于两个目的：

1. 感觉是一个通用，实用且易用的分析工具，可以与lucene，solr结合，天生服务搜索引擎

2. 感叹Nutch项目发展历史，Apache各种开源项目的紧密，自然，优美的关联性。

更多内容参看《Tika in action》

转自：http://www.cnblogs.com/chenying99/archive/2013/03/07/2947283

相关问答

配置apache(Configuration apache)[2023-09-25]

我看到你有DocumentRoot "/Users/myusername/Sites"但是块在etc/apache2/users/myusername.conf 我在http.conf不到对etc/apache2/users/myusername.conf任何引用，所以我首先检查该文件是否确实作为配置文件加载。最简单的方法（但最具破坏性的，如果它是一个实时服务器）是在该文件中引入语法错误并重新加载服务器。 I see you have DocumentRoot "/Users/myu ...
Apache Tika无法从大型pdf中提取全文内容(Apache Tika could not extract full text content from a large pdf)[2021-09-17]

这是我身边最愚蠢的错误。我从eclipse控制台获取具有有限缓冲区空间的输出文件。当我将输出写入文件时，它似乎是完美的。 This was a stupidest mistake from my side. I was taking the output file from eclipse console which has a limited buffer space. When I wrote the output into a file, it seems to be perfect.
Apache Tika入门？(Getting started with Apache Tika?)[2023-06-23]

首先，您需要阅读Apache Tika入门指南，其中介绍了如何将Tika包含在您的项目中。（这假设您有一些基本知识，包括将第三方罐子包含在您自己的项目中，如果不是，您需要阅读一些相关的教程）在您的项目中开始使用Tika的最简单方法是通过Tika Facade课程。这提供了一个单独的类，您可以使用它来检测，解析为纯文本字符串，并通过阅读器解析为xhtml，所有这些都来自各种来源。所有的基础知识都在那里。要获得更高级的使用，您需要遵循Parser API页面和内容检测页面上提供的信息。你也可以 ...
Apache重写(Apache rewrite)[2023-06-25]

正确的方法是 RewriteEngine On RewriteCond %{REQUEST_URI} !appa RewriteCond %{REQUEST_URI} !appb RewriteRule ^(.*)$ http://127.0.0.1:port/$1 [P] RewriteConds没有看到规则中匹配的内容 It seems i have found a way: ProxyPass /appa ! ProxyPass /app_testb ! ProxyPass / http://127 ...
Apache禁止(Apache forbidden)[2023-11-30]

试着改变至这就是应该根据规范编写的： http://httpd.apache.org/docs/current/mod/core.html#directory 这是一个适合我的例子： Options Indexes FollowSymLin ...
使用Apache tika删除PDFont缓存(Remove PDFont caching with Apache tika)[2022-09-19]

所以我捏造了一个解决方法，刚刚调用了System.gc(); 每次文件已经完成处理，这是一种享受，但并没有真正回答这个问题。 So I fudged a workaround and just called System.gc(); everytime the file had finished being processed which works a treat but doesn't really answer the question.
索引PDF - 使用Apache Solr和Apache Tika进行分面搜索(Indexing PDF - Faceted Search with Apache Solr and Apache Tika)[2023-04-14]

Solr 6.2附带了示例/文件中的文件示例，这些示例/文件专门用于索引和浏览丰富内容文件（如PDF）。首先使用它并尝试了解它是如何组合在一起的。 Solr 6.2 ships with files example in the example/files that is configured specifically to index and browse rich-content files (like PDF). Start by using that and try to understand h ...
无法使用apache tika直接从扫描的pdf中提取内容，但在转换为jpg格式时工作正常(Unable to extract content directly from scanned pdf using apache tika , but works fine when converted to jpg format)[2023-10-29]

问题似乎是Tika调用tesseract（一旦它验证了二进制文件存在且可以执行），如果没有明确设置配置参数，则不指定环境中tessdata目录的位置（这可能是默认适用的）一些安装，但不在我的Mac上）。可以按以下方式显式设置路径： TesseractOCRConfig config = new TesseractOCRConfig(); config.setTesseractPath("/usr/local/bin"); config.setTessdataPath(" ...
Apache Tika停止记录(Apache Tika Stopping Logging)[2024-01-17]

我想Log4j正在使用类路径上找到的不同配置文件。尝试使用-Dlog4j.debug运行您的应用程序如果您的应用程序在Tomcat中运行，您可以将其添加到： export TOMCAT_OPTS="-Dlog4j.debug -Dlog4j.configuration=foobar.xml" 使用debug运行它将显示log4j在哪里找到它的配置。 I guess Log4j is using a different configuration file found on the classpath. ...
如何使用Apache Tika获取PDF格式的元素样式信息？(How to get style information of elements in PDF using Apache Tika?)[2022-09-06]

使用像PDFTextStream这样的另一个api可能更方便。 Tika从pdf中提取原始文本信息，而PDFTextStream为您提供带有相关信息的结构化文本，如字符编码，高度，文本区域等。 It is probably more convenient to use another api like PDFTextStream. Tika extracts raw textual information from a pdf, while PDFTextStream gives you structure ...

知识点

相关文章

最近更新

Apache Tika:通用的内容分析工具

项目介绍

从源码看功能

Tika架构

Tika使用

总结

相关问答

配置apache(Configuration apache)[2023-09-25]

Apache Tika无法从大型pdf中提取全文内容(Apache Tika could not extract full text content from a large pdf)[2021-09-17]

Apache Tika入门？(Getting started with Apache Tika?)[2023-06-23]

Apache重写(Apache rewrite)[2023-06-25]

Apache禁止(Apache forbidden)[2023-11-30]

使用Apache tika删除PDFont缓存(Remove PDFont caching with Apache tika)[2022-09-19]

索引PDF - 使用Apache Solr和Apache Tika进行分面搜索(Indexing PDF - Faceted Search with Apache Solr and Apache Tika)[2023-04-14]

无法使用apache tika直接从扫描的pdf中提取内容，但在转换为jpg格式时工作正常(Unable to extract content directly from scanned pdf using apache tika , but works fine when converted to jpg format)[2023-10-29]

Apache Tika停止记录(Apache Tika Stopping Logging)[2024-01-17]

如何使用Apache Tika获取PDF格式的元素样式信息？(How to get style information of elements in PDF using Apache Tika?)[2022-09-06]