首页 \ 问答 \ 生成SequenceFile(Generating a SequenceFile)

生成SequenceFile(Generating a SequenceFile)

 给定以下格式的数据（tag_uri image_uri image_uri image_uri ...），我需要将它们转换为Hadoop SequenceFile格式，以便Mahout进一步处理（例如聚类）  
http://flickr.com/photos/tags/100commentgroup http://flickr.com/photos/34254318@N06/4019040356 http://flickr.com/photos/46857830@N03/5651576112
http://flickr.com/photos/tags/100faves http://flickr.com/photos/21207178@N07/5441742937
...
 
 在此之前，我将输入转换为csv（或arff），如下所示  
http://flickr.com/photos/tags/100commentgroup,http://flickr.com/photos/tags/100faves,...
0,1,...
1,1,...
...
 
 每行描述一个标签。 然后将arff文件转换为mahout使用的矢量文件以供进一步处理。 我试图跳过arff生成部分，然后生成一个sequenceFile。 如果我没有弄错，要将我的数据表示为sequenceFile，我需要将$ tag_uri作为键存储每行数据，然后将$ image_vector作为值存储。 这样做的正确方法是什么（如果可能的话，我可以将每行的tag_url包含在某个地方的序列文件中）吗？  
 我找到的一些参考文献，但不确定它们是否相关：  
 
  编写SequenceFile  
  格式化svd矩阵分解的输入矩阵 （我可以将矩阵存储在这种形式中吗？）  
  RandomAccessSparseVector （考虑到我只列出分配给定标签的图像而不是一行中的所有图像，是否可以使用此向量表示它？）  
  SequenceFile写  
  SequenceFile解释  

Given data in the following format (tag_uri image_uri image_uri image_uri ...), I need to turn them into Hadoop SequenceFile format for further processing by Mahout (e.g. clustering) 
http://flickr.com/photos/tags/100commentgroup http://flickr.com/photos/34254318@N06/4019040356 http://flickr.com/photos/46857830@N03/5651576112
http://flickr.com/photos/tags/100faves http://flickr.com/photos/21207178@N07/5441742937
...
 
Before this I would turn the input into csv (or arff) as follows 
http://flickr.com/photos/tags/100commentgroup,http://flickr.com/photos/tags/100faves,...
0,1,...
1,1,...
...
 
with each row describes one tag. Then the arff file is converted into a vector file used by mahout for further processing. I am trying to skip the arff generation part, and generate a sequenceFile instead. If I am not mistaken, to represent my data as a sequenceFile, I would need to store each row of the data with $tag_uri as key, then $image_vector as value. What is the proper way of doing this (if possible, can I have the tag_url for each row to be included in the sequencefile somewhere)? 
Some references that I found, but not sure if they are relevant: 
 
 Writing a SequenceFile 
 Formatting input matrix for svd matrix factorization (can I store my matrix in this form?) 
 RandomAccessSparseVector (considering I only list images that are assigned with a given tag instead of all the images in a line, is it possible to represent it using this vector?) 
 SequenceFile write 
 SequenceFile explanation 

原文：https://stackoverflow.com/questions/7062327

更新时间：2023-09-14 08:09

最满意答案

 是的，这是HTML5和XHTML之间的主要区别之一。 您应该能够使用HTML5解析器解析任何HTML页面。 

Yes, that's one of the main differences between HTML5 and XHTML. You should be able to parse any HTML page with a HTML5 parser.

生成SequenceFile(Generating a SequenceFile)

最满意答案

相关问答

哪个HTML解析器是最好的？(Which HTML Parser is the best? [closed])[2022-02-15]

如何不在IE8或更早版本上执行脚本？(How to not to execute a script on IE8 or older?)[2024-03-21]

领先的Java HTML解析器的优点和缺点是什么？(What are the pros and cons of the leading Java HTML parsers? [closed])[2023-02-28]

避免Parsers库解析器中使用失败(Avoiding usage of fail in parsers from Parsers library)[2024-03-01]

任何好的Java HTML解析器？(Any good Java HTML parsers?)[2022-02-25]

如何在IE 8或更早版本上运行HTML 5标记(How to run HTML 5 tags on IE 8 or older versions)[2022-12-08]

如果网页是HTML5（或旧版本的HTML），如何确定（使用java代码）(how to determine (using java code) if a web page is HTML5 (or older version of HTML))[2023-09-19]

符合html5的解析器是否正确处理html 4及更早版本？(Do html5-compliant parsers process html 4 and older correctly?)[2023-06-28]

错误：Android 2.3及更早版本上不受信任的服务器证书(Error: Not trusted server certificate on Android 2.3 and older)[2022-01-27]

如何检查解析器是否符合HTML5？(How can I check if a parser is HTML5 compliant?)[2022-05-04]

相关文章

最新问答