首页 \ 问答 \ 用于Excel文件的自定义InputFormat或InputReader(xls)(Custom InputFormat or InputReader for Excel files(xls))

用于Excel文件的自定义InputFormat或InputReader(xls)(Custom InputFormat or InputReader for Excel files(xls))

我需要读取存储在Hadoop集群上的excel(xls)文件。 现在我做了一些研究,发现我需要为此创建一个自定义的InputFormat 。 我阅读了很多文章,但从编程的角度来看,它们都没有用。 如果有人可以帮我编写自定义inputformat的示例代码,以便我可以理解“Programming InputFormat”的基础知识,并且可以使用Apache POI库来读取excel文件。 我制作了一个用于阅读文本文件的mapreduce程序。 现在我需要帮助,即使我有些如何设法编写我自己的自定义InputFormat ,我将编写与我已经编写的mapreduce程序相关的代码。

PS: - 将.xls文件转换为.csv文件不是一种选择。


I need to read a excel(xls) file stored on Hadoop cluster. Now I did some research and found out that I need to create a custom InputFormat for that. I read many articles but none of them is helpful from programming point of view. If someone can help me with sample code for writing custom inputformat so that I can understand the basics of "Programming InputFormat" and can use Apache POI library to read the excel file. I had made a mapreduce program for reading text file. Now I need help regarding the fact that even if I some how manage to code my own custom InputFormat where would I write the code in respect to the mapreduce program I have already written.

PS:- converting the .xls file into .csv file is not an option.


原文:https://stackoverflow.com/questions/21133013
更新时间:2022-11-13 09:11

最满意答案

所以你的第一个代码看起来很丑陋(并且据我所知,还包含两个复制和粘贴错误,请查看parse_href_text = ...parse_box_text = )。 正如其他人在评论中提到的那样,使用库来解析网站可能是更好的选择。

但是,让我们来看看您的代码并尝试简化它:

所以首先你有不同的tags ,并且你想要为它们中的每一个删除</tag><tag><tag 。 你可以把它写成一个函数:

def remove_tag(text, tag):
    return text.replace("</{}>".format(tag), '').replace("<{}>".format(tag), '').replace("<{}".format(tag), '')

然后为所有的标签调用这个函数:

tag_list = ['a', 'p', 'div', 'li', 'span', 'img', 'ul', 'ol', 'label', 'h', 'h1', 'h2', 'h3', 'h4', 'h5']

for tag in tag_list:
    text = remove_tag(text, tag)

然后,删除其他项目:

others = ['href=', '<', '>', '[]', '[', ']', '\n', '   ', '{', '}', '#']

for s in others:
    text = text.replace(s, '')

把它放在一起:

import re
import requests
import json
from bs4 import BeautifulSoup, NavigableString

url = 'http://www.grabexample.com'
geturl = requests.get(url)
some_text = geturl.text
soup = BeautifulSoup(some_text, "html.parser")
soup.prettify()

all_on_URL = soup.find_all('a')
grab_text = soup.get_text(strip=True)

def remove_tag(text, tag):
    return text.replace("</{}>".format(tag), '').replace("<{}>".format(tag), '').replace("<{}".format(tag), '')

tag_list = ['a', 'p', 'div', 'li', 'span', 'img', 'ul', 'ol', 
            'label', 'h', 'h1', 'h2', 'h3', 'h4', 'h5']
for tag in tag_list:
    grab_text = remove_tag(grab_text, tag)

others = ['href=', '<', '>', '[]', '[', ']', '\n', '   ', '{', '}', '#']
for s in others:
    grab_text = grab_text.replace(s, '')

print(grab_text)

So your first code looks pretty ugly (and also contains two copy&paste errors as far as I can tell, look at the lines parse_href_text = ... and parse_box_text =). As others have mentioned in the comments, using a library for parsing the website might be the better choice.

But let's take a look at your code and try to simplify it:

so first you've got different tags, and you want to delete </tag>, <tag> and <tag for each of them. You can write this as a function:

def remove_tag(text, tag):
    return text.replace("</{}>".format(tag), '').replace("<{}>".format(tag), '').replace("<{}".format(tag), '')

then call this function for all of your tags:

tag_list = ['a', 'p', 'div', 'li', 'span', 'img', 'ul', 'ol', 'label', 'h', 'h1', 'h2', 'h3', 'h4', 'h5']

for tag in tag_list:
    text = remove_tag(text, tag)

and then, remove the other items:

others = ['href=', '<', '>', '[]', '[', ']', '\n', '   ', '{', '}', '#']

for s in others:
    text = text.replace(s, '')

Putting it all together:

import re
import requests
import json
from bs4 import BeautifulSoup, NavigableString

url = 'http://www.grabexample.com'
geturl = requests.get(url)
some_text = geturl.text
soup = BeautifulSoup(some_text, "html.parser")
soup.prettify()

all_on_URL = soup.find_all('a')
grab_text = soup.get_text(strip=True)

def remove_tag(text, tag):
    return text.replace("</{}>".format(tag), '').replace("<{}>".format(tag), '').replace("<{}".format(tag), '')

tag_list = ['a', 'p', 'div', 'li', 'span', 'img', 'ul', 'ol', 
            'label', 'h', 'h1', 'h2', 'h3', 'h4', 'h5']
for tag in tag_list:
    grab_text = remove_tag(grab_text, tag)

others = ['href=', '<', '>', '[]', '[', ']', '\n', '   ', '{', '}', '#']
for s in others:
    grab_text = grab_text.replace(s, '')

print(grab_text)

相关问答

更多

相关文章

更多

最新问答

更多
  • 获取MVC 4使用的DisplayMode后缀(Get the DisplayMode Suffix being used by MVC 4)
  • 如何通过引用返回对象?(How is returning an object by reference possible?)
  • 矩阵如何存储在内存中?(How are matrices stored in memory?)
  • 每个请求的Java新会话?(Java New Session For Each Request?)
  • css:浮动div中重叠的标题h1(css: overlapping headlines h1 in floated divs)
  • 无论图像如何,Caffe预测同一类(Caffe predicts same class regardless of image)
  • xcode语法颜色编码解释?(xcode syntax color coding explained?)
  • 在Access 2010 Runtime中使用Office 2000校对工具(Use Office 2000 proofing tools in Access 2010 Runtime)
  • 从单独的Web主机将图像传输到服务器上(Getting images onto server from separate web host)
  • 从旧版本复制文件并保留它们(旧/新版本)(Copy a file from old revision and keep both of them (old / new revision))
  • 西安哪有PLC可控制编程的培训
  • 在Entity Framework中选择基类(Select base class in Entity Framework)
  • 在Android中出现错误“数据集和渲染器应该不为null,并且应该具有相同数量的系列”(Error “Dataset and renderer should be not null and should have the same number of series” in Android)
  • 电脑二级VF有什么用
  • Datamapper Ruby如何添加Hook方法(Datamapper Ruby How to add Hook Method)
  • 金华英语角.
  • 手机软件如何制作
  • 用于Android webview中图像保存的上下文菜单(Context Menu for Image Saving in an Android webview)
  • 注意:未定义的偏移量:PHP(Notice: Undefined offset: PHP)
  • 如何读R中的大数据集[复制](How to read large dataset in R [duplicate])
  • Unity 5 Heighmap与地形宽度/地形长度的分辨率关系?(Unity 5 Heighmap Resolution relationship to terrain width / terrain length?)
  • 如何通知PipedOutputStream线程写入最后一个字节的PipedInputStream线程?(How to notify PipedInputStream thread that PipedOutputStream thread has written last byte?)
  • python的访问器方法有哪些
  • DeviceNetworkInformation:哪个是哪个?(DeviceNetworkInformation: Which is which?)
  • 在Ruby中对组合进行排序(Sorting a combination in Ruby)
  • 网站开发的流程?
  • 使用Zend Framework 2中的JOIN sql检索数据(Retrieve data using JOIN sql in Zend Framework 2)
  • 条带格式类型格式模式编号无法正常工作(Stripes format type format pattern number not working properly)
  • 透明度错误IE11(Transparency bug IE11)
  • linux的基本操作命令。。。