首页 \ 问答 \ 用于Excel文件的自定义InputFormat或InputReader（xls）(Custom InputFormat or InputReader for Excel files(xls))

用于Excel文件的自定义InputFormat或InputReader（xls）(Custom InputFormat or InputReader for Excel files(xls))

 我需要读取存储在Hadoop集群上的excel（xls）文件。 现在我做了一些研究，发现我需要为此创建一个自定义的InputFormat 。 我阅读了很多文章，但从编程的角度来看，它们都没有用。 如果有人可以帮我编写自定义inputformat的示例代码，以便我可以理解“Programming InputFormat”的基础知识，并且可以使用Apache POI库来读取excel文件。 我制作了一个用于阅读文本文件的mapreduce程序。 现在我需要帮助，即使我有些如何设法编写我自己的自定义InputFormat ，我将编写与我已经编写的mapreduce程序相关的代码。  
 PS： - 将.xls文件转换为.csv文件不是一种选择。 

I need to read a excel(xls) file stored on Hadoop cluster. Now I did some research and found out that I need to create a custom InputFormat for that. I read many articles but none of them is helpful from programming point of view. If someone can help me with sample code for writing custom inputformat so that I can understand the basics of "Programming InputFormat" and can use Apache POI library to read the excel file. I had made a mapreduce program for reading text file. Now I need help regarding the fact that even if I some how manage to code my own custom InputFormat where would I write the code in respect to the mapreduce program I have already written. 
PS:- converting the .xls file into .csv file is not an option.

原文：https://stackoverflow.com/questions/21133013

更新时间：2022-11-13 09:11

最满意答案

 所以你的第一个代码看起来很丑陋（并且据我所知，还包含两个复制和粘贴错误，请查看parse_href_text = ...和parse_box_text = ）。 正如其他人在评论中提到的那样，使用库来解析网站可能是更好的选择。  
 但是，让我们来看看您的代码并尝试简化它：  
 所以首先你有不同的tags ，并且你想要为它们中的每一个删除</tag> ， <tag>和<tag 。 你可以把它写成一个函数：  
def remove_tag(text, tag):
    return text.replace("</{}>".format(tag), '').replace("<{}>".format(tag), '').replace("<{}".format(tag), '')
 
 然后为所有的标签调用这个函数：  
tag_list = ['a', 'p', 'div', 'li', 'span', 'img', 'ul', 'ol', 'label', 'h', 'h1', 'h2', 'h3', 'h4', 'h5']

for tag in tag_list:
    text = remove_tag(text, tag)
 
 然后，删除其他项目：  
others = ['href=', '<', '>', '[]', '[', ']', '\n', '   ', '{', '}', '#']

for s in others:
    text = text.replace(s, '')
 
 把它放在一起：  
import re
import requests
import json
from bs4 import BeautifulSoup, NavigableString

url = 'http://www.grabexample.com'
geturl = requests.get(url)
some_text = geturl.text
soup = BeautifulSoup(some_text, "html.parser")
soup.prettify()

all_on_URL = soup.find_all('a')
grab_text = soup.get_text(strip=True)

def remove_tag(text, tag):
    return text.replace("</{}>".format(tag), '').replace("<{}>".format(tag), '').replace("<{}".format(tag), '')

tag_list = ['a', 'p', 'div', 'li', 'span', 'img', 'ul', 'ol', 
            'label', 'h', 'h1', 'h2', 'h3', 'h4', 'h5']
for tag in tag_list:
    grab_text = remove_tag(grab_text, tag)

others = ['href=', '<', '>', '[]', '[', ']', '\n', '   ', '{', '}', '#']
for s in others:
    grab_text = grab_text.replace(s, '')

print(grab_text)

So your first code looks pretty ugly (and also contains two copy&paste errors as far as I can tell, look at the lines parse_href_text = ... and parse_box_text =). As others have mentioned in the comments, using a library for parsing the website might be the better choice. 
But let's take a look at your code and try to simplify it: 
so first you've got different tags, and you want to delete </tag>, <tag> and <tag for each of them. You can write this as a function: 
def remove_tag(text, tag):
    return text.replace("</{}>".format(tag), '').replace("<{}>".format(tag), '').replace("<{}".format(tag), '')
 
then call this function for all of your tags: 
tag_list = ['a', 'p', 'div', 'li', 'span', 'img', 'ul', 'ol', 'label', 'h', 'h1', 'h2', 'h3', 'h4', 'h5']

for tag in tag_list:
    text = remove_tag(text, tag)
 
and then, remove the other items: 
others = ['href=', '<', '>', '[]', '[', ']', '\n', '   ', '{', '}', '#']

for s in others:
    text = text.replace(s, '')
 
Putting it all together: 
import re
import requests
import json
from bs4 import BeautifulSoup, NavigableString

url = 'http://www.grabexample.com'
geturl = requests.get(url)
some_text = geturl.text
soup = BeautifulSoup(some_text, "html.parser")
soup.prettify()

all_on_URL = soup.find_all('a')
grab_text = soup.get_text(strip=True)

def remove_tag(text, tag):
    return text.replace("</{}>".format(tag), '').replace("<{}>".format(tag), '').replace("<{}".format(tag), '')

tag_list = ['a', 'p', 'div', 'li', 'span', 'img', 'ul', 'ol', 
            'label', 'h', 'h1', 'h2', 'h3', 'h4', 'h5']
for tag in tag_list:
    grab_text = remove_tag(grab_text, tag)

others = ['href=', '<', '>', '[]', '[', ']', '\n', '   ', '{', '}', '#']
for s in others:
    grab_text = grab_text.replace(s, '')

print(grab_text)

用于Excel文件的自定义InputFormat或InputReader（xls）(Custom InputFormat or InputReader for Excel files(xls))

最满意答案

相关问答

有没有办法简化这个linq(Is there a way to simplify this linq)[2022-02-24]

有没有办法简化我的代码？(Is there a way to simplify my code?)[2022-04-28]

有没有办法将这3个函数简化为1个函数？(Is there a way to simplify these 3 functions into 1 function? [closed])[2023-09-09]

有没有办法简化这个CSS？(Is there a way to simplify this css?)[2022-01-01]

有没有办法简化这个？(Is there a way to simplify this? remove contents from text, python)[2024-03-23]

有没有办法简化这种开关情况？(Is there a way to simplify this switch-case?)[2022-02-10]

有没有办法简化具有多对一关系的Linq查询？(Is there a way to simplify a Linq query with a many to one relationship?)[2022-02-19]

有没有办法简化acrobat中的增量计算？(Is there a way to simplify incremental calculations in acrobat?)[2021-09-15]

有没有办法简化$ urlRouterProvider.when？(Is there a way to simplify $urlRouterProvider.when?)[2022-08-07]

有没有办法删除JavaFx中的vBox中的所有内容？(Is there a way to remove all the contents in vBox in JavaFx?)[2023-08-04]

相关文章

最新问答