首页 \ 问答 \ 从Scraped URL中抓取信息(Scrape information from Scraped URL)

从Scraped URL中抓取信息(Scrape information from Scraped URL)

 我是scrapy的新手，目前正在学习如何从抓取的URL列表中抓取信息。 我已经能够通过scrapy网站上的教程从网址中抓取信息。 但是，即使在网上搜索解决方案后，我也面临从网址中抓取的网址列表中的问题抓取信息。  
 我在下面写的刮刀能够刮掉第一个网址。 但是，从抓取的URL列表中删除它是不成功的。 问题始于def parse_following_urls（self，response）：我无法从刮掉的URL列表中删除  
 谁能帮忙解决这个问题？ 预先感谢。  
import scrapy
from scrapy.http import Request

class SET(scrapy.Item):
    title = scrapy.Field()
    open = scrapy.Field()
    hi = scrapy.Field()
    lo = scrapy.Field()
    last = scrapy.Field()
    bid = scrapy.Field()
    ask = scrapy.Field()
    vol = scrapy.Field()
    exp = scrapy.Field()
    exrat = scrapy.Field()
    exdat = scrapy.Field()

class ThaiSpider(scrapy.Spider):
    name = "warrant"
    allowed_domains = ["marketdata.set.or.th"]
    start_urls = ["http://marketdata.set.or.th/mkt/stocklistbytype.do?market=SET&language=en&country=US&type=W"]

    def parse(self, response):
        for sel in response.xpath('//table[@class]/tbody/tr'):
            item = SET()
            item['title'] = sel.xpath('td[1]/a[contains(@href,"ssoPageId")]/text()').extract()
            item['open'] = sel.xpath('td[3]/text()').extract()
            item['hi'] = sel.xpath('td[4]/text()').extract()
            item['lo'] = sel.xpath('td[5]/text()').extract()
            item['last'] = sel.xpath('td[6]/text()').extract()
            item['bid'] = sel.xpath('td[9]/text()').extract()
            item['ask'] = sel.xpath('td[10]/text()').extract()
            item['vol'] = sel.xpath('td[11]/text()').extract()
            yield item
        urll = response.xpath('//table[@class]/tbody/tr/td[1]/a[contains(@href,"ssoPageId")]/@href').extract()
        urls = ["http://marketdata.set.or.th/mkt/"+ i for i in urll]
        for url in urls:
            request = scrapy.Request(url, callback=self.parse_following_urls, dont_filter=True)
            yield request
        request.meta['item'] = item

    def parse_following_urls(self, response):
        for sel in response.xpath('//table[3]/tbody'):
            item = response.meta['item']
            item['exp'] = sel.xpath('tr[1]/td[2]/text()').extract()
            item['exrat'] = sel.xpath('tr[2]/td[2]/text()').extract()
            item['exdat'] = sel.xpath('tr[3]/td[2]/text()').extract()
            yield item
 
 在尝试给出建议并查看输出后，我重新编写了代码。 以下是编辑过的代码。 但是，我收到另一个错误，指出Request url must be str or unicode, got %s:' % type(url).__name__) 。 如何将URL从列表转换为字符串？  
 我认为URL应该是字符串，因为它在For循环中。 我在下面的代码中添加了这个注释。 有什么方法可以解决这个问题吗？  
import scrapy
from scrapy.http import Request

class SET(scrapy.Item):
    title = scrapy.Field()
    open = scrapy.Field()
    hi = scrapy.Field()
    lo = scrapy.Field()
    last = scrapy.Field()
    bid = scrapy.Field()
    ask = scrapy.Field()
    vol = scrapy.Field()
    exp = scrapy.Field()
    exrat = scrapy.Field()
    exdat = scrapy.Field()

class ThaiSpider(scrapy.Spider):
    name = "warrant"
    allowed_domains = ["marketdata.set.or.th"]
    start_urls = ["http://marketdata.set.or.th/mkt/stocklistbytype.do?market=SET&language=en&country=US&type=W"]

    def parse(self, response):
        for sel in response.xpath('//table[@class]/tbody/tr'):
            item = SET()
            item['title'] = sel.xpath('td[1]/a[contains(@href,"ssoPageId")]/text()').extract()
            item['open'] = sel.xpath('td[3]/text()').extract()
            item['hi'] = sel.xpath('td[4]/text()').extract()
            item['lo'] = sel.xpath('td[5]/text()').extract()
            item['last'] = sel.xpath('td[6]/text()').extract()
            item['bid'] = sel.xpath('td[9]/text()').extract()
            item['ask'] = sel.xpath('td[10]/text()').extract()
            item['vol'] = sel.xpath('td[11]/text()').extract()
            url = ["http://marketdata.set.or.th/mkt/"]+ sel.xpath('td[1]/a[contains(@href,"ssoPageId")]/@href').extract()
            request = scrapy.Request(url, callback=self.parse_following_urls, dont_filter=True) #Request url must be str or unicode, got list: How to solve this?
            request.meta['item'] = item
            yield item
            yield request

    def parse_following_urls(self, response):
        for sel in response.xpath('//table[3]/tbody'):
            item = response.meta['item']
            item['exp'] = sel.xpath('tr[1]/td[2]/text()').extract()
            item['exrat'] = sel.xpath('tr[2]/td[2]/text()').extract()
            item['exdat'] = sel.xpath('tr[3]/td[2]/text()').extract()
            yield item

I am new to scrapy and is currently learning how to scrape information from a list of scraped URL. I have been able to scrape information from a url by going thru the tutorial in scrapy website. However, i am facing problem scraping information from a list of url scraped from a url even after googling for solution online.  
The scraper that i have written below is able to scrape from the first url. However, it is unsuccessful in scraping from a list of scraped URL. The problem starts at def parse_following_urls(self, response): whereby i am unable to scrape from the list of scraped URL 
Can anyone help to solve this? Thank in advance. 
import scrapy
from scrapy.http import Request

class SET(scrapy.Item):
    title = scrapy.Field()
    open = scrapy.Field()
    hi = scrapy.Field()
    lo = scrapy.Field()
    last = scrapy.Field()
    bid = scrapy.Field()
    ask = scrapy.Field()
    vol = scrapy.Field()
    exp = scrapy.Field()
    exrat = scrapy.Field()
    exdat = scrapy.Field()

class ThaiSpider(scrapy.Spider):
    name = "warrant"
    allowed_domains = ["marketdata.set.or.th"]
    start_urls = ["http://marketdata.set.or.th/mkt/stocklistbytype.do?market=SET&language=en&country=US&type=W"]

    def parse(self, response):
        for sel in response.xpath('//table[@class]/tbody/tr'):
            item = SET()
            item['title'] = sel.xpath('td[1]/a[contains(@href,"ssoPageId")]/text()').extract()
            item['open'] = sel.xpath('td[3]/text()').extract()
            item['hi'] = sel.xpath('td[4]/text()').extract()
            item['lo'] = sel.xpath('td[5]/text()').extract()
            item['last'] = sel.xpath('td[6]/text()').extract()
            item['bid'] = sel.xpath('td[9]/text()').extract()
            item['ask'] = sel.xpath('td[10]/text()').extract()
            item['vol'] = sel.xpath('td[11]/text()').extract()
            yield item
        urll = response.xpath('//table[@class]/tbody/tr/td[1]/a[contains(@href,"ssoPageId")]/@href').extract()
        urls = ["http://marketdata.set.or.th/mkt/"+ i for i in urll]
        for url in urls:
            request = scrapy.Request(url, callback=self.parse_following_urls, dont_filter=True)
            yield request
        request.meta['item'] = item

    def parse_following_urls(self, response):
        for sel in response.xpath('//table[3]/tbody'):
            item = response.meta['item']
            item['exp'] = sel.xpath('tr[1]/td[2]/text()').extract()
            item['exrat'] = sel.xpath('tr[2]/td[2]/text()').extract()
            item['exdat'] = sel.xpath('tr[3]/td[2]/text()').extract()
            yield item
 
I have re wrote the code after trying suggestions given and looking at the output. Below is the edited code. However, i got another error that states that Request url must be str or unicode, got %s:' % type(url).__name__). How do i convert the URL from list to a string? 
I thought URL should be in string as it is in a For loop. I have added this as comment in the code below. Is there any way to solve this? 
import scrapy
from scrapy.http import Request

class SET(scrapy.Item):
    title = scrapy.Field()
    open = scrapy.Field()
    hi = scrapy.Field()
    lo = scrapy.Field()
    last = scrapy.Field()
    bid = scrapy.Field()
    ask = scrapy.Field()
    vol = scrapy.Field()
    exp = scrapy.Field()
    exrat = scrapy.Field()
    exdat = scrapy.Field()

class ThaiSpider(scrapy.Spider):
    name = "warrant"
    allowed_domains = ["marketdata.set.or.th"]
    start_urls = ["http://marketdata.set.or.th/mkt/stocklistbytype.do?market=SET&language=en&country=US&type=W"]

    def parse(self, response):
        for sel in response.xpath('//table[@class]/tbody/tr'):
            item = SET()
            item['title'] = sel.xpath('td[1]/a[contains(@href,"ssoPageId")]/text()').extract()
            item['open'] = sel.xpath('td[3]/text()').extract()
            item['hi'] = sel.xpath('td[4]/text()').extract()
            item['lo'] = sel.xpath('td[5]/text()').extract()
            item['last'] = sel.xpath('td[6]/text()').extract()
            item['bid'] = sel.xpath('td[9]/text()').extract()
            item['ask'] = sel.xpath('td[10]/text()').extract()
            item['vol'] = sel.xpath('td[11]/text()').extract()
            url = ["http://marketdata.set.or.th/mkt/"]+ sel.xpath('td[1]/a[contains(@href,"ssoPageId")]/@href').extract()
            request = scrapy.Request(url, callback=self.parse_following_urls, dont_filter=True) #Request url must be str or unicode, got list: How to solve this?
            request.meta['item'] = item
            yield item
            yield request

    def parse_following_urls(self, response):
        for sel in response.xpath('//table[3]/tbody'):
            item = response.meta['item']
            item['exp'] = sel.xpath('tr[1]/td[2]/text()').extract()
            item['exrat'] = sel.xpath('tr[2]/td[2]/text()').extract()
            item['exdat'] = sel.xpath('tr[3]/td[2]/text()').extract()
            yield item

原文：https://stackoverflow.com/questions/35358867

更新时间：2022-07-30 07:07

最满意答案

 每当在c ++中使用模板时，必须将使用过的模板类型称为完整类型，当您要使用字符串向量时，需要包含字符串类。 包含只不过是将包含文件中的代码复制并粘贴到您的包含所在的位置。  
1> #include <string>
2> #include <vector>
3>  
4> class Foo {
5> private:
6>     vector<string> bar;    
7> }
 
 编译第6行时，编译器必须知道两种类型都是完整类型（字符串因为它是模板，因为它不是指针而是矢量）。 包含在类上，因此编译器在编译第6行时会知道这两种类型。无论您将它们包含在哪个顺序中都无关紧要。 

Whenever you use a template in c++, the used template type has to be known as a complete type, that requires you to include the string class when you want to use a vector of string. The includes are nothing more than copying and pasting the code in the included file to where your include is placed.  
1> #include <string>
2> #include <vector>
3>  
4> class Foo {
5> private:
6>     vector<string> bar;    
7> }
 
When the line 6 is compiled, the compiler has to know both types as complete type (string because its a template, vector because its not a pointer). The includes are placed over the class so the compiler knows both types when he wants to compile line 6. It doesn't matter which order you included them.

从Scraped URL中抓取信息(Scrape information from Scraped URL)

最满意答案

相关问答

C ++标题顺序[关闭](C++ Header order [closed])[2022-02-10]

C ++模板和静态成员 - 标题中的定义(C++ templates and static members - definition in the header)[2024-03-14]

C / C ++包括文件顺序/最佳实践[关闭](C/C++ include header file order)[2022-02-21]

C标题定义顺序/位置(C Header define order/position)[2022-01-11]

在C ++模板类中存在循环依赖时修复包含的顺序(Fixing order of inclusion when there is circular dependency in C++ template classes)[2021-04-17]

C ++ - Zlib - 标题和无标题支持。(C++ - Zlib - Header and No Header support. How reliable?)[2021-06-12]

为什么我会根据标题顺序获得c ++编译警告(Why do I get c++ compile warnings depending on header order)[2023-09-08]

C ++ - 在模板类之外但在标题中定义成员函数(C++ - Define member function outside template-class but in header)[2023-12-24]

带模板的C ++标题顺序(C++ header order with template)[2023-06-17]

C / C ++源代码中系统头和用户头文件之间的顺序是什么？(what's the right order between system header and user header file in C/C++ source? [duplicate])[2024-02-05]

相关文章

最新问答