首页 \ 问答 \ 从Scraped URL中抓取信息(Scrape information from Scraped URL)

从Scraped URL中抓取信息(Scrape information from Scraped URL)

我是scrapy的新手,目前正在学习如何从抓取的URL列表中抓取信息。 我已经能够通过scrapy网站上的教程从网址中抓取信息。 但是,即使在网上搜索解决方案后,我也面临从网址中抓取的网址列表中的问题抓取信息。

我在下面写的刮刀能够刮掉第一个网址。 但是,从抓取的URL列表中删除它是不成功的。 问题始于def parse_following_urls(self,response):我无法从刮掉的URL列表中删除

谁能帮忙解决这个问题? 预先感谢。

import scrapy
from scrapy.http import Request

class SET(scrapy.Item):
    title = scrapy.Field()
    open = scrapy.Field()
    hi = scrapy.Field()
    lo = scrapy.Field()
    last = scrapy.Field()
    bid = scrapy.Field()
    ask = scrapy.Field()
    vol = scrapy.Field()
    exp = scrapy.Field()
    exrat = scrapy.Field()
    exdat = scrapy.Field()

class ThaiSpider(scrapy.Spider):
    name = "warrant"
    allowed_domains = ["marketdata.set.or.th"]
    start_urls = ["http://marketdata.set.or.th/mkt/stocklistbytype.do?market=SET&language=en&country=US&type=W"]

    def parse(self, response):
        for sel in response.xpath('//table[@class]/tbody/tr'):
            item = SET()
            item['title'] = sel.xpath('td[1]/a[contains(@href,"ssoPageId")]/text()').extract()
            item['open'] = sel.xpath('td[3]/text()').extract()
            item['hi'] = sel.xpath('td[4]/text()').extract()
            item['lo'] = sel.xpath('td[5]/text()').extract()
            item['last'] = sel.xpath('td[6]/text()').extract()
            item['bid'] = sel.xpath('td[9]/text()').extract()
            item['ask'] = sel.xpath('td[10]/text()').extract()
            item['vol'] = sel.xpath('td[11]/text()').extract()
            yield item
        urll = response.xpath('//table[@class]/tbody/tr/td[1]/a[contains(@href,"ssoPageId")]/@href').extract()
        urls = ["http://marketdata.set.or.th/mkt/"+ i for i in urll]
        for url in urls:
            request = scrapy.Request(url, callback=self.parse_following_urls, dont_filter=True)
            yield request
        request.meta['item'] = item

    def parse_following_urls(self, response):
        for sel in response.xpath('//table[3]/tbody'):
            item = response.meta['item']
            item['exp'] = sel.xpath('tr[1]/td[2]/text()').extract()
            item['exrat'] = sel.xpath('tr[2]/td[2]/text()').extract()
            item['exdat'] = sel.xpath('tr[3]/td[2]/text()').extract()
            yield item

在尝试给出建议并查看输出后,我重新编写了代码。 以下是编辑过的代码。 但是,我收到另一个错误,指出Request url must be str or unicode, got %s:' % type(url).__name__) 。 如何将URL从列表转换为字符串?

我认为URL应该是字符串,因为它在For循环中。 我在下面的代码中添加了这个注释。 有什么方法可以解决这个问题吗?

import scrapy
from scrapy.http import Request

class SET(scrapy.Item):
    title = scrapy.Field()
    open = scrapy.Field()
    hi = scrapy.Field()
    lo = scrapy.Field()
    last = scrapy.Field()
    bid = scrapy.Field()
    ask = scrapy.Field()
    vol = scrapy.Field()
    exp = scrapy.Field()
    exrat = scrapy.Field()
    exdat = scrapy.Field()

class ThaiSpider(scrapy.Spider):
    name = "warrant"
    allowed_domains = ["marketdata.set.or.th"]
    start_urls = ["http://marketdata.set.or.th/mkt/stocklistbytype.do?market=SET&language=en&country=US&type=W"]

    def parse(self, response):
        for sel in response.xpath('//table[@class]/tbody/tr'):
            item = SET()
            item['title'] = sel.xpath('td[1]/a[contains(@href,"ssoPageId")]/text()').extract()
            item['open'] = sel.xpath('td[3]/text()').extract()
            item['hi'] = sel.xpath('td[4]/text()').extract()
            item['lo'] = sel.xpath('td[5]/text()').extract()
            item['last'] = sel.xpath('td[6]/text()').extract()
            item['bid'] = sel.xpath('td[9]/text()').extract()
            item['ask'] = sel.xpath('td[10]/text()').extract()
            item['vol'] = sel.xpath('td[11]/text()').extract()
            url = ["http://marketdata.set.or.th/mkt/"]+ sel.xpath('td[1]/a[contains(@href,"ssoPageId")]/@href').extract()
            request = scrapy.Request(url, callback=self.parse_following_urls, dont_filter=True) #Request url must be str or unicode, got list: How to solve this?
            request.meta['item'] = item
            yield item
            yield request

    def parse_following_urls(self, response):
        for sel in response.xpath('//table[3]/tbody'):
            item = response.meta['item']
            item['exp'] = sel.xpath('tr[1]/td[2]/text()').extract()
            item['exrat'] = sel.xpath('tr[2]/td[2]/text()').extract()
            item['exdat'] = sel.xpath('tr[3]/td[2]/text()').extract()
            yield item

I am new to scrapy and is currently learning how to scrape information from a list of scraped URL. I have been able to scrape information from a url by going thru the tutorial in scrapy website. However, i am facing problem scraping information from a list of url scraped from a url even after googling for solution online.

The scraper that i have written below is able to scrape from the first url. However, it is unsuccessful in scraping from a list of scraped URL. The problem starts at def parse_following_urls(self, response): whereby i am unable to scrape from the list of scraped URL

Can anyone help to solve this? Thank in advance.

import scrapy
from scrapy.http import Request

class SET(scrapy.Item):
    title = scrapy.Field()
    open = scrapy.Field()
    hi = scrapy.Field()
    lo = scrapy.Field()
    last = scrapy.Field()
    bid = scrapy.Field()
    ask = scrapy.Field()
    vol = scrapy.Field()
    exp = scrapy.Field()
    exrat = scrapy.Field()
    exdat = scrapy.Field()

class ThaiSpider(scrapy.Spider):
    name = "warrant"
    allowed_domains = ["marketdata.set.or.th"]
    start_urls = ["http://marketdata.set.or.th/mkt/stocklistbytype.do?market=SET&language=en&country=US&type=W"]

    def parse(self, response):
        for sel in response.xpath('//table[@class]/tbody/tr'):
            item = SET()
            item['title'] = sel.xpath('td[1]/a[contains(@href,"ssoPageId")]/text()').extract()
            item['open'] = sel.xpath('td[3]/text()').extract()
            item['hi'] = sel.xpath('td[4]/text()').extract()
            item['lo'] = sel.xpath('td[5]/text()').extract()
            item['last'] = sel.xpath('td[6]/text()').extract()
            item['bid'] = sel.xpath('td[9]/text()').extract()
            item['ask'] = sel.xpath('td[10]/text()').extract()
            item['vol'] = sel.xpath('td[11]/text()').extract()
            yield item
        urll = response.xpath('//table[@class]/tbody/tr/td[1]/a[contains(@href,"ssoPageId")]/@href').extract()
        urls = ["http://marketdata.set.or.th/mkt/"+ i for i in urll]
        for url in urls:
            request = scrapy.Request(url, callback=self.parse_following_urls, dont_filter=True)
            yield request
        request.meta['item'] = item

    def parse_following_urls(self, response):
        for sel in response.xpath('//table[3]/tbody'):
            item = response.meta['item']
            item['exp'] = sel.xpath('tr[1]/td[2]/text()').extract()
            item['exrat'] = sel.xpath('tr[2]/td[2]/text()').extract()
            item['exdat'] = sel.xpath('tr[3]/td[2]/text()').extract()
            yield item

I have re wrote the code after trying suggestions given and looking at the output. Below is the edited code. However, i got another error that states that Request url must be str or unicode, got %s:' % type(url).__name__). How do i convert the URL from list to a string?

I thought URL should be in string as it is in a For loop. I have added this as comment in the code below. Is there any way to solve this?

import scrapy
from scrapy.http import Request

class SET(scrapy.Item):
    title = scrapy.Field()
    open = scrapy.Field()
    hi = scrapy.Field()
    lo = scrapy.Field()
    last = scrapy.Field()
    bid = scrapy.Field()
    ask = scrapy.Field()
    vol = scrapy.Field()
    exp = scrapy.Field()
    exrat = scrapy.Field()
    exdat = scrapy.Field()

class ThaiSpider(scrapy.Spider):
    name = "warrant"
    allowed_domains = ["marketdata.set.or.th"]
    start_urls = ["http://marketdata.set.or.th/mkt/stocklistbytype.do?market=SET&language=en&country=US&type=W"]

    def parse(self, response):
        for sel in response.xpath('//table[@class]/tbody/tr'):
            item = SET()
            item['title'] = sel.xpath('td[1]/a[contains(@href,"ssoPageId")]/text()').extract()
            item['open'] = sel.xpath('td[3]/text()').extract()
            item['hi'] = sel.xpath('td[4]/text()').extract()
            item['lo'] = sel.xpath('td[5]/text()').extract()
            item['last'] = sel.xpath('td[6]/text()').extract()
            item['bid'] = sel.xpath('td[9]/text()').extract()
            item['ask'] = sel.xpath('td[10]/text()').extract()
            item['vol'] = sel.xpath('td[11]/text()').extract()
            url = ["http://marketdata.set.or.th/mkt/"]+ sel.xpath('td[1]/a[contains(@href,"ssoPageId")]/@href').extract()
            request = scrapy.Request(url, callback=self.parse_following_urls, dont_filter=True) #Request url must be str or unicode, got list: How to solve this?
            request.meta['item'] = item
            yield item
            yield request

    def parse_following_urls(self, response):
        for sel in response.xpath('//table[3]/tbody'):
            item = response.meta['item']
            item['exp'] = sel.xpath('tr[1]/td[2]/text()').extract()
            item['exrat'] = sel.xpath('tr[2]/td[2]/text()').extract()
            item['exdat'] = sel.xpath('tr[3]/td[2]/text()').extract()
            yield item

原文:https://stackoverflow.com/questions/35358867
更新时间:2022-07-30 07:07

最满意答案

每当在c ++中使用模板时,必须将使用过的模板类型称为完整类型,当您要使用字符串向量时,需要包含字符串类。 包含只不过是将包含文件中的代码复制并粘贴到您的包含所在的位置。

1> #include <string>
2> #include <vector>
3>  
4> class Foo {
5> private:
6>     vector<string> bar;    
7> }

编译第6行时,编译器必须知道两种类型都是完整类型(字符串因为它是模板,因为它不是指针而是矢量)。 包含在类上,因此编译器在编译第6行时会知道这两种类型。无论您将它们包含在哪个顺序中都无关紧要。


Whenever you use a template in c++, the used template type has to be known as a complete type, that requires you to include the string class when you want to use a vector of string. The includes are nothing more than copying and pasting the code in the included file to where your include is placed.

1> #include <string>
2> #include <vector>
3>  
4> class Foo {
5> private:
6>     vector<string> bar;    
7> }

When the line 6 is compiled, the compiler has to know both types as complete type (string because its a template, vector because its not a pointer). The includes are placed over the class so the compiler knows both types when he wants to compile line 6. It doesn't matter which order you included them.

相关问答

更多
  • 在头文件中,您必须包含所有头文件以使其可编译。 不要忘记使用前向声明而不是某些头文件。 在源文件中: 相应的头文件 必要的项目标题 第三方库头 标准库头 系统标题 按照这个顺序,你不会错过任何忘记自己包含库的头文件。 In a header file you have to include ALL the headers to make it compilable. And don't forget to use forward declarations instead of some headers. I ...
  • 这是[basic.def.odr]/5因为[basic.def.odr]/5明确允许模板被复制: 可以有多个类类型的定义(第9章),枚举类型(7.2),带外部链接的内联函数(7.1.2),类模板(第14章),非静态函数模板(14.5.6) ,一个类模板的静态数据成员(14.5.1.3),一个类模板的成员函数(14.5.1.1)或模板专门化,在程序中没有指定一些模板参数(14.7,14.5.5)定义出现在不同的翻译单元中,并且提供的定义满足以下要求。 ... 这些要求非常冗长,所以我不会在这里复制它们,但实际 ...
  • 我不认为有一个推荐的订单,只要它编译! 什么是令人讨厌的是当一些标头需要首先包含其他标题...这是标题本身的问题,而不是包含的顺序。 我个人的喜好是从地方到全球,每个小节按字母顺序排列,即: h文件对应于此cpp文件(如果适用) 来自相同组件的头文件, 来自其他组件的标题, 系统标题 我的理由是,它应该证明每个头(有一个cpp)可以#include没有先决条件。 其余的似乎从那里逻辑流动。 I don't think there's a recommended order, as long as it co ...
  • 我没有添加评论的声誉,但我很好奇你是否想要添加 #include 而不是 #define 我假设你的意思是#include 当您需要使用该库中定义的内容(如函数或结构)时,可以包含math.h等其他库。 对于#include语句,预处理器将使用math.h文件的内容替换#include行。 只有当文件中的某些内容引用头文件中的内容时,才应包含头文件。 你发布代码的头文件,在我看来,没有引用math.h中的任何内容,因此你不应该将它包含在该文件中。 相反,您应该拥有使用 ...
  • 是的:在Bar之前声明Foo ,因为Bar只使用指针,它不需要完整的定义。 然后在Bar之后定义Foo - 它使用一个对象,因此它确实需要定义。 class Foo; template class Bar { public: Foo * M; void barStuff(); }; class Foo { public: Bar<42> B; void fooStuff(){} }; template
  • 作为启发式检查,它将不可靠并且容易被利用。 我可以设想生成一个压缩到zlib头文件。 如果标题被视为有效,那么这将产生有效的解压缩流。 实际上,对正在传输的数据的限制可能会减轻,但它可能仍然是危险的 As a heuristic check, it will be unreliable and prone to exploit. I can conceive of generating a document which comppresses to a zlib header. Also which wou ...
  • 这是一个已知的错误 。 罪魁祸首是argp.h这一块代码,当你使用-std=c++xx时会触发它: #ifndef __attribute__ /* This feature is available in gcc versions 2.5 and later. */ # if __GNUC__ < 2 || (__GNUC__ == 2 && __GNUC_MINOR__ < 5) || __STRICT_ANSI__ # define __attribute__(Spec) /* empty */ ...
  • 14.7 / 5说 5对于给定的模板和给定的模板参数集, 显式实例化定义在程序中最多只出现一次, 一个明确的专业化应该在一个程序中最多定义一次(根据3.2) ,和 除非显式实例化遵循显式特化的声明,否则显式实例化和显式特化的声明都不应出现在程序中。 诊断违反此规则不需要实施。 第二颗子弹适用于您的情况。 3.2中定义的ODR说的是相同的东西,尽管它的形式较少。 无论在何处以及如何定义非专用版本的成员函数,都需要专门的版本定义 template <> bool TestClass::Membe ...
  • 每当在c ++中使用模板时,必须将使用过的模板类型称为完整类型,当您要使用字符串向量时,需要包含字符串类。 包含只不过是将包含文件中的代码复制并粘贴到您的包含所在的位置。 1> #include 2> #include 3> 4> class Foo { 5> private: 6> vector bar; 7> } 编译第6行时,编译器必须知道两种类型都是完整类型(字符串因为它是模板,因为它不是指针而是矢量)。 包含在类上,因此编译 ...
  • 当涉及到头文件时,我已经使用了第一种方法,只要我记得。 说到源文件,我使用相同的方法,除非有一个与源文件对应的头文件。 在这种情况下,我首先#include特定的头文件。 假设我有Ah声明一个类或一些函数, A.cc实现它们。 在那种情况下,我使用: AC: #include "A.h" // Standard includes // User includes 这对于清除任何缺少的前向声明或标准包含文件非常有用,这些文件是使Ah成为可重用的头文件所必需的,而不必担心其他必须是什么#include d ...

相关文章

更多

最新问答

更多
  • 您如何使用git diff文件,并将其应用于同一存储库的副本的本地分支?(How do you take a git diff file, and apply it to a local branch that is a copy of the same repository?)
  • 将长浮点值剪切为2个小数点并复制到字符数组(Cut Long Float Value to 2 decimal points and copy to Character Array)
  • OctoberCMS侧边栏不呈现(OctoberCMS Sidebar not rendering)
  • 页面加载后对象是否有资格进行垃圾回收?(Are objects eligible for garbage collection after the page loads?)
  • codeigniter中的语言不能按预期工作(language in codeigniter doesn' t work as expected)
  • 在计算机拍照在哪里进入
  • 使用cin.get()从c ++中的输入流中丢弃不需要的字符(Using cin.get() to discard unwanted characters from the input stream in c++)
  • No for循环将在for循环中运行。(No for loop will run inside for loop. Testing for primes)
  • 单页应用程序:页面重新加载(Single Page Application: page reload)
  • 在循环中选择具有相似模式的列名称(Selecting Column Name With Similar Pattern in a Loop)
  • System.StackOverflow错误(System.StackOverflow error)
  • KnockoutJS未在嵌套模板上应用beforeRemove和afterAdd(KnockoutJS not applying beforeRemove and afterAdd on nested templates)
  • 散列包括方法和/或嵌套属性(Hash include methods and/or nested attributes)
  • android - 如何避免使用Samsung RFS文件系统延迟/冻结?(android - how to avoid lag/freezes with Samsung RFS filesystem?)
  • TensorFlow:基于索引列表创建新张量(TensorFlow: Create a new tensor based on list of indices)
  • 企业安全培训的各项内容
  • 错误:RPC失败;(error: RPC failed; curl transfer closed with outstanding read data remaining)
  • C#类名中允许哪些字符?(What characters are allowed in C# class name?)
  • NumPy:将int64值存储在np.array中并使用dtype float64并将其转换回整数是否安全?(NumPy: Is it safe to store an int64 value in an np.array with dtype float64 and later convert it back to integer?)
  • 注销后如何隐藏导航portlet?(How to hide navigation portlet after logout?)
  • 将多个行和可变行移动到列(moving multiple and variable rows to columns)
  • 提交表单时忽略基础href,而不使用Javascript(ignore base href when submitting form, without using Javascript)
  • 对setOnInfoWindowClickListener的意图(Intent on setOnInfoWindowClickListener)
  • Angular $资源不会改变方法(Angular $resource doesn't change method)
  • 在Angular 5中不是一个函数(is not a function in Angular 5)
  • 如何配置Composite C1以将.m和桌面作为同一站点提供服务(How to configure Composite C1 to serve .m and desktop as the same site)
  • 不适用:悬停在悬停时:在元素之前[复制](Don't apply :hover when hovering on :before element [duplicate])
  • 常见的python rpc和cli接口(Common python rpc and cli interface)
  • Mysql DB单个字段匹配多个其他字段(Mysql DB single field matching to multiple other fields)
  • 产品页面上的Magento Up出售对齐问题(Magento Up sell alignment issue on the products page)