首页 \ 问答 \ 使用BS4,Python和Selenium后解析文本(Parse the text after using BS4, Python and Selenium)

使用BS4,Python和Selenium后解析文本(Parse the text after using BS4, Python and Selenium)

使用我的scrape脚本后:

from selenium import webdriver
from bs4 import BeautifulSoup
import csv

browser = webdriver.Firefox()
browser.get('http://dyn.com/about/events/')
html = browser.page_source
soup = BeautifulSoup(html)
titles = [tag.text for tag in soup.find_all('p','pubdate')]

我得到的结果如下:

[u'\ n \ n \ t \ t \ t \ BWEBINAR:如何扩大您的全球覆盖范围到中国\ xa0 \ n \ t \ t \ t \ n \ t \ t \ t设置22,2014 \ t \ t \ t \ nspeak \ n',u'\ n \ n \ t \ t \ t LAUNCH Scale \ u2013旧金山,CA \ xa0 \ n \ t \ t \ t \ n \ t \ t \ t \ tOct 23 - 24,2014 \ t \ t \ t \ nattend \ n',u'\ n \ n \ t \ t \ tAcquia参与用户会议\ u2013 Boston,MA \ xa0 \ n \ t \ t \ t \ n \ t \ t \ t \ t 3 - 5 ,2014 \ t \ t \ t \ nexhibitattend \ n',u'\ n \ n \ t \ t \ t \ tCloud Expo \ u2013圣克拉拉,加利福尼亚\ xa0 \ n \ t \ t \ t \ n \ t \ t \ tNov 4 - 6,2014 \ t \ t \ t \ nexhibit \ n',u'\ n \ n \ t \ t \ t \ 2014年全球运营商奖项\ u2013阿姆斯特丹\ xa0 \ n \ t \ t \ t \ n \ n \ t \ t \ tNov 4,2014 \ t \ t \ t \ n \ n',u'\ n \ n \ t \ t \ t \ t \ twit \ Summit \ u2013都柏林,爱尔兰\ xa0 \ n \ t \ t \ t \ n \ t \ t \ t \ tNov 4 - 6,2014 \ t \ t \ t \ n \ n \ n \ n',u'\ n \ n \ t \ t \ t \ t \ tVelocity Europe \ u2013巴塞罗那,西班牙\ xa0 \ n \ t \ t \ t \ n \ t \ t \ tNov 17 - 19,2014 \ t \ t \ t \ nexhibit \ n',u'\ n \ n \ t \ t \ tNH / VT第一届乐高联赛冠军赛\ xa0 \ n \ t \ t \ t \ n \ t \ t \ tDec 6,2014 \ t \ t \ t \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n“>

我是python的新手,所以你能建议我如何从这个结果中获取事件名称,日期,事件类型?

谢谢!


after using my scrape script:

from selenium import webdriver
from bs4 import BeautifulSoup
import csv

browser = webdriver.Firefox()
browser.get('http://dyn.com/about/events/')
html = browser.page_source
soup = BeautifulSoup(html)
titles = [tag.text for tag in soup.find_all('p','pubdate')]

I have got the result that looks like:

[u'\n\n\t\t\tWEBINAR: How To Expand Your Global Reach To China\xa0\n\t\t\t\n\t\t\tOct 22, 2014\t\t\t\nspeak \n', u'\n\n\t\t\tLAUNCH Scale \u2013 San Francisco, CA\xa0\n\t\t\t\n\t\t\tOct 23 - 24, 2014\t\t\t\nattend \n', u'\n\n\t\t\tAcquia Engage User Conference \u2013 Boston, MA\xa0\n\t\t\t\n\t\t\tNov 3 - 5, 2014\t\t\t\nexhibitattend \n', u'\n\n\t\t\tCloud Expo \u2013 Santa Clara, CA\xa0\n\t\t\t\n\t\t\tNov 4 - 6, 2014\t\t\t\nexhibit \n', u'\n\n\t\t\tThe Global Carrier Awards 2014 \u2013 Amsterdam\xa0\n\t\t\t\n\t\t\tNov 4, 2014\t\t\t\n\n', u'\n\n\t\t\tWeb Summit \u2013 Dublin, Ireland\xa0\n\t\t\t\n\t\t\tNov 4 - 6, 2014\t\t\t\nspeak \n', u'\n\n\t\t\tVelocity Europe \u2013 Barcelona, Spain\xa0\n\t\t\t\n\t\t\tNov 17 - 19, 2014\t\t\t\nexhibit \n', u'\n\n\t\t\tNH/VT FIRST LEGO League Championship Event\xa0\n\t\t\t\n\t\t\tDec 6, 2014\t\t\t\nspeak \n']

I am new to python, so could you suggest how can I get Event Name, Date, Event Type from this result?

Thanks!


原文:https://stackoverflow.com/questions/26484951
更新时间:2022-11-26 07:11

最满意答案

您应该使用外部联接。

select
    A.ID,
    A.DataA1,
    A.DataA2,
    B.A_ID,
    B.DataB1,
    B.DataB2,
    C.A_ID,
    C.DataC1,
    C.DataC2
from A 
left join B
on A.ID = B.A_ID
left join C
on A.ID = C.A_ID

有关SQL连接的详细解释, 请访问http//www.codinghorror.com/blog/2007/10/a-visual-explanation-of-sql-joins.html


You should use an outer join.

select
    A.ID,
    A.DataA1,
    A.DataA2,
    B.A_ID,
    B.DataB1,
    B.DataB2,
    C.A_ID,
    C.DataC1,
    C.DataC2
from A 
left join B
on A.ID = B.A_ID
left join C
on A.ID = C.A_ID

For a good explanation of SQL joins checkout: http://www.codinghorror.com/blog/2007/10/a-visual-explanation-of-sql-joins.html

相关问答

更多

相关文章

更多

最新问答

更多
  • 如何在Laravel 5.2中使用paginate与关系?(How to use paginate with relationships in Laravel 5.2?)
  • linux的常用命令干什么用的
  • 由于有四个新控制器,Auth刀片是否有任何变化?(Are there any changes in Auth blades due to four new controllers?)
  • 如何交换返回集中的行?(How to swap rows in a return set?)
  • 在ios 7中的UITableView部分周围绘制边界线(draw borderline around UITableView section in ios 7)
  • 使用Boost.Spirit Qi和Lex时的空白队长(Whitespace skipper when using Boost.Spirit Qi and Lex)
  • Java中的不可变类(Immutable class in Java)
  • WordPress发布查询(WordPress post query)
  • 如何在关系数据库中存储与IPv6兼容的地址(How to store IPv6-compatible address in a relational database)
  • 是否可以检查对象值的条件并返回密钥?(Is it possible to check the condition of a value of an object and JUST return the key?)
  • GEP分段错误LLVM C ++ API(GEP segmentation fault LLVM C++ API)
  • 绑定属性设置器未被调用(Bound Property Setter not getting Called)
  • linux ubuntu14.04版没有那个文件或目录
  • 如何使用JSF EL表达式在param中迭代变量(How to iterate over variable in param using JSF EL expression)
  • 是否有可能在WPF中的一个单独的进程中隔离一些控件?(Is it possible to isolate some controls in a separate process in WPF?)
  • 使用Python 2.7的MSI安装的默认安装目录是什么?(What is the default installation directory with an MSI install of Python 2.7?)
  • 寻求多次出现的表达式(Seeking for more than one occurrence of an expression)
  • ckeditor config.protectedSource不适用于editor.insertHtml上的html元素属性(ckeditor config.protectedSource dont work for html element attributes on editor.insertHtml)
  • linux只知道文件名,不知道在哪个目录,怎么找到文件所在目录
  • Actionscript:检查字符串是否包含域或子域(Actionscript: check if string contains domain or subdomain)
  • 将CouchDB与AJAX一起使用是否安全?(Is it safe to use CouchDB with AJAX?)
  • 懒惰地初始化AutoMapper(Lazily initializing AutoMapper)
  • 使用hasclass为多个div与一个按钮问题(using hasclass for multiple divs with one button Problems)
  • Windows Phone 7:检查资源是否存在(Windows Phone 7: Check If Resource Exists)
  • 无法在新线程中从FREContext调用getActivity()?(Can't call getActivity() from FREContext in a new thread?)
  • 在Alpine上升级到postgres96(/ usr / bin / pg_dump:没有这样的文件或目录)(Upgrade to postgres96 on Alpine (/usr/bin/pg_dump: No such file or directory))
  • 如何按部门显示报告(How to display a report by Department wise)
  • Facebook墙贴在需要访问令牌密钥后无法正常工作(Facebook wall post not working after access token key required)
  • Javascript - 如何在不擦除输入的情况下更改标签的innerText(Javascript - how to change innerText of label while not wiping out the input)
  • WooCommerce / WordPress - 不显示具有特定标题的产品(WooCommerce/WordPress - Products with specific titles are not displayed)