首页 \ 问答 \ BeautifulSoup find_all(“img”)不适用于所有网站(BeautifulSoup find_all(“img”) not working for all sites)

BeautifulSoup find_all(“img”)不适用于所有网站(BeautifulSoup find_all(“img”) not working for all sites)

我正在尝试编写一个Python脚本来从任何网站下载图像。 它工作,但不一致。 具体来说,find_all(“img”)对第二个url没有这样做。 该脚本是:

# works for http://proof.nationalgeographic.com/2016/02/02/photo-of-the-day-best-of-january-3/
# but not http://www.nationalgeographic.com/photography/proof/2017/05/lake-chad-desertification/
import requests
from PIL import Image
from io import BytesIO
from bs4 import BeautifulSoup

def url_to_image(url, filename):
    # get HTTP response, open as bytes, save the image
    # http://docs.python-requests.org/en/master/user/quickstart/#binary-response-content
    req = requests.get(url)
    i = Image.open(BytesIO(req.content))
    i.save(filename)

# open page, get HTML request and parse with BeautifulSoup
html = requests.get("http://proof.nationalgeographic.com/2016/02/02/photo-of-the-day-best-of-january-3/")
soup = BeautifulSoup(html.text, "html.parser")

# find all JPEGS in our soup and write their "src" attribute to array
urls = []
for img in soup.find_all("img"):
    if img["src"].endswith("jpg"):
        print("endswith jpg")
        urls.append(str(img["src"]))
    print(str(img))

jpeg_no = 00
for url in urls:
    url_to_image(url, filename="NatGeoPix/" + str(jpeg_no) + ".jpg")
    jpeg_no += 1

I'm trying to write a Python script to download images from any website. It is working, but inconsistently. Specifically, find_all("img") is not doing so for the second url. The script is:

# works for http://proof.nationalgeographic.com/2016/02/02/photo-of-the-day-best-of-january-3/
# but not http://www.nationalgeographic.com/photography/proof/2017/05/lake-chad-desertification/
import requests
from PIL import Image
from io import BytesIO
from bs4 import BeautifulSoup

def url_to_image(url, filename):
    # get HTTP response, open as bytes, save the image
    # http://docs.python-requests.org/en/master/user/quickstart/#binary-response-content
    req = requests.get(url)
    i = Image.open(BytesIO(req.content))
    i.save(filename)

# open page, get HTML request and parse with BeautifulSoup
html = requests.get("http://proof.nationalgeographic.com/2016/02/02/photo-of-the-day-best-of-january-3/")
soup = BeautifulSoup(html.text, "html.parser")

# find all JPEGS in our soup and write their "src" attribute to array
urls = []
for img in soup.find_all("img"):
    if img["src"].endswith("jpg"):
        print("endswith jpg")
        urls.append(str(img["src"]))
    print(str(img))

jpeg_no = 00
for url in urls:
    url_to_image(url, filename="NatGeoPix/" + str(jpeg_no) + ".jpg")
    jpeg_no += 1

原文:https://stackoverflow.com/questions/43985554
更新时间:2023-09-20 22:09

最满意答案

最好尽快启动Windows服务。 您可以将初始化代码移动到单独的线程,如下所示:

protected override void OnStart(string[] args)
{
   Task.Run(() => StartSynchro());
}

It's better to start the windows service as fast as possible. You could move the initialization code to a separate thread as follows:

protected override void OnStart(string[] args)
{
   Task.Run(() => StartSynchro());
}

相关问答

更多
  • 如果您的服务StartType设置为自动,但服务在重新启动后没有运行,那么您的服务依赖于其他服务无法正常启动,或者服务自己的启动代码失败并最终停止服务。 检查Windows事件日志中是否有错误(如果您没有记录自己的错误,则应该是)。 If your service StartType is set to Automatic, but the service is not running after a reboot, then either your service has a dependency on ...
  • 你是否阻止了OnStart的回归? 通常会从那里产生一个线程来完成工作,然后让方法返回。 Are you blocking the return of OnStart? Normally one would spawn a thread from there to do the work, and let the method return.
  • 这似乎是由artifactory-service.exe导致在服务定义中使用不正常的字符引起的。 运行installService.bat后,当我检查服务的“可执行文件的路径”了 ...\artifactory-pro-5.5.1\bin\artifactory-service.exe ೴//RS//Artifactory 不寻常的字符是一些奇怪的unicode字符,例如: http : //www.fileformat.info/info/unicode/char/0cf4/index.htm 这似乎是 ...
  • 分开关注。 您应该尽快从OnStart返回,因此我建议您在OnStart方法中启动异步TPL任务,不做任何其他事情。 在异步任务中,您可以执行任何操作。 通过这样做,您的服务将能够完成OnStart并正确地从开始到开始。 Separate concerns. You should return from OnStart as quick as possible, so I'd suggest to spin up an asynchronous TPL task in your OnStart method ...
  • 正如其他人已经提到的那样,你不能(容易地)直接从服务启动应用程序,所以我认为解决这个问题最简单的方法是创建一个从登录开始并使用登录用户凭证运行的进程,例如位于系统托盘中的应用程序,并打开命名管道或网络端口以连接服务。 如果服务需要提醒用户,它会向该通道发送消息,然后客户端进程可以显示其自己的UI或启动应用程序。 使用管道或端口进行的进程间通信是处理会话0进程限制的最简单方法。 As others have mentioned already, you can't (easily) launch an app ...
  • 只是让你知道。 现在可以在不更改注册表中的“imagepath”的情况下将参数添加到服务。 Just to let you know. There is now way to add a Parameter to a Service without changing the "imagepath" in the registry.
  • 最好尽快启动Windows服务。 您可以将初始化代码移动到单独的线程,如下所示: protected override void OnStart(string[] args) { Task.Run(() => StartSynchro()); } It's better to start the windows service as fast as possible. You could move the initialization code to a separate thread as fol ...
  • 我想你可以在一个线程中包含OnStart中的逻辑。 收到OnStop事件后,此线程将关闭。 像这样的东西: Thread _ServiceThread; protected override void OnStart(string[] args) { _ServiceThread = new Thread(() => { /* your current OnStart logic here...*/ }); _ServiceThread.Start(); } protected overr ...
  • 这与明确地成为Windows服务无关,这是因为您没有设置IoC,以便Unity在构造函数中询问某个实例时知道要注入什么。 大概你在AuctionControl.Service.Service1构造函数中有一个接口,但是你没有告诉Unity容器绑定/解析该接口的具体类。 编辑: 你真的需要Unity吗? 它似乎没有做任何有用的事情。 尝试: public Service1() { InitializeComponent(); _auctionControl = new Services. ...
  • 例外情况表明配置文件存在问题。 仔细检查。 在异常或其内部异常中应该有更多信息,这将使您更准确地指出错误。 The exception suggests that there is something wrong with your configuration file. Check it carefully. There should be more information in the exception or its inner exception which will give you a more ...

相关文章

更多

最新问答

更多
  • 您如何使用git diff文件,并将其应用于同一存储库的副本的本地分支?(How do you take a git diff file, and apply it to a local branch that is a copy of the same repository?)
  • 将长浮点值剪切为2个小数点并复制到字符数组(Cut Long Float Value to 2 decimal points and copy to Character Array)
  • OctoberCMS侧边栏不呈现(OctoberCMS Sidebar not rendering)
  • 页面加载后对象是否有资格进行垃圾回收?(Are objects eligible for garbage collection after the page loads?)
  • codeigniter中的语言不能按预期工作(language in codeigniter doesn' t work as expected)
  • 在计算机拍照在哪里进入
  • 使用cin.get()从c ++中的输入流中丢弃不需要的字符(Using cin.get() to discard unwanted characters from the input stream in c++)
  • No for循环将在for循环中运行。(No for loop will run inside for loop. Testing for primes)
  • 单页应用程序:页面重新加载(Single Page Application: page reload)
  • 在循环中选择具有相似模式的列名称(Selecting Column Name With Similar Pattern in a Loop)
  • System.StackOverflow错误(System.StackOverflow error)
  • KnockoutJS未在嵌套模板上应用beforeRemove和afterAdd(KnockoutJS not applying beforeRemove and afterAdd on nested templates)
  • 散列包括方法和/或嵌套属性(Hash include methods and/or nested attributes)
  • android - 如何避免使用Samsung RFS文件系统延迟/冻结?(android - how to avoid lag/freezes with Samsung RFS filesystem?)
  • TensorFlow:基于索引列表创建新张量(TensorFlow: Create a new tensor based on list of indices)
  • 企业安全培训的各项内容
  • 错误:RPC失败;(error: RPC failed; curl transfer closed with outstanding read data remaining)
  • C#类名中允许哪些字符?(What characters are allowed in C# class name?)
  • NumPy:将int64值存储在np.array中并使用dtype float64并将其转换回整数是否安全?(NumPy: Is it safe to store an int64 value in an np.array with dtype float64 and later convert it back to integer?)
  • 注销后如何隐藏导航portlet?(How to hide navigation portlet after logout?)
  • 将多个行和可变行移动到列(moving multiple and variable rows to columns)
  • 提交表单时忽略基础href,而不使用Javascript(ignore base href when submitting form, without using Javascript)
  • 对setOnInfoWindowClickListener的意图(Intent on setOnInfoWindowClickListener)
  • Angular $资源不会改变方法(Angular $resource doesn't change method)
  • 在Angular 5中不是一个函数(is not a function in Angular 5)
  • 如何配置Composite C1以将.m和桌面作为同一站点提供服务(How to configure Composite C1 to serve .m and desktop as the same site)
  • 不适用:悬停在悬停时:在元素之前[复制](Don't apply :hover when hovering on :before element [duplicate])
  • 常见的python rpc和cli接口(Common python rpc and cli interface)
  • Mysql DB单个字段匹配多个其他字段(Mysql DB single field matching to multiple other fields)
  • 产品页面上的Magento Up出售对齐问题(Magento Up sell alignment issue on the products page)