首页 \ 问答 \ 使用BS4解析HTML表(Parsing HTML Tables with BS4)

使用BS4解析HTML表(Parsing HTML Tables with BS4)

我一直在尝试从这个站点抓取数据的不同方法( http://nflcombineresults.com/nflcombinedata.php?year=1999&pos=WR&college= ),似乎无法让它们中的任何一个工作。 我尝试过使用指数,但似乎无法使其发挥作用。 我想我此刻已经尝试了太多东西,所以如果有人能指出我正确的方向,我会非常感激。

我想提取所有信息并将其导出到.csv文件,但此时我只是想获取打印的名称和位置以开始使用。

这是我的代码:

import urllib2
from bs4 import BeautifulSoup
import re

url = ('http://nflcombineresults.com/nflcombinedata.php?year=1999&pos=&college=')

page = urllib2.urlopen(url).read()

soup = BeautifulSoup(page)
table = soup.find('table')

for row in table.findAll('tr')[0:]:
    col = row.findAll('tr')
    name = col[1].string
    position = col[3].string
    player = (name, position)
    print "|".join(player)

这是我得到的错误:第14行,名称= col [1] .string IndexError:列表索引超出范围。

--UPDATE--

好的,我已经取得了一些进展。 它现在允许我从头到尾,但它需要知道表中有多少行。 我怎么能直到最后才通过它们? 更新的代码:

import urllib2
from bs4 import BeautifulSoup
import re

url = ('http://nflcombineresults.com/nflcombinedata.php?year=1999&pos=&college=')

page = urllib2.urlopen(url).read()

soup = BeautifulSoup(page)
table = soup.find('table')


for row in table.findAll('tr')[1:250]:
    col = row.findAll('td')
    name = col[1].getText()
    position = col[3].getText()
    player = (name, position)
    print "|".join(player)

I've been trying different methods of scraping data from this site (http://nflcombineresults.com/nflcombinedata.php?year=1999&pos=WR&college=) and can't seem to get any of them to work. I've tried playing with the indices given, but can't seem to make it work. I think I've tried too many things at this point,so if someone could point me in the right direction I would really appreciate it.

I would like to pull all of the information and export it to a .csv file, but at this point I'm just trying to get the name and position to print to get started.

Here's my code:

import urllib2
from bs4 import BeautifulSoup
import re

url = ('http://nflcombineresults.com/nflcombinedata.php?year=1999&pos=&college=')

page = urllib2.urlopen(url).read()

soup = BeautifulSoup(page)
table = soup.find('table')

for row in table.findAll('tr')[0:]:
    col = row.findAll('tr')
    name = col[1].string
    position = col[3].string
    player = (name, position)
    print "|".join(player)

Here's the error I'm getting: line 14, in name = col[1].string IndexError: list index out of range.

--UPDATE--

Ok, I've made a little progress. It now allows me to go from start to finish, but it requires knowing how many rows are in the table. How would I get it to just go through them until the end? Updated Code:

import urllib2
from bs4 import BeautifulSoup
import re

url = ('http://nflcombineresults.com/nflcombinedata.php?year=1999&pos=&college=')

page = urllib2.urlopen(url).read()

soup = BeautifulSoup(page)
table = soup.find('table')


for row in table.findAll('tr')[1:250]:
    col = row.findAll('td')
    name = col[1].getText()
    position = col[3].getText()
    player = (name, position)
    print "|".join(player)

原文:https://stackoverflow.com/questions/22078620
更新时间:2022-06-27 10:06

最满意答案

在Audacity中,您必须选中首选项的“ 导入/导出”部分中的 “使用自定义混合”单选按钮。 这将允许您导出多声道文件,并手动将曲目分配给频道。

除此之外,普通的旧.wav可以正常工作。

但您也可以使用SoX以更自动化的方式创建文件。

手动,您可以将五个不同的文件组合(或“合并”,如文档中所述)五个不同的文件,如下所示:

sox -M chan1.wav chan2.wav chan3.wav chan4.wav chan5.wav multi.wav

为了自动化这个过程,我整理了一个简短的Bash例程,用于生成具有交错测试音的多声道文件:

NUM=5    # Number of channels
LEN=2    # Length of each test tone, in seconds
OVL=0.5  # Overlap between test tones, in seconds

# A one-channel base file containing simple white noise.
# faded at both end with a quarter wave envelope to ensure 
# smooth equal power transitions
sox -n -b 24 -c 1 out.wav synth $LEN whitenoise fade q $OVL -0 $OVL

# Instead of white noise you can for example make a 1kHz tone
# like this:
# sox -n -b 24 -c 1 out.wav synth $LEN sine 1k fade q $OVL -0 $OVL

# Or a sweep from 10Hz to 10kHz like this:
# sox -n -b 24 -c 1 out.wav synth $LEN sine 10-10k fade q $OVL -0 $OVL

# Produces a sequence of the number of seconds each channel
# shall be padded with
SEQ=$(for ((i=1; i<=NUM; i++))
do 
  echo "$i 1 - [$LEN $OVL -]x * p" | dc  # reverse-Polish arithmetic
done)

echo $SEQ

# Padding the base file to various degrees and saving them separately
for j in $SEQ
do 
  sox -c 1 out.wav outpad${j}.wav pad $j
done

# Finding the just-produced individual files
FIL=$(ls | grep ^outpad)

# Merging the individual files into a single multi-channel file
sox -M $FIL multi.wav

rm $FIL  # removing the individual files

# Producing a multi-channel waveform plot
ffmpeg -i multi.wav -y -filter_complex "showwavespic=s=2400x900:split_channels=1" -frames:v 1 waveform.png

# displaying the waveform plot
open waveform.png

如波形图清晰显示,结果由一个包含五个通道的文件组成,每个通道具有相同的内容,只是在一段时间内移动:

具有交错和交叉褪色测试音调的多声道音频文件

更多关于使用dc反向波兰算法: http//wiki.bash-hackers.org/howto/calculate-dc

有关使用ffmpeg显示波形的更多信息: https//trac.ffmpeg.org/wiki/Waveform


In Audacity you have to check the 'Use custom mix' radio button in the Import/Export section of the preferences. This will let you export multi-channel files, and manually assign tracks to channels.

Other than that, plain old .wav works fine for this.

But you can also use SoX to create the files in a more automated manner.

Manually you can combine (or 'merge' as it's referred to in the documentation) five distinct files into a single five-channel file like this:

sox -M chan1.wav chan2.wav chan3.wav chan4.wav chan5.wav multi.wav

To automate the process I put together a short Bash routine for producing a multichannel file with staggered test tones:

NUM=5    # Number of channels
LEN=2    # Length of each test tone, in seconds
OVL=0.5  # Overlap between test tones, in seconds

# A one-channel base file containing simple white noise.
# faded at both end with a quarter wave envelope to ensure 
# smooth equal power transitions
sox -n -b 24 -c 1 out.wav synth $LEN whitenoise fade q $OVL -0 $OVL

# Instead of white noise you can for example make a 1kHz tone
# like this:
# sox -n -b 24 -c 1 out.wav synth $LEN sine 1k fade q $OVL -0 $OVL

# Or a sweep from 10Hz to 10kHz like this:
# sox -n -b 24 -c 1 out.wav synth $LEN sine 10-10k fade q $OVL -0 $OVL

# Produces a sequence of the number of seconds each channel
# shall be padded with
SEQ=$(for ((i=1; i<=NUM; i++))
do 
  echo "$i 1 - [$LEN $OVL -]x * p" | dc  # reverse-Polish arithmetic
done)

echo $SEQ

# Padding the base file to various degrees and saving them separately
for j in $SEQ
do 
  sox -c 1 out.wav outpad${j}.wav pad $j
done

# Finding the just-produced individual files
FIL=$(ls | grep ^outpad)

# Merging the individual files into a single multi-channel file
sox -M $FIL multi.wav

rm $FIL  # removing the individual files

# Producing a multi-channel waveform plot
ffmpeg -i multi.wav -y -filter_complex "showwavespic=s=2400x900:split_channels=1" -frames:v 1 waveform.png

# displaying the waveform plot
open waveform.png

As the waveform plot clearly shows, the result consists of a file with five channels, each with the same content, just moved about some in time:

multichannel audio file with staggered and cross-faded test tones

More on reverse-Polish arithmetic using dc: http://wiki.bash-hackers.org/howto/calculate-dc

More on displaying waveforms using ffmpeg: https://trac.ffmpeg.org/wiki/Waveform

相关问答

更多

相关文章

更多

最新问答

更多
  • 如何在Laravel 5.2中使用paginate与关系?(How to use paginate with relationships in Laravel 5.2?)
  • linux的常用命令干什么用的
  • 由于有四个新控制器,Auth刀片是否有任何变化?(Are there any changes in Auth blades due to four new controllers?)
  • 如何交换返回集中的行?(How to swap rows in a return set?)
  • 在ios 7中的UITableView部分周围绘制边界线(draw borderline around UITableView section in ios 7)
  • 使用Boost.Spirit Qi和Lex时的空白队长(Whitespace skipper when using Boost.Spirit Qi and Lex)
  • Java中的不可变类(Immutable class in Java)
  • WordPress发布查询(WordPress post query)
  • 如何在关系数据库中存储与IPv6兼容的地址(How to store IPv6-compatible address in a relational database)
  • 是否可以检查对象值的条件并返回密钥?(Is it possible to check the condition of a value of an object and JUST return the key?)
  • GEP分段错误LLVM C ++ API(GEP segmentation fault LLVM C++ API)
  • 绑定属性设置器未被调用(Bound Property Setter not getting Called)
  • linux ubuntu14.04版没有那个文件或目录
  • 如何使用JSF EL表达式在param中迭代变量(How to iterate over variable in param using JSF EL expression)
  • 是否有可能在WPF中的一个单独的进程中隔离一些控件?(Is it possible to isolate some controls in a separate process in WPF?)
  • 使用Python 2.7的MSI安装的默认安装目录是什么?(What is the default installation directory with an MSI install of Python 2.7?)
  • 寻求多次出现的表达式(Seeking for more than one occurrence of an expression)
  • ckeditor config.protectedSource不适用于editor.insertHtml上的html元素属性(ckeditor config.protectedSource dont work for html element attributes on editor.insertHtml)
  • linux只知道文件名,不知道在哪个目录,怎么找到文件所在目录
  • Actionscript:检查字符串是否包含域或子域(Actionscript: check if string contains domain or subdomain)
  • 将CouchDB与AJAX一起使用是否安全?(Is it safe to use CouchDB with AJAX?)
  • 懒惰地初始化AutoMapper(Lazily initializing AutoMapper)
  • 使用hasclass为多个div与一个按钮问题(using hasclass for multiple divs with one button Problems)
  • Windows Phone 7:检查资源是否存在(Windows Phone 7: Check If Resource Exists)
  • 无法在新线程中从FREContext调用getActivity()?(Can't call getActivity() from FREContext in a new thread?)
  • 在Alpine上升级到postgres96(/ usr / bin / pg_dump:没有这样的文件或目录)(Upgrade to postgres96 on Alpine (/usr/bin/pg_dump: No such file or directory))
  • 如何按部门显示报告(How to display a report by Department wise)
  • Facebook墙贴在需要访问令牌密钥后无法正常工作(Facebook wall post not working after access token key required)
  • Javascript - 如何在不擦除输入的情况下更改标签的innerText(Javascript - how to change innerText of label while not wiping out the input)
  • WooCommerce / WordPress - 不显示具有特定标题的产品(WooCommerce/WordPress - Products with specific titles are not displayed)