0302Beautiful Soup
Beautiful Soup是一个HTML/XML解析库,提供简单的、Python式的函数来处理导航、搜索、修改分析树等功能。
安装:
pip3 install beautifulsoup4
vcode No module named 'bs4'
问题:
- 1.
commond+shift+p
(windowsctrl+shift+P
) - 2.Python Interpreter -> Python 3.12.0
代码验证安装
1
2
3
from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>Hello</p>', 'lxml') #第一个参数为html文本,第二给为解释器(支持的解析器有:lxml、xml、html.parser、html5lib)
print(soup.p.string)
使用(p101)
简单的例子
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
html = """<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="title" name="dromouse">
<b>The Dormouse's story</b>
</p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body><button name="b1"/><button name="b2"/><button name="b3"/><button name="b4"/>
</html>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
选择器操作集合
操作 | 描述 | 实例 |
---|---|---|
prettify() | 格式化 | |
节点选择器 | 直接用 .节点名 即可选择节点 | soup.title,选择到title节点,返回<title>The Dormouse's story</title> |
获取节点名称 | .n.name利用name属性获取节点名称 | soup.title.name,返回title |
获取节点属性 | 1, .n.attrs[“xx_attr”] 2, .n[“xx_attr”] | soup.p.attrs[“name”] soupt.p[“name”],返回 dromouse |
获取节点内容 | string | soup.title.string,返回The Dormouse's story |
嵌套节点选择 | .n1.n2.n3 | soup.head.title.string,返回The Dormouse's story |
获取直接子节点 | 1, contents(返回list) 2, children(返回迭代器) | soup.body.contents。注意子节点内的嵌套节点不会单独列出来 soup.body.children |
获取父/祖先节点 | parent parents(返回所有祖先节点,迭代器) | soup.p.parent,返回body节点 soup.p.parents,递增向上返回节点 p、body、html、document |
兄弟节点 | next_sibling(下一个节点) next_siblings previous_sibilng(前一个节点) previous_sibilngs 注意:空格、换行也被当作一个节点 | soup.button.next_sibling[“name”],返回b2 soup.button.previous_sibling.name,返回 body |
方法、属性集合(p107)
方法 | 描述 | 实例 |
---|---|---|
find_all(name,attrs,recursive,text,**kwargs) | 查找所有符合条件的节点。参数: name:节点名 attrs:节点属性(list类型。对于 id、class[class_]等常用参数可直接当作参数传入) text:匹配节点文本,返回也是文本(可以是正则表达式) | soup.find_all(name=”p”,attrs={“class”:”story”} ) soup.find_all(name=”p”,class_=”story” ) soup.find_all(text=re.compile(“Once upon”)),返回 ['Once upon a time there were three little sisters; and their names were\n '] |
find | find_all的弱化版。只返回一个匹配节点 | soup.find(name=”p”) |
select() | css选择器 | soup.select(“body p a#link1”),选择 body->p->a且id=”link1”的节点 soup.select(“p.story”),选择p节点且class=”story” |
string get_text() | 获取文本 | soup.select(“body p a#link2”)[0].string soup.select(“body p a#link2”)[0].get_text(),返回 Lacie |
This post is licensed under CC BY 4.0 by the author.