Post

0302Beautiful Soup

Beautiful Soup是一个HTML/XML解析库,提供简单的、Python式的函数来处理导航、搜索、修改分析树等功能。

安装:

pip3 install beautifulsoup4

vcode No module named 'bs4' 问题:

  • 1.commond+shift+p(windowsctrl+shift+P)
  • 2.Python Interpreter -> Python 3.12.0

代码验证安装

1
2
3
from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>Hello</p>', 'lxml') #第一个参数为html文本,第二给为解释器(支持的解析器有:lxml、xml、html.parser、html5lib)
print(soup.p.string)

使用(p101)

简单的例子

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
html = """<html>
  <head>
    <title>The Dormouse's story</title>
  </head>
  <body>
    <p class="title" name="dromouse">
      <b>The Dormouse's story</b>
    </p>
    <p class="story">Once upon a time there were three little sisters; and their names were
      <a class="sister" href="http://example.com/elsie" id="link1">
        <!-- Elsie --></a>,
      <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and
      <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well.</p>
    <p class="story">...</p>
  </body><button name="b1"/><button name="b2"/><button name="b3"/><button name="b4"/>
</html>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

选择器操作集合

操作描述实例
prettify()格式化 
节点选择器直接用 .节点名 即可选择节点soup.title,选择到title节点,返回<title>The Dormouse's story</title>
获取节点名称.n.name利用name属性获取节点名称soup.title.name,返回title
获取节点属性1, .n.attrs[“xx_attr”]
2, .n[“xx_attr”]
soup.p.attrs[“name”]
soupt.p[“name”],返回dromouse
获取节点内容stringsoup.title.string,返回The Dormouse's story
嵌套节点选择.n1.n2.n3soup.head.title.string,返回The Dormouse's story
获取直接子节点1, contents(返回list)
2, children(返回迭代器)
soup.body.contents注意子节点内的嵌套节点不会单独列出来
soup.body.children
获取父/祖先节点parent
parents(返回所有祖先节点,迭代器)
soup.p.parent,返回body节点
soup.p.parents,递增向上返回节点p、body、html、document
兄弟节点next_sibling(下一个节点)
next_siblings
previous_sibilng(前一个节点)
previous_sibilngs
注意:空格、换行也被当作一个节点
soup.button.next_sibling[“name”],返回b2
soup.button.previous_sibling.name,返回body

方法、属性集合(p107)

方法描述实例
find_all(name,attrs,recursive,text,**kwargs)查找所有符合条件的节点。参数:
name:节点名
attrs:节点属性(list类型。对于 id、class[class_]等常用参数可直接当作参数传入)
text:匹配节点文本,返回也是文本(可以是正则表达式)
soup.find_all(name=”p”,attrs={“class”:”story”} )
soup.find_all(name=”p”,class_=”story” )
soup.find_all(text=re.compile(“Once upon”)),返回['Once upon a time there were three little sisters; and their names were\n ']
findfind_all的弱化版。只返回一个匹配节点soup.find(name=”p”)
select()css选择器soup.select(“body p a#link1”),选择 body->p->a且id=”link1”的节点
soup.select(“p.story”),选择p节点且class=”story”
string
get_text()
获取文本soup.select(“body p a#link2”)[0].string
soup.select(“body p a#link2”)[0].get_text(),返回Lacie
This post is licensed under CC BY 4.0 by the author.