0304parsel
parsel(p124)可解析 HTML和XML,并支持支持使用XPath和CSS选择器对内容提取修改,同时还融合了正则表达式的提取功能。
parsel特点
- LXML支持XPath
- Beautiful Soup以python的函数方式处理html和xml
- pyquery支持CSS选择器
安装:pip3 install parsel
使用
1
2
3
4
5
6
7
8
9
10
11
12
13
html = '''<div>
<ul>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
'''
from parsel import Selector
selector = Selector(text=html) #创建Selector对象
item = selector.xpath("//li[cotains(@class,'item-0')]") #返回的是 parsel.selector.SelectorList 对象
1.提取文本
1
2
3
4
5
6
7
8
9
10
11
#①.get(返回一个str)
items = selector.css('.item-0')
for item in items:
text = item.xpath('.//text()').get()
print(text)
result = selector.xpath('//li[contains(@class, "item-0")]//text()').get() #返回第一个 first item
print(result)
#②.getall(返回list)
result = selector.css('.item-0 *::text').getall()
print(result)
2.提取属性
- ①.css提取:
::attr(x_attr)
1
2
result = selector.css('.item-0.active a::attr(href)').get() #获取href属性
print(result) #返回link3.html
- ②.XPath提取
1
2
result = selector.xpath('//li[contains(@class, "item-0") and contains(@class, "active")]/a/@href').get()
print(result)#返回link3.html
3.正则提取
1
2
3
4
5
6
result = selector.css('.item-0').re('link.*')
print(result) #返回:['link3.html"><span class="bold">third item</span></a></li>', 'link5.html">fifth item</a></li>']
# re_first:提取第一个符合规则的文本值
result = selector.css('.item-0').re_first('<span class="bold">(.*?)</span>')
print(result) #返回:third item
This post is licensed under CC BY 4.0 by the author.