Post

0304parsel

parsel(p124)可解析 HTML和XML,并支持支持使用XPath和CSS选择器对内容提取修改,同时还融合了正则表达式的提取功能。

parsel特点

  • LXML支持XPath
  • Beautiful Soup以python的函数方式处理html和xml
  • pyquery支持CSS选择器

安装:pip3 install parsel

使用

1
2
3
4
5
6
7
8
9
10
11
12
13
html = '''<div>
    <ul>
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>
'''
from parsel import Selector
selector = Selector(text=html) #创建Selector对象
item = selector.xpath("//li[cotains(@class,'item-0')]") #返回的是 parsel.selector.SelectorList 对象

1.提取文本

1
2
3
4
5
6
7
8
9
10
11
#①.get(返回一个str)
items = selector.css('.item-0')
for item in items:
    text = item.xpath('.//text()').get()
    print(text)

result =  selector.xpath('//li[contains(@class, "item-0")]//text()').get() #返回第一个 first item
print(result)
#②.getall(返回list)
result = selector.css('.item-0 *::text').getall()
print(result)

2.提取属性

  • ①.css提取:::attr(x_attr)
1
2
result = selector.css('.item-0.active a::attr(href)').get() #获取href属性
print(result) #返回link3.html
  • ②.XPath提取
1
2
result = selector.xpath('//li[contains(@class, "item-0") and contains(@class, "active")]/a/@href').get()
print(result)#返回link3.html

3.正则提取

1
2
3
4
5
6
result = selector.css('.item-0').re('link.*')
print(result)  #返回:['link3.html"><span class="bold">third item</span></a></li>', 'link5.html">fifth item</a></li>']

# re_first:提取第一个符合规则的文本值
result = selector.css('.item-0').re_first('<span class="bold">(.*?)</span>')
print(result)  #返回:third item
This post is licensed under CC BY 4.0 by the author.