0705反爬案例分析

Posted Oct 6, 2023 Updated Feb 22, 2024

By gold-duo 3 min read

CSS位移偏、字体移反爬案例

CSS位移偏移反爬案例分析与实战(p282)

实例中的标题通过CSS(class=char)设置元素绝对位置(position:absolute),配合style设置元素的位移(style:left)来混淆元素的排列顺序

  
<h3 class="m-b-sm name">
    <span class="char" style="left: 0px;">明</span>
    <span class="char" style="left: 16px;">朝</span>
    <span class="char" style="left: 32px;">那</span>
    <span class="char" style="left: 48px;">些</span>
    <span class="char" style="left: 64px;">事</span>
    <span class="char" style="left: 80px;">儿</span>
</h3>

爬取：

将位移(style.left)按从小到大排序，再拼接字符串即可

书中的实例的讲解还用到了PyQery来对节点操作实属多余了，即使单用Selenium、Pyeteer、Playwright都可以完全交给js来提取出名称，下面是三种实现方法

  
#Selenium
nodes=item.find_element(By.SELECTOR,'h3.m-b-sm').find_elements(By.SELECTOR,'span.char')
name=browser.execute_script('''arguments[0]=>arguments[0].map(n=>{
                r={}
                r.t=n.innerText
                r.l=parseInt(n.style.left.replace(/(px)/g,""))
                return r}).sort((a,b)=>a.l-b.l).map(n=>n.t).join('')''',nodes)

js='''nodes=>nodes.map(n=>{
                r={}
                r.t=n.innerText
                r.l=parseInt(n.style.left.replace(/(px)/g,""))
                return r}).sort((a,b)=>a.l-b.l).map(n=>n.t).join('')'''
#Pyeteer
h3=item.J('h3.m-b-sm')
name=h3.JJeval('span.char',js)

#Playwright
h3=await item.query_selector('h3.m-b-sm')
name=await h3.eval_on_selector_all('span.char',js)

字体反爬案例分析与实战(p287)

实例中电影的scoe（评分）并不是直接显示成文本，而是通过css映射内容.

:before伪元素

其将成为匹配选中的元素的第一个子元素。常通过 content 属性来为一个元素添加修饰性的内容。

  
<style>
q::before {
content: "«";
color: blue;
}
q::after {
content: "»";
color: red;
}
</style>
<q>一些引用</q>, 他说，<q>比没有好。</q><!--«一些引用, 他说，比没有好。»-->

爬取：

先将对应的css文件下载下来(书中是通过request请求，如果用Pyeteer,Playwright也可用page.on监听response获取到css，不过涉及同步问题可能更复杂)，将icon-xxx:before中的content用map保存起来(正则表达式.icon-(.*?):before\{content:"(.*?)"\})；爬取节点时获取到class属性提取到icon-xxx编号从map里取相应的内容,再将内容串起来。

  
<style>
.icon {
    font-family: scrape!important;
    speak: none;
    font-style: normal;
    font-weight: 400;
    font-variant: normal;
    text-transform: none;
    line-height: 1;
    -webkit-font-smoothing: antialiased;
    -moz-osx-font-smoothing: grayscale
}
.icon-789:before {
    content: "9"
}
.icon-981:before {
    content: "."
}
.icon-504:before {
    content: "5"
}
</style>
<p class="score m-t-md m-b-n-sm">
    <span><i class="icon icon-789"></i></span>
    <span><i class="icon icon-981"></i></span>
    <span><i class="icon icon-504"></i></span>
</p><!--显示：9.5-->

更简单的做法

按照这种需求完全可以用更简单的方式获取。完全无需提前下载css文件，直接通过window.getComputedStyle(n,'::before').getPropertyValue('content')获取伪类内容，下面是我实现的核心代码

  
p=await item.query_selector('p.score') #item是页面中每一个 item(class=item)
score= await score.eval_on_selector_all('span i.icon','''nodes=>nodes.map(n=>getComputedStyle(n,'::before').getPropertyValue('content') ).join('').replaceAll('"','')''')

python, 网络爬虫开发实战

This post is licensed under CC BY 4.0 by the author.

CSS位移偏移反爬案例分析与实战(p282)

爬取：

字体反爬案例分析与实战(p287)

:before伪元素

爬取：

更简单的做法

Trending Tags