0701Selenium

Posted Sep 24, 2023 Updated Feb 22, 2024

By gold-duo 10 min read

Selenium(p212) 是一个自动化测试工具，利用它我们可以驱动浏览器执行特定的动作，如点击、下拉等等操作，对于一些 JavaScript 渲染的页面来说，此种抓取方式非常有效

安装

1.安装Chrome 浏览器
2.安装ChromeDriver
- 下载对应浏览器版本(115后的版本)
- 解压包:/usr/local/chromedriver
- 设置环境变量:export PATH="$PATH:/usr/local/chromedriver"
3.安装 Slenium python库：pip3 install selenium

使用

简单的例子

  
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait

browser = webdriver.Chrome() #获取浏览器对象(也可以是Firefox()、Edge()、Safari())
try:
    browser.get('https://www.baidu.com')   #请求页面
    input = browser.find_element(By.ID,'kw') #查找id=kw的输入框
    input.send_keys('Python')
    input.send_keys(Keys.ENTER)

    wait = WebDriverWait(browser, 10)   #等待超时十秒
    wait.until(EC.presence_of_element_located((By.ID, 'content_left'))) #直到id=content_left元素出现
    print(browser.current_url)
    print(browser.get_cookies())
    print(browser.page_source)
finally:
    browser.close()

查找单个节点`find_element(by,value)`。by值可以是

ID
XPATH
LINK_TEXT
PARTIAL_LINK_TEXT
NAME
TAG_NAME
CLASS_NAME
CSS_SELECTOR

查找多个节点`find_elements(by,value)`(返回list)

  
browser.get('https://www.taobao.com')
values = browser.find_elements(By.CSS_SELECTOR,'.service-bd li')

节点交互

  
browser.get('https://www.taobao.com')
input = browser.find_element(By.ID,'q')
input.send_keys('iPhone')
time.sleep(1)
input.clear()
input.send_keys('iPad')
button = browser.find_element(By.CSS_SELECTOR,'button.btn-search.tb-bg')
button.click()#button点击

`ActionChains`:

一种自动执行低级交互的方法，例如鼠标移动、鼠标按钮操作、按键和上下文菜单交互。(调用ActionChains的方法时，不会立即执行，而是会将所有的操作按顺序存放在一个队列里，当你调用perform()方法时，队列中的时间会依次执行)。ActionChains的一些方法:

方法	作用
click(on_element=None)	单击鼠标左键
click_and_hold(on_element=None)	点击鼠标左键，不松开
context_click(on_element=None)	点击鼠标右键
double_click(on_element=None)	双击鼠标左键
drag_and_drop(source, target)	拖拽到某个元素然后松开
drag_and_drop_by_offset(source, xoffset, yoffset)	拖拽到某个坐标然后松开
key_down(value, element=None)	按下某个键盘上的键
key_up(value, element=None)	松开某个键
move_by_offset(xoffset, yoffset)	鼠标从当前位置移动到某个坐标
move_to_element(to_element)	鼠标移动到某个元素
move_to_element_with_offset(to_element, xoffset, yoffset)	移动到距某个元素（左上角坐标）多少距离的位置
release(on_element=None)	在某个元素位置松开鼠标左键
send_keys(*keys_to_send)	发送某个键到当前焦点的元素
send_keys_to_element(element, *keys_to_send)	发送某个键到指定元素
perform()	执行链中的所有动作

执行js code:

browser.execute_script(str_js_code)

  
browser.execute_script('window.scrollTo(0, document.body.scrollHeight)') #scroll到底部
browser.execute_script('alert("To Bottom")')#alert

element = browser.find_element_by_id("ok") 
browser.execute_script("arguments[0].click();", element)#arguments[0]=element

获取信息

获取属性:element.get_attribute('attr_name')
获取文本:element.text
获取id/位置/标签名/大小(宽高):element.id/location/tag_name/size

  
url = 'https://spa2.scrape.center/'
browser.get(url)
print(browser.find_element(By.CLASS_NAME,'logo-image').get_attribute('src'))
input = browser.find_element(By.CLASS_NAME,'logo-title')
print(input.text)
print(input.id)
print(input.location)
print(input.tag_name)
print(input.size)

切换iframe:

browser.switch_to('iframe_name')

Selenium不能直接操作子iframe里的节点，需要切换到子 iframe
子iframe切换回父级:browser.switch_to.parent_frame()
切换iframe在Selenium中总是很慢，慢到怀疑执行情况

延时等待

隐式：browser.implicitly_wait(seconds):执行查找节点等待时间(默认为0，全局设置),找不到抛异常

  
browser.implicitly_wait(10)
browser.get(url)
input = browser.find_element('logo-image')

显式：

  
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
browser.get('https://www.taobao.com/')
wait = WebDriverWait(browser, 10)               #等待10秒
callback=EC.presence_of_element_located((By.ID, 'q'))   #等待条件。节点出现，按id查找：id=q
input = wait.until(callback)

等待条件表

条件	作用	例子
title_is	判断当前页面的title是否精确等于预期
title_contains	判断当前页面的title是否包含预期字符串
presence_of_element_located	判断某个元素是否被加到了dom树里，并不代表该元素一定可见	EC.presence_of_element_located((By.ID, ‘q’))
visibility_of_element_located	判断某个元素是否可见.可见代表元素非隐藏，并且元素的宽和高都不等于0	EC.visibility_of_element_located((By.TAG_NAME, ‘h2’))
visibility_of_all_elements_located	判断所有元素是否可见.	EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ‘#index .item’))
visibility_of	跟上面的方法做一样的事情，只是上面的方法要传入locator，这个方法直接传定位到的element就好了
presence_of_all_elements_located	判断是否至少有1个元素存在于dom树中。	举个例子，如果页面上有n个元素的class都是’column-md-3’，那么只要有1个元素存在，这个方法就返回True
text_to_be_present_in_element	判断某个元素中的text是否包含了预期的字符串
text_to_be_present_in_element_value	判断某个元素中的value属性是否包含了预期的字符串
frame_to_be_available_and_switch_to_it	判断该frame是否可以switch进去，如果可以的话，返回True并且switch进去，否则返回False
invisibility_of_element_located	判断某个元素中是否不存在于dom树或不可见
element_to_be_clickable	判断某个元素中是否可见并且是enable的，这样的话才叫clickable
staleness_of	等某个元素从dom树中移除，注意，这个方法也是返回True或False
element_to_be_selected	判断某个元素是否被选中了,一般用在下拉列表
element_selection_state_to_be	判断某个元素的选中状态是否符合预期
element_located_selection_state_to_be	跟上面的方法作用一样，只是上面的方法传入定位到的element，而这个方法传入locator
alert_is_present	判断页面上是否存在alert，这是个老问题，很多同学会问到

反屏蔽 & 无界面模式：

移除window.navigaator='webdriver'

  
option = ChromeOptions()
option.add_argument('--headless')#无界面启动，也可以直接设置options.headless=True
option.add_experimental_option('excludeSwitches', ['enable-automation'])#隐藏WebDriver提示条
option.add_experimental_option('useAutomationExtension', False) #隐藏自动化扩展信息
browser = webdriver.Chrome(options=option)
browser.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument', {  #在页面加载前执行js
    'source': 'Object.defineProperty(navigator, "webdriver", {get: () => undefined})'
})#移除`window.navigaator='webdriver'`

  
browser.get_cookies()   #获取
browser.add_cookies({}) #添加
browser.delete_all_cookies()#删除

异常：

from selenium.common.exception import TimeoutException ,NoSuchElementException

前进后退：

browser.back()/forward()

选项卡(浏览器的tab)管理

  
browser.get('https://www.baidu.com')
browser.execute_script('window.open()')#打开一个新tab
print(browser.window_handles)
browser.switch_to.window(browser.window_handles[1])#选中第二个tab
browser.get('https://www.taobao.com')#在第二个tab打开页面

Q&A

'WebDriver' object has no attribute 'find_element_by_id': find_element_*之类的方法被删除了用find_element(By.ID/XPATH/CSS_SELECTOR,value)替换

  
from selenium.webdriver.common.by import By

实例(p269)

目的：和之前的用例一样（由于详情url带了加密的参数token），爬取https://spa2.scrape.center/页面里的所有条目详情。
遍历所有页面：还是通过构造100个页面索引遍历请求(https://spa2.scrape.center/pIdx)

详情页url的获取：

  
elements =find_elements(By.CSS_SELECTOR,'#index .item .name') #获取到10条记录
#。。。
url=urljoin(INDEX_URL, element.get_attribute('href') )#合并url (只会用INDEX_URL的base url合并)

浏览器请求页面和判读呈现的实现

  
def scrape_page(url, condition, locator: Tuple[str, str]): 
    logging.info('scraping %s', url)
    try:
        browser.get(url)                #请求url
        wait.until(condition(locator))  #等待函数传入的条件
    except TimeoutException:
        logging.error('error occurred while scraping %s', url, exc_info=True)
# 上面这个实现将wait.until的method构造分成两个参数有点迷惑了，下面构造成一个参数
def scrape_page(url, wait_method):
    try:
        browser.get(url)
        wait.until(wait_method)
    except TimeoutException:
        pass

scrape_page(url, EC.visibility_of_all_elements_located((By.CSS_SELECTOR, '#index .item')) ) #请求页面，等待所有的 #index .item 呈现
scrape_page(url, EC.visibility_of_element_located((By.TAG_NAME, 'h2')))                     #请求详情，等待h2标签呈现

实例实现的问题：

从页码请求 -> 解析出详情页url -> 详情页请求 ->解析并保存相求数据 ->下一页请求重复流程，都是一气呵成的。比如并没有考虑页面请求出错了是否要退出整个流程还是跳个执行下一页流程

python, 网络爬虫开发实战

This post is licensed under CC BY 4.0 by the author.