python 爬虫 selenium 笔记

todo

阅读并熟悉 Xpath, 这个与 Selenium 密切相关、

selenium

selenium 加入无图模式，速度快很多。

python 复制代码

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

# selenium 无图模式，速度快很多。
option = Options()
option.page_load_strategy = "none"
prefs = {"profile.managed_default_content_settings.images": 2}  # 设置无图模式
option.add_experimental_option("prefs", prefs)  # 加载无图模式设置

driver = webdriver.Chrome(chrome_options=option)

遇到 BeautifulSoup iframe

一种解决方案是，获得iframe的src属性，然后请求并解析其内容:
另一种是：

python 复制代码

driver.get(url)
iframe = driver.find_elements_by_tag_name('iframe')[1]
driver.switch_to.frame(iframe) # 最重要的一步
soup = BeautifulSoup(driver.page_source, "html.parser")

个人常犯的错误，误区，陷阱

driver.execute_script(JS) 这个才是执行 JS，
注意是 execute_script, 不是 execute。

页面等待。这个是比较关键的。

显式等待。貌似比较麻烦，且不常用。

python 复制代码

from selenium.webdriver.support import expected_conditions as EC
wait = WebDriverWait(driver, 10)
element = wait.until(EC.element_to_be_clickable((By.ID,'someid')))

隐式等待。推荐使用。

driver.implicitly_wait(10) # seconds

定位元素

定位元素之前，加上这句话，笔记安全。

bot.implicitly_wait(10) # 这句话很关键。

查找元素的方法

python 复制代码

find_element_by_id()
find_element_by_name()              # 这个name 是标签里面的一种属性。
find_element_by_xpath()             
find_element_by_link_text()         # 比如  'Sign In'
find_element_by_partial_link_tex()      
find_element_by_tag_name()
find_element_by_class_name()
find_element_by_css_selector()

基本配置，导包

python 复制代码

import os
import random
import json
import pickle
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
import pyautogui as pt
import pyperclip

切换frame

遇到 iframe，最好是切换过去，见 https://blog.csdn.net/huilan_same/article/details/52200586

driver.switch_to.frame(0) # 1.用frame的index来定位，第一个是0

点击元素。不可点击的元素, 执行下面的方法。

python 复制代码

def real_click(self, driver, ele):
    actions = ActionChains(driver)
    actions.move_to_element(ele)
    actions.click(ele)
    actions.perform()

执行 js, 页面滚动

python 复制代码

# 先滚动到底部，然后再滚动到顶部
# window.scrollTo(0,document.body.scrollHeight);

js = "var q=document.documentElement.scrollTop=500"
bot.execute_script(js)

js2 = "document.body.scrollTop=document.documentElement.scrollTop=0;"
bot.execute_script(js2)

填写表格。这个需要再读读看。

python 复制代码

element = driver.find_element_by_xpath("//select[@name='name']")
choices = element.find_elements_by_tag_name("option")
for c in choices:
    print("Value is: %s" % c.get_attribute("value"))
    c.click()

封装一些自己常用的方法

python 复制代码

@staticmethod
def save_html(bot):             # 保存 html
    filename = 'ret.html'
    data = bot.page_source
    with open(filename, 'w') as f:
        f.write(data)
    print("保存 html 完成!")

@staticmethod
def real_click(driver, ele):    # 点击元素
    actions = ActionChains(driver)
    actions.move_to_element(ele)
    actions.click(ele)
    actions.perform()

@staticmethod
def send_word(ele, word):       # 输入框，输入文字
    ele.clear()
    ele.send_keys(word)
    ele.send_keys(Keys.RETURN)

源码中有趣的，有用的方法

Driver

driver.current_url # 本身就是静态方法

driver.page_source

driver.save_screenshot('foo.png')

driver.get_log('driver')

driver.page_source # 保存 html 源码，功本地调试，减少网络请求

driver.title 直接获取页面的标题，很适合作为文件名。

WebElement

ele.id # 直接就可以用

ele.get_attribute("class") # 这个很常用的。

python 爬虫 selenium 笔记

todo

selenium

个人常犯的错误，误区，陷阱

页面等待。这个是比较关键的。

定位元素

基本配置，导包

切换frame

点击元素。不可点击的元素, 执行下面的方法。

执行 js, 页面滚动

填写表格。这个需要再读读看。

封装一些自己常用的方法

源码中有趣的，有用的方法

个人接单，python, R语言，有事请私聊

老哥，支持一下啊。

python 爬虫 selenium 笔记

todo

selenium

个人常犯的错误， 误区，陷阱

页面等待。这个是比较关键的。

定位元素

基本配置，导包

切换frame

点击元素。不可点击的元素, 执行下面的方法。

执行 js, 页面滚动

填写表格。这个需要再读读看。

封装一些自己常用的方法

源码中有趣的，有用的方法

个人接单，python, R语言，有事请私聊

老哥，支持一下啊。

个人常犯的错误，误区，陷阱