BS4网络提取selenium.chrome.WebDriver类的方法及属性

chrome.webdriver: selenium.webdriver.chrome.webdriver --- Selenium 4.18.1 documentation

class selenium.webdriver.chrome.webdriver.WebDriver 是 Selenium 中用于操作 Chrome 浏览器的 WebDriver 类。WebDriver 类是 Selenium 提供的一个关键类，用于驱动浏览器执行各种操作，比如打开网页、查找元素、模拟用户操作等。

通过使用 selenium.webdriver.chrome.webdriver.WebDriver 类，结合其他 Selenium 提供的方法和类，可以实现自动化测试、网页数据抓取等功能。Chrome WebDriver 是针对 Chrome 浏览器的驱动程序，可以与 Chrome 浏览器无缝集成，实现对浏览器的控制和操作。

提取信息截图：

python 复制代码

import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

url = "https://www.selenium.dev/selenium/docs/api/py/webdriver_chrome/selenium.webdriver.chrome.webdriver.html#module-selenium.webdriver.chrome.webdriver"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

data = []

# 提取方法
methods = soup.find_all('dl', class_='method')
for method in methods:
    name = method.find('dt').find('code').text.strip()
    description = method.find('dd').text.strip()
    description = re.sub(r'\n\s*\n', '\n', description)  # 使用正则表达式删除多余的空行
    data.append(['Method', name, description])

# 提取属性
# 正则表达式r'\n\s*\n'的含义是：
# \n: 匹配换行符。
# \s*: 匹配零个或多个空白字符（空格、制表符等）。
# \n: 再次匹配换行符。
# 整个正则表达式r'\n\s*\n'用来匹配连续出现的多个换行符，并且中间可以包含任意数量的空白字符。
# 这样就可以将连续的多个空行替换为单个换行符，从而实现删除多余空行的效果。
# 在代码中的使用是为了确保描述信息中不会出现多个连续的空行，而只保留一个换行符作为段落分隔。

attributes = soup.find_all('dl', class_='attribute')
for attribute in attributes:
    name = attribute.find('dt').find('code').text.strip()
    description = attribute.find('dd').text.strip()
    description = re.sub(r'\n\s*\n', '\n', description)  # 使用正则表达式删除多余的空行
    data.append(['Attribute', name, description])



# 将数据存储到DataFrame中
df = pd.DataFrame(data, columns=['Type', 'Name', 'Description'])

# 将数据写入Excel文件
excel_file = "methods_attributes.xlsx"
df.to_excel(excel_file, index=False)

正则表达式r'\n\s*\n'的含义是：

\n: 匹配换行符。

\s*: 匹配零个或多个空白字符（空格、制表符等）。

\n: 再次匹配换行符。

整个正则表达式r'\n\s*\n'用来匹配连续出现的多个换行符，并且中间可以包含任意数量的空白字符。

这样就可以将连续的多个空行替换为单个换行符，从而实现删除多余空行的效果。

在代码中的使用是为了确保描述信息中不会出现多个连续的空行，而只保留一个换行符作为段落分隔。

基础知识：
# 当使用 BeautifulSoup 的 find_all() 方法进行查找时，可以结合多种条件和技巧来定位和提取需要的元素。

以下是归纳的一些常见的 find_all() 查找方式：

按标签名查找：soup.find_all('tag_name')

按类名查找：soup.find_all(class_='class_name')

按 id 查找：soup.find_all(id='element_id')

按属性查找：soup.find_all(attrs={'attribute': 'value'})

结合多个条件查找：soup.find_all('tag', class_='class_name', attrs={'attribute': 'value'})

按文本内容查找：soup.find_all(text='desired_text')

结合正则表达式的文本内容查找：soup.find_all(text=re.compile(r'regex_pattern'))

按子节点查找：parent_element.find_all('child_tag')

按序号查找：soup.find_all('tag_name')[index]

查找特定属性存在的元素：soup.select('[attribute]')

结合列表推导式进行查找：[tag for tag in soup.find_all() if condition]

find(name, attrs, recursive, text, **kwargs)：在当前标签内查找第一个符合条件的元素，并返回其 Tag 对象。

find_all(name, attrs, recursive, text, limit, **kwargs)：在当前标签内查找所有符合条件的元素，并返回一个列表。

find_parent(name, attrs, recursive, text, **kwargs)：查找当前标签的父元素并返回其 Tag 对象。

find_next_sibling(name, attrs, recursive, text, **kwargs)：查找当前标签的下一个同级元素并返回其 Tag 对象。

tag.name：用于获取元素的标签名。

tag.text 或 tag.get_text()：用于获取元素的文本内容。

tag['attribute'] 或 tag.get('attribute')：用于获取元素的属性值。

tag.contents：用于获取元素的子节点列表。

tag.parent 或 tag.parents：用于获取元素的父节点或祖先节点。

tag.next_sibling 或 tag.previous_sibling：用于获取元素的下一个兄弟节点或上一个兄弟节点。

tag.next_element 或 tag.previous_element：用于获取元素的下一个节点或上一个节点，可以是标签、字符串或注释。

tag.has_attr('attribute')：用于判断元素是否包含某个属性。

tag.find_previous(name=None, attrs={}, text=None, kwargs) 和 tag.find_all_previous(name=None, attrs={}, text=None, limit=None, kwargs)：用于查找元素前面的满足条件的元素，参数与 find() 和 find_all() 方法类似。

tag.select_one(selector)：用于按照 CSS 选择器语法查找元素，并返回第一个匹配的元素。

tag.select(selector)：用于按照 CSS 选择器语法查找元素，并返回所有匹配的元素。

select() 方法是 BeautifulSoup 中用于按照 CSS 选择器语法查找元素，并返回所有匹配的元素的功能。通过使用CSS选择器语法，可以更方便地定位和选择需要的元素。

下面是 select() 方法及其参数的详细介绍：

语法 select(selector)

参数说明

selector：字符串类型，表示 CSS 选择器语法的表达式，用于指定要查找的元素。

CSS 选择器语法示例

标签选择器：tagname，如 p 表示选择所有 <p> 标签。soup.select('p')

类选择器：.classname，如 .content 表示选择所有 class 属性为 content 的元素。soup.select('.content')

ID 选择器：#idname，如 #footer 表示选择 id 属性为 footer 的元素。soup.select('#header')

层级选择器：ancestor descendant，如 div p 表示选择所有 <p> 标签，其父元素为 <div>。soup.select('div p')

子元素选择器：parent > child，如 div.content > p 表示选择所有 <p> 标签，其父元素为 <div>，且 class 属性为 content。soup.select('div.content > p')

后代元素选择器：ancestor descendant，如 div .content 表示选择所有具有 content 类名的元素，且其祖先元素为 <div>。soup.select('div .content')

BS4提取chrome.webdriver方法和属性以及描述并输出到Excel