ScrapeGraphAI：当AI遇上爬虫，网页爬取变得简单智能

前言

在当今信息爆炸的时代，数据的获取和处理变得尤为重要。无论是科研、商业分析，还是市场调查，数据的有效提取都是成功的关键。然而，传统的网页抓取技术通常需要编写复杂的代码，对网页结构进行深入了解，并且维护成本高昂。随着人工智能（AI）和大语言模型（LLM）的迅速发展，这一切正在发生翻天覆地的变化。本文将介绍ScrapeGraphAI这一革命性工具，它结合了AI与LLM，实现了前所未有的网页抓取效率，让数据提取变得前所未有的简单。

传统网页抓取的挑战

网页抓取，即从互联网网页中提取数据的过程，一直以来都是一个复杂且技术要求高的任务。传统的网页抓取主要面临以下几个挑战：

网页结构复杂：不同网站的网页结构千差万别，标签的嵌套层次、数据的分布位置等都可能大相径庭。编写抓取程序需要对目标网页进行详细分析，理解其DOM结构，这对于没有编程经验的人来说几乎是不可能完成的任务。
动态内容加载：现代网页越来越多地使用JavaScript进行动态内容加载，这意味着网页内容并非一开始就全部加载完毕，而是用户交互时才加载。这为传统抓取工具带来了巨大挑战，需要模拟用户行为，处理异步请求，才能获取完整数据。
反爬虫机制：很多网站为了保护自身数据，都会设置各种反爬虫机制，如IP封禁、验证码等。这些机制让抓取变得更加困难，需要不断调整策略以应对。
维护成本高：网页结构经常变动，抓取规则也需要随之更新。这意味着抓取程序需要频繁维护，增加了时间和人力成本。

AI与LLM的引入

人工智能技术，尤其是大语言模型（LLM），在自然语言处理、图像识别等领域展现了强大的能力。这些模型通过对海量数据的学习，能够理解和生成类似人类的语言，对复杂问题进行分析和解决。ScrapeGraphAI正是结合了这一技术，将其应用于网页抓取，带来了革命性的变化。

ScrapeGraphAI是一个网络抓取python库，它使用LLM和直接图形逻辑来创建网站、文档和XML文件的抓取管道。只需说出您要提取哪些信息，库就会为您完成！

🚀快速安装

Scrapegraph-ai的参考页面在pypy的官方页面：pypi。

shell 复制代码

pip install scrapegraphai

您还需要安装Playwright以进行基于javascript的抓取：

shell 复制代码

playwright install

注意：建议将库安装在虚拟环境中，以避免与其他库发生冲突🐱

🔍演示

官方演示：

📖文档

ScrapeGraphAI的文档可以在这里找到。

也看看docusaurus的留档。

💻用法

可以使用SmartScraper类通过提示符从网站提取信息。

这个SmartScraper类是一个直接的图形实现，它使用web抓取管道中最常见的节点。请参阅留档。

案例1：使用Ollama提取信息

记得单独在Ollama上下载模型！

python 复制代码

from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
    "llm": {
        "model": "ollama/mistral",
        "temperature": 0,
        "format": "json",  # Ollama needs the format to be specified explicitly
        "base_url": "http://localhost:11434",  # set Ollama URL
    },
    "embeddings": {
        "model": "ollama/nomic-embed-text",
        "base_url": "http://localhost:11434",  # set Ollama URL
    }
}

smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the articles",
    # also accepts a string with the already downloaded HTML code
    source="https://perinim.github.io/projects",
    config=graph_config
)

result = smart_scraper_graph.run()
print(result)

案例2：使用Docker提取信息

注意：在使用本地模型之前记得创建docker容器！

shell 复制代码

    docker-compose up -d
    docker exec -it ollama ollama pull stablelm-zephyr

您可以使用Ollama或您自己的模型上可用的模型，而不是stablelm-zephy r

python 复制代码

from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
    "llm": {
        "model": "ollama/mistral",
        "temperature": 0,
        "format": "json",  # Ollama needs the format to be specified explicitly
        # "model_tokens": 2000, # set context length arbitrarily
    },
}

smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the articles",
    # also accepts a string with the already downloaded HTML code
    source="https://perinim.github.io/projects",  
    config=graph_config
)

result = smart_scraper_graph.run()
print(result)

案例3：使用Openai模型提取信息

python 复制代码

from scrapegraphai.graphs import SmartScraperGraph
OPENAI_API_KEY = "YOUR_API_KEY"

graph_config = {
    "llm": {
        "api_key": OPENAI_API_KEY,
        "model": "gpt-3.5-turbo",
    },
}

smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the articles",
    # also accepts a string with the already downloaded HTML code
    source="https://perinim.github.io/projects",
    config=graph_config
)

result = smart_scraper_graph.run()
print(result)

案例4：使用Groq提取信息

python 复制代码

from scrapegraphai.graphs import SmartScraperGraph
from scrapegraphai.utils import prettify_exec_info

groq_key = os.getenv("GROQ_APIKEY")

graph_config = {
    "llm": {
        "model": "groq/gemma-7b-it",
        "api_key": groq_key,
        "temperature": 0
    },
    "embeddings": {
        "model": "ollama/nomic-embed-text",
        "temperature": 0,
        "base_url": "http://localhost:11434", 
    },
    "headless": False
}

smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the projects with their description and the author.",
    source="https://perinim.github.io/projects",
    config=graph_config
)

result = smart_scraper_graph.run()
print(result)

案例5：使用Azure提取信息

python 复制代码

from langchain_openai import AzureChatOpenAI
from langchain_openai import AzureOpenAIEmbeddings

lm_model_instance = AzureChatOpenAI(
    openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"],
    azure_deployment=os.environ["AZURE_OPENAI_CHAT_DEPLOYMENT_NAME"]
)

embedder_model_instance = AzureOpenAIEmbeddings(
    azure_deployment=os.environ["AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT_NAME"],
    openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"],
)
graph_config = {
    "llm": {"model_instance": llm_model_instance},
    "embeddings": {"model_instance": embedder_model_instance}
}

smart_scraper_graph = SmartScraperGraph(
    prompt="""List me all the events, with the following fields: company_name, event_name, event_start_date, event_start_time, 
    event_end_date, event_end_time, location, event_mode, event_category, 
    third_party_redirect, no_of_days, 
    time_in_hours, hosted_or_attending, refreshments_type, 
    registration_available, registration_link""",
    source="https://www.hmhco.com/event",
    config=graph_config
)

案例6：使用双子座提取信息

python 复制代码

from scrapegraphai.graphs import SmartScraperGraph
GOOGLE_APIKEY = "YOUR_API_KEY"

# Define the configuration for the graph
graph_config = {
    "llm": {
        "api_key": GOOGLE_APIKEY,
        "model": "gemini-pro",
    },
}

# Create the SmartScraperGraph instance
smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the articles",
    source="https://perinim.github.io/projects",
    config=graph_config
)

result = smart_scraper_graph.run()
print(result)

所有3种情况的输出将是包含提取信息的字典，例如：

js 复制代码

{
    'titles': [
        'Rotary Pendulum RL'
        ],
    'descriptions': [
        'Open Source project aimed at controlling a real life rotary pendulum using RL algorithms'
        ]
}

ScrapeGraphAI的核心优势

ScrapeGraphAI是一款结合了AI与LLM的智能网页抓取工具。它不仅简化了传统网页抓取的复杂流程，还大幅提高了抓取效率和准确性。其核心优势包括：

自然语言指令：ScrapeGraphAI允许用户通过简单的自然语言指令进行数据抓取。例如，用户只需输入"抓取某电商网站的商品名称和价格"，系统就能自动分析目标网页结构，提取所需数据。这种方式极大降低了使用门槛，即便是没有编程经验的人也能轻松使用。
自动化结构分析：ScrapeGraphAI内置强大的网页结构分析能力，能够自动识别网页中的重要数据节点。无论是静态内容还是动态加载的内容，它都能准确定位，并提取所需数据。这避免了用户手动编写复杂的抓取规则，极大提升了抓取效率。
智能应对反爬虫机制：通过AI技术，ScrapeGraphAI能够模拟真实用户行为，有效绕过网站的反爬虫机制。例如，它可以自动处理验证码、调整抓取频率、使用代理IP等，从而保证数据抓取的稳定性和持续性。
持续学习与优化：ScrapeGraphAI具有自学习能力。通过不断的抓取实践和反馈，它能够优化自身算法，提升抓取效果。即便是面对新出现的网页结构或反爬虫策略，它也能快速适应，保持高效抓取。

注意

当然，尽管ScrapeGraphAI展现了强大的网页抓取能力和应用潜力，工具虽好，切勿面向监狱编程,任何数据抓取行为都必须在法律允许的范围内进行。遵守相关法律法规和网站的使用条款。因此，在使用ScrapeGraphAI进行数据抓取时，我们应当始终保持合规意识，确保我们的行为合法、合规。通过合法途径获取数据，不仅能有效避免法律纠纷，还能为数据的使用和分享提供保障，使技术真正为我们的工作和生活服务。