Libvio.link爬虫技术解析:搞定反爬机制

大家好,我是V哥。今天跟兄弟们聊聊Libvio这类视频网站的爬虫技术。先说好啊,咱们纯技术交流,学习研究为主,别拿去干违法的事儿,出了事V哥可不背锅。

先唠两句背景

很多兄弟私信问V哥,说想爬取一些视频网站的数据做分析,结果一上手就被反爬机制搞得头大。今天V哥就拿Libvio这类站点为例,给大家掰扯掰扯这里面的门道。

第一步:先去踩踩点

做爬虫跟做贼差不多(开玩笑哈),得先踩点。打开浏览器的开发者工具,咱们看看这网站到底是个啥情况。

python 复制代码
# 先写个最简单的请求试试水
import requests

url = "https://www.libvio.link/"
response = requests.get(url)
print(response.status_code)
print(response.text[:500])

你跑一下就会发现,要么返回403,要么返回一堆乱七八糟的JS代码,压根拿不到正常页面。这就是反爬机制在作怪了。

第二步:分析它的反爬套路

V哥总结了一下,这类网站一般有这么几招:

第一招:User-Agent检测

这是最基础的,服务器会检查你的请求头,看你是不是正经浏览器过来的。requests库默认的UA一看就是爬虫,直接给你拦了。

第二招:Cookie验证

网站会在你第一次访问时种一个Cookie,后续请求必须带着这个Cookie才让你进。

第三招:JS动态渲染

这招比较狠,页面内容是通过JavaScript动态加载的,你用requests拿到的只是个空壳子,真正的数据要等JS执行完才出来。

第四招:Cloudflare防护

有些站点套了Cloudflare的盾,会有5秒盾页面,还有验证码挑战,这个比较麻烦。

第三步:一个个破解它

咱们来写代码,一步步搞定这些障碍。

解决User-Agent问题

python 复制代码
import requests
from fake_useragent import UserAgent

# 搞个随机UA,每次请求都换一个
ua = UserAgent()

headers = {
    'User-Agent': ua.random,
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6',
    'Accept-Encoding': 'gzip, deflate, br',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
}

# 如果fake_useragent老报错,你也可以自己搞个列表
UA_LIST = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
]

import random
headers['User-Agent'] = random.choice(UA_LIST)

解决Cookie和Session问题

python 复制代码
import requests

class LibvioSpider:
    def __init__(self):
        # 用Session保持会话,Cookie会自动管理
        self.session = requests.Session()
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
            'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6',
            'Referer': 'https://www.libvio.link/',
        }
        self.session.headers.update(self.headers)
    
    def get_page(self, url):
        try:
            response = self.session.get(url, timeout=10)
            response.raise_for_status()
            return response.text
        except Exception as e:
            print(f"请求出错了兄弟:{e}")
            return None

# 用法
spider = LibvioSpider()
html = spider.get_page("https://www.libvio.link/")

解决JS动态渲染问题

这个是重头戏,V哥给你三个方案:

方案一:用Selenium硬刚

这是最直接的办法,直接开个浏览器让JS跑完再拿数据。

python 复制代码
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

class SeleniumSpider:
    def __init__(self, headless=True):
        options = Options()
        if headless:
            options.add_argument('--headless')  # 无头模式,不显示浏览器窗口
        
        # 这些参数很重要,能让你的浏览器看起来更像真人
        options.add_argument('--disable-blink-features=AutomationControlled')
        options.add_argument('--disable-extensions')
        options.add_argument('--no-sandbox')
        options.add_argument('--disable-dev-shm-usage')
        options.add_argument('--disable-gpu')
        options.add_argument('--window-size=1920,1080')
        
        # 设置UA
        options.add_argument('user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36')
        
        self.driver = webdriver.Chrome(options=options)
        
        # 这行代码很关键,能绕过一些检测
        self.driver.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument', {
            'source': '''
                Object.defineProperty(navigator, 'webdriver', {
                    get: () => undefined
                })
            '''
        })
    
    def get_page(self, url, wait_time=5):
        self.driver.get(url)
        time.sleep(wait_time)  # 等JS加载完
        return self.driver.page_source
    
    def get_movie_list(self, url):
        """获取电影列表"""
        html = self.get_page(url)
        
        # 等待特定元素出现
        try:
            WebDriverWait(self.driver, 10).until(
                EC.presence_of_element_located((By.CLASS_NAME, "stui-vodlist"))
            )
        except:
            print("页面加载超时,可能被反爬了")
            return []
        
        # 这里用BeautifulSoup解析
        from bs4 import BeautifulSoup
        soup = BeautifulSoup(self.driver.page_source, 'html.parser')
        
        movies = []
        items = soup.select('.stui-vodlist li')
        for item in items:
            title_tag = item.select_one('.title')
            link_tag = item.select_one('a')
            if title_tag and link_tag:
                movies.append({
                    'title': title_tag.get_text(strip=True),
                    'link': link_tag.get('href', '')
                })
        
        return movies
    
    def close(self):
        self.driver.quit()

# 使用示例
spider = SeleniumSpider(headless=True)
movies = spider.get_movie_list("https://www.libvio.link/type/1.html")
for movie in movies:
    print(movie)
spider.close()

方案二:用Playwright,比Selenium更快

Playwright是微软搞的,性能比Selenium好不少,V哥现在更喜欢用这个。

python 复制代码
from playwright.sync_api import sync_playwright
import time

class PlaywrightSpider:
    def __init__(self, headless=True):
        self.playwright = sync_playwright().start()
        self.browser = self.playwright.chromium.launch(headless=headless)
        self.context = self.browser.new_context(
            user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
            viewport={'width': 1920, 'height': 1080}
        )
        self.page = self.context.new_page()
        
        # 绕过webdriver检测
        self.page.add_init_script("""
            Object.defineProperty(navigator, 'webdriver', {
                get: () => undefined
            });
        """)
    
    def get_page(self, url, wait_selector=None):
        self.page.goto(url)
        
        if wait_selector:
            self.page.wait_for_selector(wait_selector, timeout=10000)
        else:
            time.sleep(3)
        
        return self.page.content()
    
    def get_movie_detail(self, url):
        """获取电影详情"""
        self.page.goto(url)
        time.sleep(2)
        
        # 等页面加载完
        self.page.wait_for_load_state('networkidle')
        
        # 直接用Playwright的选择器
        title = self.page.query_selector('.stui-content__detail h1')
        desc = self.page.query_selector('.stui-content__desc')
        
        return {
            'title': title.inner_text() if title else '',
            'description': desc.inner_text() if desc else ''
        }
    
    def close(self):
        self.browser.close()
        self.playwright.stop()

# 安装:pip install playwright
# 然后执行:playwright install chromium

方案三:直接分析API接口

这是V哥最推荐的方式,又快又省资源。很多网站虽然前端用JS渲染,但数据其实是从API接口拿的,咱们直接调接口就完事了。

python 复制代码
import requests
import json

class APISpider:
    def __init__(self):
        self.session = requests.Session()
        self.base_url = "https://www.libvio.link"
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
            'Accept': 'application/json, text/plain, */*',
            'Accept-Language': 'zh-CN,zh;q=0.9',
            'Referer': self.base_url,
            'X-Requested-With': 'XMLHttpRequest',
        }
    
    def find_api(self):
        """
        找API的技巧:
        1. 打开浏览器开发者工具
        2. 切到Network标签
        3. 筛选XHR/Fetch请求
        4. 刷新页面或者翻页
        5. 看看哪些请求返回的是JSON数据
        """
        pass
    
    def get_video_list(self, category_id, page=1):
        """
        假设我们找到了API接口
        实际地址需要你自己去抓包分析
        """
        api_url = f"{self.base_url}/api/video/list"
        params = {
            'category': category_id,
            'page': page,
            'limit': 20
        }
        
        try:
            response = self.session.get(api_url, params=params, headers=self.headers)
            if response.status_code == 200:
                return response.json()
        except Exception as e:
            print(f"接口请求失败:{e}")
        
        return None

解决Cloudflare防护

如果网站套了Cloudflare的盾,这就比较麻烦了,V哥给你几个思路:

python 复制代码
# 方案一:用cloudscraper库
# pip install cloudscraper

import cloudscraper

scraper = cloudscraper.create_scraper(
    browser={
        'browser': 'chrome',
        'platform': 'windows',
        'mobile': False
    }
)

response = scraper.get("https://www.libvio.link/")
print(response.text)
python 复制代码
# 方案二:用undetected_chromedriver
# pip install undetected-chromedriver

import undetected_chromedriver as uc

class StealthSpider:
    def __init__(self):
        options = uc.ChromeOptions()
        options.add_argument('--headless')
        
        self.driver = uc.Chrome(options=options)
    
    def get_page(self, url):
        self.driver.get(url)
        # 等待Cloudflare验证通过
        import time
        time.sleep(8)  # Cloudflare的5秒盾
        return self.driver.page_source
    
    def close(self):
        self.driver.quit()

第四步:来个完整的实战案例

好了,前面讲了一堆零散的,现在V哥给你整合成一个完整的爬虫项目:

python 复制代码
"""
Libvio视频网站爬虫 - V哥出品
功能:爬取电影列表和详情信息
声明:仅供学习研究使用
"""

import time
import random
import json
from typing import List, Dict, Optional
from dataclasses import dataclass, asdict
from bs4 import BeautifulSoup

# 你可以根据需要选择以下任一种方式
# from selenium import webdriver
from playwright.sync_api import sync_playwright

@dataclass
class MovieInfo:
    """电影信息数据类"""
    title: str
    link: str
    cover: str = ""
    year: str = ""
    category: str = ""
    description: str = ""
    play_links: List[str] = None
    
    def __post_init__(self):
        if self.play_links is None:
            self.play_links = []

class LibvioSpider:
    def __init__(self, headless: bool = True):
        self.base_url = "https://www.libvio.link"
        self.headless = headless
        self.playwright = None
        self.browser = None
        self.page = None
        self._init_browser()
    
    def _init_browser(self):
        """初始化浏览器"""
        self.playwright = sync_playwright().start()
        self.browser = self.playwright.chromium.launch(
            headless=self.headless,
            args=['--disable-blink-features=AutomationControlled']
        )
        self.context = self.browser.new_context(
            user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
            viewport={'width': 1920, 'height': 1080}
        )
        self.page = self.context.new_page()
        
        # 注入JS绕过检测
        self.page.add_init_script("""
            Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
            Object.defineProperty(navigator, 'plugins', {get: () => [1, 2, 3, 4, 5]});
            Object.defineProperty(navigator, 'languages', {get: () => ['zh-CN', 'zh', 'en']});
            window.chrome = {runtime: {}};
        """)
    
    def _random_delay(self, min_sec: float = 1, max_sec: float = 3):
        """随机延迟,模拟人类行为"""
        time.sleep(random.uniform(min_sec, max_sec))
    
    def get_movie_list(self, category: str = "1", page: int = 1) -> List[MovieInfo]:
        """
        获取电影列表
        category: 分类ID,比如1是电影,2是电视剧
        page: 页码
        """
        url = f"{self.base_url}/type/{category}-{page}.html"
        print(f"正在爬取:{url}")
        
        self.page.goto(url, wait_until='networkidle')
        self._random_delay(2, 4)
        
        # 解析页面
        html = self.page.content()
        soup = BeautifulSoup(html, 'html.parser')
        
        movies = []
        items = soup.select('.stui-vodlist__box, .stui-vodlist li')
        
        for item in items:
            try:
                link_tag = item.select_one('a')
                title_tag = item.select_one('.title, h4, .stui-vodlist__title')
                img_tag = item.select_one('img')
                
                if not link_tag:
                    continue
                
                movie = MovieInfo(
                    title=title_tag.get_text(strip=True) if title_tag else "未知",
                    link=self.base_url + link_tag.get('href', ''),
                    cover=img_tag.get('data-original', img_tag.get('src', '')) if img_tag else ""
                )
                movies.append(movie)
                
            except Exception as e:
                print(f"解析单个电影出错:{e}")
                continue
        
        print(f"本页共获取 {len(movies)} 部电影")
        return movies
    
    def get_movie_detail(self, movie: MovieInfo) -> MovieInfo:
        """获取电影详情"""
        print(f"正在获取详情:{movie.title}")
        
        self.page.goto(movie.link, wait_until='networkidle')
        self._random_delay(1, 2)
        
        html = self.page.content()
        soup = BeautifulSoup(html, 'html.parser')
        
        # 获取描述
        desc_tag = soup.select_one('.stui-content__desc, .detail-content')
        if desc_tag:
            movie.description = desc_tag.get_text(strip=True)
        
        # 获取年份、分类等信息
        info_tags = soup.select('.stui-content__detail p')
        for tag in info_tags:
            text = tag.get_text()
            if '年份' in text:
                movie.year = text.replace('年份:', '').strip()
            if '类型' in text:
                movie.category = text.replace('类型:', '').strip()
        
        # 获取播放链接
        play_links = soup.select('.stui-content__playlist a')
        movie.play_links = [self.base_url + a.get('href', '') for a in play_links]
        
        return movie
    
    def search(self, keyword: str) -> List[MovieInfo]:
        """搜索电影"""
        url = f"{self.base_url}/search/{keyword}-------------.html"
        print(f"搜索关键词:{keyword}")
        
        self.page.goto(url, wait_until='networkidle')
        self._random_delay(2, 3)
        
        html = self.page.content()
        soup = BeautifulSoup(html, 'html.parser')
        
        movies = []
        items = soup.select('.stui-vodlist__box')
        
        for item in items:
            try:
                link_tag = item.select_one('a')
                title_tag = item.select_one('.title')
                
                if link_tag and title_tag:
                    movie = MovieInfo(
                        title=title_tag.get_text(strip=True),
                        link=self.base_url + link_tag.get('href', '')
                    )
                    movies.append(movie)
            except:
                continue
        
        return movies
    
    def save_to_json(self, movies: List[MovieInfo], filename: str):
        """保存到JSON文件"""
        data = [asdict(movie) for movie in movies]
        with open(filename, 'w', encoding='utf-8') as f:
            json.dump(data, f, ensure_ascii=False, indent=2)
        print(f"数据已保存到 {filename}")
    
    def close(self):
        """清理资源"""
        if self.browser:
            self.browser.close()
        if self.playwright:
            self.playwright.stop()

def main():
    """主函数"""
    spider = LibvioSpider(headless=True)
    
    try:
        # 爬取电影列表
        all_movies = []
        for page in range(1, 4):  # 爬前3页
            movies = spider.get_movie_list(category="1", page=page)
            all_movies.extend(movies)
            spider._random_delay(3, 5)  # 每页之间休息一下
        
        # 获取详情(这里只取前5个做演示)
        for movie in all_movies[:5]:
            spider.get_movie_detail(movie)
            spider._random_delay(2, 4)
        
        # 保存数据
        spider.save_to_json(all_movies, "movies.json")
        
    except Exception as e:
        print(f"爬虫出错了:{e}")
    
    finally:
        spider.close()

if __name__ == "__main__":
    main()

第五步:一些V哥的经验之谈

兄弟们,爬虫这玩意儿,技术是一方面,经验也很重要。V哥总结几点:

1. 控制频率,别太猛

python 复制代码
import time
import random

def polite_request(url, session):
    """礼貌的请求,不给服务器太大压力"""
    time.sleep(random.uniform(2, 5))  # 随机等待2-5秒
    return session.get(url)

2. 做好异常处理和重试

python 复制代码
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

def create_session_with_retry():
    session = requests.Session()
    
    # 设置重试策略
    retry = Retry(
        total=3,  # 总共重试3次
        backoff_factor=1,  # 重试间隔
        status_forcelist=[500, 502, 503, 504, 429]  # 这些状态码触发重试
    )
    
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)
    
    return session

3. 用代理池

python 复制代码
class ProxyPool:
    def __init__(self):
        self.proxies = [
            'http://ip1:port',
            'http://ip2:port',
            'http://ip3:port',
        ]
        self.current = 0
    
    def get_proxy(self):
        proxy = self.proxies[self.current]
        self.current = (self.current + 1) % len(self.proxies)
        return {'http': proxy, 'https': proxy}

# 使用
pool = ProxyPool()
response = requests.get(url, proxies=pool.get_proxy())

4. 保存进度,支持断点续爬

python 复制代码
import json
import os

class ProgressManager:
    def __init__(self, filename='progress.json'):
        self.filename = filename
        self.progress = self._load()
    
    def _load(self):
        if os.path.exists(self.filename):
            with open(self.filename, 'r') as f:
                return json.load(f)
        return {'crawled_urls': [], 'last_page': 0}
    
    def save(self):
        with open(self.filename, 'w') as f:
            json.dump(self.progress, f)
    
    def is_crawled(self, url):
        return url in self.progress['crawled_urls']
    
    def mark_crawled(self, url):
        self.progress['crawled_urls'].append(url)
        self.save()

最后唠两句

好了兄弟们,今天就聊这么多。V哥再强调一遍,技术是无罪的,但用技术干违法的事儿就不对了。咱们学爬虫是为了提升技术水平,做数据分析研究,可别拿去干那些盗版、侵权的事儿。

另外,爬虫这东西讲究的是见招拆招,每个网站的反爬策略都不一样,关键是要学会分析问题、解决问题的思路。遇到新情况多动脑子,多查资料,别一遇到问题就放弃。

有啥问题评论区留言,V哥看到会回复。下期再见!


V哥原创,转载请注明出处

相关推荐
zhougl9962 小时前
Java 枚举类(enum)详解
java·开发语言·python
恋爱绝缘体12 小时前
Java语言提供了八种基本类型。六种数字类型【函数基数噶】
java·python·算法
serve the people2 小时前
python环境搭建 (三) FastAPI 与 Flask 对比
python·flask·fastapi
断眉的派大星2 小时前
Python多线程编程全解析
python
铁手飞鹰2 小时前
[深度学习]Vision Transformer
人工智能·pytorch·python·深度学习·transformer
weixin_395448912 小时前
average_weights.py
pytorch·python·深度学习
蒜香拿铁3 小时前
【第一章】爬虫概述
爬虫·python
ID_180079054733 小时前
Python调用淘宝评论API:从入门到首次采集全流程
服务器·数据库·python
小猪咪piggy3 小时前
【Python】(2) 执行顺序控制语句
开发语言·python