大家好,我是V哥。今天跟兄弟们聊聊Libvio这类视频网站的爬虫技术。先说好啊,咱们纯技术交流,学习研究为主,别拿去干违法的事儿,出了事V哥可不背锅。
先唠两句背景
很多兄弟私信问V哥,说想爬取一些视频网站的数据做分析,结果一上手就被反爬机制搞得头大。今天V哥就拿Libvio这类站点为例,给大家掰扯掰扯这里面的门道。
第一步:先去踩踩点
做爬虫跟做贼差不多(开玩笑哈),得先踩点。打开浏览器的开发者工具,咱们看看这网站到底是个啥情况。
python
# 先写个最简单的请求试试水
import requests
url = "https://www.libvio.link/"
response = requests.get(url)
print(response.status_code)
print(response.text[:500])
你跑一下就会发现,要么返回403,要么返回一堆乱七八糟的JS代码,压根拿不到正常页面。这就是反爬机制在作怪了。
第二步:分析它的反爬套路
V哥总结了一下,这类网站一般有这么几招:
第一招:User-Agent检测
这是最基础的,服务器会检查你的请求头,看你是不是正经浏览器过来的。requests库默认的UA一看就是爬虫,直接给你拦了。
第二招:Cookie验证
网站会在你第一次访问时种一个Cookie,后续请求必须带着这个Cookie才让你进。
第三招:JS动态渲染
这招比较狠,页面内容是通过JavaScript动态加载的,你用requests拿到的只是个空壳子,真正的数据要等JS执行完才出来。
第四招:Cloudflare防护
有些站点套了Cloudflare的盾,会有5秒盾页面,还有验证码挑战,这个比较麻烦。
第三步:一个个破解它
咱们来写代码,一步步搞定这些障碍。
解决User-Agent问题
python
import requests
from fake_useragent import UserAgent
# 搞个随机UA,每次请求都换一个
ua = UserAgent()
headers = {
'User-Agent': ua.random,
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
}
# 如果fake_useragent老报错,你也可以自己搞个列表
UA_LIST = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
]
import random
headers['User-Agent'] = random.choice(UA_LIST)
解决Cookie和Session问题
python
import requests
class LibvioSpider:
def __init__(self):
# 用Session保持会话,Cookie会自动管理
self.session = requests.Session()
self.headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6',
'Referer': 'https://www.libvio.link/',
}
self.session.headers.update(self.headers)
def get_page(self, url):
try:
response = self.session.get(url, timeout=10)
response.raise_for_status()
return response.text
except Exception as e:
print(f"请求出错了兄弟:{e}")
return None
# 用法
spider = LibvioSpider()
html = spider.get_page("https://www.libvio.link/")
解决JS动态渲染问题
这个是重头戏,V哥给你三个方案:
方案一:用Selenium硬刚
这是最直接的办法,直接开个浏览器让JS跑完再拿数据。
python
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
class SeleniumSpider:
def __init__(self, headless=True):
options = Options()
if headless:
options.add_argument('--headless') # 无头模式,不显示浏览器窗口
# 这些参数很重要,能让你的浏览器看起来更像真人
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_argument('--disable-extensions')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--disable-gpu')
options.add_argument('--window-size=1920,1080')
# 设置UA
options.add_argument('user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36')
self.driver = webdriver.Chrome(options=options)
# 这行代码很关键,能绕过一些检测
self.driver.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument', {
'source': '''
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined
})
'''
})
def get_page(self, url, wait_time=5):
self.driver.get(url)
time.sleep(wait_time) # 等JS加载完
return self.driver.page_source
def get_movie_list(self, url):
"""获取电影列表"""
html = self.get_page(url)
# 等待特定元素出现
try:
WebDriverWait(self.driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, "stui-vodlist"))
)
except:
print("页面加载超时,可能被反爬了")
return []
# 这里用BeautifulSoup解析
from bs4 import BeautifulSoup
soup = BeautifulSoup(self.driver.page_source, 'html.parser')
movies = []
items = soup.select('.stui-vodlist li')
for item in items:
title_tag = item.select_one('.title')
link_tag = item.select_one('a')
if title_tag and link_tag:
movies.append({
'title': title_tag.get_text(strip=True),
'link': link_tag.get('href', '')
})
return movies
def close(self):
self.driver.quit()
# 使用示例
spider = SeleniumSpider(headless=True)
movies = spider.get_movie_list("https://www.libvio.link/type/1.html")
for movie in movies:
print(movie)
spider.close()
方案二:用Playwright,比Selenium更快
Playwright是微软搞的,性能比Selenium好不少,V哥现在更喜欢用这个。
python
from playwright.sync_api import sync_playwright
import time
class PlaywrightSpider:
def __init__(self, headless=True):
self.playwright = sync_playwright().start()
self.browser = self.playwright.chromium.launch(headless=headless)
self.context = self.browser.new_context(
user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
viewport={'width': 1920, 'height': 1080}
)
self.page = self.context.new_page()
# 绕过webdriver检测
self.page.add_init_script("""
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined
});
""")
def get_page(self, url, wait_selector=None):
self.page.goto(url)
if wait_selector:
self.page.wait_for_selector(wait_selector, timeout=10000)
else:
time.sleep(3)
return self.page.content()
def get_movie_detail(self, url):
"""获取电影详情"""
self.page.goto(url)
time.sleep(2)
# 等页面加载完
self.page.wait_for_load_state('networkidle')
# 直接用Playwright的选择器
title = self.page.query_selector('.stui-content__detail h1')
desc = self.page.query_selector('.stui-content__desc')
return {
'title': title.inner_text() if title else '',
'description': desc.inner_text() if desc else ''
}
def close(self):
self.browser.close()
self.playwright.stop()
# 安装:pip install playwright
# 然后执行:playwright install chromium
方案三:直接分析API接口
这是V哥最推荐的方式,又快又省资源。很多网站虽然前端用JS渲染,但数据其实是从API接口拿的,咱们直接调接口就完事了。
python
import requests
import json
class APISpider:
def __init__(self):
self.session = requests.Session()
self.base_url = "https://www.libvio.link"
self.headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Accept': 'application/json, text/plain, */*',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Referer': self.base_url,
'X-Requested-With': 'XMLHttpRequest',
}
def find_api(self):
"""
找API的技巧:
1. 打开浏览器开发者工具
2. 切到Network标签
3. 筛选XHR/Fetch请求
4. 刷新页面或者翻页
5. 看看哪些请求返回的是JSON数据
"""
pass
def get_video_list(self, category_id, page=1):
"""
假设我们找到了API接口
实际地址需要你自己去抓包分析
"""
api_url = f"{self.base_url}/api/video/list"
params = {
'category': category_id,
'page': page,
'limit': 20
}
try:
response = self.session.get(api_url, params=params, headers=self.headers)
if response.status_code == 200:
return response.json()
except Exception as e:
print(f"接口请求失败:{e}")
return None
解决Cloudflare防护
如果网站套了Cloudflare的盾,这就比较麻烦了,V哥给你几个思路:
python
# 方案一:用cloudscraper库
# pip install cloudscraper
import cloudscraper
scraper = cloudscraper.create_scraper(
browser={
'browser': 'chrome',
'platform': 'windows',
'mobile': False
}
)
response = scraper.get("https://www.libvio.link/")
print(response.text)
python
# 方案二:用undetected_chromedriver
# pip install undetected-chromedriver
import undetected_chromedriver as uc
class StealthSpider:
def __init__(self):
options = uc.ChromeOptions()
options.add_argument('--headless')
self.driver = uc.Chrome(options=options)
def get_page(self, url):
self.driver.get(url)
# 等待Cloudflare验证通过
import time
time.sleep(8) # Cloudflare的5秒盾
return self.driver.page_source
def close(self):
self.driver.quit()
第四步:来个完整的实战案例
好了,前面讲了一堆零散的,现在V哥给你整合成一个完整的爬虫项目:
python
"""
Libvio视频网站爬虫 - V哥出品
功能:爬取电影列表和详情信息
声明:仅供学习研究使用
"""
import time
import random
import json
from typing import List, Dict, Optional
from dataclasses import dataclass, asdict
from bs4 import BeautifulSoup
# 你可以根据需要选择以下任一种方式
# from selenium import webdriver
from playwright.sync_api import sync_playwright
@dataclass
class MovieInfo:
"""电影信息数据类"""
title: str
link: str
cover: str = ""
year: str = ""
category: str = ""
description: str = ""
play_links: List[str] = None
def __post_init__(self):
if self.play_links is None:
self.play_links = []
class LibvioSpider:
def __init__(self, headless: bool = True):
self.base_url = "https://www.libvio.link"
self.headless = headless
self.playwright = None
self.browser = None
self.page = None
self._init_browser()
def _init_browser(self):
"""初始化浏览器"""
self.playwright = sync_playwright().start()
self.browser = self.playwright.chromium.launch(
headless=self.headless,
args=['--disable-blink-features=AutomationControlled']
)
self.context = self.browser.new_context(
user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
viewport={'width': 1920, 'height': 1080}
)
self.page = self.context.new_page()
# 注入JS绕过检测
self.page.add_init_script("""
Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
Object.defineProperty(navigator, 'plugins', {get: () => [1, 2, 3, 4, 5]});
Object.defineProperty(navigator, 'languages', {get: () => ['zh-CN', 'zh', 'en']});
window.chrome = {runtime: {}};
""")
def _random_delay(self, min_sec: float = 1, max_sec: float = 3):
"""随机延迟,模拟人类行为"""
time.sleep(random.uniform(min_sec, max_sec))
def get_movie_list(self, category: str = "1", page: int = 1) -> List[MovieInfo]:
"""
获取电影列表
category: 分类ID,比如1是电影,2是电视剧
page: 页码
"""
url = f"{self.base_url}/type/{category}-{page}.html"
print(f"正在爬取:{url}")
self.page.goto(url, wait_until='networkidle')
self._random_delay(2, 4)
# 解析页面
html = self.page.content()
soup = BeautifulSoup(html, 'html.parser')
movies = []
items = soup.select('.stui-vodlist__box, .stui-vodlist li')
for item in items:
try:
link_tag = item.select_one('a')
title_tag = item.select_one('.title, h4, .stui-vodlist__title')
img_tag = item.select_one('img')
if not link_tag:
continue
movie = MovieInfo(
title=title_tag.get_text(strip=True) if title_tag else "未知",
link=self.base_url + link_tag.get('href', ''),
cover=img_tag.get('data-original', img_tag.get('src', '')) if img_tag else ""
)
movies.append(movie)
except Exception as e:
print(f"解析单个电影出错:{e}")
continue
print(f"本页共获取 {len(movies)} 部电影")
return movies
def get_movie_detail(self, movie: MovieInfo) -> MovieInfo:
"""获取电影详情"""
print(f"正在获取详情:{movie.title}")
self.page.goto(movie.link, wait_until='networkidle')
self._random_delay(1, 2)
html = self.page.content()
soup = BeautifulSoup(html, 'html.parser')
# 获取描述
desc_tag = soup.select_one('.stui-content__desc, .detail-content')
if desc_tag:
movie.description = desc_tag.get_text(strip=True)
# 获取年份、分类等信息
info_tags = soup.select('.stui-content__detail p')
for tag in info_tags:
text = tag.get_text()
if '年份' in text:
movie.year = text.replace('年份:', '').strip()
if '类型' in text:
movie.category = text.replace('类型:', '').strip()
# 获取播放链接
play_links = soup.select('.stui-content__playlist a')
movie.play_links = [self.base_url + a.get('href', '') for a in play_links]
return movie
def search(self, keyword: str) -> List[MovieInfo]:
"""搜索电影"""
url = f"{self.base_url}/search/{keyword}-------------.html"
print(f"搜索关键词:{keyword}")
self.page.goto(url, wait_until='networkidle')
self._random_delay(2, 3)
html = self.page.content()
soup = BeautifulSoup(html, 'html.parser')
movies = []
items = soup.select('.stui-vodlist__box')
for item in items:
try:
link_tag = item.select_one('a')
title_tag = item.select_one('.title')
if link_tag and title_tag:
movie = MovieInfo(
title=title_tag.get_text(strip=True),
link=self.base_url + link_tag.get('href', '')
)
movies.append(movie)
except:
continue
return movies
def save_to_json(self, movies: List[MovieInfo], filename: str):
"""保存到JSON文件"""
data = [asdict(movie) for movie in movies]
with open(filename, 'w', encoding='utf-8') as f:
json.dump(data, f, ensure_ascii=False, indent=2)
print(f"数据已保存到 {filename}")
def close(self):
"""清理资源"""
if self.browser:
self.browser.close()
if self.playwright:
self.playwright.stop()
def main():
"""主函数"""
spider = LibvioSpider(headless=True)
try:
# 爬取电影列表
all_movies = []
for page in range(1, 4): # 爬前3页
movies = spider.get_movie_list(category="1", page=page)
all_movies.extend(movies)
spider._random_delay(3, 5) # 每页之间休息一下
# 获取详情(这里只取前5个做演示)
for movie in all_movies[:5]:
spider.get_movie_detail(movie)
spider._random_delay(2, 4)
# 保存数据
spider.save_to_json(all_movies, "movies.json")
except Exception as e:
print(f"爬虫出错了:{e}")
finally:
spider.close()
if __name__ == "__main__":
main()
第五步:一些V哥的经验之谈
兄弟们,爬虫这玩意儿,技术是一方面,经验也很重要。V哥总结几点:
1. 控制频率,别太猛
python
import time
import random
def polite_request(url, session):
"""礼貌的请求,不给服务器太大压力"""
time.sleep(random.uniform(2, 5)) # 随机等待2-5秒
return session.get(url)
2. 做好异常处理和重试
python
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
def create_session_with_retry():
session = requests.Session()
# 设置重试策略
retry = Retry(
total=3, # 总共重试3次
backoff_factor=1, # 重试间隔
status_forcelist=[500, 502, 503, 504, 429] # 这些状态码触发重试
)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
return session
3. 用代理池
python
class ProxyPool:
def __init__(self):
self.proxies = [
'http://ip1:port',
'http://ip2:port',
'http://ip3:port',
]
self.current = 0
def get_proxy(self):
proxy = self.proxies[self.current]
self.current = (self.current + 1) % len(self.proxies)
return {'http': proxy, 'https': proxy}
# 使用
pool = ProxyPool()
response = requests.get(url, proxies=pool.get_proxy())
4. 保存进度,支持断点续爬
python
import json
import os
class ProgressManager:
def __init__(self, filename='progress.json'):
self.filename = filename
self.progress = self._load()
def _load(self):
if os.path.exists(self.filename):
with open(self.filename, 'r') as f:
return json.load(f)
return {'crawled_urls': [], 'last_page': 0}
def save(self):
with open(self.filename, 'w') as f:
json.dump(self.progress, f)
def is_crawled(self, url):
return url in self.progress['crawled_urls']
def mark_crawled(self, url):
self.progress['crawled_urls'].append(url)
self.save()
最后唠两句
好了兄弟们,今天就聊这么多。V哥再强调一遍,技术是无罪的,但用技术干违法的事儿就不对了。咱们学爬虫是为了提升技术水平,做数据分析研究,可别拿去干那些盗版、侵权的事儿。
另外,爬虫这东西讲究的是见招拆招,每个网站的反爬策略都不一样,关键是要学会分析问题、解决问题的思路。遇到新情况多动脑子,多查资料,别一遇到问题就放弃。
有啥问题评论区留言,V哥看到会回复。下期再见!
V哥原创,转载请注明出处