python采集小红书笔记详情API接口，json数据返回

根据小红书开放平台规则及技术实现方案，以下是Python采集小红书笔记详情并返回JSON数据的完整指南：

一、官方API方案（推荐合规方式）

权限申请流程
- 注册账号，完成企业/个人实名认证。
- 创建应用选择"数据分析"或"内容管理"类目，申请note/detail接口权限（审核周期3-5工作日）。
- 获取app_key、app_secret及access_token（通过OAuth2.0授权流程获取）。
API调用示例

python

复制代码

`import requests
import json
from datetime import datetime

def get_note_detail(note_id, access_token):
    url = "https://api.xiaohongshu.com/note/detail"
    params = {
        "note_id": note_id,
        "access_token": access_token,
        "fields": "title,content,like_count,comment_count,images,author"
    }
    headers = {
        "Content-Type": "application/json",
        "X-Sign": generate_sign(app_secret, params)  # 签名生成逻辑
    }
    
    response = requests.get(url, headers=headers, params=params)
    if response.status_code == 200:
        return response.json()
    else:
        raise Exception(f"API错误: {response.status_code} - {response.text}")

# 签名生成函数（示例）
def generate_sign(secret, params):
    sorted_params = sorted(params.items())
    sign_str = "".join(f"{k}{v}" for k, v in sorted_params)
    return hmac.new(secret.encode(), sign_str.encode(), hashlib.sha256).hexdigest()

# 使用示例
if __name__ == "__main__":
    note_data = get_note_detail("NOTE_ID_123", "YOUR_ACCESS_TOKEN")
    with open(f"note_detail_{datetime.now().strftime('%Y%m%d')}.json", "w") as f:
        json.dump(note_data, f, ensure_ascii=False, indent=2)
`

返回数据结构

json

复制代码

`{
  "code": 0,
  "msg": "success",
  "data": {
    "note_id": "648a7b2f0000000012345678",
    "title": "夏日穿搭指南",
    "content": "本季流行元素解析...",
    "like_count": 12580,
    "comment_count": 890,
    "images": [
      "https://img.xiaohongshu.com/1.jpg",
      "https://img.xiaohongshu.com/2.jpg"
    ],
    "author": {
      "user_id": "12345678",
      "nickname": "时尚达人",
      "avatar": "https://avatar.xiaohongshu.com/avatar.jpg"
    }
  }
}
`

二、爬虫方案（无API权限时）

工具选择
- 使用selenium模拟浏览器操作（适合动态页面）
- 或requests+BeautifulSoup解析静态页面（需处理反爬）
爬虫示例代码

python

复制代码

`from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import json
import time

def scrape_note_detail(note_url):
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    driver = webdriver.Chrome(options=chrome_options)
    
    try:
        driver.get(note_url)
        time.sleep(3)  # 等待页面加载
        
        # 提取元素（需根据实际页面结构调整选择器）
        title = driver.find_element("xpath", "//h1[@class='title']").text
        content = driver.find_element("xpath", "//div[@class='content']").text
        likes = driver.find_element("xpath", "//span[@class='like-count']").text
        
        note_data = {
            "title": title,
            "content": content,
            "likes": int(likes.replace("万", "0000")),
            "images": [img.get_attribute("src") 
                      for img in driver.find_elements("css selector", "img.note-image")]
        }
        return note_data
        
    finally:
        driver.quit()

# 使用示例
if __name__ == "__main__":
    note = scrape_note_detail("https://www.xiaohongshu.com/note/123456")
    print(json.dumps(note, ensure_ascii=False, indent=2))
`

三、关键注意事项

合规要求
- 严格遵守《个人信息保护法》，禁止采集用户手机号、地址等敏感信息
- 请求频率限制：官方API≤100次/分钟，爬虫建议≥1秒/次请求间隔
- 数据使用范围需与申请时声明用途一致（如内容分析不得用于商业营销）
反爬应对策略
- 动态UA切换：使用fake_useragent库随机生成User-Agent
- 代理IP池：通过scrapy-rotating-proxies或第三方代理服务实现
- 请求间隔：采用随机延迟（1-3秒）避免固定模式
数据存储建议
- 结构化数据：使用MongoDB存储JSON数据
- 非结构化数据：图片/视频通过CDN加速访问
- 缓存机制：对非实时数据（如作者信息）设置30分钟缓存