[抖音]用户首页分享链接获取视频数据，可选解析视频文案

根据用户首页分享链接爬取用户抖音视频数据

功能核心：github项目-Douyin_TikTok_Download_API

仔细阅读github项目内对项目的描述，将项目进行本地部署，其中也有部署文档以及相关的演示视频 github项目：Douyin_TikTok_Download_API

我们主要用到项目中的其中一个接口：获取用户主页作品数据 - /api/douyin/web/fetch_user_post_videos

参考伪代码（将返回的max_cursor重新传入下一个接口，循环获取全部视频数据）：

python 复制代码

# 获取主页视频信息 | Get profile videos info
async def get_profile_videos_info(sec_user_id: str):
    async def fetch_videos(max_cursor: int):
        print(f"Fetching videos with max_cursor: {max_cursor}")
        # 执行API请求 | Perform API request
        response = await client.DouyinAppV3.fetch_user_post_videos(sec_user_id, max_cursor=max_cursor, count=20)
        # 提取并返回需要的信息 | Extract and return the required information
        return response["data"]["aweme_list"], response["data"]["has_more"], response["data"]["max_cursor"]

    # 创建一个空列表来存储所有视频信息 | Create an empty list to store all videos info
    all_videos_info = []
    has_more = True
    max_cursor = 0

    # 循环获取视频信息直到没有更多 | Loop to get video info until there are no more
    while has_more:
        videos_info, has_more, max_cursor = await fetch_videos(max_cursor)
        all_videos_info.extend(videos_info)
        print(f"Total videos fetched: {len(all_videos_info)}")

    return all_videos_info

注意：该项目中的pypi包已经年久失修，无法使用pip安装最新版，直接进行本地部署，或者使用其商业版API接口即可，在线演示demoAPI查询用户主页视频只支持第一页，不支持多页，所以无法获取整个用户主页的视频。

结合阿里云音频大模型解析视频文本（可选，需付费）

获取阿里云百炼模型KEY

进入阿里云百炼模型广场

登录后，在右上角进行APIKEY的创建：

创建后即可使用百炼模型广场的全部模型，每个月有免费的额度可以试用

全流程

获取抖音用户主页分享链接，点击分享，点击复制链接

perl 复制代码

如: 9- 长按复制此条消息，打开抖音搜索，查看TA的更多作品。 https://v.douyin.com/he-nohwV0QA/ 3@2.com :0pm

通过http获取到完整的抖音链接，并截取用户的真实ID，参考代码：

python 复制代码

import requests
import re


def main(share_text) -> dict:
    try:
        # 使用正则表达式提取分享文本中的 URL
        url_pattern = re.compile(r'https?://[^\s]+')
        match = url_pattern.search(share_text)
        if not match:
            return {"sec_user_id": ""}
        share_url = match.group(0)
        # 发送请求，允许重定向
        response = requests.get(share_url, allow_redirects=True)
        # 获取重定向后的最终 URL
        final_url = response.url
        # 从最终 URL 中提取 sec_user_id 部分（包含可能的参数）
        start_index = final_url.find("/user/") + len("/user/")
        partial_sec_user_id = final_url[start_index:]
        # 去掉后面的参数，只保留 sec_user_id
        sec_user_id = partial_sec_user_id.split('?')[0]
        return {"sec_user_id": sec_user_id}
    except Exception as e:
        return {"sec_user_id": ""}

根据用户的真实ID，调用/api/douyin/web/fetch_user_post_videos接口，参考伪代码编写获取全部数据的全流程，自行清洗返回的数据，参考代码：

python 复制代码

def format_data(sec_user_id, aweme_data):
    table_data = []
    for i in aweme_data:
        item_data = {
            # 用户ser_user_id
            'sec_user_id': sec_user_id,
            # 视频ID
            'aweme_id': i['aweme_id'],
            # 视频标题
            'description': i['desc'],
            # 作者昵称
            'author_name': i['author']['nickname'], 
            # uid
            'author_uid': i['author']['uid'],
            # 无水印视频地址
            'video_url': i['video']['bit_rate'][-1]['play_addr']['url_list'][-1],
            # 点赞量
            'digg_count': i['statistics']['digg_count'],
            # 评论量
            'comment_count': i['statistics']['comment_count'],
            # 转发量
            'share_count': i['statistics']['share_count'],
            # 收藏量
            'collect_count': i['statistics']['collect_count'],
            # 视频标签
            "hashtags": [text['hashtag_name'] for text in i['text_extra'] if 'hashtag_name' in text],
            # 分享链接
            'share_url': i['share_info']['share_url'],
            # 分享文案
            'share_link_desc': i['share_info']['share_link_desc'],
            # 创建时间
            'create_time': i['create_time'],
            # 音频地址
            'music_url': i['music']['play_url']['url_list'][0] if i.get('music') and i['music'].get('play_url') and i['music']['play_url'].get('url_list') else "",
        }
        table_data.append(item_data)
    return table_data

解析视频文案（可选），调用阿里云大模型的API，参考代码：

python 复制代码

def get_videos_text(file_urls):
    transcribe_response = Transcription.async_call(
        model='paraformer-v2',
        file_urls=file_urls,
        language_hints=['zh', 'en'] 
    )

    while True:
        if transcribe_response.output.task_status == 'SUCCEEDED' or transcribe_response.output.task_status == 'FAILED':
            break
        transcribe_response = Transcription.fetch(task=transcribe_response.output.task_id)

    if transcribe_response.status_code == HTTPStatus.OK:
        response = transcribe_response.output
        results = response['results']

        if len(results) > 0:
            for result in results:
                if result.get('subtask_status') == 'SUCCEEDED':
                    text_url = result.get('transcription_url')
                    text_response = requests.get(text_url)
                    text_data = json.loads(text_response.text)
                    text = text_data['transcripts'][0]['text']
                    result['text'] = text
                else:
                    result['text'] = ''
        return results