数据炼金术士的必备技能:MySQL与Crawler在LLM数据工程中的实战

数据炼金术士的必备技能:MySQL与Crawler在LLM数据工程中的实战

从业务数据库到高质量微调数据集,构建大模型训练的"数据燃料"供应链

前言

在LLM训练和微调的工作流中,数据质量直接决定模型效果的上限。而MySQL与爬虫(Crawler)这对组合,往往是获取"私有数据"和"新鲜数据"的关键手段。

对于强调"手写LoRA"的深度技术岗位,MySQL和Crawler虽然不像模型架构那样处于核心位置,但它们共同构成了高质量训练数据的供应链。本文将从LLM数据工程的角度,深度讲解如何利用MySQL存储、管理和查询训练数据,以及如何设计稳健的爬虫系统从Web获取训练语料。


第一部分:MySQL------训练数据的"中央仓库"

1.1 MySQL在大模型训练中的定位

在LLM训练流水线中,MySQL主要承担以下角色:

角色 说明 典型场景
数据仓库 存储原始对话记录、用户反馈、业务文档 从CRM系统导出历史工单
元数据管理 管理数据集的版本、标注状态、数据来源 记录每批数据的来源和清洗状态
数据采样 按规则提取训练样本 按时间、标签、质量分采样
数据增强 存储增强后的数据 存储改写、翻译后的文本变体

1.2 核心表设计

对话数据表(核心表)
sql 复制代码
-- 对话历史记录表
CREATE TABLE conversations (
    id BIGINT AUTO_INCREMENT PRIMARY KEY COMMENT '对话ID',
    session_id VARCHAR(64) NOT NULL COMMENT '会话ID',
    user_id VARCHAR(64) NOT NULL COMMENT '用户ID',
    role ENUM('user', 'assistant', 'system') NOT NULL COMMENT '角色',
    content TEXT NOT NULL COMMENT '消息内容',
    intent VARCHAR(64) DEFAULT NULL COMMENT '用户意图(可选)',
    sentiment_score FLOAT DEFAULT NULL COMMENT '情感得分',
    turn INT DEFAULT 0 COMMENT '对话轮次',
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP COMMENT '创建时间',
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '更新时间',
    INDEX idx_session (session_id),
    INDEX idx_user (user_id),
    INDEX idx_created (created_at)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci COMMENT='对话历史记录表';

-- 数据标注表
CREATE TABLE data_labels (
    id BIGINT AUTO_INCREMENT PRIMARY KEY,
    conversation_id BIGINT NOT NULL COMMENT '关联对话ID',
    labeler_id VARCHAR(64) NOT NULL COMMENT '标注员ID',
    label_type VARCHAR(32) NOT NULL COMMENT '标注类型:quality/type/topic',
    label_value VARCHAR(128) NOT NULL COMMENT '标注值',
    confidence FLOAT DEFAULT 1.0 COMMENT '置信度',
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (conversation_id) REFERENCES conversations(id) ON DELETE CASCADE,
    INDEX idx_conversation (conversation_id),
    UNIQUE KEY uk_label (conversation_id, labeler_id, label_type)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

-- 训练数据集版本表
CREATE TABLE training_datasets (
    id BIGINT AUTO_INCREMENT PRIMARY KEY,
    dataset_name VARCHAR(128) NOT NULL UNIQUE COMMENT '数据集名称',
    version VARCHAR(32) NOT NULL COMMENT '版本号',
    description TEXT COMMENT '数据集描述',
    total_samples INT DEFAULT 0 COMMENT '总样本数',
    data_source VARCHAR(128) COMMENT '数据来源',
    filtering_criteria JSON COMMENT '筛选条件(JSON格式)',
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    created_by VARCHAR(64) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

-- 数据集与对话的关联表
CREATE TABLE dataset_conversations (
    dataset_id BIGINT NOT NULL,
    conversation_id BIGINT NOT NULL,
    sampled_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (dataset_id, conversation_id),
    FOREIGN KEY (dataset_id) REFERENCES training_datasets(id) ON DELETE CASCADE,
    FOREIGN KEY (conversation_id) REFERENCES conversations(id) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

1.3 训练数据抽取实战

按条件抽取训练样本
python 复制代码
import pymysql
import json
from typing import List, Dict, Any
from datetime import datetime, timedelta

class TrainingDataExtractor:
    """从MySQL抽取训练数据的工具类"""
    
    def __init__(self, connection_config: Dict[str, str]):
        self.conn = pymysql.connect(**connection_config)
        self.cursor = self.conn.cursor(pymysql.cursors.DictCursor)
    
    def extract_qa_pairs(
        self,
        start_date: str,
        end_date: str,
        min_turns: int = 2,
        quality_threshold: float = 0.7,
        limit: int = 10000
    ) -> List[Dict[str, Any]]:
        """
        提取高质量的QA对
        用于构造指令微调数据集
        """
        query = """
        SELECT 
            c1.session_id,
            c1.content AS user_question,
            c2.content AS assistant_answer,
            c1.created_at,
            dl.label_value AS quality_label
        FROM conversations c1
        INNER JOIN conversations c2 
            ON c1.session_id = c2.session_id 
            AND c1.turn = c2.turn - 1
            AND c2.role = 'assistant'
        LEFT JOIN data_labels dl 
            ON c1.id = dl.conversation_id 
            AND dl.label_type = 'quality'
        WHERE c1.role = 'user'
            AND c1.created_at BETWEEN %s AND %s
            AND c2.content IS NOT NULL
            AND LENGTH(c2.content) > 20
            AND (dl.label_value IS NULL OR dl.label_value = 'high')
        GROUP BY c1.session_id
        HAVING COUNT(*) >= %s
        ORDER BY RAND()
        LIMIT %s
        """
        
        self.cursor.execute(query, (start_date, end_date, min_turns, limit))
        results = self.cursor.fetchall()
        
        # 转换为标准格式
        qa_pairs = []
        for row in results:
            qa_pairs.append({
                "instruction": "请回答用户的问题",
                "input": row['user_question'],
                "output": row['assistant_answer'],
                "source": "conversation",
                "session_id": row['session_id']
            })
        
        return qa_pairs
    
    def extract_by_intent(
        self,
        intent: str,
        limit: int = 5000
    ) -> List[Dict[str, Any]]:
        """
        按意图类别抽取数据
        用于构建特定场景的微调数据
        """
        query = """
        SELECT 
            content AS user_question,
            (
                SELECT content 
                FROM conversations c2 
                WHERE c2.session_id = c1.session_id 
                    AND c2.role = 'assistant'
                    AND c2.turn = c1.turn + 1
                LIMIT 1
            ) AS assistant_answer
        FROM conversations c1
        WHERE c1.role = 'user'
            AND c1.intent = %s
            AND EXISTS (
                SELECT 1 FROM conversations c2 
                WHERE c2.session_id = c1.session_id 
                    AND c2.role = 'assistant'
                    AND c2.turn = c1.turn + 1
            )
        ORDER BY RAND()
        LIMIT %s
        """
        
        self.cursor.execute(query, (intent, limit))
        return self.cursor.fetchall()
    
    def export_to_jsonl(
        self,
        qa_pairs: List[Dict[str, Any]],
        output_path: str
    ):
        """导出为JSONL格式(可直接用于微调)"""
        import json
        with open(output_path, 'w', encoding='utf-8') as f:
            for pair in qa_pairs:
                f.write(json.dumps(pair, ensure_ascii=False) + '\n')
        print(f"✅ 已导出 {len(qa_pairs)} 条样本到 {output_path}")
    
    def close(self):
        self.cursor.close()
        self.conn.close()

# 使用示例
extractor = TrainingDataExtractor({
    'host': 'localhost',
    'user': 'root',
    'password': 'your_password',
    'database': 'chat_history',
    'charset': 'utf8mb4'
})

# 抽取最近30天的高质量对话
qa_pairs = extractor.extract_qa_pairs(
    start_date=(datetime.now() - timedelta(days=30)).strftime('%Y-%m-%d'),
    end_date=datetime.now().strftime('%Y-%m-%d'),
    min_turns=2,
    quality_threshold=0.7,
    limit=10000
)

extractor.export_to_jsonl(qa_pairs, "training_data_v1.jsonl")
extractor.close()

1.4 增量数据同步策略

python 复制代码
class IncrementalDataSync:
    """增量同步工具 - 定期抽取新增训练数据"""
    
    def __init__(self, extractor: TrainingDataExtractor):
        self.extractor = extractor
        self.last_sync_time = None
        self.checkpoint_file = "sync_checkpoint.json"
        self._load_checkpoint()
    
    def _load_checkpoint(self):
        """加载上次同步时间点"""
        import os
        if os.path.exists(self.checkpoint_file):
            with open(self.checkpoint_file, 'r') as f:
                data = json.load(f)
                self.last_sync_time = data.get('last_sync')
    
    def _save_checkpoint(self, sync_time: str):
        """保存同步时间点"""
        with open(self.checkpoint_file, 'w') as f:
            json.dump({'last_sync': sync_time}, f)
    
    def sync_new_data(self, output_path_template: str = "incremental_data_{date}.jsonl"):
        """
        增量同步新数据
        """
        if self.last_sync_time is None:
            # 首次同步:取最近7天
            start_date = (datetime.now() - timedelta(days=7)).strftime('%Y-%m-%d')
        else:
            start_date = self.last_sync_time
        
        end_date = datetime.now().strftime('%Y-%m-%d')
        
        # 抽取新数据
        qa_pairs = self.extractor.extract_qa_pairs(
            start_date=start_date,
            end_date=end_date,
            min_turns=2,
            limit=5000
        )
        
        if qa_pairs:
            output_path = output_path_template.format(
                date=datetime.now().strftime('%Y%m%d')
            )
            self.extractor.export_to_jsonl(qa_pairs, output_path)
            
            # 更新检查点
            self._save_checkpoint(end_date)
            print(f"✅ 增量同步完成,新增 {len(qa_pairs)} 条")
        else:
            print("ℹ️ 无新数据")
        
        return qa_pairs

第二部分:Crawler------从Web获取训练语料

2.1 LLM数据爬虫的特殊性

传统爬虫关注的是"抓到数据",而LLM训练数据爬虫关注的是:

维度 要求
内容质量 过滤低质量、垃圾内容
多样性 覆盖不同领域、风格、长度
时效性 能够获取最新内容
合规性 遵守robots.txt,合理控制频率

2.2 稳健的爬虫架构设计

python 复制代码
import requests
from bs4 import BeautifulSoup
import time
import random
from urllib.parse import urljoin, urlparse
from typing import List, Dict, Any, Optional
import logging
from concurrent.futures import ThreadPoolExecutor, as_completed

class RobustWebCrawler:
    """
    稳健的Web爬虫 - 专为LLM训练数据采集设计
    """
    
    def __init__(self):
        self.session = requests.Session()
        self.session.headers.update({
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        })
        
        self.visited_urls = set()
        self.results = []
        self.failed_urls = []
        
        # 配置日志
        logging.basicConfig(level=logging.INFO)
        self.logger = logging.getLogger(__name__)
    
    def fetch_page(self, url: str, retries: int = 3) -> Optional[str]:
        """
        获取页面HTML(带重试机制)
        """
        for attempt in range(retries):
            try:
                response = self.session.get(
                    url,
                    timeout=10,
                    allow_redirects=True
                )
                if response.status_code == 200:
                    # 尝试智能编码检测
                    response.encoding = response.apparent_encoding or 'utf-8'
                    return response.text
                elif response.status_code == 429:
                    # 被限流,等待更长时间
                    wait_time = 30 * (attempt + 1)
                    self.logger.warning(f"被限流,等待 {wait_time}s")
                    time.sleep(wait_time)
                else:
                    self.logger.warning(f"HTTP {response.status_code}: {url}")
                    
            except requests.exceptions.RequestException as e:
                self.logger.warning(f"尝试 {attempt+1} 失败: {str(e)}")
                time.sleep(2 ** attempt)  # 指数退避
        
        self.failed_urls.append(url)
        return None
    
    def extract_articles(self, html: str, url: str) -> List[Dict[str, str]]:
        """
        从HTML中提取文章内容
        支持多种页面结构
        """
        soup = BeautifulSoup(html, 'html.parser')
        
        # 移除噪声标签
        for tag in soup(['script', 'style', 'nav', 'footer', 'header', 'aside']):
            tag.decompose()
        
        # 尝试不同的内容选择器
        content_selectors = [
            'article', 'main', '.post-content', '.article-content',
            '.content', '.entry-content', '.post-body'
        ]
        
        articles = []
        
        for selector in content_selectors:
            elements = soup.select(selector)
            for elem in elements:
                text = elem.get_text(separator='\n', strip=True)
                if len(text) > 200:  # 过滤太短的内容
                    articles.append({
                        'url': url,
                        'title': self._extract_title(soup, elem),
                        'content': text,
                        'length': len(text),
                        'source': urlparse(url).netloc
                    })
        
        # 如果没找到文章结构,使用段落聚合
        if not articles:
            paragraphs = soup.find_all('p')
            text = '\n'.join([p.get_text(strip=True) for p in paragraphs if len(p.get_text()) > 20])
            if len(text) > 200:
                articles.append({
                    'url': url,
                    'title': self._extract_title(soup),
                    'content': text,
                    'length': len(text),
                    'source': urlparse(url).netloc
                })
        
        return articles
    
    def _extract_title(self, soup: BeautifulSoup, context=None) -> str:
        """提取页面标题"""
        # 优先从文章上下文中提取
        if context:
            title_candidates = [
                context.find('h1'), context.find('h2'),
                context.find(attrs={'class': 'title'})
            ]
            for candidate in title_candidates:
                if candidate:
                    return candidate.get_text(strip=True)
        
        # 回退到页面标题
        if soup.title:
            return soup.title.get_text(strip=True)
        return ""
    
    def crawl_pages(
        self,
        seed_urls: List[str],
        max_depth: int = 2,
        max_pages: int = 100,
        same_domain: bool = True,
        delay: float = 1.0
    ) -> List[Dict[str, str]]:
        """
        执行爬虫的核心方法
        
        Args:
            seed_urls: 起始URL列表
            max_depth: 最大爬取深度
            max_pages: 最大爬取页面数
            same_domain: 是否限制在同一域名
            delay: 请求延迟(秒)
        """
        queue = [(url, 0) for url in seed_urls]  # (url, depth)
        
        while queue and len(self.results) < max_pages:
            url, depth = queue.pop(0)
            
            if url in self.visited_urls:
                continue
            self.visited_urls.add(url)
            
            self.logger.info(f"🔄 爬取 [{depth}] {url}")
            
            # 获取页面
            html = self.fetch_page(url)
            if not html:
                continue
            
            # 提取文章
            articles = self.extract_articles(html, url)
            self.results.extend(articles)
            self.logger.info(f"   ✅ 提取 {len(articles)} 篇文章")
            
            # 如果未达到最大深度,提取新链接
            if depth < max_depth:
                links = self._extract_links(html, url)
                for link in links:
                    if same_domain and urlparse(link).netloc != urlparse(url).netloc:
                        continue
                    if link not in self.visited_urls:
                        queue.append((link, depth + 1))
            
            # 礼貌性延迟 + 随机抖动
            time.sleep(delay * (0.8 + 0.4 * random.random()))
        
        return self.results
    
    def _extract_links(self, html: str, base_url: str) -> List[str]:
        """提取页面中的所有链接"""
        soup = BeautifulSoup(html, 'html.parser')
        links = []
        
        for a in soup.find_all('a', href=True):
            href = a['href']
            # 过滤空链接和锚点
            if not href or href.startswith('#') or href.startswith('javascript:'):
                continue
            
            full_url = urljoin(base_url, href)
            # 过滤文件类型
            skip_extensions = ['.pdf', '.jpg', '.png', '.mp4', '.zip']
            if any(full_url.endswith(ext) for ext in skip_extensions):
                continue
            
            # 标准化URL
            parsed = urlparse(full_url)
            normalized = parsed._replace(fragment='').geturl()
            links.append(normalized)
        
        return list(set(links))  # 去重

2.3 内容清洗与去重

python 复制代码
from hashlib import md5
import re

class ContentCleaner:
    """从爬取网页中提取高质量文本"""
    
    @staticmethod
    def clean_article(text: str) -> str:
        """清洗单篇文章"""
        # 移除重复空行
        text = re.sub(r'\n{3,}', '\n\n', text)
        
        # 去除首尾空白
        text = text.strip()
        
        # 移除短行(可能是广告、导航等)
        lines = text.split('\n')
        filtered_lines = [line for line in lines if len(line) > 15]
        text = '\n'.join(filtered_lines)
        
        return text
    
    @staticmethod
    def deduplicate(articles: List[Dict[str, str]]) -> List[Dict[str, str]]:
        """基于内容哈希去重"""
        seen = set()
        unique = []
        
        for article in articles:
            # 使用内容前200字符 + 长度 计算哈希
            content_key = md5(article['content'][:200].encode()).hexdigest()
            content_key += f"_{len(article['content'])}"
            
            if content_key not in seen:
                seen.add(content_key)
                unique.append(article)
        
        return unique
    
    @staticmethod
    def filter_by_length(articles: List[Dict[str, str]], min_len: int = 300) -> List[Dict[str, str]]:
        """按长度过滤"""
        return [a for a in articles if len(a['content']) >= min_len]
    
    @staticmethod
    def filter_by_quality(articles: List[Dict[str, str]]) -> List[Dict[str, str]]:
        """基于启发式规则过滤低质量内容"""
        filtered = []
        
        for article in articles:
            text = article['content']
            
            # 检查是否有明显的垃圾内容
            spam_indicators = [
                'window.location', 'function(){', 'advertisement',
                '点击这里', '点击购买', '获取更多', '扫码关注'
            ]
            
            if any(indicator in text for indicator in spam_indicators):
                continue
            
            # 检查句子数量(至少5个句子)
            sentence_count = len(re.findall(r'[。!?\.\?\!]', text))
            if sentence_count < 5:
                continue
            
            filtered.append(article)
        
        return filtered

第三部分:MySQL + Crawler 全链路实战

3.1 完整的训练数据构建Pipeline

python 复制代码
class DataPipeline:
    """
    完整的数据流水线:
    爬虫 → 清洗 → MySQL存储 → 训练数据抽取
    """
    
    def __init__(self, mysql_config: Dict[str, str]):
        self.crawler = RobustWebCrawler()
        self.cleaner = ContentCleaner()
        self.extractor = TrainingDataExtractor(mysql_config)
    
    def crawl_and_store(
        self,
        seed_urls: List[str],
        max_pages: int = 200,
        source_name: str = "web_data"
    ) -> int:
        """
        爬取网页并存入MySQL
        """
        # 1. 执行爬虫
        print("🔄 开始爬取...")
        articles = self.crawler.crawl_pages(
            seed_urls,
            max_depth=2,
            max_pages=max_pages,
            same_domain=True,
            delay=0.5
        )
        
        print(f"✅ 爬取完成,获取 {len(articles)} 篇文章")
        
        # 2. 清洗和去重
        articles = [{
            'content': self.cleaner.clean_article(a['content']),
            'title': a['title'],
            'url': a['url'],
            'source': a['source']
        } for a in articles]
        
        articles = self.cleaner.deduplicate(articles)
        articles = self.cleaner.filter_by_length(articles, min_len=300)
        articles = self.cleaner.filter_by_quality(articles)
        
        print(f"✅ 清洗后保留 {len(articles)} 篇文章")
        
        # 3. 存储到MySQL(作为conversations表中的新记录)
        # 注意:这里将文章转为模拟对话格式,便于统一存储
        self._store_articles(articles, source_name)
        
        return len(articles)
    
    def _store_articles(self, articles: List[Dict], source_name: str):
        """
        将文章存储到MySQL
        将文章拆分为:系统提示 + 用户请求 + 完整内容
        """
        cursor = self.extractor.conn.cursor()
        
        for article in articles:
            # 构建模拟对话
            user_query = f"请介绍以下内容:{article['title']}"
            assistant_response = article['content']
            
            # 插入用户消息
            insert_query = """
            INSERT INTO conversations (session_id, user_id, role, content, intent, created_at)
            VALUES (%s, %s, %s, %s, %s, NOW())
            """
            session_id = md5(article['url'].encode()).hexdigest()
            
            cursor.execute(insert_query, (
                session_id,
                'crawler_bot',
                'user',
                user_query,
                'web_content'
            ))
            user_id = cursor.lastrowid
            
            # 插入助手回复
            cursor.execute(insert_query, (
                session_id,
                'crawler_bot',
                'assistant',
                assistant_response,
                'web_content'
            ))
        
        self.extractor.conn.commit()
        print(f"✅ 已将 {len(articles)} 篇文章存入MySQL")
        cursor.close()
    
    def close(self):
        self.extractor.close()

# ========== 完整使用示例 ==========
if __name__ == "__main__":
    mysql_config = {
        'host': 'localhost',
        'user': 'root',
        'password': 'your_password',
        'database': 'training_data',
        'charset': 'utf8mb4'
    }
    
    pipeline = DataPipeline(mysql_config)
    
    # 1. 从Web爬取训练数据
    seed_urls = [
        'https://example.com/blog',
        'https://example.com/docs',
    ]
    count = pipeline.crawl_and_store(seed_urls, max_pages=100)
    print(f"✅ 爬取并存储 {count} 篇文章")
    
    # 2. 从MySQL抽取训练数据
    qa_pairs = pipeline.extractor.extract_qa_pairs(
        start_date=(datetime.now() - timedelta(days=7)).strftime('%Y-%m-%d'),
        end_date=datetime.now().strftime('%Y-%m-%d'),
        min_turns=2,
        limit=10000
    )
    
    # 3. 导出为训练数据
    pipeline.extractor.export_to_jsonl(qa_pairs, "final_training_data.jsonl")
    
    pipeline.close()
    print("🎉 完整数据Pipeline执行完成!")

总结:数据工程的"两手都要硬"

对于强调"手写LoRA"的技术岗位,数据工程能力同样是核心竞争力:

技能 核心价值 在LoRA训练中的作用
MySQL 存储、管理、抽取业务数据 为LoRA微调提供高质量对话数据
Crawler 从Web获取新鲜语料 补充训练数据多样性,突破知识边界
数据处理 清洗、去重、格式化 保证训练数据的质量下限

核心洞察 :优秀的模型训练者会意识到------数据工程的质量往往比模型架构的微小改进更重要。再先进的LoRA技术,喂进去的都是垃圾数据,也不可能训练出好模型。


原创声明:本文为CSDN博主原创文章,基于LLM数据工程实践经验总结,欢迎交流讨论!

最后更新:2026年6月