LLama-factory数据报错

长对话形式的微调数据出现了报错，webui上的报错如下：

bash 复制代码

[rank0]: Traceback (most recent call last):
[rank0]:   File "/root/llama_fine/LLaMA-Factory/src/llamafactory/launcher.py", line 185, in <module>
[rank0]:     run_exp()
[rank0]:   File "/root/llama_fine/LLaMA-Factory/src/llamafactory/train/tuner.py", line 132, in run_exp
[rank0]:     _training_function(config={"args": args, "callbacks": callbacks})
[rank0]:   File "/root/llama_fine/LLaMA-Factory/src/llamafactory/train/tuner.py", line 93, in _training_function
[rank0]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank0]:   File "/root/llama_fine/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 51, in run_sft
[rank0]:     dataset_module = get_dataset(template, model_args, data_args, training_args, stage="sft", **tokenizer_module)
[rank0]:   File "/root/llama_fine/LLaMA-Factory/src/llamafactory/data/loader.py", line 314, in get_dataset
[rank0]:     dataset = _get_preprocessed_dataset(
[rank0]:   File "/root/llama_fine/LLaMA-Factory/src/llamafactory/data/loader.py", line 271, in _get_preprocessed_dataset
[rank0]:     raise RuntimeError("Cannot find valid samples, check `data/README.md` for the data format.")
[rank0]: RuntimeError: Cannot find valid samples, check `data/README.md` for the data format.

查看终端，这里有更详细的报错信息。

bash 复制代码

[WARNING|2025-12-07 09:16:34] llamafactory.data.processor.supervised:148 >> Dropped invalid example: []
[WARNING|2025-12-07 09:16:34] llamafactory.data.processor.supervised:148 >> Dropped invalid example: []
[WARNING|2025-12-07 09:16:34] llamafactory.data.processor.supervised:148 >> Dropped invalid example: []
[WARNING|2025-12-07 09:16:34] llamafactory.data.processor.supervised:148 >> Dropped invalid example: []
[WARNING|2025-12-07 09:16:34] llamafactory.data.processor.supervised:148 >> Dropped invalid example: []
[WARNING|2025-12-07 09:16:34] llamafactory.data.processor.supervised:148 >> Dropped invalid example: []
[WARNING|2025-12-07 09:16:34] llamafactory.data.processor.supervised:148 >> Dropped invalid example: []
[WARNING|2025-12-07 09:16:34] llamafactory.data.processor.supervised:148 >> Dropped invalid example: []
[WARNING|2025-12-07 09:16:34] llamafactory.data.processor.supervised:148 >> Dropped invalid example: []
[WARNING|2025-12-07 09:16:34] llamafactory.data.processor.supervised:148 >> Dropped invalid example: []
[WARNING|2025-12-07 09:16:34] llamafactory.data.processor.supervised:148 >> Dropped invalid example: []
[WARNING|2025-12-07 09:16:34] llamafactory.data.processor.supervised:148 >> Dropped invalid example: []



[rank0]: Traceback (most recent call last):
[rank0]:   File "/root/llama_fine/LLaMA-Factory/src/llamafactory/launcher.py", line 185, in <module>
[rank0]:     run_exp()
[rank0]:   File "/root/llama_fine/LLaMA-Factory/src/llamafactory/train/tuner.py", line 132, in run_exp
[rank0]:     _training_function(config={"args": args, "callbacks": callbacks})
[rank0]:   File "/root/llama_fine/LLaMA-Factory/src/llamafactory/train/tuner.py", line 93, in _training_function
[rank0]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank0]:   File "/root/llama_fine/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 51, in run_sft
[rank0]:     dataset_module = get_dataset(template, model_args, data_args, training_args, stage="sft", **tokenizer_module)
[rank0]:   File "/root/llama_fine/LLaMA-Factory/src/llamafactory/data/loader.py", line 314, in get_dataset
[rank0]:     dataset = _get_preprocessed_dataset(
[rank0]:   File "/root/llama_fine/LLaMA-Factory/src/llamafactory/data/loader.py", line 271, in _get_preprocessed_dataset
[rank0]:     raise RuntimeError("Cannot find valid samples, check `data/README.md` for the data format.")
[rank0]: RuntimeError: Cannot find valid samples, check `data/README.md` for the data format.
`torch_dtype` is deprecated! Use `dtype` instead!
`torch_dtype` is deprecated! Use `dtype` instead!
`torch_dtype` is deprecated! Use `dtype` instead!

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s][rank0]:[W1207 09:16:35.668906936 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())

Loading checkpoint shards:  25%|██▌       | 1/4 [00:01<00:04,  1.41s/it]W1207 09:16:37.200695 131050570364736 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2707322 closing signal SIGTERM
W1207 09:16:37.201630 131050570364736 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2707323 closing signal SIGTERM
W1207 09:16:37.201792 131050570364736 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2707324 closing signal SIGTERM
E1207 09:16:37.579527 131050570364736 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 2707321) of binary: /opt/conda/envs/llama_factory/bin/python
Traceback (most recent call last):
  File "/opt/conda/envs/llama_factory/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.4.0', 'console_scripts', 'torchrun')())
  File "/opt/conda/envs/llama_factory/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/llama_factory/lib/python3.9/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/opt/conda/envs/llama_factory/lib/python3.9/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/opt/conda/envs/llama_factory/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/llama_factory/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/root/llama_fine/LLaMA-Factory/src/llamafactory/launcher.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-12-07_09:16:37
  host      : cc00db2e7278
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2707321)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

报错分析

根据报错推测：数据处理后，发现样本大部分甚至可能全部都不合规，所以直接全部dropout。

因为我的训练参数里已经设置了template:llama3，所以应该检查先检查数据集，然后检查到底是llama-factory处理后的数据到底长什么样，以至于全部dropout。

查看数据格式

首先检查数据格式是否有问题。

python 复制代码

import json

# 读取一个样本
with open('/root/llama_fine/LLaMA-Factory/data/nas_train.jsonl', 'r') as f:
    sample = json.loads(f.readline())

print("原始样本结构:")
print(json.dumps(sample, indent=2, ensure_ascii=False)[:500])

print("\n检查点:")
print(f"1. 是否有 'conversations' 字段: {'conversations' in sample}")
print(f"2. conversations 类型: {type(sample.get('conversations'))}")
print(f"3. conversations 长度: {len(sample.get('conversations', []))}")

if 'conversations' in sample:
    conv = sample['conversations']
    print(f"\n第一条消息:")
    print(f"  字段: {conv[0].keys()}")
    print(f"  from: {conv[0].get('from')}")
    print(f"  value 前100字符: {conv[0].get('value', '')[:100]}")

输出结果：

bash 复制代码

原始样本结构:
{
  "id": "gen_8ef63e11",
  "conversations": [
    {
      "from": "system",
      "value": "\n        You are a neural network architecture design expert. Your task is to generate improved network configurations based on given constraints and insights.\n        \n\n        **Conv Type in the Search Space:**\n            1. DWSepConvBlock: Depthwise separable convolution (Depthwise + Pointwise) structure with skip connection support.\n            2. MBConvBlock: Inverted residual structure (expa
检查点:
1. 是否有 'conversations' 字段: True
2. conversations 类型: <class 'list'>
3. conversations 长度: 7
第一条消息:
  字段: dict_keys(['from', 'value'])
  from: system
  value 前100字符: 
        You are a neural network architecture design expert. Your task is to generate improved netw

从输出可以看到，本身数据集结构没有问题，然后再检查llama-factory的data文件夹下的dataset_info.json.

html 复制代码

"nas_train": {
    "file_name": "nas_train.jsonl",
    "formatting": "sharegpt",
    "columns": {
      "messages": "conversations"
    },
    "tags": {
      "role_tag": "from",
      "content_tag": "value",
      "user_tag": "human",
      "assistant_tag": "gpt",
      "system_tag": "system"
    }
  },
  "nas_val": {
    "file_name": "nas_val.jsonl",
    "formatting": "sharegpt",
    "columns": {
      "messages": "conversations"
    },
    "tags": {
      "role_tag": "from",
      "content_tag": "value",
      "user_tag": "human",
      "assistant_tag": "gpt",
      "system_tag": "system"
    }
  }

字段和格式都没有问题，那么检查微调的模型的数据处理是否会有问题。

python 复制代码

import sys
sys.path.insert(0, '/root/llama_fine/LLaMA-Factory/src')

import json
from datasets import Dataset
from transformers import AutoTokenizer

# 加载 tokenizer
tokenizer = AutoTokenizer.from_pretrained('/root/llama_fine/llama3.1-8b')

# 读取几个样本
samples = []
with open('/root/llama_fine/LLaMA-Factory/data/nas_train.jsonl', 'r') as f:
    for i, line in enumerate(f):
        if i >= 5:
            break
        samples.append(json.loads(line))

print(f"读取了 {len(samples)} 个样本")

# 创建 dataset
dataset = Dataset.from_list(samples)

print(f"Dataset 特征: {dataset.features}")
print(f"Dataset 大小: {len(dataset)}")

# 尝试直接访问
example = dataset[0]
print(f"\n第一个样本:")
print(f"  Keys: {list(example.keys())}")
print(f"  conversations 数量: {len(example['conversations'])}")

# 模拟 LLaMA Factory 的 map 函数
def process_func(examples):
    """模拟 LLaMA Factory 的处理"""
    results = {
        'input_ids': [],
        'labels': [],
        'attention_mask': []
    }
    
    for conversations in examples['conversations']:
        # 拼接所有对话
        text = ''
        for msg in conversations:
            role = msg['from']
            content = msg['value']
            text += f"<|{role}|> {content}\n"
        
        # Tokenize
        encoded = tokenizer(
            text,
            truncation=True,
            max_length=8192,
            padding=False,
            return_tensors=None
        )
        
        if len(encoded['input_ids']) == 0:
            print(f"  ⚠️ 样本被 tokenize 为空!")
            continue
        
        results['input_ids'].append(encoded['input_ids'])
        results['labels'].append(encoded['input_ids'])
        results['attention_mask'].append(encoded['attention_mask'])
    
    return results

# 测试处理
print("\n测试处理...")
try:
    processed = dataset.map(
        process_func,
        batched=True,
        batch_size=1,
        remove_columns=dataset.column_names,
        num_proc=1
    )
    
    print(f"✓ 处理成功!")
    print(f"  处理后样本数: {len(processed)}")
    
    if len(processed) > 0:
        print(f"  第一个样本 input_ids 长度: {len(processed[0]['input_ids'])}")
    else:
        print(f"  ⚠️ 处理后没有有效样本!")
        
except Exception as e:
    print(f"✗ 处理失败: {e}")
    import traceback
    traceback.print_exc()

输出结果：

bash 复制代码

读取了 5 个样本
Dataset 特征: {'id': Value('string'), 'conversations': List({'from': Value('string'), 'value': Value('string')}), 'metadata': {'memory_constraint': Value('string'), 'parent_reward': Value('float64'), 'success': Value('bool'), 'task_type': Value('string'), 'total_attempts': Value('int64')}}
Dataset 大小: 5
第一个样本:
  Keys: ['id', 'conversations', 'metadata']
  conversations 数量: 7
测试处理...
Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 52.11 examples/s]
✓ 处理成功!
  处理后样本数: 5
  第一个样本 input_ids 长度: 4850

从输出结果看，llama3-8b模型的tokenizer本身是可以处理我的数据集的，那么报错就不是因为我的数据集太长。

现在想要知道到底是什么报错，只能查看llama-factory处理完数据后到底长什么样，或者直接查看源码报错的地方在哪，为什么要报这个错误。

bash 复制代码

# 查看 LLaMA Factory 如何处理 ShareGPT 格式
cd /root/llama_fine/LLaMA-Factory
grep -r "sharegpt" src/llamafactory/data/ --include="*.py"

输出：

bash 复制代码

grep -r "sharegpt" src/llamafactory/data/ --include="*.py"
src/llamafactory/data/parser.py:    formatting: Literal["alpaca", "sharegpt", "openai"] = "alpaca"
src/llamafactory/data/parser.py:    # sharegpt columns
src/llamafactory/data/parser.py:    # sharegpt tags
src/llamafactory/data/converter.py:    "sharegpt": SharegptDatasetConverter,

根据上述信息，说定converter.py文件。此外，再查找直接判定我们报错的文件。

bash 复制代码

# 跳转到llama-factory的目录下
cd /root/llama_fine/LLaMA-Factory0

# 查看所有 processor 相关文件
# 直接查看报错 dropped invalid 出现的文件
find src/llamafactory/data -name "*.py" | xargs grep -l "Dropped invalid"


# 或者直接查看报错的那一行
grep -rn "Dropped invalid example" src/llamafactory/data/

输出结果：

bash 复制代码

src/llamafactory/data/processor/pairwise.py:77:                    "Dropped invalid example: {}".format(examples["_prompt"][i] + examples["_response"][i])
src/llamafactory/data/processor/feedback.py:92:                    "Dropped invalid example: {}".format(examples["_prompt"][i] + examples["_response"][i])
src/llamafactory/data/processor/unsupervised.py:65:                    "Dropped invalid example: {}".format(examples["_prompt"][i] + examples["_response"][i])
grep: src/llamafactory/data/processor/__pycache__/pairwise.cpython-39.pyc: binary file matches
grep: src/llamafactory/data/processor/__pycache__/supervised.cpython-39.pyc: binary file matches
grep: src/llamafactory/data/processor/__pycache__/unsupervised.cpython-39.pyc: binary file matches
grep: src/llamafactory/data/processor/__pycache__/feedback.cpython-39.pyc: binary file matches
src/llamafactory/data/processor/supervised.py:95:                    "Dropped invalid example: {}".format(examples["_prompt"][i] + examples["_response"][i])
src/llamafactory/data/processor/supervised.py:138:                    "Dropped invalid example: {}".format(examples["_prompt"][i] + examples["_response"][i])

到了这一步，可以锁定报错了。

问题锁定

看 supervised.py 第 95 行和第 138 行的检查条件：

python 复制代码

if len(examples["_prompt"][i]) % 2 != 1 or len(examples["_response"][i]) != 1:
    logger.warning_rank0(
        "Dropped invalid example: {}".format(examples["_prompt"][i] + examples["_response"][i])
    )
    continue

这个条件要求：

_prompt 的长度必须是奇数 （% 2 != 1）
_response 的长度必须是 1

让我们看看 SharegptDatasetConverter 做了什么（在 converter.py 中）：

python 复制代码

# Line 159-160 (normal example)
prompt = aligned_messages[:-1]
response = aligned_messages[-1:]
```

**关键发现：**

你的数据有 7 条对话：
```
[system, human, gpt, human, human, gpt, human]

经过 SharegptDatasetConverter 处理后：

aligned_messages = 所有 7 条消息（转换 role 后）
prompt = aligned_messages[:-1] = 前 6 条
response = aligned_messages[-1:] = 最后 1 条

问题：最后一条是 human，不是 assistant！

而且 prompt 有 6 条（偶数），不符合要求！

验证问题

python 复制代码

import json

with open('/root/llama_fine/LLaMA-Factory/data/nas_train.jsonl', 'r') as f:
    for i, line in enumerate(f):
        if i >= 10:
            break
        
        sample = json.loads(line)
        conversations = sample['conversations']
        
        # 检查最后一条消息的角色
        last_role = conversations[-1]['from']
        
        # 统计各个角色
        roles = [msg['from'] for msg in conversations]
        
        print(f"样本 {i+1} ({sample['id']}):")
        print(f"  对话序列: {' -> '.join(roles)}")
        print(f"  最后角色: {last_role}")
        print(f"  总数: {len(conversations)}, 奇偶: {'奇数' if len(conversations) % 2 == 1 else '偶数'}")
        
        # 检查是否符合 LLaMA Factory 的要求
        # 去除 system，剩余消息应该是 user-assistant 交替，且最后一个是 assistant
        msgs_without_system = [msg for msg in conversations if msg['from'] != 'system']
        
        if len(msgs_without_system) > 0:
            last_without_system = msgs_without_system[-1]['from']
            print(f"  去除system后最后角色: {last_without_system}")
            print(f"  去除system后总数: {len(msgs_without_system)}")
            
            # LLaMA Factory 要求: prompt 长度为奇数 (user-assistant-user-assistant-user)
            # 这意味着 response (最后的 assistant) 之前应该有偶数条消息
            # 所以总消息数应该是偶数！
            if last_without_system != 'gpt':
                print(f"  ⚠️ 最后一条不是 assistant (gpt)!")
            
            if len(msgs_without_system) % 2 != 0:
                print(f"  ⚠️ 消息数是奇数，LLaMA Factory 期望偶数!")
        
        print()

输出如下:

bash 复制代码

样本 1 (gen_8ef63e11):
  对话序列: system -> human -> gpt -> human -> human -> gpt -> human
  最后角色: human
  总数: 7, 奇偶: 奇数
  去除system后最后角色: human
  去除system后总数: 6
  ⚠️ 最后一条不是 assistant (gpt)!

样本 2 (gen_6f3a89f0):
  对话序列: system -> human -> gpt -> human
  最后角色: human
  总数: 4, 奇偶: 偶数
  去除system后最后角色: human
  去除system后总数: 3
  ⚠️ 最后一条不是 assistant (gpt)!
  ⚠️ 消息数是奇数，LLaMA Factory 期望偶数!

样本 3 (gen_dd0dccb3):
  对话序列: system -> human -> gpt -> human -> human -> human -> gpt -> human -> human -> gpt -> human
  最后角色: human
  总数: 11, 奇偶: 奇数
  去除system后最后角色: human
  去除system后总数: 10
  ⚠️ 最后一条不是 assistant (gpt)!

样本 4 (gen_1b3b77a2):
  对话序列: system -> human -> gpt -> human -> human -> gpt -> human -> human
  最后角色: human
  总数: 8, 奇偶: 偶数
  去除system后最后角色: human
  去除system后总数: 7
  ⚠️ 最后一条不是 assistant (gpt)!
  ⚠️ 消息数是奇数，LLaMA Factory 期望偶数!

样本 5 (gen_bd98c230):
  对话序列: system -> human -> gpt -> human -> human -> gpt -> human
  最后角色: human
  总数: 7, 奇偶: 奇数
  去除system后最后角色: human
  去除system后总数: 6
  ⚠️ 最后一条不是 assistant (gpt)!

样本 6 (gen_7f6cf248):
  对话序列: system -> human -> gpt -> human -> human -> gpt -> human
  最后角色: human
  总数: 7, 奇偶: 奇数
  去除system后最后角色: human
  去除system后总数: 6
  ⚠️ 最后一条不是 assistant (gpt)!

样本 7 (gen_0b03afae):
  对话序列: system -> human -> gpt -> human -> human
  最后角色: human
  总数: 5, 奇偶: 奇数
  去除system后最后角色: human
  去除system后总数: 4
  ⚠️ 最后一条不是 assistant (gpt)!

样本 8 (gen_3f6912ca):
  对话序列: system -> human -> gpt -> human -> human
  最后角色: human
  总数: 5, 奇偶: 奇数
  去除system后最后角色: human
  去除system后总数: 4
  ⚠️ 最后一条不是 assistant (gpt)!

样本 9 (gen_172829d1):
  对话序列: system -> human -> gpt -> human -> human
  最后角色: human
  总数: 5, 奇偶: 奇数
  去除system后最后角色: human
  去除system后总数: 4
  ⚠️ 最后一条不是 assistant (gpt)!

样本 10 (gen_3d6fbf98):
  对话序列: system -> human -> gpt -> human -> human
  最后角色: human
  总数: 5, 奇偶: 奇数
  去除system后最后角色: human
  去除system后总数: 4
  ⚠️ 最后一条不是 assistant (gpt)!

现在明确了，要修改两个问题，对话的最后一条必须是assistant，此外必须是gpt human gpt human这样的相间隔的形式。

所以要处理数据集。

如果不是间隔交替的对话，则合并。

即[gpt, human, human, gpt]

合并为[gpt, human, gpt]

此外，最后一个对话的role改为assistant。

python 复制代码

import json

def merge_consecutive_messages(conversations):
    """合并连续的相同角色消息"""
    if not conversations:
        return []
    
    merged = []
    current_role = None
    current_values = []
    
    for msg in conversations:
        role = msg['from']
        value = msg['value']
        
        if role == current_role:
            # 同一个角色，累积内容
            current_values.append(value)
        else:
            # 角色变化，保存之前累积的内容
            if current_role is not None:
                merged.append({
                    'from': current_role,
                    'value': '\n\n'.join(current_values)
                })
            
            # 开始新的角色
            current_role = role
            current_values = [value]
    
    # 保存最后一组
    if current_role is not None:
        merged.append({
            'from': current_role,
            'value': '\n\n'.join(current_values)
        })
    
    return merged


def process_file(input_file, output_file):
    processed = 0
    total = 0
    skipped = 0
    merged_count = 0
    
    with open(input_file, 'r') as f_in, open(output_file, 'w') as f_out:
        for line in f_in:
            total += 1
            sample = json.loads(line)
            conversations = sample['conversations']
            
            # 1. 将评估反馈改为 gpt 角色
            for msg in conversations:
                if msg['from'] == 'human' and 'EVALUATION COMPLETE' in msg['value']:
                    msg['from'] = 'gpt'
            
            # 2. 分离 system 和其他消息
            system_msg = None
            other_msgs = []
            
            for msg in conversations:
                if msg['from'] == 'system':
                    system_msg = msg
                else:
                    other_msgs.append(msg)
            
            # 3. 合并连续的同角色消息
            original_len = len(other_msgs)
            other_msgs = merge_consecutive_messages(other_msgs)
            
            if len(other_msgs) < original_len:
                merged_count += 1
            
            # 4. 重建 conversations（system + 其他）
            if system_msg:
                conversations = [system_msg] + other_msgs
            else:
                conversations = other_msgs
            
            # 5. 验证格式
            valid = True
            
            if len(other_msgs) < 2:
                valid = False
            elif other_msgs[-1]['from'] != 'gpt':
                valid = False
            else:
                # 验证 human-gpt 交替
                for i, msg in enumerate(other_msgs):
                    expected = 'human' if i % 2 == 0 else 'gpt'
                    if msg['from'] != expected:
                        valid = False
                        break
            
            # 6. 输出
            if valid:
                sample['conversations'] = conversations
                f_out.write(json.dumps(sample, ensure_ascii=False) + '\n')
                processed += 1
            else:
                skipped += 1
                if skipped <= 5:
                    print(f"  ⚠️  跳过样本 {sample['id']}: 格式验证失败")
                    roles = ' -> '.join([msg['from'] for msg in other_msgs])
                    print(f"      角色序列: {roles}")
    
    return total, processed, skipped, merged_count

# 处理训练集
print("处理训练集...")
total, processed, skipped, merged = process_file(
    '/root/llama_fine/LLaMA-Factory/data/nas_train.jsonl',
    '/root/llama_fine/LLaMA-Factory/data/nas_train_fixed.jsonl'
)
print(f"  总样本: {total}")
print(f"  处理成功: {processed}")
print(f"  跳过: {skipped}")
print(f"  合并消息的样本: {merged}")

# 处理验证集
print("\n处理验证集...")
total, processed, skipped, merged = process_file(
    '/root/llama_fine/LLaMA-Factory/data/nas_val.jsonl',
    '/root/llama_fine/LLaMA-Factory/data/nas_val_fixed.jsonl'
)
print(f"  总样本: {total}")
print(f"  处理成功: {processed}")
print(f"  跳过: {skipped}")
print(f"  合并消息的样本: {merged}")

print("\n✓ 全部完成!")


# ## 合并示例

# **原始对话：**
# ```
# system: 你是架构设计专家...
# human: 生成配置...
# gpt: {"config": ...}
# gpt: FAILURE: DUPLICATE  ← 评估反馈改为 gpt
# human: 生成不同的配置...
# human: (包含之前的反馈信息)  ← 连续的 human
# gpt: {"config": ...}
# gpt: SUCCESS: Memory 14.45 MB  ← 评估反馈改为 gpt
# ```

# **合并后：**
# ```
# system: 你是架构设计专家...
# human: 生成配置...
# gpt: {"config": ...}

# FAILURE: DUPLICATE  ← 合并了
# human: 生成不同的配置...
# (包含之前的反馈信息)  ← 两条 human 合并
# gpt: {"config": ...}

# SUCCESS: Memory 14.45 MB  ← 合并了

接着修改dataset_info.json的内容。只要调整data的位置就好了