论文设计和撰写1

文章目录

- 先提炼两个链接的核心导向
一、重新定位论文贡献
- [更适合 NeurIPS E&D Track 的题目](#更适合 NeurIPS E&D Track 的题目)
- 最应该主打的核心贡献
- 三到四条新版贡献点
[二、按照 NeurIPS E&D Track 重新设计论文结构](#二、按照 NeurIPS E&D Track 重新设计论文结构)
- [1. Introduction](#1. Introduction)
- [2. Related Work](#2. Related Work)
- [3. Benchmark Construction](#3. Benchmark Construction)
- - [3.1 Dataset registry](#3.1 Dataset registry)
  - [3.2 Disease and label ontology](#3.2 Disease and label ontology)
  - [3.3 Split protocol](#3.3 Split protocol)
  - [3.4 Data license and access](#3.4 Data license and access)
- [4. Evaluation Framework](#4. Evaluation Framework)
- [5. Experiments](#5. Experiments)
- [6. Analysis](#6. Analysis)
- [7. Limitations & Ethics](#7. Limitations & Ethics)
- [8. Benchmark Release](#8. Benchmark Release)
三、重新设计实验体系
- 模型族设计
[四、真正有创新性的 benchmark 设计](#四、真正有创新性的 benchmark 设计)
- [Benchmark 名称](#Benchmark 名称)
- [Task Suite](#Task Suite)
- 总评分体系
- [多个 leaderboard](#多个 leaderboard)
五、针对旧评审意见逐条反击
六、最终论文应该得到的关键结论
七、可执行实验计划
- P0：必做实验，保证论文成立
- [P1：强创新实验，提高 NeurIPS 命中率](#P1：强创新实验，提高 NeurIPS 命中率)
- P2：加分实验，有时间再做
[八、论文图表设计：12 个关键图表/表格](#八、论文图表设计：12 个关键图表/表格)
九、最终写作策略
十、简洁版本
- [1. 最推荐论文题目](#1. 最推荐论文题目)
- [2. 一句话 thesis](#2. 一句话 thesis)
- [3. 三条核心贡献](#3. 三条核心贡献)
- [4. 论文主体结构](#4. 论文主体结构)
- [5. 必做实验清单](#5. 必做实验清单)
- [6. 最有杀伤力的创新点](#6. 最有杀伤力的创新点)
- [7. 对 NeurIPS E&D Track 的匹配理由](#7. 对 NeurIPS E&D Track 的匹配理由)

你现在是我的 NeurIPS Evaluations & Datasets Track 论文导师、领域专家和严苛审稿人。请不要只做润色，也不要局限于我旧论文的结构，而是从"如何让这篇文章真正成为一篇有科学价值、能打动 NeurIPS E&D Track 评审的评测与数据集论文"出发，帮我重新设计整篇论文。
在开始设计之前，请你必须先阅读并总结以下两个链接中的要求和导向，然后把这些要求显式映射到我的论文设计中：

NeurIPS 2026 Evaluations & Datasets Track 官方介绍：

https://blog.neurips.cc/2026/03/23/introducing-the-evaluations-datasets-track-at-neurips-2026/

请重点提炼：

该 track 为什么从 Datasets & Benchmarks 改为 Evaluations & Datasets；

"evaluation itself as a scientific object of study" 到底是什么意思；

审稿人会期待什么样的 evaluation paper；

什么样的 benchmark / dataset paper 会被认为贡献不足；

文章需要如何体现科学问题、评测协议、可复现性、数据和代码发布、局限性、负责任 AI、社会影响。

重要要求：你的回答必须先用一个表格把"NeurIPS E&D Track 要求 / 我的旧论文缺口 / 新论文应该如何补强"三者对应起来。之后再进入论文标题、主线、实验和写作设计。

我参考的同类型评测文章：

https://arxiv.org/abs/2510.09872

请重点提炼：

这篇文章如何定义 benchmark；

它的任务设计、评测协议、leaderboard、可复现工具链有什么值得借鉴；

它为什么不是简单的模型结果展示；

我这篇语音疾病评测文章可以如何借鉴它的"评测框架"思想，而不是照搬任务内容。

我的旧论文信息如下：

旧题目：A Unified Benchmark for Speech-Based Disease Diagnosis

新想法题目：A Unified Benchmark and Evaluation Framework for Speech-Based Disease Diagnosis

研究背景：语音疾病检测领域非常碎片化。很多工作只针对单一疾病或单一数据集，例如痴呆症识别模型只在痴呆症数据集上评估，呼吸疾病模型只在咳嗽/呼吸数据集上评估，不同疾病之间缺乏统一评测基准，也缺乏跨疾病、跨数据集、跨采集条件的泛化能力分析。这阻碍了未来"统一语音疾病监测大模型"或"speech health foundation model"的发展。

我目前已经收集并整理的数据资源：

27 个公开语音疾病相关数据集
覆盖 8 大疾病类别
总计 425,075 条样本
总计 769.4 小时语音
疾病类别包括 speech disorders、dysarthria、Alzheimer's、Parkinson's、respiratory disorders、heart/lung sounds、rare diseases、psychological disorders 等

旧版本实验：

在每个数据集/疾病类别上分别跑 MLP、CNN、Wav2Vec2.0、Mantis 等 baseline。
把所有数据集合并后做完整的多类别疾病分类。
主要报告 accuracy、precision、F1 等结果。

旧版本投稿 AAAI/ICASSP 类会议没有成功，主要评审意见包括：

方法比较常规，创新点不足。
实验方案存在瑕疵，科学性和逻辑性不够。
更像是结果展示或模型 bake-off，而不是真正的 benchmark。
没有清楚提供数据下载、数据许可、标准测评协议、评测代码、可复现方案。
仅仅合并不同数据集的动机不够充分。
不同疾病的特征、样本量、采集条件差异很大，用少数模型统一比较可能不公平。
resampling 可能带来负面影响，需要更系统地分析。

我现在希望投递 NeurIPS 2026 Evaluations & Datasets Track。请你深度结合该 track 的最新定位：evaluation itself as a scientific object of study，而不是只提交一个数据集或结果表。请你帮我重新规划文章，使它更像一篇"评测科学论文"，而不是普通应用论文。

请完成以下任务：

一、重新定位论文贡献

请给出 3--5 个更适合 NeurIPS E&D Track 的论文题目。题目要突出：

speech health / speech-based disease diagnosis
unified evaluation
cross-disease generalization
robustness / distribution shift / fairness / calibration / open-set detection
evaluation framework rather than only dataset collection

请明确告诉我：这篇文章最应该主打的核心贡献是什么？不要把"27 个数据集 + 几个 baseline"作为唯一贡献，而要把它升级成一个有科学问题的评测框架。请用一句话给出 paper thesis，例如：

"Current speech disease models appear strong under within-dataset evaluation but fail under clinically realistic cross-dataset, cross-disease, and open-set evaluation; therefore we introduce a multi-axis evaluation framework to measure whether a model is truly ready for speech health foundation modeling."

二、按照 NeurIPS E&D Track 标准设计论文大纲

请帮我设计完整论文结构，包括：

Introduction：如何讲清楚痛点、为什么当前单病种评测误导了领域进展、为什么未来需要统一语音疾病监测大模型的评测标尺。
Related Work：应该包括哪些方向，例如 speech health、medical audio benchmark、foundation model evaluation、OOD generalization、benchmark design、dataset documentation。
Benchmark Construction：如何描述 27 个数据集的整合、标签本体、数据许可、下载方式、元数据、Croissant metadata、Responsible AI fields、数据卡/评测卡。
Evaluation Framework：这是文章核心。请设计一个多维度评测框架，而不是只报告分类准确率。
Experiments：请给出具体实验矩阵，说明每个实验回答什么科学问题。
Analysis：应该做哪些深入分析，才能从"结果展示"变成"评测研究"。
Limitations & Ethics：医疗语音、隐私、语言/年龄/设备偏差、临床不可替代性、公开数据许可等应该怎么写。
Benchmark Release：应该发布什么，包括 dataset index、download scripts、preprocessing pipeline、evaluation harness、leaderboard、model submission format、metadata schema。

三、重新设计实验体系

请不要只建议"多跑几个模型"。请你设计有逻辑、有科学问题、有创新性的实验体系。每个实验都要说明：

研究问题是什么？
为什么这个实验对 NeurIPS E&D Track 有价值？
具体怎么做？
使用什么指标？
预期可能得到什么结论？
如果结果不如预期，如何解释？

至少包含以下实验方向，并可以继续扩展：

A. Within-dataset evaluation

传统单数据集训练/测试，用于和已有文献对齐，但不要作为核心贡献。

B. Cross-dataset generalization

同一疾病类别中，leave-one-dataset-out：在若干数据集训练，在未见过的数据集测试。评估模型是否学到疾病表征，而不是数据集/设备/语言偏差。

C. Cross-disease transfer

在某些疾病类别训练，在其他疾病类别测试或微调。回答是否存在可迁移的 speech health representation。

D. Unified disease taxonomy classification

构建层级标签体系：healthy vs abnormal、disease family、specific dataset label。评估 coarse-to-fine 诊断能力。

E. Open-set / unknown disease detection

训练时只见部分类别，测试时出现未见疾病，评估模型能否拒识 unknown disease。这对真实临床部署非常重要。

F. Domain shift and dataset bias diagnosis

分析模型是否在预测疾病，还是在预测数据集来源、录音设备、语言、任务类型、采样率、语音内容。可以设计 dataset-ID prediction、label leakage analysis、shortcut learning analysis。

G. Calibration and clinical reliability

不仅看 F1，还要看 ECE、Brier score、AUROC、AUPRC、selective prediction、risk-coverage curve。医学评测中"模型知道自己不确定"非常关键。

H. Few-shot / low-resource adaptation

每个新疾病或新数据集只给 1%、5%、10% 标注数据，评估模型快速适应能力。这契合 foundation model 评测。

I. Robustness stress tests

加入噪声、压缩、重采样、截断、静音、不同语音片段长度，评估模型稳定性。

J. Subgroup fairness / stratified analysis

如果元数据允许，按年龄、性别、语言、设备、语音任务类型等分组分析。如果元数据不完整，也要设计 metadata completeness analysis。

K. Evaluation protocol sensitivity

系统比较不同 split、resampling、metric、aggregation 方式如何改变模型排名。这个实验非常契合"evaluation as scientific object of study"。目标是证明：不严谨的评测协议会导致错误科学结论。

L. Model family comparison

不要只比较 MLP/CNN/Wav2Vec2/Mantis。请建议加入哪些模型族：

handcrafted acoustic features + classical ML
CNN / CRNN / Transformer
self-supervised speech models：wav2vec 2.0、HuBERT、WavLM、Whisper encoder
audio foundation models：BEATs、CLAP、AudioMAE 等
time-series foundation models
possible speech LLM / multimodal models if appropriate
并说明哪些是 frozen probing，哪些是 fine-tuning，哪些是 parameter-efficient tuning。

四、请提出真正有创新性的 benchmark 设计

请设计一个新的 benchmark 名称和任务体系。例如：

SpeechHealth-Eval、UniSpeechDisease-Eval、VocalHealthBench、SpeechDx-Eval 等。

请提出一个总评分体系，不要只用 average F1。可以设计一个综合评分，例如：

Speech Health Generalization Score = within-dataset score + cross-dataset score + open-set score + calibration score + robustness score + fairness score

但请你不要随便堆指标，要说明每个指标代表什么临床或科学能力，权重如何设定，是否提供多个 leaderboard。

请考虑设置多个 leaderboard：

Standard supervised leaderboard
Frozen foundation model leaderboard
Cross-dataset generalization leaderboard
Open-set detection leaderboard
Low-resource adaptation leaderboard
Robustness/calibration leaderboard

五、针对审稿意见逐条反击

请把旧评审意见转化成新版文章的设计原则：

"方法没创新" → 说明文章不是提出模型，而是提出新的评测科学问题、协议和工具。
"只是结果展示" → 变成多维评测、协议敏感性分析、failure mode analysis。
"没有数据下载和测评方案" → 设计 dataset index、download scripts、license table、Croissant metadata、evaluation harness、leaderboard。
"合并数据集动机不清楚" → 用 cross-dataset generalization、label ontology、clinical deployment shift 来重新论证。
"resampling 有副作用" → 加入 resampling sensitivity experiment 和 imbalance-aware metrics。

六、请告诉我最终论文应该得到哪些关键结论

请给出 5--8 条理想情况下论文可以得到的核心 findings，例如：

within-dataset performance overestimates real deployment performance；
models often learn dataset identity shortcuts；
speech SSL models are not automatically better for disease diagnosis；
calibration is poor even when F1 is high；
open-set disease rejection remains unsolved；
low-resource adaptation reveals whether representations are clinically reusable；
evaluation protocol choices can reverse model rankings；
a unified evaluation suite is necessary for future speech health foundation models。

七、请给出一份可执行的实验计划

请按优先级给我一个实验执行路线图：

必做实验：保证文章成立
强创新实验：提升 NeurIPS 命中率
加分实验：如果时间够再做
每个实验请估计需要的数据处理、模型训练、指标计算、图表呈现方式。

八、请给出论文图表设计

请设计 8--12 个关键图表/表格，包括：

benchmark overview figure
dataset taxonomy and label ontology
evaluation framework diagram
dataset statistics table
task/leaderboard definition table
main result table
cross-dataset generalization heatmap
protocol sensitivity ranking plot
calibration/reliability diagram
robustness curve
open-set detection results
failure mode / dataset shortcut analysis

九、请给出最终写作策略

请告诉我这篇文章在摘要、引言和贡献点中应该如何写，才能让审稿人第一眼明白：

这不是一个简单的"我收集了很多数据集并跑模型"的文章，而是一篇"为未来统一语音健康基础模型建立评测标尺"的文章。

十、请最后输出一个简洁版本

最后请给我：

最推荐的论文题目
一句话 thesis
三条核心贡献
论文主体结构
必做实验清单
最有杀伤力的创新点
对 NeurIPS E&D Track 的匹配理由

NeurIPS E&D Track 要求 / 导向	我的旧论文缺口	新论文应该如何补强
从 Datasets & Benchmarks 改为 Evaluations & Datasets ：核心信号是"evaluation first"。评测不再只是实验步骤，而是研究对象：测什么、在什么假设下测、如何解释结果，会直接决定科学结论。(NeurIPS 博客)	旧稿仍像"我收集了 27 个数据集，然后跑 MLP/CNN/Wav2Vec2/Mantis"。贡献重心在数据规模和 baseline 表格，而不是评测问题本身。旧稿确实已有 27 个数据集、8 类疾病、425,075 条样本、769.4 小时语音，但这只是材料基础，不是充分的 E&D 贡献。	把论文主问题改成：现有 speech disease models 的评测结论在跨数据集、跨疾病、未知疾病、校准、鲁棒性和公平性条件下是否仍成立？数据集整合服务于这个评测科学问题。
E&D 论文不一定要提出新模型或 beat baseline；可以通过评测协议、审计、stress test、负面结果、failure mode analysis 来推进理解。(NeurIPS 博客)	旧稿把"哪个模型 accuracy/F1 更高"作为主要叙事，容易被评审看成 bake-off。评审明确质疑 Table 2 的意义：benchmark 应该是数据、标签、标准处理和 measures，而不是谁赢。	主贡献改成 multi-axis evaluation framework + executable benchmark suite + protocol sensitivity audit。模型结果只是用来揭示评测失败模式。
数据集论文必须说明数据支持什么 evaluative claims、在什么假设下有效、边界是什么；单纯释放大数据集合且缺少问题 formulation、evaluation setup、interpretive boundaries，会被认为贡献不足。(NeurIPS 博客)	旧稿"合并不同数据集"的动机仍不够强。评审指出：公开数据 + 常规模型 + 合并动机不清，不足以说明对社区有意义。	明确每个任务支持的 claim：within-dataset 只能支持"同域识别"；cross-dataset 支持"跨采集条件泛化"；open-set 支持"未知疾病拒识"；calibration 支持"临床可靠性"；protocol sensitivity 支持"评测结论稳健性"。
官方要求 dataset/code properly hosted、accessible、documented；benchmark/evaluation tool 作为可执行 artifact 时，代码在提交时必须可运行；数据还需 Croissant core + Responsible AI fields。(NeurIPS)	评审最致命的问题是："Where do we get the data? Under what licensing terms?" 以及是否真的提供 unified database / benchmark，而不仅是汇报一组结果。	发布 dataset index、license table、download scripts、checksum、split manifests、preprocessing pipeline、evaluation harness、model submission schema、Croissant metadata、RAI fields、data cards/evaluation cards。对不能再分发的数据，只发布合法下载脚本与索引。
Benchmark 是一种标准化、可复用的 evaluation setup；不是"多个数据集 + 一张结果表"。WARC-Bench 的启发是：定义一个现有 benchmark 没覆盖的能力，再给出环境、任务目标、确定性 evaluator 和可复现工具链。(ar5iv)	旧稿没有把 "speech health foundation model 到底应具备什么能力"拆成可测任务；68 类合并分类过于粗暴，容易被认为是人为拼接。	设计 SpeechHealth-Eval：以"临床部署能力"为轴，定义 known-disease classification、cross-dataset generalization、cross-disease transfer、open-set rejection、calibration、robustness、fairness、protocol fragility 等任务。
E&D 鼓励比较不同 evaluation designs，并证明不同假设会导致不同结论。(NeurIPS)	旧稿使用统一 70/15/15 split、16 kHz resampling、oversampling/downsampling，但没有系统证明这些协议不会改变模型排名。旧评审也指出 resampling 可能有负面影响。	加入 Evaluation Protocol Sensitivity：比较 patient-level split vs random split、native sampling vs 16 kHz、oversampling vs class weighting、macro vs weighted aggregation，报告 rank reversal rate / Kendall τ / protocol-induced score variance。
负责任 AI、局限性、社会影响不是附录装饰，而是评测 claim 的边界条件：医疗数据、隐私、偏差、许可、临床不可替代性都必须进入主文。官方也要求 Croissant RAI fields。(NeurIPS)	旧稿对年龄、性别、语言、设备、任务类型、采集条件、病种定义差异、公开数据许可等讨论不足。	增加 Responsible Speech Health Evaluation 小节：metadata completeness、subgroup fairness、privacy/consent/license audit、clinical non-diagnosis disclaimer、misuse risks、dataset representativeness。
WARC-Bench 不是简单展示模型结果：它先指出现有 GUI benchmark 缺少"intermediate subtasks"能力层，再定义任务约束、可重放环境、programmatic evaluator、train/dev/test 结构和行为分析。(ar5iv)	旧稿没有证明"现有 SDD 评测遗漏了什么关键能力"。	你的论文要证明：within-dataset SDD evaluation 遗漏了部署中最关键的能力：跨数据集泛化、未知疾病拒识、可靠不确定性、抗采集扰动、公平性和协议稳健性。

先提炼两个链接的核心导向

NeurIPS 2026 E&D Track 的变化，本质上是把"评测方法本身"从实验配角提升为科学对象。官方定义的 evaluation 包括 processes、practices、tools、resources、datasets、benchmarks、auditing、red-teaming、metrics、interaction protocols 和 study designs；也就是说，一篇好 E&D 论文要研究"怎样的评测可以支持怎样的结论"，而不是仅仅给出一组模型分数。官方还明确说，数据集必须说明 scope、assumptions、limitations，以及它如何支持 evaluative claims；单纯发布大数据集合而没有问题设定和解释边界，不足以支撑评审。(NeurIPS 博客)

WARC-Bench 的可借鉴点不是网页任务本身，而是它的 benchmark 思想：它先定义一个现有评测遗漏的能力层------GUI subtasks，即介于单步 grounding 和长程 web navigation 之间、通常需要 1--20 个原子 UI 动作完成的短程任务；然后把 benchmark 定义为 environment + goal + evaluator，而不是只定义数据输入和标签。它使用可重放的 WARC 环境、自然语言任务目标、程序化确定性 evaluator、dev/test split、真实与合成任务区分，并用行为分析解释模型为什么失败。(ar5iv)

你的文章应该借鉴它的"评测框架"而不是任务形式：SpeechHealth-Eval 中每个任务也要有明确的输入域、训练/测试分布、允许的 adaptation 方式、预测格式、确定性 evaluator、可解释的 claim 边界和 leaderboard。

一、重新定位论文贡献

更适合 NeurIPS E&D Track 的题目

我最推荐第 1 个。

SpeechHealth-Eval: A Multi-Axis Evaluation Framework for Generalization, Reliability, and Open-Set Recognition in Speech-Based Disease Assessment
Beyond Within-Dataset Accuracy: Evaluating Cross-Disease Generalization and Reliability in Speech Health Models
VocalHealthBench: A Unified Benchmark for Robust, Calibrated, and Open-Set Speech-Based Disease Evaluation
SpeechDx-Eval: Stress-Testing Speech Disease Models Across Datasets, Diseases, and Clinical Distribution Shifts
Are Speech Disease Models Ready for Foundation-Scale Health Monitoring? A Unified Evaluation Framework Across 27 Public Datasets

我建议标题里谨慎使用 "diagnosis"。医疗场景下，NeurIPS 评审会更喜欢 assessment / screening / monitoring / evaluation 这类更负责任的表述。正文中可以说明任务来自 speech-based disease diagnosis literature，但你的 benchmark 不声称替代临床诊断。

最应该主打的核心贡献

不要主打"27 个数据集 + 几个 baseline"。这只能回答"谁在这些数据上分数高"。E&D Track 更想看到的是：你定义了什么新的评测问题？这个评测能推翻或修正领域里哪些原有结论？你的工具如何让未来模型被公平、可复现、可审计地比较？

新的核心贡献应该是：

A unified, executable, multi-axis evaluation framework that tests whether speech health models generalize beyond dataset-specific shortcuts and remain reliable under clinically realistic distribution shifts.

一句话 thesis：

Current speech disease models appear strong under within-dataset evaluation, but their claims are fragile under cross-dataset, cross-disease, open-set, calibration, robustness, and protocol-sensitivity tests; therefore we introduce SpeechHealth-Eval, a multi-axis evaluation framework for assessing whether models are truly ready for speech health foundation modeling.

三到四条新版贡献点

Evaluation formulation contribution

We formalize speech-based disease assessment as a set of evaluative claims rather than a single classification task: known-disease recognition, cross-dataset generalization, cross-disease transfer, unknown-disease rejection, clinical reliability, robustness, fairness, and protocol stability.
Benchmark artifact contribution

We release a documented benchmark suite over 27 public datasets with a harmonized disease ontology, dataset index, license/access table, metadata schema, split manifests, preprocessing scripts, Croissant metadata, Responsible AI fields, and an executable evaluation harness.
Empirical evaluation science contribution

We show that standard within-dataset evaluation can overestimate deployment readiness, and that model rankings can change under cross-dataset, open-set, calibration, robustness, and protocol choices.
Diagnostic analysis contribution

We audit dataset shortcuts, disease--dataset confounding, resampling sensitivity, metadata incompleteness, subgroup performance, and failure modes, turning the paper from a model bake-off into a study of how SDD evaluation should be conducted.

二、按照 NeurIPS E&D Track 重新设计论文结构

1. Introduction

引言不要从"我们收集了 27 个数据集"开头。应该从评测幻觉讲起：

第一段：愿景。Speech is a low-cost, non-invasive signal for health monitoring；未来可能出现 speech health foundation models，用于连续筛查、远程健康监测、神经退行性疾病/呼吸疾病/心理健康风险评估。

第二段：领域问题。现有工作大多在单疾病、单数据集、单采集条件中评估。这样的 within-dataset accuracy 很可能测到的是 dataset identity、录音设备、语言、任务类型、患者群体、采样率或标签定义，而不是可迁移的 disease representation。

第三段：评测科学问题。
What does it mean to claim that a speech model can diagnose or assess disease?

一个有效 claim 至少需要回答：它是否跨数据集泛化？是否能处理未见疾病？是否校准？是否对噪声/压缩/重采样稳健？是否对年龄/性别/语言公平？评测协议变化是否会改变结论？

第四段：你的方案。

介绍 SpeechHealth-Eval：27 个公开数据集只是基础；真正贡献是多维任务、标准协议、确定性 evaluator、leaderboards、release package 和系统性 failure analysis。

第五段：核心发现预告。

例如：within-dataset 高分不预测 cross-dataset；dataset shortcut 明显；SSL speech models 不一定稳定优于 acoustic baselines；calibration 普遍差；open-set rejection 很难；protocol choices 可反转模型排名。

建议分成六组，而不是按疾病逐个列文献。

Speech-based health and disease assessment

Alzheimer's, Parkinson's, dysarthria, respiratory/cough, depression, speech disorders, rare diseases。
Medical audio and clinical acoustic benchmarks

包括 cough/breath/lung/heart sounds。这里要解释 heart/lung sounds 是否属于 core speech benchmark。我的建议是设为 Extended Clinical Audio Track，不要混在主 benchmark 总分里。
Foundation model evaluation for speech/audio

wav2vec 2.0、HuBERT、WavLM、Whisper encoder、BEATs、CLAP、AudioMAE、AST、PaSST 等作为 representation families，而非论文主角。
OOD generalization and domain shift in healthcare AI

跨医院、跨设备、跨人群、跨任务类型的 shift；patient-level split；site-level split；shortcut learning。
Calibration, selective prediction, and open-set recognition

医疗模型不仅要预测，还要知道何时不确定、何时拒识。
Benchmark design, data documentation, and responsible datasets

Data Cards, Model Cards, Datasheets, Croissant metadata, evaluation cards, benchmark auditing, protocol sensitivity。

3. Benchmark Construction

这一节要像 benchmark paper，不要像数据清单。

3.1 Dataset registry

为 27 个数据集建立统一 registry：

字段	内容
dataset_id	唯一 ID
disease_family	Alzheimer's / Parkinson's / Dysarthria / Respiratory / Psychological / Speech Disorder / Rare / HLS
modality	speech / voice / cough / breath / lung sound / heart sound
task_type	sustained vowel / read speech / spontaneous speech / cough / breathing / auscultation
language	中文、英文、意大利语、西班牙语等
recording_device	phone / microphone / stethoscope / unknown
sampling_rate	native rate + standardized rate
subject_count	patient-level count，不能只给 sample count
sample_count / hours	样本数与时长
label_space	原始标签与映射后标签
demographics	age / sex / language / region
license	license type、redistribution allowed?、commercial use?
access	direct download / gated / request / credentialed
checksum	reproducibility
missing_metadata	哪些字段缺失

旧稿已经有 8 类疾病、27 个数据集、425,075 条样本、769.4 小时的统计基础，但新版要把这些转化成 machine-readable registry + claim-aware documentation。

3.2 Disease and label ontology

不要直接做 68 类 flat classification 作为核心任务。应建立层级标签：

Level 0: Signal modality

speech/voice, cough/breath, auscultation, mixed/unknown。
Level 1: Health state

healthy/control vs abnormal/pathological。
Level 2: Disease family

neurodegenerative, motor speech disorder, cognitive impairment, respiratory disorder, psychological disorder, structural speech disorder, rare disease, auscultation abnormality。
Level 3: Specific clinical condition or dataset label

AD/MCI/control, PD/control, ALS/control, COVID/cough/control, depression/control, dysarthria severity 等。
Level 4: Dataset-specific labels

保留原始标签，避免不合规地把临床含义不同的标签硬合并。

核心原则：只在临床含义可比的层级上比较模型。例如 AD vs control 与 depression vs control 不应被粗暴视为同一种二分类任务；heart/lung sounds 最好放入 extended track。

3.3 Split protocol

至少发布三类 split：

Within-dataset patient-level split

对齐已有文献，但不作为主贡献。
Leave-one-dataset-out split

同一疾病 family 内，某个数据集完全作为 unseen test domain。
Leave-one-disease-family-out / open-set split

训练时隐藏某些 disease family，测试时作为 unknown。

所有 split 必须有固定 manifest，不能只写随机种子。manifest 包括 sample_id、subject_id、dataset_id、fold_id、label、metadata hash。

3.4 Data license and access

NeurIPS E&D 对数据和代码可访问性要求非常严格。对于已有公开医疗语音数据，不能简单把数据打包再分发；你应该发布：

原始数据源链接；
许可条款；
是否允许 redistribution；
是否允许商业用途；
是否需要申请；
下载脚本；
checksum；
预处理脚本；
小样本示例；
Croissant collection-level metadata；
每个 dataset 的 Croissant metadata 或外链；
RAI fields：consent、sensitive attributes、known biases、intended use、out-of-scope use、risks。

官方 FAQ 对使用 existing public datasets 的情况也给了边界：不需要重新按新数据 hosting guideline 托管原始公开数据，但必须提供可执行的使用/修改代码和公开数据源链接，不能把已有数据本身当成你的新数据贡献。(NeurIPS)

4. Evaluation Framework

这是论文核心。建议命名为：

SpeechHealth-Eval Task Suite

包含八个能力轴：

Known-disease discrimination

同域识别能力。
Cross-dataset generalization

是否学到疾病信号，而不是数据集/设备/语言 shortcut。
Cross-disease transfer

是否存在可迁移 speech health representation。
Hierarchical disease taxonomy prediction

coarse-to-fine 识别能力。
Open-set unknown disease detection

面对未见疾病能否拒识。
Calibration and selective reliability

模型是否知道自己何时不确定。
Robustness under acquisition perturbations

噪声、压缩、重采样、截断、静音、片段长度变化。
Fairness and metadata-aware stratification

年龄、性别、语言、设备、任务类型等分组稳定性。
Evaluation protocol sensitivity

评测协议本身是否会改变结论。

第 9 点最契合 E&D Track：你不是只评测模型，而是在评测 评测协议的可靠性。

5. Experiments

主线实验不要按模型展开，而要按科学问题展开：

RQ1: Does within-dataset performance overestimate deployment readiness?
RQ2: Do models generalize across datasets within the same disease family?
RQ3: Do current foundation representations transfer across disease families?
RQ4: Can models detect unknown diseases?
RQ5: Are high-F1 models calibrated?
RQ6: Are models robust to acquisition perturbations?
RQ7: Are predictions fair across available subgroups?
RQ8: Do evaluation protocol choices reverse model rankings?

6. Analysis

必须有 failure mode analysis，否则仍像 bake-off。

建议包含：

dataset-ID prediction；
disease--dataset mutual information；
representation UMAP colored by disease vs dataset；
shortcut-only baseline：只用 dataset_id / metadata 预测 disease；
label leakage audit；
per-disease and per-modality error analysis；
calibration diagrams；
selective risk-coverage curves；
open-set failure cases；
protocol rank reversal analysis；
metadata missingness heatmap。

7. Limitations & Ethics

重点写：

不是临床诊断工具；
公开数据集不能代表真实临床人群；
数据集中疾病定义、采集任务、设备、语言、年龄分布不一致；
部分元数据缺失，公平性结论有边界；
医疗语音存在隐私和再识别风险；
数据许可可能限制二次分发；
speech / cough / lung / heart sounds 不应被混为同一种生物标志物；
benchmark 可能诱导过拟合 leaderboard；
需要未来外部临床验证。

8. Benchmark Release

必须发布：

dataset_registry.json/csv
label_ontology.yaml
license_table.md
metadata_schema.json
croissant_collection.json
croissant_per_dataset/*.json
download_scripts/
preprocess/
splits/
evaluation_harness/
metrics/
submission_format.md
leaderboard_rules.md
data_cards/
evaluation_cards/
model_cards_for_baselines/
reproducibility_checklist.md
responsible_ai_statement.md

三、重新设计实验体系

方向	研究问题与 E&D 价值	具体做法	指标	预期结论；若结果不如预期如何解释
A. Within-dataset evaluation	传统协议下模型表现如何？它是 baseline，不是主贡献。	每个数据集 patient-level train/val/test；与已有文献对齐；避免 sample-level leakage。	macro-F1, balanced accuracy, AUROC, AUPRC, per-class recall。	预期分数较高。若不高，说明旧文献可能有 split leakage 或任务依赖特定预处理。
B. Cross-dataset generalization	模型是否学到疾病表征，而不是数据集 artifact？这是文章主实验之一。	同一 disease family 内 leave-one-dataset-out。例如 Alzheimer's：在 Pitt + ADReSS-M 训练，在 NCMMSC 测试。Respiratory 同理。	macro-F1, AUROC, performance retention ratio = cross-dataset / within-dataset, worst-domain score。	预期显著下降。若下降不明显，说明该疾病 family 可能存在稳定声学标志物，或数据集间采集任务相近。
C. Cross-disease transfer	是否存在可迁移 speech health representation？契合 foundation model 评测。	在若干 disease families 训练 encoder，在 unseen family 上 frozen probe / few-shot fine-tune。也可做 leave-one-family-out。	few-shot AUROC, linear probe F1, transfer gain over random init, adaptation efficiency。	预期 pretrained speech/audio models 有部分迁移，但不稳定。若 acoustic baselines 更强，说明疾病信号可能是低层声学而非语义表征。
D. Unified disease taxonomy classification	flat 68 类是否合理？模型能否 coarse-to-fine 诊断？	层级预测：healthy vs abnormal → disease family → dataset-specific label。允许只在语义可比层级评价。	hierarchical F1, level-wise accuracy, hierarchical calibration, confusion distance。	预期 coarse level 好于 fine level。若 flat 68 类高但 hierarchy 泛化差，说明模型可能学 dataset identity。
E. Open-set / unknown disease detection	真实部署中会遇到未见疾病，模型能否拒识？	训练只见部分 disease families；测试包含 known + held-out disease family。用 MSP、energy score、Mahalanobis、OpenMax、deep ensembles。	AUROC-OOD, AUPR-OOD, FPR@95TPR, OSCR, unknown recall。	预期 open-set 很差。若效果好，要检查是否 unknown 数据集与 known 在采集条件上差异过大，模型可能在识别 dataset shift 而非 disease novelty。
F. Domain shift and dataset bias diagnosis	模型到底在预测疾病，还是数据来源/设备/语言/任务？	训练 dataset-ID classifier；用 embedding 做 UMAP；计算 disease--dataset confounding；metadata-only baseline；dataset-balanced evaluation。	dataset-ID accuracy, mutual information, shortcut gap, CKA/UMAP separability, domain adversarial residual score。	预期 dataset-ID 很容易被预测。若 dataset-ID 难预测，说明预处理可能消除了部分 domain artifact，或 metadata 不足。
G. Calibration and clinical reliability	医学模型高 F1 不够；错误自信最危险。	对所有 closed-set 任务计算校准；做 temperature scaling、isotonic regression、ensembles；评估 selective prediction。	ECE, classwise ECE, Brier score, NLL, risk-coverage AUC, selective AUROC。	预期高 F1 模型仍校准差。若校准好，检查是否模型过于保守或类别简单。
H. Few-shot / low-resource adaptation	新疾病/新医院只有少量标注时，模型是否可快速适应？	unseen dataset/family 只给 1%、5%、10% labeled data；比较 frozen probe、full fine-tune、LoRA/adapters、prototype classifier。	few-shot macro-F1, AUROC, sample efficiency curve, variance across seeds。	预期 foundation encoders 在低资源下有优势，但可能受 disease type 影响。若没有优势，说明现有预训练目标与病理声学不匹配。
I. Robustness stress tests	采集条件扰动是否破坏结论？	加噪、混响、MP3/AAC 压缩、重采样到 8/16/44.1 kHz、随机截断、静音插入、片段长度变化。	robustness AUC, clean-to-corrupt degradation, worst-corruption score, rank stability。	预期模型排名会变化。若变化小，说明任务信号强或扰动不够真实。
J. Subgroup fairness / stratified analysis	年龄、性别、语言、设备、任务类型是否影响性能？	在有 metadata 的数据集上分层；缺失时做 metadata completeness analysis，不强行下结论。	worst-group AUROC, performance gap, equal opportunity gap, calibration gap, missingness rate。	预期 metadata 缺失严重，公平性结论受限。这本身是重要 E&D finding。
K. Evaluation protocol sensitivity	不同 split、resampling、metric、aggregation 会不会改变科学结论？这是最有 NeurIPS E&D 味道的实验。	系统比较 random vs patient-level vs dataset-level split；oversampling vs class weighting；native sampling vs 16 kHz；macro vs weighted F1；sample-level vs subject-level aggregation。	Kendall τ rank correlation, rank reversal rate, protocol-induced variance, confidence interval overlap。	预期模型排名会反转。若不反转，说明 benchmark 稳定，这是正面贡献。
L. Model family comparison	不是"多跑模型"，而是验证不同 representation family 在不同评测 claim 下的适用边界。	比较 handcrafted + classical ML、CNN/CRNN/Transformer、SSL speech、audio foundation、time-series foundation、speech/audio LLM。统一 frozen probe、PEFT、fine-tuning 协议。	每个 leaderboard 单独排名；不要只看 average F1。	预期没有单一模型全胜。若某模型全胜，需要进一步检查是否利用了 shortcut 或训练数据污染。

模型族设计

建议至少包含五类，但主文不要被模型淹没。

Handcrafted acoustic + classical ML

MFCC/eGeMAPS/openSMILE + logistic regression / SVM / random forest / XGBoost。

作用：强可解释 baseline；检验深度模型是否真的必要。
Spectrogram neural networks

CNN、CRNN、AST、Conformer-lite。

作用：传统深度声学 baseline。
Self-supervised speech models

wav2vec 2.0、HuBERT、WavLM、Whisper encoder。

协议：frozen probing + PEFT + full fine-tuning 三档。
General audio foundation models

BEATs、CLAP、AudioMAE、PaSST。

作用：检验非语音专用 audio representation 对 cough/breath/lung sound 是否更强。
Time-series / waveform foundation models

Mantis、TS2Vec、TimesNet/PatchTST 类方法。

作用：检验病理信号是否更像 temporal physiological pattern，而非语言内容。
Speech/audio LLM or multimodal models

只作为加分项。可评估 zero-shot label description、few-shot prompting、embedding extraction，但不要让论文依赖这些模型。

四、真正有创新性的 benchmark 设计

Benchmark 名称

我建议：

SpeechHealth-Eval

副标题：

A Multi-Axis Evaluation Suite for Generalization, Reliability, and Open-Set Speech-Based Disease Assessment

这个名字比 "UniSpeechDisease" 更适合 NeurIPS，因为它强调 health evaluation，而不是把论文局限在 disease classification。

Task Suite

任务	名称	支持的科学 claim
T1	In-Domain Disease Recognition	模型能否在同数据集同分布中识别已知标签
T2	Cross-Dataset Generalization	模型是否跨采集条件、语言、设备、中心泛化
T3	Cross-Disease Transfer	表征是否可迁移到新疾病
T4	Hierarchical Taxonomy Prediction	模型是否具备 coarse-to-fine speech health reasoning
T5	Open-Set Disease Rejection	模型是否能拒识训练时未见疾病
T6	Calibration & Selective Prediction	模型是否知道自己何时不确定
T7	Robustness Stress Test	模型是否对真实采集扰动稳定
T8	Subgroup Fairness	性能是否在年龄/性别/语言/设备等群体间稳定
T9	Protocol Sensitivity Audit	评测协议变化是否导致结论变化

总评分体系

不要把所有指标粗暴平均。建议把总分作为 secondary summary score，主 leaderboard 仍报告多维向量。

定义：

\\text{SHGS} = 0.20S_{\\text{ID}} + 0.25S_{\\text{Gen}} + 0.15S_{\\text{Open}} + 0.15S_{\\text{Rel}} + 0.15S_{\\text{Rob}} + 0.10S_{\\text{Fair}}

其中：

(S_{\text{ID}})：within-dataset known disease score。权重较低，因为它最容易高估真实部署性能。
(S_{\text{Gen}})：cross-dataset + cross-disease generalization。权重最高，因为这是 speech health foundation model 的核心能力。
(S_{\text{Open}})：unknown disease detection。代表安全拒识能力。
(S_{\text{Rel}})：calibration + selective prediction。代表临床可靠性。
(S_{\text{Rob}})：noise/compression/resampling/length perturbation 下的性能保持。
(S_{\text{Fair}})：worst-group performance + calibration gap。只在 metadata 足够的数据集上计算；metadata 不足时单独报告 "fairness reportability"。

更稳妥的写法是：

We do not claim that a single scalar score fully characterizes clinical readiness. We therefore report a primary multi-axis scorecard and provide SHGS only as a secondary summary for leaderboard navigation.

多个 leaderboard

Standard Supervised Leaderboard

允许在训练集 full supervision 下训练。
Frozen Foundation Model Leaderboard

encoder frozen，只训练 linear / shallow head。检验 representation quality。
Cross-Dataset Generalization Leaderboard

leave-one-dataset-out，主 leaderboard。
Open-Set Detection Leaderboard

known/unknown split，报告 AUROC-OOD、FPR@95TPR、OSCR。
Low-Resource Adaptation Leaderboard

1%、5%、10% labeled data adaptation。
Robustness & Calibration Leaderboard

报告 clean score、robustness AUC、ECE、Brier、risk-coverage。
Protocol-Stability Leaderboard

排名不是看 F1，而是看模型在不同 evaluation protocols 下的 rank stability。

最有创新性的 leaderboard 是第 7 个。它非常贴合 "evaluation itself as a scientific object of study"。

五、针对旧评审意见逐条反击

旧评审意见	新版设计原则	论文中应该怎么写
方法比较常规，创新不足	本文不是 model paper，而是 evaluation science paper。NeurIPS E&D 明确允许不提出新模型、不 beat baseline，只要推进有意义评测。(NeurIPS)	"Our contribution is not a new classifier, but a claim-aware evaluation framework for speech health models."
只是结果展示 / bake-off	把结果表改成评测研究：cross-dataset、open-set、calibration、robustness、fairness、protocol sensitivity、failure analysis。	"Model rankings are used as probes to reveal which evaluative claims are fragile."
没有数据下载、许可、标准测评代码	发布 registry、license table、download scripts、Croissant、RAI fields、split manifests、evaluation harness、submission format。	"Every score in the paper can be reproduced from a fixed split manifest and deterministic evaluator."
合并数据集动机不清楚	不再说"为了变大而合并"，而是说"为了构造 clinical distribution shift"。	"The collection is organized as a set of deployment-relevant shifts: across datasets, disease families, modalities, languages, devices, and tasks."
不同疾病差异大，用少数模型统一比较不公平	设置分 disease family / modality / task 的 leaderboard；不声称一个模型统一最优。	"The benchmark is not designed to declare a universal winner, but to expose capability profiles."
resampling 可能有副作用	把 resampling 从实现细节变成核心实验：class resampling sensitivity + audio resampling sensitivity。	"We quantify how class balancing and audio resampling alter calibration, minority performance, and model rankings."
Table 2 意义不清	将 flat 68 类分类降级为一个 diagnostic task，不作为核心证据。	"Flat taxonomy classification is included only as one stress test; the primary tasks are cross-domain, open-set, and reliability evaluations."

六、最终论文应该得到的关键结论

理想情况下，论文应形成 5--8 条"可以被 NeurIPS 评审记住"的 findings：

Within-dataset performance substantially overestimates deployment readiness.

单数据集高分不能推出跨医院、跨设备、跨语言、跨任务泛化。
Cross-dataset generalization is the central bottleneck for speech health models.

同一疾病 family 内 leave-one-dataset-out 仍会显著掉分。
Many models encode dataset identity more strongly than disease identity.

dataset-ID prediction、UMAP 和 metadata-only baselines 可以揭示 shortcut。
Speech SSL models are not automatically superior for disease assessment.

wav2vec/HuBERT/WavLM/Whisper 等可能擅长语言内容，但病理声学信号不一定被预训练目标覆盖。
High F1 does not imply clinical reliability.

很多模型 ECE/Brier 差，错误时仍过度自信；selective prediction 能暴露风险。
Open-set disease rejection remains unsolved.

训练时未见 disease family 出现后，模型容易强行归入已知类别。
Low-resource adaptation reveals whether representations are clinically reusable.

1%、5%、10% 标注下的学习曲线比 full fine-tuning 更能体现 foundation representation 价值。
Evaluation protocol choices can reverse model rankings.

split、resampling、metric aggregation、audio sampling rate 都可能改变结论，因此领域需要标准化 evaluation harness。

七、可执行实验计划

P0：必做实验，保证论文成立

实验 / artifact	数据处理	模型训练	指标	图表
Dataset registry + ontology + license table	整理 27 个数据集 metadata、license、access、modality、task_type、language、device、subject_id	无	metadata completeness、license/access coverage	Dataset statistics table、metadata missingness heatmap
Patient-level within-dataset baseline	固定 splits；清理 duplicate；subject-level aggregation	先跑 4--6 个代表模型：MFCC+SVM/MLP、CNN/CRNN、wav2vec2/WavLM frozen、BEATs/CLAP frozen	macro-F1、AUROC、AUPRC、balanced accuracy	Main baseline table
Cross-dataset leave-one-dataset-out	每个 disease family 内构造 train datasets / unseen test dataset	frozen probe 优先，减少计算；随后对 top models 做 PEFT	performance retention、worst-domain score	Cross-dataset heatmap
Open-set unknown disease detection	构造 known/unknown disease split	使用已有 closed-set 模型输出 score；加 energy/Mahalanobis/ensembles	AUROC-OOD、FPR@95TPR、OSCR	Open-set result table / ROC
Calibration and selective prediction	保存 logits/probabilities	temperature scaling + baseline logits	ECE、Brier、NLL、risk-coverage AUC	Reliability diagram、risk-coverage curve
Protocol sensitivity	生成多套 split/resampling/metric configs	不必全模型全跑；对 4--6 代表模型跑完整敏感性	Kendall τ、rank reversal rate、score variance	Protocol ranking plot
Evaluation harness release	固定 prediction schema、metrics scripts、split manifests	baseline wrappers	reproducibility checklist	Release overview figure

P1：强创新实验，提高 NeurIPS 命中率

实验	数据处理	模型训练	指标	图表
Dataset shortcut diagnosis	构造 dataset-ID label、metadata-only feature table	dataset-ID classifier；disease classifier with/without domain balancing	dataset-ID accuracy、mutual information、shortcut gap	UMAP disease vs dataset、shortcut bar plot
Few-shot adaptation	为 unseen dataset/family 抽 1/5/10% labeled data；多 seed	frozen probe、LoRA/adapters、full fine-tune	sample efficiency、variance	Few-shot learning curve
Robustness stress tests	生成 corrupted test sets：noise、compression、resampling、truncation、silence	通常不重训，只测试	robustness AUC、degradation、rank stability	Robustness curves
Subgroup fairness	清理 age/sex/language/device/task metadata	用已训练模型评估	worst-group score、gap、calibration gap	Fairness table、metadata coverage chart
Hierarchical taxonomy classification	构造 Level 1--4 label ontology	hierarchical head 或多头分类	hierarchical F1、level-wise ECE	Taxonomy confusion matrix

P2：加分实验，有时间再做

实验	价值
Prospective-style external holdout	如果能找到完全未见的新公开数据集，作为 final hidden-like external test，非常加分。
Representation probing	分析模型 embedding 是否编码 pitch、jitter、shimmer、speech rate、pause、cough burst 等 acoustic biomarkers。
Label noise / annotation uncertainty	对标签来源不稳定的数据集做 sensitivity，展示医学标签本身的边界。
Leaderboard demo	做一个轻量网页或 Hugging Face Space，展示 submission format 和 scorecard。
Private holdout design	如果未来办 challenge，可保留部分 test labels；论文中说明公开验证集与隐藏测试集的治理方式。

八、论文图表设计：12 个关键图表/表格

Figure 1: Benchmark overview

从 27 datasets → ontology → task suite → evaluation harness → leaderboards → findings。替代旧稿 page 1 的泛化示意图。
Figure 2: Disease taxonomy and modality ontology

展示 speech/voice、cough/breath、auscultation 的层级关系；强调 core speech track 与 extended clinical audio track。
Table 1: Dataset registry summary

dataset、disease family、modality、language、task、subjects、samples、hours、license、access、metadata completeness。
Table 2: Evaluation task definitions

每个 task 的 train/test split、allowed training、prediction format、metrics、claim supported。
Figure 3: Evaluation framework diagram

输入不是单一路径，而是多轴 scorecard：ID、Gen、Open、Rel、Rob、Fair、Protocol。
Table 3: Main multi-axis scorecard

行为模型族，列为多个 leaderboard 分数。不要只放 average F1。
Figure 4: Cross-dataset generalization heatmap

train datasets × test datasets，按 disease family 分面；最能打动评审。
Figure 5: Within vs cross-dataset performance gap

每个模型一个点，x=within，y=cross；y 远低于 x 即说明 evaluation illusion。
Figure 6: Dataset shortcut analysis

UMAP/TSNE：颜色分别按 disease label 和 dataset_id；若 dataset 聚类更明显，很有说服力。
Figure 7: Calibration reliability diagrams

对比高 F1 但高 ECE 的模型，医学评测很关键。
Figure 8: Open-set detection results

ROC / FPR@95TPR / OSCR；展示 unknown disease 是未解问题。
Figure 9: Protocol sensitivity ranking plot

不同 split/resampling/metric 下模型排名变化。可以用 bump chart 或 Kendall τ matrix。
Figure 10: Robustness degradation curves

噪声 SNR、压缩率、片段长度变化下性能曲线。
Table 4: Release and reproducibility checklist

数据、代码、Croissant、RAI、license、splits、seeds、model cards、evaluation cards 是否齐全。

九、最终写作策略

摘要应该这样写

避免：

We collect 27 datasets and benchmark several models.

改成：

Speech-based disease assessment is increasingly studied as a non-invasive health monitoring technology, yet progress is difficult to interpret because most models are evaluated within a single disease dataset. Such evaluations cannot determine whether models learn disease-related acoustic biomarkers or dataset-specific shortcuts. We introduce SpeechHealth-Eval, a multi-axis evaluation framework over 27 public datasets spanning eight disease families, designed to test clinically relevant claims about generalization, open-set recognition, calibration, robustness, fairness, and protocol stability. Rather than proposing a new classifier, SpeechHealth-Eval provides a harmonized disease ontology, documented dataset registry, fixed split manifests, executable evaluators, and multiple leaderboards for known-disease, cross-dataset, low-resource, open-set, and reliability evaluation. Our audit shows that within-dataset performance often overestimates deployment readiness, model rankings are sensitive to evaluation protocols, and current speech/audio foundation models remain poorly calibrated and fragile under unseen disease and dataset shifts. SpeechHealth-Eval establishes a reproducible evaluation foundation for future speech health foundation models.

引言第一屏要让评审看到三个信号

这不是普通应用论文

你的对象不是某个疾病分类器，而是 SDD 领域的 evaluation validity。
这不是数据堆叠

27 个数据集是为了构造 deployment-relevant shifts，而不是为了让样本数变大。
这不是模型排行榜

模型结果用于回答：哪些 claims 是有效的？哪些是评测幻觉？哪些 protocol 会改变结论？

贡献点建议写法

Our contributions are:

We formulate speech-based disease assessment as a claim-aware evaluation problem and define a multi-axis task suite covering in-domain recognition, cross-dataset generalization, cross-disease transfer, open-set rejection, calibration, robustness, fairness, and protocol sensitivity.

We construct SpeechHealth-Eval, a documented and executable benchmark suite over 27 public speech/clinical audio health datasets, with a harmonized disease ontology, license-aware dataset registry, Croissant metadata, Responsible AI fields, fixed split manifests, preprocessing scripts, and deterministic evaluation harness.

We conduct a systematic evaluation audit of representative acoustic, neural, speech self-supervised, audio foundation, and time-series foundation models, showing that standard within-dataset scores can overstate real-world readiness and that model rankings are sensitive to dataset shift and evaluation protocol choices.

We provide failure analyses of dataset shortcuts, calibration errors, open-set failures, robustness degradation, and subgroup/metadata limitations, offering concrete guidance for future speech health foundation model evaluation.

主文中要避免的表达

不要说："we solve speech disease diagnosis"

改成："we evaluate claims about speech-based disease assessment."
不要说："our unified dataset enables all disease classification"

改成："our benchmark organizes heterogeneous datasets into claim-specific evaluation settings."
不要说："model X is best"

改成："model X is strong under in-domain recognition but fails under cross-dataset/open-set/calibration tests."
不要把 MLP/CNN 公式放主文大段推导。

模型结构移到 appendix；主文空间给 evaluation design、protocol、artifacts、failure findings。

十、简洁版本

1. 最推荐论文题目

SpeechHealth-Eval: A Multi-Axis Evaluation Framework for Generalization, Reliability, and Open-Set Recognition in Speech-Based Disease Assessment

2. 一句话 thesis

Current speech disease models appear strong under within-dataset evaluation but fail or become unreliable under clinically realistic cross-dataset, cross-disease, open-set, calibration, robustness, and protocol-sensitivity evaluation; SpeechHealth-Eval provides the evaluation framework needed to measure readiness for speech health foundation models.

3. 三条核心贡献

提出 claim-aware multi-axis evaluation framework，把 SDD 从单一分类任务升级为泛化、拒识、校准、鲁棒、公平和协议稳健性的评测问题。
发布可复现 benchmark artifact：27 数据集 registry、标签本体、许可表、下载脚本、split manifests、Croissant/RAI metadata、evaluation harness、leaderboards。
系统审计现有模型和协议失败模式：within-dataset 高估、dataset shortcut、open-set 失败、校准差、resampling/split/metric 造成排名反转。

4. 论文主体结构

Introduction
Related Work
Dataset Registry and Label Ontology
SpeechHealth-Eval Task Suite
Evaluation Protocols and Metrics
Model Families and Submission Rules
Main Results
Failure Mode and Protocol Sensitivity Analysis
Benchmark Release and Leaderboards
Limitations, Ethics, and Responsible AI

5. 必做实验清单

patient-level within-dataset baseline
leave-one-dataset-out cross-dataset generalization
hierarchical taxonomy classification
open-set unknown disease detection
calibration + selective prediction
dataset shortcut diagnosis
resampling / split / metric protocol sensitivity
robustness stress tests
metadata completeness + subgroup analysis
model family comparison under frozen / PEFT / fine-tuning protocols

6. 最有杀伤力的创新点

Evaluation Protocol Sensitivity Audit：系统证明不同 split、resampling、sampling rate、metric aggregation、subject-level vs sample-level 评价会改变模型排名。这个实验最符合 NeurIPS E&D 的 "evaluation itself as a scientific object of study"。

7. 对 NeurIPS E&D Track 的匹配理由

这篇论文不再是"收集数据并跑模型"，而是研究 什么样的评测协议才能支持 speech health model 的科学 claim。它提供可执行 benchmark、清晰 assumptions、claim boundaries、reproducibility artifacts、Responsible AI documentation、failure analysis 和 negative findings，正好对应 NeurIPS 2026 E&D Track 对 evaluation science、dataset documentation、benchmark auditing、stress testing 和可复现工具链的要求。

论文设计和撰写1

文章目录

先提炼两个链接的核心导向

一、重新定位论文贡献

更适合 NeurIPS E&D Track 的题目

最应该主打的核心贡献

三到四条新版贡献点

二、按照 NeurIPS E&D Track 重新设计论文结构

1. Introduction

2. Related Work

3. Benchmark Construction

3.1 Dataset registry

3.2 Disease and label ontology

3.3 Split protocol

3.4 Data license and access

4. Evaluation Framework

5. Experiments

6. Analysis

7. Limitations & Ethics

8. Benchmark Release

三、重新设计实验体系

模型族设计

四、真正有创新性的 benchmark 设计

Benchmark 名称

Task Suite

总评分体系

多个 leaderboard

五、针对旧评审意见逐条反击

六、最终论文应该得到的关键结论

七、可执行实验计划

P0：必做实验，保证论文成立

P1：强创新实验，提高 NeurIPS 命中率

P2：加分实验，有时间再做

八、论文图表设计：12 个关键图表/表格

九、最终写作策略

摘要应该这样写

引言第一屏要让评审看到三个信号

贡献点建议写法

主文中要避免的表达

十、简洁版本

1. 最推荐论文题目

2. 一句话 thesis

3. 三条核心贡献

4. 论文主体结构

5. 必做实验清单

6. 最有杀伤力的创新点

7. 对 NeurIPS E&D Track 的匹配理由