Android 语音助手简单实现与语音助手“执行任务”交流

Android 语音助手 Demo（中文 ASR → 解析指令 → 打开 App）与扩展讨论（UI 自动化 / Deep Link / 合作接口）

本文分两部分：

Part A：Demo 落地------把"可运行代码 + 必要配置 + 运行流程"整理成一份可复用的最小项目说明
Part B：题外话（交流拓展）------语音助手要"执行任务"时，工程上通常怎么做：Deep Link/Intent、合作接口、UI 自动化（无障碍）兜底

Part A：Android 语音助手 Demo

A1. 目标与范围

目标：在 Android 手机上实现一个最小闭环：

用户点击按钮开始说话（Push-to-talk）
系统中文语音识别（ASR）将语音转文字
从文字中识别"打开哪个 App"
根据包名 startActivity() 打开目标 App

注意：Demo 使用系统 RecognizerIntent，因此识别时会弹出系统语音识别面板，这是最快验证链路的方式，这个可自行修改。

A2. 项目结构（建议）

bash 复制代码

app/
  src/main/
    AndroidManifest.xml
    java/com/example/myapplication/
      OpenAppActivity.kt
    res/layout/
      activity_open_app.xml

A3. 清单与权限配置（AndroidManifest.xml）

本 Demo 需要麦克风权限：android.permission.RECORD_AUDIO

（属于危险权限，必须在运行时申请）

xml 复制代码

<manifest xmlns:android="http://schemas.android.com/apk/res/android">

    <uses-permission android:name="android.permission.RECORD_AUDIO"/>

    <application
        android:allowBackup="true"
        android:label="@string/app_name"
        android:supportsRtl="true">

        <activity
            android:name=".OpenAppActivity"
            android:exported="true">
            <intent-filter>
                <action android:name="android.intent.action.MAIN"/>
                <category android:name="android.intent.category.LAUNCHER"/>
            </intent-filter>
        </activity>

    </application>
</manifest>

RECORD_AUDIO 需要手动打开吗？

Android 6.0+：属于危险权限，必须运行时弹窗申请
用户拒绝后：需要引导用户去系统设置手动开启

A4. 页面布局（activity_open_app.xml）

一个按钮即可：

xml 复制代码

<?xml version="1.0" encoding="utf-8"?>
<LinearLayout xmlns:android="http://schemas.android.com/apk/res/android"
    android:layout_width="match_parent"
    android:layout_height="match_parent"
    android:gravity="center"
    android:orientation="vertical"
    android:padding="24dp">

    <Button
        android:id="@+id/btnStartSpeech"
        android:layout_width="wrap_content"
        android:layout_height="wrap_content"
        android:text="点击说话" />
</LinearLayout>

A5. Demo 完整代码（OpenAppActivity.kt）

功能点：

运行时申请 RECORD_AUDIO
调系统语音识别（中文 zh-CN）
Toast 显示识别结果
规则解析："打开/启动/进入 + 应用名"
通过映射表 appMap 找到包名并打开

kotlin 复制代码

package com.example.myapplication

import android.Manifest
import android.content.Intent
import android.content.pm.PackageManager
import android.os.Bundle
import android.speech.RecognizerIntent
import android.widget.Button
import android.widget.Toast
import androidx.activity.ComponentActivity
import androidx.activity.result.contract.ActivityResultContracts
import androidx.core.content.ContextCompat

data class ParsedIntent(val type: String, val appName: String)

class OpenAppActivity : ComponentActivity() {

    private lateinit var btnStartSpeech: Button

    // "应用中文名/别名" -> 包名（按需扩展） 不同版本/地区包名可能不同
    private val appMap: Map<String, String> = mapOf(
        "腾讯视频" to "com.tencent.qqlive",
        "腾讯" to "com.tencent.qqlive",
        "qqlive" to "com.tencent.qqlive",

        "设置" to "com.android.settings",

        "微信" to "com.tencent.mm",
        "抖音" to "com.ss.android.ugc.aweme",

        "youtube" to "com.google.android.youtube",
        "优酷" to "com.youku.phone",      
        "爱奇艺" to "com.qiyi.video"      
    )

    // 运行时权限申请：RECORD_AUDIO
    private val requestAudioPermission = registerForActivityResult(
        ActivityResultContracts.RequestPermission()
    ) { granted ->
        if (granted) {
            startSpeech()
        } else {
            Toast.makeText(this, "需要麦克风权限才能语音识别，请在设置中开启。", Toast.LENGTH_LONG).show()
        }
    }

    // 语音识别结果回调（系统语音识别面板）
    private val speechLauncher = registerForActivityResult(
        ActivityResultContracts.StartActivityForResult()
    ) { result ->
        if (result.resultCode != RESULT_OK) {
            Toast.makeText(this, "未获取到语音结果", Toast.LENGTH_SHORT).show()
            return@registerForActivityResult
        }

        val data = result.data
        val text = data?.getStringArrayListExtra(RecognizerIntent.EXTRA_RESULTS)
            ?.firstOrNull()
            ?.trim()
            .orEmpty()

        if (text.isEmpty()) {
            Toast.makeText(this, "没听清楚，请再说一次", Toast.LENGTH_SHORT).show()
            return@registerForActivityResult
        }

        Toast.makeText(this, "识别结果：$text", Toast.LENGTH_SHORT).show()

        // 核心：解析语音文本 -> 执行动作（打开App）
        handleSpeechText(text)
    }

    override fun onCreate(savedInstanceState: Bundle?) {
        super.onCreate(savedInstanceState)
        setContentView(R.layout.activity_open_app)

        btnStartSpeech = findViewById(R.id.btnStartSpeech)
        btnStartSpeech.text = "点击说话"
        btnStartSpeech.setOnClickListener { ensureAudioPermissionAndStart() }
    }

    private fun ensureAudioPermissionAndStart() {
        val granted = ContextCompat.checkSelfPermission(
            this, Manifest.permission.RECORD_AUDIO
        ) == PackageManager.PERMISSION_GRANTED

        if (granted) startSpeech()
        else requestAudioPermission.launch(Manifest.permission.RECORD_AUDIO)
    }

    // === 语音识别入口 ===
    private fun startSpeech() {
        val intent = Intent(RecognizerIntent.ACTION_RECOGNIZE_SPEECH).apply {
            putExtra(
                RecognizerIntent.EXTRA_LANGUAGE_MODEL,
                RecognizerIntent.LANGUAGE_MODEL_FREE_FORM
            )
            putExtra(RecognizerIntent.EXTRA_LANGUAGE, "zh-CN") // 中文识别
            putExtra(RecognizerIntent.EXTRA_PROMPT, "请说：打开腾讯视频 / 打开设置 / 打开微信")
        }

        try {
            speechLauncher.launch(intent)
        } catch (e: Exception) {
            Toast.makeText(this, "当前设备不支持语音识别", Toast.LENGTH_LONG).show()
        }
    }

    // === 核心：识别文本 -> 解析 -> 打开App ===
    private fun handleSpeechText(text: String) {
        val parsed = parseCommand(text)
        if (parsed == null) {
            Toast.makeText(this, "我没听懂。你可以说：打开腾讯视频", Toast.LENGTH_SHORT).show()
            return
        }

        when (parsed.type) {
            "OPEN_APP" -> {
                val ok = openAppByName(parsed.appName)
                if (!ok) {
                    Toast.makeText(this, "未找到应用：${parsed.appName}", Toast.LENGTH_LONG).show()
                }
            }
            else -> Toast.makeText(this, "暂不支持该指令：${parsed.type}", Toast.LENGTH_SHORT).show()
        }
    }

    // 规则：支持 "打开/启动/进入 + 应用名"
    private fun parseCommand(text: String): ParsedIntent? {
        val t = text.trim()
        val patterns = listOf("打开", "启动", "进入")
        val p = patterns.firstOrNull { t.startsWith(it) } ?: return null

        val app = t.removePrefix(p).trim()
        if (app.isEmpty()) return null

        return ParsedIntent(type = "OPEN_APP", appName = app)
    }

    // 应用名 -> 包名 -> startActivity
    private fun openAppByName(appNameRaw: String): Boolean {
        val key = appNameRaw.trim().lowercase() // 兼容 YouTube/youtube
        val pkg = appMap[key] ?: appMap[appNameRaw.trim()] ?: return false

        val launchIntent = packageManager.getLaunchIntentForPackage(pkg) ?: return false
        launchIntent.addFlags(Intent.FLAG_ACTIVITY_NEW_TASK)
        startActivity(launchIntent)
        return true
    }
}

A6. 运行流程（你可以按这个做演示）

安装并启动 Demo（OpenAppActivity）
点击"点击说话"
系统弹出语音面板，中文说："打开腾讯视频"
Demo Toast 显示识别结果
Demo 打开腾讯视频（或其它 appMap 中配置的应用）

Part B：题外话（交流拓展）------语音助手"执行任务"怎么落地

当需求从"打开 App"升级到"执行任务"（搜索、播放、导航、点餐、发消息等），工程上建议把系统拆成三层：

ASR：语音 → 文本
NLU：文本 → 结构化意图（Intent）
Executor：执行器把意图落到 App 上（多策略分层）

B1. 为什么"执行器"是关键

例如："打开腾讯视频，搜索流浪地球并播放"

执行上至少要做：

打开目标 App
进入搜索入口
找到输入框并输入"流浪地球"
点击搜索
进入结果页并点击某个结果（或播放）

难点在第 2～5 步：第三方 App 未必给你标准接口。

B2. 推荐的分层执行策略（从稳到不稳）

1）Deep Link / Intent / AppLink（优先）

当 App 暴露 URI scheme 或 AppLink 时，尽量用"跳转"完成任务的一大步。

优点：稳定、维护成本低
局限：很多第三方不公开"按片名搜索并播放"的能力，往往需要内容 ID 或短链

2）合作接口 / SDK（效果最好，但需要合作）

厂商要做"高成功率、高体验"，最终大概率需要与头部 App 深度合作：

App 暴露明确能力（打开搜索页、发起搜索、打开详情、到确认页）
厂商侧主要做"意图路由"和"参数填充"

优点：成功率最高，可做到产品可承诺的"自动化"体验

缺点：需要生态合作推进

3）UI 自动化（无障碍 Accessibility）兜底

当没有 deep link/合作接口时，才考虑用无障碍完成"低风险通用动作"，例如：

打开 App
找搜索入口
输入关键词
点击搜索
进入列表/详情（可选）

高风险动作（发消息/下单/支付）建议：到确认页 + 用户确认。

B3. UI 自动化（无障碍）的底层方法

无障碍服务 AccessibilityService 可以：

通过 rootInActiveWindow 读取前台界面的 AccessibilityNodeInfo 树
在节点树中按 text / content-desc / viewId / class 等特征定位控件
执行动作：
- 点击：ACTION_CLICK
- 输入：ACTION_SET_TEXT
- 滚动：ACTION_SCROLL_FORWARD
- 返回：performGlobalAction(GLOBAL_ACTION_BACK)
- 手势：dispatchGesture(...)（当节点树不可用时的退化方案）

为什么它难以"完全通用"：

UI 改版、A/B、分辨率变化会导致定位规则失效
自绘/游戏/Canvas/WebView 场景节点树信息贫瘠
目标 App 可能对自动化做风控（检测无障碍服务启用、限制流程）

成熟产品通常这样做：

通用引擎覆盖少数通用动作（打开/搜索/输入/提交）
对头部 App 做少量 App Profile（关键节点特征配置）
失败强降级（提示用户手动确认，保证体验可控）
风控：白名单能力、动作确认、日志审计

B4. 可对外/对内的现实结论（建议口径）

"语音打开 App"是 低风险、最稳定 的基础能力
"跨 App 精准执行复杂任务"建议走 分层策略 ：
- 有 deep link 用 deep link
- 能合作就走合作接口
- 无障碍只做低风险兜底，并严格做降级与确认