【第 05 篇】Python的字典与集合

字典和集合是 Python 中最优雅的数据结构之二：字典用键值对实现极速映射，集合用哈希表实现高效去重与数学运算。掌握它们，你就能写出既快又 Pythonic 的代码。

📑 目录

字典创建、访问与修改
[字典方法：get / setdefault / update / pop / items](#字典方法：get / setdefault / update / pop / items)
字典推导式与有序字典（OrderedDict）
[集合操作：交集 / 并集 / 差集 / 对称差集](#集合操作：交集 / 并集 / 差集 / 对称差集)
[frozenset 与可哈希性](#frozenset 与可哈希性)
[嵌套字典与 defaultdict / Counter](#嵌套字典与 defaultdict / Counter)
[实操 Demo：词频统计与数据去重工具](#实操 Demo：词频统计与数据去重工具)

1. 字典创建、访问与修改

字典（dict）是 Python 中最重要的数据结构之一，它以键值对的形式存储数据，查找速度接近 O(1)。如果说列表是"有序的货架"，那字典就是"带标签的抽屉"------你不需要翻找，直接报标签名就能拿到东西。

1.1 创建字典的五种姿势

python 复制代码

# 方式一：花括号字面量（最常用）
person = {"name": "Alice", "age": 28, "city": "Beijing"}

# 方式二：dict() 构造函数
config = dict(host="localhost", port=8080, debug=True)

# 方式三：从键值对序列创建
pairs = [("x", 1), ("y", 2), ("z", 3)]
coords = dict(pairs)

# 方式四：fromkeys() 批量创建同值字典
defaults = dict.fromkeys(["name", "email", "phone"], "N/A")

# 方式五：字典推导式（下文详解）
squares = {x: x**2 for x in range(6)}

print(person)    # {'name': 'Alice', 'age': 28, 'city': 'Beijing'}
print(config)    # {'host': 'localhost', 'port': 8080, 'debug': True}
print(defaults)  # {'name': 'N/A', 'email': 'N/A', 'phone': 'N/A'}
print(squares)   # {0: 0, 1: 1, 2: 4, 3: 9, 4: 16, 5: 25}

1.2 访问字典值

python 复制代码

user = {"name": "Bob", "age": 25, "skills": ["Python", "Go"]}

# 中括号访问 ------ 键不存在会抛出 KeyError
print(user["name"])       # Bob

# get() 方法 ------ 键不存在返回 None（或默认值），更安全
print(user.get("email"))           # None
print(user.get("email", "未填写"))   # 未填写

# 成员检测 ------ 用 in 检查键是否存在
print("age" in user)     # True
print("email" in user)   # False

⚠️ KeyError 陷阱 ：使用 d[key] 访问不存在的键会抛出 KeyError。在不确定键是否存在时，优先使用 d.get(key) 或 key in d 做检查。

1.3 修改字典

python 复制代码

profile = {"name": "Carol", "level": 3}

# 添加 / 修改键值对
profile["level"] = 4           # 修改已有键
profile["email"] = "c@test.com" # 添加新键

# 删除键值对
del profile["level"]                # 删除指定键
removed = profile.pop("email")      # 删除并返回值

print(profile)    # {'name': 'Carol'}
print(removed)    # c@test.com

2. 字典方法：get / setdefault / update / pop / items

字典内置了丰富的方法，熟练掌握它们能让你的代码更简洁、更 Pythonic。下面逐个拆解最常用的五个。

2.1 get(key, default)

安全访问，键不存在时返回默认值而不抛异常。

python 复制代码

scores = {"math": 95, "english": 88}

# 传统写法：容易报错
# val = scores["physics"]  ❌ KeyError!

# 安全写法
val = scores.get("physics", 0)   # 返回 0
print(val)  # 0

2.2 setdefault(key, default)

如果键不存在，就插入默认值并返回它；如果键已存在，直接返回已有值。这是构建嵌套结构时的利器。

python 复制代码

# 场景：按首字母分组单词
words = ["apple", "apricot", "banana", "blueberry", "cherry"]
groups = {}

for word in words:
    key = word[0]
    groups.setdefault(key, []).append(word)

print(groups)
# {'a': ['apple', 'apricot'], 'b': ['banana', 'blueberry'], 'c': ['cherry']}

💡 setdefault vs get 的区别 ：get() 只读取，不修改字典。setdefault() 会在键不存在时同时插入默认值。当你需要"保证键一定存在"时，用 setdefault。

2.3 update(other)

批量合并另一个字典（或键值对序列），已存在的键会被覆盖。

python 复制代码

base = {"host": "localhost", "port": 80}
override = {"port": 8080, "debug": True}

base.update(override)
print(base)
# {'host': 'localhost', 'port': 8080, 'debug': True}

# 也可以传入关键字参数
base.update(port=3000, log_level="INFO")
print(base)
# {'host': 'localhost', 'port': 3000, 'debug': True, 'log_level': 'INFO'}

2.4 pop(key, default)

删除指定键并返回其值。键不存在时可提供默认值，否则抛出 KeyError。

python 复制代码

cache = {"a": 1, "b": 2, "c": 3}

val = cache.pop("b")          # 删除并返回 2
val2 = cache.pop("z", None)   # 键不存在，返回 None

print(cache)   # {'a': 1, 'c': 3}

# popitem()：删除并返回最后插入的键值对（LIFO 顺序）
last = cache.popitem()
print(last)    # ('c', 3)

2.5 items() / keys() / values()

三个视图对象，分别返回键值对、键、值的动态视图。它们不复制数据，而是指向原字典的"窗口"------字典变了，视图也跟着变。

python 复制代码

fruit = {"apple": 5, "banana": 3, "cherry": 8}

# 遍历键值对 ------ 最常用的模式
for k, v in fruit.items():
    print(f"{k}: {v}")

# 视图是动态的
keys_view = fruit.keys()    # dict_keys(['apple', 'banana', 'cherry'])
fruit["date"] = 2
print(list(keys_view))     # ['apple', 'banana', 'cherry', 'date'] ← 自动更新!

方法速查表

方法	功能	键不存在时
`get(k, d)`	获取值	返回 d（默认 None）
`setdefault(k, d)`	获取值，不存在则插入	插入 d 并返回
`update(other)`	批量合并	新增键值对
`pop(k, d)`	删除并返回值	返回 d 或 KeyError
`items()`	键值对视图	---
`keys()`	键视图	---
`values()`	值视图	---

3. 字典推导式与有序字典（OrderedDict）

3.1 字典推导式

和列表推导式类似，字典推导式用 {k: v for ...} 的语法，一行代码就能完成"变换 + 过滤 + 构造"三件事。

python 复制代码

# 基础：平方映射
squares = {x: x**2 for x in range(1, 6)}
print(squares)  # {1: 1, 2: 4, 3: 9, 4: 16, 5: 25}

# 实战：翻转键值
mapping = {"id_1": "Alice", "id_2": "Bob", "id_3": "Carol"}
reverse = {v: k for k, v in mapping.items()}
print(reverse)  # {'Alice': 'id_1', 'Bob': 'id_2', 'Carol': 'id_3'}

# 带条件过滤：只保留及格的成绩
raw_scores = {"Alice": 85, "Bob": 42, "Carol": 91, "Dave": 58}
passed = {name: score for name, score in raw_scores.items() if score >= 60}
print(passed)  # {'Alice': 85, 'Carol': 91}

# 嵌套推导：从两个序列生成字典
keys = ["a", "b", "c"]
vals = [1, 2, 3]
merged = {k: v for k, v in zip(keys, vals)}
print(merged)  # {'a': 1, 'b': 2, 'c': 3}

📖 zip() + dict() 的简写 ：如果只是合并两个序列成字典，不需要过滤或变换，可以直接用 dict(zip(keys, vals))，比推导式更简洁。

3.2 有序字典 OrderedDict

从 Python 3.7 起，标准 dict 已经保证插入顺序 。但 collections.OrderedDict 仍然有它的用武之地：

python 复制代码

from collections import OrderedDict

# 1. 显式表达"顺序很重要"的语义
od = OrderedDict()
od["first"] = 1
od["second"] = 2
od["third"] = 3

# 2. move_to_end() ------ 将键移到末尾（LRU 缓存的核心操作）
od.move_to_end("first")      # "first" 被移到最后
print(list(od.keys()))       # ['second', 'third', 'first']

# 3. 顺序敏感的相等比较
d1 = OrderedDict([("a", 1), ("b", 2)])
d2 = OrderedDict([("b", 2), ("a", 1)])
print(d1 == d2)  # False ------ 顺序不同即不相等

# 而普通 dict 只看内容
print({"a": 1, "b": 2} == {"b": 2, "a": 1})  # True

💡 何时用 OrderedDict？ 需要 move_to_end()、popitem(last=False) 等顺序操作时；需要顺序敏感的相等比较时；或者想在代码中**明确表达"顺序很重要"**的意图时。否则，普通 dict 就够了。

4. 集合操作：交集 / 并集 / 差集 / 对称差集

集合（set）是 Python 中唯一的无序、不重复 元素容器。它的底层实现和字典的键一样是哈希表，因此查找、插入、删除都是 O(1)。集合最擅长的两件事：去重和数学集合运算。

4.1 创建集合

python 复制代码

# 花括号创建（注意：空集合只能用 set()，因为 {} 是空字典）
colors = {"red", "green", "blue"}
empty = set()            # ✅ 空集合
# empty = {}              ❌ 这是空字典！

# 从列表创建（自动去重）
nums = set([1, 2, 2, 3, 3, 3])
print(nums)  # {1, 2, 3}

# 集合推导式
evens = {x for x in range(10) if x % 2 == 0}
print(evens)  # {0, 2, 4, 6, 8}

4.2 四大集合运算

复制代码

        ┌─────────┐
       ╱    A      ╲
     ╱    (差集)     ╲
    │  1  2  3  │ 4  5 │  ← 交集(A∩B)
    │           │      │
     ╲    (差集)  ╱ 6 7 8 │
       ╲    B     ╱
        └─────────┘

python 复制代码

A = {1, 2, 3, 4, 5}
B = {4, 5, 6, 7, 8}

# ━━━ 交集（Intersection）━━━ A 和 B 都有的元素
print(A & B)            # {4, 5}
print(A.intersection(B))  # {4, 5}

# ━━━ 并集（Union）━━━ A 和 B 的所有元素（去重）
print(A | B)        # {1, 2, 3, 4, 5, 6, 7, 8}
print(A.union(B))   # {1, 2, 3, 4, 5, 6, 7, 8}

# ━━━ 差集（Difference）━━━ 在 A 但不在 B 的元素
print(A - B)           # {1, 2, 3}
print(A.difference(B))  # {1, 2, 3}

# ━━━ 对称差集（Symmetric Difference）━━━ 只属于 A 或只属于 B
print(A ^ B)                     # {1, 2, 3, 6, 7, 8}
print(A.symmetric_difference(B))  # {1, 2, 3, 6, 7, 8}

4.3 运算符 vs 方法名

运算	运算符	方法	原地版（修改原集合）
交集	`A & B`	`A.intersection(B)`	`A.intersection_update(B)`
并集	`A	B`	`A.union(B)`
差集	`A - B`	`A.difference(B)`	`A.difference_update(B)`
对称差集	`A ^ B`	`A.symmetric_difference(B)`	`A.symmetric_difference_update(B)`

📖 运算符 vs 方法 ：运算符要求两边都是 set；方法可以接受任何可迭代对象（如列表、元组）。例如 A.intersection([4,5,6]) 可以直接传列表，而 A & [4,5,6] 会报错。

4.4 子集与超集判断

python 复制代码

small = {1, 2}
big = {1, 2, 3, 4}

print(small <= big)             # True  ------ small 是 big 的子集
print(small.issubset(big))      # True
print(big >= small)             # True  ------ big 是 small 的超集
print(big.issuperset(small))   # True
print(small < big)              # True  ------ 真子集（不等于）

5. frozenset 与可哈希性

5.1 什么是不变性？

set 是可变的------你可以随时添加或删除元素。但可变对象不能被哈希 ，因此不能作为字典的键，也不能放到另一个集合里。frozenset 就是集合的不可变版本，一旦创建就不能修改。

python 复制代码

# set 不能作为字典的键
# d = {{1, 2}: "value"}  ❌ TypeError: unhashable type: 'set'

# frozenset 可以！
fs = frozenset([1, 2, 3])
d = {fs: "hello"}
print(d[fs])  # hello

# frozenset 也可以放进 set
s = {frozenset([1, 2]), frozenset([3, 4])}
print(s)  # {frozenset({1, 2}), frozenset({3, 4})}

# frozenset 不支持添加/删除
# fs.add(4)  ❌ AttributeError

# 但支持所有只读运算
fs1 = frozenset([1, 2, 3])
fs2 = frozenset([2, 3, 4])
print(fs1 & fs2)   # frozenset({2, 3})
print(fs1 | fs2)   # frozenset({1, 2, 3, 4})

5.2 可哈希性规则

⚠️ 核心规则 ：一个对象是可哈希的（hashable），当且仅当它的哈希值在生命周期内不变，且可以与其他对象比较相等。

类型	可哈希？	能否做 dict 的 key / set 的元素
`int`, `float`, `str`, `bool`	✅ 是	✅ 可以
`tuple`（元素均不可变）	✅ 是	✅ 可以
`frozenset`	✅ 是	✅ 可以
`list`	❌ 否	❌ 不可以
`set`	❌ 否	❌ 不可以
`dict`	❌ 否	❌ 不可以
`tuple`（含 list 元素）	❌ 否	❌ 不可以

python 复制代码

# 经典陷阱：元组包含列表时也不可哈希
t = (1, [2, 3], "hello")
# hash(t)  ❌ TypeError: unhashable type: 'list'

# 解决：将内部列表转为元组
t_safe = (1, (2, 3), "hello")
print(hash(t_safe))  # ✅ 正常计算哈希值

6. 嵌套字典与 defaultdict / Counter

6.1 嵌套字典

实际开发中，字典常常嵌套使用------比如一个用户信息字典中，地址又是一个字典。访问时需要逐层取值。

python 复制代码

users = {
    "alice": {
        "age": 28,
        "address": {"city": "Beijing", "zip": "100000"},
        "skills": ["Python", "SQL"]
    },
    "bob": {
        "age": 32,
        "address": {"city": "Shanghai", "zip": "200000"},
        "skills": ["Go", "Rust"]
    }
}

# 逐层访问
print(users["alice"]["address"]["city"])  # Beijing

# ⚠️ 任一层键不存在都会 KeyError
# users["charlie"]["age"]  ❌ KeyError

# 安全访问：链式 get()
city = users.get("charlie", {}).get("address", {}).get("city", "未知")
print(city)  # 未知

6.2 defaultdict ------ 自动初始化的字典

每次访问不存在的键时，defaultdict 会自动调用工厂函数创建默认值，省去手动初始化的麻烦。

python 复制代码

from collections import defaultdict

# ━━━ 经典场景：分组 ━━━
words = ["apple", "avocado", "banana", "blueberry", "cherry", "coconut"]

by_letter = defaultdict(list)  # 工厂函数是 list，默认值是 []
for w in words:
    by_letter[w[0]].append(w)

print(dict(by_letter))
# {'a': ['apple', 'avocado'], 'b': ['banana', 'blueberry'], 'c': ['cherry', 'coconut']}

# ━━━ 计数器模式 ━━━
counter = defaultdict(int)   # 工厂函数是 int，默认值是 0
for ch in "abracadabra":
    counter[ch] += 1

print(dict(counter))
# {'a': 5, 'b': 2, 'r': 2, 'c': 1, 'd': 1}

# ━━━ 嵌套 defaultdict ━━━
# 创建一个二维矩阵的默认表示
matrix = defaultdict(lambda: defaultdict(int))
matrix[0][0] = 1
matrix[1][2] = 5
print(dict(matrix[0]))  # {0: 1}

6.3 Counter ------ 专业的计数器

虽然 defaultdict(int) 也能计数，但 Counter 提供了更强大的计数专用方法。

python 复制代码

from collections import Counter

# 创建 Counter
c = Counter("abracadabra")
print(c)  # Counter({'a': 5, 'b': 2, 'r': 2, 'c': 1, 'd': 1})

# 最常见的 N 个元素
print(c.most_common(3))
# [('a', 5), ('b', 2), ('r', 2)]

# Counter 支持算术运算！
c1 = Counter("aabbcc")
c2 = Counter("bbccdd")

print(c1 + c2)   # Counter({'b': 4, 'c': 4, 'a': 2, 'd': 2})  ------ 合并计数
print(c1 - c2)   # Counter({'a': 2})  ------ 只保留正数差
print(c1 & c2)   # Counter({'b': 2, 'c': 2})  ------ 取最小值（交集）
print(c1 | c2)   # Counter({'b': 2, 'c': 2, 'a': 2, 'd': 2})  ------ 取最大值（并集）

# elements() ------ 展开为迭代器
c3 = Counter(a=2, b=3)
print(list(c3.elements()))  # ['a', 'a', 'b', 'b', 'b']

# update() ------ 追加计数
c = Counter(["a", "b"])
c.update(["a", "c"])
print(c)  # Counter({'a': 2, 'b': 1, 'c': 1})

💡 defaultdict vs Counter 选哪个？ 纯计数场景用 Counter，它有 most_common()、算术运算等专属方法。需要更灵活的默认值（如 list、自定义对象）用 defaultdict。

7. 实操 Demo：词频统计与数据去重工具

现在，我们把前面学到的知识综合起来，完成两个实用的工具。

Demo 1：词频统计器

输入一段文本，统计每个单词的出现频率，支持忽略大小写、过滤停用词，按频率排序输出。

python 复制代码

from collections import Counter
import re

def word_frequency(text, top_n=10, stopwords=None):
    """词频统计器

    Args:
        text:     输入文本
        top_n:    返回频率最高的 N 个词
        stopwords: 停用词集合（这些词将被忽略）

    Returns:
        按频率降序排列的 (word, count) 列表
    """
    # 1. 清洗文本：转小写 + 提取纯字母单词
    words = re.findall(r'[a-z]+', text.lower())

    # 2. 过滤停用词
    if stopwords:
        stop_set = set(stopwords)
        words = [w for w in words if w not in stop_set]

    # 3. 统计词频（Counter 登场！）
    freq = Counter(words)

    # 4. 返回 Top N
    return freq.most_common(top_n)


# ━━━ 测试 ━━━
sample = """
Python is a great programming language. Python is widely used
for web development, data science, and machine learning.
Many developers love Python because Python is easy to learn
and Python has a rich ecosystem of libraries.
"""

stops = {"a", "an", "the", "is", "and", "of", "to", "for", "has"}

result = word_frequency(sample, top_n=8, stopwords=stops)

print("📊 词频统计结果：")
for rank, (word, count) in enumerate(result, 1):
    bar = "█" * count
    print(f"  {rank:2d}. {word:12s} {count:2d} {bar}")

输出：

复制代码

📊 词频统计结果：
   1. python          4 ████
   2. is              3 ███
   3. great           1 █
   4. programming     1 █
   5. language        1 █
   6. widely          1 █
   7. used            1 █
   8. web             1 █

Demo 2：数据去重与比对工具

模拟两个数据源的用户记录，利用集合快速去重、找出差异，用字典实现快速索引。

python 复制代码

from collections import Counter, defaultdict

# ━━━ 模拟数据 ━━━
source_a = [
    {"id": 1001, "name": "Alice",  "dept": "Engineering"},
    {"id": 1002, "name": "Bob",    "dept": "Marketing"},
    {"id": 1003, "name": "Carol",  "dept": "Engineering"},
    {"id": 1004, "name": "Dave",   "dept": "Sales"},
    {"id": 1002, "name": "Bob",    "dept": "Marketing"},  # 重复
]

source_b = [
    {"id": 1002, "name": "Bob",    "dept": "Marketing"},
    {"id": 1003, "name": "Carol",  "dept": "Design"},        # 部门不同！
    {"id": 1005, "name": "Eve",    "dept": "Engineering"},
]

# ━━━ Step 1: 单源去重（基于 id）━━━
def deduplicate(records, key="id"):
    """按指定键去重，保留最后出现的记录"""
    index = {r[key]: r for r in records}  # 字典推导式自动覆盖重复键
    return list(index.values())

unique_a = deduplicate(source_a)
print(f"数据源A去重: {len(source_a)} → {len(unique_a)} 条")
# 数据源A去重: 5 → 4 条

# ━━━ Step 2: 双源差异分析 ━━━
def diff_sources(a, b, key="id"):
    """对比两个数据源，找出新增/删除/变更"""
    map_a = {r[key]: r for r in a}
    map_b = {r[key]: r for r in b}

    ids_a = set(map_a.keys())
    ids_b = set(map_b.keys())

    result = {
        "only_in_a":  [map_a[i] for i in ids_a - ids_b],   # 差集：在A不在B
        "only_in_b":  [map_b[i] for i in ids_b - ids_a],   # 差集：在B不在A
        "in_both":    sorted(ids_a & ids_b),                # 交集
        "modified":   [],                                   # 内容有变化
    }

    # 对交集部分检查内容变化
    for i in ids_a & ids_b:
        if map_a[i] != map_b[i]:
            result["modified"].append({
                "id": i,
                "old": map_a[i],
                "new": map_b[i],
            })

    return result

diff = diff_sources(unique_a, source_b)

print("\n📊 数据差异报告：")
print(f"  仅在A中: {[r['name'] for r in diff['only_in_a']]}")
print(f"  仅在B中: {[r['name'] for r in diff['only_in_b']]}")
print(f"  共同ID:   {diff['in_both']}")
print(f"  有变更:   {[(m['id'], m['old']['dept'], '→', m['new']['dept']) for m in diff['modified']]}")

输出：

复制代码

数据源A去重: 5 → 4 条

📊 数据差异报告：
  仅在A中: ['Alice', 'Dave']
  仅在B中: ['Eve']
  共同ID:   [1002, 1003]
  有变更:   [(1003, 'Engineering', '→', 'Design')]

Demo 3：完整工具脚本

把以上两个功能整合成一个命令行可执行的工具，支持管道输入：

python 复制代码

#!/usr/bin/env python3
"""dict_set_toolkit.py --- 字典与集合实战工具包"""

from collections import Counter, defaultdict
import re
import json
import sys

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
#  工具一：词频分析器
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

STOPWORDS_EN = {
    "the", "a", "an", "is", "are", "was", "were",
    "be", "been", "being", "have", "has", "had",
    "do", "does", "did", "will", "would",
    "could", "should", "may", "might",
    "shall", "can", "need", "dare",
    "and", "or", "but", "if", "of", "at",
    "by", "for", "with", "about",
    "to", "from", "in", "on", "it",
    "its", "that", "this", "these", "those",
}

def analyze_word_frequency(text, top_n=20, min_len=2,
                           stopwords=None):
    """高级词频分析

    Args:
        text:      输入文本
        top_n:     返回前 N 个高频词
        min_len:   最短词长（过滤单字母等噪音）
        stopwords: 停用词集合
    """
    words = re.findall(r'[a-z]+', text.lower())

    # 用集合过滤停用词和短词
    skip = (stopwords or STOPWORDS_EN) | {""}
    filtered = [w for w in words if len(w) >= min_len and w not in skip]

    # Counter 统计 + 排序
    freq = Counter(filtered)
    total = sum(freq.values())
    unique = len(freq)

    print(f"\n📝 文本概览：总词数 {total}，不同词 {unique}")
    print(f"   去重率: {unique/total*100:.1f}%\n")

    print(f"  {'排名':<6}{'单词':<16}{'次数':<8}{'占比':<8}{'柱状图'}")
    print("  " + "─" * 55)

    for rank, (word, count) in enumerate(freq.most_common(top_n), 1):
        pct = count / total * 100
        bar = "▓" * round(pct / 2)
        print(f"  {rank:<6}{word:<16}{count:<8}{pct:.1f}%  {bar}")


# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
#  工具二：JSON 数据去重器
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

def dedup_records(records, key="id", keep="last"):
    """对记录列表按 key 去重

    Args:
        records: 字典列表
        key:     去重依据的字段名
        keep:    'first' 保留首条 | 'last' 保留末条
    """
    if keep == "first":
        seen = set()
        unique = []
        for r in records:
            k = r.get(key)
            if k not in seen:
                seen.add(k)
                unique.append(r)
        return unique
    else:  # keep == "last"
        index = {r[key]: r for r in records}
        return list(index.values())


def compare_sources(a, b, key="id"):
    """对比两份数据，输出差异报告"""
    map_a = {r[key]: r for r in a}
    map_b = {r[key]: r for r in b}

    ids_a = set(map_a)
    ids_b = set(map_b)

    added   = ids_b - ids_a        # B 中新增
    removed = ids_a - ids_b        # B 中删除
    common  = ids_a & ids_b        # 共有

    # 检查共有记录的内容差异
    changed = []
    for i in common:
        if map_a[i] != map_b[i]:
            # 找出具体变更字段
            diffs = {
                k: ("old", map_a[i][k], "new", map_b[i][k])
                for k in set(map_a[i]) | set(map_b[i])
                if map_a[i].get(k) != map_b[i].get(k)
            }
            changed.append({"id": i, "diff": diffs})

    return {
        "added":   sorted(added),
        "removed": sorted(removed),
        "common":  sorted(common),
        "changed": changed,
        "summary": f"新增 {len(added)} | 删除 {len(removed)} | 变更 {len(changed)} | 未变 {len(common)-len(changed)}",
    }


# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
#  主入口
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

if __name__ == "__main__":
    # Demo：词频分析
    text = """
    Python is an interpreted, high-level and general-purpose
    programming language. Python's design philosophy emphasizes
    code readability with its notable use of significant
    indentation. Python is dynamically typed and
    garbage-collected. It supports multiple programming paradigms,
    including structured, object-oriented and functional programming.
    Python was conceived in the late 1980s by Guido van Rossum.
    """
    analyze_word_frequency(text, top_n=12)

    print("\n" + "=" * 60)

    # Demo：数据去重与比对
    a = [
        {"id": 1, "name": "Alice",  "dept": "Eng"},
        {"id": 2, "name": "Bob",    "dept": "Mkt"},
        {"id": 3, "name": "Carol",  "dept": "Eng"},
        {"id": 2, "name": "Bob",    "dept": "Mkt"},  # 重复
    ]
    b = [
        {"id": 2, "name": "Bob",    "dept": "Mkt"},
        {"id": 3, "name": "Carol",  "dept": "Design"},
        {"id": 4, "name": "Eve",    "dept": "Eng"},
    ]

    clean_a = dedup_records(a)
    report = compare_sources(clean_a, b)

    print("\n📊 数据比对报告：")
    print(f"   {report['summary']}")
    print(f"   新增 ID: {report['added']}")
    print(f"   删除 ID: {report['removed']}")
    print(f"   变更详情: ")
    for c in report["changed"]:
        print(f"     ID={c['id']}: {c['diff']}")

★ 知识体系总结

主题	核心要点	一句话记忆
字典基础	5种创建方式、安全访问用 get、in 检测键	字典是"带标签的抽屉"
字典方法	get 安全读、setdefault 安全写、update 批量合并、pop 删除取值	不确定就 get，要插入用 setdefault
字典推导式	{k:v for ...} 一行完成变换+过滤+构造	列表推导式的键值对版本
OrderedDict	move_to_end、顺序敏感比较	普通 dict 够用，LRU 才需要它
集合运算	& 交集 \| 并集 - 差集 ^ 对称差集	四种运算符 = 四种 Venn 图区域
frozenset	不可变集合，可哈希，能做 dict 的 key	set 的只读版，要当钥匙就得冻结
defaultdict	自动初始化默认值，省去 if 判断	给字典装上"自动售货机"
Counter	专业计数器，most_common、算术运算	计数场景的瑞士军刀

💡 面试高频考点

dict 的底层实现？ ------ 哈希表 + 开放寻址法
Python 3.7 后 dict 为什么有序？ ------ 实现改为维护插入顺序的索引数组
set 去重的原理？ ------ 元素必须可哈希，哈希值相同则判定重复
defaultdict vs dict.setdefault()？ ------ 前者在访问时自动初始化，后者需显式调用
frozenset 的应用场景？ ------ 做 dict 的 key、做 set 的元素、多进程共享的只读集合

下一篇预告：第 06 篇 ------ 函数与作用域：从参数到闭包