政安晨：【Keras机器学习示例演绎】（四十九）—— 利用 KerasNLP 实现语义相似性

Column	Description	Feature Type
Age	Age in years	Numerical
Sex	(1 = male; 0 = female)	Categorical
CP	Chest pain type (0, 1, 2, 3, 4)	Categorical
Trestbpd	Resting blood pressure (in mm Hg on admission)	Numerical
Chol	Serum cholesterol in mg/dl	Numerical
FBS	fasting blood sugar in 120 mg/dl (1 = true; 0 = false)	Categorical
RestECG	Resting electrocardiogram results (0, 1, 2)	Categorical
Thalach	Maximum heart rate achieved	Numerical
Exang	Exercise induced angina (1 = yes; 0 = no)	Categorical
Oldpeak	ST depression induced by exercise relative to rest	Numerical
Slope	Slope of the peak exercise ST segment	Numerical
CA	Number of major vessels (0-3) colored by fluoroscopy	Both numerical & categorical
Thal	3 = normal; 6 = fixed defect; 7 = reversible defect	Categorical
Target	Diagnosis of heart disease (1 = true; 0 = false)	Target

设置

复制代码

import os

os.environ["KERAS_BACKEND"] = "tensorflow"

import tensorflow as tf
import pandas as pd
import keras
from keras.utils import FeatureSpace

准备数据

让我们下载数据并将其加载到 Pandas 数据框中：

复制代码

file_url = "http://storage.googleapis.com/download.tensorflow.org/data/heart.csv"
dataframe = pd.read_csv(file_url)

数据集包括 303 个样本，每个样本有 14 列（13 个特征，加上目标标签）：

复制代码

print(dataframe.shape)

(303, 14)

下面是几个样本的预览：

复制代码

dataframe.head()

| | age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target |
| 0 | 63 | 1 | 1 | 145 | 233 | 1 | 2 | 150 | 0 | 2.3 | 3 | 0 | fixed | 0 |
| 1 | 67 | 1 | 4 | 160 | 286 | 0 | 2 | 108 | 1 | 1.5 | 2 | 3 | normal | 1 |
| 2 | 67 | 1 | 4 | 120 | 229 | 0 | 2 | 129 | 1 | 2.6 | 2 | 2 | reversible | 0 |
| 3 | 37 | 1 | 3 | 130 | 250 | 0 | 0 | 187 | 0 | 3.5 | 3 | 0 | normal | 0 |

4	41	0	2	130	204	0	2	172	0	1.4	1	0	normal	0

最后一栏 "目标 "表示患者是否患有心脏病（1）。

让我们把数据分成训练集和验证集：

复制代码

val_dataframe = dataframe.sample(frac=0.2, random_state=1337)
train_dataframe = dataframe.drop(val_dataframe.index)

print(
    "Using %d samples for training and %d for validation"
    % (len(train_dataframe), len(val_dataframe))
)

Using 242 samples for training and 61 for validation

让我们为每个数据帧生成 tf.data.Dataset 对象：

复制代码

def dataframe_to_dataset(dataframe):
    dataframe = dataframe.copy()
    labels = dataframe.pop("target")
    ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
    ds = ds.shuffle(buffer_size=len(dataframe))
    return ds


train_ds = dataframe_to_dataset(train_dataframe)
val_ds = dataframe_to_dataset(val_dataframe)

每个数据集都会产生一个元组（输入、目标），其中输入是一个特征字典，目标是 0 或 1 的值：

复制代码

for x, y in train_ds.take(1):
    print("Input:", x)
    print("Target:", y)

Input: {'age': <tf.Tensor: shape=(), dtype=int64, numpy=65>, 'sex': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'cp': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'trestbps': <tf.Tensor: shape=(), dtype=int64, numpy=138>, 'chol': <tf.Tensor: shape=(), dtype=int64, numpy=282>, 'fbs': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'restecg': <tf.Tensor: shape=(), dtype=int64, numpy=2>, 'thalach': <tf.Tensor: shape=(), dtype=int64, numpy=174>, 'exang': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'oldpeak': <tf.Tensor: shape=(), dtype=float64, numpy=1.4>, 'slope': <tf.Tensor: shape=(), dtype=int64, numpy=2>, 'ca': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'thal': <tf.Tensor: shape=(), dtype=string, numpy=b'normal'>}
Target: tf.Tensor(0, shape=(), dtype=int64)

让我们对数据集进行批处理：

复制代码

train_ds = train_ds.batch(32)
val_ds = val_ds.batch(32)

配置特征空间

要配置如何预处理每个特征，我们需要实例化 keras.utils.FeatureSpace，并向其传递一个将特征名称映射到描述特征类型的字符串的字典。

我们有一些 "整数分类 "特征，如 "FBS"，一个 "字符串分类 "特征（"thal"），以及一些数字特征，我们希望对这些特征进行归一化处理，但 "年龄 "除外，我们希望将其离散化为若干个分区。

我们还使用交叉参数来捕捉某些分类特征的特征交互，也就是说，创建额外的特征来表示这些分类特征的值共存。您可以计算任意分类特征集的交叉特征，而不仅仅是两个特征的元组。由于生成的共现值会散列到一个固定大小的向量中，因此您无需担心共现值空间是否过大。

复制代码

feature_space = FeatureSpace(
    features={
        # Categorical features encoded as integers
        "sex": "integer_categorical",
        "cp": "integer_categorical",
        "fbs": "integer_categorical",
        "restecg": "integer_categorical",
        "exang": "integer_categorical",
        "ca": "integer_categorical",
        # Categorical feature encoded as string
        "thal": "string_categorical",
        # Numerical features to discretize
        "age": "float_discretized",
        # Numerical features to normalize
        "trestbps": "float_normalized",
        "chol": "float_normalized",
        "thalach": "float_normalized",
        "oldpeak": "float_normalized",
        "slope": "float_normalized",
    },
    # We create additional features by hashing
    # value co-occurrences for the
    # following groups of categorical features.
    crosses=[("sex", "age"), ("thal", "ca")],
    # The hashing space for these co-occurrences
    # wil be 32-dimensional.
    crossing_dim=32,
    # Our utility will one-hot encode all categorical
    # features and concat all features into a single
    # vector (one vector per sample).
    output_mode="concat",
)

进一步自定义特征空间

通过字符串名称指定特征类型简单快捷，但有时您可能需要进一步配置每个特征的预处理。例如，在我们的案例中，我们的分类特征并没有大量的可能值--每个特征只有少数几个值（例如，特征 "FBS "的值为 1 和 0），而且所有可能的值都在训练集中有所体现。因此，我们不需要为这些特征预留一个索引来表示 "词汇表之外 "的值--这本来是默认行为。下面，我们只需在每个特征中指定 num_oov_indices=0，告诉特征预处理器跳过 "超出词汇量 "索引。

您还可以进行其他自定义设置，包括指定用于离散化 "float_discretized "类型特征的分段数，或用于特征交叉的散列空间维度。

复制代码

feature_space = FeatureSpace(
    features={
        # Categorical features encoded as integers
        "sex": FeatureSpace.integer_categorical(num_oov_indices=0),
        "cp": FeatureSpace.integer_categorical(num_oov_indices=0),
        "fbs": FeatureSpace.integer_categorical(num_oov_indices=0),
        "restecg": FeatureSpace.integer_categorical(num_oov_indices=0),
        "exang": FeatureSpace.integer_categorical(num_oov_indices=0),
        "ca": FeatureSpace.integer_categorical(num_oov_indices=0),
        # Categorical feature encoded as string
        "thal": FeatureSpace.string_categorical(num_oov_indices=0),
        # Numerical features to discretize
        "age": FeatureSpace.float_discretized(num_bins=30),
        # Numerical features to normalize
        "trestbps": FeatureSpace.float_normalized(),
        "chol": FeatureSpace.float_normalized(),
        "thalach": FeatureSpace.float_normalized(),
        "oldpeak": FeatureSpace.float_normalized(),
        "slope": FeatureSpace.float_normalized(),
    },
    # Specify feature cross with a custom crossing dim.
    crosses=[
        FeatureSpace.cross(feature_names=("sex", "age"), crossing_dim=64),
        FeatureSpace.cross(
            feature_names=("thal", "ca"),
            crossing_dim=16,
        ),
    ],
    output_mode="concat",
)

根据训练数据调整特征空间

在开始使用特征空间构建模型之前，我们必须使其适应训练数据。在 adapt() 过程中，特征空间将：

------ 为分类特征的可能值集建立索引。

------计算数值特征的均值和方差，以进行归一化处理。

------为数值特征计算不同分区的值边界，以实现离散化。

请注意，adapt() 应在 tf.data.Dataset 上调用，因为该数据集会生成特征值的二进制数（无标签）。

复制代码

train_ds_with_no_labels = train_ds.map(lambda x, _: x)
feature_space.adapt(train_ds_with_no_labels)

此时，可以调用原始特征值的特征空间，它将为每个样本返回一个单一的串联向量，将编码特征和特征交叉结合起来。

复制代码

for x, _ in train_ds.take(1):
    preprocessed_x = feature_space(x)
    print("preprocessed_x.shape:", preprocessed_x.shape)
    print("preprocessed_x.dtype:", preprocessed_x.dtype)

preprocessed_x.shape: (32, 138)
preprocessed_x.dtype: <dtype: 'float32'>

管理预处理的两种方法：作为 tf.data 管道的一部分，或在模型本身中进行预处理

您可以通过两种方式利用您的 "特色空间"：

tf.data 中的异步预处理

您可以将其作为数据管道的一部分，置于模型之前。这样就能在数据进入模型之前，在 CPU 上对数据进行异步并行预处理。如果您在 GPU 或 TPU 上进行训练，或者您想加快预处理速度，可以这样做。通常，在训练过程中这样做总是正确的。

模型中的同步预处理

您可以将其作为模型的一部分。这意味着模型将期待原始特征值的字典，并且预处理批处理将在前向传递的其余部分之前同步（以阻塞方式）完成。如果你想拥有一个可以处理原始特征值的端到端模型，可以这样做，但请记住，你的模型只能在 CPU 上运行，因为大多数类型的特征预处理（如字符串预处理）都与 GPU 或 TPU 不兼容。

请勿在 GPU / TPU 或对性能敏感的设置中执行此操作。一般来说，在 CPU 上进行推理时，应在模型内进行预处理。

在我们的例子中，我们将在训练过程中在 tf.data 管道中应用特征空间，但我们将使用包含特征空间的端到端模型进行推理。

让我们创建一个预处理批次的训练和验证数据集：

复制代码

preprocessed_train_ds = train_ds.map(
    lambda x, y: (feature_space(x), y), num_parallel_calls=tf.data.AUTOTUNE
)
preprocessed_train_ds = preprocessed_train_ds.prefetch(tf.data.AUTOTUNE)

preprocessed_val_ds = val_ds.map(
    lambda x, y: (feature_space(x), y), num_parallel_calls=tf.data.AUTOTUNE
)
preprocessed_val_ds = preprocessed_val_ds.prefetch(tf.data.AUTOTUNE)

制作模型

是时候建一个模型了，或者说建两个模型：

一个训练模型，需要预处理特征（一个样本 = 一个向量）
一个推理模型，需要原始特征（一个样本 = 原始特征值的二进制数）

复制代码

dict_inputs = feature_space.get_inputs()
encoded_features = feature_space.get_encoded_features()

x = keras.layers.Dense(32, activation="relu")(encoded_features)
x = keras.layers.Dropout(0.5)(x)
predictions = keras.layers.Dense(1, activation="sigmoid")(x)

training_model = keras.Model(inputs=encoded_features, outputs=predictions)
training_model.compile(
    optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"]
)

inference_model = keras.Model(inputs=dict_inputs, outputs=predictions)

训练模型

让我们对模型进行 50 次历时训练。请注意，特征预处理是 tf.data 管道的一部分，而不是模型的一部分。

复制代码

training_model.fit(
    preprocessed_train_ds,
    epochs=20,
    validation_data=preprocessed_val_ds,
    verbose=2,
)

Epoch 1/20
8/8 - 3s - 352ms/step - accuracy: 0.5200 - loss: 0.7407 - val_accuracy: 0.6196 - val_loss: 0.6663
Epoch 2/20
8/8 - 0s - 20ms/step - accuracy: 0.5881 - loss: 0.6874 - val_accuracy: 0.7732 - val_loss: 0.6015
Epoch 3/20
8/8 - 0s - 19ms/step - accuracy: 0.6580 - loss: 0.6192 - val_accuracy: 0.7839 - val_loss: 0.5577
Epoch 4/20
8/8 - 0s - 19ms/step - accuracy: 0.7096 - loss: 0.5721 - val_accuracy: 0.7856 - val_loss: 0.5200
Epoch 5/20
8/8 - 0s - 18ms/step - accuracy: 0.7292 - loss: 0.5553 - val_accuracy: 0.7764 - val_loss: 0.4853
Epoch 6/20
8/8 - 0s - 19ms/step - accuracy: 0.7561 - loss: 0.5103 - val_accuracy: 0.7732 - val_loss: 0.4627
Epoch 7/20
8/8 - 0s - 19ms/step - accuracy: 0.7231 - loss: 0.5374 - val_accuracy: 0.7764 - val_loss: 0.4413
Epoch 8/20
8/8 - 0s - 19ms/step - accuracy: 0.7769 - loss: 0.4564 - val_accuracy: 0.7683 - val_loss: 0.4320
Epoch 9/20
8/8 - 0s - 18ms/step - accuracy: 0.7769 - loss: 0.4324 - val_accuracy: 0.7856 - val_loss: 0.4191
Epoch 10/20
8/8 - 0s - 19ms/step - accuracy: 0.7778 - loss: 0.4340 - val_accuracy: 0.7888 - val_loss: 0.4084
Epoch 11/20
8/8 - 0s - 19ms/step - accuracy: 0.7760 - loss: 0.4124 - val_accuracy: 0.7716 - val_loss: 0.3977
Epoch 12/20
8/8 - 0s - 19ms/step - accuracy: 0.7964 - loss: 0.4125 - val_accuracy: 0.7667 - val_loss: 0.3959
Epoch 13/20
8/8 - 0s - 18ms/step - accuracy: 0.8051 - loss: 0.3979 - val_accuracy: 0.7856 - val_loss: 0.3891
Epoch 14/20
8/8 - 0s - 19ms/step - accuracy: 0.8043 - loss: 0.3891 - val_accuracy: 0.7856 - val_loss: 0.3840
Epoch 15/20
8/8 - 0s - 18ms/step - accuracy: 0.8633 - loss: 0.3571 - val_accuracy: 0.7872 - val_loss: 0.3764
Epoch 16/20
8/8 - 0s - 19ms/step - accuracy: 0.8728 - loss: 0.3548 - val_accuracy: 0.7888 - val_loss: 0.3699
Epoch 17/20
8/8 - 0s - 19ms/step - accuracy: 0.8698 - loss: 0.3171 - val_accuracy: 0.7872 - val_loss: 0.3727
Epoch 18/20
8/8 - 0s - 18ms/step - accuracy: 0.8529 - loss: 0.3454 - val_accuracy: 0.7904 - val_loss: 0.3669
Epoch 19/20
8/8 - 0s - 17ms/step - accuracy: 0.8589 - loss: 0.3359 - val_accuracy: 0.7980 - val_loss: 0.3770
Epoch 20/20
8/8 - 0s - 17ms/step - accuracy: 0.8455 - loss: 0.3113 - val_accuracy: 0.8044 - val_loss: 0.3684

<keras.src.callbacks.history.History at 0x7f139bb4ed10>

我们很快就能达到 80% 的验证准确率。

使用端到端模型对新数据进行推理

现在，我们可以使用推理模型（其中包括特征空间），根据原始特征值的字典进行预测，如下所示：

复制代码

sample = {
    "age": 60,
    "sex": 1,
    "cp": 1,
    "trestbps": 145,
    "chol": 233,
    "fbs": 1,
    "restecg": 2,
    "thalach": 150,
    "exang": 0,
    "oldpeak": 2.3,
    "slope": 3,
    "ca": 0,
    "thal": "fixed",
}

input_dict = {name: tf.convert_to_tensor([value]) for name, value in sample.items()}
predictions = inference_model.predict(input_dict)

print(
    f"This particular patient had a {100 * predictions[0][0]:.2f}% probability "
    "of having a heart disease, as evaluated by our model."
)

 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 273ms/step
This particular patient had a 43.13% probability of having a heart disease, as evaluated by our model.