深度学习-7.超参数优化

Deep Learning - Lecture 7 Hyperparameter Optimization

简介
超参数搜索
用于超参数选择的贝叶斯优化
- 启发性示例
- 贝叶斯优化
引用

本节目标：
解释并实现深度学习中使用的不同超参数优化方法，包括：

手动选择
网格搜索
随机搜索
贝叶斯优化

简介

选择超参数有两种基本方法：手动选择或自动选择。

手动超参数选择需要使用者对模型或深度学习有更多的了解。
自动超参数选择通常需要更多的计算时间。

手动超参数搜索的主要目标是：

调整模型的有效容量，使其与任务的复杂程度相匹配；
在一定的运行时间和内存限制（模型复杂度）条件下，通过找到最低的泛化误差来评判模型性能。

超参数搜索

网格搜索和随机搜索

一种简单的选择超参数的方法是自动化搜索，例如使用网格搜索或随机搜索，（随机搜索往往收敛得更快）

在网格搜索中，我们会在网格上（线性或对数间隔）测试超参数的组合。
在随机搜索中，我们随机抽取超参数，这样可以避免对值的重复评估。

计算加速 （Computational speed - ups）

网格搜索和随机搜索在计算上可能成本很高，这对你和环境都不利。

所以我们可以使用一些加速计算的技巧，来避免许多无用的计算。

使用数据子集。
采用提前停止策略。
终止性能较差的模型（可参考Successive Halving和Hyperband方法)
使用并行处理。

如上图，横轴是"Training Epochs"（训练轮数），纵轴是"Accuracy"（准确率），不同颜色的曲线代表不同模型的表现。这些方法可尽早终止表现差的模型。

底部还引用了一篇论文：Li, et al. (2018). Hyperband: A novel bandit - based approach to hyperparameter optimization. Journal of Machine Learning Research, 18(185), 1 - 52.

网格搜索Matlab示例代码

matlab 复制代码

% define the grid of hyperparameters
num_layers = [2, 3, 4]; % number of layers
num_filters = [32, 64, 128]; % number of filters per layer 
% initialise best solution
best_accuracy = 0;
best_params = [];
% auxiliary parameters 
aux_params{1} = num_classes;
aux_params{2} = image_size;
% loop over the grid of hyperparameters
for i = 1:length(num_layers)
	for j = 1:length(num_filters)
		% current hyperparameters
		hyper_params = [num_layers(i), num_filters(j)];
		% create and train model with current hyperparams
		layers = create_model(hyper_params,aux_params);
		% train model
		options = trainingOptions("adam");
		[model,info] = trainNetwork(imdsTrain,layers,options);
		% extract validation accuracy for current model
		accuracy(i,j) = info.ValidationAccuracy(end);
		% store parameters if they are better than previous
		if accuracy(i,j) > best_accuracy
			best_accuracy = accuracy;
			best_params = hyper_params;
		end
	end
end
% define a function to create a model
function layers = create_model(hyper_params,aux_params)
% unpack hyperparameter values under test
num_layers = hyper_params(1);
num_filters = hyper_params(2);
% unpack auxiliary parameters needed to build network
num_classes = aux_params{1};
image_size = aux_params{2};
% create input layer
layers = [
imageInputLayer([image_size])
];
% create blocks of conv -> batch norm -> relu -> max pool layers
for i = 1:num_layers
	layers = [layers
	convolution2dLayer(3,num_filters,'Padding','same')
	batchNormalizationLayer
	reluLayer
	maxPooling2dLayer(2,'Stride',2)
	];
end
% output layers
layers = [layers
fullyConnectedLayer(num_classes)
softmaxLayer
classificationLayer];
end

使用Keras - TensorFlow在Python中进行网格搜索

Keras可以借助Scikit-learn（sklearn）机器学习库自动执行网格搜索。

python 复制代码

# import libraries
from sklearn.model_selection import GridSearchCV
from scikeras.wrappers import KerasClassifier
######## define data etc. here #######
# create keras sklearn wrapper
model = KerasClassifier(build_fn=create_model)
# define the grid search parameters
param_grid = {'num_layers': [2, 3, 4],
'num_filters': [32, 64, 128]}
# perform the grid search
grid = GridSearchCV(estimator=model,
param_grid=param_grid,
n_jobs=-1, cv=2, verbose=1)
grid_result = grid.fit(X_train, y_train)
# first, output the best performing parameters
print(grid.best_params_)
# output is in 
df = pd.DataFrame(grid.cv_results_)
# function to build the model
def create_model(num_layers, num_filters):
	model = Sequential()
	# input layer
	model.add(Conv2D(num_filters, kernel_size=(3,3),
	padding='same', input_shape=(28, 28, 1)))
	model.add(BatchNormalization())
	model.add(Activation('relu'))
	# create blocks of conv --> batch norm --> relu --> max pool
	for i in range(num_layers-1):
		model.add(Conv2D(num_filters, kernel_size=(3,3),
		padding='same'))
		model.add(BatchNormalization())
		model.add(Activation('relu'))
		model.add(MaxPooling2D(pool_size=(2,2)))
	# output layers
	model.add(Flatten())
	model.add(Dense(num_classes, activation='softmax'))
	model.compile(optimizer=Adam(),
	loss='categorical_crossentropy', metrics=['accuracy'])
	return model

你可以查看更多的关于RandomSearchCV、HalvingGridSearchCV 和 HalvingRandomSearchCV库的知识。

（Randomized Search Cross - Validation（随机搜索交叉验证）、Successive Halving Grid Search Cross - Validation（逐次减半网格搜索交叉验证）、Successive Halving Random Search Cross - Validation（逐次减半随机搜索交叉验证））

用于超参数选择的贝叶斯优化

基础问题

我们可以使用运用搜索算法自动找寻最优超参数 x ∗ x^* x∗，使得特定目标函数取得最大值，其数学表达式为 x ∗ = arg ⁡ max ⁡ x J ( x ) x^* = \underset{x}{\arg\max}J(x) x∗=xargmaxJ(x)

图表里，横轴代表"Hyperparameter x"（超参数 x x x），纵轴表示"Objective function J ( x ) J(x) J(x)"（目标函数 J ( x ) J(x) J(x)），曲线展示了目标函数随超参数 x x x的变化趋势。红色箭头及文字表明，我们旨在找出函数 J ( x ) J(x) J(x)的最大值，或者说能让此函数达到最大值的参数 x ∗ x^* x∗ 。同时还注释了目标函数可以是验证准确率、 R 2 R^{2} R2（ r − s q u a r e d r - squared r−squared）等。

而贝叶斯优化就是一种用于寻找函数最小值或最大值的特定搜索算法。

启发性示例

让我们设想一下，我们试图优化深度网络中的层数。

下图就是搜索算法的起始点------毫无头绪（没有任何先验信息）！
现在我们已经随机搜索了几个点，得到的数据如下，接下来应该往哪里找呢？
答案：
- 我们可以通过拟合一个模型，在这些已知的数据点之间进行插值，然后选择模型的峰值点作为下一个要搜索的点
- 这被称为"利用"（我们在利用已有的知识）
问题：那些尚未探索的区域怎么办呢，那里可能有更好的答案！
- 我们需要进行探索！

核心问题 是：我们如何量化对函数 J ( x ) J(x) J(x)的认知来促进探索呢？

我们需要使用一个能量化不确定性的模型------贝叶斯模型。

红色线条表示"Mean of the function μ \mu μ"（函数均值 μ \mu μ），粉色区域是"Uncertainty region, e.g. 2 σ 2\sigma 2σ"（不确定性区域，比如 2 σ 2\sigma 2σ ）。右侧文字意思是该不确定性区域暗示这里可能存在更优的最大值。

为什么说不确定区域可能是2 σ \sigma σ？

在统计学和概率模型中，常常用标准差（ σ \sigma σ）来衡量数据的离散程度。对于符合正态分布的数据，大约95%的数据会落在均值 ± 2 σ \pm2\sigma ±2σ 的范围内。

在贝叶斯优化这类情境中，使用 2 σ 2\sigma 2σ 来定义不确定区域，是一种较为常用的量化不确定性的方式。它表明在该区域内，函数值有较大的可能性分布其中，也就意味着可能存在更好的极值（比如最大值）。虽然不一定非要选择 2 σ 2\sigma 2σ ，但这个取值能在一定程度上平衡对不确定性的估计范围和可信度，所以常被使用。

现在我们需要一种方法来选择下一个搜索位置，一个不错的选择是不确定性边界的最大值处。
图表中，红色曲线代表函数的均值 μ \mu μ，粉色区域是不确定性区域，例如 2 σ 2\sigma 2σ 。
蓝色曲线标注为"Acquisition function"（采集函数），公式为 α = μ + 2 σ \alpha = \mu + 2\sigma α=μ+2σ 。
图中黄色星星标记的点是采集函数的最大值。

贝叶斯优化

我们使用贝叶斯模型 f f f对目标函数进行建模，并通过采集函数 α \alpha α来选择下一个搜索位置。

贝叶斯模型主要依赖于两个公式： f ( x ) ≈ J ( x ) ( 1 ) x t + 1 = arg ⁡ max ⁡ x α ( x , f ( x ) ) ( 2 ) f(\boldsymbol{x}) \approx J(\boldsymbol{x})\ \ (1)\\\boldsymbol{x}_{t + 1} = \underset{\boldsymbol{x}}{\arg\max}\alpha(\boldsymbol{x}, f(\boldsymbol{x}))\ \ (2) f(x)≈J(x) (1)xt+1=xargmaxα(x,f(x)) (2)

公式(1)表明贝叶斯模型 f f f是对目标函数 J ( x ) J(\boldsymbol{x}) J(x)的近似；

公式(2)则说明下一个搜索点 x t + 1 \boldsymbol{x}_{t + 1} xt+1是使采集函数 α \alpha α在 x \boldsymbol{x} x和 f ( x ) f(\boldsymbol{x}) f(x)条件下取得最大值的点。

贝叶斯优化流程

利用现有数据对目标函数建模：用公式 f ( x ) ≈ J ( x ) f(\boldsymbol{x}) \approx J(\boldsymbol{x}) f(x)≈J(x)表示贝叶斯模型 f f f对目标函数 J ( x ) J(\boldsymbol{x}) J(x)的近似。 f f f通常是一个高斯过程（GP），服从正态分布 f ∼ N ( μ , σ 2 ) f \sim N(\mu,\sigma^{2}) f∼N(μ,σ2) ，其中 μ \mu μ是均值， σ 2 \sigma^{2} σ2是方差。
选择使采集函数 α \alpha α最大化的下一个更新点：公式 x t + 1 = arg ⁡ max ⁡ x α ( x , f ) \boldsymbol{x}{t + 1} = \underset{\boldsymbol{x}}{\arg\max}\alpha(\boldsymbol{x}, f) xt+1=xargmaxα(x,f)表明下一个点 x t + 1 \boldsymbol{x}{t + 1} xt+1的选取方式。以置信上限（UCB）为例，采集函数 α ( x , f ) = μ + 2 σ \alpha(\boldsymbol{x}, f) = \mu + 2\sigma α(x,f)=μ+2σ 。

贝叶斯优化搜索算法在探索（exploration）和利用（exploitation）之间进行权衡.

初始化数据集 D 0 \mathcal{D}_{0} D0，例如设为空集。
在某个随机起点初始化决策变量 x 1 \boldsymbol{x}_{1} x1。
进入循环，对于 t = 1 , 2 , ... t = 1, 2, \ldots t=1,2,... ：
- 查询目标函数，得到 J t = f ( x t ) J_{t} = f(\boldsymbol{x}_{t}) Jt=f(xt)。
- 扩充数据集， D t = { D t − 1 , ( x t , J t ) } \mathcal{D}{t} = \{\mathcal{D}{t - 1}, (\boldsymbol{x}{t}, J{t})\} Dt={Dt−1,(xt,Jt)} 。
- 使用 D t \mathcal{D}_{t} Dt更新目标函数 f f f的贝叶斯模型。
- 通过优化采集函数 α \alpha α选择新的 x t + 1 \boldsymbol{x}{t + 1} xt+1，即 x t + 1 = arg ⁡ max ⁡ x α ( x , f ) \boldsymbol{x}{t + 1} = \underset{\boldsymbol{x}}{\arg\max}\alpha(\boldsymbol{x}, f) xt+1=xargmaxα(x,f) 。

Matlab中的贝叶斯优化

matlab 复制代码

%% Bayesian optimization demo
% initialise a toy function to maximise over
x = [-10:0.1:10]'; % grid of decision variables
y = sin(x).*x.^2 + 2.*x; % objective function to maximise wrt x
N = length(x); % number of samples
rng(1); % initialise random seed
% initialise algorithm
xstar = x(randi(N,1)); % initial decision variable
D = []; % initial data set
% Bayesian optimization loop
for k = 1:30
	ystar = sin(xstar).*xstar.^2 + 2.*xstar; % obj fn
	D = [D; xstar ystar]; % augment data
	gp = fitrgp(D(:,1),D(:,2),'Sigma',30); % fit GP model
	[mu,sigma,yint1] = predict(gp,x); % predict GP
	alpha = mu + 2*sigma; % UCB acq. fn
	[bestalpha,idxstar] = max(alpha); % max. acq. fn
	xstar = x(idxstar); % update x
end
% plot
figure; hold on; plot(xstar,ystar,'.m','markersize',50); 
plot(x,y,'r'); plot(x,mu,'--k'); 
xlabel('x'); ylabel('Objective Value')
patch([x;flipud(x)],[yint1(:,1);flipud(yint1(:,2))], ...
'k','FaceAlpha',0.1);
% Bayesian optimization using in-built Matlab function
results = bayesopt(objectiveFunction,decisionVariables);

Python 中的贝叶斯优化

Python在贝叶斯优化方面具备更先进的工具：

hyperopt优化包是一个用于贝叶斯优化的开源Python库，由Bergstra开发。
hyperopt可以利用树状Parzen估计器（Tree Parzen Estimator, TPE）算法。
TPE算法能够将连续的超参数和离散的超参数结合起来进行优化。

python 复制代码

from hyperopt import fmin, tpe, hp
from keras.models import Sequential
from keras.layers import Dense
from keras.datasets import mnist
from keras.utils import to_categorical
# Load the MNIST dataset and preprocess it
(X_train, y_train), (X_test, y_test) = mnist.load_data()
X_train = X_train.reshape(-1, 784).astype('float32') / 255.0
X_test = X_test.reshape(-1, 784).astype('float32') / 255.0
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)
# Define the search space for hyperparameters
space = {'num_hidden_layers': hp.choice('num_hidden_layers', [1, 2, 3]),
'num_hidden_units': hp.choice('num_hidden_units', [32, 64, 128, 256]),}
# Define the objective function to minimize
def objective(params):
	# Build the Keras model with the given hyperparameters
	model = Sequential()
	model.add(Dense(params['num_hidden_units'],
	activation='relu', input_shape=(784,)))
	for i in range(params['num_hidden_layers']-1):
		model.add(Dense(params['num_hidden_units'], activation='relu'))
	model.add(Dense(10, activation='softmax'))
	model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
	# Train the model and evaluate on the test set
	model.fit(X_train, y_train, epochs=5, batch_size=32, verbose=0)
	loss, accuracy = model.evaluate(X_test, y_test, verbose=0)
	return {'loss': -accuracy, 'status': 'ok'}
# Run the optimization
best = fmin(fn=objective, space=space, algo=tpe.suggest, max_evals=20)

引用

（计算加速）Li, et al. (2018). Hyperband: A novel bandit-based approach to hyperparameter
optimization. Journal of Machine Learning Research, 18(185), 1-52.
（计算加速中终止性能较差的模型）https://keras.io/api/keras_tuner/tuners/hyperband/
(贝叶斯优化算法）Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., & De Freitas, N. (2015). Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1), 148-175.
（贝叶斯优化在Python中的示例）Bergstra, J., Bardenet, R., Bengio, Y., & Kégl, B. (2011). Algorithms for hyper-parameter optimization. Advances in Neural Information Processing Systems, 24.