Data Mining2 复习笔记6 - Optimization & Hyperparameter Tuning

6. Optimization & Hyperparameter Tuning

Why Hyperparameter Tuning?

Many learning algorithms for classification, regression, ... Many of those have hyperparameters: k and distance function for k nearest neighbors, splitting and pruning options in decision tree learning, ...

But what is their effect?
Hard to tell in general and rules of thumb are rare.

Parameters vs. Hyperparameters
Parameters are learned during training

Typical examples: Coefficients in (linear) regression, Weights in neural networks, ...

Training: Find set of of parameters so that objective function is minimized/maximized (on a holdout set)

Hyperparameters are fixed before training

Typical examples: Network layout and learning rate in neural networks, k in kNN, ...

Training: Find set of of parameters so that objective function is minimized/maximized (on a holdout set), given a previously fixed set of hyperparameters

Hyperparameter Tuning -- a Naive Approach

  1. run classification/regression algorithm
  2. look at the results (e.g., accuracy, RMSE, ...)
  3. choose different parameter settings, go to 1

Questions: when to stop? how to select the next parameter setting to test?

Hyperparameter Tuning -- Avoid Overfitting!

Recap overfitting: classifiers may overadapt to training data. The same holds for parameter settings

Possible danger: finding parameters that work well on the training set but not on the test set

Remedy: train / test / validation split

6.1 Hyperparameter Tuning: Brute Force

Try all parameter combinations that exist → we need a better strategy than brute force!

Hyperparameter tuning is an optimization problem
Finding optimal values for N variables

Properties of the problem:

  • the underlying model is unknown, i.e., we do not know changing a variable will influence the results
  • we can tell how good a solution is when we see it, i.e., by running a classifier with the given parameter set
  • but looking at each solution is costly

Related problem: feature subset selection

Given n features, brute force requires 2^n evaluations

e.g. for 20 features, that is already one million → ten million with cross validation

Knapsack problem

given a maximum weight you can carry and a set of items with different weight and monetary value. Pack those items that maximize the monetary value

Problem is NP hard -- i.e., deterministic algorithms require an exponential amount of time
Note: feature subset selection for N features requires 2^n evaluations

Many optimization problems are NP hard

Routing problems (Traveling Salesman Problem)

Integer factorization: hard enough to be used for cryptography

Resource use optimization. e.g., minimizing cutoff waste

Chip design - minimizing chip sizes

Properties of Brute Force search

guaranteed to find the best parameter setting, too slow in most practical cases

performs a brute force search with equal-width steps on non-discrete numerical attributes

(e.g., 10,20,30,...,100)

Hyperparameter with a wide range (e.g., 0.0001 to 1,000,000)

with ten equal-width steps, the first step would be 1,000

but what if the optimum is around 0.1?
logarithmic steps may perform better for some parameters

Needed :

solutions that take less time/computation and often find the best parameter setting or find a near-optimal parameter setting

6.2 Hyperparameter Tuning: One After Another

Given n parameters with m degrees of freedom -- brute force takes m^n runs of the base classifier

Simple tweak:

  1. start with default settings
  2. try all options for the first parameter
    2a. fix best setting for first parameter
  3. try all options for the second parameter
    3a. fix best setting for second parameter
  4. ...

This reduces the runtime to n*m

i.e., no longer exponential -- but we may miss the best solution

6.2.1 Interaction Effects

Interaction effects make parameter tuning hard. i.e., changing one parameter may change the optimal settings for another one

Example: two parameters p and q, each with values 0,1, and 2 -- the table depicts classification accuracy

Example: two parameters p and q, each with values 0,1, and 2. The table depicts classification accuracy. If we try to optimize one parameter by another (first p, then q). We end at p=0,q=0 in six out of nine cases. On average, we investigate 2.3 solutions.

(0.5-local optimum, 0.7-globe optimum)

6.3 Hill climbing with variations

"Like climbing Everest in thick fog with amnesia" always search in the direction of the steepest ascend.

  • Stochastic hill climbing
    random selection among the uphill moves
    the selection probability can vary with the steepness of the uphill move
  • First-choice hill climbing
    generating successors randomly until a better one is found, then pick
    that one
  • Random-restart hill climbing
    run hill climbing with different seeds
    tries to avoid getting stuck in local maxima

Local Beam Search

Keep track of k states rather than just one

Start with k randomly generated states

At each iteration, all the successors of all k states are generated

Select the k best successors from the complete list and repeat

Grid Search vs. Random Search

All the examples discussed so far use fixed grids

Challenges: some hyperparameters are pretty sensitive

e.g., 0.02 is a good value, but 0 and 0.05 are not -- others have little influence

but it is hard to know upfront which
grid search may easily miss best parameters but random search often yields better results

6.6 Genetic Programming

Genetic Algorithms is inspired by evolution:

use a population of individuals (solutions) -> create new individuals by crossover -> introduce random mutations -> from each generation, keep only the best solutions ("survival of the fittest")
Standard Genetic Algorithm (SGA)

6.6.1 SGA

Basic ingredients:

  • individuals: the solutions
    hyperparameter tuning: a hyperparameter setting
  • a fitness function
    hyperparameter tuning: performance of a hyperparameter setting (i.e., run learner with those parameters)
  • acrossover method
    hyperparameter tuning: create a new setting from two others
  • amutation method
    hyperparameter tuning: change one parameter
  • survivor selection

Crossover OR Mutation?

Decade long debate: which one is better / necessary ...

Answer (at least, rather wide agreement): it depends on the problem, but

in general, it is good to have both -- both have another role
mutation-only-EA is possible, crossover-only-EA would not work

Exploration : Discovering promising areas in the search space, i.e. gaining information on the problem
Exploitation : Optimising within a promising area, i.e. using information

There is co-operation AND competition between them
Crossover is explorative , it makes a big jump to an area

somewhere "in between" two (parent) areas
Mutation is exploitative , it creates random small diversions, thereby staying near (in the area of) the parent

Only crossover can combine information from two parents

Remember: sample from entire value range
Only mutation can introduce new information (alleles)

To hit the optimum you often need a 'lucky' mutation

6.6.2 Genetic Feature Subset Selection

Feature Subset Selection can also be solved by Genetic Programming

Individuals: feature subsets

Representation: binary -- 1 = feature is included; -- 0 = feature is not included

Fitness: classification performance

Crossover: combine selections of two subsets

Mutation: flip bits

6.6.3 Selecting a Learner by Meta Learning

So far, we have looked at finding good parameters for a learner -- the learner was always fixed

A similar problem is selecting a learner for the task at hand

Again, we could go with search. Another approach is meta learning

Meta Learning i.e., learning about learning

Goal: learn how well a learner will perform on a given dataset features: dataset characteristics, learning algorithm

prediction target: accuracy, RMSE, ...

Also known as AutoML

Basic idea: train a regression model

  • data points: individual datasets plus ML approach
  • target: expected accuracy/RMSE etc.

Examples for features: number of instances/attributes, fraction of nominal/numerical attributes, min/max/average entropy of attributes, skewness of classes, ...


Recap: search heuristics are good for problems where finding an optimal solution is difficult, evaluating a solution candidate is easy, the search space of possible solutions is large

Possible solution: genetic programming

We have encountered such problems quite frequently

Example: learning an optimal decision tree from data

6.6.4 Genetic Decision Tree Learning

Population: candidate decision trees (initialization: e.g., trained on small subsets of data)
Create new decision trees by means of crossover & mutation

Fitness function: e.g., accuracy

swap can happen in different level, just randomly

6.6.5 Combination of GP with other Learning Methods

Rule Learning ("Learning Classifier Systems")

Population: set of rule sets (!)

Crossover: combining rules from two sets

Mutation: changing a rule

Artificial Neural Networks

Easiest solution: fixed network layout

The network is then represented as an ordered set (vector) of weights

e.g., [0.8, 0.2, 0.5, 0.1, 0.1, 0.2]
Crossover and mutation are straight forward

Variant: AutoMLP - Searches for best combination of hidden layers and learning rate

6.7 Hyperparameter learning

Hyperparameter tuning as a learning problem : Given a set of hyperparameters H, predict performance p of model. The prediction model is referred to as a surrogate model or oracle

Rationale:

Training and evaluating an actual model is costly

Learning and predicting with the surrogate model is fast

Note:

we want to use not too many runs of the actual model, i.e., the surrogate model will have few training points - use a simple model.

Most well-known: bayesian optimization

Summary: Grid Search, Random Search, Learning hyperparameters / bayesian optimization

Grid search

Inefficient

Fixed grid sizes may miss good parameters (Smaller grid sizes would be even less efficient!)

Random search

Often finds good solutions in less time

Learning hyperparameters / bayesian optimization

Sucessively tests hyperparameters close to local optima

Similar to hill climbing

Difference: explicit surrogate model

6.8 Summary

相关推荐
qzhqbb25 分钟前
基于统计方法的语言模型
人工智能·语言模型·easyui
冷眼看人间恩怨1 小时前
【话题讨论】AI大模型重塑软件开发:定义、应用、优势与挑战
人工智能·ai编程·软件开发
2401_883041081 小时前
新锐品牌电商代运营公司都有哪些?
大数据·人工智能
Nu11PointerException1 小时前
JAVA笔记 | ResponseBodyEmitter等异步流式接口快速学习
笔记·学习
AI极客菌2 小时前
Controlnet作者新作IC-light V2:基于FLUX训练,支持处理风格化图像,细节远高于SD1.5。
人工智能·计算机视觉·ai作画·stable diffusion·aigc·flux·人工智能作画
阿_旭2 小时前
一文读懂| 自注意力与交叉注意力机制在计算机视觉中作用与基本原理
人工智能·深度学习·计算机视觉·cross-attention·self-attention
王哈哈^_^2 小时前
【数据集】【YOLO】【目标检测】交通事故识别数据集 8939 张,YOLO道路事故目标检测实战训练教程!
前端·人工智能·深度学习·yolo·目标检测·计算机视觉·pyqt
亦枫Leonlew2 小时前
三维测量与建模笔记 - 3.3 张正友标定法
笔记·相机标定·三维重建·张正友标定法
考试宝2 小时前
国家宠物美容师职业技能等级评价(高级)理论考试题
经验分享·笔记·职场和发展·学习方法·业界资讯·宠物
Power20246663 小时前
NLP论文速读|LongReward:基于AI反馈来提升长上下文大语言模型
人工智能·深度学习·机器学习·自然语言处理·nlp