What is Hyperopt?

Hyperopt is a way to search through an hyperparameter space. For example, it can use the Tree-structured Parzen Estimator (TPE) algorithm, which explore intelligently the search space while narrowing down to the estimated best parameters.

It is hence a good method for meta-optimizing a neural network which is itself an optimisation problem: tuning a neural network uses gradient descent methods, and tuning the hyperparameters needs to be done differently since gradient descent can’t apply. Therefore, Hyperopt can be useful not only for tuning hyperparameters such as the learning rate, but also to tune more fancy parameters in a flexible way, such as changing the number of layers of certain types, or the number of neurons in a layer, or even the type of layer to use at a certain place in the network given an array of choices, each with nested tunable hyperparameters.

This is an oriented random search, in contrast with a Grid Search where hyperparameters are pre-established with fixed steps increase. Random Search for Hyper-Parameter Optimization (such as what Hyperopt do) has proven to be an effective search technique. The paper about this technique sits among the most cited deep learning papers. To sum up, it is more efficient to search randomly through values and to intelligently narrow the search space rather than looping on fixed sets of values for the hyperparameters.

Note that this blog post is also available on our GitHub as a Notebook. It contains code that can be run with Jupyter.

Star  Fork

How to define Hyperopt parameters?

A parameter is defined with a certain uniformrange or else a probability distribution, such as:

  • hp.randint(label, upper)
  • hp.uniform(label, low, high)
  • hp.loguniform(label, low, high)
  • hp.normal(label, mu, sigma)
  • hp.lognormal(label, mu, sigma)

There is also a few quantized versions of those functions, which rounds the generated values at each step of “q”:

  • hp.quniform(label, low, high, q)
  • hp.qloguniform(label, low, high, q)
  • hp.qnormal(label, mu, sigma, q)
  • hp.qlognormal(label, mu, sigma, q)

It is also possible to use a “choice” which can lead to hyperparameter nesting:

  • hp.choice(label, ["list", "of", "potential", "choices"])
  • hp.choice(label, [hp.uniform(sub_label_1, low, high), hp.normal(sub_label_2, mu, sigma), None, 0, 1, "anything"])

Visualisations of the parameters for probability distributions can be found below. Then, more details on choices and parameter nesting will come.

In [1]:
# The "%reset" ipython magic command will reset the kernel upon being called.
# (thus flushing loaded variables and previous imports from the current
# notebook session, if there were any)
%reset -f

from hyperopt import pyll, hp
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats.kde import gaussian_kde

# Let's plot the result of sampling from many different probability distributions:
hyperparam_generators = {
    'randint': hp.randint('randint', 5),
    'uniform': hp.uniform('uniform', -1, 3),
    'loguniform': hp.loguniform('loguniform', -0.3, 0.3),
    'normal': hp.normal('normal', 1, 2),
    'lognormal': hp.lognormal('lognormal', 0, 0.3)
}

n_samples = 5000

for title, space in hyperparam_generators.items():
    evaluated = [
        pyll.stochastic.sample(space) for _ in range(n_samples)
    ]
    x_domain = np.linspace(min(evaluated), max(evaluated), n_samples)

    plt.figure(figsize=(18,6))

    hist = gaussian_kde(evaluated, 0.001)
    plt.plot(x_domain, hist(x_domain), label="True Histogram")

    blurred_hist = gaussian_kde(evaluated, 0.1)
    plt.plot(x_domain, blurred_hist(x_domain), label="Smoothed Histogram")

    plt.title("Histogram (pdf) for a {} distribution".format(title))
    plt.legend()
    plt.show()
Histogram for an Hyperopt integer distribution
Histogram for an Hyperopt uniform distribution
Histogram for an Hyperopt loguniform distribution
Histogram for an Hyperopt normal distribution
Histogram for an Hyperopt lognormal distribution

Note on the above charts (especially for the loguniform and uniform distributions): the blurred line averaging the values fades out toward the ends of the signal since it is zero-padded. The line ideally would not fade out by using techniques such as mirror-padding.

On the loguniform and lognormal distributions

Those are the best distributions for modeling the values a learning rate. That’s because we want to observe changes in the learning rate according to changing it with multiplications rather than additions, e.g.: when adjusting the learning rate, we’ll want to try to divide it or multiply it by 2 rather than adding and substracting a finite value.

To proove this, let’s generate a loguniform distribution for a multiplier of the learning rate, centered at 1.0. Dividing 1 by those values should yield the same distribution.

In [2]:
log_hyperparam_generators = {
    'loguniform': hp.loguniform('loguniform', -0.3, 0.3),
    'lognormal': hp.lognormal('lognormal', 0, 0.3)
}
# For more info about the lognormal distribution, see: 
# https://www.wolframalpha.com/input/?i=y%3D2%5E(+(-log4(x))%5E0.5+),+y%3D2%5E(-+(-log4(x))%5E0.5+)+from+0+to+1
# https://www.wolframalpha.com/input/?i=y%3D4%5E-(log2(x)%5E2)+from+0+to+5

n_samples = 5000

for title, space in log_hyperparam_generators.items():
    evaluated = [
        pyll.stochastic.sample(space) for _ in range(n_samples)
    ]
    inverse_evaluated = [1.0 / y for y in evaluated]
    x_domain = np.linspace(min(evaluated), max(evaluated), n_samples)

    plt.figure(figsize=(18,6))

    hist = gaussian_kde(evaluated, 0.001)
    plt.plot(x_domain, hist(x_domain), label="True Histogram")

    inverse_hist = gaussian_kde(inverse_evaluated, 0.001)
    plt.plot(x_domain, inverse_hist(x_domain), label="1 / True Histogram")

    blurred_hist = gaussian_kde(evaluated, 0.1)
    plt.plot(x_domain, blurred_hist(x_domain), label="Smoothed Histogram")

    blurred_inverse_hist = gaussian_kde(inverse_evaluated, 0.1)
    plt.plot(x_domain, blurred_inverse_hist(x_domain), label="1 / Smoothed Histogram")

    plt.title("Histogram (pdf) comparing a {} distribution and the distribution of the inverse of all its values".format(title))
    plt.legend()
    plt.show()
Histogram for an Hyperopt loguniform distribution, and its inverse
Histogram for an Hyperopt lognormal distribution, and its inverse

Example – optimizing for finding the minimum of:
f(x) = x^2 - x + 1

Let’s now define a simple search space and solve for f(x) = x^2 - x + 1, where x is an hyperparameter.

In [3]:
%reset -f

from hyperopt import fmin, tpe, hp
import matplotlib.pyplot as plt

def f(x):
    return x**2 - x + 1

plt.plot(range(-5, 5), [f(x) for x in range(-5, 5)])
plt.title("Function to optimize: f(x) = x^2 - x + 1")
plt.show()

space = hp.uniform('x', -5, 5)

best = fmin(
    fn=f,  # "Loss" function to minimize
    space=space,  # Hyperparameter space
    algo=tpe.suggest,  # Tree-structured Parzen Estimator (TPE)
    max_evals=1000  # Perform 1000 trials
)

print("Found minimum after 1000 trials:")
print(best)
Some quadratic function f(x)=x^2-x+1
Found minimum after 1000 trials:
{'x': 0.500084824485627}

Example with a dict hyperparameter space

Let’s solve for minimizing f(x, y) = x^2 + y^2 using a space using a python dict as structure. Later, this will neable us to nest hyperparameters with choices in a clean way.

In [4]:
%reset -f

from hyperopt import fmin, tpe, hp

def f(space):
    x = space['x']
    y = space['y']
    return x**2 + y**2

space = {
    'x': hp.uniform('x', -5, 5),
    'y': hp.uniform('y', -5, 5)
}

best = fmin(
    fn=f,
    space=space,
    algo=tpe.suggest,
    max_evals=1000
)

print("Found minimum after 1000 trials:")
print(best)
Found minimum after 1000 trials:
{'x': 0.013181950926553512, 'y': 0.0364742933684085}

With choices, Hyperopt hyperspaces can be represented as nested data structures, too

Yet, we have defined spaces as a single parameter. But that is 1D. Normally, spaces contain many parameters. Let’s define a more complex one and with one nested hyperparameter choice for an uniform float:

In [5]:
%reset -f

from hyperopt import pyll, hp

import pprint

pp = pprint.PrettyPrinter(indent=4, width=100)

# Define a complete space: 
space = {
    'x': hp.normal('x', 0, 2),
    'y': hp.uniform('y', 0, 1),
    'use_float_param_or_not': hp.choice('use_float_param_or_not', [
        None, hp.uniform('float', 0, 1),
    ]),
    'my_abc_other_params_list': [
        hp.normal('a', 0, 2), hp.uniform('b', 0, 3), hp.choice('c', [False, True]),
    ],
    'yet_another_dict_recursive': {
        'u': hp.uniform('u', 0, 3),
        'v': hp.uniform('v', 0, 3),
        'u': hp.uniform('w', -3, 0)
    }
}

# Print a few random (stochastic) samples from the space: 
for _ in range(10):
    pp.pprint(pyll.stochastic.sample(space))
{   'my_abc_other_params_list': (-0.9011698405784833, 0.04240518887511624, False),
    'use_float_param_or_not': None,
    'x': 3.285084743824567,
    'y': 0.4111558924009172,
    'yet_another_dict_recursive': {'u': -1.5170149298252174, 'v': 2.1058221185350656}}
{   'my_abc_other_params_list': (-2.1813950266716633, 1.8724130869362874, False),
    'use_float_param_or_not': 0.11139197358152442,
    'x': -0.6396356374539179,
    'y': 0.944179731788648,
    'yet_another_dict_recursive': {'u': -0.8219112779877897, 'v': 2.481889167917366}}
{   'my_abc_other_params_list': (-1.9378270981764674, 0.23818419773638277, True),
    'use_float_param_or_not': 0.055164365429011486,
    'x': -0.722483828902152,
    'y': 0.16571995577005316,
    'yet_another_dict_recursive': {'u': -1.5321114862383185, 'v': 2.7355357236685687}}
{   'my_abc_other_params_list': (-6.277744702434964, 2.78891176300414, True),
    'use_float_param_or_not': 0.8465512535129484,
    'x': -2.6743650912469983,
    'y': 0.6015479182494206,
    'yet_another_dict_recursive': {'u': -0.05083517047457731, 'v': 0.33898515479785407}}
{   'my_abc_other_params_list': (0.14853623708214062, 0.9928986807814948, False),
    'use_float_param_or_not': None,
    'x': -1.5562156256416555,
    'y': 0.21144118616438012,
    'yet_another_dict_recursive': {'u': -0.14514878554310773, 'v': 1.5459430173391977}}
{   'my_abc_other_params_list': (0.11632043580442206, 0.08582655498150193, True),
    'use_float_param_or_not': None,
    'x': -4.261629893779373,
    'y': 0.24843664789625597,
    'yet_another_dict_recursive': {'u': -1.2130224897824347, 'v': 1.9784014363068374}}
{   'my_abc_other_params_list': (-0.7262319170213638, 0.5228563659772797, True),
    'use_float_param_or_not': None,
    'x': 0.4587527247586972,
    'y': 0.7533386636633489,
    'yet_another_dict_recursive': {'u': -0.7483556533657412, 'v': 1.2820200216468183}}
{   'my_abc_other_params_list': (0.7776940586049558, 2.4228455311249997, True),
    'use_float_param_or_not': 0.1547203762368643,
    'x': 0.09432801775550438,
    'y': 0.18913970646641654,
    'yet_another_dict_recursive': {'u': -2.0359310700722295, 'v': 1.450071647308347}}
{   'my_abc_other_params_list': (0.3023988415655776, 0.7548445351596126, False),
    'use_float_param_or_not': None,
    'x': 0.9896658847831072,
    'y': 0.24496113805354003,
    'yet_another_dict_recursive': {'u': -1.0461122235433225, 'v': 0.48828176621712693}}
{   'my_abc_other_params_list': (-2.103811439799546, 1.8384233211718317, False),
    'use_float_param_or_not': None,
    'x': 1.481404813165046,
    'y': 0.970300705213251,
    'yet_another_dict_recursive': {'u': -2.0565596833626976, 'v': 1.0738132202239536}}

Let’s now record the history of every trial

This will require us to import a few more things, and return the results with a dict that has a “status” and “loss” key at least. Let’s keep in our return dict the evaluated space too as this may come in handy if we save results to disk.

In [6]:
%reset -f

from hyperopt import fmin, tpe, hp, Trials, STATUS_OK, STATUS_FAIL

import pprint

pp = pprint.PrettyPrinter(indent=4)

def f(space):
    x = space['x']
    y = space['y']

    if y > 1:
        # Make use of status fail as an example of skipping on error
        result = {
            "loss": -1,
            "status": STATUS_FAIL,
            "space": space
        }
        return result

    loss = x**2 + y**2
    result = {
        "loss": loss,
        "status": STATUS_OK,
        "space": space
    }
    return result

space = {
    'x': hp.uniform('x', -5, 5),
    'y': hp.uniform('y', -5, 5)
}

trials = Trials()

best = fmin(
    fn=f,
    space=space,
    algo=tpe.suggest,
    trials=trials,
    max_evals=1000
)

print("Found minimum after 1000 trials:")
print(best)
print("")

print("Here are the space and results of the 3 first trials (out of a total of 1000):")
pp.pprint(trials.trials[0])
pp.pprint(trials.trials[1])
pp.pprint(trials.trials[2])
# pp.pprint(trials.trials[...])
# pp.pprint(trials.trials[999])
print("")

print("What interests us most is the 'result' key of each trial (here, we show 7):")
pp.pprint(trials.trials[0]["result"])
pp.pprint(trials.trials[1]["result"])
pp.pprint(trials.trials[2]["result"])
pp.pprint(trials.trials[3]["result"])
pp.pprint(trials.trials[4]["result"])
pp.pprint(trials.trials[5]["result"])
pp.pprint(trials.trials[6]["result"])
# pp.pprint(trials.trials[...]["result"])
# pp.pprint(trials.trials[999]["result"])
Found minimum after 1000 trials:
{'x': 0.1330891919905135, 'y': -0.22753380990535327}

Here are the space and results of the 3 first trials (out of a total of 1000):
{   'book_time': datetime.datetime(2017, 7, 19, 19, 25, 42, 14000),
    'exp_key': None,
    'misc': {   'cmd': ('domain_attachment', 'FMinIter_Domain'),
                'idxs': {'x': [0], 'y': [0]},
                'tid': 0,
                'vals': {   'x': [-0.2682564852440139],
                            'y': [-3.6433914388359234]},
                'workdir': None},
    'owner': None,
    'refresh_time': datetime.datetime(2017, 7, 19, 19, 25, 42, 14000),
    'result': {   'loss': 13.346262718458373,
                  'space': {'x': -0.2682564852440139, 'y': -3.6433914388359234},
                  'status': 'ok'},
    'spec': None,
    'state': 2,
    'tid': 0,
    'version': 0}
{   'book_time': datetime.datetime(2017, 7, 19, 19, 25, 42, 16000),
    'exp_key': None,
    'misc': {   'cmd': ('domain_attachment', 'FMinIter_Domain'),
                'idxs': {'x': [1], 'y': [1]},
                'tid': 1,
                'vals': {'x': [1.1193079488707536], 'y': [4.941591090140912]},
                'workdir': None},
    'owner': None,
    'refresh_time': datetime.datetime(2017, 7, 19, 19, 25, 42, 16000),
    'result': {   'loss': -1,
                  'space': {'x': 1.1193079488707536, 'y': 4.941591090140912},
                  'status': 'fail'},
    'spec': None,
    'state': 2,
    'tid': 1,
    'version': 0}
{   'book_time': datetime.datetime(2017, 7, 19, 19, 25, 42, 18000),
    'exp_key': None,
    'misc': {   'cmd': ('domain_attachment', 'FMinIter_Domain'),
                'idxs': {'x': [2], 'y': [2]},
                'tid': 2,
                'vals': {'x': [-0.9841176113491965], 'y': [4.084615269991156]},
                'workdir': None},
    'owner': None,
    'refresh_time': datetime.datetime(2017, 7, 19, 19, 25, 42, 18000),
    'result': {   'loss': -1,
                  'space': {'x': -0.9841176113491965, 'y': 4.084615269991156},
                  'status': 'fail'},
    'spec': None,
    'state': 2,
    'tid': 2,
    'version': 0}

What interests us most is the 'result' key of each trial (here, we show 7):
{   'loss': 13.346262718458373,
    'space': {'x': -0.2682564852440139, 'y': -3.6433914388359234},
    'status': 'ok'}
{   'loss': -1,
    'space': {'x': 1.1193079488707536, 'y': 4.941591090140912},
    'status': 'fail'}
{   'loss': -1,
    'space': {'x': -0.9841176113491965, 'y': 4.084615269991156},
    'status': 'fail'}
{   'loss': 41.242387285099674,
    'space': {'x': -4.347394824005317, 'y': -4.726790193070923},
    'status': 'ok'}
{   'loss': -1,
    'space': {'x': -0.7484035055375422, 'y': 2.0659168406990114},
    'status': 'fail'}
{   'loss': -1,
    'space': {'x': 4.7582574160719595, 'y': 4.497709800735624},
    'status': 'fail'}
{   'loss': -1,
    'space': {'x': 1.9594920113855077, 'y': 1.3312270068517922},
    'status': 'fail'}

Note that the optimization could be parallelized by using MongoDB and storing the trials’ state here. Althought this is a built-in feature of hyperopt, let’s keep things simple for our examples here.

Indeed, the TPE algorithm used by the fmin function has state which is stored in the trials and which is useful to narrow the search space dynamically once we have a few trials. It is then interesting to pause and resume a training, and to apply that to a real problem.

This is what’s done inside the hyperopt_optimize.py file of the GitHub repository for this project. There, as an example, we optimize a convolutional neural network for solving the CIFAR-100 problem.

It is also possible to glance at the results and the effect of each hyperparameter on the accuracy:


Hyperspace Scatterplot Matrix

You might as well like this other blog post of mine on how to use Git Large File Storage (Git LFS) to handle the versioning of huge files when working with machine learning projects.

Star  Fork

Share on

Where to
find us

510-1015 av. Wilfrid-Pelletier
Quebec, QC, Canada
G1W 0C4

418 800.0027
TOLL FREE: 1 844 800.0027
info@vooban.com

Directions

Our newsletter