Advanced

This page details the more advanced HyperOptimizer options.

First it might be useful to sketch the underlying process for performing a contraction path search:

  1. A driver (such as 'greedy' or 'kahypar' ) along with a set of heuristic parameters (the ‘hyper-parameters’) are sampled from an ‘optimizer’.

  2. These are used to build a single ContractionTree

  3. Optionally, some modification of this tree is then performed - either slicing, subtree reconfiguration, or some combination such as dynamic slicing

  4. A score is then generated for the tree, for example the contraction cost

  5. This score, the driver and hyper-parameters are fed back to the optimizer to update its search space

  6. The process is repeated, optionally in parallel, for a set number of repeats or time, with the best scoring tree being returned finally.

import cotengra as ctg

Drivers

The following are valid initial methods to HyperOptimizer(methods=[...]), which will tune each one and ultimately select the best performer:

  • 'kahypar' (default)

  • 'greedy' (default)

  • 'labels' (default if kahypar not installed ) - a pure python implementation of hypergraph community detection

  • igraph partition based

    • 'spinglass'

    • 'labelprop'

  • igraph dendrogram based (yield entire community structure tree at once)

    • 'betweenness'

    • 'walktrap'

  • linegraph tree decomposition based:

    • 'quickbb'

    • 'flowcutter'

You can also register your own methods to add to the mix with register_hyper_function.

Slicing and Subtree Reconfiguration

There are three types of ‘post-processing’ of trees that the hyper optimizer can perform. For details of what each of these does, or how to perform them outside of the hyper optimization loop, see the ‘Tree Surgery’ sections:

Basic slicing - slicing_opts

This removes indices from the contraction greedily one by one until the specified options are met, without changing the tree structure at all. This is cheap to perform but can incur significant ‘slicing overhead’ for complicated contractions. To turn it on supply a dict to slicing_opts, which will be supplied as kwargs to ContractionTree.slice

opt = ctg.HyperOptimizer(slicing_opts={'target_size': 2**28})

Or if we want paths with at least 1024 independent contractions we could use:

opt = ctg.HyperOptimizer(slicing_opts={'target_slices': 2**10})

Note that you need to work with the ContractionTree object rather than a raw path (i.e. call opt.search or quimb.tensor.TensorNetwork.contraction_tree) if you want to access the sliced indices or perform the sliced contraction (although a tree reconstructed from the raw path will be ‘sliceable’ in the sense that the indices will be easily refound).

Dynamic slicing - slicing_reconf_opts

This removes indices one by one, but also performs subtree reconfiguration between removals in order to modify the tree to take account of the slicing. Slower than basic slicing but can be much higher quality, especially when slicing many indices of a complicated contraction. This is turned on by supplying a dict to slicing_reconf_opts, which will be supplied as kwargs to ContractionTree.slice_and_reconfigure. If both basic and dynamic slicing are specified then the basic slicing is performed first.

opt = ctg.HyperOptimizer(slicing_reconf_opts={'target_size': 2**28})

Subtree reconfiguration - reconf_opts

Finally, one can perform subtree reconfiguration alone - ranging over subtrees of the main tree optimally reconfiguring them locally to lower the target score globally. This is turned on by supplying a dict to reconf_opts, which will be supplied as kwargs to ContractionTree.subtree_reconfigure. The ‘reconfing’ will be performed after any slicing.

opt = ctg.HyperOptimizer(reconf_opts={})

Note you just supply an empty dict if you want to use the default options. Some parameters, such as the score function to minimize (see next section), will be set for you to match the global objective if not specified. You can supply reconf_opts={'forested' :True, ...} to use forested subtree reconfiguration.

Note

Note all three of these can be applied successively:

opt = ctg.HyperOptimizer(
    slicing_opts={'target_size': 2**40},         # first do basic slicing
    slicing_reconf_opts={'target_size': 2**28},  # then advanced slicing with reconfiguring
    reconf_opts={'subtree_size': 14},            # then finally just higher quality reconfiguring
)

Simulated Annealing - simulated_annealing_opts

cotengra has an implementation of simulated annealing along the lines of arXiv:2108.05665, the core function of which is simulated_anneal_tree.

You can turn this on by supplying a dict to simulated_annealing_opts, which will be supplied as kwargs to the above function:

opt = ctg.HyperOptimizer(simulated_annealing_opts={})

There is support for slicing by supplying the target_size in this dict. The simulated anneal is performed first-thing after generating the initial tree, and can still be combined with any of the other processing steps above.

Objective

What the optimizer (and subtree reconfiguration) considers the ‘best’ tree or objective function is set with the minimize kwarg.

  • minimize='flops' - minimize the total number of scalar operations,

  • minimize='size' - minimize the size of the largest intermediate tensor,

Hint

Generally to control the memory usage of a contraction it is better to use slicing rather than targetting the size directly.

  • minimize='write' - minimize the total amount of ‘memory written’, i.e. the sum of sizes of all intermediate tensors. This targets memory speed, but is best weighted with the flops cost as well - see below. However this score is also relevant for automatic differentiating a contraction, where naively all intermediates must be kept.

  • minimize='combo' - minimize the sum of \(FLOPS + \alpha \times WRITE\) using the default \(\alpha=64\)

  • minimize=f'combo-{alpha}' - use a custom \(\alpha\) for the combo cost

For real world contractions, where both clock speed and memory speed are important, using:

opt = ctg.HyperOptimizer(minimize='combo')

should provide decent baseline performance for typical CPUs and GPUs.

Optimization Library

Optimizing the hyper parameters of the drivers which generate trees is done by one of many backend ‘ask-and-tell’ optimization libraries. The cost of asking for a new sample point and telling the resulting score should be much smaller than actually generating the trees themselves. However, for Gaussian process bayesian optimization, for example, the cost can become considerable once a few hundred trials have been fitted. As such, you may want to use a different library or algorithm. The options are:

  • optuna - Tree of Parzen Estimators used by default

  • baytune - Bayesian Tuning and Bandits - Gaussian Processes used by default

  • chocolate - the CMAES optimization algorithm is used by default (sampler='QuasiRandom' also useful)

  • skopt - random forest as well as Gaussian process regressors (high quality but slow)

  • nevergrad - various population and evolutionary algorithms (v fast & suitable for highly parallel path findings)

These are specified like:

opt = ctg.HyperOptimizer(optlib='nevergrad')

To select which algorithm the optimizer uses supply the sampler= kwarg, other options to instantiate the optimizer are also passed in as kwargs.

If you don’t want to use optimization (either because no library is installed or for the least overhead), you can supply:

opt = ctg.HyperOptimizer(optlib='random')

which will just randomly sample the space of hyper parameters.

Optimization Space

You can view (and modify) the actual hyper parameter space of each driver with the following call:

ctg.get_hyper_space()['greedy']
{'random_strength': {'type': 'FLOAT_EXP', 'min': 0.001, 'max': 1.0},
 'temperature': {'type': 'FLOAT_EXP', 'min': 0.001, 'max': 1.0},
 'costmod': {'type': 'FLOAT', 'min': 0.0, 'max': 50.0}}

Parallelization

cotengra can fairly naturally parallelize over the trials of the hyper optimization, and does so by default HyperOptimizer(parallel='auto'). The basic parallel options are:

  • parallel=False - no parallelization

  • parallel=True - alias for using default parallelization and number of workers

  • parallel=X:int - alias for using default parallelization but specify number of workers

  • parallel=pool - supply your own concurrent.futures.ProcessPoolExecutor compatible pool

For non-distributed pools the default number of workers is computed, in order of preference, as:

  1. COTENGRA_NUM_WORKERS environment variable

  2. OMP_NUM_THREADS environment variable

  3. os.cpu_count()

parallel='concurrent.futures'

The default parallelization is the python standard library. A cached ProcessPoolExecutor is used.

parallel='loky'

Use the loky (also fallback to joblib.externals.loky) reusable pool.

parallel='dask'

Use the dask distributed scheduler. If a dask.distributed.Client already exists then it will be used, otherwise a new one will be created. You can also supply a dask.distributed.Client object directly.

parallel='ray'

Use the ray distributed scheduler, with potentially better performance than dask. A ray runtime will be created if not already running.

Others such as mpi4py

Other libraries that conform to the concurrent.futures interface can be used, for example the mpi4py library.

Warning

There are two caveats to parallelization to consider:

  1. Since trials have to be pre dispatched, a delay is introduced between sampling and reporting trials to the underlying optimizer

  2. If the underlying optimizer cannot suggest trials quickly enough, or the pre dispatch is not large enough, workers can become idle.

Consider using optlib='nevergrad' or optlib='random' if you are optimizing with many tens or more workers to avoid being bottlenecked by the default Bayesian optimization for example.

parallel='threads'

You can specify a ThreadPoolExecutor for parallelization, but this is only useful tasks where the underlying driver releases the GIL, e.g. RandomGreedyOptimizer, with accel=True, where threads are used by default.

Termination

The HyperOptimizer can be terminated in several ways.

  • max_repeats=X:int - terminate after this many trees have been generated. The default of 128 is reasonable for simple to moderatetely complex contractions using a Bayesian optimizer.

  • max_time=T - terminate after T seconds have elapsed

  • max_time=f'rate:{ops_per_second}' - terminate once the time taken searching is greater than if the best contraction found so far was performed at the rate ops_per_second. This can be used to avoid spending a long time on cheap contractions. If the contraction will be performed many times, divide the rate by the commensurate factor. For example, if we have a CPU with very approximately a billion OPS/s but we expect to perform the contraction a hundred times we could use max_time='rate:1e7'.

  • max_time='equil:{X:int}' - terminate if no better contraction has been found in the last X trials. This can be used to avoid spending a long time on a contraction that is already converged.

Note that the HyperOptimizer is stateful; each time it is supplied to the same contraction it will reset the above conditions and continue.

Hint

The ReusableHyperOptimizer on the other hand only runs once.

Large graphs - RandomGreedyOptimizer

The HyperOptimizer can handle medium sizes of graph (up to 1000s of tensors), but the hyper-optimization and explicit ContractionTree construction of every trial can become slow for very large contractions. An alternative more suited to very lightweight path finding that can handle 10,000s of tensors is the RandomGreedyOptimizer. This only samples random greedy paths whilst simultaneously calculating each flops score, thus avoiding the heavier machinery of the hyper optimizer. Moreover, the core function optimize_random_greedy_track_flops has an accelerated rust counterpart in cotengrust, that is both faster and allows threaded parallelization.

inputs, output, _, size_dict = ctg.utils.lattice_equation([100, 100])

opt = ctg.RandomGreedyOptimizer(
    max_repeats=32,
    temperature=0.01,
    seed=42,
)

It has the same call methods as all cotengra optimizers:

  • path = opt(inputs, output, size_dict): return the best contraction path

  • ssa_path = opt.ssa_path(inputs, output, size_dict): return the best contraction path in SSA format

  • tree = opt.search(inputs, output, size_dict): return the best ContractionTree

The marginally cheapest of these is the SSA format:

%%time
ssa_path = opt.ssa_path(inputs, output, size_dict)
CPU times: user 3.21 s, sys: 58 ms, total: 3.26 s
Wall time: 613 ms

Once the optimizer has been run, the corresponding flops (log10) can be retrieved with:

opt.best_flops
65.02261352539062