# Advanced¶

This page details the more advanced `HyperOptimizer`

options.

First it might be useful to sketch the underlying process for performing a contraction path search:

A driver (such as

`'greedy'`

or`'kahypar'`

) along with a set of heuristic parameters (the ‘hyper-parameters’) are sampled from an ‘optimizer’.These are used to build a single

`ContractionTree`

Optionally, some modification of this tree is then performed - either slicing, subtree reconfiguration, or some combination such as dynamic slicing

A score is then generated for the tree, for example the contraction cost

This score, the driver and hyper-parameters are fed back to the optimizer to update its search space

The process is repeated, optionally in parallel, for a set number of repeats or time, with the best scoring tree being returned finally.

```
import cotengra as ctg
```

## Drivers¶

The following are valid initial methods to `HyperOptimizer`

(`methods=[...])`

,
which will tune each one and ultimately select the best performer:

`'kahypar'`

*(default)*`'greedy'`

*(default)*`'labels'`

(*default if*) - a pure python implementation of hypergraph community detection`kahypar`

not installedigraph

**partition**based`'spinglass'`

`'labelprop'`

igraph

**dendrogram**based (yield entire community structure tree at once)`'betweenness'`

`'walktrap'`

**linegraph tree decomposition**based:`'quickbb'`

`'flowcutter'`

You can also register your own methods to add to the mix with
`register_hyper_function`

.

## Slicing and Subtree Reconfiguration¶

There are three types of ‘post-processing’ of trees that the hyper optimizer can perform. For details of what each of these does, or how to perform them outside of the hyper optimization loop, see the ‘Tree Surgery’ sections:

**Basic slicing** - `slicing_opts`

¶

This removes indices from the contraction greedily one by one until the
specified options are met, without changing the tree structure at all. This is cheap
to perform but can incur significant ‘slicing overhead’ for complicated
contractions. To turn it on supply a `dict`

to `slicing_opts`

, which will be
supplied as kwargs to
`ContractionTree.slice`

```
opt = ctg.HyperOptimizer(slicing_opts={'target_size': 2**28})
```

Or if we want paths with at least 1024 independent contractions we could use:

```
opt = ctg.HyperOptimizer(slicing_opts={'target_slices': 2**10})
```

Note that you need to work with the `ContractionTree`

object rather than a raw
path (i.e. call `opt.search`

or `quimb.tensor.TensorNetwork.contraction_tree`

)
if you want to access the sliced indices or perform the sliced contraction
(although a tree reconstructed from the raw path will be ‘sliceable’ in the
sense that the indices will be easily refound).

**Dynamic slicing** - `slicing_reconf_opts`

¶

This removes indices one by one, but also performs subtree reconfiguration
between removals in order to modify the tree to take account of the slicing.
Slower than basic slicing but can be much higher quality, especially when
slicing many indices of a complicated contraction. This is turned on by
supplying a `dict`

to `slicing_reconf_opts`

, which will be supplied as kwargs to
`ContractionTree.slice_and_reconfigure`

.
If both basic and dynamic slicing are specified then the basic slicing is
performed first.

```
opt = ctg.HyperOptimizer(slicing_reconf_opts={'target_size': 2**28})
```

**Subtree reconfiguration** - `reconf_opts`

¶

Finally, one can perform subtree reconfiguration alone - ranging over subtrees
of the main tree optimally reconfiguring them locally to lower the target
score globally. This is turned on by supplying a `dict`

to `reconf_opts`

, which
will be supplied as kwargs to
`ContractionTree.subtree_reconfigure`

.
The ‘reconfing’ will be performed after any slicing.

```
opt = ctg.HyperOptimizer(reconf_opts={})
```

Note you just supply an empty dict if you want to use the default options. Some
parameters, such as the score function to minimize (see next section), will be
set for you to match the global objective if not specified. You can supply
`reconf_opts={'forested' :True, ...}`

to use forested subtree reconfiguration.

Note

Note all three of these can be applied successively:

```
opt = ctg.HyperOptimizer(
slicing_opts={'target_size': 2**40}, # first do basic slicing
slicing_reconf_opts={'target_size': 2**28}, # then advanced slicing with reconfiguring
reconf_opts={'subtree_size': 14}, # then finally just higher quality reconfiguring
)
```

**Simulated Annealing** - `simulated_annealing_opts`

¶

`cotengra`

has an implementation of simulated annealing along the lines of
arXiv:2108.05665, the core function of which is
`simulated_anneal_tree`

.

You can turn this on by supplying a `dict`

to `simulated_annealing_opts`

, which
will be supplied as kwargs to the above function:

```
opt = ctg.HyperOptimizer(simulated_annealing_opts={})
```

There is support for slicing by supplying the `target_size`

in this dict.
The simulated anneal is performed *first-thing* after generating the initial
tree, and can still be combined with any of the other processing steps above.

## Objective¶

What the optimizer (and subtree reconfiguration) considers the ‘best’ tree or
objective function is set with the `minimize`

kwarg.

`minimize='flops'`

- minimize the total number of scalar operations,`minimize='size'`

- minimize the size of the largest intermediate tensor,

Hint

Generally to control the memory usage of a contraction it is better to use slicing rather than targetting the size directly.

`minimize='write'`

- minimize the total amount of ‘memory written’, i.e. the sum of sizes of all intermediate tensors. This targets memory speed, but is best weighted with the flops cost as well - see below. However this score is also relevant for automatic differentiating a contraction, where naively all intermediates must be kept.`minimize='combo'`

- minimize the sum of \(FLOPS + \alpha \times WRITE\) using the default \(\alpha=64\)`minimize=f'combo-{alpha}'`

- use a custom \(\alpha\) for the combo cost

For real world contractions, where both clock speed and memory speed are important, using:

```
opt = ctg.HyperOptimizer(minimize='combo')
```

should provide decent baseline performance for typical CPUs and GPUs.

## Optimization Library¶

Optimizing the hyper parameters of the drivers which generate trees is done by one of many backend ‘ask-and-tell’ optimization libraries. The cost of asking for a new sample point and telling the resulting score should be much smaller than actually generating the trees themselves. However, for Gaussian process bayesian optimization, for example, the cost can become considerable once a few hundred trials have been fitted. As such, you may want to use a different library or algorithm. The options are:

optuna -

**Tree of Parzen Estimators**used by defaultbaytune -

*Bayesian Tuning and Bandits*-**Gaussian Processes**used by defaultchocolate - the

**CMAES**optimization algorithm is used by default (`sampler='QuasiRandom'`

also useful)skopt - random forest as well as Gaussian process regressors (high quality but slow)

nevergrad - various population and evolutionary algorithms (v fast & suitable for highly parallel path findings)

These are specified like:

```
opt = ctg.HyperOptimizer(optlib='nevergrad')
```

To select which algorithm the optimizer uses supply the `sampler=`

kwarg, other
options to instantiate the optimizer are also passed in as kwargs.

If you don’t want to use optimization (either because no library is installed or for the least overhead), you can supply:

```
opt = ctg.HyperOptimizer(optlib='random')
```

which will just randomly sample the space of hyper parameters.

### Optimization Space¶

You can view (and modify) the actual hyper parameter space of each driver with the following call:

```
ctg.get_hyper_space()['greedy']
```

```
{'random_strength': {'type': 'FLOAT_EXP', 'min': 0.001, 'max': 1.0},
'temperature': {'type': 'FLOAT_EXP', 'min': 0.001, 'max': 1.0},
'costmod': {'type': 'FLOAT', 'min': 0.0, 'max': 50.0}}
```

## Parallelization¶

`cotengra`

can fairly naturally parallelize over the trials of the hyper
optimization, and does so by default `HyperOptimizer(parallel='auto')`

.
The basic `parallel`

options are:

`parallel=False`

- no parallelization`parallel=True`

- alias for using default parallelization and number of workers`parallel=X:int`

- alias for using default parallelization but specify number of workers`parallel=pool`

- supply your own`concurrent.futures.ProcessPoolExecutor`

compatible pool

For non-distributed pools the default number of workers is computed, in order of preference, as:

`COTENGRA_NUM_WORKERS`

environment variable`OMP_NUM_THREADS`

environment variable`os.cpu_count()`

`parallel='concurrent.futures'`

¶

The default parallelization is the python standard library.
A cached `ProcessPoolExecutor`

is used.

`parallel='loky'`

¶

Use the `loky`

(also fallback to
`joblib.externals.loky`

) reusable pool.

`parallel='dask'`

¶

Use the `dask`

distributed scheduler. If a `dask.distributed.Client`

already exists then it will be used, otherwise a new one will be created. You can
also supply a `dask.distributed.Client`

object directly.

`parallel='ray'`

¶

Use the `ray`

distributed scheduler, with
potentially better performance than `dask`

. A `ray`

runtime will be created if
not already running.

### Others such as `mpi4py`

¶

Other libraries that conform to the `concurrent.futures`

interface can be used,
for example the `mpi4py`

library.

Warning

There are two caveats to parallelization to consider:

Since trials have to be pre dispatched, a delay is introduced between sampling and reporting trials to the underlying optimizer

If the underlying optimizer cannot suggest trials quickly enough, or the pre dispatch is not large enough, workers can become idle.

Consider using `optlib='nevergrad'`

or `optlib='random'`

if you are optimizing
with many tens or more workers to avoid being bottlenecked by the default Bayesian
optimization for example.

`parallel='threads'`

¶

You can specify a `ThreadPoolExecutor`

for parallelization, but this is only
useful tasks where the underlying driver releases the GIL, e.g.
`RandomGreedyOptimizer`

,
with `accel=True`

, where threads are used by default.

## Termination¶

The `HyperOptimizer`

can be terminated in
several ways.

`max_repeats=X:int`

- terminate after this many trees have been generated. The default of 128 is reasonable for simple to moderatetely complex contractions using a Bayesian optimizer.`max_time=T`

- terminate after`T`

seconds have elapsed`max_time=f'rate:{ops_per_second}'`

- terminate once the time taken searching is greater than if the best contraction found so far was performed at the rate`ops_per_second`

. This can be used to avoid spending a long time on cheap contractions. If the contraction will be performed many times, divide the rate by the commensurate factor. For example, if we have a CPU with very approximately a billion OPS/s but we expect to perform the contraction a hundred times we could use`max_time='rate:1e7'`

.`max_time='equil:{X:int}'`

- terminate if no better contraction has been found in the last`X`

trials. This can be used to avoid spending a long time on a contraction that is already converged.

Note that the `HyperOptimizer`

is stateful;
each time it is supplied to the same contraction it will reset the above
conditions and continue.

Hint

The `ReusableHyperOptimizer`

on the
other hand only runs once.

## Large graphs - `RandomGreedyOptimizer`

¶

The `HyperOptimizer`

can handle medium sizes of graph (up to 1000s of tensors), but the hyper-optimization and explicit `ContractionTree`

construction of every trial can become slow for very large contractions. An alternative more suited to very lightweight path finding that can handle 10,000s of tensors is the `RandomGreedyOptimizer`

. This only samples random greedy paths whilst simultaneously calculating each flops score, thus avoiding the heavier machinery of the hyper optimizer. Moreover, the core function `optimize_random_greedy_track_flops`

has an accelerated rust counterpart in cotengrust, that is both faster and allows threaded parallelization.

```
inputs, output, _, size_dict = ctg.utils.lattice_equation([100, 100])
opt = ctg.RandomGreedyOptimizer(
max_repeats=32,
temperature=(0.01, 0.1), # sample a range
seed=42,
)
```

It has the same call methods as all `cotengra`

optimizers:

`path = opt(inputs, output, size_dict)`

: return the best contraction path`ssa_path = opt.ssa_path(inputs, output, size_dict)`

: return the best contraction path in SSA format`tree = opt.search(inputs, output, size_dict)`

: return the best`ContractionTree`

The marginally cheapest of these is the SSA format:

```
%%time
ssa_path = opt.ssa_path(inputs, output, size_dict)
```

```
CPU times: user 3.21 s, sys: 58 ms, total: 3.26 s
Wall time: 613 ms
```

Once the optimizer has been run, the corresponding flops (log10) can be retrieved with:

```
opt.best_flops
```

```
65.02261352539062
```