Advanced¶
This page details the more advanced HyperOptimizer
options.
First it might be useful to sketch the underlying process for performing a contraction path search:
A driver (such as
'greedy'
or'kahypar'
) along with a set of heuristic parameters (the ‘hyper-parameters’) are sampled from an ‘optimizer’.These are used to build a single
ContractionTree
Optionally, some modification of this tree is then performed - either slicing, subtree reconfiguration, or some combination such as dynamic slicing
A score is then generated for the tree, for example the contraction cost
This score, the driver and hyper-parameters are fed back to the optimizer to update its search space
The process is repeated, optionally in parallel, for a set number of repeats or time, with the best scoring tree being returned finally.
import cotengra as ctg
Drivers¶
The following are valid initial methods to HyperOptimizer
(methods=[...])
,
which will tune each one and ultimately select the best performer:
'kahypar'
(default)'greedy'
(default)'labels'
(default ifkahypar
not installed ) - a pure python implementation of hypergraph community detectionigraph partition based
'spinglass'
'labelprop'
igraph dendrogram based (yield entire community structure tree at once)
'betweenness'
'walktrap'
linegraph tree decomposition based:
'quickbb'
'flowcutter'
You can also register your own methods to add to the mix with
register_hyper_function
.
Slicing and Subtree Reconfiguration¶
There are three types of ‘post-processing’ of trees that the hyper optimizer can perform. For details of what each of these does, or how to perform them outside of the hyper optimization loop, see the ‘Tree Surgery’ sections:
Basic slicing - slicing_opts
¶
This removes indices from the contraction greedily one by one until the
specified options are met, without changing the tree structure at all. This is cheap
to perform but can incur significant ‘slicing overhead’ for complicated
contractions. To turn it on supply a dict
to slicing_opts
, which will be
supplied as kwargs to
ContractionTree.slice
opt = ctg.HyperOptimizer(slicing_opts={"target_size": 2**28})
Or if we want paths with at least 1024 independent contractions we could use:
opt = ctg.HyperOptimizer(slicing_opts={"target_slices": 2**10})
Note that you need to work with the ContractionTree
object rather than a raw
path (i.e. call opt.search
or quimb.tensor.TensorNetwork.contraction_tree
)
if you want to access the sliced indices or perform the sliced contraction
(although a tree reconstructed from the raw path will be ‘sliceable’ in the
sense that the indices will be easily refound).
Dynamic slicing - slicing_reconf_opts
¶
This removes indices one by one, but also performs subtree reconfiguration
between removals in order to modify the tree to take account of the slicing.
Slower than basic slicing but can be much higher quality, especially when
slicing many indices of a complicated contraction. This is turned on by
supplying a dict
to slicing_reconf_opts
, which will be supplied as kwargs to
ContractionTree.slice_and_reconfigure
.
If both basic and dynamic slicing are specified then the basic slicing is
performed first.
opt = ctg.HyperOptimizer(slicing_reconf_opts={"target_size": 2**28})
Subtree reconfiguration - reconf_opts
¶
Finally, one can perform subtree reconfiguration alone - ranging over subtrees
of the main tree optimally reconfiguring them locally to lower the target
score globally. This is turned on by supplying a dict
to reconf_opts
, which
will be supplied as kwargs to
ContractionTree.subtree_reconfigure
.
The ‘reconfing’ will be performed after any slicing.
opt = ctg.HyperOptimizer(reconf_opts={})
Note you just supply an empty dict if you want to use the default options. Some
parameters, such as the score function to minimize (see next section), will be
set for you to match the global objective if not specified. You can supply
reconf_opts={'forested' :True, ...}
to use forested subtree reconfiguration.
Note
Note all three of these can be applied successively:
opt = ctg.HyperOptimizer(
slicing_opts={'target_size': 2**40}, # first do basic slicing
slicing_reconf_opts={'target_size': 2**28}, # then advanced slicing with reconfiguring
reconf_opts={'subtree_size': 14}, # then finally just higher quality reconfiguring
)
Simulated Annealing - simulated_annealing_opts
¶
cotengra
has an implementation of simulated annealing along the lines of
arXiv:2108.05665, the core function of which is
simulated_anneal_tree
.
You can turn this on by supplying a dict
to simulated_annealing_opts
, which
will be supplied as kwargs to the above function:
opt = ctg.HyperOptimizer(simulated_annealing_opts={})
There is support for slicing by supplying the target_size
in this dict.
The simulated anneal is performed first-thing after generating the initial
tree, and can still be combined with any of the other processing steps above.
Objective¶
What the optimizer (and subtree reconfiguration) considers the ‘best’ tree or
objective function is set with the minimize
kwarg.
minimize='flops'
- minimize the total number of scalar operations,minimize='size'
- minimize the size of the largest intermediate tensor,
Hint
Generally to control the memory usage of a contraction it is better to use slicing rather than targetting the size directly.
minimize='write'
- minimize the total amount of ‘memory written’, i.e. the sum of sizes of all intermediate tensors. This targets memory speed, but is best weighted with the flops cost as well - see below. However this score is also relevant for automatic differentiating a contraction, where naively all intermediates must be kept.minimize='combo'
- minimize the sum of \(FLOPS + \alpha \times WRITE\) using the default \(\alpha=64\)minimize=f'combo-{alpha}'
- use a custom \(\alpha\) for the combo cost
For real world contractions, where both clock speed and memory speed are important, using:
opt = ctg.HyperOptimizer(minimize="combo")
should provide decent baseline performance for typical CPUs and GPUs.
Optimization Library¶
Optimizing the hyper parameters of the drivers which generate trees is done by one of many backend ‘ask-and-tell’ optimization libraries. The cost of asking for a new sample point and telling the resulting score should be much smaller than actually generating the trees themselves. However, for Gaussian process bayesian optimization, for example, the cost can become considerable once a few hundred trials have been fitted. As such, you may want to use a different library or algorithm. The main options are:
optuna - Tree of Parzen Estimators used by default, (high quality but only medium fast)
cmaes - CMAES optimization algorithm, (v fast & suitable for highly parallel path findings)
nevergrad - various population and evolutionary algorithms (v fast & suitable for highly parallel path findings)
skopt - random forest as well as Gaussian process regressors (very high quality reference but very slow)
These are specified like:
opt = ctg.HyperOptimizer(optlib="cmaes")
To select which algorithm the optimizer uses supply the sampler=
kwarg, other
options to instantiate the optimizer are also passed in as kwargs.
If you don’t want to use optimization (either because no library is installed or for the least overhead), you can supply:
opt = ctg.HyperOptimizer(optlib="random")
which will just randomly sample the space of hyper parameters.
Optimization Space¶
You can view (and modify) the actual hyper parameter space of each driver with the following call:
ctg.get_hyper_space()["greedy"]
{'random_strength': {'type': 'FLOAT_EXP', 'min': 0.001, 'max': 1.0},
'temperature': {'type': 'FLOAT_EXP', 'min': 0.001, 'max': 1.0},
'costmod': {'type': 'FLOAT', 'min': 0.0, 'max': 50.0}}
Parallelization¶
cotengra
can fairly naturally parallelize over the trials of the hyper
optimization, and does so by default HyperOptimizer(parallel='auto')
.
The basic parallel
options are:
parallel=False
- no parallelizationparallel=True
- alias for using default parallelization and number of workersparallel=X:int
- alias for using default parallelization but specify number of workersparallel=pool
- supply your ownconcurrent.futures.ProcessPoolExecutor
compatible pool
For non-distributed pools the default number of workers is computed, in order of preference, as:
COTENGRA_NUM_WORKERS
environment variableOMP_NUM_THREADS
environment variableos.cpu_count()
parallel='concurrent.futures'
¶
The default parallelization is the python standard library.
A cached ProcessPoolExecutor
is used.
parallel='loky'
¶
Use the loky
(also fallback to
joblib.externals.loky
) reusable pool.
parallel='dask'
¶
Use the dask
distributed scheduler. If a dask.distributed.Client
already exists then it will be used, otherwise a new one will be created. You can
also supply a dask.distributed.Client
object directly.
parallel='ray'
¶
Use the ray
distributed scheduler, with
potentially better performance than dask
. A ray
runtime will be created if
not already running.
Others such as mpi4py
¶
Other libraries that conform to the concurrent.futures
interface can be used,
for example the mpi4py
library.
Warning
There are two caveats to parallelization to consider:
Since trials have to be pre dispatched, a delay is introduced between sampling and reporting trials to the underlying optimizer
If the underlying optimizer cannot suggest trials quickly enough, or the pre dispatch is not large enough, workers can become idle.
Consider using optlib='cmaes'
or optlib='random'
if you are optimizing
with many tens or more workers to avoid being bottlenecked by the default Bayesian
optimization for example.
parallel='threads'
¶
You can specify a ThreadPoolExecutor
for parallelization, but this is only
useful tasks where the underlying driver releases the GIL, e.g.
RandomGreedyOptimizer
,
with accel=True
, where threads are used by default.
Termination¶
The HyperOptimizer
can be terminated in
several ways.
max_repeats=X:int
- terminate after this many trees have been generated. The default of 128 is reasonable for simple to moderatetely complex contractions using a Bayesian optimizer.max_time=T
- terminate afterT
seconds have elapsedmax_time=f'rate:{ops_per_second}'
- terminate once the time taken searching is greater than if the best contraction found so far was performed at the rateops_per_second
. This can be used to avoid spending a long time on cheap contractions. If the contraction will be performed many times, divide the rate by the commensurate factor. For example, if we have a CPU with very approximately a billion OPS/s but we expect to perform the contraction a hundred times we could usemax_time='rate:1e7'
.max_time='equil:{X:int}'
- terminate if no better contraction has been found in the lastX
trials. This can be used to avoid spending a long time on a contraction that is already converged.
Note that the HyperOptimizer
is stateful;
each time it is supplied to the same contraction it will reset the above
conditions and continue.
Hint
The ReusableHyperOptimizer
on the
other hand only runs once.
Large graphs - RandomGreedyOptimizer
¶
The HyperOptimizer
can handle medium sizes of graph (up to 1000s of tensors), but the hyper-optimization and explicit ContractionTree
construction of every trial can become slow for very large contractions. An alternative more suited to very lightweight path finding that can handle 10,000s of tensors is the RandomGreedyOptimizer
. This only samples random greedy paths whilst simultaneously calculating each flops score, thus avoiding the heavier machinery of the hyper optimizer. Moreover, the core function optimize_random_greedy_track_flops
has an accelerated rust counterpart in cotengrust, that is both faster and allows threaded parallelization.
inputs, output, _, size_dict = ctg.utils.lattice_equation([100, 100])
opt = ctg.RandomGreedyOptimizer(
max_repeats=32,
temperature=(0.01, 0.1), # sample a range
seed=42,
)
It has the same call methods as all cotengra
optimizers:
path = opt(inputs, output, size_dict)
: return the best contraction pathssa_path = opt.ssa_path(inputs, output, size_dict)
: return the best contraction path in SSA formattree = opt.search(inputs, output, size_dict)
: return the bestContractionTree
The marginally cheapest of these is the SSA format:
%%time
ssa_path = opt.ssa_path(inputs, output, size_dict)
CPU times: user 3.21 s, sys: 58 ms, total: 3.26 s
Wall time: 613 ms
Once the optimizer has been run, the corresponding flops (log10) can be retrieved with:
opt.best_flops
65.02261352539062