.. _petsc-advanced:

=======================================
GGCE-PETSc interface: advanced features
=======================================

In this tutorial we go over two advanced features enabled by the GGCE-PETSc interface:
the double-parallel scheme and the basis-matrix separation, which together unlock
large-scale phonon configuration calculations.

Double-parallel scheme
----------------------

In Sec. :ref:`parallel`, we alluded to the double-parallel scheme of running GGCE.
To recap: in the double-parallel scheme, the calculation is MPI-parallelized BOTH across different
:math:`(k,\omega)` points, AND within the solution of the matrix equation at each individual
point. To enable and control this feature when running GGCE-PETSc, we use the ``brigade_size``
keyword when instantiating the Solver.

.. code-block:: python

  COMM = MPI.COMM_WORLD
  solver = MassSolverMUMPS(system=system, mpi_comm=COMM, brigade_size=1)

This is the only syntactic change required to enable this scheme.

By a "brigade", we mean a group of MPI processes
that are collectively working on a single :math:`(k,\omega)` point. The keyword
``brigade_size`` is used in the code to split all MPI processes into brigades of
**equal** size ``brigade_size``. Each brigade works collectively on the matrix for a single
:math:`(k,\omega)` point, and together the brigades cover the entire :math:`(k,\omega)`
space.

By default, ``brigade_size=WORLD_SIZE``, meaning that there is only one brigade,
and the Solver proceeds sequentially through all :math:`(k,\omega)` points whilst
maximally distributing the matrix of equations across all available MPI processes.
The main limitation for this keyword is that the number of MPI processes should be
evenly divisible by ``brigade_size``, otherwise an Error is raised and the calculation
is terminated.

.. note ::

  Due to the specifics of the PETSc
  API, the number of "jobs" fed into ``Solver.greens_function(k, w, eta)`` --
  that is, ``len(k)*len(w)`` -- is required to be evenly divisible by the number of brigades
  (otherwise the PETSc solvers hang indefinitely). The Solver automatically extends (pads)
  the :math:`k, \omega` arrays with the minimum needed number of extra points to
  prevent this. The padding is done to the :math:`omega` array by default, unless it
  is of length 1: in that case, the :math:`k` array is padded. The padding is always
  done outside the range specified by the user, so the input :math:`k,\omega` arrays
  are NOT redefined but merely extended, so that the user still has perfect
  control over where the calculations are being carried out.

  All calculations, including over the padded points, are returned by
  ``greens_function``. This is why sometimes the output array of ``greens_function()``
  may have slightly larger dimensions than the original input array. This is of
  course easily fixed by merely indexing the output array up to the lengths of the
  original :math:`k,\omega` arrays

  .. code-block:: python

    result = solver.greens_function(k, w, eta)
    result_notpadded = result[:len(k),:len(w)]

The optimal choice of ``brigade_size`` is ultimately up to the user. For small
systems where the matrix can be easily stored and solved by a single CPU, the
optimal speed-up is accomplished with maximum splitting, i.e. ``brigade_size=1``.
When dealing with larger matrices (say of dimension more than 100,000), optimal
speed-up will have to be determined by experimenting. In practice, ``brigade_size``
should be just enough for the matrix solution to be possible within MUMPS in terms of
memory and time, so that the largest number of independent brigades can be formed.

For the calculation in the previous section, we see that the sizes of matrices
being solved are quite small (merely hundreds of equations). This allows us to set
``brigade_size=1`` and enjoy linear speed-up in the calculation.

Basis-matrix separation
-----------------------

Without a double-parallel scheme, GGCE is limited to phonon configurations where
the matrix produced can be reasonably quickly solved by SciPy's sparse direct
solver (<100,000 or so). Using the double-parallel scheme removes this restriction --
if a suitable number of MPI processes (CPUs) is available, the matrix can be
appropriately segmented, stored and solved in parallel, allowing much larger matrices
to be handled.

However, one memory bottleneck still remains in GGCE: the basis. The basis, as
computed and stored inside the ``System`` object, is at core a Python ``dict``
that encodes the structure of equations induced by a chosen phonon cloud variational
configuration. It does this agnostic of particular values of :math:`k, \omega, \eta`,
which is a great advantage as it allows to only compute the basis once and then easily
generate all matrices for a range of :math:`k,\omega` values.

And yet the basis itself constitutes a problem for calculations with large configurations.
Being a ``dict``, the basis is inherently not parallelizable: it cannot be shared across
MPI processes. Instead, each process **has an individual copy** of the entire basis,
resulting in incredible redundancy and memory usage.

One solution to this problem is to change the basis generation and storage approach:
this is one of the goals of the next release. In the meantime, another solution is to
separate the computation of the basis from the actual calculation of the Green's
function, by pre-computing and storing on disk the matrices to be solved by the
sparse direct solver.

This is accomplished within the ``MassSolverMUMPS`` in two stages. First,
we must define the directory for saving both the matrices to be solved and the
``Model`` and ``System`` objects

.. code-block:: python

    matr_dir = p["root_matr_dir"]
    root_dir = p["root_sys"]

    model = Model.from_parameters(...)
    model.add_(...)

Next, we create the first ``MassSolverMUMPS`` instance, the purpose of which is
to create the matrices and save them to disk.

.. code-block:: python

    sysgen_petsc = MassSolverMUMPS(
                                  system=System(model),
                                  root=root_dir,
                                  mpi_comm=COMM,
                                  matr_dir=matr_dir,
                                  brigade_size=1,
                                  )

Setting the ``matr_dir`` keyword determines the location where the matrices will
be saved. Notice that the creation of matrices is in principle parallelizable:
while the basis is not shareable, in the sense that each MPI process will still
have a redundant copy of it, groups of MPI processes can still work on
generating and writing different matrices to disk. In especially dire cases,
to save memory, a smaller number of MPI ranks should be used while keeping
allocated memory constant.

To prepare the matrices and save them to disk, we use the following command

.. code-block:: python

    sysgen_petsc.prepare_greens_function(k, w, eta)

    del sysgen_petsc  # to free up memory used to store the basis

The matrices are dumped to disk using ``pickle`` in sparse matrix format,
meaning that we save three arrays -- row indices, column indices, and values --
for nonzero entries only. They are saved with a filename formatted as
``matr_at_k_{kval:.10f}_w_{wval:.10f}_e_{etaval:.10f}.pkl``.

Once the matrices are written to disk, the solver object, which contains the
basis, is deleted to free up memory.

Next, we create the Solver that will actually do the solving. Notice that we need
to create a new System object with ``autoprime=False`` keyword AND to pass this
keyword to ``MassSolverMUMPS`` to make sure that the basis is not generated when
we pass the System to the Solver.

.. code-block:: python

    system_unprimed = System(model, autoprime=False)
    executor_petsc = MassSolverMUMPS(
                                    system=system_unprimed,
                                    root=root_dir,
                                    mpi_comm=COMM,
                                    matr_dir=matr_dir,
                                    autoprime=False,
                                    brigade_size=1,
                                    )

    results_petsc = executor_petsc.greens_function(k, w, eta)

This second Solver then looks in the ``matr_dir`` directory and loads the matrices
from there, constructing filenames from the ``k,w,eta`` it was given.

We showed how to run basis-matrix separate calculations within a single Python
script. Alternatively, this can be run as two separate, sequential calculations
(i.e. with different batch job scripts), using different memory and CPU counts to
optimize resource usage.