Code Optimization and Distribution

Getting started \ Code Optimization and Distribution

There are several reasons to optimize, parallelize, or distribute a code:

Efficiency: minimize time to result (operational constraint) and carbon footprint of resources usage
Huge volume and memory usage handling (hardware constraint): EO data is often big, and might need to be processed by tile, or on a distributed system

Optimization and/or parallelization should not be done if not needed, especially at the cost of code readability, which might incure more carbon impact in itself than efficiency.

As we are often dealing with big data products or even huge datasets, an important point to consider is also how the data is stored in itself, IO efficency and disk volume is often a critical part to optimize a processing.

Monitoring and profiling

The first step of optimization is to understand how we are using resources along our program:

Quantify resources usage across time, which is referred as Monitoring,
Quantify resources usage by code parts, referred as Profiling.

Monitoring

The questions we are trying to answer here are: which resources is using our program, at which point, and what are the bottlenecks? A program can be IO bound, CPU bound, memory bound, or more often a mix of all.

Several tools can help us identify our code behavior:

Low-level command line tools: iostat, top, dstat, which mainly print resources usage (optionnaly to a file) on a given time step. There is also nvidia-smi for GPU usage.
Higher level tools, like monitoring module provided by CNES HPC team. It makes use of low-level tool and Python scripting to give ready to use plots.

Profiling

Profiling is achieved by instrumenting either the program source code or its binary executable form using a tool called a profiler (Wikipedia), it allows a finer grain of details than monitoring, linking resource usage to part of your code.

Several tools exist for such measurements:

Intel VTune is a reference, for fortran or C code. It can also be use for Python for high level profiling through Application Performance Report.
Valgrind is another well-known profiler for Fortran and C code.
Python is also providing a set of tools for profiling:
cProfile module is included in Python, and can then be vizualized using SnakeViz
memory-profiler is a library than can be installed and used to watch Python program memory usage. It is not maintained anymore though.
Memaray seems like a good alternative!

Basic code optimization

Once you've identified the bottlenecks of your code, and you are sure it needs to be optimized (otherwise you should spend time improving code readability, or test coverage, and after that functionalities), you can try to optimize it. A few things that are always true, and should be always kept in mind:

Loops optimization: even in compiled code, beware of your loops, especially on multi dimensionnal array. Loops order, or how your iteration is done, can make a huge difference. In python, it's often worth than that, and you want to avoid looping except on really high level and time consuming functions.
Vectorize: again, this is true in most language: every modern CPU can vectorize operations, meaning doing at the same time the same operation on several inputs. You must make sure the way you write your code allows this.
IOs: More of it in the next chapter, but reading data from disk, then memory, then CPU cache results in totally different time scales. If in some programs you'll look at avoiding cache miss, reading input data appropriately from disk is already a good step!

Data and IO optimization

In the Big Data era, which satellite datasets have entered since the early 2000's, disk IOs are crutial, and so file formats. A couple of reason for it:

Optimized file formats use (a lot of) less space, saving money and energy
They are faster, and improve the overall processing time

A good scientific file format for processing must at least propose the following caracteristics:

Compression: CPU time to uncompress is almost always cheaper than storage space, both in term of volume and in term of speed.
Chunking: you must be able to read (and if possible write) a data by pieces. In the case of imagery, a squared portion of an image typically. This improves compression, but also partial reading time.

This is even more true in the Cloud era, and so consortium made a really |good guide on Cloud Optimized Geospatial Formats](https://guide.cloudnativegeo.org/)!

Parallelization and distribution

Parallelization (on a single machine), and even more distribution (on several nodes) must be considered only after all other possibilities, and only if really needed. Amdahl's law must always be kept in mind, and the cost of parallelization on a program maintenance can be heavy. Thus, it is always better to rely on already built libraries, like OpenBLAS, the Math Kernel Library, Boost or other to do the work for you.

Compiled language

Close to hardware language like Fortran, C or even Java first offer multithrading, allowing to divide a computation between several threads of a program, sharing the same memory space. The reference tool to use this within this languages is OpenMP. Note than frameworks like ITK, used by the OrfeoToolBox software already offers multi threading capabilities.

Multi processing or distributed processing with these languages often rely on the Message Passing Interface API, and the OpenMPI library, although other implementations exist.

Python solutions

Python, by its interpreted design, comes with a big limitation, the dreaded Global Interpreter Lock. The GIL often prevents multithreading from being efficient in Python, thus leading to a heavy use of multi processing solutions. The big drawback of multi processing is the fact that processes do not easily share a single memory space together, and in memory data might be duplicated between processes.

So the solutions that can be used in Python are the following:

Concurrent programming, using asyncio. Concurrent execution is not the same as multithreading or multiprocessing, but it can help a lot in having different part of the code executed at the same time, typically IO operations (waiting on an http call, or reading a file).
Writing part of the code in C or Fortran with Cython or F2PY for example. Numba can also be used to translate Python functions to optimized machine code at runtime and approach the speeds of C or FORTRAN.
Multi processing, using the accordingly named module can be a basic but powerful solution in some case.
Dask is one of the reference distributed processing Python framework for scientific computing, allowing to scale Numpy, Pandas, or any Python code for bigger than memory dataset on a single computer, or up to a cluster of nodes.
Other Open source solutions like PySpark or Ray also exist.
EOScale has been built at CNES to provide a tool for parallelizing Python image processing on a single node using sharedmemory module.

At CNES

Plenty of documentation is available on all this domain on the CNES Computing Center documentation.