Skip to content

Getting started \ Python stack

Python stack

Python is not the only language for scientific computing or data processing, but it has become the de facto standard in the domain: it is simple to use, easy to install, especially with it's interpreted nature, but most of all it comes with a wide ecosystem of powerful libraries. You don't want to write plain Python, Python is slow when it comes to loops or processing, but its scientific libraries are well optimized and rely on low level standard implementations. So when using it, just remember one thing: don't use loops, use NumPy, Pandas, Scikit-learn, or any other well implemented libraries that exist! The planet, your HPC center, or your cloud bill will thank you for that.

Also keep in mind that although we focus on Python here for its standard position in the scientific domain, there are other good languages to use: C++ or Rust for the performance, or Julia to benefit from the best of both compiled and interpreted world.

Here a selection of the most common packages that we recommend you for scientific computing and data science. All these tools are well-known Open-Source libraries. They also come with plenty of good resources for learning how to use them.

Python Scientific core stack

These are the base libraries than anyone doing scientific Python should know and use.

  • NumPy is the fundamental package for scientific computing with Python: Get Start. It focuses on N-dimensionnal array processing, and is the basis of almost all the tools listed below. When dealing with EO data like rasters, Numpy Array is the one you will use!
  • Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool: Pandas cheat sheet. It is build to deal with DataFrames: tabular datasets coming from CSV or Parquet files for example.
  • Scipy
  • Xarray makes working with labelled multi-dimensional arrays in Python simple: Tutorials. Xarray combines Numpy and Pandas to handle n-dimensional arrays using coordinates and dimensions. Built for NetCDF like dataset (climate, oceanography), it can also be used for images stack.

Geospatial and image data handling

Higher level packages, often relying on base libraries, to handle geospatial or imagery datasets.

Libraries for image processing and machine learning

Python is the reference language for Machine and Deep Learning libraries, most efficient and well-known tools propose a Python implementation or API.

  • Scikit-learn : simple and efficient tools for predictive data analysis|Home page
  • PyTorch is an optimized tensor library for deep learning using GPUs and CPUs.
  • TensorFlow is an end-to-end platform for machine and especially Deep Learning.

Code optimization and distribution

Again, the first thing to optimize Python code is to use well built libraries like Numpy. But it's not always enough, and there are other solutions. Python multithreading is often bound by the GIL, so when it comes to parallelisation, multiprocessing and framework like Dask are the correct approach, which can come with the price of moving data between processes. For more information on the subject of optimization and distribution, see the associated page.

  • Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love | https://www.dask.org/
  • Multiprocessing is a package that supports spawning processes using an API similar to the threading module.
  • Numba translates Python functions to optimized machine code at runtime using the industry-standard LLVM compiler library. Numba-compiled numerical algorithms in Python can approach the speeds of C or FORTRAN.
  • You can also have a look at EOScale, built at CNEs for the specific case of image processing using shared memory Python module.