TOPAL Working Group

Table des matières

1 Every Thursday at 09:30

2 Schedule 2024

  • 27/06/2024 : [Ana/Abel]
  • 20/06/2024 : [Alycia/Raphael]
  • 13/06/2024 : canceled
  • 06/06/2024 : [Lionel]
  • 30/06/2024 : [Philippe]
  • 28/05/2024 : [Erik]
  • 23/05/2024 : [Hayfa]
  • 21/05/2024 : [Esragul]
  • 02/05/2024 : [SOLHARIS/NUMPEX]
  • 15/04/2024 : [Jean-François]
  • 11/04/2024 : Journée EDMI
  • 04/04/2024 : canceled
  • 28/03/2024 : [Somesh]
  • 21/03/2024 : [internships]
  • 14/03/2024 : [C++ part3]
  • 08/02/2024 : [Alycia and Thomas]
  • 06/02/2024 : [Mohamed]
  • 01/02/2024 : [C++ part2]
  • 18/01/2024 : [C++ part1]
  • 11/01/2024 : [Hayfa]

2.1 Thursday 27 June 2024

2.1.1 Speakers: Ana Hourcau

2.1.2 Title: Exploitation de la multi-précision pour accélérer les calculs

2.1.3 Summary:

L'objectif de cette présentation est d'évaluer l'implémentation multi-précision d'opérations sur des matrices, telles que la factorisation de Cholesky, pour en étudier les gains en optimisation selon plusieurs critères. L'utilisation de précisions faibles peut permettre l'accélération des calculs, notamment sur les architectures avec des unités tensorielles qui exploitent largement la demi précision. L'impact numérique doit toutefois pouvoir être maitrisé pour garantir une cohérence du résultat final. Ces travaux ont été intégrés dans l'application ExaGeoStat développée par KAUST, permettant de recueillir des données de climat et météo pour prédire d'éventuelles observations manquantes. Ces prédictions nécessitent des résolutions de matrices de covariances définies positives et symétriques. Une première implémentation avait déjà été réalisée, se basant sur un critère algébrique, en définissant des bandes de régions le long de la diagonale pour appliquer la précision correspondante. En effet, sur la diagonale de la matrice, les variances sont les plus importantes, la précision appliquée doit être la plus forte. Plus on s'éloigne de la diagonale et plus la précision est faible. Les conversions sont gérées en manipulant une version de la donnée pour chaque précision, impliquant un coût mémoire à prendre en compte. Une autre implémentation utilisant le support d'exécution PaRSEC propose de diminuer le surcoût mémoire au prix de plus de conversions, intervenant à chaque appel de noyaux de calcul. On propose d'utiliser le support d'exécution StarPU afin de rendre la gestion plus dynamique et éviter d'avoir plusieurs versions des données. La conversion se fait à l'exécution, en choisissant le noyau correspondant à la précision voulue. De plus, afin de mieux adapter la précision des données, la précision de chaque tuile est basée sur un critère numérique en fonction de la norme de Frobenius de cette tuile par rapport à celle de la matrice. Enfin, lorsque la bibliothèque BLAS le permet, on utilisera le noyau GEMM étendu qui propose d'effectuer les calculs avec des matrices dans des précisions différentes, évitant ainsi le surcoût de la conversion. C'est le cas notamment de la bibliothèque cuBLAS, qui permet d'exploiter la demi précision, qui n'est disponible que sur GPU.

2.1.4 Slides: Ana-27-06-24

2.1.5 Speakers: Abel Calluaud

2.1.6 Title: Vers un solveur hiérarchique performant parallélisé par les tâches

2.1.7 Summary:

La recherche de performance pour la résolution de grands systèmes d'équations linéaires est un enjeu majeur pour de nombreuses applications scientifiques et industrielles. Le paradigme de tâches Sequential Task Flow (STF) facilite l'expression de telles opérations d'algèbres linéaires en soumettant des tâches à un support exécutif généraliste qui les ordonnance efficacement sur les ressources disponibles. L'état de l'art tire parti d'un découpage en tuiles des matrices opérandes qui permet d'extraire des tâches indépendantes exploitant efficacement les unités de calculs et mémoires caches de l'architecture matérielle. La complexité algorithmique cubique de la factorisation directe reste cependant prohibitive pour de grandes tailles de matrice. Les approximations de rang faible permettent de réduire cette complexité en exploitant la structure de rang faible inhérente à certains problèmes comme ceux rencontrés en électromagnétisme ou en géophysique. Ainsi, la structure de rang permet de compresser certaines tuiles sous forme \(B=UV^t\), c'est le format Tile Low Rank (TLR) que nous avons introduit dans la bibliothèque CHAMELEON. Pour ce faire, nous avons intégré les noyaux de calcul nécessaires dans la librairie RAPACK pour manipuler des combinaisons de matrices stockées sous forme compressée ou dense. En réduisant le coût mémoire et calculatoire, les résultats préliminaires ont montré que l'approche améliore le passage à l'échelle pour la factorisation de matrices de grande taille. Notre implémentation repose sur les algorithmes tuilés existants de CHAMELEON et du support exécutif basé sur les tâches StarPU. Le modèle de programmation a récemment été étendu avec le support de tâches récursives, qui permet d'exprimer des tâches qui peuvent dynamiquement soumettre un sous-graphe. C'est un verrou important pour l'extension prévue de notre travail au format Tile Hierarchical Low Rank (THLR) qui gagne un ordre de grandeur supplémentaire par rapport à TLR en appliquant récursivement le découpage en tuiles. La librairie CHAMELEON-HMAT fournit une telle implémentation, mais le grain des tâches est fixé statiquement par la taille de tuiles, limitant sévèrement le parallélisme exploitable. Notre projet CHAMELEON-RAPACK vise à exprimer les opérations via les tâches récursives, rendant possibles de nouvelles optimisations pour adapter l'exécution à des ressources matérielles hétérogènes.

2.1.8 Slides: Abel-27-06-24

2.2 Thursday 20 June 2024

2.2.1 Speakers: Alycia Lisito

2.2.2 Title: Efficient HPL on top of runtime systems

2.2.3 Summary:

De nos jours, les machines sont de plus en plus grosses, complexes et hétérogènes. Cela rend plus difficile d'avoir un code générique qui soit perfomant, indifféremment de l'architecture de la machine. Afin d'atteindre cet objectif, de plus en plus d'algorithmes sont implémentés sur des supports d'exécution à base de tâches. Dans ce papier, nous proposons une implémentation de HPL sur ces supports d'execution à base de taches. Pour cela, nous avons implémenté l'algorithme dans Chameleon, qui est une bibliothèque déjà optimisée et performante, avec le support d'exécution StarPU. Nous expliquons comment arriver à une implementation performance et montrons comment nous arrivons à obtenir 93% en pic de performance par rapport au code de référence sans pivotage.

2.2.4 Slides: Alycia-20-06-24

2.2.5 Speakers: Raphael Bourgouin

2.2.6 Title: Parallelism in mixture of experts decoder models

2.2.7 Summary:

Nowadays, the amount of digital data is increasing exponentially, and so is the importance of machine learning, especially natural language processing models. However, training these models requires an increasing quantity of resources, which poses a great challenge. This talk aims to improve the efficiency of parallel training by utilizing the possibilities offered by mixture of experts in decoder models.

2.2.8 Slides: Raphael-20-06-24

2.3 Thursday 06 June 2024

2.3.1 Speakers: Lionel Eyraud-Dubois

2.3.2 Title: Tightening I/O lower bounds through the hourglass dependency pattern

2.3.3 Summary:

When designing an algorithm, one cares about arithmetic/computational complexity, but data movement (I/O) complexity plays an increasingly important role that highly impacts performance and energy consumption. For a given algorithm and a given I/O model, scheduling strategies such as loop tiling can reduce the required I/O down to a limit, called the I/O complexity, inherent to the algorithm itself. The objective of I/O complexity analysis is to compute, for a given program, its minimal I/O requirement among all valid schedules. We consider a sequential execution model with two memories, an infinite one, and a small one of size \(S\) on which the computations retrieve and produce data. The I/O is the number of reads and writes between the two memories. We identify a common “hourglass pattern” in the dependency graphs of several common linear algebra kernels. Using the properties of this pattern, we mathematically prove tighter lower bounds on their I/O complexity, which improves the previous state-of-the-art bound by a parametric ratio. This proof was integrated inside the IOLB automatic lower bound derivation tool.

2.3.4 Slides: Lionel-06-06-24

2.4 Thursday 30 May 2024

2.4.1 Speakers: Philippe Swartvagher

2.4.2 Title: Making reproducible and publishable large-scale HPC experiments

2.4.3 Summary:

For a long time, scientific publications focused only on experimental results, ignoring how, concretely, the results were obtained, making difficult for readers, but also for the author, to reproduce the experiments. Things are slowly changing: publication of so-called "artifacts" are encouraged by journals and conferences. However, releasing scripts and programs used for experiments can be challenging: how to organize the material? how to clearly document the instructions? how to ensure reproducibility of the experiments? how to ensure long-term availability? Several answers are possible to all these questions. In this talk, I will try to summarize how and why my methodology to build reproducible artifacts evolved over several years in the research area.

2.4.4 Slides: Philippe-30-05-24

2.5 Tuesday 28 May 2024

2.5.1 Speakers: Erik Saule

2.5.2 Title: Optimizing Distributed Dataflow Graph Algorithms

2.5.3 Summary:

Dataflow graph algorithms are algorithms where the calculation flow on vertices based on a total order. These dataflow algorithms can be used to compute graph colorings, maximum independent sets, matchings, and other. One of the bottleneck in these algorithms is the longest path induced by the total order. And these total orders are often derived from a random draw. We show statistically that we can derive better orders on some graphs (which exhibit small world properties) by changing the random order generation method at no additional runtime cost. We provide the basis for analyzing formally the length of the longest path of a category of Erdos-Renyi graphs.

2.5.4 Slides: Erik-28-05-24

2.6 Thursday 23 May 2024

2.6.1 Speakers: Hayfa Tayeb

2.6.2 Title: Dynamic Tasks Scheduling with Multiple Priorities on Heterogeneous Computing Systems

2.6.3 Summary:

The efficient utilization of heterogeneous computing systems is crucial for scientists and industrial organizations to execute computationally intensive applications. Task-based programming has emerged as an effective approach for harnessing the processing power of these systems. However, effective scheduling of task-based applications is critical for achieving high performance. Typically, these applications are represented as directed acyclic graphs (DAGs), which can be optimized through careful scheduling to minimize execution time and maximize resource utilization. In this paper, we introduce MultiPrio, a dynamic task scheduler that aims to minimize the overall completion time of parallelized task-based applications. The goal is to find a trade-off between resource affinity, task criticality, and workload balancing on the resources. To this end, we compute scores for each task and manage the available tasks in the system with a data structure based on a set of priority queues. Tasks are assigned to available resources according to these scores, which are dynamically computed by heuristics based on task affinity and criticality. We also consider workload balancing across resources and data locality awareness. To evaluate the scheduler, we study the performance of dense and sparse linear algebra task-based applications and task-based FMM application using the StarPU runtime system on heterogeneous nodes. Our scheduler shows interesting results compared to other state-of-the-art schedulers in StarPU for regular applications, and excels at optimizing irregular workloads, improving performance by up to 31%.

2.7 Tuesday 21 May 2024

2.7.1 Speakers: Esragul Korkmaz

2.7.2 Title: $1.25(1+ε)$-Approximation Algorithm for Scheduling with Rejection

2.7.3 Summary:

We address an offline job scheduling problem where jobs can either be processed on a limited supply of energy-efficient machines, or offloaded to energy-inefficient machines (with an unlimited supply), and the goal is to minimize the total energy consumed in processing all tasks. This scheduling problem can be formulated as a problem of scheduling with rejection, where rejecting a job corresponds to process it on an energy-inefficient machine and has a cost directly proportional to the processing time of the job. To solve this scheduling problem, we introduce a novel \(\frac{5}{4}(1+\epsilon)\) approximation algorithm (BEKP) by associating it to a Multiple Subset Sum problem. Our algorithm is an improvement over the existing literature, which provides a (\(\frac{3}{2} - \frac{1}{2m}\)) approximation for scheduling with arbitrary rejection costs. We evaluate and discuss the effectiveness of our approach through a series of experiments, comparing it to existing algorithms.

2.7.4 Slides: Esra-20-05-24

2.8 Thuesday 02 May 2024

2.8.1 ExaSoft 02/05

  • 10h-10h30 : Présentation générale des WP3 (supports d'exécution) et WP4 (bibliothèques numériques)
  • 10h30-11h : Dynamic Task Graph Adaptation with Recursive Tasks (Thomas Morin)
  • 11h-11h30 : Integration of asynchronous network communications scheduling and local task scheduling (Alexandre Denis)
  • 11h30-12h : titre à venir (Bérenger Bramas ou Samuel Thibault)
  • 14h-14h30 : Towards scalable dense and sparse linear algebra using task-based programming models (Antoine Jego)
  • 14h30-15h : Efficient implementation of approximate computing algorithms (Alfredo Buttari)
  • 15h30-16h : Celeste: A task-parallel framework for tensor computations (Oguz Kaya)
  • 16h-16h30 : Extension of Chameleon to small dimension tensors for large distributed systems with applications to deep neural networks (Brieuc Nicolas)

2.8.2 Solharis 03/05

  • 9:00-9:25 : Algorithmes d'allocation statique pour la planification d'applications haute performance (Lionel Eyraud-Dubois)
  • 9:25-9:50 : Batching small tasks on top of runtime systems for HPL (Alycia Lisito)
  • 9:50-10:15 : Ordonnancement sous contrainte mémoire dans un modèle de programmation à base de tâches (Loris Marchal)
  • 10:30-10:55 : De l'interaction entre les supports d'exécution à tâches HPC et les bibliothèques de communications (Philippe Swartvagher)
  • 10:55-11:20 : Programmation des architectures hétérogènes à l'aide de tâches divisibles (Pierre-André Wacrenier)
  • 11:20-11:55 : Solveurs rapides pour l'aéroacoustique haute fréquence (Emmanuel Agullo)

2.9 Monday 15 April 2024

2.9.1 Speakers: Jean-François

2.9.2 Title: StarONNX: a Dynamic Scheduler for Low Latency and High Throughput Inference on Heterogeneous Resources

2.9.3 Summary:

Efficient execution of Deep Neural Network (DNN) models on heterogeneous processors is challenging, not only because of the heterogeneity of CPUs and hardware accelerators, but also because the problem is fundamentally bi-objective in many contexts, since both latency (time to perform an inference) and throughput (number of inferences per unit time) need to be optimized. We present StarONNX, a solution based on integrating ONNX Runtime in StarPU, which aims to optimize the distribution of inference tasks and resource management on heterogeneous architectures. This strategy relies on (i) the efficient execution of deep learning models by ONNX Runtime to maximize efficiency, and (ii) the orchestration of heterogeneous resources by StarPU to provide sophisticated scheduling and overlapping strategies for computation and communication. An essential point of the framework is the ability to split a CNN into two parts, one running on the GPU and the other on the CPU, thus increasing throughput by using all possible resources without compromising latency. We show that integrating the ONNX runtime into StarPU does not introduce significant overhead. We also evaluated our approach against the Triton Inference Server and showed a significant improvement in resource utilization and reduced latency.

2.10 Thursday 28 March 2024

2.10.1 Speakers: Somesh Singh

2.10.2 Title: High-performance sparse tensor computations

2.11 Thursday 21 March 2024

2.11.1 Speaker: Brieuc Nicolas

2.11.2 Title: Extending Chameleon to tensors contraction

2.11.3 Slides: Brieuc-21-03-24

2.11.4 Speaker: Adrien Aguila-Multner

2.11.5 Title: Efficient training of Neural Networks

2.11.6 Slides: Adrien-21-03-24

2.11.7 Speaker: Ana Hourcau

2.11.8 Title: Exploiting mixed-precision to speed-up applications

2.11.9 Slides: Ana-21-03-24

2.12 Thursday 14 March 2024

2.12.1 Speakers: Pierre Esterie

2.12.2 Title: Existing features and new features to come from the C++ standard that can enhance HPC software (Part 3/3)

2.12.3 Summary:

Open discussion about the language, design methods and its challenges. Let's discuss on the language, what would you like/need? What's blocking or scary? Useful ressources.

2.12.4 Slides: Pierre-14-03-24

2.12.5 Video: Pierre-14-03-24

2.13 Thursday 08 February 2024

2.13.1 Speaker 1 : Alycia Lisito

2.13.2 Title: Enhancing sparse direct solver scalability through runtime system automatic data partition

2.13.3 Summary:

With the ever-growing number of cores per node, it is critical for runtime systems and applications to adapt the task granularity to scale on recent architectures. Among applications, sparse direct solvers are a time-consuming step and the task granularity is rarely adapted to large many-core systems. In this paper, we investigate the use of runtime systems to automatically partition tasks in order to achieve more parallelism and refine the task granularity. Experiments are conducted on the new version of the PaStiX solver, which has been completely rewritten to better integrate modern task-based runtime systems. The results demonstrate the increase in scalability achieved by the solver thanks to the adaptive task granularity provided by the StarPU runtime system.

2.13.4 Slides: Alycia-08-02-24

2.13.5 Speaker 2 : Thomas Morin

2.13.6 Title: Optimizing Parallel System Efficiency: Dynamic Task Graph Adaptation with Recursive Tasks

2.13.7 Summary:

Task-based programming models significantly improve the efficiency of parallel systems. The Sequential Task Flow (STF) model focuses on static task sizes within task graphs, but determining optimal granularity during graph submission is tedious. To overcome this, we extend StarPU’s STF recursive tasks model, enabling dynamic transformation of tasks into subgraphs. Early evaluations on homogeneous shared memory reveal that this just-in-time adaptation enhances performance.

2.13.8 Slides: Thomas-08-02-24

2.14 Thuesday 06 February 2024

2.14.1 Speakers: Mohamed Kherraz

2.14.2 Title: PyTorch Geometric framework with tutorials

2.14.3 Slides: Mohamed-06-02-24

2.14.4 Video: Mohamed-06-02-24

2.14.5 Notebook: ./notebooks/node_classification_with_PyG.ipynb[Mohamed-06-02-24]

2.15 Thursday 01 February 2024

2.15.1 Speakers: Pierre Esterie

2.15.2 Title: Existing features and new features to come from the C++ standard that can enhance HPC software (Part 2/3)

2.15.3 Summary:

Parallelism and asynchrony support. Concurrency library additions : an attempt to a unified interface.

2.15.4 Slides: Pierre-01-02-24

2.15.5 Video: Pierre-01-02-24

2.16 Thursday 18 January 2024

2.16.1 Speakers: Pierre Esterie

2.16.2 Title: Existing features and new features to come from the C++ standard that can enhance HPC software (Part 1/3)

2.16.3 Summary:

Containers, views and algorithms. Memory handling and abstractions, linear algebra support.

2.16.4 Slides: Pierre-18-01-24

2.16.5 Video: Pierre-18-01-24

2.17 Thursday 11 January 2024

2.17.1 Speakers: Hayfa Tayeb

2.17.2 Title: Autovesk: Automatic vectorized code generation from unstructured static kernels using graph transformations

2.17.3 Summary:

Leveraging the SIMD capability of modern CPU architectures is mandatory to take full advantage of their increased performance. To exploit this capability, binary executables must be vectorized, either manually by developers or automatically by a tool. For this reason, the compilation research community has developed several strategies for transforming scalar code into a vectorized implementation. However, most existing automatic vectorization techniques in modern compilers are designed for regular codes, leaving irregular applications with non-contiguous data access patterns at a disadvantage. In this article, we present a new tool, Autovesk, that automatically generates vectorized code from scalar code, specifically targeting irregular data access patterns. We describe how our method transforms a graph of scalar instructions into a vectorized one, using different heuristics to reduce the number or cost of instructions. Finally, we demonstrate the effectiveness of our approach on various computational kernels using Intel AVX-512 and ARM SVE. We compare the speedups of Autovesk vectorized code over GCC, Clang LLVM, and Intel automatic vectorization optimizations. We achieve competitive results on linear kernels and up to 11x speedups on irregular kernels.

2.17.4 Slides: Hayfa-11-01-24

3 Schedule 2023

  • 21/12/2023 : Portes fermees Inria
  • 14/12/2023 : canceled
  • 07/12/2023 : [Somesh]
  • 30/11/2023 : round table
  • 23/11/2023 : [Abel]
  • 16/11/2023 : canceled
  • 09/11/2023 : [Radja] (STORM talk)
  • 02/11/2023 : canceled (autumn holidays)
  • 26/10/2023 : canceled
  • 19/10/2023 : [Philippe]
  • 11/10/2023 : [Thomas + HPC Days]
  • 05/10/2023 : [Ahmed]
  • 28/09/2023 : [Kickoff Exasoft 28 et 29]
  • 22/06/2023 : [Abel]
  • 15/06/2023 : [Alycia]
  • 08/06/2023 : [Julia]
  • 01/06/2023 : canceled
  • 28/04/2023 : [Jean]
  • 27/04/2023 : [Alexandre]
  • 20/04/2023 : canceled (spring holidays)
  • 13/04/2023 : [Xunyi]
  • 06/04/2023 : Day of the doctoral school
  • 30/03/2023 : SOLHARIS workshop
  • 23/03/2023 : JLESC workshop
  • 16/03/2023 : canceled
  • 09/03/2023 : canceled
  • 02/03/2022 : [Aurore]
  • 23/02/2022 : [Clément]
  • 16/02/2022 : canceled (winter holidays)
  • 09/02/2023 : [Valentin]
  • 02/02/2023 : canceled
  • 24/01/2023 : [Quantic ATOS]
  • 19/01/2023 : postpone
  • 12/01/2023 : [Brieuc]

3.1 Thursday 07 December 2023

3.1.1 Speakers: Somesh Singh

3.1.2 Title: High-performance sparse tensor computations

3.1.3 Summary:

Tensors arise in several application domains such as scientific computing, machine learning, and data analytics. The talk will focus on two operations on sparse tensors:

  • tensor contraction and
  • querying for nonzeros in a tensor.

Tensor contraction is a higher-dimensional analog of matrix-matrix multiplication. Querying for nonzeros in a tensor is a subproblem in a recently developed tensor decomposition method. We investigate hashing-based methods for improving the performance of sparse tensor computations on shared-memory systems.

3.1.4 Slides: Somesh-07-12-23

3.2 Thursday 23 November 2023

3.2.1 Speakers: Abel Calluaud

3.2.2 Title: State of the art on H-matrix libraries

3.2.3 Summary:

3.2.4 Slides: Abel-23-11-23

3.3 Thursday 09 November 2023

3.3.1 Speakers: Radjasouria Vinaygame

3.3.2 Title: Rethinking Data Race Detection in MPI-RMA Programs

3.3.3 Summary:

Supercomputers are capable of increasingly more computations, and nodes forming them need to communicate even more efficiently with each other. The Message Passing Interface (MPI) proposes a communication model based on one-sided communications called the MPI Remote Memory Access (MPI-RMA). Thanks to these operations, applications can improve the overlap of communications with computations. However, one-sided communications are complex to write since they are subject to data races. This paper rethinks an existing on-the-fly data race detection algorithm for MPI-RMA programs by improving the storage of memory accesses in a Binary Search Tree using a new insertion algorithm based on fragmentation and merging algorithms. Thus, experimental results on real-life applications show that this new insertion algorithm improves the accuracy of the data race detection and can reduce the overhead of the analysis at runtime by a factor up to two.

3.4 Thursday 19 October 2023

3.4.1 Speakers: Philippe Swartvagher

3.4.2 Title: Using Mixed-Radix Decomposition to Enumerate Computational Resources of Deeply Hierarchical Architecture

3.4.3 Summary:

Considering the increasing number of hierarchy levels (sockets, NUMA domain, L3 cache, …) in HPC systems and the importance of process mapping for MPI application performance, we propose a procedure for enumerating the computational elements in the hierarchy in different orders. We explore two use cases: MPI rank reordering for applications using subcommunicators and core se- lection for applications that do not use all cores on a node. Results of micro-benchmarks executing collective operations in subcommunicators show a performance difference up to a factor 4 between the best and the worst rank orderings. By changing the rank orders, we observe an impact on the performance of the Splatt application. The evaluation of the strong scalability of a conjugate gradient benchmark showed that considering all hierarchy levels in the core selection policy can give better performance than using only options available with common MPI application launchers.

3.5 Wednesday 11 October 2023

3.5.1 Speakers: Samuel Thibault, Vincenç Beltran and Thomas Hérault.

3.5.2 Title: HPC Days following Gwenole'PhD defense

3.5.3 Summary: (Amphi LaBRI)

The schedule is the following:

  • 9:00-9h45 : Samuel Thibault "Task scheduling in StarPU"
  • 9h45:10h30 : Vincenç Beltran "OmpSs-2: A Programming Model For Distributed and Heterogeneous Systems"
  • 10h30-11h: Coffee break
  • 11h-11h45 : Thomas Herault "Template Task Graphs: a composable task-based programming interface for distributed hybrid systems"

3.6 Thursday 05 October 2023

3.6.1 Speakers: Ahmed Abdourahman Mahamoud

3.6.2 Title: Combine ROTOR with Pipe and DDP

3.6.3 Summary:

Scaling up deep neural network capacity has been known as an effective approach to improving model quality for several different machine learning tasks. ROTOR is an activation checkpointing method that helps scaling models for a single accelerator by significantly decreasing the memory usage when training sequential models. Pipe and DDP are distributed training methods in which the deep learning model is partitioned across multiple devices for Pipe or copied on multiple devices for DDP. We combined ROTOR with Pipe and DDP in order to use it in a multi-device architecture. During this project, we worked on the adaptation of the algorithms of ROTOR for the pipelining context and we analyzed ROTOR’s performances in a multi-device architecture with sequential models.

3.6.4 Slides: Ahmed-05-10-23

3.7 Thursday 22 June 2023

3.7.1 Speakers: Abel Calluaud

3.7.2 Title: Analyse de performance d’un solveur direct hiérarchique

3.7.3 Summary:

Talk for COMPAS 2023.

La recherche de performance pour la résolution de systèmes linéaires de grande taille est la clé pour de nombreuses applications de simulation numérique. Ainsi, le code de simulation ARLENE du CEA s’intéresse à la résolution des équations de Maxwell en fréquence pour étudier les problèmes de furtivité électromagnétique. Il repose sur une méthode de factorisation directe extrêmement gourmande en mémoire et en temps de calcul, limitant la taille des problèmes qui peuvent être étudiés. Les besoins industriels pour la résolution de systèmes toujours plus grands ont motivé l’utilisation de techniques de compression hiérarchique pour réduire les besoins en termes de mémoire et de puissance de calcul. Les approximations H-matrices ont rendu atteignable en temps raisonnable la résolution sur de très grands systèmes en ramenant la complexité de la factorisation à O(nlog(n)) contre O(n3) pour un algorithme dense. L’introduction de H-matrices introduit toutefois des difficultés pour une implémentation efficace de par la grande combinatoire de noyaux de calculs qui varient en taille et en intensité arithmétique. Les algorithmes sur ce format hiérarchique s’expriment naturellement à l’aide d’un paradigme de tâches dont l’ordonnancement est délégué à un support d’exécution dédié. Basé sur le modèle Séquential Task Flow (STF), celui-ci bénéficie d’extensions permettant d’exploiter les particularités du domaine, notamment la prise en compte des dépendances hiérarchiques de données à grain fin afin d’éviter des synchronisations excessives. Cette approche a permis de factoriser efficacement des problèmes qui comportent jusqu’à des dizaines de millions d’inconnues en quelques heures sur quelques dizaines de milliers de cœurs. Cependant, l’irrégularité des calculs laisse une question de recherche ouverte sur l’adaptation de la granularité des tâches à l’architecture matérielle. Les petits noyaux peuvent d’une part souffrir d’inefficacités et les larges tâches gagneraient à être parallélisées. Dans ce contexte, nous proposons une méthodologie et des métriques afin d’évaluer la performance des noyaux par rapport aux limites matérielles. Nous présentons les résultats de performance des noyaux et en faisons une analyse détaillée. Nous mettons en évidence les problèmes de granularité et faisons une estimation du gain potentiel d’une stratégie de fusion pour le grain faible et de parallélisation pour le gros grain.

3.8 Thursday 15 June 2023

3.8.1 Speakers: Alycia Lisito

3.8.2 Title: New parallel features in the sparse solver PaStiX

3.8.3 Summary:

Talk for Sparse Days 2023.

In this talk, we will present new parallel features in the sparse solver PaStiX. We will start by introducing the distributed MPI version of the initialisation of the factorisation step and solve step. We will then discuss about the new multi-threaded and distributed versions of PaStiX's solve (with its internal schedulers). Finally, we will present a left looking version of the task submission in the POTRF and GETRF functions of the scheduler StarPU.

3.8.4 Slides: Alycia-15-06-23

3.9 Thursday 08 June 2023

3.9.1 Speakers:

3.9.2 Title: ELF (Efficient deep Learning Frameworks) associate team

3.9.3 Summary:

Defriefing of the ELF workshop

3.10 Friday 28 April 2023

3.10.1 Speakers: Jean Kossaifi

3.10.2 Title: ELF (Efficient deep Learning Frameworks) associate team

3.10.3 Summary:

First workshop organizing in the framework of ELF (Efficient deep Learning Frameworks) associate team between Topal and Anima AI + Science Lab from Caltech (http://tensorlab.cms.caltech.edu/users/anima/).

The schedule is the following:

Jean will give an overview of tensor methods for deep learning, starting with a short introduction to tensor decomposition, how to leverage tensor methods to design better deep models, for improved performance or speed, model compression and robustness and finally practical implementation using TensorLy and TensorLy-Torch. He will also touch on how to leverage tensor methods for improving quantum circuit simulations and neural operators.

  • 10.30 to 12.00 : Discussion and brainstorming on efficient training of large deep learning models using re-materialization, offloading, pipelining techniques
  • 14.30 to 16.00 : Discussion and brainstorming on the high-performance computing aspects of tensor computations (dense, sparse, compressed approaches, low-level kernel optimization, runtime systems considerations)

3.11 Thursday 27 April 2023

3.11.1 Speakers: Alexandre Honorat

3.11.2 Title: Sequential Scheduling of Dataflow Graphs for Memory Peak Minimization

3.11.3 Summary:

Many computing systems are constrained by their fixed amount of shared memory. Modeling applications with task or Synchronous DataFlow (SDF) graphs makes it possible to analyze and optimize their memory peak. The problem studied here is the memory peak minimization of such graphs when scheduled sequentially. Regarding task graphs, former work has focused on the Series-Parallel Directed Acyclic Graph (SP-DAG) subclass and proposed techniques to find the optimal sequential algorithm w.r.t. memory peak. As main contributions, we propose task graph transformations and an optimized branch and bound algorithm to solve the problem on a larger class of task graphs. The approach also applies to SDF graphs after converting them to task graphs. However, contributions about SDF graphs will be skipped for this presentation. We evaluate our approach on classic benchmarks, on which we always outperforms the state-of-the-art.

3.11.4 Slides: Alexandre-27-04-23

3.12 Thursday 13 April 2023

3.12.1 Speakers: Xunyi

3.12.2 Title: Rockmate: an Efficient, Fast, Automatic and Generic Tool for Re-materialization in PyTorch

3.12.3 Summary:

In recent years, very large networks have emerged. These networks induce huge memory requirements both because of the number of parameters and the size of the activations that must be kept in memory to perform back-propagation. Memory issues for training have been identified for a long time. Indeed, training is usually performed on computing resources such as GPUs or TPUs, on which memory is limited. Therefore, different approaches have been proposed.

3.13 Thursday 30 March 2023

3.13.1 Speakers:

3.13.2 Title: SOLHARIS workshop

3.13.3 Summary:

The working group will be held jointly with the HPC-SCALABLE-ECOSYSTEM project.

3.14 Thursday 23 March 2023

3.14.1 Speakers:

3.14.2 Title: JLESC workshop

3.14.3 Summary:

The 15th JLESC Workshop gathers leading researchers in high-performance computing from the JLESC partners INRIA, the University of Illinois, Argonne National Laboratory, Barcelona Supercomputing Center, Jülich Supercomputing Centre, RIKEN Center for Computational Science and the University of Tennessee to explore the most recent and critical issues in advancing the field of HPC from petascale to the extreme scale era.

3.15 Thursday 02 March 2023

3.15.1 Speakers: Aurore Li

3.15.2 Title: Some words about the ELFAM project

3.15.3 Summary:

There are two parts in this presentation:

  • I would like to talk about some of my professional experiences before joining Topal team
  • I'll present the projet ELFRAM collaborating with Julia Gusak. I'll talk about some ongoning works and our perspectives.

3.16 Thursday 23 Febuary 2023

3.16.1 Speakers: Clément Richefort

3.16.2 Title: Toward a multilevel method for the Helmholtz equation

3.16.3 Summary:

It is well known that multigrid methods are very competitive in solving a wide range of SPD problems. However achieving such performance for non-SPD matrices remains an open problem. In particular, two main issues may arise when solving a Helmholtz problem. Some eigenvalues become negative or even complex, requiring the choice of an adapted smoothing method for capturing them. Moreover, since the near-kernel space is oscillatory, the geometric smoothness assumption cannot be used to build efficient interpolation rules. We present some investigations about designing a method that converges in a constant number of iterations with respect to the wavenumber. The method builds on an ideal reduction-based framework and related theory for SPD matrices to correct an initial least squares minimization coarse selection operator formed from a set of smoothed random vectors.

3.16.4 Slides: Clement-23-02-23

3.17 Thursday 09 Febuary 2023

3.17.1 Speakers: Valentin Lefevre

3.17.2 Title: Efficient Execution of SpGEMM on Long Vector Architectures

3.17.3 Summary:

In this talk, we will look at how long vector architectures are used in the context of the European Processor Initiative (EPI) and we will then focus on optimizing SpGEMM in this context. The Sparse GEneral Matrix-Matrix multiplication (SpGEMM) \(C = A \times B\) is a fundamental routine extensively used in domains like machine learning or graph analytics. Despite its relevance, the efficient execution of SpGEMM on vector architectures is a relatively unexplored topic. The most recent algorithm to run SpGEMM on these architectures is based on the SParse Accumulator (SPA) approach, and it is relatively efficient for sparse matrices featuring several tens of non-zero coefficients per column as it computes C columns one by one. However, when dealing with matrices containing just a few non-zero coefficients per column, the state-of-the-art algorithm is not able to fully exploit long vector architectures when computing the SpGEMM kernel. To overcome this issue we propose the SPA paRallel with Sorting (SPARS) algorithm, which computes in parallel several C columns among other optimizations, and the HASH algorithm, which uses dynamically sized hash tables to store intermediate output values. To combine the efficiency of SPA for relatively dense matrix blocks with the high performance that SPARS and HASH deliver for very sparse matrix blocks we propose H-SPA(t) and H-HASH(t), which dynamically switch between different algorithms. H-SPA(t) and H-HASH(t) obtain \(1.24 \times\) and \(1.57 \times\) average speed-ups with respect to SPA respectively, over a set of 40 sparse matrices obtained from the SuiteSparse Matrix Collection. For the 22 most sparse matrices, H-SPA(t) and H-HASH(t) deliver \(1.42 \times\) and \(1.99 \times\) average speed-ups respectively.

3.18 Tuesday 24 January 2023

3.18.1 Speakers: Simon Martiel

3.18.2 Title: Calcul Quantique et HPC

3.18.3 Summary:

Après une rapide présentation des activités R&D d’Atos sur le calcul quantique, nous développerons deux sujets à la frontière en calcul hautes performances et calcul quantique. Nous présenterons les récents efforts pour la définition d’un framework basé sur C++ pour l’intégration de kernel quantiques au sein d’un flot de calcul classique. Nous survolerons ensuite certaines problématiques liées à la définition de kernels quantiques pour l’algèbre linéaire. Dans un second temps, nous aborderons le sujet de la simulation de processeurs quantiques via un système HPC classique. Nous présenterons le problème de la contraction de réseaux de tenseurs, ses liens avec le calcul hautes performances, ainsi que des résultats récents basés sur des techniques de méthodes formelles quantiques.

3.19 Thursday 12 January 2023

3.19.1 Speakers: Brieuc Nicolas

3.19.2 Title: Comparing mixed-precision solving with low-rank compression

3.19.3 Summary:

Comparisons between mixed precision and low rank compression in the sparse direct solver PaStiX.

3.19.4 Slides: Brieuc-12-01-23

4 Schedule 2022

  • 15/12/2022 : [Abel]
  • 08/12/2022 : [Hayfa]
  • 01/12/2022 : [Anne-Cecile]
  • 24/11/2022 : [Alycia]
  • 17/11/2022 : [Mathieu]
  • 10/11/2022 : canceled
  • 03/11/2022 : SBAC-PAD 2022 conference (https://project.inria.fr/sbac2022/)
  • 27/20/2022 : canceled
  • 20/20/2022 : canceled
  • 13/10/2022 : [Yanfei]
  • 06/10/2022 : [Lucas]
  • 29/09/2022 : [Alycia]
  • 22/09/2022 : [Xunyi]
  • 30/06/2022 : internship presentations
  • 23/06/2022 : Sparse Days (canceled)
  • 16/06/2022 : [Nathanaël]
  • 09/06/2022 : SOLHARIS (canceled)
  • 02/06/2022 : lecture for "Graph Partitioning and Sparse Matrix Ordering using Reinforcement Learning and Graph Neural Networks"
  • 26/05/2022 : canceled (public holiday)
  • 19/05/2022 : canceled
  • 12/05/2022 : auditions PR UB (Pierre + Abdou)
  • 05/05/2022 : [Clément]
  • 28/04/2022 : canceled (spring holidays)
  • 21/04/2022 : introduction à l'apprentissage par renforcement (part3)
  • 14/04/2022 : [Alycia]
  • 07/04/2022 : canceled
  • 31/03/2022 : introduction à l'apprentissage par renforcement (part2)
  • 24/03/2022 : introduction à l'apprentissage par renforcement (part1)
  • 17/03/2022 : canceled
  • 10/03/2022 : canceled
  • 03/03/2022 : [Julia]
  • 24/02/2022 : canceled (winter holidays)
  • 17/02/2022 : [Julia]
  • 10/02/2022 : SOLHARIS and HPC Scalabe Ecosystem half-days
  • 03/02/2022 : [Grégoire]
  • 27/01/2022 : [Valentin]
  • 20/01/2022 : canceled
  • 13/01/2022 : canceled
  • 06/01/2022 : [Cristina]

4.1 Thursday 15 December 2022

4.1.1 Speakers: Abel Calluaud

4.1.2 Title: Combined runtime system and compiler for direct hierarchical solver

4.1.3 Summary:

Previous work on the boundary element method for solving Maxwell's equation in frequency has pushed the limits in terms of problems reachable by direct approaches using compression techniques. The distributed direct solver developed at CEA relies on the approximation by hierarchical matrices to reduce both computational and memory costs. Although these developments have met a growing demand for increased simulation accuracy, there are still open problems to pursue these research efforts in an HPC context. We propose to develop and compare several approaches to adapt the granularity of hierarchical tasks and extract parallelism to exploit the multicore computational nodes associated with massively parallel architectures such as GPUs.

4.1.4 Slides: Abel-15-12-22

4.2 Thursday 08 December 2022

4.2.1 Speakers: Hayfa Tayeb

4.2.2 Title: MulTreePrio scheduler for task-based applications

4.2.3 Summary:

Effective scheduling is crucial for task-based applications to achieve high performance in heterogeneous computing systems. We propose MulTreePrio, a scheduler based on trees. Tasks are assigned to available resources based on priority scores. These scores are computed using heuristics built according to a set of rules that the scheduler must respect.

4.3 Thursday 01 December 2022

4.3.1 Speakers: Anne-Cecile Orgerie (common seminar with STORM)

4.3.2 Title: Some wrong ideas about energy consumption of distributed systems

4.3.3 Summary

Distributed systems, such as Cloud computing, are increasingly spanning worldwide, with digital services hosted all around the globe and often belonging to complex systems, utilizing many other services and hardware resources themselves. Along with this increase comes an alarming growth of Cloud devices and their related energy consumption. Despite the distributed systems’ complexity, understanding how they consume energy is important in order to hunt wasted Joules and reduce their environmental impact. This talk will deal with measuring the energy consumption of distributed systems and wrong ideas about this consumption that can be found in literature.

4.4 Thursday 24 November 2022

4.4.1 Speakers: Alycia Lisito

4.4.2 Title: Sparse direct solver analysis and optimization for MHD simulation

4.4.3 Summary: Pre-defense of Internship

4.4.4 Slides: Alycia-24-11-22

4.5 Thursday 17 November 2022

4.5.1 Speakers: Mathieu Vérité

4.5.2 Title: A Data Distribution Scheme for Dense Cholesky Factorization on Any Number of Nodes

4.5.3 Summary:

This work focuses on the case of the parallel and distributed execution of the Cholesky factorization on a dense matrix using identical computing resources. We consider the problem of distributing a matrix divided into tiles to a set of nodes in order to reduce the overall volume of communication. The recently introduced Symmetric Block Cyclic (SBC) distribution is a pattern-based distribution that takes advantage of the symmetry of the input matrix \(A\) to reduce the volume of communication generated when performing Cholesky factorization compared to classical 2D Block Cyclic. Experimental results show that the overall communication reduction allows to achieve higher performance. However, SBC distribution scheme is only available for specific values of the number of nodes \(P\). In this work, we propose a greedy algorithm, Greedy ColRow & Matching (GCR&M), which enables to define pattern-based distributions with a structure similar to SBC and which achieves the same communication reduction but for any \(P\). It thus allows a generalization of distributions specifically adapted to the Cholesky factorization and more generally to kernels using a symmetric input. We present a theoretical evaluation of the distributions provided by the GCR&M algorithm. The flexibility and ease of programming induced by task-based runtime systems allowed us to carry out experiments to test out those distributions using Chameleon and StarPU. Results show that the solutions provided by GCR&M algorithm manage to achieve performance similar to SBC while being available for any \(P\).

4.5.4 Slides: Mathieu-17-11-22

4.6 Thursday 13 October 2022

4.6.1 Speakers: Yanfei Xiang

4.6.2 Title: Hybridizing deep learning solver and numerical linear algebra for the Helmholtz equations

4.6.3 Summary:

In the recent years, research on scientific machine learning based on deep learning solvers has been increasingly applied to scientific computing and computational engineering. However, even though these newly data-driven deep learning solvers could be very effective as soon as they have been properly trained, they can (in general) provide us with a solution of limited accuracy. Furthermore, the computational cost in the training phase can be extremely expensive. In this talk, we presents some ways of hybridizing the newly emerging deep learning solvers and the more traditional numerical linear algebra techniques to let them benefit from each other. In the context of solving a heterogeneous Helmholtz equation, we first focus on introducing some mathematical ingredients from classical iterative solver into the training phase of a recently proposed deep neural network solver. The main benefit is a significant improvement in the training phase that is more robust and faster, which turns out to be applicable to the testing process as well. Furthermore, once the network solvers have been properly trained, their inferences can be applied as a nonlinear preconditioner in the traditional flexible GMRES and flexible FOM methods. This part demonstrates that these hybrid variants have clear advantages over both the newly emerging deep neural network approach and the classical iterative Krylov solver in terms of both computational cost and accuracy of the computed solution.

4.7 Thursday 06 October 2022

4.7.1 Speakers: Lucas Nesi (common seminar with STORM)

4.7.2 Title: Exploiting system-level heterogeneity to improve the performance of multi-phase task-based applications

4.7.3 Summary:

HPC infrastructures often present intra-node (multi-core CPUs and multiple GPUs) and system-level heterogeneity (different nodes arranged into partitions). This heterogeneity provides opportunities for the applications to improve performance, especially in task-based multi-phase applications where each phase has different needs. Improving phases overlap and asynchronously, taking advantage of inter-node heterogeneity to better distribute the load, and adequately selecting the number of resources to use are examples of such opportunities. In this talk, we present strategies for (1) improving applications phase overlap by optimizing runtime and scheduling decisions; (2) computing a distribution considering phases' suitability over heterogeneous nodes while reducing the redistribution overhead; (3) strategies based on the Gaussian Process for the application to dynamically learn during execution time and adapt to the best set of heterogeneous nodes it has access.

4.8 Thursday 29 September 2022

4.8.1 Speakers: Alycia Lisito

4.8.2 Title: Distributed interface for PaStiX

4.8.3 Slides: Alycia-29-09-22

4.9 Thursday 22 September 2022

4.9.1 Speakers: Xunyi Zhao

4.9.2 Title: Rockmate: to save memory by combining Rotor and Checkmate

4.10 Thursday 30 June 2022

4.10.1 Speakers:

  • Guillaume Bienfait : QemuNet to QemuWeb
  • Jean-Alexandre Collin : Reducing communications on a dense Cholesky factorization
  • Théotime Le Hellard : Rematerializing Optimally with pyTORch
  • Brieuc Nicolas : Mixed precision in PaStiX solver

4.10.2 Title: Internship presentations

4.10.3 Summary:

4.10.4 Slides:

4.11 Thursday 16 June 2022

4.11.1 Speakers: Nathanaël Fijalkow

4.11.2 Title: Gragh Neural Networks

4.11.3 Summary:

4.11.4 Slides: Nathanael-16-06-22

4.12 Thursday 02 June 2022

4.12.1 Speakers: Alice Gatti (video)

4.12.2 Title: Graph Partitioning and Sparse Matrix Ordering using Reinforcement Learning and Graph Neural Networks

4.13 Thursday 05 May 2022

4.13.1 Speakers: Clément Richefort

4.13.2 Title: Interpolator built from local near-kernel spaces in multigrid methods applied to Helmholtz equation

4.13.3 Summary:

Numerical simulations of various physical phenomenons give rise to the resolution of a potentially very large system of linear equations Ax=b. Multigrid methods are known to be scalable and quasi-optimal for solving such a system in many classes of problems. In this method, computation of x is accelerated thanks to a collection of coarse problems. The solution on the coarsest level is computed by a direct method, while few iterations of a given smoother is used on others to refine the approximation made by restriction and interpolation. Such a complementarity principle between a smoother targeting residual information and a direct method solving the near-kernel space of A guarantees the convergence of the method. In elliptic problems, it is easy to find appropriates smoother and interpolators to reach this complementarity since A is positive and has a smooth near-kernel space. However for indefinite problems like Helmholtz, smoothness property of near-kernel space and positiveness of resulting matrix can not be called to reach complementarity : usual smoothers amplify negative components while usual interpolators are not designed to catch wave-like near-kernel space anymore. Working on normal equations or executing Krylov iterations are good alternatives to usual smoothers in indefinite context, however finding appropriate operators to target near-kernel space is still an open question. From definition a theoretical Ideal Interpolator, this presentation will expose a new type of interpolator permitting to reach convergence in a constant number of iterations independently of matrix size with 2 or 3 multigrid levels. This interpolator works by combining a smoothing matrix to damp oscillatory components, an approximation of the Fine-Relaxation operator used in Ideal Interpolator, and a tentative interpolator built in order to target local near-kernel spaces computed from blocks of A.

4.13.4 Slides: Clement-05-05-22

4.14 Thursday 21 Appril 2022

4.14.1 Speakers: Grégoire Passault (Robhan team)

4.14.2 Title: Introduction à l'apprentissage par renforcement (part3)

4.14.3 Summary:

Deep Q-learning

4.14.4 Slides: Gregoire-24-03-22

4.15 Thursday 14 Appril 2022

4.15.1 Speakers: Alycia Lisito

4.15.2 Title: Validation and evaluation of the Chameleon Lapack interface

4.15.3 Summary:

Chameleon is a dense linear algebra library that aims at replacing the standard APIs Lapack and Sclapack for distributed and heterogeneous architectures. It relies on tiled algorithms implemented on top of various runtime systems such as StarPU, Quark, Parsec, or OpenMP. The Chameleon library provides three interfaces to the user for more flexibility: an interface close to Lapack API to ease the switch from Lapack to Chameleon to the users, a tile interface to match the algorithm structure, and a tile asynchronous interface to enable pipelined algorithms. Over the time, support of the Lapack interface was not sustained and leftover from the conversion from Plasma was still impacting the performance of this interface. The objective of my internship was twofold, add checking functionnality to fix potential issues and ease its support, and evaluate the performance impact of the out-of-place copy to tile layout with respect to an in-place solution. During this talk, I will recall the principles of the tile algorithms in Chameleon, the respective data layout of Lapack and Chameleon and how we can implement a Lapack interface on top of the chameleon algorithm. I'll finish by presenting the evaluation of the performance of the library on multiple architectures from plafrim.

4.15.4 Slides: Alycia-14-04-22

4.16 Thursday 31 March 2022

4.16.1 Speakers: Grégoire Passault (Robhan team)

4.16.2 Title: Introduction à l'apprentissage par renforcement (part2)

4.16.3 Summary:

Q-learning

4.16.4 Slides: Gregoire-24-03-22

4.17 Thursday 24 March 2022

4.17.1 Speakers: Grégoire Passault (Robhan team)

4.17.2 Title: Introduction à l'apprentissage par renforcement (part1)

4.17.3 Summary:

4.17.4 Slides: Gregoire-24-03-22

4.18 Thursday 03 March 2022

4.18.1 Speakers: Julia Gusak - Skoltech

4.18.2 Title: Neural Networks Memory Footprint Reduction during Training

4.18.3 Summary:

Modern Deep Neural Networks (DNNs) require significant memory to store weights, activations, and other intermediate tensors during training. Hence, many models do not fit in the memory of one GPU device or can be trained using only a small per-GPU batch size. We will discuss two new approaches two reduce memory consumption during training by performing activation quantization or approximate matrix multiplication during backward pass.

4.18.4 Slides: Julia-03-03-22

4.19 Thursday 17 February 2022

4.19.1 Speakers: Julia Gusak - Skoltech

4.19.2 Title: Neural Networks Speed-up and Compression

4.19.3 Summary:

Most modern neural networks are overparameterized and exhibit high computational costs. As a result, they can’t be efficiently deployed on embedded systems and mobile devices due to power and memory limitations. We present several approaches based on low-rank approximations to accelerate and compress neural networks during inference. Also, the memory footprint is one of the main limiting factors for large neural network training. And we discuss several techniques to reduce memory costs during the training phase based on approximate gradients computation during backpropagation.

4.19.4 Slides: Julia-17-02-22

4.20 Thursday 10 February 2022

4.20.1 Title: SOLHARIS and HPC Scalabe Ecosystem half-days

4.20.2 Wednesday 9/2 (9:45 - 12:45)

  • Scheduling algorithms 10:00 - 11:00 [chair: Loris Marchal]
    • Maxime Gonthier "Memory-Aware Scheduling of Tasks Sharing Data on Multiple GPUs"
    • Mathieu Verité / Lionel Eyraud-Dubois "I/O-Optimal Algorithms for Symmetric Linear Algebra Kernels"
  • Programming models and tools 11:15 - 12:45 [chair: Olivier Aumage]
    • Marek Felsoci "An energy consumption study of coupled solvers for FEM/BEM linear systems: preliminary results"
    • Philippe Swartvagher "Recent progress on starpu+nmad"
    • Antoine Jego "Task-based programming models for scalable algorithms"

4.20.3 Thursday 10/2 (9:45 - 11:30)

  • Application 10:00 - 11:30 [chair: Vincent Perrier]
    • Grégoire Pichon / Mathieu Faverge "Deciding Non-Compressible Blocks in Sparse Direct Solvers using Incomplete Factorization"
    • Sangeeth Simon "Task Based Parallelization Of A Finite-Volume Code For Hyperbolic Conservation Laws"
    • Romain Peressoni "Approximating MDS point clouds shape using only part of the distance matrix"

4.21 Thursday 03 February 2022

4.21.1 Speakers: Grégoire Pichon

4.21.2 Title: Trading Performance for Memory in Sparse Direct Solvers using Low-rank Compression

4.21.3 Summary:

Sparse direct solvers using Block Low-Rank compression have been proven efficient to solve problems arising in many real-life applications. Improving those solvers is crucial for being able to 1) solve larger problems and 2) speed up computations. A main characteristic of a sparse direct solver using low-rank compression is at what point in the algorithm the compression is performed. There are two distinct approaches: (1) all blocks are compressed before starting the factorization, which reduces the memory as much as possible, or (2) each block is compressed as late as possible, which usually leads to better speedup. Approach 1 reaches a very small memory footprint generally at the expense of a greater execution time. Approach 2 achieves a smaller execution time but requires more memory. In this talk, we design a composite approach, to speedup computations while staying under a given memory limit. This should allow to solve large problems that cannot be solved with Approach 2 while reducing the execution time compared to Approach 1. We propose a memory-aware strategy where each block can be compressed either at the beginning or as late as possible. We first consider the problem of choosing when to compress each block, under the assumption that all information on blocks is perfectly known, i.e., memory requirement and execution time of a block when compressed or not. We show that this problem is a variant of the NP-complete Knapsack problem, and adapt an existing approximation algorithm for our problem. Unfortunately, the required information on blocks depends on numerical properties and in practice cannot be known in advance. We thus introduce models to estimate those values. Experiments on the PaStiX solver demonstrate that our new approach can achieve an excellent trade-off between memory consumption and computational cost. For instance on matrix Geo1438, Approach 2 uses three times as much memory as Approach 1 while being three times faster. Our new approach leads to an execution time only 30% larger than Approach 2 when given a memory 30% larger than the one needed by Approach 1.

4.21.4 Slides: Gregoire-03-02-22

4.22 Thursday 27 January 2022

4.22.1 Speakers: Valentin Le Fèvre

4.22.2 Title: Resilient algorithms in HPC and linear algebra for new architectures

4.22.3 Summary:

In this talk, I will present various research topics on which I was involved during my PhD (at ENS de Lyon) or my post-doc position (at BSC). The first axis will be resilience for large-scale platforms. We will review different techniques for resilience, mainly checkpointing and replication, with a focus on process replication. Process replication allows an application to survive many errors, allowing to use checkpointing less frequently. We will study two different strategies: one which restarts processes at each checkpoint and one where processes are restarted only after a crash. The second axis will be sparse linear algebra. I will present a recent work on sparse Cholesky factorization, where the application is represented as a task-graph. We define thresholds to automatically adapt the granularity of the tasks depending on the input matrix, and present experimental results using the recent A64FX processor and OmpSs runtime. I will conclude with some on-going perspectives for processors based on the Risc-V architecture, in the context of the European Processor Initiative (EPI).

4.22.4 Slides: Valentin-27-01-22

4.23 Thursday 06 January 2022

4.23.1 Speakers: Dr. Cristina Boeres, Instituto de Computação, Universidade Federal Fluminense, Brazil

4.23.2 Title: Towards Analyzing Computational Costs of Spark for SARS-CoV-2 Sequences Comparisons on a Commercial Cloud

4.23.3 Summary:

This talk introduces the development of a Spark application, named Diff Sequences Spark which compares 540 SARS-CoV-2 sequences from South America in Amazon EC2 Cloud. The work analyzed the performance of the proposed application on selected memory and storage EC2 virtual machines instances (VMs) at on-demand and spot markets. Some preliminary conclusions were drawn: the execution times and financial costs of the memory-optimized VMs outperformed the storage optimized ones and the average execution times and monetary costs were reduced on spot VMs compared to their respective on-demand even in scenarios with several spot revocations.

4.23.4 Bio:

Cristina Boeres is an associate professor at the Instituto de Computação (IC) of the Universidade Federal Fluminense (UFF), in Brazil. Her current research interests focus on various aspects of parallel and distributed computing in grids and clouds, including autonomic computing, scientific applications, resource management, scheduling, and fault tolerance. She has been involved with the organization of conferences such as the International Symposium on Computer Architecture and High Performance Computing, sponsored by IEEE and SBC, the Brazilian Symposium on High Performance Computational Systems, sponsored by SBC, and also, IPDPS 2019 held in Rio de Janeiro, Brazil.

4.23.5 Slides: Cristina-06-01-22

5 Schedule 2021

  • 16/12/2021 : [Clement]
  • 09/12/2021 : [Vinod]
  • 02/12/2021 : canceled (talk by Karim replanned in January, common WG with HiePACS)
  • 24/11/2021 : canceled (replaced by the overview of the Energyscope tool)
  • 18/11/2021 : Informal discussions around energy saving
  • 11/11/2021 : canceled (public holiday)
  • 04/11/2021 : canceled (university break)
  • 28/10/2021 : [Mathieu]
  • 21/10/2021 : [Alena]
  • 14/10/2021 : [Jean-François]
  • 07/10/2021 : [Xunyi]
  • 30/09/2021 : [Clement]
  • 23/09/2021 : Internship defenses (Nolan Bredel & Tom Moënne-Loccoz)
  • 16/09/2021 : canceled
  • 09/09/2021 : Coffee-break / Go-around
  • 08/07/2021 : moliere / skoltech-inria meeting
  • 02/07/2021 : solharis / hpc-scalable-ecosystem day
  • 24/06/2021 : [Olivier]
  • 17/06/2021 : [Lionel]
  • 10/06/2021 : [Nolan]
  • 04/06/2021 : [Clement]
  • 27/05/2021 : [Pavel]
  • 20/05/2021 : [Suraj]
  • 06/05/2021 : [Olivier]
  • 12/04/2021 : [Gwenole]
  • 08/04/2021 : Data Parallelism - Model Parallelism - Hybrid Parallelism (Fidle team, CNRS)
  • 01/04/2021 : Keras and Convolutional Neural Networks (Fidle team, CNRS)
  • 25/03/2021 : History and basic concepts of neural networks (Fidle team, CNRS)
  • 18/03/2021 : [Mathieu]
  • 11/03/2021 : [Hervé]
  • 04/03/2021 : Présentation de PlaFRIM
  • 25/02/2021 : [Mathieu]
  • 18/02/2021 : Winter Holidays
  • 11/02/2021 : [Tony]
  • 04/02/2021 : [Alena]
  • 28/01/2021 : [Esra]

5.1 Thursday 16 December 2021

5.1.1 Speakers: Clement Richefort

5.1.2 Title: Towards approximating ideal interpolator for multigrid methods applied to indefinite Helmholtz equation

5.1.3 Summary:

The indefiniteness of Helmholtz equation makes usual multi-level schemes unable to converge. Geometric and algebraic interpolators permitting to propagate the solution between each space don't give a good approximation of near-kernel components (NKC). A first idea is to build iterative methods combining local near-kernel spaces for targeting the global one, and use them as smoothers between each level for correcting the bad approximation of usual interpolators. Another approach tries to approximate the so-called "ideal interpolator" which gives good convergence results but requires to inverse a potentially large sub-matrix of the system, making it hard to compute and very dense. Theoretical attempts of this "ideal interpolator" approximation will be presented.

5.2 Thursday 09 December 2021

5.2.1 Speakers: Dr. Vinod Rebello, Instituto de Computação, Universidade Federal Fluminense, Brazil

5.2.2 Title: Managing Vertical Memory Elasticity in Containers

5.2.3 Summary:

Efficient resource utilization and throughput maximization are just two important objectives for service providers trying to reduce operating costs, for example in clusters, cloud data centers, and even cloudlets at the edge. While containers consume CPU, memory, and I/O resources elastically, orchestration frameworks must still allocate and schedule containers according to resource availability and limit the amount of resources that each can use to avoid adverse interference. In this short talk, I will present a tool to autonomically manage container memory allocations dynamically in order to increase the average number of containers that can be hosted on a server. Evaluations show that through the careful adjustments of memory limits, manipulation of pages between memory and swap, and the use of container preemption, improvements in memory utilization, cloud costs, and job throughput can be achieved without prejudicing container performance.

5.2.4 Bio:

Vinod Rebello is an associate professor at the Instituto de Computação (IC) of the Universidade Federal Fluminense (UFF), in Brazil. His current research interests focus on various aspects of parallel and distributed computing in grids and clouds, including autonomic computing, scientific applications, resource management, scheduling and fault tolerance, and cybersecurity. He has experience in high performance computing, having been responsible for HPC services at IC-UFF and participated in several research projects funded through the European Union-Brazil Cooperation Program in ICT. Dr Rebello is an Associate Editor of the Journal of Parallel and Distributed Computing, a member of the ACM, IEEE and SBC (Brazilian Computing Society) and has been involved in the organisation of a number of their conferences, including acting as General Chair of IPDPS 2019.

5.2.5 Slides: Vinod-08-12-21

5.3 Thursday 18 Novembre 2021

5.3.1 Speakers: /all

5.3.2 Title: Informal discussions around energy saving

5.3.3 Slides: Energy-18-11-21

5.4 Thursday 28 October 2021

5.4.1 Speakers: Mathieu Vérité

5.4.2 Title: Parallel distributed Cholesky factorization

5.4.3 Summary:

In this work, we consider the parallel distributed Cholesky factorization on a set of homogeneous nodes. Our contribution is of two different natures. First, we provide a new parametric lower bound with an improved scaling factor on the amount of data that must be exchanged during the factorization, for a fixed number of nodes and a fixed memory size. This improvement of the bound is based on a specific treatment of data reuse in symmetric kernels such as the Cholesky factorization. Then, based on this work on lower bounds, we propose a new data distribution scheme, called Symmetric Block Cyclic, in replacement of the 2D and 3D Block Cyclic distributions for symmetric dense matrices. The Symmetric Block Cyclic distribution reduces the communication volume while keeping an excellent load balancing during the computation. Experiments with multicore nodes using the chameleon library show that this reduction of the communication volume effectively induces a reduction of the computing time both for the parallel distributed Cholesky factorization, as well as for subsequent Cholesky solves or the Cholesky inversion.

5.5 Thursday 21 October 2021

5.5.1 Speakers: Alena Shilova

5.5.2 Title: Efficient Combination of Rematerialization and Offloading for Training DNNs

5.5.3 Summary:

Rematerialization and offloading are two well-known strategies to save memory during the training phase of deep neural networks, allowing data scientists to consider larger models, batch sizes or higher resolution data. Rematerialization trades memory for computation time, whereas Offloading trades memory for data movements. As these two resources are independent, it is appealing to consider the simultaneous combination of both strategies to save even more memory. We precisely model the costs and constraints corresponding to Deep Learning frameworks such as PyTorch or Tensorflow, we propose optimal algorithms to find a valid sequence of memory-constrained operations and finally, we evaluate the performance of proposed algorithms on realistic networks and computation platforms. Our experiments show that the possibility to offload can remove one third of the overhead of rematerialization, and that together they can reduce the memory used for activations by a factor 4 to 6, with an overhead below 20%.

5.6 Thursday 14 October 2021

5.6.1 Speakers: Jean-François David

5.6.2 Title: Realize an adaptive sampling algorithm for TFM imaging on GPU

5.6.3 Summary:

Non-destructive testing (NDT) is a set of methods to characterize the state of integrity of structures or materials without having recourse to their degradation. These verifications are generally performed by an operator using various instruments including ultrasonic and electromagnetic translators. These are characterized by the emission of a signal that interacts with the part to be inspected. The signal refracted towards the translator carries the signature of the specificities of the structure and the defects that may be present. This last signal must then be interpreted in order to make a decision as to the presence or absence of an or not of an anomaly. The non-destructive testing by ultrasound linked to the use of multi-element translators allows very efficient reconstruction methods such as the Total Focusing Method (TFM), which remains however greedy in computing time. In order to overcome this problem, we propose a method based on barycentric interpolation: from a mesh, we build an uneven mesh (called adaptive) that adapts to the topology of the area to be reconstructed, thus minimizing the interpolation errors.

5.6.4 Slides: Jean-Francois-14-10-21

5.7 Thursday 7 October 2021

5.7.1 Speakers: Xunyi Zhao

5.7.2 Title: Regularization Effect of Dropout

5.7.3 Summary:

As one of the most important regularization methods in deep learning, dropout has never been fully understood. The result that dropout does not work on the small datasets motivates us to study the regularization behavior of dropout. In this project, we try to understand how dropout regularizes deep learning models, especially what makes dropout not work with small datasets. Our focus lies in the classification task with fully connected deep neural networks, where standard dropout is applied in the training stage.

5.7.4 Slides: Xunyi-07-10-21

5.8 Thursday 30 September 2021

5.8.1 Speakers: Clement Richefort

5.8.2 Title: Multigrid methods applied to the Helmholtz equation

5.8.3 Summary:

Afin d'améliorer la surface équivalente radar de différentes plateformes, le CEA développe des codes de calcul simulant le comportement électromagnétique d’objets 3D complexes. L’un de ces codes couple une méthode éléments finis à une équation intégrale (utilisée comme condition de rayonnement exacte). La partie éléments finis est aujourd’hui traitée par une méthode de décomposition de domaine, chaque sous-domaine étant résolu à l’aide d’un solveur direct pour matrices creuses. Cependant, avec l’augmentation des capacités de calcul du CEA, les limitations de cette méthode apparaissent : l’augmentation du nombre de sous-domaines tend à dégrader la convergence de la méthode, et l’augmentation des tailles des sous-domaines rend l’utilisation d’un solveur direct compliquée, à cause du manque de scalabilité de ce type de solveur.

Ce travail s'inscrit dans une première étape de recherche d'une méthode alternative à la décomposition de domaine : les méthodes multigrilles. Le principe de cette méthode est l’utilisation d’une collection de problèmes grossiers qui permettent d’accélérer le calcul de la solution fine. C’est une méthode itérative, qui est éprouvée pour des problèmes elliptiques, et ayant une scalabilité optimale (la résolution est en \(O(n)\), \(n\) étant le nombre d’inconnues du problème). Cependant, il est connu que les méthodes multigrilles ne sont pas performantes pour des problèmes à noyaux oscillants tels que l’électromagnétisme ou l’acoustique (ces équations conduisant à des problèmes indéfinis). L'objectif du stage est donc de faire une première analyse de la méthode pour l'équation de Helmholtz (cas acoustique).

5.8.4 Slides: Clement-30-09-21

5.9 Thursday 23 September 2021

5.9.1 Speaker : Nolan Bredel

5.9.2 Title: Integration of low rank support in distributed memory in PaStiX

5.9.3 Summary:

At the beginning of the internship, the low rank support in PaStiX was not available in distributed memory. We will present how it has been implemented for the static and dynamic schedulers and for StarPU.

5.9.4 Slides: Nolan-23-09-21

5.9.5 Speaker : Tom Moënne-Loccoz

5.9.6 Title: Integration of Heteroprio scheduler for GPU performance into PaStiX-StarPU.

5.9.7 Summary:

Performance comparisons between PaStiX-StarPU and PaStiX-PaRSEC with GPUs show a possibility for improvement on StarPU's side. The objective of the internship is to increase these performances through the integration of the Heteroprio scheduler into PaStiX-StarPU.

5.9.8 Slides: Tom-23-09-21

5.10 Thursday 09 September 2021

5.10.1 Title: Coffee-break - Go-around the table

5.11 Thursday 08 July 2021

5.11.1 Title: Inria Skoltech Moliere Associated Team

5.12 Friday 02 July 2021

5.12.1 Title: SOLHARIS and HPC Scalabe Ecosystem half-days

5.12.2 Middleware 9:30 - 10:30 - chair Bora Uçar

  • Philippe Swartavagher: Interferences between Communications and Computations in Distributed HPC Systems
  • Antoine Jego: Task-based programming models for scalable distributed memory parallel algorithms
  • Arthur Chevalier: HPC-Big Data Convergence: A library to apply highly parallel algorithms on big data clusters

5.12.3 Scheduling: 11:00-11:40 - chair Lionel Eyraud-Dubois

  • Maxime Gonthier: Locality-Aware Scheduling of Independent Tasks for Runtime Systems
  • Mathieu Vérité: Makespan lower bound combining communication and computation for dynamic task allocation on homogeneous resource: dense Cholesky factorization test case

5.12.4 Applications: 12:00-13:00 - chair Cédric Augonnet

  • Marek Felsoci: Towards memory-aware multi-solve two-stage solver for coupled FEM/BEM systems
  • Sangeeth Simon: Task-based parallelization of a multi-dimensional, higher order, finite volume code for the Euler flows
  • Romain Peressoni: Folding as a fast heuristic to compare point clouds

5.13 Thursday 24 June 2021

5.13.1 Speaker: Olivier Beaumont [with Thomas Lambert, Loris Marchal, Bastien Thomas]

5.13.2 Title: Data-Locality Aware Dynamic Schedulers for Independent Tasks with Replicated Inputs

5.13.3 Summary:

Might be related to Qarnot Computing problems !

5.14 Thursday 17 June 2021

5.14.1 Speaker: Lionel Eyraud-Dubois

5.14.2 Title: Rotor Tutorial

5.14.3 Summary:

I will present a tutorial on how to use Rotor to perform Deep Learning with PyTorch, while limiting the memory usage.

5.14.4 Slides: Lionel-17-06-21

5.15 Thursday 10 June 2021

5.15.1 Speaker: Nolan Bredel

5.15.2 Title: Présentation des nouvelles fonctionnalités dans ViTE

5.15.3 Summary:

Un PFA sur ViTE a été réalisé cette année à l'Enseirb dans le but de le mettre à jour et d'ajouter plusieurs fonctionnalités. Ce travail s'est articulé autour de 3 axes :

  • intégration du moteur de rendu Vulkan,
  • intégration d'un plugin de statistique sur la trace,
  • mise à jour du plugin de visualisation de matrice.

5.15.4 Slides: Nolan-10-06-21

5.16 Friday 04 June 2021

5.16.1 Speaker: Clément Gavoille (TADaaM Talk)

5.16.2 Title: Modélisation et projection de performances d’applications parallèles sur environnement ARM

5.16.3 Summary:

L’évolution des architectures de processeurs rend la prédiction et l’évaluation des performances d’une application parallèle complexes. En effet, l’augmentation du nombre de cœurs de calcul, la multiplication des unités vectorielles et l’organisation interne du réseau en maillage influencent grandement le comportement des processeurs. Néanmoins, afin de préparer l’arrivée de futures machines, il est nécessaire d’avoir une idée des performances (e.g., temps d’exécution ou nombre d’opérations flottantes par seconde) des applications parallèles actuelles sur les futures générations de processeurs. Dans ce cadre, le CEA et ARM souhaitent développer une méthodologie de prédiction de performance : étant donné le comportement d’une application parallèle sur un processeur existant et les caractéristiques d’un processeur hypothétique, le but est de prédire les performances de cette application sur ce dernier. L’objectif de cette thèse est de définir un modèle de projection de performance d’applications parallèles basé sur l’évolution des processeurs incluant les changements des unités de calcul (par exemple, jeu d’instructions SVE), l’évolution du nombre de cœurs et l’augmentation des capacités mémoire (DDR, HBM, …). Ceci passera par l’expérimentation sur des machines existantes afin de valider le modèle et sur des variations de processeurs connus (changement de fréquence, changement du nombre de cœurs…). Cette phase permettra alors d’étudier l’impact architectural sur les performances. Une fois cette première étude effectuée, il sera alors possible de faire évoluer ce modèle pour raffiner les prédictions de performances notamment en cas de changements micro-architecturaux (ce qui est le cas lors d’un changement entre générations quasi-identique).

5.17 Thursday 27 May 2021

5.17.1 Speaker: Pavel Kus, Berenger Bramas

5.17.2 Title: Towards a parallel domain decomposition solver for immersed boundary finite element method

5.17.3 Slides: Kus-27-05-21

5.18 Thursday 20 May 2021

5.18.1 Speaker: Suraj Kumar

5.18.2 Title: Parallel Tensor Train through Hierarchical Decomposition

5.18.3 Summary:

We consider the problem of developing parallel decomposition and approximation algorithms for high dimensional tensors. We focus on a tensor representation named Tensor Train (TT). It stores a d-dimensional tensor in O(ndr2), much less than the O(nd) entries in the original tensor, where 'r' is usually a very small number and depends on the application. Sequential algorithms to compute TT decomposition and TT approximation of a tensor have been proposed in the literature. We propose a parallel algorithm to compute TT decomposition of a tensor. We prove that the ranks of TT-representation produced by our algorithm are bounded by the ranks of unfolding matrices of the tensor. Additionally, we propose a parallel algorithm to compute approximation of a tensor in TT-representation. Our algorithm relies on a hierarchical partitioning of the dimensions of the tensor in a balanced binary tree shape and transmission of leading singular values of associated unfolding matrix from the parent to its children. We consider several approaches on the basis of how leading singular values are transmitted in the tree. We present an in-depth experimental analysis of our approaches for different low rank tensors and also assess them for tensors obtained from quantum chemistry simulations. Our results show that the approach which transmits leading singular values to both of its children performs better in practice. Compression ratios and accuracies of the approximations obtained by our approaches are comparable with the sequential algorithm and, in some cases, even better than that.

5.18.4 Slides: Suraj-20-05-21

5.19 Thursday 6 May 2021

5.19.1 Speaker: Olivier Beaumont, Lionel Eyraud-Dubois, Alena Shilova

5.19.2 Title: Pipelined Model Parallelism: Complexity Results and Memory Considerations

5.19.3 Summary:

The training phase in Deep Neural Networks has become an important source of computing resource usage and because of the resulting volume of computation, it is crucial to perform it efficiently on parallel architectures. Even today, data parallelism is the most widely used method, but the associated requirement to replicate all the weights on the totality of computation resources poses problems of memory at the level of each node and of collective communications at the level of the platform. In this context, the model parallelism, which consists in distributing the different layers of the network over the computing nodes, is an attractive alternative. Indeed, it is expected to better distribute weights (to cope with memory problems) and it does not imply large collective communications since only forward activations are communicated. However, to be efficient, it must be combined with a pipelined / streaming approach, which leads in turn to new memory costs. The goal of this paper is to model these memory costs in detail, to analyze the complexity of the associated throughput optimization problem under memory constraints and to show that it is possible to formalize this optimization problem as an Integer Linear Program (ILP).

5.19.4 Slides: Olivier-06-05-21

5.19.5 Ressources:

5.20 Thursday 12 April 2021

5.20.1 Speaker: Gwenolé Lucas (STORM Talk)

5.20.2 Title: Programming heterogeneous architectures using divisible tasks

5.20.3 Summary:

Task-based runtime systems are used to face the prevalence of heterogeneous architectures and exploit their computational power. As such, StarPU handles the communications and scheduling operations of the task-graphs users provide. However, it is confronted to some limitations, as it works on graphs that are entirely and sequentially submitted at startup, leading to a significant overhead and a static graph. To address these problems, we introduce hierarchical tasks to StarPU, which can insert subgraphs when they are executed. This approach allows for more dynamic graphs, that can evolve at runtime as subgraphs are inserted, for example to modify its granularity locally in order to meet the optimal grain of the different nodes. Similarly, submitting the initial graph with a large grain (and therefore fewer tasks) and then refining it to a smaller grain at runtime by adding subgraphs, could help reduce the aforementioned overhead, as the subgraphs of independent hierarchical tasks can be submitted in parallel.

5.21 Thursday 8 April 2021

5.21.1 Speaker: Bertrand Cabot (IDRIS)

5.21.2 Title: Data Parallelism - Model Parallelism - Hybrid Parallelism

5.21.3 Summary:

  • We will continue the activity on Deep Neural Networks and we will watch a part of the Video from Bertrand Cabot about parallelism in DNN.
  • Then, Rémi will summarize the beginning of the session 7 with information about the CNRS/IDRIS Jean Zay calculation plateform.

5.21.4 Ressources:

5.22 Thursday 1 April 2021

5.22.1 Speaker: Jean-Luc Parouty

5.22.2 Title: Keras and Convolutional Neural Networks

5.22.3 Summary:

  • We will continue the activity on Deep Neural Networks and we will watch a video from Jean-Luc Parouty about Convolutional Neural Networks (sequence 2.2) with practical examples using Keras.
  • Then, Rémi will summarize the end of the session 2 with information about how to explore hyper-parameters with Keras.
  • It is not absolutely necessary, but everyone should be able to watch the introduction on CNN (sequence 2.1).

5.22.4 Ressources:

5.23 Thursday 25 March 2021

5.23.1 Speaker: Jean-Luc Parouty

5.23.2 Title: History and basic concepts of neural networks

5.23.3 Summary:

  • As a prerequisite, everyone should be able to watch the introduction of the class (7min) and the one on the tools (18min).
  • The video 1.2 "De la régression linéaire au neurone artificiel" (22min) will be played at the beginning of the working group for the first reactions or questions.
  • Depending on the remaining time we will also watch the following 1.3 "La controverse des neurones".

5.23.4 Ressources:

5.24 Thursday 18 March 2021

5.24.1 Speaker: Mathieu Faverge

5.24.2 Title: Large scale SVD using polar decomposition

5.24.3 Summary:

5.24.4 Slides: Mathieu-18-03-21

5.25 Thursday 11 March 2021

5.25.1 Speaker: Hervé Mathieu

5.25.2 Title: Efficacité énergétique d'un programme en HPC

5.25.3 Summary:

La page donne des éléments sur ce qui a amené à (et ce qu'est) l'outil `energyscope`.

5.25.4 Slides: Herve-11-03-21

5.26 Thursday 4 March 2021

5.26.1 Speaker: Julien Lelaurain

5.26.2 Title: Connaissez vous la Plateforme Fédérative pour la Recherche en Informatique et Mathématiques (PlaFRIM) ?

5.26.3 Summary:

Au programme :

  • Présentation générale
  • Présentation technique
  • Prise en main de SLURM et des modules
  • Prise en main de GUIX

Pour en savoir plus sur la plateforme, c'est ici https://www.plafrim.fr/ !

5.27 Thursday 25 February 2021

5.27.1 Speaker: Mathieu Vérité

5.27.2 Title: Effect Of Communications On Dynamic Allocation For Distributed Memory Execution Of Dense Tiled Linear Algebra Operations: Cholesky Factorization

5.27.3 Summary:

In the context of large scale linear algebra operations, distributed execution on several computation resources require careful distribution of data since transfer over the network connecting those resources is often more costly than computation tasks. Though static allocations currently is the only practical implementation for distributed resolution, one can consider the potential benefit of applying dynamic runtime scheduler strategies at resource level to simultaneously handle the concurrent difficulty of balancing load between resources while minimizing data transfers under contention constraint. Using Cholesky factorization as a study case, we propose to investigate this question by testing out StarPU dynamic allocation and scheduling strategies in simple configurations mimicking typical architectures of parallel computation platform at node or global level. The performance of the strategies are evaluated in comparison with various bounds based on considerations about load balancing and minimum necessary data transfers. Results tend to show that even straightforward adaptations of StarPU allocation and scheduling strategies can outperform the classical 2D block cyclic method which is yet the default data allocation strategy widely used for dense tiled cases. The good performance of some dynamic schedulers may help lead the way to better designed 2.5D or 3D static allocation strategies. On a side notice, the experiments conducted in this study put some focus on the lack of a theoretical bound taking into account both workload and data transfers in the case of distributed execution on several homogeneous resources sharing a communication medium.

5.27.4 Slides: Mathieu-25-02-21

5.28 Thursday 11 February 2021

5.28.1 Speaker: Tony Delarue

5.28.2 Title: Hands-on tutorial on PaStiX with GPU

5.28.3 Summary:

As the core of a large number of simulation tools, the resolution of large linear systems often represents the dominant part of the computing time. Massively parallel versions are needed to maintain advances in multi-physics and multi-scale simulations, especially when targeting exascale platforms. The aim is therefore to address the major challenge of designing and building numerically robust solvers on runtime systems that can scale up and push back the limits of existing industrial codes of the EoCoE project. In this talk, we will present a hands-on tutorial on the PaStiX solver about the capacity to exploit modern GPU accelerators through runtime systems.

5.28.4 Slides: Tony-11-02-21

5.29 Thursday 4 February 2021

5.29.1 Speaker: Alena Shilova

5.29.2 Title: Memory Saving Strategies for Deep Neural Network Training

5.29.3 Summary:

Training Deep Neural Networks is known to be an expensive operation, both in terms of computational cost and memory load. Indeed, during training, all intermediate layer outputs (called activations) computed during the forward phase must be stored until the corresponding gradient has been computed in the backward phase. These memory requirements sometimes prevent to consider larger batch sizes and deeper networks, so that they can limit both convergence speed and accuracy. Recent works have proposed several ways to deal with this issue. The first one consists in discarding some activations and then recompute them later when needed, which is also known as Rematerialization. The second one is Offloading some of the computed forward activations from the memory of the GPU to the memory of the CPU. Both approaches require a careful choice of candidates among activations, which should be recomputed/offloaded and then scheduling operations/transfers so that the overhead was small. For both problems we present our solutions and we compare our solutions with previous state-of-the-art, showing that they achieve much better performance in a wide variety of situations.

5.29.4 Slides: Alena-04-02-21

5.30 Thursday 28 January 2021

5.30.1 Speaker: Esragul Korkmaz

5.30.2 Title: Deciding Non-Compressible Blocks in Sparse Direct Solvers using Incomplete Factorization

5.30.3 Summary:

In the recent years, many works have tried to enhance sparse direct solvers by introducing computation using low-rank compression formats. This approach has been shown very promising for reducing memory footprint and execution time on a large spectrum of solvers. Among them, sparse direct solvers based on supernodal approaches despite providing a better scalability, suffer from the low-rank update step. For this family of solver, the low-rank format, while reducing greatly the memory footprint, does not have the expected impact on the execution time. In this paper, we study two solutions to overcome this issue. The first one studies the impact of K-Way partitioning to reorder the unknowns in this context. The second solution tackles the overprice of the low-rank updates by identifying the blocks that have poor compression rates. It is expected that a good identification of these blocks would reduce the flops overhead that may appear in the Block Low-Rank updates. We show that block incomplete LU factorization allows identifying at low cost most of the non-compressible blocks by their fill-in levels. This identification enables to postpone the block compression step to trade small extra memory consumption for a better time to solution. Both solutions are validated within the PaStiX library with a large set of application matrices and demonstrate sequential and multi-threaded speedup up to 84%, for small memory overhead of less than 3% with respect to the original version.

5.30.4 Slides: Esra-28-01-21

Date: 27/06/2024

Auteur: Pierre Ramet

Created: 2024-06-27 Thu 11:35

Validate