Author: Jorge Ejarque (BSC)
High-Performance Computing (HPC) has been used to provide computational resources, software environments and programming models to enable the execution of large-scale e-science applications, i.e, with the objective of generating predictions of real processes (weather forecasting or protein interaction modelling). Recently, with the introduction of Big Data and Machine Learning (ML) technologies, e-science applications have evolved to more complex workflows where traditional HPC simulations are combined with Data Analytics (DA) and ML algorithms.
However, implementing combined applications requires a lot of engineering efforts in terms of deployment and integration of the HPC, DA and ML parts of the application. First, each part uses its own frameworks and libraries. For instance, developers use parallel programming models for implementing HPC simulators, DA algorithms are mainly implemented using DA transformations and actions that are developed with specific frameworks, while ML models are created or applied also using dedicated frameworks.
Therefore, to integrate them in a single application, developers have to dedicate a considerable amount of effort on implementing glue code to coordinate the execution and to exchange data between the application components. At deployment and operation phase, the work is replicated because all HPC, Big Data and ML environments must be installed, configured and run at the same time. It can be a hard task if this process has to be replicated manually in several supercomputers with their specific architectures and restrictions.
The eFlows4HPC project proposes a software stack and methodology to improve this situation. It aims at simplifying the development, deployment and execution of these complex workflows on federated computing infrastructures. The eFlows4HPC software stack will be composed by the integration of existing software components, organised in different layers as shown in the Figure above. The first layer consists of a set of services, repositories, catalogues, and registries to facilitate the accessibility and re-usability of the implemented workflows (Workflow Registry), their core software components such as HPC libraries and DA/ML frameworks(Software Catalog), and its data sources and results such as ML models (Data Registry and Model repository).
The second layer provides the syntax and programming models to implement these complex workflows combining typical HPC simulations with HPDA and ML. A workflow implementation consists on three main parts: a description of how the software components are deployed in the infrastructure (provided by an extended TOSCA definition); the functional programming of the parallel workflow (provided by the PyCOMPSs Programming Model); and data logistic pipelines to describe data movement in order to ensure that the workflow data is available in the computing infrastructure, when required.
Finally, the lowest layers provide the functionalities to deploy and execute the workflow based on the provided workflow description. From one side, this layer provides the components to orchestrate the deployment, and coordinate the execution of the workflow components in federated computing infrastructures. On the data management side, it provides a set of components to manage and simplify the integration of large volumes of data from different sources and locations with the workflow execution.