3 min to read
Devito Cluster - the whys and the whats
Being Devito Codes
One of Devito’s core missions is performance portability. Our current roadmap targets the matrix of architectures and parallel programming models below.
Architecture | CUDA | HIP | MPI | OpenACC | OpenMP |
---|---|---|---|---|---|
AMD/CPU | Yes | Yes | |||
AMD/GPU | Nov’22 | Yes | N/A | ||
ARM | Yes | N/A | Yes | ||
Intel/CPU | Yes | N/A | Yes | ||
Intel/KNC,KNL | Yes | N/A | Yes | ||
Intel/GPU | TBA | TBA | |||
NVidia | Yes | Yes | Yes | Yes |
To support the delivery of this vision, since January 2020 Devito Codes has maintained a distributed cluster as part of our DevOps infrastructure for both open-source Devito and DevitoPRO. This provides services such as:
- Testing (i.e. CI) 1.
- Performance optimization on ‘bare metal’ target platforms.
- Performance benchmarking 2.
- Deployment (i.e. CD via Docker and release mirrors).
While Devito Codes invested directly in hardware at the outset, long term loans and donations now make up the majority of the cluster 3. At the time of writing the cluster is comprised of:
Supplier | units | CPU | GPU | Sponsor |
---|---|---|---|---|
Dell | 1 | AMD | 4 X A100 (80G) | NVidia & Dell |
HP | 1 | Intel Xeon | Devito Codes | |
Fujitsu | 1 | ARM64 | Fujitsu | |
self (custom build) | 1 | AMD | 2 x MI50 | Devito Codes (server) & AMD (GPUs) |
self (custom build) | 4 | Intel/AMD PC CPUs | 3 x RTX3090, RTX3080 | Devito Codes |
Supermicro | 4 | Intel Xeon Gold | 4 x V100 | NVidia & Supermicro |
While we were optimizing for the Intel KNC and KNL, [DUG]{https://dug.com/} provided us with access to nodes for DevOps. Today they are running their own DevOps on KNC and KNL for DUG Wave which depends upon Devito. So far no problems have been experienced so long as we monitor for performance regression on Intel Xeon’s running with OpenMP.
We do all our deployment on the cluster using Docker containers. We do this to ensure that the same environment is used for testing and deployment. This has also been useful for performance debugging on Cloud platforms, such as AWS and Azure, because we can differentiated between Docker related issues and the underlying platform.
Early on we relied on Cloud computing nodes rather than our own hardware. However, we ran into a number of issues:
- Sponsored Cloud credit severely limited what architectures we were able to test on - in particular we would not use any modern GPU.
- It was difficult to control costs. We make heavy use of GitHub Actions for automation. Self hosted runners require you to run a client on your self-hosted runner. This always on model meant we would rack up an eye-watering bill at the end of each month - in fact, this is what provided the initial motivation to build our own cluster.
There is no doubt that we will revisit the Cloud in the future with a smarter strategy to control costs. However, we are committed to continuing with our strategy of maintaining our own hardware so we can always drill down to bare metal when we are looking for maximum performance.