About this role
HPC Lead / HPC Architect (Linux, GPU, Cloud HPC) – 8+ Years ExperienceRole Overview We are looking for an experienced HPC Lead / HPC Architect to design, implement, and manage large-scale High Performance Computing (HPC) environments. This role will drive architecture, performance optimization, and transformation initiatives across compute (CPU/GPU), storage, and high-speed networking, supporting AI/ML, research, and enterprise workloads. Key Responsibilities • Lead architecture, design, and deployment of HPC clusters (CPU & GPU computing environments) • Own the end-to-end HPC lifecycle: design, build, deployment, operations, and optimization • Define and manage job scheduling systems (Slurm, PBS, LSF) and workload orchestration • Drive performance tuning, benchmarking, and optimization for compute-intensive workloads (AI/ML, simulation, analytics) • Architect and manage high-performance storage systems (Lustre, GPFS/IBM Spectrum Scale, BeeGFS) • Design and implement low-latency, high-throughput networking (InfiniBand, RDMA, high-speed Ethernet) • Lead hybrid and cloud HPC integration (AWS, Azure, GCP HPC solutions) • Build and maintain automation frameworks (Ansible, Terraform, Infrastructure as Code, scripting) • Implement monitoring, observability, logging, and capacity planning (Prometheus, Grafana, ELK, AIOps tools) • Ensure security, compliance, identity and access management (IAM, LDAP/AD) • Collaborate with research teams, data scientists, application owners, and business stakeholders • Mentor and lead HPC engineers, system administrators, and infrastructure teams • Drive innovation in GPU computing, AI/ML infrastructure, and advanced automation Required Skills & Experience • 8+ years of experience in HPC, Infrastructure Engineering, or Linux System Engineering • Proven expertise in:HPC cluster architecture, deployment, and operationsLinux/Unix systems administration at scale (RHEL, CentOS, Ubuntu)CPU and GPU computing environments (NVIDIA GPU, CUDA preferred) • Strong hands-on experience with:Job schedulers (Slurm, PBS, LSF)High-performance distributed storage (Lustre, GPFS, BeeGFS)Networking (TCP/IP, DNS, InfiniBand, RDMA, low-latency fabrics) • Experience in automation and scripting (Python, Bash/Shell, Ansible, Terraform) • Knowledge of cloud HPC architectures (AWS ParallelCluster, Azure CycleCloud, GCP HPC) Preferred / Nice-to-Have Skills • Experience with AI/ML infrastructure, deep learning workloads, or research computing • Familiarity with containerization and orchestration (Docker, Kubernetes for HPC workloads) • Exposure to observability platforms, AIOps, and predictive monitoring • Experience in large-scale enterprise, research institutes, or university HPC environments • Knowledge of DevOps / Platform Engineering practices Leadership & Profile • Strong stakeholder management and cross-functional collaboration skills • Ability to translate business or research requirements into scalable HPC architecture • Proven experience in team leadership, mentoring, and technical decision-making • Strategic mindset with hands-on technical depth in HPC systems and infrastructure
Required
Also in Data Science
CONSULTING GROUP - ASIA INSIGHT PTE. LTD.
YOTCHA LLP
PERSOL SINGAPORE PTE. LTD.