About this role
Key Responsibilities• Design and develop compute cluster architectures optimized for performance, reliability, scalability, and serviceability within KLA systems. • Define and validate server hardware configurations, including CPUs, GPUs, memory subsystems, storage, networking, and specialized accelerators. • Analyze and optimize system-level performance across hardware and software layers, including CPU/GPU utilization, memory bandwidth, PCIe topology, NUMA architecture, and I/O performance. • Collaborate with hardware, software, firmware, and systems engineering teams to ensure seamless integration of compute clusters into broader system architectures. • Support server bring-up, hardware integration, diagnostics, benchmarking, stress testing, and root-cause analysis activities. • Manage and troubleshoot enterprise server platforms, including BIOS/firmware configuration, BMC/IPMI management, thermal and power optimization, and hardware health monitoring. • Participate in architecture reviews, integration planning, technical discussions, and cross-functional problem-solving sessions. • Create and maintain technical documentation for hardware design decisions, validation procedures, deployment standards, and troubleshooting workflows. Required Skills & Qualifications• Strong experience in computer hardware and system architecture design, particularly in compute clusters, HPC environments, or enterprise server platforms. • Deep understanding of modern CPU and GPU architectures, including multicore processing, NUMA, PCIe, memory hierarchy, and hardware-software interactions. • Experience with GPU-accelerated systems and accelerator integration (e.g., NVIDIA GPU platforms, CUDA environments, or similar technologies). • Hands-on experience with Linux system administration and OS customization (preferably SUSE Linux Enterprise Server). • Familiarity with enterprise server management technologies such as BIOS/UEFI, BMC, IPMI, iDRAC, or similar remote management tools. • Understanding of distributed systems, high-performance networking, and cluster infrastructure technologies such as InfiniBand, RDMA, or high-speed Ethernet. • Experience with system performance tuning, hardware validation, benchmarking, and low-level troubleshooting. • Strong analytical, documentation, and communication skills. Preferred Qualifications• Experience in high-performance computing (HPC), AI/ML infrastructure, or large-scale distributed compute environments. • Familiarity with server hardware bring-up, failure analysis, thermal/power optimization, and reliability engineering. • Exposure to hardware diagnostic and monitoring tools for server and cluster environments. • Understanding of storage architectures, parallel file systems, and distributed storage solutions. • Experience working in cross-functional engineering teams across hardware, firmware, and software domains. • Test-driven and detail-oriented engineering mindset with strong problem-solving skills. • Self-motivated individual with a proactive approach to continuous improvement and technical innovation.
Also in Software Engineering
TEAMLEASE DIGITAL SOLUTIONS PTE. LTD.
PKF-CAP LLP
EARNEST DESIGNER & PROJECT PTE. LTD.