About this role
Skills: Job Scheduling: PBS Professional, SLURN. Monitoring: Grafana, Nagios, Prometheus, Ganglia. Cluster Management: Bright cluster manager, xCat, Puppet (for IaC cluster mgmt.) Networking: InfiniBand Profiling and debugging tools: IntelVTune, Valgrind, gprof. Application Support: GNU, Intel CUDA Compilers. MKL Libraries, MPI, Open MP libraries. Virtualization: Proxmox, FlexLM. Storage: Parallel Filesystems, Enterprise Object Storage. Operating Systems: Red Hat Linux. Cloud: Provisioning skills in AWS, Azure and GCP. GPU Support: CUDA, ROCm, OpenCL GPU acceleration. Containerization: docker, K8, Singularity. TensorFlow, PyTorch, scikit-learn, MXNet. MLOps: MLflow, Kubeflow. Infrastructure as Code: AI/ML Pipelines. Terraform, Pulumi, CloudFormation. Technical Skills: High-Performance Computing Hands on experience in managing HPC clusters with job scheduler, cluster management parallel programming libraries, and parallel filesystems. Knowledge of resource scheduling and job optimization for efficient workload management Infini band(Networking) Hands-on experience with high-throughput, low-latency interconnect technologies like Infini band. Ability to design, configure, and troubleshoot interconnects in HPC or distributed environments. Operating Systems and Environments Administration and configuration of RHEL-based systems. Performance tuning, package management, and security hardening. Knowledge of Red Hat Satellite and Ansible for automation. Job Scheduling with PBS Professional Experience in deploying and managing PBS Professional for scheduling and workload management in HPC environments. Customizing job submission scripts and optimizing job queues. Parallel Programming Libraries MPI(Message Passing Interface) and OpenMP (Open Multi-Processing): Proficiency in writing, debugging, and optimizing parallelized code. Experience with scaling applications across HPC systems. Understanding of distributed memory (MPI) and shared memory (OpenMP) paradigms. Cloud Platforms AWS, Azure, Google Cloud: Expertise in provisioning, configuring, and managing services on all three platforms. Cross-platform migration and hybrid cloud solutions knowledge. Proficiency in managing high-performance computing (HPC) clusters on the cloud. Deep understanding of cost optimization, security, and cloud native development tools (e.g., Kubernetes, Terraform). Infrastructure as Code (IaC) Ability to design, deploy, and maintain infrastructure using automation and configuration management tools. CI/CD pipeline integration for IaC workflows. GPU& AI Libraries and Tools Hands-on experience with container technologies. Hands-one xperience with TensorFlow, PyTorch, scikit-learn, Keras, or MXNet. Familiarity with AI/ML pipelines, model training, and optimization. Knowledge of MLOps tools for deploying and monitoring models
Also in Design
SEARCH INDEX PTE. LTD.
ITCAN PTE. LIMITED
BYTE PTE. LTD.