About this role
Job Description · We are seeking a highly skilled engineer to design, build, and operate production-grade Agentic AI and Generative AI systems end-to-end. · This role focuses on delivering scalable, secure, and observable services—not prototypes—by combining strong software engineering principles with modern AI practices such as RAG, multi-agent orchestration, evaluation frameworks, and policy-driven architectures. · You will work on building robust APIs, reusable components, and enterprise-grade pipelines that integrate LLMs, tools, and business systems to deliver measurable outcomes. Key Responsibilities · 1. Agent & Application Engineering Design and implement multi-agent systems (MAS) with planning, tool usage, and delegation (e.g., LangGraph, Semantic Kernel). Develop and expose services via REST/gRPC APIs using frameworks such as FastAPI, Express, Java, or Go. Build secure tool adapters (SQL, search, document stores, APIs, code execution) with strict type contracts and sandboxing. Integrate LLM gateways (OpenAI, Azure OpenAI, Bedrock, Vertex AI, or self-hosted models like vLLM/TGI) with routing, retries, rate-limiting, and fallback strategies. · 2. Retrieval, Data & Knowledge Systems Build and optimize RAG pipelines: chunking, embedding, indexing, and hybrid/vector search. Implement data ingestion pipelines using Airflow, Prefect, Celery, or Ray. Process structured and unstructured data sources (documents, chat logs, CRM/ERP systems). Apply PII redaction, metadata governance, and compliance controls. Improve retrieval quality using re-ranking, query rewriting, and evaluation frameworks. · 3. Quality, Testing & Evaluation Treat prompts and workflows as code: versioning, testing, and regression management. Develop evaluation frameworks covering latency, cost, accuracy, hallucination, and safety metrics. Build automated CI-integrated test harnesses (golden datasets, regression suites). Implement drift detection, rollback mechanisms, and fail-safe controls. · 4. Platform Engineering & Operations Containerize services and deploy to Kubernetes using Helm and Argo CD. Configure autoscaling (HPA/VPA) and resource optimization. Implement policy enforcement and guardrails (OPA/Gatekeeper, Presidio, Trivy). Establish deep observability using OpenTelemetry, Prometheus, ELK/OpenSearch. Monitor cost, performance, and system health across models and services. · 5. Security & Compliance Manage secrets using Vault/KMS and enforce least-privilege access. Ensure secure software supply chain (image signing, SBOMs). Design multi-tenant architectures with data isolation and residency controls. Implement AI safety mechanisms, including jailbreak defense and adversarial testing. · 6. Integration & Enterprise Workflows Develop integrations with SAP, CRM, ITSM platforms, and event-driven systems (Kafka). Build idempotent, resilient processors for distributed systems. Automate business workflows with scalable, pro-code solutions. · 7. Collaboration & Leadership Partner with Product, Data, and Platform teams to define SLAs, SLOs, and success metrics. Lead design reviews, architecture discussions, and postmortems. Mentor engineers on production-grade AI engineering practices. Qualifications · Required 6+ years of software engineering experience building production-grade systems. Proven experience leading small projects or technical initiatives. · Strong proficiency in at least one language: Python, TypeScript, Go, or Java. Hands-on experience with: Containers & Kubernetes CI/CD pipelines (GitHub Actions, GitLab CI, Jenkins) Infrastructure as Code (Terraform) Cloud platforms (AWS, Azure, or GCP) Practical experience building LLM-based applications (RAG, agents, prompt engineering). · Solid data engineering skills: SQL, pipelines, search indexes, vector databases. · Strong testing mindset: unit, integration, and performance testing. · Understanding of security fundamentals (IAM, secrets management, policy enforcement). · Preferred Production experience with LangGraph, Semantic Kernel, or similar orchestration frameworks. · Experience with distributed compute frameworks (e.g., Ray). · Expertise in search optimization (BM25 + vector hybrid, re-ranking strategies). · Strong background in observability and performance tuning. · Experience in regulated or high-scale industries (finance, telco, healthcare). · Knowledge of event-driven architectures (Kafka, Debezium). · Familiarity with OPA/Gatekeeper and enterprise policy frameworks. · Exposure to TM Forum APIs / BSS-OSS architectures (for telco environments). · Technology Stack (Indicative) Languages: Python, TypeScript/Node.js (Go/Java as a plus) · Frameworks: FastAPI, Express, LangGraph, Semantic Kernel, Ray, Celery, Airflow, Prefect Data & Search: PostgreSQL, Redis, S3/Blob Storage, pgvector, Pinecone, Weaviate, OpenSearch LLM Platforms: OpenAI, Azure OpenAI, AWS Bedrock, Google Vertex AI, vLLM, · TGI Platform & DevOps: Docker, Kubernetes, Helm, Argo CD, Terraform, Vault, Istio Observability: OpenTelemetry, Prometheus, Grafana, ELK/OpenSearch Quality & Safety: pytest, Jest, prompt testing frameworks, Presidio, Trivy
Also in Software Engineering
ARGYLL SCOTT CONSULTING PTE. LTD.
ELLIOTT MOSS CONSULTING PTE. LTD.
ELLIOTT MOSS CONSULTING PTE. LTD.