About this role
Responsibilities Incident & Application Support ● Provide second-line (L2) support for production and staging systems, handling escalations from L1 Support. ● Investigate application errors, system alerts, performance degradation, and integration issues. ● Restore services within agreed SLA/OLA timelines and ensure proper incident closure. Troubleshooting & Root Cause Analysis ● Perform in-depth troubleshooting using logs, metrics, and monitoring tools. ● Conduct root cause analysis (RCA) for recurring or high-impact incidents. ● Propose and implement corrective and preventive actions to reduce incident recurrence. Collaboration & Escalation ● Work closely with L3 engineers, DevOps, and vendors to resolve complex technical issues. ● Provide clear technical findings, logs, and evidence when escalating issues. ● Participate in incident bridges, post-incident reviews, and operational discussions. Operational Excellence ● Monitor system health, alerts, dashboards, and logs to proactively identify issues. ● Execute approved configuration changes, patches, and operational fixes. ● Support deployment, release, and maintenance activities when required. ● Contribute to automation of operational tasks, monitoring, and alerting where applicable. ● Identify gaps in runbooks, SOPs, and operational processes and drive improvements. Documentation ● Maintain and update runbooks, troubleshooting guides, and knowledge base articles. ● Document incident resolutions and operational procedures clearly and accurately. Security & Compliance ● Adhere to security, access control, and compliance requirements. ● Handle sensitive information in logs, tickets, and systems appropriately. ● Support audits, vulnerability remediation, and compliance checks when required. Key Experiences and Qualifications We Seek Educational Background: ● Diploma or higher in Computer Science, Information Technology, or a related field. Professional Experience: ● 3–5+ years of relevant experience in application support, systems support, or operations roles. ● Experience supporting production systems in a high-availability or mission-critical environment. Technical Expertise: ● Strong hands-on experience with: ○ Application log analysis and monitoring tools (e.g. AWS CloudWatch, Grafana, ELK, Google Analytics, etc) ○ Linux/Unix environments ● Working knowledge of cloud platforms (e.g. AWS services such as ECS, Lambda, S3, RDS). ● Basic database knowledge (MySQL, PostgreSQL) for health checks and simple queries. ● Basic knowledge on REST APIs, system integrations and authentication design ● Understanding of incident, problem, and change management processes. Problem-Solving Skills: ● Strong analytical and troubleshooting skills. ● Ability to break down complex incidents into clear, actionable steps. ● Calm and methodical approach when handling production issues under pressure. Operational Practices: ● Familiarity with ticketing and incident management tools (e.g. Jira, PagerDuty). ● Experience working with runbooks, SOPs, and on-call support rotations (if applicable). Additional Skills (Bonus Points): ● Experience supporting cloud-native or microservices-based systems. ● Basic scripting skills (e.g. Bash, Python) for automation. ● Experience working in government, regulated, or large-scale enterprise environments. ● Knowledge of disaster recovery and business continuity planning. Character Traits We Look Out For ● Team player with a collaborative mindset ● Strong sense of ownership and accountability for system reliability ● Proactive in identifying and addressing operational issues ● Willingness and ability to learn and adapt to new systems and tools ● Openness to sharing knowledge and improving team capability ● Clear verbal and written communication skills, including incident reporting
Also in Government Policy