Senior SRE (On-Premise)
We are currently seeking a Site Reliability Engineer to join their reliability-first platform team. This role offers you the opportunity to be at the forefront of stabilising Windows-based services, enhancing observability, and driving containerisation into Kubernetes.
What you'll do:
As a Site Reliability Engineer based in Taiwan, you will immerse yourself in a dynamic environment focused on building a reliability-first platform. Your daily activities will revolve around automating operational tasks for Windows services using tools like AWX or Rundeck, implementing robust workflows with Ansible or PowerShell DSC, and standardising observability practices through Prometheus, Grafana, OpenTelemetry, ELK or Loki. You will participate in on-call rotations to manage incidents efficiently while leading post-incident reviews to foster continuous improvement. Your expertise will be crucial in designing resilient backup strategies and disaster recovery plans that safeguard business operations. Over time, you will drive the adoption of Kubernetes technologies for service containerisation with an unwavering commitment to security and compliance. Success in this role requires proactive collaboration with diverse teams to embed reliability engineering principles throughout the software delivery process while championing automation practices that elevate operational excellence.
- Develop comprehensive self-service runbooks for Windows services using AWX or Rundeck to streamline operational processes and empower teams with automated solutions.
- Implement Ansible or PowerShell DSC workflows that facilitate health checks, safe rollbacks, and efficient incident response mechanisms across critical systems.
- Standardise metrics, logs, and traces utilising Prometheus, Grafana, windows_exporter, OpenTelemetry, ELK, or Loki to create actionable dashboards and alerts that drive informed decision-making.
- Participate actively in on-call rotations to handle incidents promptly, conduct thorough post-incident reviews (PIR), and lead game days to institutionalise standard operating procedures (SOPs).
- Design and execute backup strategies and disaster recovery plans that ensure business continuity while optimising capacity planning and performance tuning for all services.
- Drive long-term service containerisation initiatives by adopting Kubernetes technologies such as Helm, Kustomize, Argo CD, Flux, ConfigMap, and Secrets with a strong emphasis on security and compliance.
- Collaborate closely with cross-functional teams to embed reliability engineering principles throughout the software delivery lifecycle.
- Champion automation practices that reduce manual intervention and enhance operational efficiency across the platform.
- Contribute to the development of golden-signal dashboards that provide real-time visibility into system health and performance.
- Support the integration of compliance requirements into delivery pipelines by understanding RBAC principles and least privilege access models.
What you bring:
To excel as a Site Reliability Engineer in this organisation’s Taiwan office, you bring proven experience from SRE or DevOps roles where you managed production on-call duties for critical systems. Your technical proficiency covers Windows/Linux administration alongside deep knowledge of networking fundamentals such as DNS configuration, TCP/IP protocol management, TLS implementation and load balancing strategies. You are adept at deploying Ansible or PowerShell DSC workflows within CI/CD environments powered by Jenkins or Argo CD. Monitoring system health using Prometheus/Grafana/OpenTelemetry combined with centralised logging via ELK/Loki forms part of your toolkit. Containerisation expertise is vital; you have established Kubernetes clusters from scratch using Helm/Kustomize/Argo CD/Flux while ensuring secure operations through ConfigMap/Secrets management. Your interpersonal skills shine through effective collaboration with colleagues—driven by accountability for outcomes—and your meticulous approach ensures reliable automation across all processes. Additional strengths include securing secrets with Vault/Key Vault, leveraging cloud DNS solutions like Cloudflare/AWS Route 53/CloudFront for scalable architectures, participating in structured incident command reviews (IMOC/PIR), and integrating RBAC/compliance controls into delivery pipelines.
- Mandatory: You have over four years of experience in SRE, DevOps or Platform Engineering roles with at least two years spent managing production on-call responsibilities for mission-critical systems.
- Mandatory: Your hands-on expertise spans Windows/Linux administration as well as core networking concepts including DNS, TCP/IP protocols, TLS encryption standards and load balancing techniques.
- Mandatory: You possess practical experience implementing Ansible or PowerShell DSC workflows alongside CI/CD pipelines using Jenkins or Argo CD.
- Mandatory: You demonstrate solid understanding of monitoring frameworks such as Prometheus, Grafana or OpenTelemetry coupled with centralised logging solutions like ELK or Loki.
- Mandatory: Familiarity with containers/Kubernetes is essential; you have built up environments from scratch and operated tooling including Helm/Kustomize/Argo CD/Flux within production settings.
- Mandatory: Your communication skills are exceptional; you collaborate effectively within teams while maintaining accountability for deliverables through attention to detail and a bias towards automation.
- Nice to have: Experience managing secrets using Vault or Key Vault enhances your ability to secure sensitive information within distributed systems.
- Nice to have: Exposure to cloud-based DNS management platforms such as Cloudflare/AWS Route 53/CloudFront (or equivalent stacks) supports scalable infrastructure design.
- Nice to have: Familiarity with incident command structures (IMOC) and conducting post-incident reviews (PIR) strengthens your incident response capabilities.
- Nice to have: Understanding RBAC models, least privilege access controls and embedding compliance/audit requirements into delivery pipelines demonstrates your commitment to secure operations.
About the job
Contract Type: Perm
Specialism: IT & Digital Transformation
Focus: Infra/Network/System
Industry: IT
Salary: Negotiable
Workplace Type: On-site
Experience Level: Associate
Location: Taipei
FULL_TIMEJob Reference: HJ2KKJ-3B397C92
Date posted: 20 March 2026
Consultant: Amy Lin
taipei tech-transformation/infrastructure 2026-03-20 2026-05-19 it Taipei TW Robert Walters https://www.robertwalters.com.tw https://www.robertwalters.com.tw/content/dam/robert-walters/global/images/logos/web-logos/square-logo.png true