Senior SRE (On-Premise)

We are currently seeking a Site Reliability Engineer to join their reliability-first platform team. This role offers you the opportunity to be at the forefront of stabilising Windows-based services, enhancing observability, and driving containerisation into Kubernetes.

Key responsibilities:

As a Site Reliability Engineer based in Taiwan, you will immerse yourself in a dynamic environment focused on building a reliability-first platform. Your daily activities will revolve around automating operational tasks for Windows services using tools like AWX or Rundeck, implementing robust workflows with Ansible or PowerShell DSC, and standardising observability practices through Prometheus, Grafana, OpenTelemetry, ELK or Loki.

Develop comprehensive self-service runbooks for Windows services using AWX or Rundeck to streamline operational processes and empower teams with automated solutions.
Implement Ansible or PowerShell DSC workflows that facilitate health checks, safe rollbacks, and efficient incident response mechanisms across critical systems.
Standardise metrics, logs, and traces utilising Prometheus, Grafana, windows_exporter, OpenTelemetry, ELK, or Loki to create actionable dashboards and alerts that drive informed decision-making.
Participate actively in on-call rotations to handle incidents promptly, conduct thorough post-incident reviews (PIR), and lead game days to institutionalise standard operating procedures (SOPs).
Design and execute backup strategies and disaster recovery plans that ensure business continuity while optimising capacity planning and performance tuning for all services.
Drive long-term service containerisation initiatives by adopting Kubernetes technologies such as Helm, Kustomize, Argo CD, Flux, ConfigMap, and Secrets with a strong emphasis on security and compliance.
Collaborate closely with cross-functional teams to embed reliability engineering principles throughout the software delivery lifecycle.
Champion automation practices that reduce manual intervention and enhance operational efficiency across the platform.
Contribute to the development of golden-signal dashboards that provide real-time visibility into system health and performance.
Support the integration of compliance requirements into delivery pipelines by understanding RBAC principles and least privilege access models.

Candidate profile:

To excel as a Site Reliability Engineer in this organisation’s Taiwan office, you bring proven experience from SRE or DevOps roles where you managed production on-call duties for critical systems. Your technical proficiency covers Windows/Linux administration alongside deep knowledge of networking fundamentals such as DNS configuration, TCP/IP protocol management, TLS implementation and load balancing strategies. You are adept at deploying Ansible or PowerShell DSC workflows within CI/CD environments powered by Jenkins or Argo CD.

Mandatory: You have over four years of experience in SRE, DevOps or Platform Engineering roles with at least two years spent managing production on-call responsibilities for mission-critical systems.
Mandatory: Your hands-on expertise spans Windows/Linux administration as well as core networking concepts including DNS, TCP/IP protocols, TLS encryption standards and load balancing techniques.
Mandatory: You possess practical experience implementing Ansible or PowerShell DSC workflows alongside CI/CD pipelines using Jenkins or Argo CD.
Mandatory: You demonstrate solid understanding of monitoring frameworks such as Prometheus, Grafana or OpenTelemetry coupled with centralised logging solutions like ELK or Loki.
Mandatory: Familiarity with containers/Kubernetes is essential; you have built up environments from scratch and operated tooling including Helm/Kustomize/Argo CD/Flux within production settings.
Mandatory: Your communication skills are exceptional; you collaborate effectively within teams while maintaining accountability for deliverables through attention to detail and a bias towards automation.
Nice to have: Experience managing secrets using Vault or Key Vault enhances your ability to secure sensitive information within distributed systems.
Nice to have: Exposure to cloud-based DNS management platforms such as Cloudflare/AWS Route 53/CloudFront (or equivalent stacks) supports scalable infrastructure design.
Nice to have: Familiarity with incident command structures (IMOC) and conducting post-incident reviews (PIR) strengthens your incident response capabilities.
Nice to have: Understanding RBAC models, least privilege access controls and embedding compliance/audit requirements into delivery pipelines demonstrates your commitment to secure operations.

相似職缺

瀏覽更多職缺

Senior SRE (On-Premise)

分享

相似職缺