SRE Lead

A fast-growing fintech firm in Taiwan seeks a Site Reliability Engineering Lead to build and scale reliable global systems for a blockchain-focused platform in a VC-backed environment.

Key responsibilities:

As Site Reliability Engineering Lead, you will oversee AWS/Kubernetes reliability, automation, and incident management while driving SRE best practices.

Lead and manage the SRE team to ensure high availability, scalability, and reliability of production systems
Own AWS cloud infrastructure operations including monitoring, security, resource management, and cost optimisation in a 24/7 environment
Lead incident management, troubleshooting, RCA, post-incident reviews, and continuous service improvements
Ensure compliance with security, audit, and regulatory standards (e.g., MAS TRM, ISO 27001) across infrastructure and operations
Drive SRE best practices including observability, alerting, SLA/SLO/SLI management, capacity planning, DR, and high availability
Improve system performance and operational efficiency through automation, CI/CD, IaC, and Kubernetes/EKS optimisation
Collaborate with cross-functional teams (Backend, Data, Security, Product) while mentoring engineers and strengthening operational maturity

Candidate profile:

To excel in this role, you bring expertise in Linux, AWS, and Kubernetes, with strong experience in automation, CI/CD, observability, and SRE practices.

8+ years' Linux system administration and large-scale infrastructure experience with 2+ years in Tech Lead or team management roles
Hands-on experience operating high-availability, 24/7 cloud platforms with strong AWS expertise (EC2, VPC, IAM, Lambda, EKS, CloudWatch, etc.)
Strong Kubernetes and container orchestration experience, including EKS administration, troubleshooting, and scaling
Experience with Infrastructure as Code and CI/CD pipelines using tools such as Terraform, Helm, Kustomize, Jenkins, GitHub Actions, and ArgoCD
Strong observability and monitoring expertise using tools like Grafana, ELK, Zabbix, and Nagios
Experience with distributed systems (e.g., Kafka, MongoDB) and SRE/DevOps practices including incident management, DR, capacity planning, and SLO/SLA design
Proficient in scripting/programming (Bash, Python, or Golang) with strong knowledge of cloud security, collaboration, and fast-paced production environments

About the company:

A fintech organisation in the digital asset space builds secure, scalable financial platforms. It offers a fast-paced, collaborative environment focused on innovation, engineering excellence, and high-impact growth.

Keywords: site reliability engineering, blockchain-focused platform, AWS/Kubernetes, incident management, regulatory compliance, fintech

What’s next:

Learn more and apply today!

相似職缺

瀏覽更多職缺

SRE Lead

分享

相似職缺