Senior SRE
A leading technology organisation in Taipei is seeking a Senior Site Reliability Engineer to join their collaborative and supportive team.
As a Senior Site Reliability Engineer based in Taipei, you will play a pivotal role in maintaining the reliability of cloud-based products used by customers worldwide. Your day-to-day activities will involve operating sophisticated Kubernetes clusters on Google Cloud Platform, leveraging state-of-the-art monitoring tools to keep systems running smoothly, and automating deployment workflows through CI/CD pipelines. You will also be responsible for designing resilient infrastructure using Infrastructure as Code principles while actively participating in disaster recovery planning. Your expertise will help ensure that users in China have uninterrupted access to services by managing custom CDN nodes and optimising network connectivity. By collaborating closely with other engineers and engaging with technical communities, you will contribute significantly to the overall stability and growth of the organisation’s digital platforms.
- Operate and maintain Google Kubernetes Engine environments along with other products hosted on Google Cloud Platform, ensuring high availability and performance.
- Monitor product environments using Prometheus, Grafana, and Sentry to proactively identify issues and optimise system health.
- Design and build automated CI/CD pipelines with GitLab to streamline validation and deployment processes across multiple environments.
- Manage infrastructure using Ansible, Terraform, Vault, and ArgoCD to ensure consistency, security, and scalability of all systems.
- Plan and implement backup strategies as well as disaster recovery exercises to safeguard data integrity and business continuity.
- Ensure stable access for Chinese users by managing custom CDN nodes and optimising network routes for seamless service delivery.
- Collaborate with cross-functional teams to troubleshoot complex networking issues related to TCP/IP stack layers and provide detailed root cause analysis.
- Develop scripts in Bash for Linux system administration tasks to automate routine operations and improve efficiency.
- Participate in the planning and execution of redundancy mechanisms to minimise downtime during unexpected events or failures.
- Engage with information technology communities to share best practices, learn new techniques, and contribute to industry knowledge.
What you bring:
To excel as a Senior Site Reliability Engineer in this role, your proven track record should include hands-on management of Kubernetes clusters within cloud environments alongside deep familiarity with Infrastructure as Code methodologies. Your programming skills—whether in Python, Go, PHP, JavaScript or Rust—will enable you to develop efficient solutions tailored to operational needs. Experience administering Linux systems is essential; your ability to script in Bash will help automate critical tasks. A thorough grasp of networking concepts is required so you can diagnose intricate TCP/IP issues effectively. Exposure to major cloud providers like AWS or GCP will allow you to navigate complex deployments confidently. Your capacity for planning robust redundancy strategies ensures business continuity even under challenging circumstances. Additionally, your background in CDN management—especially for Chinese markets—and proficiency with monitoring tools will empower you to maintain optimal service levels. Familiarity with CI/CD automation further strengthens your ability to deliver reliable software updates seamlessly.
- Demonstrated experience managing Kubernetes environments at scale within production settings.
- Proven ability to understand and implement Infrastructure as Code (IaC) using tools such as Terraform or Ansible.
- Hands-on development experience with at least one programming language among Python, Go, PHP, JavaScript, or Rust.
- Comprehensive background in Linux system administration including writing effective Bash scripts for automation.
- In-depth understanding of TCP/IP stack layers with strong troubleshooting skills for complex networking issues.
- Practical experience utilising major cloud platforms such as AWS, Azure, GCP or Alibaba Cloud for enterprise solutions.
- Capability to plan redundancy mechanisms that enhance system resilience against outages or failures.
- Experience operating large-scale CDN infrastructures tailored for specific regional requirements such as China.
- Familiarity with monitoring tools like Prometheus, Grafana, Sentry for proactive system management.
- Knowledge of CI/CD pipeline automation using GitLab or similar platforms.
About the job
Contract Type: Perm
Specialism: IT & Digital Transformation
Focus: Infrastructure, Network & System
Industry: IT
Salary: Negotiable
Workplace Type: Hybrid
Experience Level: Mid Management
Location: Taipei
FULL_TIMEJob Reference: FNYOPY-7AC2F137
Date posted: 20 November 2025
Consultant: Amy Lin
taipei tech-transformation/infrastructure 2025-11-20 2026-01-19 it Taipei TW Robert Walters https://www.robertwalters.com.tw https://www.robertwalters.com.tw/content/dam/robert-walters/global/images/logos/web-logos/square-logo.png true