Infrastructure Senior SRE Engineer
at OKX
Posted 6 hours ago
No clicks
- Compensation
- Not specified
- City
- Not specified
- Country
- Not specified
Currency: Not specified
Design and lead the stability architecture for large-scale distributed systems, including big data platforms and core middleware. Develop and optimize stability strategies covering capacity planning, performance tuning, fault prevention, and disaster recovery. Drive chaos engineering, monitoring, and incident response with rapid detection and recovery, and advance automation through AIOps. Collaborate across product, development, and operations to embed stability requirements throughout the product lifecycle.
Who We Are
About the Team
- Design and lead the stability architecture for large-scale distributed systems, including big data platforms, data warehouses, and core middleware infrastructure.
- Develop and optimize comprehensive stability strategies covering capacity planning, performance optimization, fault prevention, and disaster recovery.
- Spearhead chaos engineering practices, designing complex fault injection scenarios to validate system resilience and self-healing capabilities.
- Build and refine comprehensive monitoring and alerting systems for rapid fault detection, localization, and recovery.
- Lead root cause analysis for major incidents and formulate long-term improvement plans to continuously enhance system availability and reliability.
- Drive infrastructure intelligence and automation, designing and implementing AIOps solutions.
- Collaborate closely with product, development, and operations teams to integrate stability requirements throughout the product lifecycle.
- Lead the development of stability-related technical standards and best practices, promoting their adoption across the organization.
- Bachelor's degree or above in Computer Science or related field, with 10+ years of architectural design experience in large-scale internet or cloud computing platforms.
- Expert knowledge of distributed system architectures, with deep understanding and rich practical experience in big data, cloud-native, and microservice technologies.
- In-depth understanding of various infrastructure components (e.g., Kubernetes, Kafka, Database) and ability to perform advanced tuning.
- Strong systems thinking capability, able to analyze and solve complex stability issues from a holistic perspective.
- Extensive experience in handling large-scale system failures, with the ability to quickly locate and resolve challenging problems.
- Mastery of Linux systems and network technologies, familiarity with mainstream cloud platforms (e.g., Alibaba Cloud, AWS) architecture and services.
- Excellent technical leadership skills, able to guide teams and drive cross-departmental collaboration.
- Strong communication and documentation skills, with the ability to engage in technical discussions in both Chinese and English.
- Passion for continuous learning, able to quickly grasp new technologies and apply them in practical work scenarios.
Perks & Benefits
- Competitive total compensation
- Comprehensive insurance coverage for employees and their dependants
- More that we love to tell you along the process!

