
Lead Site Reliability Engineer
at J.P. Morgan
Posted 17 days ago
No clicks
- Compensation
- Not specified
- City
- Palo Alto
- Country
- United States
Currency: Not specified
Lead Site Reliability Engineer in JPMorgan Chase's Commercial Investment Bank responsible for designing and operating highly available cloud-native platforms. You will design and maintain AWS infrastructure with Terraform, manage production Kubernetes clusters, and build robust CI/CD pipelines while acting as technical lead for medium to large products. The role includes mentoring engineers, conducting resiliency design reviews, defining SLIs/SLOs/error budgets, and leading incident response to minimize business impact.
Location: Palo Alto, CA, United States
Assume a critical role in defining the future of a globally recognized firm and have a direct and significant effect in a realm tailored for top achievers in site reliability.
Job responsibilities
- Design and maintain highly available AWS infrastructure using Terraform
- Design, implement, and maintain robust CI/CD pipelines using Jenkins, Spinnaker etc.
- Manage production Kubernetes clusters and container orchestration
- Demonstrates and champions site reliability culture and practices and exerts technical influence throughout your team
- Collaborates with team members to identify comprehensive service level indicators and stakeholders to establish reasonable service level objectives and error budgets with customers
- Demonstrates a high level of technical expertise within one or more technical domains and proactively identifies and solves technology-related bottlenecks in your areas of expertise
- Acts as the main point of contact during major incidents for your application and demonstrates the skills to identify and solve issues quickly to avoid financial losses
- Documents and shares knowledge within your organization via internal forums and communities of practice
Required qualifications, capabilities, and skills
- 7+ years in SRE, DevOps, or Platform Engineering; Deep proficiency in reliability, scalability, performance, security, enterprise system architecture, toil reduction, and other site reliability best practices with the ability to implement these practices within an application or platform
- Fluency in at least one programming language such as (e.g., Python, Java Spring Boot, .Net, etc.)
- Expert-level AWS experience (ECS, EC2, EKS, RDS, Lambda, VPC, IAM)
- Production Kubernetes experience at scale
- Strong Terraform skills (modules, state management, best practices)
- Hands-on Jenkins experience (pipelines, shared libraries)
- Spinnaker experience for continuous delivery
- Python/Bash/Go scripting abilities
- Deep understanding of SRE principles (SLOs, SLIs, error budgets)
Drive to self-educate and evaluate new technology
- AWS or Kubernetes certifications
- Experience with Prometheus, Grafana, ELK Stack
- Service mesh knowledge (Istio, Linkerd)
- GitOps experience (ArgoCD, Flux)
- Ability to teach new programming languages to team members
- Ability to expand and collaborate across different levels and stakeholder groups




