
Senior Manager of Site Reliability Engineering - Data Protection and Recovery
at J.P. Morgan
Posted a month ago
No clicks
- Compensation
- Not specified
- City
- Houston
- Country
- United States
Currency: Not specified
Senior Manager of Site Reliability Engineering responsible for leading SRE practices for data protection and recovery platforms. Acts as the non-functional requirement owner, driving resiliency, scalability, monitoring, automation, and security across applications and infrastructure. Leads and coaches technologists, influences strategic planning, and fosters continual improvement through metrics, blameless post-mortems, and knowledge sharing. Collaborates across teams to avoid duplication and implements enterprise-level solutions for reliability and performance.
Location: Houston, TX, United States
Guide and shape the future of technology at a globally recognized firm, driven by pride in ownership.
Job responsibilities
- Demonstrates expertise in site reliability principles and demonstrates an understanding of the fine balance between features, efficiency, and stability
- Effectively negotiates with peers and executive partners to ensure optimal outcomes for all
- Drives the adoption of site reliability practices throughout the organization
- Ensures your teams demonstrate site reliability best practices with the ability to demonstrate this empirically through stability and reliability metrics
- Drives a culture of continual improvement and solicits real-time feedback to improve the customer’s experience and product line services
- Ensures your team collaborates with other teams within your group’s specialization and avoids duplication of work where possible
- Follows blameless, data-driven, post-mortem strategies and conducts regular team debriefs to enable learning from both successes and mistakes
- Provides personalized coaching for entry to mid-level team members
- Ensures your team documents and shares their knowledge and innovations via internal forums, communities of practice, guilds, and conferences
Required qualifications, capabilities, and skills
- Formal training or certification on infrastructure engineering concepts and 5+ years applied experience. In addition, 2+ years of experience leading technologists to manage and solve complex technical items within your domain of expertise.
- 7+ years experience in Infrastructure Operations, driving site reliability and performance engineering
- Advanced proficiency in site reliability culture and principles and can demonstrate how to implement site reliability across platform teams while avoiding common pitfalls
- Experience leading technologists to manage and solve complex technological issues at a firmwide level
- Ability to influence the team’s culture by championing innovation and change for success
- Experience hiring, developing, and recognizing talent
- Proficiency in at least one programming language (e.g., Python)
- Demonstrated proficiency in technical processes
- Proficiency in continuous integration and continuous delivery tools (e.g., Jenkins, GitLab, Terraform, etc.)
- Experience with container and container orchestration (e.g., ECS, Kubernetes, Docker, etc.)
- Experience with troubleshooting common compute, storage, and networking technologies and hardware issues
Preferred qualifications, capabilities, and skills
- Demonstrate data fluency
- 3+ years experience with enterprise data protection products such as Cohesity or Commvault




