
Site Reliability Engineer III
at J.P. Morgan
Posted 17 days ago
No clicks
- Compensation
- Not specified
- City
- Dallas
- Country
- United States
Currency: Not specified
Join JPMorgan Chase as a Site Reliability Engineer focused on modernizing platform operations through SRE principles and AI-driven automation. You will build observability, resilience, and automated incident management for critical financial systems, collaborating with stakeholders to set SLOs and error budgets. The role requires hands-on experience with cloud, Kubernetes, Terraform, coding, and observability stacks, and includes mentoring and knowledge-sharing responsibilities. Based in Plano, TX, you will help drive proactive, customer-focused reliability improvements across the enterprise.
Location: Plano, TX, United States
There’s nothing more exciting than being at the center of a rapidly growing field in technology and applying your skillsets to drive innovation and modernize the world's most complex and mission-critical systems.
Assume a critical role in defining the future of a globally recognized firm and have a direct and significant effect in a realm tailored for top achievers in site reliability.
As a Site Reliability Engineer at JPMorgan Chase within the CORPORATE SECTOR, ENTERPRISE TECHNOLOGY team, you will be instrumental in enhancing intelligent and resilient platform operations for a global financial institution. You will drive the integration of traditional support with modern Site Reliability Engineering (SRE) principles, utilizing agentic AI as a core capability to achieve our vision of a proactive, automated, and customer-centric reliability function. This role demands a blend of deep technical expertise, a growth-oriented mindset, and a strong dedication to operational excellence. You will excel in modern infrastructure and observability, promoting AI-powered incident management, autonomous runbooks, and support intelligence initiatives.
Job responsibilities
- Advocate and embody site reliability principles, fostering a culture of excellence and technical influence within your team.
- Leverage AI tools to enhance operational effectiveness and automate processes, ensuring high-quality customer service.
- Spearhead projects aimed at enhancing the reliability and stability of applications and platforms.
- Utilize data-driven analytics and AI technologies to automate detection, diagnosis, resolution processes, elevate service levels and drive continuous improvement.
- Engage stakeholders to establish realistic service level objectives and error budgets, ensuring alignment with customer expectations.
- Exhibit technical proficiency in one or more domains, proactively addressing technology-related bottlenecks.
- Employ AI-driven solutions to streamline processes and enhance operational efficiency.
- Participate in troubleshooting during incidents, demonstrating the ability to swiftly identify and resolve issues to prevent financial losses.
- Act as a culture carrier by documenting learnings and disseminating knowledge through internal forums and communities of practice.
- Mentor team members, guiding them in the strategic adoption of AI technologies to enhance operational effectiveness and customer service.
Required qualifications, capabilities, and skills
- Formal training or certification on site reliability engineering concepts and 2+ years applied experience in areas such as resiliency, scalability, performance and security.
- Proven success in an SRE or DevOps role, with knowledge of service level indicators/objectives (SLIs/SLOs), incident management, blameless postmortem analysis, and systems reliability.
- Expert with observability stacks (e.g., Prometheus, Grafana, Splunk, OpenTelemetry), including deep experience correlating telemetry across services and time.
- Hands-on skills in coding (at least one high-level programming language), cloud platforms (AWS or GCP), container orchestration (Kubernetes), infrastructure as code (Terraform), and resilient CI/CD pipelines.
- Active experience or deep curiosity in applying AI to operations—such as LLM-based copilots, anomaly detection, automated runbooks, autonomous agents (e.g. CrewAI, LangGraph), or Retrieval-Augmented Generation (RAG) workflows for support.
- A track record of delivering under pressure. You finish what you start, adapt to uncertainty, and thrive in high-accountability environments.
- You deconstruct complexity, organize effectively, and drive clarity into ambiguous operational environments. Documentation and design are second nature.
- Outstanding communication, empathy, and professionalism—especially during incidents. You recognize that great systems serve real people.
- Experience with operational and compliance rigor in banking, fintech, or similar.
- Practical use of LLM frameworks (e.g. LangChain, Semantic Kernel), AI orchestration tools, vector databases, or custom agents supporting reliability workflows.
- Experience with game days, chaos experiments, or failure-mode analysis to improve service robustness.
- A background in mentoring engineers or leading technical knowledge-sharing, especially around AI and SRE best practices.





