LOG IN
SIGN UP
Tech Job Finder - Find Software, Technology Sales and Product Manager Jobs.
Sign In
OR continue with e-mail and password
E-mail address
Password
Don't have an account?
Reset password
Join Tech Job Finder
OR continue with e-mail and password
E-mail address
First name
Last name
Username
Password
Confirm Password
How did you hear about us?
By signing up, you agree to our Terms & Conditions and Privacy Policy.

Lead Site Reliability Engineer, AI/ML Platform

at J.P. Morgan

Back to all Cloud & DevOps jobs
J.P. Morgan logo
Bulge Bracket Investment Banks

Lead Site Reliability Engineer, AI/ML Platform

at J.P. Morgan

Mid LevelNo visa sponsorshipAWS/GCP/Azure DevOps

Posted a month ago

No clicks

Compensation
Not specified

Currency: Not specified

City
Jersey City
Country
United States

Lead SRE for AI/ML platform responsible for designing and implementing solutions to improve reliability, scalability, and performance of AI/ML systems. Partner with product engineering teams to build observability, security, automation, and fin-ops tooling, and define reliability and automation architecture standards. Own production incident debugging, participate in on-call rotations, and mentor junior engineers while driving cross-functional engagements. Provide strategic technical leadership to ensure cloud-native, distributed system reliability at scale.

Location: Jersey City, NJ, United States

 Responsibilities:

  • Design and implement solutions to enhance the reliability and scalability of AI/ML platforms and applications to accommodate fast growing demands.
  • Partner with product engineering teams to ensure the AI/ML systems are reliable and high performing. 
  • Develop observability, security, automation and fin-ops tools and orchestration.
  • Provide strategic technology leadership by defining and evaluating standards and architecture for reliability, observability and automation frameworks.
  • Build strong cross-functional relationships that foster engagements across the organization and deliver solutions to user problems.
  • Debug and solve issues in a production environment, identify root cause and remediate. 
  • Participates in on-call rotations, incident management and escalation workflows.
  • Take full ownership of problems, develop solutions, and acquire new knowledge to complete the task.
  • Mentor and guide junior engineers.

Required Qualifications:

  • Bachelor’s degree in computer science, Information Technology, or equivalent technical qualification with 5+ years professional experience.
  • Expertise in SRE principles, reliability, scalability and performance of application and infrastructure.
  • Have hands-on experience with cloud platforms (AWS, GCP, Azure) and IaC tools (Terraform, Ansible). 
  • Extensive experience implementing advanced observability using tools like Open Telemetry, Dynatrace, Grafana, and/or cloud-native services.
  • Experience in architecting distributed systems and cloud-native architecture in AWS.
  • Systematic problem-solving and troubleshooting skills in a complex system.
  • Excellent communication skills and ability to represent and present business and technical concepts to stakeholders. 
  • Self-managed, self-motivated with strong sense of ownership, urgency, and drive

Good to have:

  • Prior experience working in AI, ML, or Data engineering.
  • Prior experience developing AI Ops/AI Agents.
  • Multi cloud experience (AWS, GCP, Azure) is a plus 
As an SRE in AI Machine Learning and Data platforms LOB, you will implement solutions to enhance the reliability and observarvability

Lead Site Reliability Engineer, AI/ML Platform

at J.P. Morgan

Back to all Cloud & DevOps jobs
J.P. Morgan logo
Bulge Bracket Investment Banks

Lead Site Reliability Engineer, AI/ML Platform

at J.P. Morgan

Mid LevelNo visa sponsorshipAWS/GCP/Azure DevOps

Posted a month ago

No clicks

Compensation
Not specified

Currency: Not specified

City
Jersey City
Country
United States

Lead SRE for AI/ML platform responsible for designing and implementing solutions to improve reliability, scalability, and performance of AI/ML systems. Partner with product engineering teams to build observability, security, automation, and fin-ops tooling, and define reliability and automation architecture standards. Own production incident debugging, participate in on-call rotations, and mentor junior engineers while driving cross-functional engagements. Provide strategic technical leadership to ensure cloud-native, distributed system reliability at scale.

Location: Jersey City, NJ, United States

 Responsibilities:

  • Design and implement solutions to enhance the reliability and scalability of AI/ML platforms and applications to accommodate fast growing demands.
  • Partner with product engineering teams to ensure the AI/ML systems are reliable and high performing. 
  • Develop observability, security, automation and fin-ops tools and orchestration.
  • Provide strategic technology leadership by defining and evaluating standards and architecture for reliability, observability and automation frameworks.
  • Build strong cross-functional relationships that foster engagements across the organization and deliver solutions to user problems.
  • Debug and solve issues in a production environment, identify root cause and remediate. 
  • Participates in on-call rotations, incident management and escalation workflows.
  • Take full ownership of problems, develop solutions, and acquire new knowledge to complete the task.
  • Mentor and guide junior engineers.

Required Qualifications:

  • Bachelor’s degree in computer science, Information Technology, or equivalent technical qualification with 5+ years professional experience.
  • Expertise in SRE principles, reliability, scalability and performance of application and infrastructure.
  • Have hands-on experience with cloud platforms (AWS, GCP, Azure) and IaC tools (Terraform, Ansible). 
  • Extensive experience implementing advanced observability using tools like Open Telemetry, Dynatrace, Grafana, and/or cloud-native services.
  • Experience in architecting distributed systems and cloud-native architecture in AWS.
  • Systematic problem-solving and troubleshooting skills in a complex system.
  • Excellent communication skills and ability to represent and present business and technical concepts to stakeholders. 
  • Self-managed, self-motivated with strong sense of ownership, urgency, and drive

Good to have:

  • Prior experience working in AI, ML, or Data engineering.
  • Prior experience developing AI Ops/AI Agents.
  • Multi cloud experience (AWS, GCP, Azure) is a plus 
As an SRE in AI Machine Learning and Data platforms LOB, you will implement solutions to enhance the reliability and observarvability