LOG IN
SIGN UP
Tech Job Finder - Find Software, Technology Sales and Product Manager Jobs.
Sign In
OR continue with e-mail and password
E-mail address
Password
Don't have an account?
Reset password
Join Tech Job Finder
OR continue with e-mail and password
E-mail address
First name
Last name
Username
Password
Confirm Password
How did you hear about us?
By signing up, you agree to our Terms & Conditions and Privacy Policy.

Senior Lead Site Reliability Engineer

at J.P. Morgan

Back to all Cloud & DevOps jobs
J.P. Morgan logo
Bulge Bracket Investment Banks

Senior Lead Site Reliability Engineer

at J.P. Morgan

Tech LeadNo visa sponsorshipAWS/GCP/Azure DevOps

Posted a month ago

No clicks

Compensation
Not specified

Currency: Not specified

City
Hyderabad
Country
India

Senior Lead SRE on the AI/ML & Data platform team responsible for defining non-functional requirements, availability targets, SLOs, and service-level indicators for applications and product lines. You will own incident management, root-cause analysis, production changes, and observability practices while mentoring team members. The role emphasizes automation to reduce toil, building tooling, and applying AI/ML techniques for troubleshooting and incident resolution. Technologies include Databricks, Snowflake, AWS, Kubernetes, and observability tools such as Grafana, Dynatrace, Prometheus, Datadog, and Splunk.

Location: Hyderabad, Telangana, India

Elevate your engineering prowess to unprecedented levels by joining a team of exceptionally gifted professionals and position yourself among the top echelon in site reliability.

As a Senior Lead Site Reliability Engineer at JPMorgan Chase within the AI/ML & Data platform team, you work with your fellow stakeholders to define non-functional requirements (NFRs) and availability targets for the services in your application and product lines. You will ensure those NFRs are accounted for in your products’ design and test phases, that your service level indicators are effectively measuring customer experience, and that service level objectives are defined with stakeholders and implemented in production.

Job responsibilities

  • Demonstrate expertise in application development and support with multiple technologies such as Databricks, Snowflake, AWS, Kubernetes, etc.
  • Coordinate incident management coverage to ensure effective resolution of application issues.
  • Collaborate with cross-functional teams to perform root cause analysis and implement production changes.
  • Mentor and guide team members to foster innovation and strategic change.
  • Develop and support AI/ML solutions for troubleshooting and incident resolution.

Required qualifications, capabilities, and skills

  • Formal training or certification on SRE concepts and 5+ years applied experience
  • Proficient in site reliability culture and principles and familiarity with how to implement site reliability within an application or platform
  • Proficiency in running production incident calls and managing incident resolution.
  • Experience in observability such as white and black box monitoring, service level objective alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, and others
  • Strong understanding of SLO/SLA and Error Budgets
  • Proficiency in Python or Pyspark for AI/ML modeling.
  • Must be able to reduce toil by building new tools to automate repeated tasks.
  • Hands-on experience in system design, resiliency, testing, operational stability, and disaster recovery
  • Understanding of network topologies, load balancing, and content delivery networks.
  • Awareness of risk controls and compliance with departmental and company-wide standards.
  • Ability to work collaboratively in teams and build meaningful relationships to achieve common goals.

Preferred qualifications, capabilities, and skills

  • SRE or production support role with AWS Cloud, Databricks, Snowflake or similar Technologies.
  • AWS and Databricks certifications.
Work with stakeholders to define non-functional requirements and availability targets for the services in application and product lines

Senior Lead Site Reliability Engineer

at J.P. Morgan

Back to all Cloud & DevOps jobs
J.P. Morgan logo
Bulge Bracket Investment Banks

Senior Lead Site Reliability Engineer

at J.P. Morgan

Tech LeadNo visa sponsorshipAWS/GCP/Azure DevOps

Posted a month ago

No clicks

Compensation
Not specified

Currency: Not specified

City
Hyderabad
Country
India

Senior Lead SRE on the AI/ML & Data platform team responsible for defining non-functional requirements, availability targets, SLOs, and service-level indicators for applications and product lines. You will own incident management, root-cause analysis, production changes, and observability practices while mentoring team members. The role emphasizes automation to reduce toil, building tooling, and applying AI/ML techniques for troubleshooting and incident resolution. Technologies include Databricks, Snowflake, AWS, Kubernetes, and observability tools such as Grafana, Dynatrace, Prometheus, Datadog, and Splunk.

Location: Hyderabad, Telangana, India

Elevate your engineering prowess to unprecedented levels by joining a team of exceptionally gifted professionals and position yourself among the top echelon in site reliability.

As a Senior Lead Site Reliability Engineer at JPMorgan Chase within the AI/ML & Data platform team, you work with your fellow stakeholders to define non-functional requirements (NFRs) and availability targets for the services in your application and product lines. You will ensure those NFRs are accounted for in your products’ design and test phases, that your service level indicators are effectively measuring customer experience, and that service level objectives are defined with stakeholders and implemented in production.

Job responsibilities

  • Demonstrate expertise in application development and support with multiple technologies such as Databricks, Snowflake, AWS, Kubernetes, etc.
  • Coordinate incident management coverage to ensure effective resolution of application issues.
  • Collaborate with cross-functional teams to perform root cause analysis and implement production changes.
  • Mentor and guide team members to foster innovation and strategic change.
  • Develop and support AI/ML solutions for troubleshooting and incident resolution.

Required qualifications, capabilities, and skills

  • Formal training or certification on SRE concepts and 5+ years applied experience
  • Proficient in site reliability culture and principles and familiarity with how to implement site reliability within an application or platform
  • Proficiency in running production incident calls and managing incident resolution.
  • Experience in observability such as white and black box monitoring, service level objective alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, and others
  • Strong understanding of SLO/SLA and Error Budgets
  • Proficiency in Python or Pyspark for AI/ML modeling.
  • Must be able to reduce toil by building new tools to automate repeated tasks.
  • Hands-on experience in system design, resiliency, testing, operational stability, and disaster recovery
  • Understanding of network topologies, load balancing, and content delivery networks.
  • Awareness of risk controls and compliance with departmental and company-wide standards.
  • Ability to work collaboratively in teams and build meaningful relationships to achieve common goals.

Preferred qualifications, capabilities, and skills

  • SRE or production support role with AWS Cloud, Databricks, Snowflake or similar Technologies.
  • AWS and Databricks certifications.
Work with stakeholders to define non-functional requirements and availability targets for the services in application and product lines