Investment Banking

Associate Director, Software Engineering Specialist

at HSBC

Tech LeadNo visa sponsorshipPython

Posted a day ago

No clicks

Compensation: Not specified
City: Bengaluru
Country: India

The Associate Director, Software Engineering Specialist will lead the team responsible for designing and operating the intelligence layer of HSBC's Internal Developer Platform. The role focuses on building a proactive, data-driven reliability approach using SRE principles, AIOps, and a unified telemetry platform. This position requires extensive experience in technology leadership, especially in Site Reliability Engineering and observability.

Some careers shine brighter than others. If you’re looking for a career that will help you stand out, join HSBC and fulfil your potential. Whether you want a career that could take you to the top, or simply take you in an exciting new direction, HSBC offers opportunities, support and rewards that will take you further. HSBC is one of the largest banking and financial services organisations in the world, with operations in 64 countries and territories. We aim to be where the growth is, enabling businesses to thrive and economies to prosper, and, ultimately, helping people to fulfil their hopes and realise their ambitions. We are currently seeking an experienced professional to join our team in the role of Associate Director, Software Engineering Specialist. In this role: Your mission is to build and lead the team that designs, builds, and operates the "Intelligence Layer" for our entire Internal Developer Platform. You will transform our approach to reliability from a reactive, tool-based discipline to a proactive, data-driven science. By treating operational telemetry as a first-class data product, you will provide the unified observability and predictive insights that empower thousands of engineers to build faster, safer, and more resilient services. Strategic Leadership & Vision: Define and execute the comprehensive SRE and Observability strategy for a mission-critical Internal Developer Platform serving thousands of engineers. You will champion a culture of data-driven reliability, proactive incident management, and blameless post-mortems, making reliability a core, measurable feature of our platform. Observability as a Platform: Spearhead the design, development, and operation of a unified, enterprise-grade telemetry platform. This platform will ingest metrics, logs, and traces via a standardized OpenTelemetry (OTel) pipeline/similar, leveraging Kafka/similar for resilient data streaming and Datadog/similar for advanced visualization and analytics. AIOps & Predictive Analytics: Move beyond reactive monitoring to build a true AIOps capability. You will lead the effort to correlate low-level infrastructure signals with high-level business KPIs, creating predictive models that identify potential incidents before they impact customers and drive automated remediation. Reliability Engineering Excellence: Lead the definition and implementation of Service Level Objectives (SLOs) and Error Budgets across all platform services. You will partner with product and engineering teams to embed reliability targets directly into the development lifecycle, ensuring a healthy balance between feature velocity and operational stability. Take part in the post-mortem process to ensure that every incident results in concrete, automated improvements that systematically reduce Mean Time to Detection (MTTD) and Mean Time to Recovery (MTTR). Team Leadership & Development: Manage and mentor a specialized team of Site Reliability Engineers and Data Analysts. Foster a culture of curiosity, automation, and continuous learning, providing career development opportunities in the highly sought-after fields of SRE and AIOps. Customer Focus & Partnership: Act as the primary SRE partner to our internal customers—the application development teams. Provide them with the tools, dashboards, and insights they need to understand the performance and reliability of their own applications, empowering them to build more resilient services from day one. Launch a Predictive AIOps Capability: Move beyond reactive dashboards to deliver a true AIOps engine that can automatically correlate business-level KPIs with infrastructure events, predict failures before they occur, and trigger automated remediation. To be successful in this role, you should meet the following requirements: 10+ years of experience in technology leadership, with a proven track record in Site Reliability Engineering, Observability, or a similar data-intensive operational domain. Extensive experience in a leadership role is required. Core Technology Stack: Subject matter expertise in OpenTelemetry (OTel) is highly desirable. Hands-on experience with data streaming technologies like Kafka and advanced monitoring platforms like Datadog, New Relic etc is critical. SRE Mastery: A demonstrated history of implementing SRE principles in a complex environment, including defining and managing SLOs/SLIs and Error Budgets. Experience as an Incident Commander for production outages is required. Automation & Coding: Strong scripting and automation skills, preferably in Python. The ability to write code to automate operational tasks and analyse telemetry data is essential. Experience with Infrastructure as Code (e.g., Terraform) is a plus. Data & Analytics Mindset: A deeply ingrained belief in using data to drive decisions. Experience with AIOps concepts and the ability to correlate disparate data sources to find meaningful insights is a significant advantage. Establish a Unified Observability Standard: Drive the enterprise-wide adoption of OpenTelemetry (OTel) as the single, vendor-neutral standard for all logs, metrics, and traces, completely eliminating proprietary agent lock-in. Build the Central Telemetry Pipeline: Architect and operate a highly available, real-time data pipeline using Kafka / similar as the central buffer, capable of processing terabytes of operational data daily from our entire hybrid-cloud estate. Deliver "Reliability as a Service": Implement a mature SRE practice built on SLOs and Error Budgets, transforming reliability from an abstract goal into a quantifiable metric that directly informs product development and deployment decisions. You’ll achieve more when you join HSBC. HSBC is committed to building a culture where all employees are valued, respected and opinions count. We take pride in providing a workplace that fosters continuous professional development, flexible working and opportunities to grow within an inclusive and diverse environment. Personal data held by the Bank relating to employment applications will be used in accordance with our Privacy Statement, which is available on our website. Issued by – HSBC Software Development India

Back to all Python jobs

Apply now