Hedge Funds

Platform Reliability Engineer

at Millennium

JuniorNo visa sponsorshipAWS/GCP/Azure DevOps

Posted 4 hours ago

No clicks

Compensation: Not specified
City: Not specified
Country: Not specified

The Platform Reliability Engineer will manage and optimize server and network infrastructure for a large financial buy-side organization, focusing on reducing operational overhead, improving system reliability, and automating operations. Responsibilities include ensuring production reliability of Linux-based research and trading platforms, emergency response, building observability, participating in on-call rotations, and developing maintainable automation and documentation.

Platform Reliability Engineer

Millennium's Infrastructure organization is dedicated to designing, engineering, supporting, and managing a robust server estate, systems virtualization, and core enterprise services. We are seeking a Platform Reliability Engineer to join a highly specialized team of exceptionally talented yet refreshingly humble individuals from diverse disciplines. We believe that delivering exceptional services requires the ability to make meaningful changes across the entire stack. Our mission is to solve real business challenges, reduce operational complexities, and foster a collaborative, team-driven environment that promotes mutual growth and success.

As a Platform Reliability Engineer, you will play a key role in managing and optimizing the operational aspects of the server and network infrastructure for a large financial buy-side organization. Your primary focus will be on reducing operational overhead, optimizing systems, managing configurations, and ensuring the reliability and performance of critical infrastructure.

Key Responsibilities:

Ensure the production reliability of the firm’s Linux-based research and trading platform as part of a globally distributed engineering team.
Provide rapid emergency response to production infrastructure issues.
Proactively understand internal clients’ needs and effectively communicate them to leadership at both regional and global levels.
Identify risks, develop contingency plans, and implement solutions to mitigate them.
Develop and enhance the observability platform to monitor the performance and health of critical computing environments.
Participate in occasional (monthly) on-call rotations and support on-call staff during their shifts.
Contribute to organizational knowledge through documentation, education, and writing maintainable code.

Qualifications/Skills: We are looking for individuals with experience in at least some of the following areas:

2+ years of experience in SRE, DevOps, or other infrastructure engineering roles, preferably within the financial industry.
Strong understanding of Linux system internals, including kernel operations, memory management, and performance optimization.
In-depth knowledge of storage technologies, particularly those used in high-performance computing (GPFS experience is a plus).
Broad understanding of IT infrastructure components, such as networking, DNS, NTP/PTP, and NIS.
Proficiency in system automation, monitoring, and self-healing (experience with Salt is a plus).
Experience with container orchestration and virtualization technologies (e.g., Kubernetes, Nomad, VMware).
Familiarity with on-premises and cloud-based HPC infrastructure (operational knowledge of Slurm and GPU is a plus).
Understanding of AI technologies and their applications in infrastructure automation and management. Experience with or a strong interest in implementing AI/ML solutions for infrastructure optimization, anomaly detection, or predictive analytics.
A passion for technology and automation, with a deep sense of curiosity and ownership.
A hands-on approach to problem-solving and a demonstrable enthusiasm for technology.
Excellent verbal and written communication skills.

Back to all Cloud & DevOps jobs

Apply now

Hedge Funds

Platform Reliability Engineer

at Millennium

JuniorNo visa sponsorshipAWS/GCP/Azure DevOps

Posted 4 hours ago

No clicks

Compensation: Not specified
City: Not specified
Country: Not specified

Platform Reliability Engineer

Key Responsibilities:

Ensure the production reliability of the firm’s Linux-based research and trading platform as part of a globally distributed engineering team.
Provide rapid emergency response to production infrastructure issues.
Proactively understand internal clients’ needs and effectively communicate them to leadership at both regional and global levels.
Identify risks, develop contingency plans, and implement solutions to mitigate them.
Develop and enhance the observability platform to monitor the performance and health of critical computing environments.
Participate in occasional (monthly) on-call rotations and support on-call staff during their shifts.
Contribute to organizational knowledge through documentation, education, and writing maintainable code.

Qualifications/Skills: We are looking for individuals with experience in at least some of the following areas:

2+ years of experience in SRE, DevOps, or other infrastructure engineering roles, preferably within the financial industry.
Strong understanding of Linux system internals, including kernel operations, memory management, and performance optimization.
In-depth knowledge of storage technologies, particularly those used in high-performance computing (GPFS experience is a plus).
Broad understanding of IT infrastructure components, such as networking, DNS, NTP/PTP, and NIS.
Proficiency in system automation, monitoring, and self-healing (experience with Salt is a plus).
Experience with container orchestration and virtualization technologies (e.g., Kubernetes, Nomad, VMware).
Familiarity with on-premises and cloud-based HPC infrastructure (operational knowledge of Slurm and GPU is a plus).
Understanding of AI technologies and their applications in infrastructure automation and management. Experience with or a strong interest in implementing AI/ML solutions for infrastructure optimization, anomaly detection, or predictive analytics.
A passion for technology and automation, with a deep sense of curiosity and ownership.
A hands-on approach to problem-solving and a demonstrable enthusiasm for technology.
Excellent verbal and written communication skills.