
Software Engineer - AI Platform Services
at Millennium
Posted a month ago
No clicks
- Compensation
- Not specified
- City
- Not specified
- Country
- Not specified
Currency: Not specified
Senior engineering role to design, build, and operate the core service layer for an internal AI/agent platform, including an AI gateway, model/provider routing, policy/guardrails, tool-execution interfaces, high-throughput async APIs, and observability. You will implement MCP capabilities to enable secure, governed connectivity between agent runtimes and enterprise tools/data and partner closely with AI Engineers to support agent workflows. Responsibilities include production-grade Kubernetes operations, autoscaling, SLO/SLI and incident management, CI/CD and IaC-driven delivery, and influencing platform roadmap and technical strategy. The role emphasizes reliability, security, scalability, and developer experience for firm-wide internal AI capabilities.
We’re a high-impact platform team building the firm’s internal AI platform that bridges traditional enterprise platforms (identity, data, workflow, governance) with GenAI tools (agents, copilots, model providers).
This is a senior engineering role focused on designing and owning the core service layer that agentic tools run on: an AI gateway, model/provider routing, policy/guardrails, tool-execution interfaces, high-throughput async APIs, and production-grade observability.. MCP (Model Context Protocol) services are part of the platform portfolio—enabling secure, governed connectivity between agent runtimes and enterprise tools/data. You’ll partner closely with AI Engineers building agent workflows—your focus is to make the underlying platform fast, reliable, secure, and easy to build on.
Key Responsibilities
Design, build, and operate core platform services (Python; REST + async; streaming where appropriate) powering firm-wide internal AI/agentic capabilities.
Own gateway/platform concerns end-to-end: routing, timeouts/retries, streaming, request shaping, rate limits/quotas, multi-tenancy, policy enforcement, provider abstraction, safe degradation, and robust client experience.
Build and operate MCP capabilities as part of the platform.
Build for scale and availability on Kubernetes: autoscaling, rollout strategies, capacity planning, performance tuning, and production debugging.
Raise reliability practices: define and manage SLOs/SLIs, instrumentation standards, incident response/runbooks, post-incident follow-ups, load/resilience testing, and operational excellence.
Improve delivery safety: CI/CD, environment promotion, IaC-driven repeatability, and secure SDLC practices.
Influence roadmap and technical strategy: prioritize foundational investments and reduce platform risk for a business-critical internal platform.
Required Qualifications
7+ years of professional software engineering experience (or equivalent practical experience)
Strong expertise in Python, Java, or Go, including async patterns, concurrency, and building high-throughput services (FastAPI or similar).
Solid distributed systems fundamentals: idempotency, backpressure, failure isolation, consistency tradeoffs, rate limiting, retries/timeouts.
Production experience operating services on Kubernetes (deployments, autoscaling, debugging, observability, performance).
Basic familiarity with LLM integration patterns (streaming responses, tool/function calling)
Demonstrated design leadership (RFCs, architecture reviews, leading cross-team initiatives).
Excellent communication skills—able to translate technical tradeoffs to stakeholders and partner teams.
Preferred Qualifications
Experience with service-to-service authentication patterns (API keys, OAuth/JWT, mTLS concepts).
Familiarity with observability tooling (structured logs, metrics, tracing; Datadog or OpenTelemetry a plus).
Strong fundamentals in AWS (or GCP/Azure) relevant to secure platforms (IAM, networking basics, compute, logging/monitoring patterns).
Working proficiency with Terraform and automation-first operations (repeatable environments, policy checks, safe rollouts).
Comfort using AI dev tools (Claude Code, Cursor, Gemini CLI) responsibly (tests, validation, secure coding).




