Join Docusign's Observability team as a Senior Software Engineer, owning end-to-end pipelines for ingestion, feature engineering, training, and real-time/batch inference of operational telemetry. Design and operate anomaly-detection and forecasting services to improve SLAs and reduce alert fatigue, and build scalable ingestion frameworks with governance. Collaborate with Applied Scientists to productionize models and optimize time-series storage and compute, while delivering reliable, high-throughput data pipelines with strong observability. Mentor early-career engineers and contribute to platform ergonomics and documentation.
Company Overview Docusign brings agreements to life. Over 1.5 million customers and more than a billion people in over 180 countries use Docusign solutions to accelerate the process of doing business and simplify people’s lives. With intelligent agreement management, Docusign unleashes business-critical data that is trapped inside of documents. Until now, these were disconnected from business systems of record, costing businesses time, money, and opportunity. Using Docusign’s Intelligent Agreement Management platform, companies can create, commit, and manage agreements with solutions created by the #1 company in e-signature and contract lifecycle management (CLM). What you'll do Operating a reliable global service requires robust observability and automation. The Observability team builds the real‑time telemetry and AI/ML capabilities that power how engineers measure, visualize, investigate, and improve customer experience at scale. This position is an indiviual contributor role reporting to the Senior Director, Software Engineering. Responsibility Own end‑to‑end pipelines - from ingestion and feature engineering to training and real‑time/batch inference for operational time‑series use cases (metrics, logs, events) Design and operate anomaly‑detection and forecasting services (statistical/ML) that improve SLAs, reduce alert fatigue, and accelerate incident triage Build internal SDKs, metadata catalogs, and reusable ingestion frameworks that standardize access, enforce governance, and accelerate adoption across product and platform teams Harden data quality (sanity scoring, validation, drift checks), backfills, and replay strategies and define SLOs for data and models Partner with Applied Scientists to productionize models with pragmatic algorithms (e.g., tree‑based methods, classical TS) and selectively introduce deep learning where it pays off Optimize time‑series storage and compute (e.g., Prometheus, ClickHouse, Azure Data Explorer, columnar stores): partitioning, rollups, retention, and cost controls Integrate with the observability stack (OTel for signals; dashboards/alerts) and collaborate with SRE/infra to ensure performance and resilience Explore LLM‑assisted on‑call (RAG over logs/runbooks) to improve diagnosis and guidance; manage prompt safety, evals, and latency budget Deliver reliable, high‑throughput real‑time and batch data pipelines for billions of telemetry points/day Ship production‑ready services (APIs, jobs, workers) with clean contracts and strong observability Define data/model lifecycle: versioning, lineage, reproducibility, and automated evaluations Mentor early‑career engineers; champion platform ergonomics and documentation Job Designation Hybrid: Employee divides their time between in-office and remote work. Access to an office location is required. (Frequency: Minimum 2 days per week; may vary by team but will be weekly in-office expectation) Positions at Docusign are assigned a job designation of either In Office, Hybrid or Remote and are specific to the role/job. Preferred job designations are not guaranteed when changing positions within Docusign. Docusign reserves the right to change a position's job designation depending on business needs and as permitted by local law. What you bring Basic 8+ years of software engineering, with deep Python and SQL in production Experience building real‑time + batch pipelines with stringent SLAs and data quality controls Experience in time‑series analysis, anomaly detection, and forecasting for operational systems Experience deploying containerized services (Docker), Linux fundamentals, and CI/CD Experience with at least one time‑series/analytical store (ClickHouse, Postgres/TimescaleDB, columnar stores), and caching/NoSQL where appropriate Experience with workflow orchestration (Prefect, Airflow, or Dagster) Preferred Streaming platforms (Kafka/Redpanda/Pulsar) or equivalent messaging; schema management and idempotent/exactly‑once strategies Model serving and monitoring (custom FastAPI/Flask services, MLflow, KServe/Seldon); drift detection; shadow/canary rollouts Familiarity with observability tooling (OpenTelemetry, Prometheus, Grafana) and alerting best practices (SLOs, MTTR/MTTA) Exposure to LLMs and Hugging Face; interest in applying LLMs to ops guidance (RAG over telemetry/runbooks) Distributed systems fundamentals; cloud experience (GCP/AWS/Azure) and IaC Kubernetes experience (or willingness to ramp quickly) Wage Transparency Pay for this position is based on a number of factors including geographic location and may vary depending on job-related knowledge, skills, and experience. Based on applicable legislation, the below details pay ranges in the following locations: Washington, Maryland, New Jersey and New York (including NYC metro area): $151,200.00 - $222,450.00 base salary This role is also eligible for the following: Bonus: Sales personnel are eligible for variable incentive pay dependent on their achievement of pre-established