The Managed Services team is a newly formed squad within the Databases department. It owns and operates shared, production-critical infrastructure that powers Grafana Cloud s next-generation database products (Mimir, Loki, and Tempo). Today, this includes operating 100+ WarpStream clusters across multiple cloud providers and regions, with continued growth anticipated for the future. WarpStream acts as the streaming backbone for ingestion and read/write decoupling across databases. It sits directly on the hot path for metrics, logs, and traces, handling high-throughput, multi-consumer workloads at massive scale.
In addition to streaming infrastructure, the team works closely with high-volume analytical and storage systems that power query-heavy and aggregation-heavy workloads, where latency, compression behavior, storage layout, and scaling characteristics matter deeply.
What You ll Be Doing
- Operating and evolving 100+ multi-cloud streaming clusters and related database infrastructure
- Diagnosing and eliminating cross-layer failure modes (e.g., object storage latency, noisy neighbors, control-plane bottlenecks, query performance regressions, etc.)
- Designing safe upgrade and rollout strategies at scale
- Improving observability, automation, and operational ergonomics
- Partnering closely with database and platform teams to ensure safe scaling, partitioning, consumer fan-out, and query performance
- Working directly with distributed systems behavior, Kubernetes scheduling dynamics, storage engines, compression trade-offs, etc.
- Serving as a primary escalation point and on-call for relevant incidents
- Owning the relationship with all system vendors, including WarpStream Labs and others.
- Help define and evolve the technical direction for operating WarpStream and adjacent shared database systems at scale
- Lead complex initiatives such as migrations, rollout improvements, and reliability investments
- Establish best practices around SLOs, scaling limits, failure isolation, and change safety
- Investigate and drive resolution of multi-layer incidents spanning storage, compute, networking, and control-plane dependencies
- Identify systemic risks across 100+ clusters and contribute architectural improvements that reduce recurring issues
- Improve systems toil and operational ergonomics with automation
- Partner with database and platform teams to align on strategy and long-term scalability
- Mentor and support engineers as the team matures
- Work remotely with an independent attitude and good communication skills
- Participate in an on-call rotation aligned to approximately 12 daylight hours per day, collaborating globally for balanced coverage and shared ownership
- Blend deep distributed systems work with influencing team approaches to reliability, scaling, and operational excellence
- Use modern AI coding assistants to improve developer productivity within security guidelines
What Makes You a Great Fit
- Regular 1:1s with your manager and close collaboration with teammates across regions, helping shape how the team operates and matures
- Defining and evolving SLO strategy for shared database infrastructure, identifying systemic reliability gaps and driving long-term error budget improvements
- Setting standards for diagnosability across core streaming and database systems in production
- Leading complex initiatives across high-throughput, multi-cloud infrastructure
- Designing and promoting fault-tolerant architectural patterns that address distributed system realities such as storage latency, partition imbalance, noisy neighbors, and control-plane dependencies
- Defining rollout, migration, and upgrade safety practices used across dozens of production clusters
- Partnering with database and platform engineering leaders to influence architecture decisions, roadmap prioritization, and long-term scalability strategy
- Leading design discussions and reviewing PRs with a focus on reducing operational risk and increasing system resilience
- Raising the bar for practices across teams by mentoring engineers and sharing distributed systems knowledge
- Playing a key role in high-impact incident response, guiding investigation, driving root cause analysis, and ensuring durable remediation through strong post-incident reviews
Requirements
- 8+ years of engineering experience, including meaningful time in SRE, platform engineering, production engineering, infrastructure engineering, or distributed systems roles.
- Experience with high-throughput streaming systems, analytical or storage backends, or large-scale database infrastructure. Examples of these include Kafka, Redpanda, WarpStream, Postgres, ClickHouse, Snowflake, or Cassandra.
- Strong Kubernetes experience in AWS, GCP, or Azure, and familiarity with infrastructure-as-code tooling (Helm, Terraform, Jsonnet, etc.).
- Experience leading or driving complex technical efforts, even without formal management responsibilities
- Ability to influence technical direction and align teams around reliability improvements
- Strong understanding of distributed systems failure modes in multi-cloud environments.
- Proficiency in at least one systems-oriented language (Go preferred, but not required).
- Working knowledge of Linux internals, networking, cloud storage, and performance/scaling behavior.
- Experience participating in blameless incident response and writing high-quality post-incident reviews.
- Clear communicator who can collaborate across teams and work autonomously.
- Intellectually curious
Be the first to know aboutnew jobs every week
Get 8 new jobs with salaries, once per week! Sign up here so you don't miss a single newsletter.