Senior Site Reliability Engineer

About Root

Root is building the future of online community: a next-gen communication platform where developers create fully custom apps and experiences that run entirely within Root. Our goal is to empower all the world’s communities with technology.

We're early-stage, product-driven, and running a modern distributed infrastructure that powers real-time communication for communities worldwide.

About the Role

We’re hiring a Senior Site Reliability Engineer to own deployment safety and release reliability across Root’s platform. You will own how deployments happen end-to-end. You’ll design and enforce the deployment systems, reliability standards, and incident practices that allow engineers to ship frequently and safely. You’ll collaborate across infrastructure and client teams to embed reliability by default, and you’ll have the authority to evolve these systems when needed.

What You'll Do

Own Deployment Safety & Release Systems

Own deployment safety for backend services and core platform infrastructure
Design and operate CI/CD systems that make production changes safe, observable, and reversible
Implement progressive delivery patterns (canaries, automated rollbacks, feature gating) as defaults
Eliminate manual, high-risk deployment paths through automation and clear standards

Set and Enforce Reliability Standards

Define SLOs/SLIs and enforce error budgets for critical services
Establish explicit release and reliability policies that balance velocity and uptime
Identify and eliminate systemic failure modes before they reach users
Reduce MTTR through improved detection, ownership clarity, and response discipline

Lead Incident Response & Operational Maturity

Lead high-severity production incidents and drive durable systemic fixes
Run blameless post-incident reviews that result in measurable improvement
Own on-call standards, escalation paths, and operational readiness
Make on-call sustainable through automation, signal quality, and clear accountability

Enable Teams Through Operational Patterns

Partner with infrastructure and product engineers to embed reliability into development workflows
Provide reference implementations and tooling rather than one-off fixes
Diagnose issues across application and infrastructure boundaries when needed
Design systems that scale reliability ownership across engineering

What Success Looks Like

Within 6–12 months:

Production deployments are fast, predictable, and fully reversible with minimal manual intervention
Engineers ship confidently, with clear release ownership and no ambiguity around risk
SLOs are defined, measured, and actively used to guide engineering tradeoffs
High-severity incidents are detected quickly and resolved with minimal user impact
Post-incident reviews produce measurable systemic improvements
On-call is sustainable, high-signal, and operationally mature
Deployment and operational failure modes are eliminated through automation and standardization
Operational knowledge is documented; no critical systems depend on a single individual
The SRE function increases engineering velocity without becoming a bottleneck

Qualifications

Required:

5+ years owning production reliability and deployment systems for customer-facing software
Direct accountability for production deploys and incident response
Experience designing and operating CI/CD systems where failures had real user impact
Strong understanding of distributed systems and common backend failure modes
Ability to diagnose issues across application and infrastructure boundaries
Experience defining and enforcing SLOs, error budgets, and release policies
Comfort acting as the final gate for production changes when necessary
Sound judgment balancing speed, risk, and reliability

Preferred: (not required)

Deep experience operating Kubernetes-based production systems
Progressive delivery patterns (canaries, automated rollbacks, feature flags)
GitOps or declarative deployment workflows
Building scalable observability systems with high-signal alerting
Experience with high-throughput or real-time backend systems
Experience documenting and operationalizing complex infrastructure
Mentoring engineers on reliability and operational discipline

How to Apply

Submit your resume or LinkedIn profile.

In your application, briefly include:

A production deployment or release system you’ve owned
A high-severity incident you led or significantly improved
The scale and type of systems you’ve operated (e.g., traffic volume, infra footprint, real-time constraints)

Optional but helpful:

Links to public writing, talks, or open-source contributions

Apply for this position

Awesome, you’re in!
We’ll keep you in the loop.

Oops! Something went wrong while submitting the form.