About Root
Root is building the future of online community: a next-gen communication platform where developers create fully custom apps and experiences that run entirely within Root. Our goal is to empower all the world’s communities with technology.
We're early-stage, product-driven, and running a modern distributed infrastructure that powers real-time communication for communities worldwide.
About the Role
We’re hiring a Senior Site Reliability Engineer to own deployment safety and release reliability across Root’s platform. You will own how deployments happen end-to-end. You’ll design and enforce the deployment systems, reliability standards, and incident practices that allow engineers to ship frequently and safely. You’ll collaborate across infrastructure and client teams to embed reliability by default, and you’ll have the authority to evolve these systems when needed.
What You'll Do
Own Deployment Safety & Release Systems
- Own deployment safety for backend services and core platform infrastructure
- Design and operate CI/CD systems that make production changes safe, observable, and reversible
- Implement progressive delivery patterns (canaries, automated rollbacks, feature gating) as defaults
- Eliminate manual, high-risk deployment paths through automation and clear standards
Set and Enforce Reliability Standards
- Define SLOs/SLIs and enforce error budgets for critical services
- Establish explicit release and reliability policies that balance velocity and uptime
- Identify and eliminate systemic failure modes before they reach users
- Reduce MTTR through improved detection, ownership clarity, and response discipline
Lead Incident Response & Operational Maturity
- Lead high-severity production incidents and drive durable systemic fixes
- Run blameless post-incident reviews that result in measurable improvement
- Own on-call standards, escalation paths, and operational readiness
- Make on-call sustainable through automation, signal quality, and clear accountability
Enable Teams Through Operational Patterns
- Partner with infrastructure and product engineers to embed reliability into development workflows
- Provide reference implementations and tooling rather than one-off fixes
- Diagnose issues across application and infrastructure boundaries when needed
- Design systems that scale reliability ownership across engineering
What Success Looks Like
Within 6–12 months:
- Production deployments are fast, predictable, and fully reversible with minimal manual intervention
- Engineers ship confidently, with clear release ownership and no ambiguity around risk
- SLOs are defined, measured, and actively used to guide engineering tradeoffs
- High-severity incidents are detected quickly and resolved with minimal user impact
- Post-incident reviews produce measurable systemic improvements
- On-call is sustainable, high-signal, and operationally mature
- Deployment and operational failure modes are eliminated through automation and standardization
- Operational knowledge is documented; no critical systems depend on a single individual
- The SRE function increases engineering velocity without becoming a bottleneck
Qualifications
Required:
- 5+ years owning production reliability and deployment systems for customer-facing software
- Direct accountability for production deploys and incident response
- Experience designing and operating CI/CD systems where failures had real user impact
- Strong understanding of distributed systems and common backend failure modes
- Ability to diagnose issues across application and infrastructure boundaries
- Experience defining and enforcing SLOs, error budgets, and release policies
- Comfort acting as the final gate for production changes when necessary
- Sound judgment balancing speed, risk, and reliability
Preferred: (not required)
- Deep experience operating Kubernetes-based production systems
- Progressive delivery patterns (canaries, automated rollbacks, feature flags)
- GitOps or declarative deployment workflows
- Building scalable observability systems with high-signal alerting
- Experience with high-throughput or real-time backend systems
- Experience documenting and operationalizing complex infrastructure
- Mentoring engineers on reliability and operational discipline
How to Apply
Submit your resume or LinkedIn profile.
In your application, briefly include:
- A production deployment or release system you’ve owned
- A high-severity incident you led or significantly improved
- The scale and type of systems you’ve operated (e.g., traffic volume, infra footprint, real-time constraints)
Optional but helpful:
- Links to public writing, talks, or open-source contributions