Staff Software Engineer - Reliability (US Citizen Only)
Rubrik, Inc. · San Francisco Bay Area
📍 Palo Alto, CA💰 $218,300via greenhousePosted 2026-06-25
Apply on company site ↗
CareerRiver pulls this listing straight from the employer's hiring system — no recruiter middleman, no reposts. Applying takes you directly to Rubrik, Inc..
About Team & About Role
The Site Reliability Engineering (SRE) team at Rubrik ensures the absolute reliability, availability, performance, and security of our enterprise infrastructure services, spanning both global SaaS platforms and government-compliant environments. We operate at the intersection of software development and systems engineering, prioritizing hyperscale platform automation, self-healing architectures, and structural resiliency. As a Staff Site Reliability Engineer, you will serve as a primary technical leader and architect across our broader distributed cloud systems. You will drive long-term technical roadmaps, establish cross-organizational reliability standards, and solve complex distributed systems challenges that safeguard both enterprise and public sector environments.
Beyond the core SRE charter, this Staff role also leads the Application-SRE team — a US-based group that partners closely with engineering, Sales, and Support to unblock POCs, drive complex customer escalations to resolution, and convert recurring field signals into engineering and reliability roadmap items. You will be the technical leader and project owner for Application-SRE: setting direction, tracking commitments, and ensuring the team operates as a high-leverage bridge between the field and the broader engineering org.
What You'll Do
As a Staff Site Reliability Engineer, you will possess engineering-wide influence and take ownership of the following critical areas:
Infrastructure Strategy & Architecture: Formulate and execute the architectural vision for Rubrik's Cloud Platform, optimizing backend infrastructure systems like Kubernetes, MySQL, and cloud-native services for performance, security, and multi-region scale.
Hyperscale Automation & Platform Tooling: Build, scale, and maintain sophisticated custom internal tools, platform controllers, and automation frameworks in Go or Python to systematically eliminate operational toil.
AI Infrastructure for SaaS: Deploy, scale, and operate the AI infrastructure that powers Rubrik's SaaS offerings, owning the reliability, performance, cost, and security controls required to run AI workloads in multi-tenant, compliance-bound environments.
AI for SRE & Engineering Productivity : Drive the adoption of AI-driven solutions across the SRE charter to compress toil and multiply the org - applying agentic and LLM-based approaches to automated triage, incident response, operational analysis, and developer productivity.
AI Adoption Guardrails for SaaS Reliability : Build the guardrails, controls, and platform patterns that keep Rubrik's SaaS reliable as AI adoption accelerates across product and engineering, ensuring new AI capabilities ship without eroding availability, performance, security, or cost posture.
Cross-Functional Leadership: Wield engineering-wide influence to create technical consensus among component, platform, and security engineering teams, effectively "shifting left" to embed structural resilience, capacity guards, and compliance from initial feature designs.
Reliability Governance: Define, audit, and enforce robust Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets across all critical enterprise platform services, translating telemetry insights into actionable product roadmaps during executive reviews.
Incident Command & Operations Review: Serve as a primary Incident Commander for high-severity cloud outages, establishing roles, directing mitigation vectors under pressure, and orchestrating comprehensive, blameless post-mortems that drive durable systemic fixes.
Cost Governance & Capacity Modeling: Architect cost-observability tools and attribution frameworks, leading cloud infrastructure capacity forecasting, resource quota optimization, and vendor SLA management.
Application-SRE Leadership : Set the technical direction for the Application-SRE team, raising the bar on how the team diagnoses, mitigates, and durably resolves the most complex customer-impacting issues across our platform.
Technical Multiplier & Mentorship: Champion SRE best practices, mentoring senior and junior individual contributors across the organization, participating in interview frameworks, and actively raising the collective technical bar.
On-Call Rotations: Participate in on-call rotations
Experience You'll Need
Citizenship & Residency: Must be a US Citizen currently residing on CONUS soil (strict regulatory requirement to enable support for federal and FedRAMP environments when required).
Education: BS, MS, or PhD in Computer Science, Computer Engineering, or a highly related technical discipline.
Industry Experience: A minimum of 8–12+ years of software engineering and production cloud infrastructure experience, with at least 5+ years dedicated to a formal SRE, DevOps, or Platform engineering role operating hyperscale SaaS products.
Technical Depth: Comprehensive, hands-on programming expertise in Golang, Python, or Java with a deep grasp of concurrency models, data structures, and test-driven software design patterns.
Distributed Systems Expertise: Proven proficiency designing, deploying, analyzing, and auditing complex, large-scale distributed systems, database topologies, and high-availability public cloud meshes.
Systems Internals: Authoritative operational command of Unix/Linux operating system environments (process models, file systems, kernels), systems administration, and advanced L4/L7 networking protocols.
AI Systems Fluency : Working knowledge of operating AI systems in production — including model serving, cost trade-offs, and the reliability and safety considerations of LLM- and agent-based workloads. Practical judgment on when AI is the right tool versus deterministic automation.
Field-to-Product Feedback Loop : Institutionalize the channel that converts patterns from customer escalations and POCs into prioritized product and reliability feedback, partnering directly
More San Francisco Bay Area jobs
San Francisco Bay Area jobs · Browse all locations