CareerRiver

Senior Platform & Reliability Engineer (SRE)

Vizcom · San Francisco Bay Area

📍 San Francisco💰 $200,000 – $250,000via ashbyPosted 2026-02-24
Apply on company site ↗
CareerRiver pulls this listing straight from the employer's hiring system — no recruiter middleman, no reposts. Applying takes you directly to Vizcom.
Agency Notice: We are not currently working with recruiting agencies for this role. Please do not contact Vizcom employees regarding this position. Any resumes submitted without a prior agreement will be considered unsolicited. About Vizcom Vizcom is a visual creation platform that combines modern web tooling with AI-powered workflows. Our stack includes React/TypeScript frontend, Node/Koa + PostGraphile API services, PostgreSQL, Redis, BullMQ queues, and Kubernetes-based production infrastructure. We’re hiring a senior owner of stability and infrastructure to ensure the platform is reliable, fast, and resilient as we scale. Role Mission Own service reliability end-to-end: prevent incidents, reduce blast radius when failures happen, and lead fast, high-quality recovery when production degrades. This is a hands-on technical leadership role with authority to set reliability standards and enforce production guardrails. Compensation $200,000 – $250,000 base salary + meaningful equity What You’ll Own - Reliability bar: Set and enforce SLIs/SLOs/error budgets for critical user flows. - Production architecture resilience: Drive failure isolation across API, workers, queues, and dependencies so one subsystem cannot take down core access. - Kubernetes runtime reliability: Define probe contracts, rollout/rollback standards, graceful shutdown behavior, scaling/resource policies, and startup safety. - Queue + job safety (BullMQ/Redis): Own poison pill containment and workload isolation. - Incident command quality: Lead Sev1/Sev2 response end-to-end (containment, communications, technical direction, RCA, corrective action execution). - Reliability operating system: Own observability quality (signals over noise), on-call effectiveness, runbooks, and postmortem discipline. - Release safety authority: Gate risky deploys and enforce reliability guardrails when production health is at risk. Traits We’re Looking For - Calm, structured incident commander under pressure. - Thinks in failure modes and blast radius by default. - Pragmatic: can stabilize quickly, then implement durable fixes. - High ownership and strong written communication. FIRST 90 DAYS - Establish baseline reliability metrics and identify top platform risks. - Tighten incident response mechanics (roles, comms cadence, runbooks, status updates). - Deliver high-impact hardening fixes across probes/startup paths/queue safety. - Publish a prioritized 6–12 month reliability roadmap with clear ownership and milestones. If possible please include one incident you personally led and send to [email protected] : 1) what failed, 2) how you contained it, 3) what permanent fixes you shipped, and measured.

More San Francisco Bay Area jobs

San Francisco Bay Area jobs · Browse all locations