Staff Site Reliability Engineer

Thrive Market · Remote

📍 Playa Vista, CA or Remote💰 $180,000 - $225,000via greenhousePosted 2026-04-17

CareerRiver pulls this listing straight from the employer's hiring system — no recruiter middleman, no reposts. Applying takes you directly to Thrive Market.

ABOUT THRIVE MARKET Thrive Market was founded in 2014 with a mission to make healthy and sustainable living easy and affordable for everyone. As an online, membership-based market, we deliver the highest quality healthy, and sustainable products at member-only prices, while matching every paid membership with a free one for someone in need. Every day, we leverage innovative technology and member-first thinking to help our over 1,700,000+ members find better products, support better brands, and build a better world in the process. We are also a Certified B Corporation, a Public Benefit Corporation, and a Climate Neutral Certified company. Join us as we bring healthy and sustainable living to millions of Americans in the years to come. THE ROLE We’re looking for a Staff Site Reliability Engineer to help define and build the reliability foundation for Thrive Market’s platform. You’ll be working with a first-class group of engineers to establish our SRE practice from the ground up; defining SLOs, SLIs and Error Budgets, building observability into everything we do, and creating the frameworks that ensure our systems scale reliably during our company’s rapid growth. This is a high-impact role at an exciting inflection point. We’ve recently containerized our entire platform on Kubernetes, and we’re evaluating a potential platform migration to a next-generation ecommerce platform. You’ll be balancing hands-on reliability work with the strategic thinking needed to build systems that self-heal and get better over time. If you’ve read books like The Google SRE Handbook, The Phoenix Project, Accelerate, The DevOps Handbook, etc., this is the right place for you! RESPONSIBILITIES Reliability & Observability Define, implement, and own Service Level Objectives (SLOs) and Service Level Indicators (SLIs) across critical platform services Build and maintain comprehensive monitoring, alerting, and observability systems using tools like Datadog, Prometheus, Grafana, or similar platforms Establish error budgets and use them to balance feature velocity with reliability investments Lead incident response efforts, conduct blameless postmortems, and drive systemic improvements that prevent recurrence Design and implement chaos engineering practices to proactively identify failure modes before they impact members Infrastructure & Platform Architect and optimize our Kubernetes-based container orchestration platform for reliability, performance, and cost efficiency Support large infrastructure migrations, ensuring a smooth transition with minimal disruption to business operations Contribute to the evaluation and execution of potential platform migrations, with a focus on reliability planning and risk mitigation Design and implement automated deployment pipelines that enable rapid, error-free releases with feature flags and built-in rollback/roll-forward capabilities Develop and own disaster recovery plans, capacity planning models, and system hardening initiatives Collaborate closely with product engineering teams to help them scale their infrastructure in AWS and adopt SRE best practices Culture & Process Help establish SRE as a practice at Thrive Market, defining the team’s charter, processes, and engagement model with product engineering teams Champion a culture of operational excellence, continuous improvement, and data-driven reliability decisions Create and maintain technical documentation covering architecture decisions, runbooks, incident response procedures, and operational playbooks Participate in weekly on-call rotations and help build sustainable on-call practices that avoid burnout Identify systemic problems and inefficiencies across the engineering organization and make strategic recommendations for improvement QUALIFICATIONS Required B.S. in Computer Science or equivalent professional experience 7+ years of hands-on experience in SRE, DevOps, or Infrastructure Engineering, with a proven track record of improving reliability at rapidly growing companies Deep expertise in Kubernetes (K8s) — including cluster management, Helm charts, service meshes, and production-grade container orchestration Strong systems engineering background with advanced proficiency in Linux administration Advanced scripting and automation skills in Bash, Python, Golang, Ruby, or similar languages Extensive experience with core AWS services including EC2, ECS/EKS, S3, VPC, IAM, CloudWatch, Route 53, RDS, and Lambda Strong experience with Infrastructure as Code tools (Terraform, CloudFormation, Pulumi, or similar) Hands-on experience defining and implementing SLOs, SLIs, and error budgets in production environments Deep understanding of CI/CD pipelines and deployment strategies (blue-green, canary, rolling deployments) Expertise in monitoring and observability platforms (Datadog, Prometheus, Grafana, New Relic, or similar) Strong knowledge of web application infrastructure, networking, load balancing, and security best practices Excellent communication skills with the ability to lead incident response and facilitate blameless postmortems Preferred Experience with e-commerce platforms (Magento, Shopify, or comparable) and the unique reliability challenges they present at scale Experience with ConcourseCI, Github Actions (GHA) or similar deployment frameworks Experience with chaos engineering tools and practices (Gremlin, Litmus, Chaos Monkey, or similar) Familiarity with GitOps workflows (ArgoCD, Flux) and service mesh technologies (Istio, Linkerd) Experience building and managing cost-optimization strategies for cloud infrastructure Background in establishing SRE practices in organizations transitioning from traditional DevOps models Experience with configuration management tools (Ansible, Chef, Puppet, or similar) BELONG TO A BETTER COMPANY Comprehensive health benefits (medical, dental, vision, life and disability)

More Remote jobs

Senior Sales Lead
Climate Bonds Initiative
Lead Cyber Security Consultant
Actica Consulting
Senior Cyber Security Consultant
Actica Consulting
Technical Consultant
Actica Consulting
Commercialization Director, Foot and Ankle
Advita Ortho
Commercialization Manager, Shoulder
Advita Ortho

Remote jobs · Browse all locations