Site Reliability Engineer, Observability

Ripple Labs Inc · Chicago, IL

📍 Chicago, Illinois, United States💰 $160,000via greenhousePosted 2026-06-18

CareerRiver pulls this listing straight from the employer's hiring system — no recruiter middleman, no reposts. Applying takes you directly to Ripple Labs Inc.

At Ripple, we’re building a world where value moves like information does today. It’s big, it’s bold, and we’re already doing it. Through our crypto solutions for financial institutions, businesses, governments and developers, we are improving the global financial system and creating greater economic fairness and opportunity for more people, in more places around the world. And we get to do the best work of our career and grow our skills surrounded by colleagues who have our backs. If you’re ready to see your impact and unlock incredible career growth opportunities, join us, and build real world value. Ripple Treasury, now a Ripple solution, was acquired by Ripple in 2025, marking a significant expansion into the multi-trillion-dollar corporate finance arena. Ripple Treasury has more than 40 years of experience supporting some of the world’s largest and most sophisticated companies. Integrating its treasury command center into Ripple’s technology stack gives corporates the ability to move, manage and optimize liquidity in real-time, across traditional and digital assets, under one expanded umbrella. Join us to build the future of corporate treasury and the infrastructure that powers the Internet of Value. THE WORK: As a Site Reliability Engineer you will be a force multiplier elevating engineering capabilities across observability and incident management. You will empower Ripple's stream-aligned engineering teams to detect, diagnose, and resolve production issues quickly and effectively—helping keep our products highly available, performant, and resilient at scale for customers managing trillions in annual payment volume. You will be part of Ripple's Technical Operations team, coaching teams to build comprehensive monitoring, effective alerting, and mature incident response practices. Through workshops, consultation, and hands-on guidance, you'll help teams achieve operational excellence and self-sufficiency. If you're passionate about building capabilities in others and creating lasting impact through observability and incident management, this is the opportunity for you. WHAT YOU’LL DO: Observability Enablement Coach teams on instrumenting applications with structured logs, metrics, and distributed traces using New Relic and OpenTelemetry Guide teams in creating effective dashboards, alerts, and SLOs/SLIs that provide actionable insights into system health and reduce Mean Time to Detection (MTTD) Teach teams to define and track error budgets, using them to balance feature velocity with reliability Provide hands-on guidance during production incidents to coach real-time troubleshooting using observability data Develop golden path examples for instrumentation patterns, dashboard templates, and alert configurations that teams can adopt independently Help teams optimize their use of New Relic (APM, Infrastructure, Logs, Synthetics) across Azure and AWS multi-cloud environments Build team capability to identify and resolve performance bottlenecks, resource constraints, and degradation patterns Incident Management Administration & Enablement Administer and configure the Incident.IO platform, ensuring it supports effective incident response workflows across all engineering teams Coach teams on incident response best practices: classification, escalation, communication, coordination, and resolution Help teams establish on-call rotation schedules, runbooks, and escalation policies that ensure appropriate incident coverage Facilitate post-incident review (PIR) processes, teaching teams to identify root causes, document learnings, and implement preventive measures Guide teams in defining incident severity levels and response procedures aligned with business impact Integrate observability tooling (New Relic) with incident management ( Incident.IO ) to enable rapid detection and diagnosis Track and report on incident metrics (MTTR, MTTD, incident frequency) and help teams drive continuous improvement Facilitate incident management simulations (game days, failure injection exercises) to build team readiness Cross-Functional Impact Enable 4-6 teams per quarter to successfully adopt improved observability or incident management practices through workshops, consultation, and hands-on guidance Identify and remove operational bottlenecks in monitoring and incident response, helping teams reduce MTTR and improve reliability Collaborate with the Subsystems Platform Team to translate common needs into self-service observability and incident management capabilities Facilitate knowledge sharing through documentation, training materials, and communities of practice that build lasting team competence Measure and track team progress on observability maturity and incident management effectiveness, demonstrating measurable improvement Work across Azure (80%) and AWS (20%) environments, supporting teams operating on both Windows (80%) and Linux (20%) infrastructure WHAT YOU'LL BRING: Core SRE Experience 5+ years of experience in Site Reliability Engineering, DevOps, or Platform Engineering with strong focus on observability and production operations Proven ability to coach and mentor engineering teams with excellent communication and teaching skills across technical and non-technical audiences Consultative mindset with the ability to influence and guide teams without direct authority Experience working in Agile/Scrum environments and collaborating with cross-functional teams Observability Expertise (Required) Expert-level hands-on experience with New Relic (APM, Infrastructure Monitoring, Logs, Synthetics, Alerts) and strong proficiency writing NRQL queries for troubleshooting Proven experience implementing instrumentation in application code (OpenTelemetry, Serilog, or similar frameworks) Deep understanding of structured logging, metrics collection (RED/USE methods), distributed tracing, and creating effective dashboards

More Chicago, IL jobs

Engagement Manager (Healthcare & Life Sciences)
Kenway Consulting
Girls on the Run Chicago: Impact Coordinator
Girls on the Run
Field Specialist- Valves Sales
ASC Engineered Solutions
National Sales Lead - Wardflex Specialist
ASC Engineered Solutions
National Field Lead - Fire
ASC Engineered Solutions
Hanger Specialist
ASC Engineered Solutions

Chicago, IL jobs · Browse all locations