Site Reliability Engineer, Observability
Ripple Labs Inc · Chicago, IL
📍 Chicago, Illinois, United States💰 $160,000via greenhousePosted 2026-06-18
Apply on company site ↗
CareerRiver pulls this listing straight from the employer's hiring system — no recruiter middleman, no reposts. Applying takes you directly to Ripple Labs Inc.
At Ripple, we’re building a world where value moves like information does today. It’s big, it’s bold, and we’re already doing it. Through our crypto solutions for financial institutions, businesses, governments and developers, we are improving the global financial system and creating greater economic fairness and opportunity for more people, in more places around the world. And we get to do the best work of our career and grow our skills surrounded by colleagues who have our backs.
If you’re ready to see your impact and unlock incredible career growth opportunities, join us, and build real world value.
Ripple Treasury, now a Ripple solution, was acquired by Ripple in 2025, marking a significant expansion into the multi-trillion-dollar corporate finance arena.
Ripple Treasury has more than 40 years of experience supporting some of the world’s largest and most sophisticated companies. Integrating its treasury command center into Ripple’s technology stack gives corporates the ability to move, manage and optimize liquidity in real-time, across traditional and digital assets, under one expanded umbrella.
Join us to build the future of corporate treasury and the infrastructure that powers the Internet of Value.
THE WORK:
As a Site Reliability Engineer you will be a force multiplier elevating engineering capabilities across observability and incident management. You will empower Ripple's stream-aligned engineering teams to detect, diagnose, and resolve production issues quickly and effectively—helping keep our products highly available, performant, and resilient at scale for customers managing trillions in annual payment volume. You will be part of Ripple's Technical Operations team, coaching teams to build comprehensive monitoring, effective alerting, and mature incident response practices. Through workshops, consultation, and hands-on guidance, you'll help teams achieve operational excellence and self-sufficiency. If you're passionate about building capabilities in others and creating lasting impact through observability and incident management, this is the opportunity for you.
WHAT YOU’LL DO:
Observability Enablement
Coach teams on instrumenting applications with structured logs, metrics, and distributed traces using New Relic and OpenTelemetry
Guide teams in creating effective dashboards, alerts, and SLOs/SLIs that provide actionable insights into system health and reduce Mean Time to Detection (MTTD)
Teach teams to define and track error budgets, using them to balance feature velocity with reliability
Provide hands-on guidance during production incidents to coach real-time troubleshooting using observability data
Develop golden path examples for instrumentation patterns, dashboard templates, and alert configurations that teams can adopt independently
Help teams optimize their use of New Relic (APM, Infrastructure, Logs, Synthetics) across Azure and AWS multi-cloud environments
Build team capability to identify and resolve performance bottlenecks, resource constraints, and degradation patterns
Incident Management Administration & Enablement
Administer and configure the Incident.IO platform, ensuring it supports effective incident response workflows across all engineering teams
Coach teams on incident response best practices: classification, escalation, communication, coordination, and resolution
Help teams establish on-call rotation schedules, runbooks, and escalation policies that ensure appropriate incident coverage
Facilitate post-incident review (PIR) processes, teaching teams to identify root causes, document learnings, and implement preventive measures
Guide teams in defining incident severity levels and response procedures aligned with business impact
Integrate observability tooling (New Relic) with incident management ( Incident.IO ) to enable rapid detection and diagnosis
Track and report on incident metrics (MTTR, MTTD, incident frequency) and help teams drive continuous improvement
Facilitate incident management simulations (game days, failure injection exercises) to build team readiness
Cross-Functional Impact
Enable 4-6 teams per quarter to successfully adopt improved observability or incident management practices through workshops, consultation, and hands-on guidance
Identify and remove operational bottlenecks in monitoring and incident response, helping teams reduce MTTR and improve reliability
Collaborate with the Subsystems Platform Team to translate common needs into self-service observability and incident management capabilities
Facilitate knowledge sharing through documentation, training materials, and communities of practice that build lasting team competence
Measure and track team progress on observability maturity and incident management effectiveness, demonstrating measurable improvement
Work across Azure (80%) and AWS (20%) environments, supporting teams operating on both Windows (80%) and Linux (20%) infrastructure
WHAT YOU'LL BRING:
Core SRE Experience
5+ years of experience in Site Reliability Engineering, DevOps, or Platform Engineering with strong focus on observability and production operations
Proven ability to coach and mentor engineering teams with excellent communication and teaching skills across technical and non-technical audiences
Consultative mindset with the ability to influence and guide teams without direct authority
Experience working in Agile/Scrum environments and collaborating with cross-functional teams
Observability Expertise (Required)
Expert-level hands-on experience with New Relic (APM, Infrastructure Monitoring, Logs, Synthetics, Alerts) and strong proficiency writing NRQL queries for troubleshooting
Proven experience implementing instrumentation in application code (OpenTelemetry, Serilog, or similar frameworks)
Deep understanding of structured logging, metrics collection (RED/USE methods), distributed tracing, and creating effective dashboards
More Chicago, IL jobs
Chicago, IL jobs · Browse all locations