CareerRiver

System Development Engineer, Elastic Disaster Recovery, AWS Elastic Disaster Recovery

Amazon · San Francisco Bay Area

📍 Santa Clara, California, USAvia amazonPosted June 29, 2026
Apply on company site ↗
CareerRiver pulls this listing straight from the employer's hiring system — no recruiter middleman, no reposts. Applying takes you directly to Amazon.
We are looking for a Systems Development Engineer to build the automation, tooling, and operational infrastructure that keep this large-scale, mission-critical service reliable, secure, and efficient. In this role you will treat operations as a software problem — eliminating manual toil, hardening our deployment and monitoring systems, and ensuring our replication and recovery fleet runs flawlessly across a broad and heterogeneous environment. A key dimension of this role is breadth: DRS supports a wide range of operating systems (multiple Linux distributions and Windows versions) and both x86/64 and ARM64 (Graviton) architectures, so your automation and tooling must be robust across diverse OS and hardware combinations. Key job responsibilities * Operational automation: Design and build software that automates infrastructure provisioning, deployments, and recurring operational workflows, reducing manual effort and on-call burden across the DRS fleet. * CI/CD and deployment safety: Build and improve pipelines, deployment guardrails, and rollback mechanisms to ship changes safely across all regions and platform variants. * Cross-platform support: Develop and maintain tooling that works reliably across a wide range of operating systems (various Linux distributions and Windows) and both x86/64 and ARM64 (Graviton) architectures. * Monitoring and resilience: Implement monitoring, alarming, and self-healing systems to detect and remediate issues before they impact customers' replication and recovery operations. * Scaling and performance: Tune and scale the systems behind continuous replication, capacity management, and recovery orchestration to handle growth gracefully. * Operational excellence: Drive down ticket and incident volume through durable, programmatic fixes; lead root-cause analysis and contribute to runbooks and operational best practices. * Security and compliance: Partner with security teams to harden the service and remediate findings, ensuring fixes are deployed consistently across the fleet. * Cross-team leverage: Build automation and tooling that serves multiple teams and raises the operational bar across DRS. About the team AWS Elastic Disaster Recovery (DRS) is a disaster recovery service provided by AWS that enables organizations to minimize downtime and data loss with fast, reliable recovery of on-premises and cloud-based applications. DRS uses cost-effective AWS resources to maintain an up-to-date copy of source servers on AWS, allowing for point-in-time recovery and failback to the primary site after an issue is resolved.

More San Francisco Bay Area jobs

San Francisco Bay Area jobs · Browse all locations