Senior AIOps Engineer I

Porch

via workday

CareerRiver pulls this listing straight from the employer's hiring system — no recruiter middleman, no reposts. Applying takes you directly to Porch.

Porch Group is a leading vertical software and insurance platform and is positioned to be the best partner to help homebuyers move, maintain, and fully protect their homes. We offer differentiated products and services, with homeowners insurance at the center of this relationship. We differentiate and look to win in the massive and growing homeowners insurance opportunity by 1) providing the best services for homebuyers 2) led by advantaged underwriting in insurance 3) to protect the whole home As a leader in the home services software-as-a-service (“SaaS”) space, we’ve built deep relationships with approximately 30 thousand companies that are key to the home-buying transaction, such as home inspectors, mortgage companies, and title companies. In 2020, Porch Group rang the Nasdaq bell and began trading under the ticker symbol PRCH. We are looking to build a truly great company and are JUST GETTING STARTED. Job Title: Senior AIOps Engineer I Location:  India Workplace Type: Remote Job Summary The future is bright for the Porch Group, and we’d love for you to be a part of it as our Senior AIOps Engineer I We are looking for a Senior AIOps Engineer I who will partner with product managers, platform engineers, data scientists, and machine learning engineers to ensure our AI and ML-powered systems are reliable, observable, secure, and cost-efficient in production. You will focus on how AI systems run in real-world environments: monitoring model performance and drift, ensuring robust deployment pipelines, managing incidents, standing up new AI infrastructure, and improving the stability and scalability of our AI platform. You'll help evolve our AI & ML Ops stack and operational processes so teams can ship AI features quickly and safely. Our AI/ML stack is based on Python and runs on Kubernetes (GKE) and Google Cloud Platform. We use tools such as Union Cloud (Flyte) for ML workflow orchestration, BentoML for model serving, Feast for feature stores, Label Studio for data annotation, BigQuery as our central data warehouse, and Dataflow for streaming/batch data pipelines. On the GenAI side, we operate a centralized LLM routing/gateway service across providers, batch prediction services for large-scale LLM inference, and are building out RAG infrastructure. You will maintain and harden this ecosystem — and stand up new infrastructure components as we expand our AI platform capabilities What You Will Do As A Senior AIOps Engineer I Own production reliability for AI/ML services Monitor and improve the reliability, availability, and performance of AI/ML-powered services running in production. Define and maintain SLOs/SLIs for critical AI systems (e.g., latency, error rates, model performance), tying them to user experience and business impact where possible. Own recurring model refresh cycles — coordinate retraining, validation, and redeployment of production models to prevent staleness and drift. Build and improve AI observability Design and implement monitoring, logging, and alerting for models and data pipelines in partnership with AI Engineers and Data Scientists. Integrate model and system metrics with existing observability stacks (Datadog, Opik, etc.) and dashboards used by engineering and operations teams. Build and maintain monitoring workflows for pipeline health. Support scalable, safe deployment of models Collaborate with data scientists and ML engineers to streamline deployment workflows for models and related services (blue/green, canary, A/B, shadow deployments). Support the productionization of image-based ML models, including batch prediction workflows, model performance monitoring, and data pipeline integration. Improve CI/CD pipelines and release processes for AI services to reduce risk and increase deployment frequency. Stand up and operate AI infrastructure Provision, deploy, configure, and maintain new AI infrastructure components as they are adopted across the organization — including AI gateways, RAG platforms, LLM observability tools, agentic workflows, and no-code agent builders. Utilize and improve existing frameworks and tools (e.g., Union Cloud, BentoML, Feast, Kubernetes, Terraform, and GCP services) to support robust and maintainable AI infrastructure. Build automation and tooling to reduce manual operational work, especially around model promotion, configuration, environment management, and Docker image maintenance. Support multi-BU infrastructure provisioning — create and manage separate environments, projects, roles, and CI/CD integrations for different business units Operate and maintain Label Studio Own the operational health of Label Studio — our production data annotation platform used for ground truth collection, model evaluation, and ML training dataset creation. Maintain the supporting infrastructure around Label Studio, including GCS storage buckets and BigQuery data pipelines that feed annotation projects; coordinate with Platform/IT partners for database and SSO dependencies. Support bi-weekly annotation project cycles and ensure platform availability for labeling specialists. Build and maintain RAG and vector database infrastructure Stand up and operate RAG (Retrieval-Augmented Generation) platforms — including vector databases, embedding pipelines, and context retrieval APIs. Collaborate with AI Engineers on data source connectors, sync schedules, retention policies, and access control models for RAG infrastructure. Optimize embedding storage and retrieval performance at scale. Optimize LLM costs Monitor and optimize LLM token usage and costs across providers, leveraging batch vs. real-time inference strategies to reduce spend. Implement and maintain centralized cost tracking dashboards and alerting for LLM consumption across business units. Evaluate and recommend cost-efficient model routing and provider selection strategies. . Manage AI-related incidents and post-incident learning

Browse all locations