Site Reliability Engineering Manager

CAE Inc.

Fecha: hace 4 horas

ciudad: Montevideo, Montevideo

Tipo de contrato: Tiempo completo

Behind every success is a team of dedicated experts driving us forward. Our corporate functions don’t just support — they lead, shaping the company’s path and preparing clients and employees for the moments that matter. Be part of a team where your work makes a difference, with opportunities to grow, collaborate, and thrive.

Site Reliability Engineering Manager (SRE)

Roles & Responsibilities

We are seeking a Reliability Engineering Manager with a strong mix of Site Reliability Engineering (SRE) and Database experience to lead the reliability, scalability, performance, observability, and operational excellence of critical production platforms. This role combines deep database expertise with broader service reliability ownership, ensuring databases and platform services are engineered as highly available, measurable, automated, and resilient systems aligned with business and customer needs.

The Role We Are Offering You

Leadership & Team Management:

Lead and develop a team of reliability engineers across database and platform domains
Define the reliability strategy, operating model, and roadmap for critical production services
Build a culture of automation, ownership, resilience, and continuous improvement

Service & Database Reliability:

Own the availability, performance, scalability, and resilience of production databases and dependent platform services
Define and enforce SLIs, SLOs, and error budgets for critical services and database platforms
Lead incident management, postmortems, and remediation efforts to reduce recurrence and improve recovery

Observability & Monitoring:

Define observability strategy across databases, services, infrastructure, and dependencies
Ensure effective metrics, logs, traces, dashboards, and alerting for proactive detection and response

Automation & Platform Engineering:

Drive Infrastructure as Code, operational automation, and self-healing mechanisms
Reduce manual toil through standardized workflows for provisioning, scaling, backup, recovery, and routine operations

Performance, Capacity & Cost Efficiency:

Improve system and database performance through tuning, capacity planning, and architecture reviews
Partner on cost optimization initiatives across cloud infrastructure, database services, storage, and licensing

Disaster Recovery & Operational Readiness:

Own RTO/RPO alignment, resilience planning, backup strategy, and disaster recovery exercises
Improve production readiness for releases, migrations, and major operational events

Cross-Functional Collaboration:

Partner with Development, Platform Engineering, Infrastructure, Security, and Product teams to improve service reliability end to end
Act as a technical leader and escalation point for reliability concerns across the stack

Minimum Qualifications

To succeed in this role, you bring:

7+ years of experience in Site Reliability Engineering, Database Engineering or related production engineering roles
2+ years of leadership experience managing engineers or technical teams
Strong expertise in relational database platforms such as PostgreSQL, Oracle or SQL Server
Experience operating reliable services in cloud or hybrid environments, including AWS-managed or self-managed platforms
Deep analytical, operational, and performance-focused mindset
Ability to balance service reliability, engineering velocity, and cost efficiency
Strong background in automation, observability, incident response, performance tuning, and high availability/disaster recovery practices
Knowledge of Kubernetes, containerized workloads, and platform engineering practices
Experience defining SLOs, SLIs, alerting strategies, and operational readiness standards
Strong communication and cross-functional collaboration skills across engineering and business stakeholders

Preferred Qualifications

Experience with distributed systems, large-scale production environments, or customer-facing platforms
Experience with automation frameworks, Infrastructure as Code, and CI/CD pipelines
Exposure to FinOps, cloud cost optimization, or capacity modeling initiatives
Familiarity with modern observability platforms and reliability review processes

About CAE

At CAE, our mission is clear: to help make the world a safer place. For nearly 80 years, we’ve driven innovation in simulation, training, and mission readiness to support critical operations worldwide. By leveraging advanced technologies, we empower our customers to operate smarter, faster, and more sustainably. Join a purpose-driven organization where bold ideas are encouraged, collaboration drives progress, and your growth fuels our shared success.

Position Type

Regular

Equal Opportunity & Accommodations

CAE is committed to providing equal opportunities to all applicants, regardless of race, nationality, color, religion, sex, gender identity or expression, sexual orientation, disability, neurodiversity, veteran status, age, or other characteristics protected by law. We encourage applicants who may not meet every qualification to apply. Reasonable accommodations are available—contact your recruiter or email [email protected] if needed.

Data Privacy

Privacy Statement | CAE

As part of our process, we may use AI‑supported tools to help review applications, with human decision‑making at every step. CAE thanks all applicants for their interest. However, only those whose background and experience match the requirements of the role will be contacted.

Cómo postularme

Para solicitar este empleo, debe autorizarse en nuestro sitio web. Si aún no tiene una cuenta, regístrese.

Publicar un currículum

Ver más empleos en Montevideo, Montevideo