Site Reliability Engineering Manager
CAE Inc.
Site Reliability Engineering Manager (SRE)
Roles & Responsibilities
We are seeking a Reliability Engineering Manager with a strong mix of Site Reliability Engineering (SRE) and Database experience to lead the reliability, scalability, performance, observability, and operational excellence of critical production platforms. This role combines deep database expertise with broader service reliability ownership, ensuring databases and platform services are engineered as highly available, measurable, automated, and resilient systems aligned with business and customer needs.
The Role We Are Offering You
Leadership & Team Management:
- Lead and develop a team of reliability engineers across database and platform domains
- Define the reliability strategy, operating model, and roadmap for critical production services
- Build a culture of automation, ownership, resilience, and continuous improvement
Service & Database Reliability:
- Own the availability, performance, scalability, and resilience of production databases and dependent platform services
- Define and enforce SLIs, SLOs, and error budgets for critical services and database platforms
- Lead incident management, postmortems, and remediation efforts to reduce recurrence and improve recovery
Observability & Monitoring:
- Define observability strategy across databases, services, infrastructure, and dependencies
- Ensure effective metrics, logs, traces, dashboards, and alerting for proactive detection and response
Automation & Platform Engineering:
- Drive Infrastructure as Code, operational automation, and self-healing mechanisms
- Reduce manual toil through standardized workflows for provisioning, scaling, backup, recovery, and routine operations
Performance, Capacity & Cost Efficiency:
- Improve system and database performance through tuning, capacity planning, and architecture reviews
- Partner on cost optimization initiatives across cloud infrastructure, database services, storage, and licensing
Disaster Recovery & Operational Readiness:
- Own RTO/RPO alignment, resilience planning, backup strategy, and disaster recovery exercises
- Improve production readiness for releases, migrations, and major operational events
Cross-Functional Collaboration:
- Partner with Development, Platform Engineering, Infrastructure, Security, and Product teams to improve service reliability end to end
- Act as a technical leader and escalation point for reliability concerns across the stack
Minimum Qualifications
To succeed in this role, you bring:
- 7+ years of experience in Site Reliability Engineering, Database Engineering or related production engineering roles
- 2+ years of leadership experience managing engineers or technical teams
- Strong expertise in relational database platforms such as PostgreSQL, Oracle or SQL Server
- Experience operating reliable services in cloud or hybrid environments, including AWS-managed or self-managed platforms
- Deep analytical, operational, and performance-focused mindset
- Ability to balance service reliability, engineering velocity, and cost efficiency
- Strong background in automation, observability, incident response, performance tuning, and high availability/disaster recovery practices
- Knowledge of Kubernetes, containerized workloads, and platform engineering practices
- Experience defining SLOs, SLIs, alerting strategies, and operational readiness standards
- Strong communication and cross-functional collaboration skills across engineering and business stakeholders
Preferred Qualifications
- Experience with distributed systems, large-scale production environments, or customer-facing platforms
- Experience with automation frameworks, Infrastructure as Code, and CI/CD pipelines
- Exposure to FinOps, cloud cost optimization, or capacity modeling initiatives
- Familiarity with modern observability platforms and reliability review processes
About CAE
At CAE, our mission is clear: to help make the world a safer place. For nearly 80 years, we’ve driven innovation in simulation, training, and mission readiness to support critical operations worldwide. By leveraging advanced technologies, we empower our customers to operate smarter, faster, and more sustainably. Join a purpose-driven organization where bold ideas are encouraged, collaboration drives progress, and your growth fuels our shared success.
Position Type
RegularEqual Opportunity & Accommodations
CAE is committed to providing equal opportunities to all applicants, regardless of race, nationality, color, religion, sex, gender identity or expression, sexual orientation, disability, neurodiversity, veteran status, age, or other characteristics protected by law. We encourage applicants who may not meet every qualification to apply. Reasonable accommodations are available—contact your recruiter or email [email protected] if needed.
Data Privacy
Privacy Statement | CAE
As part of our process, we may use AI‑supported tools to help review applications, with human decision‑making at every step. CAE thanks all applicants for their interest. However, only those whose background and experience match the requirements of the role will be contacted.
Cómo postularme
Para solicitar este empleo, debe autorizarse en nuestro sitio web. Si aún no tiene una cuenta, regístrese.
Publicar un currículum