Site Reliability Engineer
We're working with a global technology consultancy that designs, builds, and supports modern software platforms for enterprise customers worldwide. They partner closely with clients to deliver reliable, scalable, cloud-native solutions.
The Role
As an SRE, you'll play a key role in ensuring the availability, performance, and scalability of production systems, supporting customers across the EMEA region. Helping to build, mature, and enhance the SRE function. This is a hands-on, technical role, focused on reliability, automation, and operational excellence across a distributed, cloud-based platform
Key Responsibilities
- Platform Reliability: Deploy, operate, and improve Kubernetes clusters across multiple cloud environments.
- Service Performance: Design and implement processes to enhance system reliability, availability, and scalability.
- CI/CD Enablement: Build and optimise CI/CD pipelines to support safe, repeatable deployments.
- Observability & Incidents: Own monitoring, alerting, and incident response to minimise downtime and speed recovery.
- Root Cause Analysis: Lead post-incident reviews and implement long-term preventative improvements.
- Automation: Reduce operational toil through automation and performance optimisation.
- On-Call: Participate in weekday coverage and a once-monthly weekend rota.
Collaboration & Stakeholder Engagement
- Work closely with engineering, infrastructure, and product teams to embed SRE best practices.
- Advocate for reliability, resilience, and operational excellence across teams.
- Collaborate with a globally distributed engineering function.
- Engage directly with customers to resolve incidents and improve user experience.
Skills & Experience
- Proven experience as an SRE or similar role, supporting complex distributed systems (5+ years).
- Strong Kubernetes experience (AKS, EKS, GKE, or similar).
- Hands-on with observability tools such as Prometheus, Grafana, Kibana, Vector, or Superset.
- Experience with at least one major cloud platform: AWS, Azure, GCP, or Linode.
- SQL database experience (PostgreSQL beneficial but not essential).
- Proficiency in Python, Go, or Rust.
- Strong Linux expertise, including performance tuning and troubleshooting.
- Excellent communication skills, able to work effectively with engineers and customers.
- customers and cross-functional team
Please apply now if you are meeting the above criteria, or contact Andrew Harrison directly.
