Site Reliability Engineer (SRE) - Remote
Projektbewertung
Die Ausschreibung bietet einen sehr detaillierten Einblick in die Anforderungen und Erwartungen für die SRE-Rolle mit klarem Fokus auf Coaching, Wissenstransfer und Aufbau von SRE-Praktiken in einem Cloud-nativen Umfeld, wobei leider der Stundensatz nicht spezifiziert wurde.
Site Reliability Engineer (SRE)
Start: asap
Dauer: 6 Monate +++
Standort: remote
Beschreibung:
We are building up Site Reliability Engineering (SRE) practices for our mission-critical Customer Portal, a cloud-native, self-service, and transactional platform that is central to our digital business. The portal is delivered by an Agile Release Train (ART) with 15 teams, responsible for the platform and cross-cutting functions. In addition, external business feature teams outside the ART also contribute functionality to the portal through a shared contribution model.
To accelerate this journey, one internal team member will take the lead for SRE in a “lift & shift” approach. As this person is new to SRE, we are looking for an experienced SRE Champion (external engagement) who can provide hands-on guidance and structured coaching.
This is a transitional role: the Champion will introduce best practices, establish core reliability processes, and enable the internal lead and product teams to independently run and evolve SRE capabilities after the engagement ends.
Responsibilities:
• Act as coach and mentor for the internal SRE lead, ensuring structured knowledge transfer.
• Establish and pilot SRE foundations for the Customer Portal: SLO/SLI framework, error budgets, incident/post-mortem processes, and runbooks.
• Guide the setup of observability, monitoring, and alerting aligned with business reliability needs.
• Promote a cultural shift toward “you build it, you run it” across teams delivering to the portal.
• Define a handover roadmap and playbook to secure sustainable ownership post-engagement.
• Collaborate with both ART teams and external business feature teams to align responsibilities and reliability goals.
• Ensure SRE practices are included in the onboarding process for new ART-external feature teams, providing guardrails and playbooks for reliability.
• Identify skills and roles needed for a SRE team Experience
Required Skills & Experience:
• 5+ years establishing or scaling SRE practices for complex, high-traffic, cloud-native products.
• Experience introducing SRE in organizations without existing SRE structure
• Expertise with observability and monitoring tooling (e.g., Dynatrace, Prometheus, Grafana, ELK/Opensearch, or similar).
• Proven track record implementing SLO/SLI/error budget frameworks.
• Hands-on experience with incident response, root cause analysis, and automation for reliability.
• Solid understanding of DevOps practices, CI/CD, and infrastructure-as-code.
• Strong communication and coaching skills to upskill less experienced colleagues.
Nice to Have:
• Familiarity with AIOps and reliability automation.
• Background in compliance and governance in regulated industries