Site Reliability Engineer (SRE) - Remote

Projektbewertung

Die Ausschreibung bietet einen sehr detaillierten Einblick in die Anforderungen und Erwartungen für die SRE-Rolle mit klarem Fokus auf Coaching, Wissenstransfer und Aufbau von SRE-Praktiken in einem Cloud-nativen Umfeld, wobei leider der Stundensatz nicht spezifiziert wurde.

Site Reliability Engineer (SRE)

Start: asap

Dauer: 6 Monate +++

Standort: remote

Beschreibung:

We are building up Site Reliability Engineering (SRE) practices for our mission-critical Customer Portal, a cloud-native, self-service, and transactional platform that is central to our digital business. The portal is delivered by an Agile Release Train (ART) with 15 teams, responsible for the platform and cross-cutting functions. In addition, external business feature teams outside the ART also contribute functionality to the portal through a shared contribution model.

To accelerate this journey, one internal team member will take the lead for SRE in a “lift & shift” approach. As this person is new to SRE, we are looking for an experienced SRE Champion (external engagement) who can provide hands-on guidance and structured coaching.

This is a transitional role: the Champion will introduce best practices, establish core reliability processes, and enable the internal lead and product teams to independently run and evolve SRE capabilities after the engagement ends.

Responsibilities:

• Act as coach and mentor for the internal SRE lead, ensuring structured knowledge transfer.

• Establish and pilot SRE foundations for the Customer Portal: SLO/SLI framework, error budgets, incident/post-mortem processes, and runbooks.

• Guide the setup of observability, monitoring, and alerting aligned with business reliability needs.

• Promote a cultural shift toward “you build it, you run it” across teams delivering to the portal.

• Define a handover roadmap and playbook to secure sustainable ownership post-engagement.

• Collaborate with both ART teams and external business feature teams to align responsibilities and reliability goals.

• Ensure SRE practices are included in the onboarding process for new ART-external feature teams, providing guardrails and playbooks for reliability.

• Identify skills and roles needed for a SRE team Experience

Required Skills & Experience:

• 5+ years establishing or scaling SRE practices for complex, high-traffic, cloud-native products.

• Experience introducing SRE in organizations without existing SRE structure

• Expertise with observability and monitoring tooling (e.g., Dynatrace, Prometheus, Grafana, ELK/Opensearch, or similar).

• Proven track record implementing SLO/SLI/error budget frameworks.

• Hands-on experience with incident response, root cause analysis, and automation for reliability.

• Solid understanding of DevOps practices, CI/CD, and infrastructure-as-code.

• Strong communication and coaching skills to upskill less experienced colleagues.

Nice to Have:

• Familiarity with AIOps and reliability automation.

• Background in compliance and governance in regulated industries

AutomatisierungAgile MethodologieContinuous IntegrationDevOpsIncident ResponseSteuerungZuverlässigkeitstechnikPrometheusUrsachenanalyseGrafana

Art der Anstellung

contracting

Gepostet am

18. September 2025

Angeboten von:

Freelancermap

Zur Ausschreibung