Senior Site Reliability Engineer (SRE)

Permanent employee, Full-time · Berlin / Frankfurt (Main)

50,000 - 100,000 € per year

Your Hundertserver mission:

As a Site Reliability Engineer (SRE) at Hundertserver, you are responsible for the stable, high-performing, and secure operation of modern cloud platforms. Through automation, monitoring, SLAs, and incident response, you ensure that our systems not only run – but continuously improve. You work closely with customers, development, and infrastructure teams, bring clarity to complex operational issues, and create sustainable solutions – hands-on, pragmatic, and with a high degree of ownership.

The Main Tasks:

Key Responsibilities
Availability & Stability
• Ensuring platform availability according to defined SLOs / SLAs
• Analyzing and resolving incidents & performance issues (including on-call duties)
• Building and maintaining robust alerting, logging, and monitoring setups
• Root cause analysis & implementation of preventive measures

Automation & Infrastructure
• Automating provisioning, scaling, and maintenance (IaC with Terraform, Ansible, etc.)
• Operating and enhancing Kubernetes environments (cloud & on-prem)
• Developing and maintaining self-healing and auto-scaling mechanisms
• Creating and maintaining runbooks & playbooks

Monitoring, Observability & Performance
• End-to-end monitoring with tools like Prometheus, Grafana, Loki, ELK
• Setting up and managing SLIs and SLOs – data-driven platform control
• Performing performance analyses (workloads, traffic, databases) and ongoing optimization
• Setting up & maintaining distributed tracing and logging systems

Security & Operational Hygiene
• Implementing and enforcing security standards (least privilege, TLS, secrets management)
• Regular health checks, updates, and patching
• Ensuring availability through established backup & disaster recovery processes

Collaboration & Consulting
• Close collaboration with development, support, and platform teams
• Consulting customers on operating models, platform metrics & architectural decisions
• Training internal teams on topics such as monitoring, SRE basics & troubleshooting

You fit to our team when:

What You Should Bring
Technical Profile
• Linux expertise (Debian, Ubuntu, RHEL)
• Deep knowledge of Kubernetes – clusters, ingress, operators, Helm, etc.
• Experience with cloud platforms (AWS, Azure, GCP)
• Strong expertise in monitoring stacks (Prometheus, Grafana, Loki, ELK)
• Proficiency in Infrastructure-as-Code (Terraform, Ansible, Puppet)
• Scripting and automation skills (Bash, Python, Go)
• Familiarity with logging, tracing & incident management processes

Soft Skills & Working Style
• Proactive troubleshooting & high quality awareness
• Structured, analytical thinking – solution-oriented and pragmatic
• Excellent communication skills (with customers, developers, and operations)
• Focus on sustainability & automation rather than firefighting
• Willingness to participate in on-call rotations (standby, SLA windows)

Nice to Have
• Certifications such as CKA / CKS / AWS DevOps or equivalent
• Experience with GitOps, ArgoCD, or Policy-as-Code
• Knowledge of FinOps / cost optimization in cloud platforms

What we offer:

What You Can Expect at Hundertserver
• Real development – in technology, methodology & culture
• Modern platforms & tools – with room for your own ideas
• Ownership & trust – we work in partnership, not through hierarchy
• Flexible working hours & a remote-first culture
• Hands-on mentality & direct customer impact

Apply for this job

About us

ONEHUNDRED / Hundertserver is the cloud service provider that doesn’t just support digital transformation – we actively shape it. Based in the heart of Berlin and trusted by clients such as Gründerszene, Edelman, and Prognos, we develop innovative, secure, and sovereign cloud solutions for a connected future.
Our team lives and breathes technology, thrives on challenges, and is always pushing the boundaries of what cloud can do. With over 20 years of experience, deep open-source expertise, and a strong focus on data sovereignty, efficiency, and quality, we guide organizations on their journey into the multi-cloud world.
What defines us? Integrity, team spirit, a passion for learning, and the courage to break new ground. We’re open, agile, and driven by progress – and we’re looking for people who share that mindset.
Join our team and help shape the future of cloud with us.

Deine Hundertserver-Mission:

Als Site Reliability Engineer (SRE) bei Hundertserver bist du verantwortlich für den stabilen, performanten und sicheren Betrieb moderner Cloud-Plattformen. Du sorgst mit Automatisierung, Monitoring, SLAs und Incident Response dafür, dass unsere Systeme nicht nur laufen – sondern sich kontinuierlich verbessern. Dabei arbeitest du eng mit Kunden, Entwicklung und Infrastruktur-Teams zusammen, bringst Klarheit in komplexe Betriebsfragen und schaffst nachhaltige Lösungen – hands-on, pragmatisch und mit viel Eigenverantwortung.

Die Main Tasks:

Hauptverantwortlichkeiten
Verfügbarkeit & Stabilität

Sicherstellung der Plattformverfügbarkeit entlang definierter SLOs / SLAs
Analyse und Behebung von Incidents & Performance-Problemen (On-Call inkl.)
Aufbau und Pflege robuster Alerting-, Logging- und Monitoring-Setups
Fehlerursachenanalyse (Root Cause) & präventive Maßnahmen

Automatisierung & Infrastruktur

Automatisierung von Bereitstellung, Skalierung und Wartung (IaC mit Terraform, Ansible etc.)
Betrieb und Weiterentwicklung von Kubernetes-Umgebungen (Cloud & On-Prem)
Entwicklung und Pflege von Self-Healing- und Auto-Scaling-Mechanismen
Einführung und Pflege von Runbooks & Playbooks

Monitoring, Observability & Performance

End-to-End Monitoring mit Tools wie Prometheus, Grafana, Loki, ELK
Aufbau und Betreuung von SLIs und SLOs – datenbasierte Plattformsteuerung
Performance-Analysen (Workloads, Traffic, DBs) und kontinuierliche Optimierung
Einrichtung & Wartung verteilter Tracing- und Logging-Systeme

Sicherheit & Betriebshygiene

Umsetzung und Kontrolle von Sicherheitsstandards (Least Privilege, TLS, Secrets Mgmt.)
Regelmäßige Health-Checks, Updates und Patching
Verfügbarkeitssicherung durch gelebte Backup- & Disaster-Recovery-Prozesse

Kollaboration & Beratung

Enge Zusammenarbeit mit Dev-, Support- und Plattformteams
Beratung von Kunden zu Betriebsmodellen, Plattformmetriken & Architekturentscheidungen

Schulung interner Teams in Themen wie Monitoring, SRE-Basics & Troubleshooting

Dein Background:

Was du mitbringen solltest
Technisches Profil

Linux-Expertise (Debian, Ubuntu, RHEL)
Tiefe Kubernetes-Kenntnisse – Cluster, Ingress, Operators, Helm etc.
Erfahrung mit Cloud-Plattformen (AWS, Azure, GCP)
Sehr gute Kenntnisse in Monitoring-Stacks (Prometheus, Grafana, Loki, ELK)
Know-how in Infrastructure-as-Code (Terraform, Ansible, Puppet)
Scripting- und Automatisierungskenntnisse (Bash, Python, Go)
Vertraut mit Logging, Tracing & Incident-Management-Prozessen

Soft Skills & Arbeitsweise

Proaktives Troubleshooting & hohes Qualitätsbewusstsein
Strukturierte, analytische Denkweise – lösungsorientiert und pragmatisch
Sehr gute Kommunikationsfähigkeit (Kunde, Dev, Ops)
Fokus auf Nachhaltigkeit & Automatisierung statt Firefighting
Bereitschaft zu Bereitschaftsdiensten (Rufbereitschaft, SLA-Fenster)

Nice to Have

Zertifizierungen wie CKA / CKS / AWS DevOps oder vergleichbar
Erfahrung mit GitOps, ArgoCD oder Policy-as-Code
Kenntnisse in FinOps / Kostenoptimierung in Cloud-Plattformen

Deine Benefits:

Was dich bei Hundertserver erwartet

Echte Weiterentwicklung – in Technik, Methodik & Kultur
Moderne Plattformen & Tools – mit Raum für eigene Gestaltung
Eigenverantwortung & Vertrauen – wir arbeiten partnerschaftlich, nicht hierarchisch
Flexible Arbeitszeit & Remote-First-Kultur
Hands-on-Mentalität & direkter Kundenimpact

Apply for this job

Über uns

ONEHUNDRED / Hundertserver ist der Cloud-Service-Provider, der die digitale Transformation nicht nur begleitet – sondern aktiv gestaltet. Mit Sitz im Herzen Berlins und Kunden wie Gründerszene, Edelman und Prognos entwickeln wir innovative, sichere und souveräne Cloud-Lösungen für eine vernetzte Zukunft.
Unser Team lebt Technologie, liebt Herausforderungen und denkt Cloud konsequent weiter. Mit über 20 Jahren Erfahrung, Open-Source-Kompetenz und einem klaren Fokus auf Datensouveränität, Effizienz und Qualität begleiten wir Unternehmen auf ihrem Weg in die Multi-Cloud-Welt.
Was uns ausmacht? Integrität, Teamgeist, Lernfreude und der Mut, neue Wege zu gehen. Wir sind offen, agil und hungrig auf Fortschritt – und suchen Menschen, die genauso ticken.
Werde Teil unseres Teams und forme mit uns die Zukunft der Cloud.

Apply for this job

Your application!

Thank you for your interest in Hundertserver. Please fill out the following short form. If you have any difficulties uploading your data, please contact us via email at info@hundertserver.de.