SR SITE RELIABILITY ENGINEER

XTN-37D6360

City: Makati City, Philippines
Schedule

Office Location

OFFSITE

Make your next big career move by applying as KMC Solutions' next SR SITE RELIABILITY ENGINEER

Role Summary
We are building a 24×7 senior platform operations team to expand our Infrastructure & Operations
function. Six Senior Site Reliability Engineers will join our existing EU-based SREs to provide
round-the-clock L3 platform engineering coverage across a cloud platforman d our wider customer
estate.
This is a senior individual contributor role. You will not be ramping on Linux. You will be running
production platforms at scale, owning the automation that keeps them clean, and acting as the
senior on-shift technical authority during your coverage window. The role combines engineering
work (Ansible, Python, Terraform/OpenTofu, automation in code) with platform operations (L3
escalation, vendor case management, incident response, change implementation). Both halves
matter, and the role is structured so neither degrades into ticket-clearing.
This is one of the largest parallel openings Gatti has run. We are aiming for top-tier Linux and
platform engineers. We do not expect every candidate to arrive with the complete skill stack on day
one. We are willing to invest meaningfully in training the right people. We offer cutting-edge data
centre work at scale, generous

On top of your salary, here are the exciting benefits you can look forward to:

Scope of Support
The team is responsible for L3 platform support, day-to-day operations, and engineering
improvements across the following layers of a cloud platform:
Storage. Pure Storage arrays, VAST arrays, the internal storage proxy, VMS proxies for customerfacing
access to VMS, the Image Builder, and the Image Publisher.
Supporting services and infrastructure. Support servers (DNS64, NAT64, internal Git mirrors,
Prometheus and its exporters, Vector, ISO-server mirrors); Management Gateways (DNS, DHCP,
TFTP, WireGuard tunnels); Nautobot; GitHub repositories; monitoring stack (Grafana, Prometheus
and adjacent tooling); vendor support portals (case open and monitor across the relevant vendors);
Authentication services from Azure; the NVIDIA Enterprise Support Portal accounts.
Server platforms. BlueField DPUs, InfiniBand HCAs, GPUs and NVSwitch, and local storage.
Networking. InfiniBand switches and subnet managers; Ethernet switches and routers; internal
DNS.
Software. Operating system and node images, NVIDIA drivers, the AI software stack (SLURM,
plus the Kubernetes layer: ingress-nginx, cert-manager, VictoriaMetrics, OpenBao, Harbor,
Portworx CSI, Defguard, ArgoCD, Kustomize).
You will not be specialised on a single sliver of it. You will be expected to develop working depth
across this stack over your first 9 to 12 months.

The main responsibilities of a SR SITE RELIABILITY ENGINEER include:

What You'll Do
Run L3 platform operations during your shift. Be the on-shift senior for all platform escalations
in your coverage window. Own incidents end-to-end from triage to root cause, vendor case
management, and post-incident write-up. Hand over cleanly to the next shift.
Bare-metal provisioning and configuration management. Build and evolve the Ansible roles,
playbooks, and templates that bring new Debian-based hosts from BMC power-on to fully
configured production nodes. Own the lifecycle of network-based installation methods (PXE, iPXE,
preseed, kickstart) used to roll out hundreds of nodes per site.
Infrastructure as code. Develop and maintain Terraform / OpenTofu modules for provisioning
network, compute, and storage primitives across our environments. Keep state hygiene, module
reuse, and CI/CD discipline at a senior standard.
Tooling and automation in Python. Write and maintain production-quality Python for cluster
lifecycle operations, inventory reconciliation, DCIM integrations, monitoring exporters, and incident
response automation. This is not throw-away scripting. It is code that runs in production and that
other engineers will read and reuse.
Operations on real data centre hardware. Troubleshoot at the intersection of hardware,
BMC/IPMI, kernel, network fabric, and application layers. Drive failure analyses that the rest of the
team learns from. Document what you find.
Change implementation and platform hygiene. Run scheduled changes, software upgrades,
image rollouts, and configuration drift remediation across the fleet. Make recurring operational
issues go away in code, not in heroics.
Knowledge transfer and peer review. Contribute to runbooks. Review pull requests. Raise the
engineering bar on the team. Onboard new joiners as the team scales.
Cross-shift and cross-site collaboration. Work daily with EU-based SREs, with the wider
Infrastructure & Operations function, and with peers on other shifts. Maintain clean handovers and
a shared operational picture across the rotation.

To apply, you must be an expert on the following requirements:

What We're Looking For
We are hiring for capability, not for ticking every box on the list. Candidates who bring a strong
subset of the hard requirements and the right disposition will be considered, with a structured
ramp-up plan in place for the rest.
Hard requirements (the core skills profile):
• 5+ years of hands-on Site Reliability, Platform, or Infrastructure Engineering experience.
• Deep working knowledge of Linux on bare metal, not only in cloud environments.
Comfortable with Debian-based distributions (Debian, Ubuntu) at the level of kernel
parameters, networking, storage, systemd internals.
• Strong shell scripting (Bash) for operational work and automation glue.
• Production-grade Python: reading, writing, packaging, testing, and reviewing real code.
• Deep Ansible experience: idempotent role design, inventory patterns, secrets management,
scale considerations.
• Strong Terraform or OpenTofu experience, including module design and state management
discipline.
• Hands-on experience working in a data centre environment, including racking,
BMC/iLO/iDRAC, structured cabling adjacency, and working with onsite teams on physical
issues.
• Strong written and spoken English. You will operate in a fully English-language
environment with European colleagues; the role requires clear, concise written
communication and confident verbal handovers.
Nice to have (we will train where needed):
• Network-based installation methods (PXE, iPXE, preseed, kickstart, FAI, MAAS).
• Zero-touch Linux provisioning at fleet scale.
• Virtualisation: Proxmox VE and / or OpenShift Virtualization (KubeVirt).
• Deep Linux kernel knowledge (modules, networking stack, performance tuning, kernel
debugging).
• SCM with Git at a senior level: branching strategy, code review discipline, CI integration.
• Experience with DCIM tools and their APIs (Nautobot, NetBox, or comparable).
• Exposure to HPC, AI/ML compute, or large GPU clusters.
• Familiarity with InfiniBand, RoCE, or other low-latency fabrics.
• Operational experience with any of: SLURM, Kubernetes (ingress-nginx, cert-manager,
ArgoCD, Kustomize), Prometheus / VictoriaMetrics, Harbor, OpenBao, Portworx, Pure
Storage, VAST, NVIDIA BlueField.

It will also be favorable if you are knowledgeable in:

The successful candidate must submit the following pre-employment requirements

Scanned copy of valid NBI Clearance
Accomplished Medical or PEME Slip (covered by KMC)
2x2 & Half body picture with white background
Proof of government numbers (TIN, SSS, Pag-ibig, & Philhealth)
Photocopy of 2 valid IDs – front & back (government-issued)
Clear copy of your Birth Certificate (PSA or NSO)
Accomplished HR Forms & Promissory Note (will be provided by KMC’s Onboarding Team

Click here to view the complete list of KMC’s pre-employment requirements.

KMC Careers

If you're a rockstar at what you do and looking to be a part of our amazing story, we want to hear from you!

We offer attractive salaries and benefits plus you get to work in some of the Philippines' best flexible workspaces. Our employees also get to enjoy exclusive discounts, rewards and freebies, and invites to our monthly events. We are always recruiting for roles in IT & Development, Marketing, Business Administration, HR & Recruitment and Legal & Finance Roles.

KMC provides quality employment opportunities for job-seekers looking for a career that is both challenging and fulfilling. We are also committed to providing equal opportunities at every selection stage. We do not discriminate due to age, gender, sexual orientation, ethnicity, nationality, and religion.

Work with Us. Grow with Us.

KMC Solutions offers a variety of career opportunities in Metro Manila, Cebu and Clark & Iloilo. We are always looking for talented and enthusiastic individuals who are ready to make their next big career move.

Our Culture

At KMC, we foster an inclusive and positive workplace for all. We push our members to succeed in everything they do through our collaborative work environment. We encourage our community to work hard and reach their full potential while delivering results that matter for our members and you as professionals.

We host amazing and quality events and implement people-centric policies to work flexibly. We ensure that everyone in our expansive network is engaged, from our internal employees and those who work on behalf our offshore partners.

Life within KMC: Work Hard Party Harder

At KMC, we work hard and we are committed to putting our best foot forward in everything we do. Everyone is encouraged to be an individual while also working for the collective good of the KMC Community. We believe mistakes are opportunities and that you should not present a solution without a problem.

We also know when hard work deserves to be recognized so we reward our employees with monthly parties, free trips and much much more!

Job Code : XTN-37D6360

No account yet

Refer A Friend

TEAMS

SPACES

CAREERS

SR SITE RELIABILITY ENGINEER

SR SITE RELIABILITY ENGINEER