SITE RELIABILITY ENGINEER

XTN-DAD3686

City
NA, Philippines
Schedule
Office Location
OFFSITE

Make your next big career move by applying as KMC Solutions' next SITE RELIABILITY ENGINEER

You will be working on validating and testing GPU clusters prior to production release, ensuring hardware integrity, system reliability, and optimal performance. This role involves provisioning clusters, executing performance benchmarks, maintaining automated validation frameworks, and troubleshooting Linux-based systems in high-performance compute environments. You will collaborate closely with engineering and operations teams to ensure seamless handovers and production readiness. 

On top of your salary, here are the exciting benefits you can look forward to:

.•  Health Insurance/HMO 
•  Enjoy unlimited MadMax Coffee
•  Diverse learning & growth opportunities
•  Accessible Cloud HR platform (Sprout)
•  Above standard leaves

The main responsibilities of a SITE RELIABILITY ENGINEER include:

Cluster Validation & Testing

  • Validate GPU clusters of varying sizes to ensure hardware and system integrity prior to production release

  • Perform functional and reliability testing of GPUs, servers, and associated components

  • Verify network connectivity and performance, including InfiniBand where applicable

Orchestration & Benchmarking

  • Provision and configure GPU clusters using automated workflows

  • Execute and analyse performance and stability benchmarks orchestrated via Slurm

  • Validate results against expected performance and reliability thresholds

Test Framework & Automation

  • Maintain and extend the automated validation framework built using Python and Ansible

  • Integrate new test cases to support additional hardware platforms and GPU generations

  • Improve test reliability, coverage, and execution efficiency

Remediation & System Integrity

  • Diagnose and remediate unhealthy nodes through configuration changes or software fixes

  • Coordinate with on-site support and Smart Hands teams for hardware replacements when required

  • Ensure all issues are resolved and documented prior to handover to production operations

Documentation & Handover

  • Produce clear, accurate documentation of test results, hardware states, and remediation actions

  • Ensure smooth handovers to operations and engineering teams

  • Maintain up-to-date runbooks and validation procedures

To apply, you must be an expert on the following requirements:

Essential
• Strong hands-on experience administering and troubleshooting Linux systems (Prio)
• Confident use of CLI tools for diagnostics, including analysis of kernel logs, drivers, and system
services
• Excellent written and verbal English communication skills
• High standards for system reliability, consistency, and documentation
Preferred / Desirable
• Experience working with GPU-based or high-performance compute environments
• Familiarity with Slurm or other workload schedulers
• Understanding of datacenter hardware lifecycle and server validation processes
• Exposure to InfiniBand or high-speed networking technologies
• Experience working with distributed or remote infrastructure teams
• Proficiency in Python for automation, test execution, and parsing results (Preferred)
• Proven experience writing and maintaining Ansible playbooks (Preferred)

It will also be favorable if you are knowledgeable in:

.

The successful candidate must submit the following pre-employment requirements

  • Scanned copy of valid NBI Clearance
  • Accomplished Medical or PEME Slip (covered by KMC)
  • 2x2 & Half body picture with white background
  • Proof of government numbers (TIN, SSS, Pag-ibig, & Philhealth)
  • Photocopy of 2 valid IDs – front & back (government-issued)
  • Clear copy of your Birth Certificate (PSA or NSO)
  • Accomplished HR Forms & Promissory Note (will be provided by KMC’s Onboarding Team

Click here to view the complete list of KMC’s pre-employment requirements.

KMC Careers

If you're a rockstar at what you do and looking to be a part of our amazing story, we want to hear from you!

We offer attractive salaries and benefits plus you get to work in some of the Philippines' best flexible workspaces. Our employees also get to enjoy exclusive discounts, rewards and freebies, and invites to our monthly events. We are always recruiting for roles in IT & Development, Marketing, Business Administration, HR & Recruitment and Legal & Finance Roles.

KMC provides quality employment opportunities for job-seekers looking for a career that is both challenging and fulfilling. We are also committed to providing equal opportunities at every selection stage. We do not discriminate due to age, gender, sexual orientation, ethnicity, nationality, and religion.

Work with Us. Grow with Us.

KMC Solutions offers a variety of career opportunities in Metro Manila, Cebu and Clark & Iloilo. We are always looking for talented and enthusiastic individuals who are ready to make their next big career move.

Our Culture

At KMC, we foster an inclusive and positive workplace for all. We push our members to succeed in everything they do through our collaborative work environment. We encourage our community to work hard and reach their full potential while delivering results that matter for our members and you as professionals.

We host amazing and quality events and implement people-centric policies to work flexibly. We ensure that everyone in our expansive network is engaged, from our internal employees and those who work on behalf our offshore partners.

Life within KMC: Work Hard Party Harder

At KMC, we work hard and we are committed to putting our best foot forward in everything we do. Everyone is encouraged to be an individual while also working for the collective good of the KMC Community. We believe mistakes are opportunities and that you should not present a solution without a problem.

We also know when hard work deserves to be recognized so we reward our employees with monthly parties, free trips and much much more!

No account yet

Sign up to view exciting career opportunities!