Booking NL
Senior Software Engineer (For independent contractors)
Senior Software Engineer I for ML Production team
About the role
As a Senior Software Engineer I in the ML Production (RS) team, you will design, build and operate the core backend services that power our ML inference platform at Booking.com that is used by the entire company. You’ll work mainly with JVM-based services, Kubernetes on-prem and AWS EKS, and Graphite/Grafana-based observability stack, ensuring our platform is reliable, efficient and easy for other teams to integrate with.
You’ll collaborate closely with ML engineers, scientists and other teams to:
build and evolve high‑throughput, low‑latency ML model serving services,
modernize our infrastructure towards a cloud‑native and hybrid setup,
improve developer experience, reliability and performance of ML serving across Booking.com
Key responsibilities
Design, implement and operate scalable, low‑latency backend services in Scala/Java and other JVM-based languages.
Profile and optimize CPU and memory usage of services; run benchmarks, load tests and capacity experiments to keep latency and cost under control.
Build and maintain distributed systems and APIs used for online and offline predictions, including async/batch flows and client libraries.
Develop and run services on Kubernetes (AWS EKS/BKS), including containerization, deployment pipelines, autoscaling and safe rollout strategies (canaries, staged rollouts, rollbacks).
Own and improve cron‑based and scheduled jobs (e.g. housekeeping, maintenance, batch processing, data migrations) running in Kubernetes / cloud environments.
Implement robust observability (Graphite metrics, logs, alerts and Grafana dashboards) for core services and critical platform components.
Contribute to engineering best practices: automated testing, CI/CD pipelines, code reviews, documentation and design reviews.
Participate in on‑call and incident response, drive root‑cause analysis and implement long‑term reliability and resilience improvements.
Work closely with ML practitioners and product teams to understand their requirements and translate them into robust, easy‑to‑use platform capabilities.
Contribute to technical design docs, runbooks and standards, and share knowledge through reviews, mentoring and internal talks.
Required qualifications
Solid professional experience (typically 5+ years) as a Senior Software / Backend Engineer building and operating production services.
Strong programming skills in Java and/or Scala (or another JVM language) and good knowledge of concurrent programming and performance tuning.
Proven experience with distributed systems (e.g. microservices, RPC, caching) and designing for reliability and scalability.
Good understanding of CPU/memory constraints, profiling, and performance benchmarking of server-side applications.
Hands-on experience running services on Kubernetes (preferably AWS EKS or similar managed K8s): containerization, deployments, rollbacks, autoscaling.
Experience writing and maintaining cron jobs / scheduled workloads (e.g. for batch work, housekeeping, data pipelines) in a production environment.
Practical experience with observability tooling, ideally Graphite for metrics and Grafana for dashboards and alerting.
Comfortable working in a Linux-based environment and with common cloud primitives (networking, load balancers, IAM, storage).
Strong communication skills and ability to collaborate with engineers, ML practitioners and product stakeholders.
Continuously look for opportunities to improve how we and our users work, from tooling and automation to processes and workflows, and drive those improvements with conviction
Nice to have
Experience building platform or infrastructure services used by other engineering teams (internal platforms, SDKs, libraries).
Exposure to ML serving or data-intensive systems (e.g. model APIs, feature services, streaming pipelines).
Experience with hybrid cloud setups or migrations between bare‑metal and cloud environments.
Familiarity with Spark or other distributed compute engines used for batch or async processing.
Background in Site Reliability Engineering or strong interest in reliability, capacity planning and incident management.