Booking NL
Site Reliability Engineer (For independent contractors)
The core premise for SRE lies in treating operations as a software problem, where operations are concerned with addressing availability, scalability, latency, and efficiency for Booking.com’s systems & services. At its core, the SRE is tasked with engineering efforts to solve complex problems, requiring a strong aptitude to develop software systems that will minimize (i.e., through automation) human labor and increase system & service reliability.
A Booking Reliability Engineering team has full vertical ownership of a system, from the server configuration up to the application interfaces. This enables the team to have full control of a service, avoiding situations where different teams own different areas of a system, causing some parts to fall between the cracks. SREs can wear several hats; at times an SRE might be part of the product development team themselves, and at other times will act as a consultant to support a product development team in implementing Booking Reliability Engineering best practices.
As systems & services grow in size and complexity, so too does the operational overhead. It is a fundamental principle of SRE to break this relationship between operational toil, system size, and complexity. This also requires the team to limit operations work, enforcing engineering development efforts that are at the heart of Booking Reliability Engineering. Ultimately, fundamental software engineering skills coupled with strong systems and networking knowledge will guide the SRE to create more reliable systems & services that are highly available, scale with growth, and are efficient and latency-sensitive.
Requirements:
In-depth knowledge, understanding, and experience (minimum 3 years) of Apache Kafka administration.
Strong software engineering skills with the ability to write robust code.
Decent experience with Java.
Decent experience with Kubernetes (Docker, Helm, Argo).
Experience building and using monitoring components for distributed systems.
Experience building and maintaining distributed multi-tenant systems.
Oriented towards automating tasks and working closely with the team.
Expected to participate in operational shifts during the day (reacting to outages) and providing customer support (ticket work).
Proven problem-solving capabilities in complex distributed environments
Understanding of the Confluent platform and Confluent cloud is a plus.
Experience with databases is a plus
Bachelor's degree
Key Responsibilities
Building software applications:
Responsible for building software applications using relevant development languages and applying knowledge of systems, services, and tools appropriate for the business area.
Responsible for writing readable and reusable code by applying standard patterns and using standard libraries.
Responsible for refactoring and simplifying code by introducing design patterns when necessary.
Responsible for ensuring the quality of the application by following standard testing techniques and methods that adhere to the test strategy.
Responsible for maintaining data security, integrity, and quality by effectively following company standards and best practices.
End-to-End System Ownership:
Responsible for owning a service end-to-end by actively monitoring application health and performance, setting and monitoring relevant metrics, and acting accordingly when violated.
Responsible for reducing business continuity risks and bus factor by applying state-of-the-art practices and tools, and writing appropriate documentation such as runbooks and OpDocs.
Responsible for reducing risk and obtaining customer feedback by using continuous delivery and experimentation frameworks.
Responsible for independently managing an application or service by working through deployment and operations in production.
Software Systems Design:
Possesses sufficient knowledge to evaluate possible architectural solutions by taking into account cost, business requirements, technology requirements, and emerging technologies.
Possesses sufficient knowledge to describe the implications of changing an existing system or adding a new system to a specific area, by having a broad, high-level understanding of the infrastructure and architecture of our systems.
Possesses sufficient knowledge to help grow the business and/or accelerate software development by applying engineering techniques (e.g., prototyping, spiking, and vendor evaluation) and standards.
Possesses sufficient knowledge to meet business needs by designing solutions that meet current requirements and are adaptable for future enhancements.
Technical Incident Management:
Responsible for addressing and resolving live production issues by mitigating customer impact within SLA.
Responsible for improving the overall reliability of systems by producing long-term solutions through root cause analysis.
Responsible for keeping track of incidents by contributing to postmortem processes and logging live issues.
Automation and toil reduction:
Responsible for ensuring that infrastructure stays current by reducing technical debt, searching for bottlenecks, and preparing for scaling.
Responsible for reducing the cost of operations and maintenance by leveraging new technologies, automation, and partnering with vendors to ensure we stay current.
Responsible for reducing human labor by writing small software features that address availability, scalability, latency, and efficiency.
Monitoring and Alerting improvements:
Responsible for reviewing and verifying the performance of production systems and network infrastructure by continuously monitoring appropriate observability metrics, business KPIs, and capacity planning.
Responsible for improving application reliability by partnering with development teams to advise on setting appropriate observability metrics.
Architectural Guidance:
Possesses basic knowledge to advise product teams towards a technical solution that meets the functional, non-functional, and architectural requirements by challenging the rationale for an application design and providing context in the wider architectural landscape.
Possesses basic knowledge to set a clear direction for a technical capability by evaluating and aligning target architecture improvements, reframing architectural designs, and decisions for varied stakeholders.
Critical Thinking:
Responsible for systematically identifying patterns and underlying issues in complex situations, and for finding solutions by applying logical and analytical thinking.
Responsible for constructively evaluating and developing ideas, plans, and solutions by reviewing them, objectively taking into account external knowledge, initiating 'SMART' improvements, and articulating their rationale.
Continuous Quality and Process Improvement:
Responsible for identifying opportunities for process, system, and structural improvements (i.e., performance gains) by examining and evaluating current process flows, methods, and standards.
Responsible for designing and implementing relevant improvements by defining adapted/new process flows, standards, and practices that enable business performance.
Effective Communication:
Responsible for delivering clear, well-structured, and meaningful information to a target audience by using suitable communication mediums and language tailored to the audience.
Responsible for achieving mutually agreeable solutions by staying adaptable, communicating ideas in clear, coherent language, and practicing active listening.
Responsible for asking relevant (follow-up) questions to properly engage with the speaker and truly understand what they are saying, by applying listening and reflection techniques.
Responsible for technical implementation and maintenance of data processing services and storage systems in line with the Data Governance Framework.