Booking NL
Site Reliability Engineer (For independent contractors)
Senior Site Reliability Engineer I (aka Senior SRE I) are experts in treating operations as a software problem. They focus on reliability of systems and services - addressing availability, performance, scalability, latency, observability, efficiency. They work on maintaining key components and developing systems that will minimize human labor (through automation) and increase system reliability with the end goal of breaking the relationship between system size, operational toil and complexity. A Senior SRE I is responsible for the design, prioritization and implementation of complex technical solutions. They can accurately estimate or forecast the effort and impact of the items they work on, and show a high quality of craft in what they deliver. They are expected to lead incident response for issues affecting their team.
Key Responsibilities
Supporting the team capabilities during office hours: Configuration Management, Secret Management, Certificate Management and Runtime Configuration.
Providing support to the internal clients.
Reacting to the alerts and troubleshooting incidents.
Automating the repeating operations and maintaining the automation.
Running periodic operations.
Common Responsibilities
End-to-End System Ownership
Responsible to own a service end to end by actively monitoring application health and performance, setting and monitoring relevant metrics, and acting accordingly when violated.
Responsible to reduce business continuity risks and bus factor by applying state-of-the-art practices and tools, and writing the appropriate documentation such as runbooks and OpDocs.
Responsible to reduce risk and obtain customer feedback by using continuous delivery and experimentation frameworks.
Responsible to independently manage an application or service by working through deployment and operations in production.
Responsible to maintain data security, integrity, and quality by effectively following company standards and best practices.
Technical Incident Management
Responsible to address and resolve live production issues by mitigating the customer impact within SLA.
Responsible to improve the overall reliability of systems by producing long-term solutions through root cause analysis.
Responsible to keep track of incidents by contributing to postmortem processes and logging live issues.
Automation and Toil Reduction
Responsible to ensure that infrastructure stays current by reducing technical debt, searching for bottlenecks, and preparing for scaling.
Responsible to reduce cost of operations and maintenance by leveraging new technologies, automation, and partnering with vendors to ensure we stay current.
Responsible to reduce human labor by writing small software features that address availability, scalability, latency, and efficiency.
Monitoring and Alerting Improvements
Responsible to review and verify performance of production systems and network infrastructure by continuously monitoring appropriate observability metrics, business KPIs, and capacity planning.
Responsible to improve application reliability by partnering with development teams to advise on setting appropriate observability metrics.
Critical Thinking
Responsible to systematically identify patterns and underlying issues in complex situations, and to find solutions by applying logical and analytical thinking.
Responsible to constructively evaluate and develop ideas, plans, and solutions by reviewing them, objectively taking into account external knowledge, initiating 'SMART' improvements, and articulating their rationale.
Continuous Quality and Process Improvement
Responsible to identify opportunities for process, system, and structural improvements (i.e performance gains) by examining and evaluating current process flows, methods, and standards.
Responsible to design and implement relevant improvements by defining adapted/new process flows, standards, and practices that enable business performance.
Effective Communication
Has sufficient knowledge to deliver clear, well-structured, and meaningful information to a target audience by using suitable communication mediums and language tailored to the audience.
Has sufficient knowledge to achieve mutually agreeable solutions by staying adaptable, communicating ideas in clear coherent language, and practicing active listening.
Has sufficient knowledge to ask relevant (follow-up) questions to properly engage with the speaker and really understand what they are saying, by applying listening and reflection techniques.
Building Software Applications
Responsible to build software applications by using relevant development languages and applying knowledge of systems, services, and tools appropriate for the business area and guide more junior members of the team in this topic.
Responsible to refactor and simplify code by introducing design patterns when necessary and guide more junior members of the team in this topic.
Responsible to ensure the quality of the application by following standard testing techniques and methods that adhere to the test strategy.
Responsible to write readable and reusable code by applying standard patterns and using standard libraries.
Responsible to maintain data security, integrity and quality by effectively following company standards and best practices.
Communication with
Internal Clients
Track Members
Product Stakeholders
Peers
Requirements of special knowledge/skills:
Advanced Knowledge (5 - 8 years)
Troubleshooting skills of the complex highly available systems
Development experience in Go and Python
Kubernetes, AWS and bare metal
Certificate management, Public Key Infrastructure
Technical knowledge with the following technologies would be helpful:
Vault
Puppet
Postgresql