Site Reliability Engineering (SRE) is what you get when you treat operations as if it’s a software problem. Our client’s mission is to progress, protect, and provide for the software and systems behind their services with an ever-watchful eye on their availability, latency, performance, and capacity. As an SRE, you will use your experience in software development to implement the platforms that allow our systems to run smoothly with minimal intervention.
Successful SRE’s Exhibit the Following:
- Obsessive desire to automate everything and build self-healing systems.
- Passion and skill in software development and an ability to apply software development concepts to the maintenance, observability, and scalability of production systems.
- Uncanny ability to quickly troubleshoot interdependent services and applications in distributed environments.
- Comfortable transitioning between languages and platforms.
- Communicative and eager to share ideas or lessons-learned with others.
The SRE Team is Responsible for:
- The design, writing, and delivery of software to improve the availability, scalability, latency, and efficiency of Client’s services.
- Solving problems relating to mission critical services and build automation to prevent problem recurrence with the goal of automating response to all non-exceptional service conditions.
- Influencing and creating new designs, architectures, standards, and methods for large-scale distributed systems.
- Engaging in service capacity planning and demand forecasting, software performance analysis, and system tuning.
- Conducting periodic on-call duties to be available for production incidents.
- Performing other related duties as assigned by management.
- OS platforms: Windows server, CentOS, Ubuntu
- Languages: Python, bash, PowerShell, C#, Java, Typescript, Go
- Workload management: Hyper-V, Docker, Rancher, Azure, Kubernetes, HAProxy
- CI/CD: Azure DevOps, Git
- Databases: MS SQL Server, MariaDB / MySQL, MongoDB
- Infrastructure as code: Terraform, Puppet, Ansible
- Monitoring: Splunk, Icinga
Our SREs Typically Already have the Following Skills:
- Knowledge and understanding of network theory and the ability to design and troubleshoot network infrastructure.
- Understanding of Windows and Unix/Linux systems from kernel to shell and beyond, taking in system libraries, file systems, and client-server protocols along the way.
- Expertise in designing, analyzing, and troubleshooting large-scale distributed systems.
- Familiarity with running microservices at scale including containerization and orchestration.
- Systematic problem-solving approach, coupled with a strong sense of ownership and drive.
- Professional experience in backend software development with C# or Java.
- Production experience with Windows, Linux, scripted automation, docker, HAProxy, and Puppet or Ansible.
- Must have experience with development and, more importantly, automation.
- Experience with Azure, DevOps, and Docker.
- Windows 2016/19 Experience.
- Some Linux experience.