Dutech’s Job
Senior Site Reliability Engineer (SRE) – Cloud & Distributed Systems
Austin,TX
DatePosted : 4/1/2026 3:21:48 PM
JobNumber : DTS1017187676JobType : Contract
Skills: SRE, DevOps, AWS, GCP, Kubernetes, Docker, Python, Go, Linux, Distributed Systems, Monitoring, Logging, SLIs, SLOs, CI/CD, Observability
Job Description
We are seeking an experienced Senior Site Reliability Engineer (SRE) to design, build, and operate highly scalable and reliable cloud-based systems. The ideal candidate will have a strong background in DevOps, distributed systems, and cloud infrastructure, with a focus on automation, observability, and system reliability.
This role involves working in a fast-paced environment to ensure system uptime, performance, and operational excellence.
Key Responsibilities:
- Design, implement, and manage highly available, distributed systems
- Maintain and optimize cloud infrastructure (AWS/GCP)
- Develop automation scripts using Python, Go, Java, or Bash
- Manage containerized environments using Docker and Kubernetes
- Define and monitor SLIs, SLOs, and error budgets
- Implement monitoring, logging, and alerting solutions
- Lead incident management, root cause analysis (RCA), and postmortems
- Ensure system security and compliance within operational workflows
- Improve system reliability through performance tuning and optimization
- Collaborate with engineering teams to enhance deployment and release processes
- Create and maintain runbooks, dashboards, and operational documentation
Required Qualifications:
- 8+ years of experience in SRE, DevOps, or Systems Engineering
- Strong expertise in Linux/Unix systems and system internals
- Proficiency in at least one programming/scripting language (Python, Go, Java, Bash)
- Experience designing and operating distributed systems
- Hands-on experience with cloud platforms (AWS or GCP)
- Experience with Docker and Kubernetes
- Strong understanding of monitoring, alerting, and logging concepts
- Experience managing SLIs, SLOs, and error budgets
- Experience with incident management and RCA processes
Preferred Qualifications:
- Experience with observability tools (Prometheus, Grafana, Datadog, Splunk, Application Insights)
- Experience supporting 24x7 production environments and on-call rotations
- Knowledge of chaos engineering and resiliency testing
- Experience with canary deployments, feature flags, and progressive delivery
- Strong documentation and communication skills
SHARE THIS JOB