Dutech’s Job

Senior Site Reliability Engineer (SRE) – Cloud & Distributed Systems

Austin,TX

DatePosted : 4/1/2026 3:21:48 PM

JobNumber : DTS1017187676
JobType : Contract
Skills: SRE, DevOps, AWS, GCP, Kubernetes, Docker, Python, Go, Linux, Distributed Systems, Monitoring, Logging, SLIs, SLOs, CI/CD, Observability
Job Description

We are seeking an experienced Senior Site Reliability Engineer (SRE) to design, build, and operate highly scalable and reliable cloud-based systems. The ideal candidate will have a strong background in DevOps, distributed systems, and cloud infrastructure, with a focus on automation, observability, and system reliability.

This role involves working in a fast-paced environment to ensure system uptime, performance, and operational excellence.


Key Responsibilities:

  • Design, implement, and manage highly available, distributed systems
  • Maintain and optimize cloud infrastructure (AWS/GCP)
  • Develop automation scripts using Python, Go, Java, or Bash
  • Manage containerized environments using Docker and Kubernetes
  • Define and monitor SLIs, SLOs, and error budgets
  • Implement monitoring, logging, and alerting solutions
  • Lead incident management, root cause analysis (RCA), and postmortems
  • Ensure system security and compliance within operational workflows
  • Improve system reliability through performance tuning and optimization
  • Collaborate with engineering teams to enhance deployment and release processes
  • Create and maintain runbooks, dashboards, and operational documentation

Required Qualifications:

  • 8+ years of experience in SRE, DevOps, or Systems Engineering
  • Strong expertise in Linux/Unix systems and system internals
  • Proficiency in at least one programming/scripting language (Python, Go, Java, Bash)
  • Experience designing and operating distributed systems
  • Hands-on experience with cloud platforms (AWS or GCP)
  • Experience with Docker and Kubernetes
  • Strong understanding of monitoring, alerting, and logging concepts
  • Experience managing SLIs, SLOs, and error budgets
  • Experience with incident management and RCA processes

Preferred Qualifications:

  • Experience with observability tools (Prometheus, Grafana, Datadog, Splunk, Application Insights)
  • Experience supporting 24x7 production environments and on-call rotations
  • Knowledge of chaos engineering and resiliency testing
  • Experience with canary deployments, feature flags, and progressive delivery
  • Strong documentation and communication skills

SHARE THIS JOB

;