
SRE C2C jobs
Contract
Location: Onsite in Buffalo NYC
Key Responsibilities
Reliability & Operations
Define, measure, and enforce SLOs, SLAs, and error budgets across all critical services.
Own incident management end-to-end: detection, triage, escalation, mitigation, and post-mortems.
Build and maintain runbooks, playbooks, and on-call rotation schedules.
Conduct blameless post-incident reviews and drive systemic improvements to prevent recurrence.
Java Development & Automation
Design and maintain Java-based microservices, reliability tooling, and automation frameworks.
Collaborate with development teams to embed SRE best practices during the SDLC (code reviews, CI/CD gates).
Implement chaos engineering experiments to proactively surface weaknesses in production systems.
Develop self-healing automation scripts and toil-reduction tooling in Java and Python/Shell.
Observability — Kibana & Dynatrace
Design and own the full observability strategy: structured logging, distributed tracing, and real-time metrics.
Build, maintain, and optimize Kibana dashboards, index patterns, and Elasticsearch pipelines for log analytics.
Configure and manage Dynatrace monitoring — including OneAgent deployment, Davis AI problem detection, service flows, and synthetic monitoring.
Create alerting rules and anomaly detection policies in Dynatrace aligned to SLO thresholds.
Correlate signals across Kibana and Dynatrace to enable rapid root cause analysis during incidents.
Infrastructure & Platform
Manage and improve CI/CD pipelines (Jenkins, GitHub Actions, or similar) for reliability-focused deployments.
Support Kubernetes/Docker-based infrastructure; optimize resource utilization and autoscaling policies.
Collaborate with Security and Compliance on hardening production environments and managing vulnerabilities.
Drive capacity planning, load testing, and performance benchmarking initiatives.
Required Skills & Qualifications
Core Technical Requirements
10+ years of hands-on SRE, DevOps, or backend engineering experience in production environments.
Strong Java development proficiency — Spring Boot, microservices architecture, REST APIs, JVM tuning.
Demonstrable expertise with Kibana — dashboard creation, KQL queries, index lifecycle management (ILM), Beats/Logstash pipelines.
Hands-on Dynatrace experience — OneAgent, Smartscape topology, Davis AI, SLO configuration, Synthetic monitoring, DQL/USQL.
Solid understanding of distributed systems, fault tolerance patterns, and CAP theorem.
Proficiency in at least one scripting language: Python, Bash, or Groovy.
Experience with containerization (Docker) and orchestration (Kubernetes / OpenShift).
Familiarity with cloud platforms — AWS, GCP, or Azure — and IaC tools such as Terraform or Ansible.
SRE / Operations Skills
Experience defining and managing SLIs, SLOs, SLAs, and error budgets.
Proficiency with incident management workflows and tools (PagerDuty, OpsGenie, or similar).
Knowledge of capacity planning, load testing tools (JMeter, Gatling, k6), and performance optimization.
Strong understanding of networking fundamentals: DNS, TCP/IP, HTTP/HTTPS, TLS, load balancing.
To apply for this job email your details to aditya@hgtechinc.net