About us:
Innervate, is a leading software company specializing in developing dynamic customer experience systems for enterprise users. We create advanced software solutions that empower marketers to provide personalized, intelligent experiences to their audiences across any channel.
About the job:
Innervate is hiring a DevOps Engineer with strong expertise in Linux, virtualization, and hardware, who is excited about managing applications in a highly available environment as part of our Site Reliability Engineering (SRE) team. This role offers the chance to work with talented professionals in a fast-paced, growing environment, allowing you to sharpen your observability and automation skills while ensuring the reliability of a critical platform with 99.99% uptime, and contributing to modernization initiatives.
The SRE team manages all hardware and software within both production and SDLC environments. As a DevOps Infrastructure Engineer, you will play an essential role within the SRE team, collaborating with infrastructure, networking, and DevOps specialists.
The Ideal Candidate
- Proven experience in implementing automation and reducing manual effort in operational or administrative tasks that are traditionally labor-intensive.
- Possesses a strong technical background and can effectively collaborate with engineers on the intricacies of complex systems, while also being able to step back and understand the broader context.
- Thrives on tackling challenges that require creative solutions, leveraging both legacy and modern technologies. In other words, this role involves maintaining and improving legacy systems while exploring new tools and technologies to enhance resilience, performance, ease of management, and monitoring.
- Enjoys collaborating with a globally distributed team of equally skilled professionals and is comfortable building strong relationships with teammates located across different time zones.
-
Believes no service or appliance is truly production-ready until it has the necessary metrics and alerts in place to ensure performance and reliability.
Key Duties
- Administration - Participate in maintenance and operations of our production environment, including patching, deployment, server administration, and troubleshooting, either using configuration as code tooling or manually.
- Reliability & Performance - Ensure reliability, availability, and performance of services. Respond to incidents and resolve them before they become customer-impacting.
- Projects - Deliver complex solutions that traverse all layers of the technology stack: Operating System, Virtualisation, Network, Storage & Cloud.
- Collaboration - Work closely with teammates, software, and security teams to rapidly meet customer, business, and compliance needs.
- Automation - Drive the automation of operational tasks, and ensure our infrastructure is more like cattle than pets.
- Observability - Develop and maintain internal and commercial or OSS tools to improve system health, performance, and deployment.
- Continuous Improvement - Drive never-ending improvement in SRE processes, tools, and methodologies. Take a leading role in blameless post-mortems to avoid repeat issues or mistakes and clearly document all lessons learned for others. If you love writing actionable documentation, we’d love to set up an interview.
This is a fully remote role!
Essential Skills & Experience
- 4+ years of experience in roles such as Site Reliability Engineering, DevOps, System Administration, or similar positions.
- Knowledge of Nginx, HAProxy, or Envoy for load balancing, reverse proxying, and traffic management in large-scale environments
- Extensive experience with bare-metal facilities – this is a hybrid environment, so candidates who have only worked in public cloud environments would not be a strong fit.
- Hands-on experience with configuration management tools like Chef, Ansible, or similar, for system deployment and configuration.
- Strong expertise in Linux systems (any distribution is acceptable, but we prefer Debian-based systems).
- Experience with Kubernetes, LXC, or Docker for managing large-scale production environments.
- Proficiency in scripting or programming languages such as Python, Go, or Bash.
- Several years of experience responding to and proactively managing alerts for infrastructure, applications, and networks using tools like Grafana, Nagios, or custom toolchains.
- Strong problem-solving skills and experience working in high-availability production environments (99.95% uptime or greater) with demanding performance requirements.
- Extensive experience managing databases, including MySQL, PostgreSQL, and Cassandra.
- Expertise in managing message brokers such as Kafka and RabbitMQ.
- Understanding of networking fundamentals (TCP/IP, DNS, VPNs, firewalls, load balancing) to troubleshoot, optimize, and secure infrastructure.
Preferred Skills And Experience
- Proficiency in Git for version control, and familiarity with GitOps practices for automating infrastructure and application deployments
- Expertise in ArgoCD or Flux for continuous delivery of Kubernetes applications using GitOps workflows.
- Experience with Helm for packaging and managing Kubernetes applications efficiently
- Knowledge of Linux kernel tuning and optimization for high-performance and low-latency applications.
- Exposure to cloud platforms such as AWS, Google Cloud, or Azure, with a focus on hybrid cloud integration
- Understanding of Java resource consumption and performance optimization, including memory management, garbage collection, and profiling tools to effectively work with Java applications in a production environment.
Languages:
- upper intermediate Russian
- upper intermediate English
Ключевые навыки
- Kubernetes
- Python
- ELK
- Linux
- HTTP
- TCP/IP
Задайте вопрос работодателю
Он получит его с откликом на вакансию
Вакансия опубликована 18 февраля 2025 в Баку