Остават 13 дни 06ч 34м, за да грабнеш Early Bird билет за All in One 2026 тук!

+
Вход

Въведи своя e-mail и парола за вход, ако вече имаш създаден профил в DEV.BG/Jobs

Забравена парола?
+
Създай своя профил в DEV.BG/Jobs

За да потвърдите, че не сте робот, моля отговорете на въпроса, като попълните празното поле:

109+1 =

+
Забравена парола

Въведи своя e-mail и ще ти изпратим твоята парола

Man Group

Site Reliability Engineer

ApplyКандидатствай

Обявата е публикувана в следните категории

+
  • Anywhere
  • Съобщи проблем Megaphone icon

Съобщи за проблем с обявата

×

    Какво не е наред с обявата?*
    Моля опиши ни, къде е проблемът:
    За да потвърдите, че не сте робот, моля отговорете на въпроса, като попълните празното поле:
    Tech Stack / Изисквания

    About Man Group

    Man Group is a global alternative investment management firm focused on pursuing outperformance for sophisticated clients via our Systematic, Discretionary and Solutions offerings. Powered by talent and advanced technology, our single and multi-manager investment strategies are underpinned by deep research and span public and private markets, across all major asset classes, with a significant focus on alternatives. Man Group takes a partnership approach to working with clients, establishing deep connections and creating tailored solutions to meet their investment goals and those of the millions of retirees and savers they represent.

    Headquartered in London, we manage $213.9 billion* and operate across multiple offices globally. Man Group plc is listed on the London Stock Exchange under the ticker EMG.LN and is a constituent of the FTSE 250 Index. Further information can be found at www.man.com

    * As at 30 September 2025

    Purpose of the Role

    Join our high-performing Site Reliability Engineering (SRE) team and play a pivotal role in ensuring the reliability, scalability, and performance of the technology powering Man Group’s hedge funds. You’ll have the autonomy, tools, and support to innovate and shape the future of our platform. This is an opportunity to work on cutting-edge projects, gain mentorship from senior leaders, and develop a deep understanding of both technology and the business.

    As an SRE, you’ll take ownership of service reliability and deliver solutions that make a real impact. Your initial focus will include leveraging AI to accelerate incident diagnosis and resolution, improving observability, capacity planning, and automation. Over time, you’ll work across our entire infrastructure stack, operating at scale and driving continuous improvement.

     

    Specific responsibilities

    • Ensure reliability and performance of critical systems across global infrastructure through proactive monitoring and rapid incident response.
    • Design and implement observability solutions using tools like Prometheus, Grafana, ELK, and Loki to provide deep insights into system health.
    • Automate operational tasks and build self-service capabilities to eliminate toil and improve efficiency.
    • Develop and maintain SLIs, SLOs, and error budgets to guide reliability improvements and inform engineering priorities.
    • Participate in incident response efforts, blameless post-mortems, and implement preventive measures to reduce recurrence.
    • Collaborate with development teams to improve system design, deployment practices, and operational excellence.
    • Operate at scale, managing petabyte-level storage, large CPU/GPU deployments, and high-throughput distributed systems.
    • Contribute to capacity planning and performance tuning, ensuring systems meet business demands.
    • Manage multiple ELK clusters hosting hundreds of terabytes of logs, telemetry, and APM data.

     

    Key competencies

    • Strong understanding of SRE principles, including SLIs, SLOs, error budgets, and reliability best practices.
    • Hands-on experience with observability and monitoring tools (Prometheus, Grafana, ELK, Loki, or similar).
    • Proficiency with automation tools (Ansible, Terraform) and scripting/programming languages (Python, Go, PowerShell).
    • Strong troubleshooting and debugging skills across distributed systems, with the ability to diagnose complex production issues under pressure.
    • Experience with incident management, on-call rotations, and post-incident reviews.
    • Familiarity with Kubernetes and container orchestration.
    • A proactive mindset and ability to take ownership of reliability initiatives.

     

    Advantageous

    • Experience with CI/CD pipelines and source control workflows (Git, Jenkins, TeamCity).
    • Administration of Linux and Windows systems and exposure to cloud technologies (AWS/Azure).
    • Understanding of networking concepts, load balancing, and distributed architectures.
    • Knowledge of AI/LLM concepts (context windows, prompt tuning, MCP servers).
    • I/nterest in FinOps principles, desire to understand the true cost of our decisions.
    • Excellent communication and collaboration skills.

     

    Benefits

    • Modern office located in the OfficeX campus with easy access to transport and amenities.
    • Hybrid working model
    • Competitive compensation package
    • 25 days holiday allowance
    • Premium Health insurance
    • Employee Assistance program
    • Referral Bonus
    • Additional days off for long service and volunteering
    • Multisport card
    • Opportunities for professional development including internal tech talks
    • Conference attendance, and engagement with the open-source community.