Generative AI Infrastructure Engineer

Permanent contract - Full-time

Job Description

We are looking for a technical professional to support the development and evolution of the generative infrastructure of Nexum products. The person will work on local and cloud inference, open-weight model deployment, runtime management, caching, batching, containerization, observability and performance optimization. Tech Stack: Python, Docker, Kubernetes, vLLM, Ollama, llama.cpp, MLX, PostgreSQL, Redis, OpenTelemetry, Llama, Qwen, Mistral, Gemma, DeepSeek, Phi. Key Responsibilities: - Manage and optimize LLM model deployment in production environments - Configure inference runtimes (vLLM, Ollama, llama.cpp, MLX) - Implement caching, batching and load balancing strategies - Containerize and orchestrate AI services with Docker and Kubernetes - Monitor performance with OpenTelemetry and observability tools - Optimize GPU/CPU resource usage for inference This job posting is addressed to both genders, in compliance with laws 903/77 and 125/91 on equal treatment in the workplace and against gender discrimination. We welcome candidates of all ages and nationalities, in accordance with legislative decrees 215/03 and 216/03. Nexum also encourages applications from people with disabilities, in compliance with current regulations.

Requirements

AI Infrastructure
Experience with vLLM, Ollama, llama.cpp, MLX and inference runtimes
Containers & Orchestration
Docker, Kubernetes, containerization and model deployment
Observability
OpenTelemetry, monitoring, logging and performance optimization
Database & Caching
PostgreSQL, Redis, caching and batching strategies

Benefits

Professional Growth
"Continuous Improvement" program for technical and professional empowerment
Innovative Infrastructure
Work on the frontier of AI inference and model deployment
Competitive Compensation
Salary aligned with the most up-to-date IT sector standards
Flexible Work
Hybrid or full-remote options available