Show HN: FlexLLama – Run multiple local LLMs at once with a simple dashboard

github.com

4 points by yazon a day ago

After playing around with local AI setups for a while, I kept getting annoyed at having to juggle different llama.cpp servers for each model. Switching between them was such a pain and I always had to restart things just to load up a new model. So I ended up building something to fix that. It's called FlexLLama - https://github.com/yazon/flexllama Basically, it's a tool that lets you run multiple llama.cpp instances easily, spread across CPU and GPUs if you got'em. Everything sits behind a single OpenAI-compatible API. You can run chat models, embeddings, rerankers - all at once. The models assigned to the runners are reloaded on the fly. There's a little web dashboard to monitor and manage runners. It's super easy to get started: just pip install from the repo, or grab the Docker image for a speedy setup.

I've been using it myself with things like OpenWebUI and some VS Code extensions (Roo Code, Cline, Continue.dev), and it works flawlessly.