Hi all, I have never touched any tools for local inference and barely know anything about the landscape. Additionally, the only hardware I have available is a 8C/16T Zen 3 CPU and 48GB of RAM. I have many years experience running Linux as a daily driver and small network sysadmin.
I am well aware this is extreme challenge mode, but it’s what I have to work with for now, and my main goal is more to do with learning the ecosystem than with getting highly usable results.
I decided for various reasons that my first project would be to get a model which I can feed an image, and have it output a caption.
If I have to quantize a model to make it fit into my available RAM then I am willing to learn that too.
I am looking for basic pointers of where to get started, such as “read this guide,” “watch this video,” “look into this software package.”
I am not looking for solutions which involve using an API where inference happens on a machine which is not my own.
Hey there ThorrJo welcome to our community.
I recommend you use kobold.cpp as your first inference engine of choice as its very easy to get running especially on Linux. Since you have no GPU you don’t need to worry about CUDA or Vulcan for offloading.
https://github.com/LostRuins/koboldcpp/
Read the kobold wiki section for vision model projection. For the image recognition model itself I recommend you use Nvidia Cosmos finetune of Qwen2.5-VL. Make sure to load the qwen2.5vl mmproj lens that kobold links along with the model.
https://github.com/LostRuins/koboldcpp/wiki#what-is-llava-and-mmproj
https://huggingface.co/koboldcpp/mmproj/tree/main
https://huggingface.co/mradermacher/Cosmos-Reason1-7B-i1-GGUF
.GGUF I linked are already pre-quantized, you should be able to load the biggest quant available and the f16 mmproj on your 48gb ram easy with lots of context allocation room left.
Allocate as much context size as you can. larger high resolution images take more input context to process.
For troubleshooting if its replies are wonky try changing chat template first I forget if its ChatML or something else. You can try adjusting sampler size too.
Kobold.CPP runs a web interface you can connect to through the browser on multiple devices. It also exposes its backend through openai-compatable api so you can write your own custom apps for send and receive or use kobold with other frontend software thats compatable with corporate APIs like tinychat if you want to go further.
If you have any specific questions or need help feel free to reach out :)
Chiming in to say this is a very reasonable starting place and wanted to highlight to op that this solution is 100% self-hosted
I’m a beginner myself, and while I do have a GPU (unsure how much that speeds up things) I have found the qwen3-coder has been almost a cheatcode when problem solving the various issues that otherwise would have me search different forums for hours.
Thank you so much for this detailed starting point!! ❤️