Qwen 3.5 9B for Open Source LLM and VLM
Run Qwen 3.5 9B in ComfyUI as a text-only LLM or as a vision language model. Attach an image or a video, write your prompt, and get text back.
image to text
llm
open source
qwen
text generation
vlm
1
67
Nodes & Models
CLIPLoader
qwen3.5_9b_bf16.safetensors
PrimitiveStringMultiline
LoadImage
TextGenerate
PreviewAny
Run Qwen 3.5 9B inside ComfyUI. Use it as a text-only LLM, or turn on the image or video input and use it as a vision language model.
Write a prompt and get text back. Attach an image and the model can describe what's in it, answer questions about it, or read text out of it. Attach a video and it can describe the action, summarize what happens across the clip, or answer questions about specific moments.
Output is plain text. Use it on its own, or feed the result into another step in your workflow.
Image or video (optional) Off by default. Want pure LLM mode? Leave it bypassed. Want to analyze a picture? Enable the image input and upload your file. Want to describe or summarize a clip? Enable the video input and drop in a short video. Same model, three modes.
Load the Qwen 3.5 9B model, write your prompt, and run. The image and video inputs are optional and bypassed by default. Skip them for code, writing, or any text-only task. Enable the image input for vision Q&A, or the video input to describe what's happening in a clip. Adjust temperature and sampling if you want a different output style.
Prompt Write what you want. "Write Python code that..." for code. "Describe this image in detail" with an image attached for captions. "What's happening in this video?" with a clip attached for video summaries. "What's in this picture?" for visual Q&A. The example shipped with the workflow asks for Java calculator code, so swap it out for whatever you need.
"What is Qwen 3.5 9B good for in ComfyUI?"
The vision mode is the more interesting half. Caption a folder of training images for a LoRA. Read text out of a screenshot. Describe a reference photo and use the result as the prompt for an image model later in the chain. Feed it a short clip and get a written description of the action, the scene, or specific moments. Useful for video tagging, draft captions, content review, or pulling prompts out of existing footage.
FAQ
What's the difference between LLM and VLM mode in Qwen 3.5 9B? LLM mode means the model only sees text. You write a prompt, it writes a response. VLM mode adds image or video input. The same Qwen 3.5 9B can see what's in a picture or clip and answer questions about it, describe it, or read text from it. Toggle by enabling the LoadImage or video node.
Can Qwen 3.5 9B describe a video? Yes. Enable the video input, upload a short clip, and prompt with something like "Describe what's happening in this video" or "Summarize the main action." The model samples frames and reads them as a sequence. Good for tagging clips, drafting captions, or pulling prompts out of reference footage.
Read more

