
Microsoft's Phi-4 Multimodal – Best Small Model Out There?
Microsoft's Phi-4 Multimodal – Best Small Model Out There?
Walkthrough and thorough test of the recently released Phi-4 multimodal model (5.6B parameters).
Code here – https://github.com/designingwithml/blogposts/blob/main/notebooks/modelevals/phi4/phi-4-multimodal.ipynb
0:00 Introduction
01:21 Setup
02:30 Text Conversation
03:53 Image Description
05:42 OCR
06:42 Tool Calling
07:32 Audio…
source
Reviews
0 %
Got it up and running on my kind of outdated local PC. Somewhat a struggle to do so. Model is impressive. Will check it out further. Thank you very much for your video, creative use cases and snippets to run them. Subscribed.
You’ve got a great personality for this type of content.
Cant wait to see a phi4-mini video too! Great job!
Short story, the model's great at text generation (e.g., summarize x), multimodal understanding (what does the author speak about in this audio file and how is it related to the image provided), audio transcription (give me a verbatim transcription of this audio file), OCR (give me ALL the text in this image as a tidy markdown file), function calling.
If you are doing any of this and would like a small/local model (e.g., for latency, privacy, compliance etc reasons), definitely try Phi-4 multimodal.
Why is nobody talking about qwen 2.5 vl it's quite impressive even with the 3b param model
Not bad