TogetherAI has introduced Dragonfly, a new family of multimodal models, including the Llama 3 8B Dragonfly, which outperforms existing models in medical image understanding. The Dragonfly architecture leverages multi-resolution vision encoding, zoom-in patch resolution, and a zoom-and-select VLM architecture to enhance vision-language model capabilities. It has been trained on 5.5 million image-instruction samples and an additional 1.4 million medical domain samples for fine-tuning. The model has set a new state-of-the-art (SOTA) in medical captioning, surpassing models like Llava, Queen-VL, and med-Gemini. Additionally, it incorporates CLIP for improved performance.
AI Agents: Key Concepts and How They Overcome LLM Limitations https://t.co/jkyU7CZwEK @janakiramm #AIAgents #AI #LLM #Limitations
Very interesting Paper - "Mixture-of-Agents (MoA) Enhances Large Language Model Capabilities": - MoA using only open-source LLMs is the leader of AlpacaEval 2.0 by a substantial gap, achieving a score of 65.1% compared to 57.5% by GPT-4 Omni. 🔥 📌 The paper introduces the… https://t.co/P09kddjZMt
Mixture-of-Agents Enhances Large Language Model Capabilities "In our approach, we construct a layered MoA architecture wherein each layer comprises multiple LLM agents. Each agent takes all the outputs from agents in the previous layer as auxiliary information in generating its… https://t.co/Vo5OvK7NwZ
Very exciting to see that open sourced Llama-3-70B-Instruct performs so well on the multimodal agentic benchmark Visual Web Arena (VWA), despite being only a text-based model, even beating multimodal Gemini-Flash-1.5 and coming very close to the performance of Gemini-Pro-1.5.… https://t.co/jFVq9g5gVn
NEW VLM: Llama 3 8B Dragonfly by TogetherAI 🐲 > Beats Llava, Queen-VL, and med-Gemini ⚡️ > Leverages multi-resolution vision encoding & zoom-in patch resolution > Trained on 5.5M image-instruction samples > Additional 1.4M medical domain samples for fine-tuned model > CLIP… https://t.co/FFIn3JhKZ3
Dragonfly are a family of new multimodal models from @togethercompute including one that outperforms on medical image understanding. https://t.co/TjQmDVAI4a
Key to enhance vision-language model is increasing image resolution, but how to do this w/o blowing up context? Excited to introduce Dragonfly, our new zoom-and-select VLM architecture!🔍 It encodes images at multiple res and picks salient patches for LM. Medical caption SOTA🧵 https://t.co/gWdCfdT1Af