بررسی لاما 3.2 ویژن | به سمت هوش مصنوعی

نویسنده(های): برهومی مصبه

در ابتدا منتشر شد به سمت هوش مصنوعی.

اولاما چند وجهی

Ollama به تازگی پشتیبانی رسمی خود را از مدل های Llama 3.2 Vision اعلام کرده است. مدل های Llama 3.2 Vision در دو اندازه عرضه می شوند: 11 میلیارد و 90 میلیارد پارامتر. در این مقاله مروری بر عملکرد آنها در موارد مختلف به همراه نظرات شخصی خود در مورد آنها خواهم داشت.

مدل‌های تنظیم‌شده با دستورالعمل Llama 3.2-Vision برای v بهینه‌سازی شده‌اندتشخیص عینی، استدلال تصویر، شرح و پاسخ به سوالات کلی در مورد یک تصویر. مدل‌ها از بسیاری از مدل‌های چندوجهی منبع باز و بسته موجود در معیارهای رایج صنعت بهتر عمل می‌کنند.

همچنین این مدل از بسیاری از زبان ها پشتیبانی می کند:

برای کارهای فقط متنی، انگلیسی، آلمانی، فرانسوی، ایتالیایی، پرتغالی، هندی، اسپانیایی و تایلندی رسما پشتیبانی می شود. Llama 3.2 بر روی مجموعه گسترده تری از زبان ها نسبت به این 8 زبان پشتیبانی شده آموزش دیده است. توجه داشته باشید برای برنامه های تصویر + متن، انگلیسی تنها زبان پشتیبانی شده است.

ارزیابی

من از رویکرد این مقاله برای ارزیابی استفاده خواهم کرد، که در واقع استدلال می کند که مدل های زبان بینایی “کور” هستند. این مدل ها را در چندین کار آزمایش می کند:

VLM ها کور هستند

تحقیقات نشان می‌دهد که مدل‌های زبان بینایی (VLM) در کارهای ساده بصری که برای انسان آسان است شکست می‌خورند.

vlmsareblind.github.io

وظیفه 1: تقاطع خطوط
وظیفه 2: دو دایره
وظیفه 3: نامه دایره ای
وظیفه 4: همپوشانی اشکال
وظیفه 5: مربع های تو در تو
وظیفه 6: شبکه شمارش
وظیفه 7: نقشه مترو

من چالش‌برانگیزترین بخش‌ها را از هر کار انتخاب می‌کنم و ببینم آیا مدل می‌تواند هر یک از آنها را درست انجام دهد یا خیر. علاوه بر این، من کد را حدود 10 تا 20 بار اجرا می کنم تا میانگین پاسخ های دقیق را محاسبه کنم و نتایج نادرست را شناسایی کنم.

وظیفه 1: تقاطع خطوط

درخواست ها:

خطوط آبی و قرمز چند بار یکدیگر را لمس می کنند؟ با یک عدد در پرانتز، به عنوان مثال، {5} پاسخ دهید.”
«نقاط تقاطعی را که خطوط آبی و قرمز به هم می‌رسند بشمار. پاسخ خود را در پرانتز قرار دهید، به عنوان مثال، {2}.”

The blue and red lines intersect at 2 points. Therefore, they touch each 
other twice. {2}

The blue and red lines touch each other at three points. Therefore, 
the answer is {3}.

میزان موفقیت > 60%

وظیفه 2: دو دایره

درخواست ها:

«آیا این دو دایره یکدیگر را لمس می کنند؟ با بله/خیر پاسخ دهید.»
«آیا این دو دایره با هم همپوشانی دارند؟ با بله/خیر پاسخ دهید.»

**Step 1: Identify the Circles**The image contains a blue circle and a red circle.
**Step 2: Determine if the Circles are Touching Each Other**
To determine if the circles are touching each other, we need to check if they intersect at exactly one point or not. If they do not intersect at all, then they are not touching. If they intersect at more than one point, then they are overlapping but not necessarily touching.
**Step 3: Check for Intersection Points**
Upon examining the image closely, it appears that the two circles intersect at exactly one point. This means that the circles are indeed touching each other.
**Conclusion:**
Yes

The image depicts a diagram with two circles, one red and one blue. The red circle is positioned on the left side of the diagram, while the blue circle is located on the right side.To determine whether the two circles are touching each other, we need to analyze their positions and sizes. Upon closer inspection, it becomes apparent that the two circles do not overlap or touch at any point. Instead, they appear to be separated by a small gap between them.
Therefore, based on this analysis, **No**, the two circles are not touching each other.

میزان موفقیت > 80%

وظیفه 3: حرف دایره شده

درخواست ها:

“کدام حرف در حال دور زدن است؟”
“کدام شخصیت با بیضی قرمز برجسته می شود؟”

**Answer:** The letter "n" is highlighted in red and encircled.

میزان موفقیت: 100%

شمارش اشکال همپوشانی

درخواست ها:

«چند {شکل} در تصویر وجود دارد؟ فقط با عدد به صورت عددی پاسخ دهید.”
«شکل‌های موجود در تصویر را بشمارید. با یک عدد در پرانتز به عنوان مثال {3} پاسخ دهید.”

مدل با 6، 7 یا 8 پاسخ داد اما هرگز عدد صحیح را که 9 است، نداد. و حدس بزنید چه چیزی؟ حتی GPT-4 هم نمی تواند این کار را انجام دهد!

میزان موفقیت: 0%

دنبال کردن مسیرهای تک رنگ

درخواست ها:

«چند مسیر تک رنگ از A به C می رود؟ با یک عدد در داخل پرانتز پاسخ دهید، به عنوان مثال، {3}”
«مسیرهای تک رنگی را که از A به C می‌روند بشمارید. با یک عدد در کروشه‌های فرفری پاسخ دهید، مثلاً {3}».

To find the number of single-colored paths from point A to point C, we need to count the distinct paths that only use one color throughout their entire length.

**Step 1: Identify the colors used on the graph**

Looking at the graph provided, we can see that there are four different colors used:

* Red
* Blue
* Green
* Purple

However, upon closer inspection, it becomes apparent that not all these colors are utilized in creating paths from A to C. We focus only on those colors that lead directly or indirectly from A to C without interruption.

**Step 2: Determine the relevant paths**

Upon examining the graph closely for paths using a single color from A to C:

* There is one path going directly from A to B (using red) and then continuing in another segment.
* Another path leads from A to D and then to C, but this involves two colors.

Given this analysis, only one path utilizes a single color continuously from A to C.

**Conclusion**

Therefore, the number of single-colored paths from point A to point C is

{1}.

Answer: {1}.

To find the number of single-colored paths from point A to point C, we need to analyze the given graph and count the distinct paths that meet this criterion.**Step 1: Identify Single-Colored Paths**
Single-colored paths are those where all edges belong to a single color. In this context, a path is considered single-colored if it consists entirely of red edges.
**Step 2: Count Distinct Red Edges from A to C**
From point A to point C, there are two distinct red edges that form a direct path:
- One edge connects A directly to C.
- Another set of edges forms a longer path through intermediate nodes but ultimately leads back to the same conclusion regarding single-color paths.
**Step 3: Consideration of Alternative Paths**
Upon closer inspection, it becomes apparent that while there are multiple paths from A to C, only two of these paths consist entirely of red edges. This is because any path involving more than one color would not be considered "single-colored."
**Conclusion**
Given the analysis above, the correct answer reflects the count of single-colored paths (all red) from point A to point C.
Answer: {2}

میزان موفقیت > 60%

هر دو GPT-4o و کلود 3.5 Sonnet در این کار شکست خوردند.

بعد از مدتی بازی با این مدل، تفاوت قابل توجهی نمی بینم. می توان آن را مقایسه کرد GPT-4o و غزل-3. با این حال، هنوز به خوبی نیست Gemini-1.5 Pro، که من معتقدم ممکن است بهترین VLM در حال حاضر باشد. به خاطر داشته باشید که این فقط نسخه 11B این مدل است. اگر منابع کافی برای آزمایش نسخه 90B دارید، کافی است در کد زیر عبارت 11B را با 90B جایگزین کنید.

خودتان آن را امتحان کنید

اوللاما را نصب کنید

!curl -fsSL https://ollama.com/install.sh | sh

شروع کن اولاما

import os
get_ipython().system = os.system
!ollama serve &

مدل را بکشید

!ollama pull llama3.2-vision

آن را امتحان کنید

import ollama
def extract_document_info(image_path):
response = ollama.chat(
model='llama3.2-vision',
messages=[{
'role': 'user',
'content': "your question about the image"
),
'images': [image_path]
}]
)
return response
#example
image_path ="/kaggle/input/image.jpg"
result = extract_document_info(image_path)
print(result["message"]["content"])

همین! این یک بررسی سریع از LLaMA 3.2 Vision (نسخه 11B) بود. در واقع در OCR نیز خوب است، من آن را روی برخی از تصاویر واقعا چالش برانگیز آزمایش کردم. البته نسخه 90B بهتر است، اما برای اجرا به منابع قابل توجهی نیاز دارد. وقتی منبع خوبی پیدا کردم، آن را پیوست می کنم!

lama3.2-vision:90b

Llama 3.2 Vision مجموعه ای از مدل های مولد استدلال تصویری تنظیم شده با دستورالعمل در اندازه های 11B و 90B است.

olama.com

VLM ها کور هستند

تحقیقات نشان می‌دهد که مدل‌های زبان بینایی (VLM) در کارهای ساده بصری که برای انسان آسان است شکست می‌خورند.

vlmsareblind.github.io

منتشر شده از طریق به سمت هوش مصنوعی

منبع: https://towardsai.net/p/l/llama-3-2-vision-review