Image & Video Understanding API | Vision AI Analysis

Overview

LaoZhang API provides powerful image and video understanding capabilities through multiple advanced AI models. Using the unified OpenAI API format, you can easily implement image recognition, scene description, OCR text extraction, video content analysis and more. 🔍 Intelligent Visual Analysis Support for object recognition, scene understanding, text extraction, sentiment analysis, video content understanding and various other vision tasks - let AI truly “see” and understand images and videos.

🌟 Key Features

🎯 Multiple Models: GPT-5, Gemini 2.5 Pro/Flash and other top vision models
📸 Flexible Input: Support for URL links and Base64-encoded images and videos
🎬 Video Understanding: Gemini series supports video content analysis (up to several minutes)
⚡ Fast Response: High-performance inference with sub-second results
💰 Cost Control: Multiple model options to fit different budgets
🔄 OpenAI Compatible: Drop-in replacement for OpenAI vision APIs

📋 Supported Vision Models

Model Name	Model ID	Image	Video	Features
GPT-5 ⭐	`gpt-5`	✅	❌	Latest model, highly detailed image recognition
Gemini 2.5 Pro ⭐	`gemini-2.5-pro`	✅	✅	Ultra-long context, supports video analysis
Gemini 2.5 Flash ⭐	`gemini-2.5-flash`	✅	✅	Extremely fast, best value, supports video
GPT-4.1 Mini	`gpt-4.1-mini`	✅	❌	Lightweight, fast, low cost
Claude 3.5 Sonnet	`claude-3-5-sonnet`	✅	❌	Deep understanding, accurate descriptions

💡 Video Analysis: Currently only Gemini series models support video content understanding. GPT-5 and Claude support image analysis only.

🚀 Quick Start

1. Basic Example - Image URL

import requests

url = "https://api.yelinai.com/v1/chat/completions"
headers = {
    "Authorization": "Bearer YOUR_API_KEY",
    "Content-Type": "application/json"
}

payload = {
    "model": "gpt-4o",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Please describe this image in detail"
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/image.jpg"
                    }
                }
            ]
        }
    ]
}

response = requests.post(url, headers=headers, json=payload)
result = response.json()
print(result['choices'][0]['message']['content'])

2. Local Image Example - Base64 Encoding

import base64
import requests

def image_to_base64(image_path):
    """Convert local image to base64 encoding"""
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

# Read local image
base64_image = image_to_base64("path/to/your/image.jpg")

url = "https://api.yelinai.com/v1/chat/completions"
headers = {
    "Authorization": "Bearer YOUR_API_KEY",
    "Content-Type": "application/json"
}

payload = {
    "model": "gemini-2.5-pro",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Extract all text content from this image"
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{base64_image}"
                    }
                }
            ]
        }
    ]
}

response = requests.post(url, headers=headers, json=payload)
print(response.json()['choices'][0]['message']['content'])

3. Advanced Example - Multi-Image Comparison

import requests

url = "https://api.yelinai.com/v1/chat/completions"
headers = {
    "Authorization": "Bearer YOUR_API_KEY",
    "Content-Type": "application/json"
}

payload = {
    "model": "gpt-4o",
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Compare the differences between these two images:"},
                {
                    "type": "image_url",
                    "image_url": {"url": "https://example.com/image1.jpg"}
                },
                {
                    "type": "image_url",
                    "image_url": {"url": "https://example.com/image2.jpg"}
                }
            ]
        }
    ],
    "max_tokens": 1000
}

response = requests.post(url, headers=headers, json=payload)
print(response.json()['choices'][0]['message']['content'])

🎬 Video Content Analysis

Supported Video Models

Currently only Gemini series models support video analysis:

gemini-2.5-pro - Detailed and accurate, recommended for complex video analysis
gemini-2.5-flash - Fast and cost-effective, suitable for batch processing

1. Basic Video Analysis - URL Method

import requests

url = "https://api.yelinai.com/v1/chat/completions"
headers = {
    "Authorization": "Bearer YOUR_API_KEY",
    "Content-Type": "application/json"
}

payload = {
    "model": "gemini-2.5-pro",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Please describe the content of this video in detail"
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/video.mp4"
                    }
                }
            ]
        }
    ],
    "max_tokens": 1000
}

response = requests.post(url, headers=headers, json=payload)
print(response.json()['choices'][0]['message']['content'])

2. Using OpenAI SDK

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://api.yelinai.com/v1"
)

response = client.chat.completions.create(
    model="gemini-2.5-flash",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Analyze this video content, including scenes, people, and actions"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/video.mp4"
                    }
                }
            ]
        }
    ],
    max_tokens=1000
)

print(response.choices[0].message.content)

3. cURL Command Example

curl -X POST "https://api.yelinai.com/v1/chat/completions" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemini-2.5-pro",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "Describe the content of this video"
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://example.com/video.mp4"
            }
          }
        ]
      }
    ],
    "max_tokens": 1000
  }'

4. Video + Image Mixed Analysis

Gemini supports analyzing videos and images in the same request:

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://api.yelinai.com/v1"
)

response = client.chat.completions.create(
    model="gemini-2.5-pro",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Compare the differences between this video and this image"},
                {
                    "type": "image_url",
                    "image_url": {"url": "https://example.com/video.mp4"}
                },
                {
                    "type": "image_url",
                    "image_url": {"url": "https://example.com/image.jpg"}
                }
            ]
        }
    ],
    max_tokens=1500
)

print(response.choices[0].message.content)

5. Local Video Analysis - Base64 Encoding

For local video files, you can use Base64 encoding to upload:

import base64
from openai import OpenAI

def video_to_base64(video_path):
    """Convert local video to base64 encoding"""
    with open(video_path, "rb") as video_file:
        return base64.b64encode(video_file.read()).decode('utf-8')

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://api.yelinai.com/v1"
)

# Encode local video
base64_video = video_to_base64("path/to/your/video.mp4")

response = client.chat.completions.create(
    model="gemini-2.5-flash",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Please describe the content of this video in detail"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:video/mp4;base64,{base64_video}"
                    }
                }
            ]
        }
    ],
    max_tokens=1000
)

print(response.choices[0].message.content)

MIME Type Reference: Different video formats require corresponding MIME types:

MP4: data:video/mp4;base64,...
WebM: data:video/webm;base64,...
MOV: data:video/quicktime;base64,...
AVI: data:video/x-msvideo;base64,...

💡 Best Practice: Base64 encoding increases data size by approximately 33%. For large video files (> 10MB), prefer URL method. For small videos (< 5MB), Base64 is more convenient.

Video Analysis Best Practices

File Size: Recommend max 20 MB per video - larger files may take longer to process
Video Formats: Supports MP4, WebM, MOV, AVI and other mainstream formats
Video Duration: Short videos (< 5 minutes) work best - consider splitting longer videos
Resolution: Higher resolution videos provide better recognition but take longer to process
Prompt Optimization: Be specific about what to analyze (e.g., “analyze character movements”, “extract dialogue”)

Video Analysis Use Cases

📹 Content Moderation: Automatically identify inappropriate content in videos
🎓 Educational Video Analysis: Extract key points and subtitles
🛡️ Surveillance Video Understanding: Detect abnormal behaviors and identify events
🎬 Ad Creative Analysis: Evaluate creative elements and emotional impact
📊 Sports Analysis: Recognize athlete movements and key moments in games

⚠️ Important Notes:

Video processing typically takes longer than images (depending on video length and complexity)
Use max_tokens parameter to limit output length and avoid excessive costs
Be mindful of data security when processing privacy-sensitive video content

🎯 Common Use Cases

1. Product Recognition & Analysis

prompt = """
Analyze this product image, including:
1. Product type and brand
2. Key features and selling points
3. Target audience
4. Suggested marketing copy
"""

2. Document OCR Recognition

prompt = """
Extract all text content from the image and organize it in the original format.
If there are tables, present them in Markdown table format.
"""

3. Medical Imaging Assistance

prompt = """
This is a medical image, please:
1. Describe basic information (imaging type, body part, etc.)
2. Identify visible anatomical structures
3. Note: For reference only, not for diagnostic purposes
"""

4. Security Monitoring Scene Analysis

prompt = """
Analyze the surveillance footage, identifying:
1. Number and location of people in the scene
2. Any abnormal behaviors
3. Environmental safety hazards
4. Timestamp information (if visible)
"""

💡 Best Practices

Image Preprocessing Guidelines

Format Support: JPEG, PNG, GIF, WebP and other mainstream formats
Size Limit: Recommend max 20MB per image
Resolution: Higher resolution images get better recognition results
Compression: Moderate compression improves transmission speed

Prompt Optimization

# ❌ Not Recommended: Vague prompts
prompt = "What is this"

# ✅ Recommended: Specific and clear prompts
prompt = """
Analyze this image from the following aspects:
1. Main objects: Identify primary objects or people
2. Scene environment: Describe location and environmental features
3. Color composition: Analyze color scheme and composition
4. Emotional atmosphere: Mood or atmosphere conveyed
5. Potential uses: Suitable scenarios for this image
"""

Error Handling

import requests
from requests.exceptions import RequestException

def analyze_image_with_retry(image_url, prompt, max_retries=3):
    """Image analysis function with retry mechanism"""
    for attempt in range(max_retries):
        try:
            response = requests.post(
                "https://api.yelinai.com/v1/chat/completions",
                headers={
                    "Authorization": f"Bearer {API_KEY}",
                    "Content-Type": "application/json"
                },
                json={
                    "model": "gpt-4o",
                    "messages": [{
                        "role": "user",
                        "content": [
                            {"type": "text", "text": prompt},
                            {"type": "image_url", "image_url": {"url": image_url}}
                        ]
                    }]
                },
                timeout=30
            )

            if response.status_code == 200:
                return response.json()
            elif response.status_code == 429:
                print(f"Rate limited, retrying... (attempt {attempt + 1}/{max_retries})")
                time.sleep(2 ** attempt)  # Exponential backoff
            else:
                print(f"Error: {response.status_code} - {response.text}")

        except RequestException as e:
            print(f"Request exception: {e}")

    return None

🔧 Advanced Features

1. Streaming Output

For long analyses, use streaming for better user experience:

payload = {
    "model": "gpt-4o",
    "messages": [...],
    "stream": True
}

response = requests.post(url, headers=headers, json=payload, stream=True)
for line in response.iter_lines():
    if line:
        print(line.decode('utf-8'))

2. Multi-turn Dialogue

Maintain context for in-depth analysis:

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What animal is this?"},
            {"type": "image_url", "image_url": {"url": "animal.jpg"}}
        ]
    },
    {
        "role": "assistant",
        "content": "This is a Golden Retriever."
    },
    {
        "role": "user",
        "content": [{"type": "text", "text": "How old does it look? What's its health condition?"}]
    }
]

3. Function Calling Integration

tools = [
    {
        "type": "function",
        "function": {
            "name": "save_image_analysis",
            "description": "Save image analysis results to database",
            "parameters": {
                "type": "object",
                "properties": {
                    "objects": {"type": "array", "items": {"type": "string"}},
                    "scene": {"type": "string"},
                    "text_content": {"type": "string"}
                }
            }
        }
    }
]

payload = {
    "model": "gpt-4o",
    "messages": messages,
    "tools": tools,
    "tool_choice": "auto"
}

📊 Performance Comparison

Model	Image Support	Video Support	Response Speed	Recognition Accuracy	Price
GPT-5	✅	❌	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	$$$
Gemini 2.5 Pro	✅	✅	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	$$
Gemini 2.5 Flash	✅	✅	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	$
GPT-4.1 Mini	✅	❌	⭐⭐⭐⭐⭐	⭐⭐⭐	$

🚨 Important Notes

Privacy Protection: Don’t upload images or videos containing sensitive information
Compliance: Follow relevant laws and regulations, avoid illegal uses
Result Verification: AI analysis results are for reference only - verify for critical decisions
Cost Control: Choose appropriate models to avoid unnecessary expenses
Video Limitations: Video analysis is only supported by Gemini series, other models don’t support it yet

Chat Completions API - Learn more about the Chat API
Pricing Information - View model pricing details

💡 Tip: Start with Gemini 2.5 Flash or GPT-4.1 Mini for testing, then upgrade to premium models for production deployment.

​Overview

​🌟 Key Features

​📋 Supported Vision Models

​🚀 Quick Start

​1. Basic Example - Image URL

​2. Local Image Example - Base64 Encoding

​3. Advanced Example - Multi-Image Comparison

​🎬 Video Content Analysis

​Supported Video Models

​1. Basic Video Analysis - URL Method

​2. Using OpenAI SDK

​3. cURL Command Example

​4. Video + Image Mixed Analysis

​5. Local Video Analysis - Base64 Encoding

​Video Analysis Best Practices

​Video Analysis Use Cases

​🎯 Common Use Cases

​1. Product Recognition & Analysis

​2. Document OCR Recognition

​3. Medical Imaging Assistance

​4. Security Monitoring Scene Analysis

​💡 Best Practices

​Image Preprocessing Guidelines

​Prompt Optimization

​Error Handling

​🔧 Advanced Features

​1. Streaming Output

​2. Multi-turn Dialogue

​3. Function Calling Integration

​📊 Performance Comparison

​🚨 Important Notes

​🔗 Related Resources