Gemini Embedding 2: Google’s First Multimodal Embedding Model

Google DeepMind has released Gemini Embedding 2 in public preview, their first fully multimodal embedding model built on the Gemini architecture. Unlike previous text-only models, this one maps text, images, videos, audio, and documents into a single, unified embedding space, capturing semantic intent across over 100 languages.

Key Technical Details

The model is available through the Gemini API and Vertex AI, and supports these specific capabilities:

Text: Supports context of up to 8192 input tokens
Images: Processes up to 6 images per request (PNG and JPEG formats)
Videos: Supports up to 120 seconds of video input (MP4 and MOV formats)
Audio: Natively ingests and embeds audio without needing text transcriptions
Documents: Directly embeds PDFs up to 6 pages long

Beyond processing single modalities, the model natively understands interleaved input, allowing you to pass multiple modalities (e.g., image + text) in a single request to capture nuanced relationships between different media types.

Flexible Output Dimensions

Gemini Embedding 2 incorporates Matryoshka Representation Learning (MRL), enabling flexible output dimensions scaling down from the default 3072. This lets developers balance performance and storage costs. Google recommends using 3072, 1536, or 768 dimensions for highest quality.

Integration and Use Cases

The model is designed for multimodal downstream tasks including Retrieval-Augmented Generation (RAG), semantic search, sentiment analysis, and data clustering. It's available through multiple platforms: