SentrySearch: Gemini Embedding 2 Brings Sub-Second Semantic Video Search to Dashcam Footage
Google's Gemini Embedding 2 model now projects raw video directly into vector space, eliminating the need for transcription or frame-by-frame captioning. A new open-source tool called SentrySearch demonstrates this capability by enabling natural language search over hours of dashcam footage in seconds, with indexing costs around $2.50 per hour of video.
How Native Video Embedding Changes Everything
Traditional video search has always required an intermediate step—transcribing audio, captioning frames, or extracting metadata—that bridges the gap between visual content and text queries. Gemini Embedding 2 eliminates this bottleneck by projecting raw video pixels directly into the same 768-dimensional vector space as text queries.
This means a query like "red truck running a stop sign" can be directly compared to 30-second video clips at the vector level, without any text conversion. The semantic similarity between the query and video content is computed natively in embedding space, enabling matches based on visual concepts rather than just keywords.
The implications extend far beyond convenience. Previous approaches required expensive transcription services, computer vision pipelines, or manual tagging. Native video embedding reduces this to a single API call per video chunk, democratizing access to enterprise-grade video search for indie developers and small teams.
SentrySearch: A Practical Implementation
SentrySearch is a CLI tool built specifically for Tesla Sentry Mode footage, though it works with any MP4 video files. The tool splits videos into overlapping chunks, embeds each chunk using Gemini's video embedding API, and stores vectors in a local ChromaDB database for fast similarity search.
The workflow is straightforward. First, initialize the tool with your Gemini API key. Then index your footage directory—the tool recursively finds all MP4 files and processes them in batches. Once indexed, search using natural language queries like "green car cutting me off" or "person approaching vehicle." The top match is automatically trimmed from the original file and saved as a standalone clip.
The project includes intelligent optimizations for real-world usage. Still-frame detection skips chunks with no visual change, dramatically reducing costs for security camera footage that spends hours recording empty parking lots. Preprocessing downscales chunks to 480p at 5fps to minimize upload time and prevent API timeouts, though the Gemini API processes at 1fps regardless of input frame rate.
Cost Analysis: Enterprise Video Search for $2.50/Hour
The economics of native video embedding are compelling. Indexing one hour of footage with default settings costs approximately $2.84, broken down as follows:
The Gemini API extracts and tokenizes exactly one frame per second from uploaded video, regardless of the file's actual frame rate. For one hour (3,600 seconds) at $0.00079 per frame, the cost is approximately $2.84. Search queries use text embeddings only, so they're essentially free.
Still-frame skipping provides significant savings for security footage. A typical Sentry Mode recording might show hours of an empty parking lot with only brief moments of activity. The tool's heuristic detection—comparing JPEG file sizes across sampled frames—skips chunks with no meaningful visual change, potentially reducing costs by 50-80% for low-activity footage.
For comparison, traditional video transcription services typically charge $1-2 per hour just for audio processing, while computer vision APIs for frame analysis can run $5-10 per hour depending on the complexity. Native video embedding provides semantic search capabilities at roughly half the cost of basic transcription alone.
Technical Architecture
SentrySearch's architecture reflects practical constraints of working with video at scale. The chunking system splits videos into overlapping segments (default 30 seconds with 5 seconds of overlap) to ensure events aren't missed at chunk boundaries. Overlap helps when an incident spans two chunks—the semantic match will be stronger for the chunk containing more of the event.
Preprocessing via ffmpeg downscales video to 480p at 5fps before uploading. This reduces payload sizes from potentially hundreds of megabytes to manageable chunks that won't timeout during API transmission. Since Gemini processes at 1fps regardless of input, this preprocessing doesn't affect the number of frames billed—it only improves reliability and speed.
Vector storage uses ChromaDB, a lightweight embedding database that runs locally. This keeps search latency low and ensures privacy—your video embeddings never leave your machine after initial processing. The database stores chunk metadata including source file, timestamps, and similarity scores for result ranking.
Use Cases Beyond Dashcams
While SentrySearch targets Tesla footage, the underlying approach applies to any video search use case:
Security Camera Systems: Index weeks of security footage and search for specific events—"person wearing red jacket," "delivery truck arriving," or "suspicious activity near back door." The still-frame optimization makes this economically viable for continuous recording.
Content Creation Archives: Video creators with terabytes of B-roll can finally search their libraries semantically. Find that shot of "city skyline at sunset" or "dog playing in park" without manual tagging or browsing endless folders.
Sports Analysis: Coaches and analysts can search game footage for specific plays—"corner kick resulting in goal" or "fast break transition"—without watching hours of video.
Drone Footage: Search aerial surveys for specific features—"construction equipment," "flooded areas," or "wildlife sightings"—directly from raw footage.
Current Limitations and Future Directions
The README acknowledges several current limitations. Still-frame detection is heuristic and may occasionally skip chunks with subtle motion or embed truly static chunks. Smarter scene detection could improve chunking quality. The Gemini Embedding 2 API is still in preview, so pricing and behavior may change.
Search quality depends on chunk boundaries—if an event spans two chunks, the overlapping window helps but isn't perfect. Scene detection-based chunking (detecting cuts and transitions) could improve this by aligning chunks with natural video boundaries.
The Hacker News discussion around this project highlighted interest in extending the approach to other video types. Commenters suggested applications for bodycam footage, wildlife monitoring, and automated video editing workflows. The low cost and simple architecture make experimentation accessible.
Getting Started
SentrySearch requires Python 3.10+, ffmpeg (or the bundled imageio-ffmpeg), and a free Gemini API key from Google AI Studio. Installation is via pip after cloning the repository.
The tool provides three main commands: init for API key configuration, index for processing video directories, and search for querying the indexed database. Optional flags control chunk duration, overlap, preprocessing quality, and still-frame skipping behavior.
Results include similarity scores and timestamps, with the top match automatically extracted as a standalone clip using ffmpeg. This eliminates the need to manually scrub through original files to find the relevant segment.
FAQ
Does this work with non-Tesla video files?
Yes, SentrySearch works with any MP4 files. The tool recursively scans directories for .mp4 files regardless of folder structure or naming conventions. It was built for Tesla Sentry Mode but handles any video content.
How accurate is the semantic search?
Accuracy depends on the specificity of your query and the visual clarity of your footage. The demo shows queries like "red truck running a stop sign" returning relevant results with similarity scores above 0.85. Abstract concepts work better than specific details—"car accident" will match better than "blue Honda Civic with license plate ABC123."
Is my video data sent to Google?
Video chunks are sent to Google's Gemini API for embedding generation during the indexing phase. The resulting embeddings—768-dimensional vectors—are stored locally in ChromaDB. Original video files remain on your machine. If privacy is a concern, you should review Google's API terms and data handling policies.
Can I search multiple videos at once?
Yes, the indexing process handles entire directory trees. Once indexed, search queries automatically scan across all processed videos. Results are ranked by similarity score regardless of which source file they came from.
What happens if I reach my API quota?
The tool processes files sequentially and saves progress incrementally. If you hit a quota limit, you can resume indexing later without reprocessing already-completed chunks. The README recommends monitoring your Google AI Studio quota dashboard during large indexing operations.
How does this compare to traditional video search solutions?
Traditional solutions typically require either expensive transcription services, custom computer vision models, or manual tagging. SentrySearch provides semantic search at roughly $2.50-3.00 per hour of footage with no infrastructure setup. For small to medium video libraries, this is significantly cheaper and faster than building custom pipelines or using enterprise video analysis services.