ai-multimodal by mrgoonie

Process and generate multimedia content using Google Gemini API. Capabilities include analyze audio files (transcription with timestamps, summarization, speech understanding, music/sound analysis up to 9.5 hours), understand images (captioning, object detection, OCR, visual Q&A, segmentation), process videos (scene detection, Q&A, temporal analysis, YouTube URLs, up to 6 hours), extract from documents (PDF tables, forms, charts, diagrams, multi-page), generate images (text-to-image, editing, composition, refinement). Use when working with audio/video files, analyzing images or screenshots, processing PDF documents, extracting structured data from media, creating images from text prompts, or implementing multimodal AI features. Supports multiple models (Gemini 2.5/2.0) with context windows up to 2M tokens.

Coding
1.3K Stars
249 Forks
Updated Dec 30, 2025, 02:08 PM

Why Use This

This skill provides specialized capabilities for mrgoonie's codebase.

Use Cases

  • Developing new features in the mrgoonie repository
  • Refactoring existing code to follow mrgoonie standards
  • Understanding and working with mrgoonie's codebase structure

Skill Snapshot

Auto scan of skill assets. Informational only.

Valid SKILL.md

Checks against SKILL.md specification

Source & Community

Repository claudekit-skills
Skill Version
main
Community
1.3K 249
Updated At Dec 30, 2025, 02:08 PM

Skill Stats

SKILL.md 358 Lines
Total Files 1
Total Size 0 B
License MIT