Papers
Topics
Authors
Recent
Search
2000 character limit reached

OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens

Published 2 Mar 2026 in cs.CV | (2603.02138v1)

Abstract: OmniLottie is a versatile framework that generates high quality vector animations from multi-modal instructions. For flexible motion and visual content control, we focus on Lottie, a light weight JSON formatting for both shapes and animation behaviors representation. However, the raw Lottie JSON files contain extensive invariant structural metadata and formatting tokens, posing significant challenges for learning vector animation generation. Therefore, we introduce a well designed Lottie tokenizer that transforms JSON files into structured sequences of commands and parameters representing shapes, animation functions and control parameters. Such tokenizer enables us to build OmniLottie upon pretrained vision LLMs to follow multi-modal interleaved instructions and generate high quality vector animations. To further advance research in vector animation generation, we curate MMLottie-2M, a large scale dataset of professionally designed vector animations paired with textual and visual annotations. With extensive experiments, we validate that OmniLottie can produce vivid and semantically aligned vector animations that adhere closely to multi modal human instructions.

Summary

  • The paper introduces a tokenization strategy for Lottie JSON that efficiently linearizes nested animation data for high-fidelity auto-regressive generation.
  • It presents MMLottie-2M, a large-scale, multi-modal vector animation dataset with rigorous benchmarks for semantic alignment and motion fidelity.
  • Experimental results show improved rendering reliability and controllability across text, image, and video modalities compared to existing methods.

OmniLottie: A Unified Framework for Multi-Modal Vector Animation Generation

Introduction

The synthesis of vector animations from multi-modal prompts constitutes a key challenge in content creation, enabling resolution-independent graphics with explicit temporal control. Traditional approaches address either static vector graphic synthesis or post-hoc animation via keyframe manipulation, rarely integrating appearance, effects, and motion in an end-to-end manner. "OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens" (2603.02138) proposes a paradigm shift by introducing a tokenization framework for Lottie's hierarchical JSON format, establishing an auto-regressive generative model conditioned on arbitrary input modalities (text, images, videos) and yielding expressive, editable vector animations. The paper also contributes MMLottie-2M, the first large-scale, multi-modal vector animation dataset, and MMLottie-Bench, a rigorous evaluation protocol for multi-modal fidelity and motion correctness.

Vector Animation Modeling and Lottie Tokenization

Direct generation of standard Lottie JSON is both inefficient and unreliable due to the verbosity and redundancy of its structure—models trained on the raw representation waste token budget on irrelevant formatting and frequently produce invalid or unrenderable outputs. The central innovation is a Lottie tokenizer formalizing a transformation from nested JSON structures into a linearized token sequence that encodes metadata, hierarchical layering, commands, and continuous parameters via discrete tokens with optimized offset mapping. This process abstracts away schema-induced noise, focusing capacity on the underlying parametric representation of animation primitives and effects. Figure 1

Figure 1: Overview of the Vector Animation Data Construction Pipeline, detailing data collection, conversion, processing, video rendering, and annotation to yield a large-scale, well-annotated dataset.

This tokenization preserves all generative flexibility, supports lossless round-tripping (detokenization), and admits type-specific discretization, offset allocation informed by dataset-based parameter value distributions (see the parameter distribution figure in the appendix), and explicit end/command markers to define sequence boundaries. Notably, by reserving dedicated token subspaces for categories (binary flags, categorical variables, continuous scales, text identifiers) and maintaining structural markers, the model enables autoregressive generation with fine-grained semantic control. Figure 2

Figure 2: Parameter Distribution Analysis for Vocabulary Offset Design, highlighting dense concentration areas and informing the tokenizer’s range allocation for efficient code space usage.

Figure 3

Figure 3: Token Efficiency Through Progressive Abstraction; token sequence length is reduced \sim5.3×\times from raw JSON, with semantic content preserved for high-fidelity generation.

MMLottie-2M Dataset and Standardized Benchmarking

To facilitate large-scale, instruction-conditioned training, the authors curate MMLottie-2M—a composition of two million Lottie animation samples, accompanied by rich text captions (coarse-to-fine, object- and motion-grounded), reference images, and rendered videos. The data sources span both web-crawled professional Lottie files (after strict cleaning/normalization) and animated SVG-conversions with diversified synthesized motions, ensuring architectural, temporal, and compositional diversity for robust learning.

A modular annotation pipeline, leveraging VLMs for semantic grounding and conformance to strict labeling guidelines, provides multi-granularity annotation tailored to appearance, color, geometry, and motion. The dataset includes detailed statistics—layer counts, duration, category distributions, and hierarchical complexity—that expose the non-trivial structure and diversity, positioning it as a foundation for vector animation research. Figure 4

Figure 4: Visual summary of dataset composition, temporal and spatial statistics pre- and post-normalization, emphasizing diversity and cross-domain coverage.

The authors propose MMLottie-Bench, a dual-track benchmark partitioned into real (hold-out, professionally authored) and synthetic (prompt-based, model-generated) splits, ensuring train–test separation and robustness to future data leakage. The protocol scores both visual fidelity (FVD, PSNR, SSIM, DINO) and semantic alignment (using LLM-based object/motion alignment metrics), supporting automation, reproducibility, and human consistency validation.

The OmniLottie Model: Architecture and Training

OmniLottie builds on a pretrained vision-LLM (VLM)—specifically, Qwen2.5-VL—augmented with embedding tables for the new Lottie token vocabulary. The model consumes interleaved multi-modal instructions and generates Lottie token sequences autoregressively. The discrete token stream is subsequently detokenized into valid Lottie JSON, ensuring renderable, highly-editable vector animation outputs.

The training objective is standard next-token prediction under multi-modal conditioning, optimized with cross-entropy. Ablation studies confirm that fine-tuning directly on raw JSON is significantly inferior to training with the proposed tokenization, both in deterministically improving rendering reliability and boosting all fidelity/alignment metrics.

Experimental Results

Quantitative Benchmarks

Across all modalities (Text-to-Lottie, Text-Image-to-Lottie, Video-to-Lottie), OmniLottie exhibits superior performance versus LLMs, VLMs, and optimization-based/decomposition baselines:

  • Text-to-Lottie: Success rate 88.3% (Real), FVD 202.14, best or second-best semantic and visual alignment scores—significantly ahead of Qwen2.5-VL, DeepSeekV3, and even the commercial Recraft system.
  • Text-Image-to-Lottie: Success rate 93.3%, FVD 180.27, object and motion alignment at or slightly below AniClipart/Livesketch, but with an order-of-magnitude lower runtime and higher validity.
  • Video-to-Lottie: Success rate 88.1%, best FVD (227.11), PSNR, SSIM, DINO, and far higher consistency than GPT-5/Gemini3.1.

These numerical results decisively establish the advantage of structured tokenization and unified training for generative controllability, compactness, and success.

Qualitative Analysis

Rendering samples demonstrate (1) precise adherence to input prompts, (2) plausible, varied, and visually coherent motions, and (3) robust handling of compositionally complex or multimodal conditioning. Comparison with VLM/LLM generators reveals chronic failures in the latter: high rates of schema violations, empty/malformed renderings, and inability to reliably link content structure with temporal dynamics. Figure 5

Figure 5: Qualitative comparison of Text-to-Lottie outputs, revealing precise motion, object integrity, and clear semantic adherence from OmniLottie while other systems fail in prompt alignment or structural validity.

Figure 6

Figure 6: Text-Image-to-Lottie results; only OmniLottie consistently achieves both visual and semantic alignment with input images and captions.

Figure 7

Figure 7: Video-to-Lottie examples highlight the efficacy of parameterized tokenization and multi-modal inference—baseline LLM/VLMs fail to reconstruct valid outputs in most cases.

Analyses of ablations on (1) data mixing (SVG proportion) reveal that moderate inclusion of synthetic SVG-animated sequences enhances geometric coverage without sacrificing realistic motion, and (2) the tokenization strategy verifies it as indispensable to reliability and performance. Figure 8

Figure 8: Qualitative impact of data composition on animation fidelity and prompt alignment.

Figure 9

Figure 9: Lottie Tokenizer ablation study with visualizations and qualitative metrics, illustrating failure rates without structured tokenization.

Failure Analysis

Failure mode taxonomy identifies the dominant error types for each class of method:

  • LLM/VLMs: Specification/structural errors dominate (schema hallucination, attribute misplacement), reflecting lack of format-induced priors.
  • Optimization-based: Input-dependency and convergence pathologies, due to multi-stage pipelines and keypoint/feature extraction failures.
  • OmniLottie: Residual errors are confined to rendering-level (missing styles, timing errors, opacity/scale mispredictions), a significantly more tractable domain for improvement.

Implications, Limitations, and Future Prospects

From a practical standpoint, OmniLottie enables one-shot multi-modal animation synthesis, significantly reducing latency and token cost associated with template-based design or iterative optimization pipelines. Its outputs preserve editability, scalability, and cross-platform compatibility, in contrast to raster or non-vector approaches. Theoretically, the work illustrates that format-aware tokenization is essential for generative modeling of structured, parameter-rich domains, especially where schema conformance and renderability are non-negotiable.

Remaining limitations reside primarily in residual invalid sequences from autoregressive decoding, limited context in extreme compositional depth scenarios, and the brittleness of direct parameter regression under adversarial or ambiguous input conditions. Directions for future work include constrained decoding schemes, reinforcement/active learning-based reward shaping (renderability and perceptual feedback), and agentic integrations with professional DCC pipelines to close the designer-in-the-loop gap.

Conclusion

OmniLottie establishes a new foundation for multi-modal vector animation generation by introducing a robust tokenization and representation framework tailored to Lottie. The model’s auto-regressive design, enabled by large-scale, richly-annotated data and formalized benchmarks, yields marked improvements in reliability, semantic control, and generative quality across all major modalities of vector animation synthesis. This work sets a methodological precedent for future research on high-level scene parameterization and editable, instruction-driven creative toolchains in vision and graphics.

Whiteboard

Explain it Like I'm 14

What is this paper about?

This paper introduces OmniLottie, a smart system that can create high-quality vector animations (the kind that stay sharp when you zoom in) from different kinds of inputs—like text, images, and videos. It focuses on the Lottie format, which stores animations in a lightweight text file called JSON. The team also built a huge dataset (MMLottie-2M) and a fair test benchmark (MMLottie-Bench) to train and evaluate the system.

What questions did the researchers ask?

The researchers wanted to find out:

  • How can we automatically generate editable, scalable vector animations from text, images, or videos?
  • How do we make these animations follow the user’s instructions closely (for example, “a red rocket flies up and fades out”)?
  • Can we simplify Lottie’s complex JSON files so a model learns the important parts (shapes and motion) instead of wasting effort on formatting?
  • Can a single model handle multiple tasks (text-to-animation, image+text-to-animation, and video-to-animation) reliably?

How did they do it?

Lottie and why it matters

Think of “vector animations” like drawing instructions: “draw a circle at this position, move it to the right, change its color,” and so on. This is different from normal videos (called raster videos), which are like a big grid of colored dots. Vector animations:

  • Stay sharp at any size (no blurriness when you zoom in),
  • Are easy to edit (change color, motion, timing),
  • Work well across different apps and devices.

Lottie is a popular vector animation format that stores all shapes and animation steps in a single JSON file. But Lottie JSON can be very long and full of technical details, which makes it hard for a model to learn from directly.

Turning Lottie into simple tokens (the tokenizer)

To solve the “too much messy text” problem, the authors created a Lottie tokenizer. A tokenizer turns complicated files into short, meaningful codes (tokens). Imagine replacing a long recipe with neat instructions like:

  • “Start layer: shape”
  • “Position: x=120, y=80”
  • “Scale: 100%”
  • “Opacity: 100”
  • “Animate: move up, 2 seconds, ease-out”

This tokenizer:

  • Removes repetitive or unimportant formatting,
  • Keeps only the useful commands (shapes, effects, keyframes, positions),
  • Converts numbers to discrete tokens using a simple formula (scale and offset), so the model can predict them more easily,
  • Makes a clean sequence the model can read and write like a sentence.

In short: instead of making the model write long, fragile JSON text, it teaches the model to write a compact “command list” that can be turned back into a valid Lottie file.

The model that writes animations

OmniLottie uses a pretrained vision–LLM (VLM) as its “brain” (specifically Qwen2.5-VL). A VLM can understand text, images, and videos together (this is called “multi-modal”). The model then generates the animation tokens step by step—like writing one word at a time in a sentence (this is called “auto-regressive” generation).

Analogy:

  • You give the model a prompt (text) and possibly an image or video.
  • The model “thinks,” then writes a sequence of animation commands.
  • Those commands are turned back into a valid Lottie JSON file.
  • The JSON file plays as a vector animation.

Building a big training set: MMLottie-2M

Good models need lots of good examples. The team built MMLottie-2M, a dataset with 2 million vector animations. They:

  • Collected professional Lottie files from several websites,
  • Cleaned and standardized them to a common canvas size and time range,
  • Generated extra animated examples by adding motion templates to static SVG graphics (like “spin,” “zoom,” “fade,” “slide”),
  • Added helpful captions and annotations for what’s in the animation and how it moves,
  • Rendered preview videos and keyframes to help the model learn connections between text, image, video, and vector commands.

Testing with a fair benchmark: MMLottie-Bench

To measure progress, they created MMLottie-Bench, a test set with “real” examples (not seen during training) and “synthetic” ones (carefully generated with other tools). They measured:

  • Visual quality (does the animation look smooth and natural?),
  • Instruction following (does it contain the right objects and move the right way?),
  • Reliability (does it produce a valid animation file often and quickly?).

They used metrics like:

  • FVD (how natural the motion looks across frames),
  • CLIP similarity (how well images/frames match the text prompt),
  • Object Alignment and Motion Alignment (LLM judge scores: did the animation include the right stuff, moving in the right way?),
  • PSNR/SSIM/DINO (measures for matching videos, clarity, and structure).

What did they find?

Here are the main results explained simply:

  • OmniLottie generated valid, high-quality vector animations more often than other methods and commercial tools.
  • It followed instructions more accurately: the right objects appeared and moved in the right direction, speed, and style.
  • It achieved the best or near-best scores on most metrics (FVD, alignment, PSNR/SSIM/DINO), meaning smoother motion, clearer visuals, and better match to the inputs.
  • The Lottie tokenizer made a big difference. Training on raw Lottie JSON was much less reliable. With the tokenizer, success rates and quality jumped.
  • Mixing training data worked best when they combined mostly real Lottie animations with some animated SVGs. Too many SVGs made motions too simple; a balanced mix improved both shapes and motion.
  • Compared to models that tried to write raw JSON directly, OmniLottie was both more robust and more efficient—because it focused on meaningful tokens rather than formatting.

Why does this matter?

This work helps make animation creation faster, easier, and more accessible:

  • For students and hobbyists: you could describe an idea (“a paper plane glides upward while fading in”) and get a clean, editable vector animation.
  • For designers and developers: it speeds up workflows, produces cross-platform animations, and keeps everything editable and scalable.
  • For creative tools: it enables smarter “co-pilot” features that understand text, example images, or even short videos, then output native vector animations.

The research also shows a general strategy: when a format is too complex (like Lottie JSON), designing the right tokenizer can unlock strong models. It reduces confusion and lets the model learn the important parts—shapes, effects, and motion.

Limitations and future ideas:

  • Sometimes the model still writes invalid or empty sequences. The team suggests adding safety checks (“constrained decoding”), training with rewards for files that render correctly, or combining the model with pro tools (like After Effects) to fix mistakes.
  • Extremely long or complex animations remain challenging (context length limits). Future work could handle longer timelines or more layers more reliably.

Overall, OmniLottie shows that multi-modal, token-based generation is a powerful way to create editable, high-quality vector animations from simple instructions.

Knowledge Gaps

Knowledge Gaps, Limitations, and Open Questions

Below is a concise list of concrete gaps and unresolved questions that emerge from the paper, intended to guide future research.

  • Formal validity guarantees: No grammar-constrained or schema-aware decoding is implemented to prevent invalid token sequences; explore constrained/typed decoding, finite-state automata, or structured decoders to guarantee renderable Lottie output.
  • Quantization fidelity: The impact of the offset/scale quantization on geometric accuracy, timing, easing curves, and render equivalence is not analyzed; determine error bounds, adaptive precision schemes, and trade-offs between token budget and fidelity.
  • Round-trip integrity: Despite claims of lossless structural compression, numerical round-trips after quantization may alter renders; quantify visual and semantic drift after tokenization–detokenization cycles.
  • Tokenizer coverage gaps: The tokenizer supports only five layer types (ty ∈ {0,1,3,4,5}); handling of image (ty=2), camera, audio layers, gradient fills, track mattes, masks, blend modes, and advanced effects is incompletely specified and not benchmarked.
  • Expressions and procedural logic: After Effects expressions and other non-parameterizable behaviors are excluded during data curation; investigate representing, constraining, or compiling expressions into parameter sequences.
  • 3D support: The ddd flag and camera-like behaviors are acknowledged but not supported or evaluated; assess feasibility of 3D/transformed layers and camera animations within the tokenizer and model.
  • Cross-engine robustness: Outputs are evaluated with a single rendering stack; test cross-platform consistency and failure rates across Lottie players (Web, iOS, Android) and AE import.
  • Success rate definition and diagnostics: “Success rate” lacks a formal, standardized validity protocol; define and publish schema checks, renderer logs, and multi-engine validation criteria.
  • Vector-native metrics: Evaluation relies on raster proxies (FVD, CLIP) and LLM-as-judge; develop vector-specific metrics for path topology similarity, boolean ops, hierarchy/parenting accuracy, keyframe timing, easing curve matching, and parameter distribution alignment.
  • LLM-as-judge reliability: Object/motion alignment is scored by a single closed-source LLM; quantify inter-judge reliability, calibration, and robustness to prompt phrasing, and release gold-standard human annotations for a subset.
  • Dataset bias and coverage: Web-sourced assets may overrepresent certain styles/regions and underrepresent enterprise UI/UX or scientific visuals; audit style/content/motion diversity, language coverage, and introduce balancing/long-tail splits.
  • Potential train–test contamination: Real subset is sourced from similar platforms as training data; provide near-duplicate and stylistic similarity checks, with leakage reports and de-duplication thresholds.
  • Annotation noise: Multimodal captions generated by VLMs may be noisy or biased; measure caption accuracy, release human-verified subsets, and study the effect of annotation noise on instruction following.
  • Motion template dependence: The motion library synthesized from clustering may bias the model toward canonical patterns; quantify diversity/novelty, coverage of non-templated motions, and mitigation strategies against template overuse.
  • Variable duration and timing control: Training normalizes time to 0–16; study conditioning and generation for arbitrary timelines, fps, looping behaviors, and multi-scene sequences.
  • Controllability and editability: Fine-grained controls (easing types, motion paths, color palettes, z-order, layer parenting, complexity budgets) are not exposed; design controllable tokens or constraint interfaces for professional workflows.
  • Editing existing Lotties: The framework does not support instruction-conditioned editing or partial preservation of given layers; add masked/anchored generation and diff-style Lottie editing.
  • Complexity scaling and context limits: The model’s maximum scene complexity vs. context length is not characterized; provide scaling laws for token length, memory use, and performance, and explore hierarchical or chunked generation.
  • Efficiency and decoding strategy: Inference times are high and decoding strategies are not optimized; evaluate grammar-guided pruning, speculative decoding, cache-based prompting, and parallelized sampling to reduce latency.
  • Reliability under stress: Failure taxonomy is provided, but robustness under adversarial/OOD prompts, unusually long sequences, or extreme parameter ranges is not quantified; benchmark stress tests and fallback generation strategies.
  • Video-to-Lottie generalization: Video inputs used for supervision are renders of Lotties; assess performance on real-world raster videos (camera footage, motion graphics) and define conversion quality metrics beyond PSNR/SSIM.
  • Fonts and multilingual text: Support for fonts, glyphs, kerning, and non-Latin scripts is not evaluated; measure multilingual prompt following, font substitution strategies, and text layout fidelity.
  • File size and performance constraints: Effects of generation choices on file weight, runtime performance, and mobile playback are not analyzed; introduce metrics and constraints for deployability.
  • Comparative structured baselines: No baseline with schema-constrained JSON/function-calling LLMs is included; compare against grammar-constrained JSON generation to isolate tokenizer benefits from structural constraints.
  • Training details and reproducibility: Compute budgets, optimization schedules, and full hyperparameters are not disclosed; publish training recipes and seeds to enable faithful replication.
  • Differentiable render feedback or RL: Training optimizes token likelihood only; investigate differentiable vector rendering and reinforcement learning with renderability and perceptual rewards.
  • Safety/IP filtering: The dataset disclaimer notes third-party rights, but pipelines for license classification, brand/logo filtering, and takedown are not described; formalize compliance workflows.
  • Audio/beat-synchronous animation: Audio layers and synchronization are out of scope; explore audio-conditioned Lottie generation and alignment metrics.

Practical Applications

Overview

OmniLottie introduces three core innovations with direct practical impact: (1) a Lottie tokenizer that converts complex JSON animations into compact, structured command–parameter tokens; (2) OmniLottie, an autoregressive VLM-based generator that creates Lottie vector animations from text, image, and video instructions; and (3) the MMLottie-2M dataset and MMLottie-Bench for training and benchmarking. Together, they enable editable, resolution-independent, cross-platform animations that can be generated and iterated far faster than traditional workflows.

Below are concrete applications, grouped by deployment horizon. Each item includes sectors, likely tools/products/workflows, and key assumptions/dependencies that affect feasibility.

Immediate Applications

  • Text-to-animation copilot for designers (software, creative tools, advertising/marketing, product)
    • What: Prompt-based generation of microinteractions, icon animations, loaders, banners, and branded motion assets as native Lottie.
    • Tools/workflows: Plugins/extensions for Figma, Adobe After Effects, LottieFiles; “Prompt → Generate → Preview → Edit parameters → Export” flows; batch generation for A/B tests.
    • Assumptions/dependencies: Stable integration via LottieFiles/AE scripting; human-in-the-loop review for brand/quality; model and base VLM licenses compatible with commercial deployment.
  • Auto-animating static icon/SVG libraries (software, product design, e-commerce)
    • What: Apply learned motion templates to static vector icon sets for consistent, on-brand motion systems.
    • Tools/workflows: Ingest SVG library → text or preset motion prompts → Lottie output per icon → asset library publish.
    • Assumptions/dependencies: SVG-to-Lottie conversion quality; motion template selection aligned with brand guidelines.
  • Rapid onboarding and tutorial animation from UI captures (software, education, SaaS)
    • What: Video-to-Lottie converts screen recordings or annotated screenshots into step-by-step vector animations for user onboarding or help centers.
    • Tools/workflows: Product teams upload short screen capture clips → automatic Lottie tutorials → embed in in-app help.
    • Assumptions/dependencies: Works best with clean UI footage and vector-friendly content; human QA to refine steps.
  • Marketing and growth content at scale (advertising/marketing, social media)
    • What: Generate campaign-specific animations (text-to-Lottie) for social posts, ads, emails; variant generation for personalization and A/B testing.
    • Tools/workflows: Campaign brief → prompt templates → batch generation → automated CLIP/LLM checks for prompt adherence → deploy.
    • Assumptions/dependencies: Brand governance rules layered on top; content moderation/safety filters; compute budget for batch runs.
  • Localization and theming of motion assets (software, media localization)
    • What: Adjust text layers, directionality (LTR/RTL), colors, and pacing across locales via parameter editing rather than re-animating.
    • Tools/workflows: Tokenized Lottie → scripted updates to text/font/color/transform tokens → localized builds.
    • Assumptions/dependencies: Font availability and layout policies per locale; UI integration pipelines.
  • Accessibility variants for reduced motion and safety (software, public sector, education, healthcare apps)
    • What: Produce reduced-motion or high-contrast versions by editing opacity, speed, and effect parameters; avoid flashing patterns.
    • Tools/workflows: Accessibility profile presets → batch adjustments via tokenizer (e.g., lower amplitude/speed, disable flicker).
    • Assumptions/dependencies: Rule sets aligned to WCAG; QA for edge cases; constrained decoding or rule-based post-processing.
  • Lottie linting, validation, and minification in CI/CD (software tooling)
    • What: Use the tokenizer for schema validation, renderability checks, parameter normalization, and diff/merge-friendly representations.
    • Tools/workflows: Git pre-commit hooks; CI tasks that validate token sequences, normalize timelines/canvas; minify to reduce payload size.
    • Assumptions/dependencies: Stable tokenizer API; coverage for common AE/Lottie features used by team.
  • Classroom and MOOC content creation (education)
    • What: Teachers generate vector animations of diagrams, processes, and concepts from prompts for lectures and slides.
    • Tools/workflows: Text/image prompts → render Lottie → export to presentation or web player.
    • Assumptions/dependencies: Educator prompts well-specified; optional human edits for accuracy; institution policy on generative assets.
  • Documentation and internal training animations (enterprise, finance, software)
    • What: Convert SOPs, release notes, and internal guides into short vector explainers (text-to-/image-to-Lottie).
    • Tools/workflows: Docs → prompt templates → Lottie embeds in wikis/portals.
    • Assumptions/dependencies: Review for compliance (especially in regulated industries); keep visuals abstract to avoid sensitive data.
  • Research and benchmarking (academia, open-source)
    • What: Use MMLottie-2M and MMLottie-Bench to study multimodal instruction following, tokenization for structured graphics, and evaluation protocols.
    • Tools/workflows: Train/fine-tune models; reproduce benchmarks; ablation studies on tokenization and data mixing.
    • Assumptions/dependencies: Dataset is research-only; respect IP and usage restrictions; compute resources for training.

Long-Term Applications

  • End-to-end “brief-to-production” motion design pipeline (software, creative tools, enterprise)
    • What: Agents that turn creative briefs and style guides into production-ready Lottie with constrained decoding and brand rules enforcement.
    • Tools/workflows: Design system + brand tokens → generative agent → reviewer loop → shipment to web/mobile.
    • Assumptions/dependencies: Robust constrained decoding; rule engines for brand/accessibility; integration with design systems.
  • Cross-format generative motion (SVG/CSS, Rive, After Effects) from a unified token space (software ecosystem)
    • What: Extend tokenization to other vector animation formats for multi-target exports.
    • Tools/workflows: Abstract token spec → exporters for Lottie/SVG-CSS/Rive/AE → multi-platform distribution.
    • Assumptions/dependencies: Formalized mappings between specs; vendor cooperation; performance testing across players.
  • Real-time or on-device generative motion (mobile, gaming, AR/VR)
    • What: Context-aware microinteractions generated at runtime to match user state/theme.
    • Tools/workflows: Distilled/lightweight models → on-device inference or low-latency edge services → dynamic Lottie injection.
    • Assumptions/dependencies: Significant model compression; latency budgets; privacy and energy constraints.
  • Automated conversion of legacy raster motion to vector (media production, education)
    • What: Video-to-Lottie for explainer videos, infographics, and training clips, preserving motion while vectorizing look.
    • Tools/workflows: Batch “raster to vector motion” pipelines → human cleanup → re-usable vector libraries.
    • Assumptions/dependencies: Best for stylized or flat-motion sources; complex photorealistic footage remains challenging.
  • Personalized learning animations (education, healthcare)
    • What: Generate tailored vector animations explaining medical procedures, benefits, or coursework from patient/student profiles and prompts.
    • Tools/workflows: Template prompts + structured data → animation variants → deliver via portals/apps.
    • Assumptions/dependencies: Privacy-preserving data pipelines; clinical/pedagogical review; alignment with accessibility needs.
  • Financial and enterprise explainers at scale (finance, enterprise L&D)
    • What: Turn research notes, policy changes, or compliance updates into branded micro-animations for employees or customers.
    • Tools/workflows: Prompt templates with risk keywords → generation → automated compliance checks → distribution.
    • Assumptions/dependencies: Strong governance for factuality and compliance; content approval workflows.
  • Data-driven motion graphics from telemetry (IoT, robotics, operations)
    • What: Animate device or robot logs and trajectories as vector dashboards for debugging and executive reporting.
    • Tools/workflows: Telemetry parsers → tokenized motion templates → Lottie dashboards.
    • Assumptions/dependencies: Stable mapping from numeric streams to motion grammar; domain customization.
  • Motion accessibility policy auditors (public sector, large enterprises)
    • What: Static analysis of Lottie token sequences to flag violations (e.g., excessive flicker, speed, opacity changes).
    • Tools/workflows: CI policies → reports and autofixes → certification gates.
    • Assumptions/dependencies: Standardized policy definitions; buy-in from accessibility teams; robust heuristics.
  • Interactive co-creation agents with RL for renderability (software, creative tools)
    • What: Agents that iteratively propose, render-check, and fix animations using “renderability rewards,” improving reliability without heavy manual QA.
    • Tools/workflows: RL fine-tuning + sandbox renderer → constrained decoding → user feedback loops.
    • Assumptions/dependencies: Efficient render-in-the-loop training; scalable evaluation metrics.
  • Sector-specific motion libraries and generators (healthcare, education, fintech, e-commerce)
    • What: Curated, domain-tuned motion packs (e.g., anatomy, lab workflows, fintech icons) with specialized prompts and safety filters.
    • Tools/workflows: Domain curation → fine-tuned models → governed generation portals/APIs.
    • Assumptions/dependencies: Licensed, domain-specific datasets; SME review; ongoing maintenance.
  • Standards and procurement guidance for vector animation generation (policy, industry consortia)
    • What: Use MMLottie-Bench-like protocols to set evaluation standards for generative motion quality, safety, and compliance in RFPs.
    • Tools/workflows: Public benchmarks; certification suites; disclosure and reproducibility checklists.
    • Assumptions/dependencies: Consensus among stakeholders; updates as models evolve; transparency requirements.

Notes on feasibility across applications:

  • Licensing: MMLottie-2M is research-only; commercial systems may require re-training on licensed assets or rights-cleared subsets.
  • Reliability: Although OmniLottie improves success rates, production use benefits from constrained decoding, validators, and human review.
  • Performance: Current generation times (tens of seconds) and token lengths may need optimization for real-time or large-scale personalization.
  • Content safety and brand alignment: Additional filters, rule-based post-processing, and style-token constraints are advisable for enterprise deployments.
  • Format coverage: Some After Effects/Lottie edge features may remain unsupported without further tokenizer/decoder extensions.

Glossary

  • After Effects expressions: scripting expressions in Adobe After Effects that dynamically modify layer properties during rendering. "After Effects expressions that dynamically modify properties."
  • Autoregressive models: sequence models that generate content by predicting each next token conditioned on previously generated tokens. "Autoregressive models treat visual content as token sequences"
  • Blend mode: a rendering control that determines how a layer’s pixels are composited with layers beneath it. "blend mode (bm)"
  • CLIP similarity: a text–image embedding metric from CLIP used to assess semantic alignment between prompts and visuals. "CLIP similarity"
  • Collapse transformations (ct flag): a Lottie setting that determines whether grouped transformations are collapsed for rendering. "The ct flag controls transformation collapsing."
  • Cross-entropy loss: a standard training objective for probabilistic sequence prediction that penalizes divergence from the target token distribution. "with the standard cross-entropy loss."
  • Detokenize: convert discrete tokens or command sequences back into structured Lottie JSON. "detokenized to Lottie JSONs"
  • DINO: a self-supervised visual representation metric/model used to measure structural similarity. "and DINO"
  • Diffusion-based video generation models: generative models that synthesize videos via iterative denoising (diffusion) processes. "the diffusion-based video generation models"
  • FVD (Fréchet Video Distance): a metric that evaluates video quality by measuring distributional distance between generated and reference features. "We report FVD"
  • Hierarchical representation: a nested data organization with parent–child relations used to structure layers and effects. "Lottie is a complex hierarchical representation"
  • Keyframe: a frame marking a change in animation parameters (e.g., position, scale), enabling interpolation across time. "random keyframe extraction."
  • LLMs: very large neural LLMs trained on massive corpora and capable of complex reasoning and generation. "With the rapid advancement of LLMs"
  • LLM judge: using an LLM to automatically assess output quality or alignment as part of evaluation. "LLM judge"
  • Lottie: a JSON-based vector animation format that stores shape, effect, and motion parameters in a single file. "we focus on Lottie, a light-weight JSON formatting for both shapes and animation behaviors representation."
  • Lottie tokenizer: a converter that serializes Lottie JSON into compact command–parameter sequences for efficient generation. "we propose a Lottie tokenizer that converts raw JSON into concise command sequences"
  • Matte relations: layer masking relationships that define how one layer reveals or hides another (tt, tp, td). "matte relations (tt, tp, td)"
  • MMLottie-2M: a large-scale dataset of professionally designed vector animations with textual and visual annotations. "MMLottie-2M"
  • Motion Alignment: an evaluation metric (0–10) measuring how well generated motion matches type, direction, magnitude, targets, and smoothness. "Motion Alignment (Motion Align) (0–10)"
  • Motion priors: pre-existing motion patterns derived from sources like reference GIFs or text-to-video models to guide animation. "applying motion priors derived from either a reference GIF"
  • Motion signatures: encoded transform trajectories summarizing temporal motion patterns for clustering and transfer. "generating motion signatures that encode temporal patterns"
  • Motion templates: canonical motion patterns extracted via clustering and applied to assets for augmentation. "extract numerous canonical motion templates."
  • Null layer: a non-visual Lottie layer type used for transforms or hierarchy without rendering. "Null (ty=3)"
  • Object Alignment: an evaluation metric (0–10) measuring correctness of object presence, type, count, visual traits, and spatial relations. "Object Alignment (Obj. Align) (0–10)"
  • Offset-based tokenization: mapping continuous parameters to discrete tokens using type-specific scales and vocabulary offsets. "Our offset-based tokenization maps continuous Lottie parameters to discrete tokens:"
  • Opacity: the transparency level of a layer (0–100 in Lottie) affecting visibility. "rotation, scale, position, opacity"
  • Parent-child relationships: hierarchical links that reconstruct the scene structure and inheritance among layers. "through parent-child relationships"
  • Precomposition: a Lottie layer type referencing nested compositions (sub-scenes) used for grouping and reuse. "Precomposition (ty=0)"
  • PSNR (Peak Signal-to-Noise Ratio): a reconstruction fidelity metric comparing generated frames to references. "PSNR"
  • Raster videos: pixel-based video representations that lack editability and resolution scalability compared to vector formats. "Compared to raster videos"
  • Resolution-independent representation: graphics that can be scaled to different sizes without loss of quality. "resolution-independent representation"
  • Sequence modeling: treating complex content (images, animations) as token sequences for autoregressive generation. "amenable to sequence modeling"
  • Shape layer: a Lottie layer type containing vector shapes and paths for rendering geometry. "Shape (ty=4)"
  • Solid layer: a Lottie layer type representing a solid-color rectangle used as a background or element. "Solid (ty=1)"
  • Spatio-temporal normalization: standardizing animations to fixed spatial (canvas) and temporal (timeline) ranges. "spatio-temporal normalization"
  • SSIM (Structural Similarity Index Measure): a perceptual quality metric assessing structural similarity between images/videos. "SSIM"
  • SVG (Scalable Vector Graphics): a vector image format that stacks shapes and paths and can embed CSS animations. "SVG stacks rects, circles, and paths for shapes and embeds CSS codes for animation."
  • Text layer: a Lottie layer type containing textual content with style and animation parameters. "Text (ty=5)"
  • Time remapping: altering a layer’s playback timing by remapping frames to different timestamps. "tm for time remapping"
  • Token efficiency: an efficiency measure based on average token length of generated outputs. "token efficiency average token length of generated Lottie JSON"
  • Transform trajectories: time-series of transforms (rotation, scale, position, opacity) across keyframes describing motion. "extract transform trajectories (rotation, scale, position, opacity) across keyframes"
  • Vector animation: animation created using vector graphics rather than rasterized video frames. "Vector animation is a type of computer animation"
  • Vision-LLM (VLM): models that jointly process visual and textual inputs for multimodal understanding and generation. "we adopt a pre-trained VLM, specifically Qwen2.5-VL"
  • Video-to-Lottie: the task of converting input videos into vector animations in the Lottie format. "For Video-to-Lottie, we report FVD, PSNR, SSIM, and DINO"

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 9 tweets with 1519 likes about this paper.