Papers
Topics
Authors
Recent
Search
2000 character limit reached

Fara-7B Model Architecture

Updated 9 January 2026
  • Fara-7B is a multimodal transformer model designed for pixel-to-action tasks with 7 billion parameters.
  • It employs 32 decoder blocks with high-dimensional embeddings, GeLU activations, and pre-norm LayerNorm for robust performance.
  • The architecture integrates a patch embedding pipeline and unified token output mechanism to enable efficient UI control in web applications.

Fara-7B is a 7 billion-parameter, multimodal, transformer-based agentic model designed for interpreting computer screen images and outputting interaction actions such as clicks or text entry. Developed by supervised fine-tuning of the Qwen2.5-VL-7B backbone on highly curated pixel-to-action datasets, Fara-7B targets efficient, real-world web-based Computer Use Agent (CUA) tasks, achieving notable performance advantages in both scale and sample efficiency relative to comparable or larger baselines (Awadallah et al., 24 Nov 2025).

1. Core Architectural Topology

Fara-7B employs a standard transformer-decoder architecture, inheriting all essential macro- and micro-structural characteristics from Qwen2.5-VL-7B:

  • Layer count: 32 transformer decoder blocks (N=32N=32)
  • Model (hidden) dimension: dmodel=4096d_\mathrm{model} = 4\,096
  • Attention heads: 32 (h=32h=32), with per-head dk=dv=128d_k=d_v=128
  • Feed-forward (MLP) inner dimension: dff=16384d_\mathrm{ff} = 16\,384
  • Activation: Gaussian Error Linear Unit (GeLU)
  • Normalization: Pre-norm LayerNorm for every sublayer

Each architectural block processes input x1x_{\ell-1} as follows:

  • Apply Multi-Head Self-Attention (MHSA) with residual connection and LayerNorm.
  • Apply a two-layer FFN with GeLU, followed by another residual connection and LayerNorm.

This configuration supports efficient training and inference under the transformer paradigm and allows the model to scale favorably for agentic deployments.

2. Mathematical Operations and Normalization

The per-layer computation in Fara-7B is fully specified using canonical transformer notation, instantiated as follows:

Multi-Head Self-Attention (MHSA):

headi=softmax(QWiQ(KWiK)Tdk)VWiV,MultiHead(Q,K,V)=Concat(head1,,headh)WO\mathrm{head}_i = \mathrm{softmax}\left(\frac{QW_i^Q\,(KW_i^K)^{\mathsf T}}{\sqrt{d_k}}\right) VW_i^V,\quad \mathrm{MultiHead}(Q,K,V)=\mathrm{Concat}(\mathrm{head}_1,\dots,\mathrm{head}_h)W^O

where Q=x1WQQ= x_{\ell-1}W^Q, K=x1WKK = x_{\ell-1}W^K, V=x1WVV = x_{\ell-1}W^V.

Residual and Pre-Norm:

x=LayerNorm(x1+MultiHead(Q,K,V)),x=LayerNorm(x+FFN(x))x'_\ell = \mathrm{LayerNorm}(x_{\ell-1} + \mathrm{MultiHead}(Q,K,V)),\quad x_\ell = \mathrm{LayerNorm}(x'_\ell + \mathrm{FFN}(x'_\ell))

Feed-Forward Network (FFN):

FFN(x)=(GeLU(xW1+b1))W2+b2\mathrm{FFN}(x) = (\mathrm{GeLU}(xW_1 + b_1)) W_2 + b_2

LayerNorm formulation:

LayerNorm(x)=γxμσ2+ϵ+β\mathrm{LayerNorm}(x) = \gamma \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta

with μ,σ2\mu, \sigma^2 as per-component mean and variance.

The model uses pre-norm LayerNorm, which is standard in contemporary transformer LLMs and stabilizes deep architectures (Awadallah et al., 24 Nov 2025).

3. Multimodal Input Pipeline

Fara-7B processes composite sequences that fuse rasterized image data and associated text:

  • Screenshot rasterization: Each environment observation oto_t includes a browser viewport screenshot at 224×224224 \times 224 pixels.
  • Patch embedding: The screenshot is divided into 16×1616\times16 patches (yielding L=196L=196 visual tokens). Each patch, flattened, is projected via a learnable embedding matrix EimgR(P23)×dmodelE_\mathrm{img}\in\mathbb{R}^{(P^2\cdot 3)\times d_\mathrm{model}}.
  • 2D positional encoding: Added to patch embeddings, using fixed 2D sinusoidal positional embeddings Epos2DRL×dmodelE_\mathrm{pos}^{2D} \in\mathbb{R}^{L\times d_\mathrm{model}}.
  • Input concatenation: Visual tokens are concatenated with browser metadata tokens (e.g., URLs), embedded via a standard byte-pair encoding, as well as any previously generated tokens.
  • Decoding: This concatenated sequence is input to the transformer’s first decoder layer.

This multimodal, patch-embedding pipeline enables efficient joint processing of perception (screenshots) and symbolic context (URLs, history) (Awadallah et al., 24 Nov 2025).

4. Action Output and Tokenization

Fara-7B's agentic capabilities derive from a token-based action output mechanism specialized for UI control:

  • Output token sequence: At each time-step tt, the model generates a “Thought” rtr_t (reasoning in natural language), followed by an “Action” ata_t (e.g., CLICK(x,y)\mathrm{CLICK}(x,y), TYPE()\mathrm{TYPE}(\cdots), SCROLL()\mathrm{SCROLL}(\cdots)).
  • Coordinate encoding: Pixel coordinates (x,y)(x, y) are represented as digit-string tokens (e.g., “(123,456)”).
  • Unified output head: The model employs a single linear + softmax head over the full action-and-language vocabulary, including numeric, punctuation, and symbolic tokens for coordinates.
  • Generation as classification: Action prediction is treated as sequential classification: P(atcontext)=i=1atP(wiw<i,context)P(a_t \mid \mathrm{context}) = \prod_{i=1}^{|a_t|} P(w_i \mid w_{<i}, \mathrm{context})
  • Inference: Output token chains are greedily or stochastically decoded, yielding the (possibly multi-part) actions for execution.

This approach permits seamless integration of natural-language reasoning, environmental perception, and action execution in a unified generative framework (Awadallah et al., 24 Nov 2025).

5. Training Regimen and Data Integration

Fara-7B was fine-tuned on a large, highly diversified corpus mediated by the FaraGen synthetic data framework:

  • Training samples: 1.8 million total, including 1.23M CUA trajectories, 0.56M grounding samples (patch-to-click), 3K refusal samples, and 1.8K screenshot-based QA/captioning.
  • Optimization: Cross-entropy loss over concatenated token streams (“thought” + “action”) for each step: L=t=1Ti=1rt+atlogP(wt,iwt,<i,ot)\mathcal{L} = -\sum_{t=1}^{T}\sum_{i=1}^{|r_t|+|a_t|} \log P(w_{t,i}^\star \mid w_{t,<i}, o_{\leq t})
  • Hyperparameters: AdamW optimizer, (β1=0.9, β2=0.95,(\beta_1=0.9,\ \beta_2=0.95, no weight decay), cosine LR schedule with linear warmup (10% of steps), peak LR =5×106=5\times10^{-6}, gradient clipping at norm 1.0, bf16 precision, batch size 128/sample/GPU, 28,000 steps (2 epochs).
  • Distributed training: DeepSpeed ZeRO-3, 64 NVIDIA H100 GPUs.

FaraGen data is critically post-processed: trajectories are decomposed into atomic steps, and under-represented skill types (e.g., multi-item shopping) are aggressively up-sampled. Additional auxiliary signals, synthesized QA, and explicit refusals are included to encode safety and action grounding (Awadallah et al., 24 Nov 2025).

6. Empirical Performance, Ablations, and Significance

Comprehensive ablations demonstrate the efficacy of both the architecture and the training corpus:

  • Data scaling: Training on 1%1\%, 10%10\%, and 100%100\% of FaraGen yields WebVoyager accuracies of 48%/65%/73.5%\sim48\% / 65\% / 73.5\%, evidencing strong performance returns from scalable synthetic data integration.
  • Model comparison: Fara-7B surpasses other 7B CUA baselines and is competitive with substantially larger models (e.g., GPT-4o SoM agents) in aggregate agentic performance benchmarks, while sustaining on-device deployability.
  • Training pipeline ablations: Subsystem improvements (e.g., seeing full action history, error retries, “BrowserBase” abstractions) more than doubled successful trajectory yields in WebTailBench evaluation.
  • Action planning and safety: Model behaviors such as stopping at critical steps and refusing inappropriate requests are explicitly learned from orchestrator-labeled trajectories and refusal samples.

These results position Fara-7B as a highly efficient, agentic, and robust transformer baseline for pixel-to-action applications in human-computer interaction (Awadallah et al., 24 Nov 2025).

7. Relation to Adjacent Architectures and Implications

Fara-7B is architecturally distinct from standard text-only LLMs (e.g., LLaMA, OLMo, Fanar Star) due to its explicit pixel embedding pipeline, multimodal input space, and uniquely agentic token output interface. Unlike models such as Fanar Star, which employs decoder-only transformers and pre-layer RMSNorm with SwiGLU activations (Team et al., 18 Jan 2025), Fara-7B adopts standard pre-norm LayerNorm and GeLU.

A plausible implication is that Fara-7B’s design reflects a growing trend toward multimodal, action-generative transformers for real-world agent deployments, emphasizing both data curation and modular architecture inheritance. The model leverages the extensibility of existing large multimodal transformer backbones, while carefully curating task-specific data flows and construction of pixel-token pipelines. This pattern suggests that future agentic transformers will likely be distinguished less by bespoke architectural variation and more by corpus scale, pipeline efficiency, and interface design (Awadallah et al., 24 Nov 2025, Team et al., 18 Jan 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Fara-7B Model Architecture.