Papers
Topics
Authors
Recent
Search
2000 character limit reached

NEMO-4-PAYPAL: Leveraging NVIDIA's Nemo Framework for empowering PayPal's Commerce Agent

Published 25 Dec 2025 in cs.AI | (2512.21578v1)

Abstract: We present the development and optimization of PayPal's Commerce Agent, powered by NEMO-4-PAYPAL, a multi-agent system designed to revolutionize agentic commerce on the PayPal platform. Through our strategic partnership with NVIDIA, we leveraged the NeMo Framework for LLM model fine-tuning to enhance agent performance. Specifically, we optimized the Search and Discovery agent by replacing our base model with a fine-tuned Nemotron small LLM (SLM). We conducted comprehensive experiments using the llama3.1-nemotron-nano-8B-v1 architecture, training LoRA-based models through systematic hyperparameter sweeps across learning rates, optimizers (Adam, AdamW), cosine annealing schedules, and LoRA ranks. Our contributions include: (1) the first application of NVIDIA's NeMo Framework to commerce-specific agent optimization, (2) LLM powered fine-tuning strategy for retrieval-focused commerce tasks, (3) demonstration of significant improvements in latency and cost while maintaining agent quality, and (4) a scalable framework for multi-agent system optimization in production e-commerce environments. Our results demonstrate that the fine-tuned Nemotron SLM effectively resolves the key performance issue in the retrieval component, which represents over 50\% of total agent response time, while maintaining or enhancing overall system performance.

Summary

  • The paper’s main contribution is the development of a scalable agentic commerce system leveraging NVIDIA NeMo to optimize search, recommendation, and customer interactions.
  • It employs a four-stage LLM pipeline—query processing, semantic retrieval, ranking, and evaluation—to achieve significant latency and cost reductions.
  • Extensive fine-tuning using SFT and DPO on the nemotron model yields measurable performance gains, setting a blueprint for enterprise-grade AI in e-commerce.

NEMO-4-PAYPAL: Scalable Agentic Commerce Optimization with NVIDIA NeMo

Overview of NEMO-4-PAYPAL and System Transformation

This paper presents the architectural and empirical advances underlying PayPal's transition from a payments service to a commerce-centric platform, centered on its Commerce Agent framework empowered by AI and LLMs. Leveraging NVIDIA's NeMo Framework, the system is refactored as a scalable multi-agent ensemble capable of handling complex search, recommendation, and customer-interaction cycles. The redesign addresses latency, cost, throughput, and domain-specific model adaptation challenges, establishing a state-of-the-art agentic commerce ecosystem.

PayPal's strategic infrastructure pivot (Figure 1) from payments to a full-spectrum commerce platform underscores the system's multi-agent orchestration, which coordinates LLM-based retrieval, ranking, and evaluation to deliver contextually relevant shopping experiences at scale. This transformation is not merely architectural; it is fundamentally a performance paradigm shift aligning agentic AI with industrial e-commerce operational requirements. Figure 1

Figure 1: PayPal's shift from payments provider to agentic commerce platform, powered by NVIDIA NeMo-driven LLMs.

Commerce Search and Recommendation Pipeline

At the core of the framework is a highly modular Search and Recommendation pipeline. The system is decomposed into four salient components:

  1. Query Understanding/Expansion/Formulation: LLMs preprocess user queries, extracting attribute-value pairs and generating refined, structured queries.
  2. Retrieval: The LLM, using the HyDE methodology, generates hypothetical product embeddings to guide semantic retrieval over vast catalogs.
  3. Ranking: A secondary LLM sorts retrieved results based on personalized criteria.
  4. LLM Evaluator: Meta-evaluation functions arbitrate quality and consistency of agent output, with supervision over system metrics.

This hierarchical pipeline is built to ingest user session data, reformulate queries in line with user intent, execute efficient semantic search, and return precision-ranked recommendations, all under agentic orchestration. Figure 2

Figure 2: Four-stage architecture for PayPal's commerce search/recommendation, highlighting LLM-mediated attribute extraction, retrieval, ranking, and evaluation.

Agentic System Architecture and Orchestration

The agentic framework is underpinned by backend agent orchestration using LangChain, integrating multiple LLMs with PayPal's proprietary infrastructure and cloud services. The orchestration module includes a Generic Agent Orchestrator, LLM Strategy selector, Conversation Planning, Task Planner, Agent Memory, Unified Session Management, and feedback-driven continual learning loops.

Personalization is achieved through offline user profile construction, session-aware preference extraction, and multi-turn dialogue management. The orchestration synchronizes agent components, ensuring real-time constraint satisfaction and contextual grounding across all e-commerce interactions. Figure 3

Figure 3: Agentic Commerce AI system, integrating commerce platform, agent orchestration, LLMs, and personalization.

Model Fine-Tuning and Optimization

The crux of agentic commerce scaling in production is fine-tuning foundational LLMs for low-latency, cost-efficient operation. PayPal's system exclusively employs the llama3.1-nemotron-nano-8B-v1 architecture, selecting it for its adaptability in commerce retrieval tasks.

Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) represent the principal model adaptation methods. Twenty LoRA-based SLM variants are trained via extensive hyperparameter sweeps (learning rates, Adam/AdamW optimizers, cosine annealing, LoRA ranks), with evaluation on both synthetic and real-world commerce dialogues. The SFT-optimized nemotron delivers:

  • Retrieval latency reduced by 57.9%
  • Agent latency reduced by 48.9%
  • GPU cost lowered by 45.5%
  • Minor quality score degradation (21.7%), with tradeoffs managed via variant selection

Notably, DPO mitigates quality loss while sustaining latency improvements, showing only a 6.8% drop in quality and 2.0% in E2E metrics, outperforming non-fine-tuned baselines. Figure 4

Figure 4

Figure 4: Nemotron-8B achieves marked latency/cost reduction compared to the baseline, with quantifiable quality tradeoff.

Figure 5

Figure 5: GPU inference latency analysis showing B200 Blackwell's superiority over H100, with tensor parallelism scaling.

Figure 6

Figure 6: Fine-tuning with NeMo Framework yields improved alignment and agent performance.

Figure 7

Figure 7: Comparative quality and E2E score degradation for 8B-NFT and resilience for 8B-DPO.

Empirical Evaluation and Deployment Insights

Using finely annotated chat and prompt-response datasets—augmented for edge-case coverage—models are benchmarked under both rubric-based and pairwise LLM-as-a-Judge annotation. Deployment on NVIDIA NIM microservices enables exploitation of TensorRT-LLM, vLLM, and SGLang, with automatic hardware/configuration adaptation.

B200 Blackwell delivers a 35% latency gain over H100 at single-GPU, with parallelism yielding nearly 50% improvement when scaling from TP=1 to TP=4. The NeMo SDK facilitates robust configuration management, checkpointing, PEFT integration, and data format versatility, directly accelerating experimentation and system adaptation.

Theoretical and Practical Implications

This work demonstrates that domain-specific agentic optimization—utilizing PEFT, SFT, DPO methods—can achieve near real-time performance with controlled quality tradeoff in large-scale commerce environments. The combination of modular agent orchestration (LangChain), advanced hardware acceleration (B200), and deep NeMo integration creates a best-practice blueprint for production multi-agent AI deployments in e-commerce.

Theoretically, the research substantiates structured agentic frameworks and compromise-based model fine-tuning as viable solutions for latency-cost-quality tradeoffs. Practically, the Commerce Agent system sets new benchmarks for enterprise-grade AI search/recommendation pipelines, validating LLM-centric approaches over conventional IR methods.

Future Directions

Ongoing directions include further scaling multi-agent architectures, refining user personalization granularity, exploring multi-modal retrieval, expanding RL-based post-training, and integrating open-source/third-party LLM ensembles for enhanced coverage. Emphasis will remain on operational reliability, explainability, and continually adaptive optimization to maintain strong e-commerce KPIs.

Conclusion

NEMO-4-PAYPAL exemplifies effective large-scale agentic commerce optimization, merging advanced LLMs, fine-tuning strategies, and GPU acceleration to realize measurable performance and cost efficiencies in production settings. The synthesis of modular orchestration, rigorous model adaptation, and hardware tuning offers an extensible paradigm for future intelligent e-commerce agent systems.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

Explain it Like I'm 14

What is this paper about?

This paper explains how PayPal made its shopping assistant faster and cheaper to run by teaming up with NVIDIA. They used NVIDIA’s NeMo tools to fine-tune a smaller AI model so the assistant could understand what shoppers want and search for products more quickly, without hurting the overall quality of its answers.

What were the goals?

Here are the goals in simple terms:

  • Make PayPal’s shopping assistant respond much faster, especially during product search.
  • Cut computing costs while keeping the assistant’s suggestions helpful and accurate.
  • Test a method for tuning smaller AI models to do specific e-commerce tasks well.
  • Build a setup that can improve different parts of a multi-agent system (many small AIs working together) in a real, large-scale online shopping environment.

How did they do it?

To keep things clear, think of a giant store and a helpful “digital librarian” who finds the right items when you ask for something. The paper describes how they trained and organized this librarian to work faster and smarter.

The Commerce Agent (many helpers working together)

PayPal’s assistant is made of several specialized mini-AIs, each with a job:

  • One understands your question and turns it into a “search-ready” format.
  • One finds matching products (retrieval).
  • One orders the results from best to okay (ranking).
  • One checks quality and keeps the conversation on track. These mini-AIs are coordinated by an orchestration framework that manages memory, tools, and product data.

Training smaller models to do a focused job

Instead of using a big, general AI model for everything, they trained a Small LLM (SLM) from the Nemotron family (about 8 billion parameters) for the specific task of query understanding and search. They fine-tuned this model using NVIDIA’s NeMo Framework and a method called LoRA, which lets you adapt a model by training a small number of added parameters—like teaching a skilled worker a new specialty without retraining their entire education.

Making search smarter with HyDE

They used a technique called HyDE (Hypothetical Document Embeddings). Imagine you ask, “Suggest tech accessories for skiing.” The model:

  • Extracts key details (category, attributes like “heated,” “waterproof,” etc.).
  • Writes a short “hypothetical” product description that captures what you really want.
  • Uses this to find similar real products in the catalog. It’s like describing your dream item so the search engine can find the closest real matches.

Measuring performance and quality

They ran many experiments, changing training settings (like learning rates and optimizers) to find the best model. To judge quality, they sometimes used another AI to rate the results (LLM-as-a-Judge), comparing pairs of outputs and using a rubric to score them.

They also tried Direct Preference Optimization (DPO), which teaches the model to prefer better answers by showing it pairs: a “chosen” response and a “rejected” one. It’s like learning taste by comparing examples instead of memorizing rigid rules.

Speeding up with GPUs and smart deployment

They tested different NVIDIA GPUs (H100 and the newer B200 “Blackwell”) and used parallel processing (splitting work across chips) to cut waiting time (“latency”). They deployed with NVIDIA NIM, which picks the fastest inference engine automatically. This made the system even quicker.

What did they find?

Here are the main results:

  • The fine-tuned Nemotron small model sped up the slowest part: turning a user’s message into a smart search query. That area used to take more than half of the assistant’s total time.
  • Overall response time went down by roughly half. Retrieval got faster by about 58%, and the assistant’s total latency dropped by around 49%.
  • Monthly GPU costs dropped by about 45%.
  • Quality stayed competitive. In some tests, the fine-tuned model matched or slightly improved the quality compared to the base setup. DPO helped minimize quality drops when pushing for more speed.
  • Newer GPUs (B200) and running tasks in parallel gave noticeable extra speed-ups.

In short: the assistant became faster and cheaper, and its recommendations stayed helpful.

Why does it matter?

  • Better shopping experience: Faster replies mean smoother conversations and less waiting when you search for products.
  • Smarter search: The assistant understands your intent, not just keywords, so it finds products that fit your needs more closely.
  • Lower costs at scale: Cutting compute costs makes it easier to offer these benefits to millions of users.
  • Practical blueprint: The paper shows a clear, repeatable way to improve specific parts of a large AI system, which other companies can follow.
  • Future-ready: Using smaller, fine-tuned models for focused tasks can be more efficient than relying on one giant model for everything.

Takeaway

By fine-tuning a small, focused AI model with NVIDIA’s NeMo tools and smart training methods like LoRA and DPO, PayPal made its commerce assistant much faster and cheaper while keeping its quality strong. This approach is a practical path for building quick, reliable, and personalized shopping experiences at a massive scale.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

The following list captures what remains missing, uncertain, or unexplored in the paper. Each item is phrased to be concrete and actionable for future research or engineering work.

  • Baseline clarity: The “base model” and baseline retrieval stack are insufficiently described (architecture, parameter count, inference engine, quantization settings, tokenizer, prompt template); provide a fully specified baseline to enable apples-to-apples comparisons.
  • Evaluation metrics coverage: The study relies heavily on LLM-as-a-Judge scores; add standard IR metrics (Recall@k, MRR, NDCG@k), grounding accuracy (catalog match rate), and schema adherence rates to quantify retrieval and ranking effectiveness.
  • Human and online evaluation: No human-in-the-loop assessments or production A/B experiments are reported; measure business metrics (conversion rate, CTR, add-to-cart, dwell time, bounce rate) and conduct controlled A/B tests to validate offline gains translate into user impact.
  • Judge model reliability: The paper does not specify the judge model(s), calibration procedures, or agreement metrics; assess inter-annotator reliability (Krippendorff’s alpha) and cross-judge consistency, and report bias checks for judge models.
  • Dataset transparency: The composition of the 10k synthetic + real chat dataset lacks detail (category distribution, language mix, geography, session length, query types, long-tail coverage, label provenance); publish a data card covering sampling, annotation guidelines, and known biases.
  • Synthetic data effects: The impact of synthetic data on model behavior and generalization is not analyzed; run controlled ablations on varying synthetic-to-real ratios and measure downstream retrieval/ranking performance, hallucination rates, and domain overfitting.
  • Privacy and compliance: Using customer chats to build profiles raises privacy and regulatory concerns; document PII handling, consent, minimization, retention, GDPR/CCPA compliance, DSAR workflows, and any differential privacy or de-identification techniques.
  • Personalization cold start: The approach depends on purchase history; develop and evaluate strategies for new users (contextual bandits, demographic priors, session-based signals) and for new products (item cold start).
  • Fairness and inclusivity: Gender-aligned recommendations are enforced without evaluation of fairness or inclusivity; assess performance across genders, non-binary users, regions, and merchants, and measure stereotype amplification and disparate impact.
  • Safety/guardrail robustness: Safety policies are described but not stress-tested; evaluate against prompt injection, jailbreaks, adversarial queries, illegal or harmful item requests, and measure guardrail precision/recall and false positives.
  • Retrieval method ablation: HyDE is adopted without comparative baselines; compare HyDE against BM25, static/dynamic query expansion, dual-encoder dense retrieval (e.g., E5, Contriever), hybrid sparse+dense, and multi-vector encoders; report IR metrics per method.
  • Embedding and index details: The embedding models, vector store type, ANN algorithm (e.g., HNSW, IVF-PQ), and index parameters are unspecified; detail and ablate encoder choices, dimensionality, quantization, recall-latency trade-offs, and re-ranking strategies.
  • Ranking module evaluation: Ranking is delegated to another LLM with no comparison to classic learning-to-rank (LambdaMART, XGBoost LTR, neural LTR); benchmark accuracy (NDCG@k), calibration, and consistency across queries/catalog segments.
  • Pipeline-level ablations: Query understanding, retrieval, and ranking contributions are confounded; isolate and quantify each stage’s effect via ablations (e.g., swapping only query formulation vs. only retrieval vs. only ranking) on both latency and quality.
  • Hardware vs. model improvements: Latency gains mix hardware (B200 vs. H100), engine, and fine-tuning effects; decompose contributions via controlled experiments to attribute improvements to model fine-tuning versus inference stack versus GPU architecture.
  • Load and tail latency: Only average latencies are discussed; report p90/p95/p99 latencies, throughput (req/s), concurrency scaling, queueing effects, and backpressure behavior under production-like load.
  • Quantization and memory: Inference quantization settings (INT8/FP8/FP16), KV cache strategies, and memory footprints are unreported; evaluate quantization impacts on latency, cost, and quality, including stability across diverse query lengths.
  • Guided JSON reliability: Guided JSON is mentioned without error metrics; measure schema adherence, parse failure rates, recovery strategies, and downstream impacts on tool calls and retrieval consistency.
  • Hallucination grounding: HyDE may hallucinate product attributes; quantify hallucination rates, catalog grounding accuracy (SKU match, attribute correctness), and the effectiveness of filters to suppress non-existent products or brands.
  • Catalog dynamism: No evaluation of model robustness to inventory changes, out-of-stock items, or rapid catalog drift; build tests for freshness handling, reindexing schedules, and stale recommendation suppression.
  • Multi-lingual and regional support: Global PayPal use is implied but language coverage (non-English), locale formatting (currency, units), and regional regulatory constraints are not addressed; evaluate multilingual performance and localization.
  • Tool success and orchestration: The agent’s tool invocation reliability, error handling, recovery policies, and orchestration metrics (tool success rate, plan execution success, loop detection) are not reported; instrument and evaluate these systematically.
  • DPO methodology detail: Preference data sources (human vs. model-generated), rubric, sampling, and label quality are unclear; expand DPO dataset, report hyperparameters, training stability, and compare SFT vs. DPO on identical setups and metrics.
  • Reproducibility: Training seeds, durations, compute budgets, LoRA ranks, learning rates, schedulers, parameter counts, and prompt templates are not fully specified; provide a reproducibility appendix or artifacts (configs, logs, ablation results).
  • Business cost modeling: “45% monthly GPU cost reduction” lacks context (traffic volume, concurrency, instance pricing, autoscaling policies); report cost-per-query, cost-per-session, and sensitivity analyses under varying traffic profiles.
  • Energy and sustainability: Energy use and carbon footprint are not measured; estimate energy per 1k queries and evaluate sustainability improvements alongside cost and latency.
  • User experience outcomes: The trade-off between modest quality drops and latency gains is not tied to UX thresholds; study how latency vs. quality trade-offs affect user satisfaction and business KPIs.
  • Error analysis: The paper lacks qualitative error categorization; perform error analyses on failed/rejected cases (attribute extraction errors, mis-gendering, off-target categories, irrelevant products) to guide targeted fixes.
  • Security and compliance for deployment: NIM microservices and LangChain orchestration introduce supply chain and container risks; document CVE scanning, SBOM, isolation, secrets management, and runtime policy enforcement.
  • Continuous learning and monitoring: There is no described pipeline for drift detection, automated retraining triggers, rollback criteria, or gating tests; design and report a continuous evaluation/governance framework for safe iteration.
  • Generalization across scenarios: Performance is not stratified by category, seasonality, event-type queries (e.g., holidays, travel), or long-tail intents; add stratified benchmarks and stress tests across diverse commerce scenarios.
  • Clarity of reported results: Several captions state “latency up 57.9%” while implying improvement; standardize sign conventions and report absolute values, confidence intervals, and statistical significance to avoid ambiguity.

Practical Applications

Immediate Applications

Below is a concise set of deployable applications that can be implemented now using the paper’s methods and findings.

  • Latency-optimized query formulation for e-commerce search and recommendation
    • Sector: Retail/e-commerce, Software
    • Tools: Fine-tuned Nemotron SLM (llama3.1-nemotron-nano-8B-v1) via NeMo SDK; HyDE-based retrieval; LangChain orchestration; JSON-guided outputs; NVIDIA NIM for inference
    • Dependencies: High-quality catalog metadata and embeddings; reliable vector search; consented user profiles; GPU access (H100/B200); production-grade orchestration
  • Cost reduction in LLM-powered commerce agents through SLM replacement
    • Sector: Retail/e-commerce, Finance (FinOps), Software
    • Tools: PEFT (LoRA/QLoRA) fine-tuning with NeMo; sharded checkpoints; 3D parallelism; NIM engine selection
    • Dependencies: Comparable quality to baseline LLMs; robust evaluation pipeline; workload profiling to validate -45% GPU cost claims
  • GPU capacity planning and inference tuning
    • Sector: Software/ML Ops, Energy/Green AI
    • Tools: NIM microservices with TensorRT-LLM/vLLM/SGLang; TP sweeps (TP=1..4); hardware benchmarking (H100 vs B200)
    • Dependencies: Access to Blackwell (B200) or H100; latency/throughput targets; budget constraints; driver/runtime compatibility
  • Production-grade multi-agent commerce orchestration
    • Sector: Retail/e-commerce, Software
    • Tools: LangChain-based Generic Agent Orchestrator; Conversation Planning; Task Planner; Agent Tools (search, ranking, cart); Unified Session Management
    • Dependencies: Stable domain APIs (catalog, pricing, inventory, cart); session persistence; LLM Evaluator integration; guardrails for safety
  • Personalized top‑K recommendations grounded in user profiles
    • Sector: Retail/e-commerce, Marketing
    • Tools: Offline user profile generation from purchase history; fine-tuned LLM ranking; HyDE retrieval grounding
    • Dependencies: Privacy-compliant data pipelines (consent, regional regulations); profile freshness; bias/fairness checks (e.g., gender alignment rules)
  • Synthetic data generation and preference-pair pipelines (SFT/DPO)
    • Sector: Software/ML, Academia (applied ML)
    • Tools: Synthetic prompt–response generation; DPO preference datasets; LLM-as-a-Judge rubric evaluation
    • Dependencies: Quality controls for synthetic data; judge reliability; domain-specific rubrics; reproducible evaluation
  • Schema-bound JSON generation for downstream integration
    • Sector: Software, Retail/e-commerce
    • Tools: Guided JSON outputs for category, attribute, product lists; validation and post-processing services
    • Dependencies: Strict schema definitions; error handling; catalog alignment and normalization
  • Conversational shopping assistance for web/mobile
    • Sector: Retail/e-commerce, Daily life
    • Tools: Commerce Agent front ends; multi-turn dialogue planning; memory framework; personalization engine
    • Dependencies: UX instrumentation; fallback flows; safety prompts (avoid illegal/weapons/explicit content); robust connection to inventory
  • Continuous agent monitoring with LLM-as-a-Judge
    • Sector: Software/ML Ops, Retail/e-commerce
    • Tools: Pairwise and rubric-based evaluation; end-to-end scoring (query formulation, attribute extraction, recommendation)
    • Dependencies: Calibrated rubrics; periodic human validation; drift detection; CI/CD integration
  • Enterprise dev productivity boost via NeMo training recipes
    • Sector: Software/ML, Enterprise IT
    • Tools: Hydra-first config management; PEFT recipes; distributed optimizers; checkpoint portability
    • Dependencies: Team skills on NeMo; GPU cluster availability; experiment tracking (e.g., MLflow/W&B)

Long-Term Applications

The following applications require further research, scaling, or development before broad deployment.

  • Autonomous agentic commerce with sub‑2s end‑to‑end performance at global scale
    • Sector: Retail/e-commerce
    • Tools: Fully optimized multi-agent workflows; dynamic model selection; global session/state management; caching strategies
    • Dependencies: Robust cross-region infrastructure; multi-tenant scaling; advanced observability; SLAs
  • Cross-domain adaptation of retrieval‑optimized SLMs
    • Sector: Healthcare, Education, Finance, Enterprise Knowledge Management
    • Tools: Domain-specific SFT/DPO pipelines; HyDE-style structured retrieval over regulated corpora (e.g., formularies, curricula, product specs)
    • Dependencies: Domain data access and compliance (HIPAA, FERPA, financial regulations); ontology/taxonomy alignment; subject-matter expert evaluation
  • Privacy-preserving personalization (federated and on-device learning)
    • Sector: Retail/e-commerce, Policy/Regulatory, Software
    • Tools: Federated fine-tuning; differential privacy; secure enclaves; consent management
    • Dependencies: Legal frameworks; device heterogeneity; robust privacy guarantees; auditability
  • Multimodal agentic commerce (text + vision + speech)
    • Sector: Retail/e-commerce, Assistive technologies
    • Tools: Vision models for image-based search; speech ASR/TTS; multimodal NeMo workflows
    • Dependencies: Image metadata quality; multilingual speech support; real-time inference scaling; UX design for multimodal interactions
  • Adaptive inference planners for energy-aware cost/latency optimization
    • Sector: Energy/Green AI, Software/ML Ops
    • Tools: Automatic engine/hardware selection based on traffic and cost; smart TP/PP/DP orchestration; carbon-aware scheduling
    • Dependencies: Accurate telemetry; policy constraints; fleet diversity; dynamic workload prediction
  • Standardized evaluation and governance frameworks for agentic commerce
    • Sector: Policy/Regulatory, Industry consortia, Academia
    • Tools: Open benchmarks (latency, cost, quality, fairness, safety); red-team protocols; explainability metrics; audit trails
    • Dependencies: Stakeholder alignment (platforms, regulators, researchers); shared datasets; continuous updates to standards
  • Negotiation and post‑purchase autonomous support agents
    • Sector: Retail/e-commerce, Finance (offers, refunds), Customer service
    • Tools: Multi-step reasoning and constraints; integration with returns/exchange, dispute resolution, dynamic pricing
    • Dependencies: Clear business policies; risk controls; human-in-the-loop escalation; compliance with consumer protection laws
  • Merchant/partner onboarding assistants with catalog normalization
    • Sector: Retail/e-commerce, B2B marketplaces
    • Tools: LLM-driven attribute extraction, taxonomy mapping, deduplication; quality scoring
    • Dependencies: Merchant data variation; taxonomy governance; conflict resolution; bulk processing at scale
  • Robust safety, bias, and fairness systems for personalized commerce
    • Sector: Policy/Regulatory, Retail/e-commerce
    • Tools: Guardrailing APIs; bias audits (e.g., gender alignment logic in the paper); content filters; fairness-adjusted ranking
    • Dependencies: Transparent criteria; legal compliance; periodic audits; diverse evaluation cohorts
  • Cross-merchant knowledge graphs and session-level reasoning
    • Sector: Retail/e-commerce, AdTech/MarTech
    • Tools: Unified product graph; persistent memory across sessions; intent evolution tracking; cross-sell/upsell reasoning
    • Dependencies: Data-sharing agreements; identity resolution; privacy controls; graph maintenance at scale
  • Academic research programs on agentic planning and retrieval
    • Sector: Academia
    • Tools: Public variants of SFT/DPO workflows; ablation studies of PEFT ranks, schedulers, optimizers; LLM-as-a-Judge reliability research
    • Dependencies: Accessible datasets; ethical review; reproducibility infrastructure; funding and compute access
  • Integrated commerce + payments agents (PayPal-native)
    • Sector: Finance, Retail/e-commerce
    • Tools: End-to-end flows from discovery → cart → checkout → financing options; risk signals; offer recommendation
    • Dependencies: Secure payments integration; fraud detection alignment; regulatory compliance (KYC/AML); latency budgets

Notes on Key Assumptions and Dependencies Across Applications

  • Data and privacy: Applications rely on consented, compliant use of user profiles and purchase histories; privacy-preserving techniques become critical as personalization deepens.
  • Catalog/embedding quality: Effective retrieval (especially HyDE) depends on accurate product embeddings, up-to-date inventory, and clean attribute taxonomies.
  • Evaluation reliability: LLM-as-a-Judge should be supplemented with human audits to mitigate rubric bias and ensure real-world quality.
  • Hardware/runtime: Performance gains (e.g., B200 > H100) assume access to newer GPU architectures and compatible inference stacks (NIM/TensorRT-LLM/vLLM).
  • Trade-offs: The paper shows immediate latency/cost improvements may reduce some quality metrics; DPO helps but requires careful tuning and domain-specific evaluation.
  • Safety/fairness: Guardrails must be maintained and extended (e.g., gender alignment rules, illegal content avoidance) as the agent scales to new domains and modalities.

Glossary

  • 3D parallelism: A training approach that combines tensor, pipeline, and data parallelism to scale model training across devices. "3D parallelism (TP/PP/DP) with sequence and context parallelism maintained stable, composable training;"
  • Agent Memory Framework: A component that maintains conversation context and user state across sessions for an agent. "the Agent Memory Framework to maintain conversation context and user state across sessions,"
  • Agent Orchestration Framework: An infrastructure layer that coordinates multiple agents and tools to execute complex workflows. "Agentic Commerce AI system: (1) Commerce Platform (2) Agent Orchestration Framework, (3) LLMs Integration (4) Personalization Engine."
  • Agent Task Planner: A module that decomposes complex user requests into actionable sub-tasks. "and the Agent Task Planner decomposes complex user requests into executable sub-tasks."
  • agentic artificial intelligence: AI systems that autonomously plan and act to achieve goals with minimal human intervention. "The emergence of agentic artificial intelligence represents a paradigm shift in how consumers interact with e-commerce platforms"
  • agentic commerce: Commerce interactions powered by autonomous, goal-directed agents. "a multi-agent system designed to revolutionize agentic commerce on the PayPal platform."
  • Blackwell architecture: NVIDIA’s GPU architecture (e.g., B200) designed for high-performance AI inference and training. "the NVIDIA B200 Blackwell architecture consistently outperforms the H100 across all tensor parallelism configurations"
  • Conversation Planning Framework: A module that manages multi-turn dialogue structure and flow. "the Conversation Planning Framework manages multi-turn dialogue flow,"
  • cosine annealing schedules: A learning rate schedule that follows a cosine curve to improve optimization stability. "cosine annealing schedules, and LoRA ranks."
  • Direct Preference Optimization (DPO): A training method that learns from pairs of preferred versus rejected outputs to align with human preferences. "Direct Preference Optimization (DPO) is a training method that learns from relative preferences rather than absolute labels."
  • distributed optimizers: Optimization algorithms that run across multiple devices or nodes to scale training. "combined with sharded checkpoints and distributed optimizers, further simplified experimentation and scale-out operations."
  • Domain Knowledge Center: A service that provides standardized access to product catalogs and business logic. "while the Domain Knowledge Center provides access to product catalogs and business logic through standardized protocols (API, MCP, A2A)."
  • embedding space: The vector space in which items (e.g., queries and products) are represented for similarity-based retrieval. "by identifying similar real products in the embedding space, effectively grounding the generated hypothetical description to actual inventory..."
  • E2E Latency: The total end-to-end time from user input to agent response. "achieves a substantial reduction in E2E Latency (+17.9%+17.9\%), indicating faster processing speed."
  • E2E Score: A composite metric evaluating overall system quality across multiple components. "and E2E Score (17.4%-17.4\%)"
  • Generic Agent Orchestrator: The core controller that coordinates specialized agent components and tools. "At its core, the Generic Agent Orchestrator coordinates between multiple specialized components:"
  • guardrailing: Safety mechanisms that constrain or monitor model outputs to prevent harmful or undesired behavior. "retrieval-augmented generation, guardrailing, data curation, and pretrained models."
  • guided JSON: Constraining model outputs to structured JSON formats to streamline downstream processing. "guided JSON methods for streamlined post-processing operations."
  • Hydra: A configuration management framework for organizing experiments via YAML/CLI. "Configuration management: Hydra-first YAML/CLI system enabled parallel tracking of 20 LoRA variants without manual scripting;"
  • Hyperparameter sweeps: Systematic exploration of hyperparameter combinations to optimize model performance. "systematic hyperparameter sweeps across learning rates, optimizers (Adam, AdamW), cosine annealing schedules, and LoRA ranks."
  • Hypothetical Document Embeddings (HyDE): A retrieval technique where an LLM generates hypothetical documents that are then used for dense retrieval. "we adopt the Hypothetical Document Embeddings (HyDE) approach for product retrieval."
  • LangChain: A framework for building LLM-powered applications and agent orchestration. "utilizing the LangChain \cite{mavroudis2024langchain} agent orchestration framework to develop intelligent products"
  • LLM Evaluator: A component that continuously monitors and assesses agent performance. "the LLM Evaluator continuously monitors agent performance,"
  • LLM Model Gardens: Curated collections of models from which an agent can select the most suitable one. "the LLM Strategy module selects optimal models from the LLM Model Gardens (for instance: NVIDIA NIM Cloud),"
  • LLM Strategy: A module responsible for selecting and configuring the optimal LLM for a task. "the LLM Strategy module selects optimal models from the LLM Model Gardens (for instance: NVIDIA NIM Cloud),"
  • LLM-as-a-Judge: An evaluation paradigm where an LLM is used to assess the quality of model outputs. "Our evaluation employs LLM-as-a-Judge"
  • LoRA: A parameter-efficient fine-tuning method that injects low-rank adapters into model layers. "training 20 LoRA-based model variants"
  • llama3.1-nemotron-nano-8B-v1: A specific Nemotron-based 8B-parameter model variant used in experiments. "focusing on the llama3.1-nemotron-nano-8B-v1 architecture."
  • Nemotron: NVIDIA’s family of LLMs used as a base for fine-tuning. "fine-tuned Nemotron SLM effectively resolves the key performance issue in the retrieval component,"
  • NeMo Framework: NVIDIA’s end-to-end platform for data preparation, training, fine-tuning, and deployment of LLMs. "We collaborated with NVIDIA to solve these performance challenges by leveraging the NeMo Framework"
  • NIM: NVIDIA Inference Microservices—containerized, pre-optimized inference services integrating engines like TensorRT-LLM and vLLM. "NVIDIA NIM is a containerized deployment solution that provides pre-optimized inference microservices"
  • Parameter-Efficient Fine-Tuning (PEFT): Techniques that fine-tune large models by learning a small number of additional parameters. "NeMo's Parameter-Efficient Fine-Tuning (PEFT) support for LoRA/QLoRA techniques"
  • QLoRA: A PEFT method that combines quantization with LoRA to reduce memory footprint during fine-tuning. "LoRA/QLoRA techniques"
  • Retrieval-Augmented Generation: A method that integrates retrieved documents into the generation process to ground responses. "retrieval-augmented generation, guardrailing, data curation, and pretrained models."
  • sequence parallelism: Splitting sequence dimensions across devices to scale training/inference. "sequence and context parallelism maintained stable, composable training;"
  • sharded checkpoints: Checkpoints saved as multiple partitions to facilitate distributed training and flexible loading. "combined with sharded checkpoints and distributed optimizers"
  • SGLang: A high-performance LLM inference engine. "including TensorRT-LLM, vLLM, and SGLang."
  • Small LLM (SLM): A compact LLM tailored to specific tasks for lower latency and cost. "small LLMs (SLMs) optimized for commerce retrieval tasks."
  • Supervised Fine Tuning (SFT): Fine-tuning a model using labeled examples with explicit targets. "The SFT champion model achieved a quality score of 2.49 out of 5"
  • Tensor parallelism: Distributing model tensors across multiple devices to accelerate inference and training. "Tensor parallelism configurations (TP=1, TP=2, TP=4)"
  • TensorRT-LLM: NVIDIA’s optimized inference engine for LLMs. "including TensorRT-LLM, vLLM, and SGLang."
  • Top-K: Selecting the top K items (e.g., recommendations) based on a scoring function. "create personalized Top-K recommendations"
  • vLLM: An LLM serving system optimized for high-throughput inference. "including TensorRT-LLM, vLLM, and SGLang."

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 223 likes about this paper.