Autonomous Agents

Autonomous Agents-research papers. Updated daily. See as well the Resources-section.

Research papers

Chronological order.

4th March 2025

From Metaphor to Mechanism: How LLMs Decode Traditional Chinese Medicine Symbolic Language for Modern Clinical Relevance

Perceptual-Chain-Of-Thought framework: introduces multi-agent system for TCM metaphor interpretation, with Entity Mapping and Splitting Layer, Perceptual Layer, Metaphor understanding Layer, and Perceptual KG Subset components.
This framework uses chain-of-thought reasoning and multi-agent collaboration to bridge TCM and Western medicine understanding of metaphors.
The system aims to improve accuracy and transparency in translating TCM symbolic language for modern clinical relevance.

FINARENA: A HUMAN-AGENT COLLABORATION FRAMEWORK FOR FINANCIAL MARKET ANALYSIS AND FORECASTING

FinArena (Human-Agent collaboration framework for financial market analysis and forecasting): introduces a novel framework for financial analysis, integrating Human Module (interactive user interface), Machine Module (LLM-based multi-agent system), Time Series Agent (stock time series prediction), News Agent (news insights and RAG), Statement Agent (financial statement analysis), AI Expert (investment decision synthesis), Report Agent (human-agent interaction), Data Set (multimodal financial data), Output (investment action suggestion), and Web Port (information retrieval and analysis).
FinArena framework employs specialized agents for time series, news, and statements, combined with an AI expert for synthesizing insights and a report agent for human interaction, utilizing multimodal financial data for enhanced stock trend predictions and personalized investment decisions.
The framework leverages adaptive Retrieval-Augmented Generation (RAG) within the News Agent to mitigate hallucinations and improve accuracy when processing unstructured news data, and incorporates iterative reasoning in the Statement Agent for in-depth financial statement analysis.

MPO: Boosting LLM Agents with Meta Plan Optimization

MPO (Meta Plan Optimization): introduces meta plan optimization framework with meta planner generating abstract guidance, agent providing execution feedback, and prompt incorporating meta plan for enhanced planning.
MPO framework leverages meta plans to provide explicit guidance for LLM agents, enabling continuous optimization based on agent's task execution feedback.
MPO enhances agent planning by decoupling meta plans from specific environmental details, improving generalization and task completion efficiency without agent retraining.

Playing games with Large language models: Randomness and strategy

LangChain: introduces game-playing framework with LLM, player agents, evaluation, and history for game simulations.
This framework facilitates bidirectional LLM interactions for repeated games with history feedback.
The framework enables analysis of LLM strategic adaptation and randomness in game scenarios.

Generator-Assistant Stepwise Rollback Framework for Large Language Model Agent

GA-Rollback (Generator-Assistant Stepwise Rollback) introduces a framework with Environment, LLM (Generator), GA-Rollback, Assistant, Rollback Operation, Evaluation, Feedback Evaluation, and "wait-k" Strategy to improve decision-making in LLM agents by addressing error propagation through rollback operations and quality evaluation.
The framework utilizes a Generator (LLM) to interact with the Environment, while an Assistant examines actions and triggers Rollback Operation upon error detection, incorporating Feedback Evaluation and "wait-k" Strategy for enhanced performance.
GA-Rollback framework aims to ensure credible reasoning trajectory by separating action generation and examination, and integrating seamlessly as plug-and-play module with other methods for improved robustness and extensibility.

BRIDGE: Bootstrapping Text to Control Time-Series Generation via Multi-Agent Iterative Optimization and Diffusion Modelling

BRIDGE: introduces BRIDGE framework with Text Template Generation, Automatic Evaluation, Feedback-driven Refinement, Domain Time Series Encoder, Text Description Encoder, Prototype Assignment Module, Semantic Prototypes, Conditioning, Diffusion Model, and Random Noise for text-controlled time series generation.
BRIDGE framework uses multi-agent system for iterative text refinement and hybrid approach combining semantic prototypes with text descriptions to enhance time-series generation controllability and fidelity.
The framework addresses challenges of limited text-TS pairs and modality discrepancy by generating high-quality datasets and integrating semantic prototypes for improved domain generalization in time-series generation.

PersonaX: A Recommendation Agent-Oriented User Modeling Framework for Long Behavior Sequence

PersonaX (Recommendation Agent-Oriented User Modeling Framework): introduces a user modeling framework for long behavior sequences, with behavior clustering, sampling budget allocation, in-cluster selection, SBS selection, offline multi-persona construction, online persona retrieval, and persona cache.
PersonaX extracts representative sub-behavior sequences offline to construct fine-grained personas for efficient online retrieval in recommendation agents.
PersonaX addresses challenges of long user-generated content by balancing behavioral completeness and efficiency in user modeling.

ReSo: A Reward-driven Self-organizing LLM-based Multi-Agent System for Reasoning Tasks

ReSo (Reward-driven Self-organizing): introduces reward-driven multi-agent system, integrating Task Graph for decomposition, Agent Graph Construction for network building, Dynamic Agent Database for agent profiles, and Collaborative Reward Model for feedback.
ReSo incorporates Collaborative Reward Model to provide fine-grained signals, enabling dynamic optimization of agent collaboration and improving scalability.
The framework utilizes Dynamic Agent Database to maintain agent profiles, facilitating adaptive agent selection based on performance and task similarity.

EchoQA: A Large Collection of Instruction Tuning Data for Echocardiogram Reports

EchoQA (Echocardiogram Question Answering) introduces a question-answering dataset creation framework with Data Extraction, Clinical Categorization, Sentence Matching, Question Generation, LLM Training, and Fairness Audit components for echocardiogram reports.
EchoQA framework utilizes clinician expertise for categorizing cardiac abnormalities and generates question-answer pairs to facilitate instruction tuning of language models for cardiology QA tasks.
The framework aims to establish a benchmark for LLM-based AI agents in cardiology, focusing on differential diagnoses and fairness across social determinants of health.

AppAgentX: Evolving GUI Agents as Proficient Smartphone Users

AppAgentX (Evolving GUI Agents as Proficient Smartphone Users): introduces an evolutionary framework for GUI agents, incorporating memory mechanism, evolutionary mechanism, and execution strategy to enhance efficiency and intelligence by evolving high-level actions from task execution history.
AppAgentX framework utilizes a chain-based knowledge framework to record task execution history, enabling the agent to identify repetitive action sequences and evolve shortcut nodes representing high-level actions for improved task efficiency.
The framework's memory mechanism stores page nodes and element nodes with descriptions and visual embeddings, facilitating the evolutionary mechanism to abstract low-level actions into high-level actions and optimize the agent's operational repertoire.

Haste Makes Waste: Evaluating Planning Abilities of LLMs for Efficient and Feasible Multitasking with Time Constraints Between Actions

RECIPE2PLAN: introduces Multitask Agent, Action, Observation, Feedback, and Recipe, to evaluate multitask planning with time constraints in cooking scenarios.
RECIPE2PLAN framework challenges agents to balance efficiency and feasibility in parallel task execution while respecting temporal constraints between actions.
RECIPE2PLAN benchmark uses real-world recipes to assess agents' ability to optimize cooking time and adhere to temporal constraints, highlighting the need for improved temporal awareness in LLMs.

ATLAS: Agent Tuning via Learning Critical Steps

ATLAS (Agent Tuning via Learning Critical Steps): introduces a framework for efficient LLM agent tuning by focusing on critical steps identified from expert trajectories.
ATLAS framework employs a Selector to identify Critical Steps within Expert Trajectories and applies Finetuning to a Base LLM using Critical Step Loss computed on these steps to create a tuned LLM Agent.
By selectively finetuning on Critical Steps, ATLAS reduces overfitting and enhances generalization capabilities of LLM agents while minimizing training costs.

3rd March 2025

CorrA: Leveraging Large Language Models for Dynamic Obstacle Avoidance of Autonomous Vehicles

CorrA (Corridor-Agent): introduces a dynamic obstacle avoidance framework, integrating Scene Description, LLM Scene Analysis, Optimization, DDP, Hard safety constraints, and Ego car trajectory components.
CorrA uses LLM for reasoning to generate adaptive sigmoid boundary parameters, which are efficiency-optimized and used by DDP within MPC for trajectory planning.
This framework enhances autonomous vehicle safety and efficiency in dynamic environments through real-time adaptation of sigmoid-based safety boundaries.

Interactive Debugging and Steering of Multi-Agent AI Systems

AGDEBUGGER (Interactive Agent Debugging Tool): introduces an interactive debugging system for multi-agent AI, featuring a message viewer, message sending, message editing, message reset, overview visualization, agent state checkpoints, and a message queue.
This tool facilitates debugging by allowing users to inspect conversations, edit messages, reset workflows, and visualize conversation history to understand and correct agent behavior.
AGDEBUGGER addresses the challenges of debugging complex multi-agent systems by providing interactive control and visualization, enabling developers to effectively identify and fix errors in agent workflows.

Al persuading Al vs Al persuading Humans: LLMs' Differential Effectiveness in Promoting Pro-Environmental Behavior

Research Framework: introduces a system for evaluating LLMs in promoting Pro-Environmental Behavior, with Participants, Chat, Communication Strategy and Effects components.
The framework compares real, simulated, and synthetic participants interacting with personalized chatbots, non-personalized chatbots, or static statements using different communication strategies.
The study investigates effects on pro-environmental intentions, climate change belief, sustainable choices, psychological distance, sharing, consumption, self-perception, and policy adoption.

Persuasion at Play: Understanding Misinformation Dynamics in Demographic-Aware Human-LLM Interactions

PANDORA Framework (Persuasion ANalysis in Demographic-aware human-LLM interactions and misinformation Response Assessment): introduces components including LLM-to-Human Persuasion, Persuasive Text Generation, Persuasive Text Impact, Human-to-LLM Persuasion, Persuasive Text Generation, Persuasive Text Impact, Multi-agent LLM Persuasion, Multi-Agent LLM Architecture, Homogeneous groups, Heterogeneous groups, Interaction rounds, First responses, and Final responses to investigate misinformation dynamics in human-LLM interactions considering demographic factors.
PANDORA framework analyzes bidirectional persuasion between humans and LLMs, evaluating LLM-generated and human-generated persuasive texts' impact on belief and correctness across diverse demographic groups in single-agent and multi-agent settings.
The framework's multi-agent LLM architecture explores echo chamber effects in homogeneous groups and mitigation in heterogeneous groups, offering insights into demographic influences on misinformation susceptibility and potential intervention strategies.

Mind the (Belief) Gap: Group Identity in the World of LLMs

Multi-agent LLM framework: introduces simulation of belief congruence experiment with participant agent interacting with confederate agents, each having assigned belief, through interaction rounds to answer question.
Framework components include participant agent making decision after interaction rounds with confederate agents, considering their beliefs on discussion topic.
Framework simulates psychological experiment to investigate belief congruence in LLMs by observing agent's choices based on belief alignment of others.

Adaptively evaluating models with task elicitation

Adaptive Evaluations: introduces framework for evaluating language models, utilizing Target LLM, Evaluator Agent, Verifier, and Static Evaluation components.
Framework employs evaluator agents to create difficult questions by probing target model behavior from static evaluation results.
Verifier component ensures generated questions maintain validity, difficulty, and novelty, refining target model profile iteratively.

--

Can (A)I Change Your Mind?

Dynamic Bot Framework: introduces a structured system utilizing GPT-4, with System Prompt, Experiment Framework, System Message, Persona, Conversation Instruction, User Message, Bot Message, Opinion and Confidence, Few Shot Conversations, Nudger, Initial Message, Summarization Prompt, and Final Message, to facilitate and analyze human-bot conversations for persuasion studies.
This framework employs a detailed prompt and iterative message processing, including summarization and rephrasing, to ensure naturalistic and contextually relevant bot interactions within the experiment.
The framework incorporates components like Nudger for maintaining engagement and Few Shot Conversations for guiding bot behavior, aiming for robust and ecologically valid persuasion research.

Persuade Me if You Can: A Framework for Evaluating Persuasion Effectiveness and Susceptibility Among Large Language Models

PMIYC (Persuade Me If You Can): introduces automated framework for evaluating persuasion effectiveness and susceptibility of LLMs through multi-agent interactions, with PERSUADER (Agent attempting persuasion), PERSUADEE (Agent being persuaded), Multi-turn Conversation (Iterative argument exchange), and Agreement Score (Quantifies stance on claim).
PMIYC framework simulates conversations between PERSUADER and PERSUADEE agents to measure persuasive effectiveness and susceptibility of LLMs in different contexts.
PMIYC offers scalable and automated approach to study LLM persuasion dynamics, providing insights into vulnerabilities and safer AI development.

AutoAdvExBench: Benchmarking autonomous exploitation of adversarial example defenses

AutoAdvExBench (Benchmark for Autonomous Exploitation of Adversarial Example Defenses): introduces benchmark, with Forward Pass Implementation, Differentiable Forward Pass Conversion, FGSM Attack and Iterative Attack components, that evaluates LLMs' ability to autonomously exploit adversarial defenses.
This benchmark directly measures LLMs' success on security tasks performed by machine learning experts, unlike proxy benchmarks.
AutoAdvExBench is mechanistically verifiable and uses real-world research codebases, highlighting the gap between CTF-like and real-world security challenges for LLMs.

Designing VR Simulation System for Clinical Communication Training with LLMs-Based Embodied Conversational Agents

VAPS (Virtual AI Patient Simulator): introduces VR system for clinical communication training, with tutorial-, clinical patient interaction- and reflection-scenes, embodied conversational agents, medical records, narrative design, realistic animations and system interaction.
VAPS utilizes LLM-driven ECAs to simulate dynamic patient interactions, incorporating medical records and adaptive narratives for realistic VR-based training.
The system aims to enhance HP students' communication skills through customizable and repeatable practice scenarios within an immersive VR environment.

Retrieval Models Aren't Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models

TOOLRET: introduces TOOLRET Benchmark (heterogeneous tool retrieval evaluation), IR Models (benchmark various retrieval models), TOOLRET-train Dataset (large-scale training dataset), and Evaluation Metrics (retrieval performance metrics) for benchmarking tool retrieval performance of large language models.
TOOLRET benchmark demonstrates existing information retrieval models exhibit suboptimal performance in retrieving tools, consequently degrading tool-use large language model task completion rates.
TOOLRET-train dataset aims to improve information retrieval models for tool retrieval, ultimately enhancing the effectiveness of large language models in tool utilization.

Student engagement in collaborative learning with AI agents in an LLM-empowered learning environment: A cluster analysis

MAIC (Massive AI-empowered Course System): introduces a platform integrating specialized AI agents including AI Teacher, AI Teaching Assistant, Sparker, Thinker, Questioner and Note Taker, managed by Director Agents using Dialogue History and Learning Materials to enhance online learning.
MAIC system utilizes Director Agents to analyze classroom dynamics from Dialogue History and Learning Materials, enabling dynamic agent selection via Select Speakers and text generation through Generate Texts, alongside components like Role Descriptions and Specialized Intelligence.
MAIC framework aims to foster collaborative learning by employing diverse AI agents - AI Teacher, AI Teaching Assistant, Sparker, Thinker, Questioner, Note Taker - each with specific roles, to support student engagement and personalized educational experiences.

Evaluation and Facilitation of Online Discussions in the LLM Era: A Survey

Taxonomy of Discussion Quality Evaluation: introduces four main dimensions - Structure and Logic, Social Dynamics, Emotion and Behavior, and Engagement and Impact - to assess online discussion quality.
Taxonomy of Discussion Quality Evaluation: encompasses multiple aspects within each dimension, ranging from argument analysis and coherence to politeness, toxicity, and engagement, providing a comprehensive framework.
Taxonomy of Discussion Quality Evaluation: aims to offer a structured approach for evaluating diverse facets of online discussions, moving beyond traditional argument-centric methods to include social and behavioral dynamics.

Improving Retrospective Language Agents via Joint Policy Gradient Optimization

RetroAct (Retrospective Language Agent): introduces a novel agent framework that jointly optimizes task-planning and self-reflective evolution capabilities with Planner, Reflector, Environment, Tool Calling, Reflection, Feedback, Reward, Differential Reward, Imitation Learning, Reinforcement Learning, Policy Gradient Optimization, Replay Buffer.
RetroAct framework uses a two-stage joint optimization process integrating imitation and reinforcement learning for enhanced data efficiency and training stability.
RetroAct improves performance of open-source models and reduces dependency on closed-source LLMs by enabling continuous learning and evolution.

MultiAgentBench : Evaluating the Collaboration and Competition of LLM agents

MARBLE (Multi-agent coordination Backbone with LLM Engine): introduces a multi-agent evaluation framework with Configuration, Coordinate Engine, Agent Graph, Shared Memory, Cognitive Module, Environment, Tool Box, and Evaluator components.
This framework evaluates LLM-based multi-agent systems by measuring task completion and coordination quality in diverse interactive scenarios.
MARBLE utilizes milestone-based KPIs and supports various coordination protocols and planning strategies for comprehensive multi-agent system analysis.

2nd March 2025

NESYC: A NEURO-SYMBOLIC CONTINUAL LEARNER FOR COMPLEX EMBODIED TASKS IN OPEN DOMAINS

NESYC (Neuro-Symbolic Continual learner): introduces a neuro-symbolic continual learning framework integrating Semantic Parser, Hypothesis Generator, Hypothesis Interpreter, Logic programming form, Memory-based monitoring, Task Planner, Action Executor, and Error Handler for embodied agents in open-domain environments.
NESYC framework employs contrastive generality improvement and memory-based monitoring schemes, utilizing LLMs and symbolic tools to generalize actionable knowledge and refine it through experience.
The framework iteratively reformulates knowledge and applies it, adapting to unpredictable situations and demonstrating effectiveness in diverse embodied task benchmarks by continually improving understanding of the environment.

A Law Reasoning Benchmark for LLM with Tree-Organized Structures including Factum Probandum, Evidence and Experiences

TL Agent (Transparent Law Reasoning Agent): introduces a transparent law reasoning framework, with Agent Brain, Fact Finding Head, Knowledge Search, MultiRole Checker, Legal Knowledge, Reflection, Evidence, Factum Probandum, Experiences, and Inferences components, for AI-assisted legal decision-making.
The framework employs a tree-organized schema integrating hierarchical factum probandum, evidence, and experiences to simulate comprehensive court processes and enhance transparency in legal reasoning.
TL Agent utilizes a suite of legal analysis tools within an agent-based architecture to construct tree-organized legal reasoning structures from textual case descriptions for improved judicial fairness.

AI Agents for Ground-Based Gamma Astronomy

Astronomical Agent: introduces an AI system designed for astronomy tasks, integrating context understanding, language model processing, external function execution, and data validation within a specified framework.
The agent utilizes instruction-finetuned LLMs to automate complex tasks in gamma-ray astronomy, incorporating components like ACADA and Gammapy for telescope control and data analysis pipelines.
Validation mechanism ensures command quality by evaluating function execution results against provided data and software framework, enhancing reliability in autonomous astronomical operations.

Evaluating Personalized Tool-Augmented LLMs from the Perspectives of Personalization and Proactivity

ETAPP (Evaluation of Tool-augmented Agent from the Personalization and Proactivity Perspective): introduces a benchmark for evaluating personalized tool invocation, comprising API Construction, Sandbox Construction, User Profile Construction, Tool-utilizing Preference Construction, Interaction History Construction, Memory Building, Instruction Construction, Manual Check, Inference, Available Tools, Tool Invoking Process, Final Answer, and Evaluation components.
ETAPP assesses personalization and proactivity in tool-augmented LLMs using a dataset of 800 cases and a key-point-based evaluation method.
The benchmark aims to address the lack of evaluation criteria for personalized tool usage in diverse scenarios, focusing on improving personalized LLM agents.

CLEA: Closed-Loop Embodied Agent for Enhancing Task Execution in Dynamic Environments

CLEA (Closed-Loop Embodied Agent): introduces a closed-loop framework with Observer, Memory, Planner-Critic Agent and Skill Pool to enhance task execution in dynamic environments using robots.
CLEA framework incorporates Observer for visual input conversion, Memory for belief state maintenance, Planner-Critic Agent for adaptive decision-making, and Skill Pool for predefined executable actions.
The framework facilitates continuous adaptation and error recovery in long-horizon tasks by integrating real-time environmental feedback and memory-driven reasoning within its components.

Unmasking Digital Falsehoods: A Comparative Analysis of LLM-Based Misinformation Detection Strategies

SNIFFER (Multimodal Large Language Model for Explainable Out-of-Context Misinformation Detection): introduces multimodal model utilizing visual and textual inputs with cross-modal transformer and external knowledge to detect misinformation.
SNIFFER integrates image encoder, text processing, and retrieval mechanisms, employing LLM for reasoning and providing explainable out-of-context misinformation detection.
The framework achieves explainability through structured validation and evidence integration, enhancing transparency in multimodal misinformation analysis.

LLMDR: LLM-Driven Deadlock Detection and Resolution in MAPF Environment

LLMDR (LLM-Driven Deadlock Detection and Resolution): introduces MAPF Environment, Base Model Simulation, LLM Deadlock Detection, and LLM Deadlock Resolution with PIBT to address deadlock and enhance learned MAPF model performance.
LLMDR framework uses LLM Deadlock Detection to identify deadlocks in Base Model Simulation within MAPF Environment and employs LLM Deadlock Resolution with PIBT to resolve them.
LLMDR leverages LLMs for high-level deadlock management and integrates with PIBT algorithm for collision-free action generation in multi-agent pathfinding scenarios.

1st March 2025

Instructor-Worker Large Language Model System for Policy Recommendation: a Case Study on Air Quality Analysis of the January 2025 Los Angeles Wildfires

Instructor-Worker LLM System: introduces Instructor LLM (Prompt interpreter orchestrator), Worker LLM(s) (Data analysis summarization agents), Code Execution Module (API call validation execution), and Cloud Platform (External data source) for air quality analysis.
The system uses Instructor LLM to process user instructions, retrieve data from Cloud Platform via Code Execution Module, and distribute analysis tasks to Worker LLMs.
This multi-agent approach aims to efficiently analyze large datasets and generate policy recommendations based on air quality data.

Challenges in Testing Large Language Model Based Software: A Faceted Taxonomy

Taxonomy for LLM Test Case Design: introduces Software Under Test (system or component to evaluate), Goal (objective of the test case), Oracles (evaluation mechanisms for property), and Inputs (data to elicit SUT responses) for structuring LLM testing.
This taxonomy addresses challenges in LLM testing by categorizing key variation points impacting evaluation correctness and emphasizing ambiguity in inputs and outputs.
The taxonomy aims to improve reliability and reproducibility of LLM testing by providing a systematic framework for test case design and evaluation across the software lifecycle.

PodAgent: A Comprehensive Framework for Podcast Generation

PodAgent: introduces comprehensive framework for podcast generation with Host-Guest-Writer System, Voice-Role Matching, Instruction-following TTS, Audio Script Generation and Audio Production components.
PodAgent framework utilizes multi-agent collaboration for content creation, voice characteristic analysis for voice selection and LLM-enhanced speech synthesis for expressive speech.
PodAgent addresses key challenges in podcast generation, including content depth, dialogue naturalness, voice appropriateness and speech expressiveness.

Structured Reasoning for Fairness: A Multi-agent Approach to Bias Detection in Textual Data

SRF Framework (Structured Reasoning for Fairness Framework): introduces multi-agent pipeline with Checker Agent, Validation Agent, and Justification Agent for textual data bias detection.
SRF Framework systematically identifies biases through fact or opinion classification, bias intensity scoring, and factual justification provision.
This approach improves bias detection accuracy and interpretability, fostering fairness and accountability in language models.

Shifting Power: Leveraging LLMs to Simulate Human Aversion in ABMs of Bilateral Financial Exchanges, A bond market study

TRIBE (Trading Relationships, Interactions, and Bilateral Exchange of assets): introduces agent-based model augmented with LLM to simulate human aversion, with components Select Paramatervalues, Build the Financial Landscape, Initialise Bankers, Bankers engage with Clients, Clients determine direction choice availability, LLM response Positive, LLM response Averse, Bankers must trade if Clients are Positive towards them, and Banker facilitated trade occurs.
TRIBE framework simulates bilateral financial exchanges by integrating LLM for human-like client decision-making regarding trade aversion and timeliness.
This framework enhances realism in agent-based models by incorporating stochastic human-like decision processes via LLM, revealing emergent market behaviors.

28th February 2025

UDora: A Unified Red Teaming Framework against LLM Agents by Dynamically Hijacking Their Own Reasoning

UDora (Unified Red Teaming Framework): introduces a novel approach for attacking LLM agents by dynamically adapting adversarial strings based on the agent's reasoning process, encompassing components like System (target agent), Direct Attack (baseline), UDora Attack (framework itself), Initial Response, Modified Response, Optimization, Malicious Environment, Malicious Instruction, Tool list, and Malicious Target Tool.
UDora framework strategically inserts "noise" into the agent's reasoning at optimal positions identified through positional scoring and iterative optimization to mislead the agent towards malicious actions.
The framework evaluates two adversarial scenarios: Malicious Environment, where the observation is corrupted, and Malicious Instruction, where the instruction is directly manipulated, demonstrating effectiveness across diverse datasets and real-world agents.

Personalized Causal Graph Reasoning for LLMs: A Case Study on Dietary Recommendations

Personalized Causal Graph Reasoning: introduces agentic framework enhancing LLM reasoning by incorporating personal causal graphs, with goal identification, personal causal graph, traverse impactful nutrient paths, rank, verify, retrieve food items, food nutrient database, generate food recommendation, large language model, and personal data.
This framework constructs personalized causal graphs from individual data to guide LLM in generating tailored dietary recommendations.
By leveraging structured causal dependencies and counterfactual evaluation, the framework aims to provide more precise and personalized dietary advice compared to generic LLM approaches.

BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology

BixBench (Bioinformatics Benchmark): introduces benchmark framework with analyst-created analysis capsules, expert review, LLM-generated MCQs, task capsules, agent task environment with tools, and open/multiple-choice evaluations.
BixBench framework uses analysis capsules containing data and questions, evaluated in agent environment with tools for bioinformatics tasks.
BixBench framework assesses LLM-based agents in bioinformatics through open-ended questions and multiple-choice questions for comprehensive evaluation.

EdgeAIGuard: Agentic LLMs for Minor Protection in Digital Spaces

EdgeAIGuard: introduces multi-agent framework for minor protection, with Input Layer, Edge Processing Unit, Local Storage and Protection Layer components.
EdgeAIGuard framework incorporates Sentinel, Context, and Intervention Agents within Edge Processing Unit, utilizing DeepSeek LLM Engine for threat detection and response.
EdgeAIGuard employs Local Storage with History Cache and Pattern Memory to maintain context awareness and adapt to evolving online threats effectively.

ARIES: AUTONOMOUS REASONING WITH LLMS ON INTERACTIVE THOUGHT GRAPH ENVIRONMENTS

ARIES (AUTONOMOUS REASONING WITH LLMS ON INTERACTIVE THOUGHT GRAPH ENVIRONMENTS) introduces a multi-agent framework with Policy Agent, Reasoning Agent, and Thought Graph to enhance reasoning in LLMs.
ARIES framework utilizes Policy Agent to select Actions that Reasoning Agent executes on Thought Graph, dynamically adapting problem-solving strategy.
The framework aims to improve reasoning accuracy and efficiency by using LLMs as policy agents to guide exploration within a structured thought graph environment.

The amplifier effect of artificial agents in social contagion

Artificial Agent Social Contagion Framework: introduces agent types, experiments, attributes, threshold, adoption rate, seeding strategy, network, and proportion of artificial agents, to describe the impact of artificial agents on social contagion processes.
This framework investigates how artificial agents, compared to humans, exhibit lower adoption thresholds and amplify social contagion in networks.
The findings highlight the potential for artificial agents to accelerate behavioral shifts and raise questions about managing their influence in social systems.

Retrieval Augmented Generation for Topic Modeling in Organizational Research: An Introduction with Empirical Demonstration

Agentic RAG (Agentic Retrieval-Augmented Generation): introduces topic modeling method integrating retrieval, generation, and agent-driven learning for improved qualitative analysis.
Agentic RAG extends RAG with ReAct agent for iterative query reformulation and output evaluation, enhancing transparency and reliability.
Agentic RAG streamlines topic modeling by using embeddings, reducing preprocessing and improving efficiency over traditional methods.

The Power of Personality: A Human Simulation Perspective to Investigate Large Language Model Agents

Human Simulation Perspective: introduces framework with prompt-based personality shaping, single and multi-agent task testing and evaluation, group collaboration, team formation, and performance analysis to investigate personality traits influence on large language model agents in closed and open tasks.
This framework systematically explores how personality traits impact reasoning, creativity, and collaboration of LLM agents by assigning Big Five traits and evaluating performance in single-agent and multi-agent settings.
The study reveals that specific personality traits significantly affect agent performance and multi-agent systems exhibit collective intelligence driven by personality combinations, demonstrating LLMs' inherent human behavior simulation capabilities.

Digital Player: Evaluating Large Language Models based Human-like Agent in Games

CivAgent: introduces a Large Language Model-based agent for strategy games, integrating perception, memory, reasoning & planning, skills, tools, and game components for human-like gameplay.
CivAgent utilizes game observations and stored interaction data within its memory to inform reasoning and planning for executing in-game skills and leveraging external tools.
The framework incorporates a simulator within its tools component to enhance numerical reasoning and decision-making processes in the complex game environment.

Cyber Defense Reinvented: Large Language Models as Threat Intelligence Copilots

CYLENS (Cyber Defense Reinvented: Large Language Models as Threat Intelligence Copilots): introduces a cyber threat intelligence copilot system, integrating Base LLMs, Large-scale Knowledge, Task-Oriented Dataset, Curriculum Pre-training, Cascading Reasoning, and Specialized NLP Modules.
CYLENS enhances cyber threat analysis through cascading reasoning and specialized NLP modules for tasks like attribution, contextualization, detection, correlation, prioritization, and remediation.
The framework utilizes curriculum pre-training and fine-tuning methodologies to embed extensive CTI knowledge and adapt to diverse organizational needs in cybersecurity.

The Rise of Darkness: Safety-Utility Trade-Offs in Role-Playing Dialogue Agents

ADMP (Adaptive Dynamic Multi-Preference) introduces a method that dynamically adjusts safety-utility preferences, incorporating train dataset, ADMP, sample dataset, CMS, character settings, GPT-4, safety reward model, utility reward model and typical interaction library.
ADMP framework utilizes Coupling Margin Sampling (CMS) to enhance safety in high-risk scenarios through character-query risk coupling measurement within typical interaction library and preference weight sampling and mapping.
The framework aims to balance safety and utility in role-playing dialogue agents, addressing risk coupling between user queries and character settings to mitigate unsafe content generation.

ProAI: Proactive Multi-Agent Conversational AI with Structured Knowledge Base for Psychiatric Diagnosis

ProAI (Proactive AI): introduces a proactive conversational framework for mental health diagnosis, integrating Multi-Agent Proactive Reasoning Workflow, Structured Knowledge Graph, and Multifaceted Evaluation Strategy.
ProAI framework employs Decision-Maker and Question-Generator Agents within Multi-Agent Proactive Reasoning Workflow, utilizing Structured Knowledge Retrieval and Action Prediction to navigate Structured Knowledge Graph.
Multifaceted Evaluation Strategy of ProAI, encompassing Simulated Clinical Interview, User Experience Evaluation, and Doctor Evaluation, ensures comprehensive assessment of diagnostic accuracy, user experience and medical proficiency.

Multi²: Multi-Agent Test-Time Scalable Framework for Multi-Document Processing

Multi² (Multi-Agent Test-Time Scalable Framework): introduces multi-agent framework with Input documents processed using Prompt Bank's Prompts to generate Candidate Summaries, then aggregated by Aggregator (Voter, Context-preserve, Context-independent) into Final summary, evaluated by Evaluation metrics (CAP score, LLM-ACU score) against Baseline summary.
Multi² framework leverages prompt ensemble for multi-document summarization, employing diverse Prompts from Prompt Bank to guide independent LLM agents in generating Candidate Summaries, which are then consolidated by Aggregator module.
The framework's Aggregator offers three distinct approaches: Voter selects best summary, Context-preserve refines summary using documents and candidates, and Context-independent consolidates summaries without original documents, all evaluated with CAP score and LLM-ACU score metrics.

27th February 2025

WHY ARE WEB AI AGENTS MORE VULNERABLE THAN STANDALONE LLMS? A SECURITY ANALYSIS

OpenHands (Web AI agent platform): introduces a framework for analyzing web agent vulnerabilities, comprising Goal Preprocessing, Action Space, Event Stream, LLM, and Eval Environment components.
This framework investigates how embedding user goals, multi-step actions, and observational capabilities increase web agent vulnerability compared to standalone LLMs.
The study uses component ablation to identify specific design choices that contribute to the heightened susceptibility of web agents to jailbreaking.

Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers

MAV (Multi-Agent Verification): introduces Generator LLM for output generation, Aspect Verifiers for output evaluation, Aggregation for signal combination, BoN-MAV as multi-agent algorithm, BoN-RM as reward model algorithm, and Self-Consistency as consistency algorithm for test-time compute scaling.
MAV paradigm combines multiple Aspect Verifiers to evaluate Generator LLM outputs, using Aggregation of verifier signals to improve performance over BoN-RM and Self-Consistency baselines.
BoN-MAV algorithm, a specific implementation of MAV, demonstrates effective test-time scaling by increasing number of Aspect Verifiers, showing improvements in accuracy across diverse language models and domains.

Bridging Legal Knowledge and AI: Retrieval-Augmented Generation with Vector Stores, Knowledge Graphs, and Hierarchical Non-negative Matrix Factorization

Smart-SLIC (Smart Semantic Legal Information and Computational System): introduces a legal AI framework integrating RAG, VS, KG, and NMF for enhanced legal research.
Smart-SLIC leverages vector stores for semantic retrieval, knowledge graphs for relationship navigation, and NMF for latent topic discovery in legal documents.
This framework aims to improve legal information retrieval, reasoning, and explainability by combining these components for complex legal tasks.

Telephone Surveys Meet Conversational AI: Evaluating a LLM-Based Telephone Survey System at Scale

AI-driven telephone survey system: introduces a voice-based conversational AI agent for conducting phone surveys, integrating Speech to Text, Large Language Model, and Text to Speech components for real-time dialogue.
The system incorporates a Turn-taking Model for managing conversation flow, a Deterministic consent checker and Llama Guard safety model within a Safety suite for secure interactions, and Logger, Recording of call, and Closed Database for data management.
This architecture enables automated large-scale telephone surveys, mimicking human interviewers while maintaining data quality and operational efficiency.

Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents

Collab-Overcooked Benchmark: introduces a framework with Memory, Reflection, Instruction-Builder, Planner, Communication, Error Handling, Executor, and Environment State components for evaluating LLM-based multi-agent collaboration.
This benchmark assesses collaboration capabilities in a simulated kitchen environment featuring resource isolation and asymmetric task knowledge between agents.
It employs process-oriented metrics like Initiating Capability and Responding Capability alongside end-to-end metrics to enable fine-grained analysis of collaborative performance.

Picking the Cream of the Crop: Visual-Centric Data Selection with Collaborative Agents

ViSA (Visual-Centric Selection approach via Agents Collaboration): introduces a multi-agent framework for high-quality visual instruction data selection, incorporating Visual Information Quantification, Diversity Perspectives Quantification, and Text Quality Quantification components.
ViSA evaluates image informativeness and instruction relevance by leveraging visual agents including InternVL, QwenVL and Llava to assess visual elements using SAM2 and DINO, and image-specific features.
ViSA utilizes Shapley value based Agent Collaboration to refine evaluation scores like Segmentation Complexity Score, Object Alignment Score, Diversity Perspective Score, Prior Token Perplexity Score and Image-Text Mutual Information Score and improve MLLM training efficiency by reducing dataset noise.

MIND: Towards Immersive Psychological Healing with Multi-agent Inner Dialogue

MIND (Multi-agent INner Dialogue): introduces multi-agent framework with Trigger, Devil, Guide, Strategist, Player and Memory components for immersive psychological healing through inner dialogue.
MIND framework utilizes Trigger for scenario generation, Devil for cognitive distortion simulation, Guide for restructuring guidance, Strategist for storyline progression, Player as simulated patient and Memory for narrative coherence.
MIND paradigm aims to enhance empathy and self-reconciliation by decomposing patient's conflicting self into interactive agents facilitating cognitive scaffold for metacognitive reflection and therapeutic efficacy.

Supervised Fine-Tuning LLMs to Behave as Pedagogical Agents in Programming Education

GuideLM: introduces a fine-tuning framework with curated question-answer dataset, script-based, manual and LLM-based pre-processing, OpenAI fine-tuning and pedagogical model.
GuideLM framework employs supervised fine-tuning to create pedagogically sound LLMs for programming education by refining existing models with targeted datasets.
The framework aims to improve Socratic guidance and economy of words in LLM responses for novice programmers, enhancing learning without over-assistance.

--

Personas Evolved: Designing Ethical LLM-Based Conversational Agent Personalities

Framework name: introduces a workshop for responsible design and evaluation of ethical LLM-based conversational agent personalities.
This workshop addresses ethical and practical concerns of rapidly adopted LLM-based personas in conversational user interfaces.
The workshop aims to bridge CUI and AI communities to ensure transparency, inclusivity, and user-centered LLM-driven CUIs.

TripCraft: A Benchmark for Spatio-Temporally Fine Grained Travel Planning

TripCraft: introduces a benchmark for fine-grained travel planning, with User (initiates travel plan), Query (travel plan request), Persona (travel style preferences), Agent (generates travel plan), Reference Information (data for plan generation), Databases (storage for data), Generated Plan (output itinerary), Temporal Meal Score (meal scheduling quality), Temporal Attraction Score (attraction visit duration), Spatial Score (travel efficiency), Ordering Score (itinerary sequence), Persona Score (user preference alignment), CPR Macro (commonsense constraint adherence), CPR Micro (commonsense constraint adherence), HCPR Macro (hard constraint adherence), HCPR Micro (hard constraint adherence), and Delivery Rate (plan generation success).
TripCraft assesses language agents in generating constraint-aware travel itineraries by incorporating user preferences and real-world constraints.
The benchmark uses continuous evaluation metrics to assess temporal, spatial, sequential, and persona-specific aspects of generated travel plans.

A Thousand Words or An Image: Studying the Influence of Persona Modality in Multimodal LLMs

Persona Modality Representation Framework: introduces framework with LLM, Stable Diffusion, Pillow Typography, Text, Image, Assisted Image, Descriptive Image components for studying persona modality influence in multimodal LLMs.
Persona Modality Representation Framework evaluates persona embodiment using text, image, assisted image and descriptive image modalities generated through pipeline involving LLM, Stable Diffusion and Pillow Typography.
Persona Modality Representation Framework systematically investigates how different modalities impact persona expressiveness in multimodal LLMs, utilizing diverse persona dataset and evaluation framework.

Large Language Model Strategic Reasoning Evaluation through Behavioral Game Theory

Evaluation framework: introduces a three-step method for evaluating LLMs' strategic reasoning, incorporating Abstracted Game Library, Responder Model or Agents, and TQRE Estimation components.
This framework systematically assesses LLMs' reasoning capability through game abstraction and TQRE-based parameter analysis, accounting for contextual complexity.
It evaluates strategic reasoning beyond Nash Equilibrium, considering demographic biases and prompt effects using behavioral game theory principles.

26th February 2025

Agentic Mixture-of-Workflows for Multi-Modal Chemical Search

CRAG-MoW (Mixture-of-Workflows for Self-Corrective Retrieval-Augmented Generation): introduces agentic framework for multi-modal chemical search, includes User, Generators, Vector Store, Document Fusion, Aggregator Agent, and Report Generation.
CRAG-MoW orchestrates multiple CRAG workflows using Generators for iterative self-correction and Aggregator Agent for synthesizing final response.
Framework leverages structured retrieval and multi-agent synthesis to enhance response quality and interpretability for materials discovery.

Weaker LLMs' Opinions Also Matter: Mixture of Opinions Enhances LLM's Mathematical Reasoning

MoO (Mixture of Opinions): introduces post-training method using Dataset Curation, Ancillary LLMs, Main LLM, Chain-of-Thought Reasoning Steps, Opinions, Post-Training, Inference and Post-trained Main LLM to enhance mathematical reasoning of stronger Main LLM by incorporating diverse Opinions from weaker Ancillary LLMs.
MoO framework curates dataset by augmenting training samples with Chain-of-Thought reasoning and diverse Opinions from multiple weaker Ancillary LLMs, then fine-tunes Main LLM on this dataset.
The post-trained Main LLM in MoO framework demonstrates improved mathematical reasoning by learning to synthesize insights from varied Opinions during the Post-Training phase.

Stay Focused: Problem Drift in Multi-Agent Debate

DRIFTJudge/DRIFTPolicy Framework: introduces DRIFTJudge for drift detection and DRIFTPolicy for mitigation in multi-agent debate with Discussion and Voted Solution components.
This framework addresses problem drift, a performance degradation issue in multi-agent debate over multiple turns.
The framework aims to improve the effectiveness of multi-agent debate by identifying and reducing problem drift.

Winning Big with Small Models: Knowledge Distillation vs. Self-Training for Reducing Hallucination in QA Agents

QA Pipeline (retrieval-augmented question-answering pipeline): introduces retrieval-augmented system, with User Question, Product Manual, Retrieval, QA Pipeline and Factual Response components, designed to generate factual answers from product manual based on user questions.
The framework utilizes product manual as structured knowledge source to provide relevant information for question-answering process.
The described pipeline aims to address hallucination in question answering systems by grounding responses in provided product manual content.

Conversational Planning for Personal Plans

LLM-based Hierarchical Framework: introduces a novel architecture for conversational agents using Meta-Controller, Policy-Option, Tool-Use Policy, Memory, and Tools and RAG components to enable long-term interactive planning.
The framework employs a LLM-powered Meta-Controller to decide macro-actions, LLM-powered Policy-Options to execute these actions, and Tool-Use Policy to fetch relevant content, leveraging Memory and Tools and RAG for context and knowledge retrieval.
This approach facilitates adaptive planning through conversation and feedback, applicable to various scenarios requiring long-term user assistance, such as tutoring and personal health planning.

TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem Understanding

TheoremExplainAgent: introduces an agentic framework for multimodal theorem explanation video generation, incorporating Theorems, Prompting, Planner Agent, Code Agent, Multimodal Elements, Rendered Video and Evaluation components.
TheoremExplainAgent utilizes Planner Agent with Scene Outline, Vision Storyboard Plan, Technical Implementation Plan and Animation & Narration Plan to create video plans, and Code Agent with Query Generator, Core Documentation, Plugin Documentation and Agentic RAG to generate animation code.
The framework outputs Rendered Video composed of Multimodal Elements and is assessed by Evaluation metrics, aiming to enhance theorem understanding through visual explanations.

Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems

REWARDAGENT (Agentic Reward Modeling): introduces router, verification agents (factuality, instruction-following), and judger to combine human preference rewards with verifiable correctness signals for reliable reward systems.
Agentic reward modeling enhances reward reliability by integrating multi-dimensional correctness signals and enabling flexible incorporation of diverse verification agents.
REWARDAGENT empirically demonstrates superior performance over vanilla reward models and improves LLM training through DPO with agent-constructed preference pairs.

Agent-centric Information Access

Agent-centric Information Access Framework: introduces architecture orchestrating domain-expert and user-specific LLMs through User Agents (personalized user interface), Knowledge Agents (domain expert LLM), and Belief on Expertise (expertise assessment model).
The framework utilizes User Expertise (user knowledge history) and Knowledge Base (domain specific data) to dynamically manage Query (information request) and Response (synthesized answer) cycles, incorporating Training (expertise model update) via Data-Metadata (inter-agent communication).
This architecture facilitates efficient information retrieval by dynamically selecting and querying relevant expert LLMs, optimizing for accuracy, cost, and latency in multi-expert knowledge synthesis.

Simulation of Language Evolution under Regulated Social Media Platforms: A Synergistic Approach of Large Language Models and Genetic Algorithms

LLM-driven multi-agent framework (Large Language Model-driven multi-agent framework): introduces multi-agent simulation for language evolution on regulated platforms, incorporating participant agents evolving language and supervisory agent enforcing regulations, utilizing Reflection-, Planning-, Dialogue- and Memory-Modules, with Constraint/Expression Strategy Update, Dialogue Log, Keyword Filter, LLM Assessment, Violation Log, Regulations and Violation Detection.
Framework employs dual language strategies (constraint and expression) and LLM-driven Genetic Algorithm for strategy optimization through selection, mutation, and crossover, enhancing adaptability and simulation fidelity.
Participant and supervisory agents, both LLM-driven, interact iteratively, refining language strategies to balance effective communication with evasion of regulatory constraints in simulated social media environments.

MEDDxAgent: A Unified Modular Agent Framework for Explainable Automatic Differential Diagnosis

MEDDxAgent (Modular Explainable DDx Agent): introduces a modular agent framework for explainable automatic differential diagnosis, integrating DDxDriver, History Taking Simulator, Knowledge Retrieval Agent, and Diagnosis Strategy Agent.
MEDDxAgent facilitates iterative diagnostic reasoning by using DDxDriver as central orchestrator to manage interactions between simulator and agents for refining diagnoses.
The framework enhances explainability and transparency in the diagnostic process through intermediate logging and iterative updates of patient profiles and diagnoses.

A Temporal Planning Framework for Multi-Agent Systems via LLM-Aided Knowledge Base Management

PLANTOR (PLanning with Natural language for Task-Oriented Robots) introduces a temporal planning framework for multi-agent systems, integrating natural language input, LLMs for knowledge base generation, Prolog for planning, and behaviour trees for ROS2 execution.
The framework employs a two-phase knowledge base generation using high-level and low-level LLMs, followed by a three-step planning procedure that incorporates temporal dependencies and resource constraints solved via mixed-integer linear programming.
PLANTOR leverages LLMs for human-understandable knowledge bases and Prolog for formal correctness, demonstrating potential for advanced robotics tasks requiring flexible and scalable planning.

Language-Driven Opinion Dynamics in Agent-Based Simulations with LLMs

LODAS (Language-Driven Opinion Dynamics Model for Agent-Based Simulations): introduces a framework for simulating opinion dynamics using language and social interactions with LLM Agents, Network of connections, Initial opinion, Opponent, Discussant, Prompt, Discussion statement, Arguments, and Opinion change.
LODAS framework explores how language and logical fallacies influence opinion evolution in agent-based simulations, simulating debates around the "Ship of Theseus" paradox.
The model utilizes LLM Agents as Opponent and Discussant roles, guided by Prompts and exchanging Arguments related to a Discussion statement to observe Opinion change within a Network of connections.

NEXUS: A LIGHTWEIGHT AND SCALABLE MULTI-AGENT FRAMEWORK FOR COMPLEX TASKS AUTOMATION

Nexus (A Lightweight and Scalable Multi-Agent Framework): introduces a Python framework for constructing LLM-based multi-agent systems, incorporating Supervisor, Task Supervisors, Worker Agents, Memory, and Tools for automating complex tasks.
Nexus framework utilizes a multi-supervisor hierarchy for scalable delegation and YAML-based workflow design, facilitating efficient management of intricate tasks and enhancing scalability.
Nexus framework achieves state-of-the-art performance across coding, mathematical reasoning, and EDA domains, demonstrating its adaptability and efficacy in varied applications.

IndicEval-XL: Bridging Linguistic Diversity in Code Generation Across Indic Languages

IndicEval-XL: introduces comprehensive benchmark for code generation, with Original Dataset prompts, Language extraction, Translation, Back Translation, Quality checks, Programming Languages, and Natural Languages.
IndicEval-XL benchmark evaluates multilingual code generation across Indic languages, focusing on linguistic diversity and functional correctness.
The framework employs automated and human-based quality checks to ensure dataset reliability for benchmarking code generation models.

Letters from Future Self: Augmenting the Letter-Exchange Exercise with LLM-based Future Self Agents to Enhance Young Adults' Career Exploration

Framework name here: introduces a system augmenting letter-exchange exercise with User, Future-self Agent, Present Self Info, LLM, Current Career Exploration Context, and Envisioned Future Profile components.
The system utilizes Profile After 3 Years, Current Profile, and Current Career Development as input for Conversational Agent, offering Letter and Chat modalities for interaction.
This approach aims to enhance young adults' career exploration by simulating personalized future self interactions for guidance and reflection.

Multi-LLM Collaborative Search for Complex Problem Solving

MOSA (Mixture-of-Search-Agents) paradigm introduces collaborative search framework, integrating independent exploration and iterative refinement with root node, action space, child nodes, sampling LLM, sub-questions, candidate sub-answers, majority voting, aggregator, and aggregated candidate sub-answers.
MOSA leverages multiple LLMs as proposers for diverse search directions and as aggregators for refining candidate answers, enhancing reasoning accuracy in complex problem-solving.
Framework mitigates limitations of single-model approaches by combining independence and collaboration, effectively avoiding local optima during search-based reasoning processes.

REALM-Bench: A Real-World Planning Benchmark for LLMs and Multi-Agent Systems

Stock Market Prediction Workflow: introduces a system for stock price forecasting, incorporating data collection, feature extraction, model training, prediction generation, integration, and alert generation, validated through loops.
This workflow utilizes market data, news feeds, and economic data as inputs to generate trading alerts based on predicted stock prices.
The framework emphasizes validation and adaptive updates to maintain prediction accuracy and system reliability in dynamic market conditions.

Data-Efficient Multi-Agent Spatial Planning with LLMs

LLM-MASP (LLM-based Multi-Agent Spatial Planning Framework): introduces multi-agent spatial planning using pretrained language model, rollout algorithm, base policy, environment, state, action, prompt, parse output, finetuning, feasibility checking, resampling, and memory.
This framework leverages LLMs for efficient taxi routing by incorporating world knowledge and adapting to environmental factors through prompting and finetuning.
The use of rollout algorithm and finetuning with LLMs significantly reduces the need for environmental interactions while outperforming traditional methods.

Reward Shaping to Mitigate Reward Hacking in RLHF

RLHF training pipeline with reward shaping: introduces Prompt, Policy Model, Reference Model, Reward Model, Reward Shaping, Reshaped Reward and RL Training for aligning language models and mitigating reward hacking.
The pipeline utilizes Reward Shaping to modify proxy rewards from Reward Model, optionally using Reference Reward, before updating Policy Model via RL Training.
Preference As Reward (PAR) method, detailed as Reward Shaping, applies sigmoid function to centered reward to enhance training stability and mitigate reward hacking.

AGENTSociety Challenge: Designing LLM Agents for User Modeling and Recommendation on Web Platforms

Environment simulator: introduces interactive environment for evaluating LLM agents, with LLM Agents, Simulator, U-R-I Network, and Datasets components.
Environment simulator: constructs interactive environment comprising user, review, and item network, enabling agents to access historical data from datasets.
Environment simulator: facilitates agent performance evaluation in tasks resembling real-world applications for user modeling and recommendation.

TrajLLM: A Modular LLM-Enhanced Agent-Based Framework for Realistic Human Trajectory Simulation

TrajLLM (A Modular LLM-Enhanced Agent-Based Framework for Realistic Human Trajectory Simulation): introduces a modular framework for human trajectory simulation, integrating Persona Preprocess, Routine Activity Generation, Memory, and Destination modules.
This framework uses LLMs for activity and destination prediction, incorporating memory for historical context and physical models for spatial reasoning.
TrajLLM aims to generate realistic and adaptable human mobility patterns while ensuring scalable memory management and interpretable insights.

25th February 2025

A Cooperative Multi-Agent Framework for Zero-Shot Named Entity Recognition

CMAS (cooperative multi-agent system): introduces multi-agent framework with self-annotator, TRF extractor, demonstration discriminator, and overall predictor for zero-shot NER.
CMAS leverages self-annotator for data generation, TRF extractor for contextual feature identification, demonstration discriminator for selective learning, and overall predictor for final prediction.
CMAS enhances zero-shot NER by integrating contextual correlations and self-reflection mechanism through collaborative agents, improving performance and robustness.

Hybrid Voting-Based Task Assignment in Role-Playing Games

VBTA (Voting-Based Task Assignment): introduces a framework for task allocation in role-playing games using capability profiles and task descriptions to generate a suitability matrix.
VBTA framework integrates voting methods and allocation strategies to manage task assignments, and employs a pre-trained LLM with custom prompts to resolve agent-task compatibility ambiguities.
By incorporating Conflict-Based Search for path planning, VBTA enables dynamic game content generation and automates agent decisions, enhancing narrative and gameplay immersion.

Assistance or Disruption? Exploring and Evaluating the Design and Trade-offs of Proactive AI Programming Support

Codellaborator: introduces proactive AI programming support framework, with Timing of Assistance, AI Agent Representation and Scope of Interaction components, exploring design trade-offs in human-AI workflows.
Codellaborator framework evaluates proactive assistance benefits and disruptions compared to user-initiated systems in programming tasks.
The research emphasizes adapting AI proactivity to programming processes for improved user control and code understanding.

INDEPENDENT MOBILITY GPT (IDM-GPT): A SELF-SUPERVISED MULTI-AGENT LARGE LANGUAGE MODEL FRAMEWORK FOR CUSTOMIZED TRAFFIC MOBILITY ANALYSIS USING MACHINE LEARNING MODELS

IDM-GPT (Independent Mobility GPT): introduces a multi-agent LLM framework with Input Validation, Self-optimization Prompting, Database Interaction, Data Analysis, and Self-supervision Modules, leveraging Database and Machine Learning Models for customized traffic mobility analysis.
This framework utilizes LLM-based AI agents to streamline traffic data analysis, enabling efficient processing of spatio-temporal data and ensuring data privacy by mediating user access to sensitive information.
IDM-GPT aims to address challenges in traffic management by providing a scalable solution for urban mobility improvement through optimized data analysis and actionable insights generation for users without ML expertise.

Single- vs. Dual-Prompt Dialogue Generation with LLMs for Job Interviews in Human Resources

Single- vs. Dual-Prompt Dialogue Generation: introduces Single-Prompt Generation (one prompt for full dialogue), Dual-Prompt Generation (two agents with prompts), and Judge LLM (evaluates dialogue authenticity) frameworks for HR job interview dialogue generation and quality assessment.
Single-Prompt Generation uses a single prompt to instruct LLM to create entire interview, whereas Dual-Prompt Generation uses two LLM-agents, interviewer and candidate, with separate prompts.
Judge LLM evaluates generated dialogues by pairwise comparison to determine if dialogues are distinguishable from human discourse, focusing on AI generation detection.

FRIDA to the Rescue! Analyzing Synthetic Data Effectiveness in Object-Based Common Sense Reasoning for Disaster Response

FRIDA (Field Ready Instruction Decoding Agent): introduces expert-in-the-loop pipeline with Templates, Disaster Relief Expert Input, Linguist Input, Seed sentences, Fine-tune Instruct model, Synthetic instructions, and Prompting LLM to generate synthetic data for fine-tuning smaller language models in disaster response domain.
FRIDA pipeline leverages domain and linguistic expertise to create high-quality seed data, which is then used to generate synthetic instructions for fine-tuning language models, enhancing their common sense reasoning about objects.
The framework demonstrates that fine-tuning smaller LLMs with synthetic data generated through the FRIDA pipeline improves their performance in object-based common sense reasoning tasks, particularly in disaster-related scenarios.

MAPORL: Multi-Agent Post-Co-Training for Collaborative Large Language Models with Reinforcement Learning

MAPORL (Multi-Agent Post-co-training for collaborative LLMs with Reinforcement Learning): introduces multi-agent post-co-training paradigm for collaborative large language models, with LLMs, Verifier, Multi-agent RL, and Multi-LLM Systems components.
MAPORL framework employs multi-agent reinforcement learning to co-train multiple LLMs for enhanced collaboration and generalization across diverse tasks.
The framework utilizes a verifier to evaluate LLM responses and discussions, providing co-training rewards maximized through multi-agent RL, fostering effective collaboration.

AgentRM: Enhancing Agent Generalization with Reward Modeling

AgentRM (Agent Reward Model): introduces generalizable reward model to guide policy model for effective test-time search using SFT Agent, Reward Model, and Policy Model components.
AgentRM framework includes Dataset for training and Environment for task execution, leveraging LLM as base model and Reward Annotation for signal generation.
AgentRM enhances agent generalization by finetuning reward model instead of policy model, improving performance on unseen tasks during Inference.

REFUTEBENCH 2.0 – AGENTIC BENCHMARK FOR DYNAMIC EVALUATION OF LLM RESPONSES TO REFUTATION INSTRUCTION

RefuteBench 2.0: introduces User, LLMs (Large Language Models), and Verifier components for dynamic evaluation of LLM responses to refutation instructions.
RefuteBench 2.0 framework employs User to provide feedback, LLMs as model under evaluation, and Verifier as evaluator agent.
RefuteBench 2.0 facilitates flexible assessment of LLM's ability to incorporate refutation feedback in multi-turn dialogues.

Debt Collection Negotiations with Large Language Models: An Evaluation System and Optimizing Decision Making with Multi-Agent

MADeN (Multi-Agent Debt Negotiation): introduces multi-agent framework with Communicating Agent (provides negotiation content), Planning Agent (designs decision framework), and Judging Agent (evaluates action rationality).
MADeN framework enhances debt negotiation by incorporating planning to design decision framework and judging module to evaluate action rationality.
MADeN framework aims to improve decision rationality in debt collection negotiations by addressing limitations of LLMs in making appropriate decisions based on debtor's financial condition.

LAG: LLM agents for Leaderboard Auto Generation on Demanding

LAG (Leaderboard Auto Generation): introduces a framework for automatic leaderboard creation, encompassing paper processing, table analysis, data integration, and leaderboard output with evaluation.
LAG framework utilizes LLMs to address challenges in generating up-to-date leaderboards from rapidly growing scientific publications, focusing on efficiency and quality.
The framework's stages systematically handle paper collection, information extraction, data recombination, and quality assessment to produce reliable and timely leaderboards.

Intersubjective Model of AI-mediated Communication: Augmenting Human-Human Text Chat through LLM-based Adaptive Agent Pair

Intersubjective Model: introduces an AI-mediated communication framework with Agent (LLM-based user proxy), Environment (independent chat space), Extraction (information distilling function), Conversation (dialogue management function), Information Transmission (agent-agent information sharing), Knowledge Base (information integration), and Online Chat Interface (user interaction platform).
This model facilitates human-human communication indirectly through independent agent interactions and information exchange, enabling adaptive message shaping and shared understanding.
The framework aims to overcome limitations of traditional communication models by removing the constraint of shared objective environment and allowing for customized interactions.

Carbon and Silicon, Coexist or Compete? A Survey on Human-AI Interactions in Agent-based Modeling and Simulation

5W1H Taxonomy (5W1H Taxonomy for Human-AI Interactions in ABMS): introduces five dimensions - Why, When, Who, What, and How - to categorize human-AI interaction methods within Agent-Based Modeling and Simulation (ABMS).
The taxonomy decomposes interactions based on user goals, interaction phase, user roles, system components controlled, and interaction means, drawing analogy from theater roles to define user engagement.
This framework aims to provide a structured approach for analyzing and designing human-AI interactions in ABMS, facilitating development of more effective and user-centered simulation systems.

Large Language Model Driven Agents for Simulating Echo Chamber Formation

Model Framework: introduces data preparation, simulation process with LLM post generation, and analysis and validation to simulate echo chamber formation.
The framework employs LLM-enhanced approach for opinion evolution, network rewiring, and content generation, incorporating textual context for realistic simulation.
Simulation Process includes "screen" component, representing limited user attention and information accessibility within social media environments.

LLM Knows Geometry Better than Algebra: Numerical Understanding of LLM-Based Agents in A Trading Arena

Agent Trading Arena: introduces a virtual numerical game environment, with Agent, Stocks and Market, Chat Pool, Day Simulation, Reflection, Memory and Environment components, designed for evaluating numerical reasoning of LLM-based agents in stock trading.
Agent Trading Arena facilitates complex economic simulations through zero-sum games, enabling assessment of LLMs' geometric and algebraic reasoning capabilities using visual and textual numerical data.
The framework incorporates a reflection module to enhance strategy refinement based on trading performance and environmental feedback, promoting continuous agent adaptation and learning in a dynamic market.

MA-GTS: A Multi-Agent Framework for Solving Complex Graph Problems in Real-World Applications

MA-GTS (Multi-Agent Graph Theory Solver): introduces multi-agent framework for solving graph problems, with Information Extraction Layer (extracts text information), Knowledge Integration Layer (constructs structured graph data), and Algorithm Execution Layer (executes algorithms).
MA-GTS framework decomposes complex graph problems through agent collaboration and maps text-based graph data into structured representations.
MA-GTS framework dynamically selects suitable algorithm based on problem constraints and graph structure scale for efficient and interpretable solution process.

7 Points to Tsinghua but 10 Points to 清华? Assessing Large Language Models in Agentic Multilingual National Bias

Academic Career Planning Advisor: introduces input prompt, LLM component, and output response for university recommendation task.
Framework evaluates LLM's score and reasoning for provided universities in multilingual context.
System aims to identify nationality bias in LLM's advisory capabilities across languages.

FACT-AUDIT: An Adaptive Multi-Agent Framework for Dynamic Fact-Checking Evaluation of Large Language Models

FACT-AUDIT (An Adaptive Multi-Agent Framework for Dynamic Fact-Checking Evaluation) introduces an agent-driven framework with Appraiser, Taxonomy, Inquirer, Prototype, Quality Inspector, Memory Pool, Evaluator, Verify Fact & Produce Justification, Target LLM, Prober, Iterative Probing, and Automatic Evaluation components for adaptive and dynamic assessment of LLMs' fact-checking capabilities.
FACT-AUDIT framework adaptively generates datasets, performs iterative evaluations, and updates assessments based on model-specific responses, incorporating justification production for comprehensive audit of LLMs' factual reasoning.
The framework leverages multi-agent collaboration and importance sampling to address limitations of static datasets and classification metrics in existing fact-checking evaluation methods, providing a more nuanced and evolving audit process.

Towards Enhanced Immersion and Agency for LLM-based Interactive Drama

Immersion-Agency Paradigm: introduces framework for LLM-based interactive drama, enhancing player Immersion and Agency through Dramatic Story Generator from Premise paragraph, Role Agents responding to Prompt, and generating Drama Script after Post-process.
This paradigm uses Playwriting-guided Generation for improved story structure and Plot-based Reflection for agent reaction refinement.
The framework aims to bridge gap in current interactive dramas by focusing on deeper emotional connections and meaningful player influence within the story.

IMPROVE: ITERATIVE MODEL PIPELINE REFINEMENT AND OPTIMIZATION LEVERAGING LLM AGENTS

IMPROVE (Iterative Model Pipeline Refinement and Optimization leveraging LLM agents): introduces a multi-agent framework with Project Architect, Data Engineer, Model Engineer, Training Engineer, and Performance Analyst agents to iteratively refine ML pipelines based on user-provided dataset and task description.
IMPROVE framework utilizes Iterative Refinement strategy, optimizing one pipeline component at a time through training and evaluation process guided by Performance Analyst feedback for stable and interpretable improvements.
IMPROVE framework aims to automate object classification pipeline development, achieving high performance without requiring ML expertise by emulating human expert iterative refinement workflow.

24th February 2025

Aligning Compound AI Systems via System-level DPO

SysDPO (System-level Direct Preference Optimization): introduces DAG, LLM, Diffusion Model, DPO Loss, and Preference Dataset to align compound AI systems.
SysDPO framework uses DAG to model compound AI system, factorizes probability, and applies DPO loss for end-to-end optimization using preference dataset.
SysDPO enables joint alignment of components like LLM and diffusion models, improving coherence and preference alignment in complex AI systems.

ARACNE: An LLM-Based Autonomous Shell Pentesting Agent

ARACNE (Autonomous LLM-based Shell Pentesting Agent): introduces a novel multi-LLM architecture for autonomous shell pentesting, comprising user, core agent, planner, interpreter, summarizer, SSH server and context components.
ARACNE separates planning and command execution using distinct LLMs, enhancing flexibility and effectiveness in cybersecurity tasks.
The framework utilizes an optional summarizer to manage context window size, offering a trade-off between accuracy and attack duration.

IGDA: Interactive Graph Discovery Agent

IGDA (Interactive Graph Discovery Agent): introduces a LLM-based pipeline for interactive graph discovery, with Edge Confidence Estimation, Edge Experiment Selection, and Local Edge Updates components.
IGDA leverages LLMs for uncertainty-driven edge selection and local graph updates based on binary feedback from experiments.
IGDA iteratively refines graph predictions by selecting uncertain edges for experiments and updating related edges based on experimental outcomes.

A Multi-LLM-Agent-Based Framework for Economic and Public Policy Analysis

MLAB (Multi-LLM-Agent-Based Framework): introduces a novel approach for economic analysis by employing multiple LLMs as heterogeneous agents representing different socio-economic groups.
MLAB framework simulates policy impacts by mapping LLMs to educational and income brackets, utilizing calibrated economic parameters for each agent group.
This framework leverages LLMs' diverse reasoning capabilities to model population heterogeneity and analyze policy responses in economic scenarios.

Graphy'our Data: Towards End-to-End Modeling, Exploring and Generating Report from Raw Data

Graphy: introduces an end-to-end platform, with Offline Scrapper, Inspection, Define Workflow, File Extractor, LLM or Rule Extractor, Fact Node, Dimension Node, Navigation, Online Surveyor, Exploration, Search, StatRefiner, GraphView, NeighborQuery, Generation, DataInfer, Mindmap Generator, Confirmed, Report Writer, and Graph Store, that automates data modeling, exploration, and report generation from raw data.
Graphy platform comprises an offline Scrapper for transforming unstructured documents into structured graphs and an online Surveyor for iterative exploration and LLM-driven report creation.
Graphy facilitates progressive document investigation by enabling users to iteratively explore, analyze, and synthesize information from large unstructured datasets to generate high-quality reports.

Leveraging Large Language Models for Effective and Explainable Multi-Agent Credit Assignment

LLM-MCA (Large Language Model Multi-agent Credit Assignment): introduces centralized LLM Critic (LLM for credit assignment), Base Prompt (LLM initial instructions), LLM Parser (extracts feedback from LLM), Individualized Feedback (per-agent reward signals), Centralized Policy Training (learns decentralized policies), Demultiplexer (splits input data for critic), Multiplexer (aggregates agent feedback), Agent Policies (decentralized agent controllers), Environment (multi-agent simulation scenario), Observations and Global Reward (environment state input), and Joint Action (agent actions output).
LLM-MCA employs centralized LLM critic with base prompt to generate individualized feedback, guiding decentralized agent policy training for effective credit assignment.
By reformulating credit assignment as pattern recognition, LLM-MCA leverages LLMs to achieve human-level credit evaluation and enhance multi-agent cooperative learning.

Grounded Persuasive Language Generation for Automated Marketing

AI Realtor: introduces agentic framework, with Grounding Module, Personalization Module, Marketing Module, ChatGPT, to automate persuasive marketing content generation.
It uses LLMs to align content with user preferences and highlight factual attributes, demonstrated in real estate marketing.
The framework achieves superhuman persuasion in experiments, outperforming human experts in real estate marketing description generation.

Multi-Agent Autonomous Driving Systems with Large Language Models: A Survey of Recent Advances

LLM-based Multi-Agent ADS Framework: introduces multi-agent system for autonomous driving, with Environment (driving context), Information (perceived data), Action (driving commands), Profile (role definition), Agent (autonomous entity), Driver Agent (vehicle control), Infrastructure Agent (external infrastructure), Shared Message Pool (communication medium), and Memory (experience storage).
This framework employs profiles to define agent functionalities, facilitating collaborative decision-making through shared message pool and memory for experience retention.
The architecture improves driving safety and efficiency in intricate scenarios by integrating separate agents for vehicle and infrastructure interaction, supported by LLM-based reasoning capabilities.

AlphaAgent: LLM-Driven Alpha Mining with Regularized Exploration to Counteract Alpha Decay

AlphaAgent (LLM-Driven Alpha Mining with Regularized Exploration to Counteract Alpha Decay): introduces autonomous framework integrating Idea Agent, Factor Agent, and Eval Agent with regularization mechanisms for decay-resistant alpha factor mining.
AlphaAgent framework employs Human Knowledge, Research Report, Market Insight, Performance Metrics, Backtest, Self-reflection, Analysis Feedback, Factor Zoo, Regularization Mechanisms, Operator Library, and Abstract Syntax Trees within closed-loop iterative refinement process.
AlphaAgent utilizes originality enforcement, hypothesis-factor alignment, and complexity control to guide alpha generation, balancing financial rationale and market adaptability for effective alpha mining.

23rd February 2025

GUARDIANS OF THE AGENTIC SYSTEM: PREVENTING MANY SHOTS JAILBREAK WITH AGENTIC SYSTEM

Evaluating Agentic Systems: introduces methodology to evaluate agentic system security, with Reverse Turing Test, Aligning Multi-Agent Systems, and Prevention of Multi-Shot Jailbreaks.
The framework employs GamoraAI, RocketAI, Star-LordAI, GrootAI, ObserverAI agents for assessing security vulnerabilities, deceptive alignment, and jailbreak defense.
This comprehensive approach aims to enhance LLM-based agentic system robustness against adversarial threats through dynamic, tool-mediated security evaluations.

RapidPen: Fully Automated IP-to-Shell Penetration Testing with LLM-based Agents

RapidPen (RapidPenetration): introduces a fully automated penetration testing framework, integrating Re Module (task planning module), Act Module (command execution module), and RapidPen-vis (visualization and reporting tool), utilizing PTT (pentesting process data model) for IP-to-Shell achievement.
RapidPen framework incorporates ReAct paradigm with specialized RAG (Retrieval-Augmented Generation) repositories, featuring Re (L1) PTT Planner (PTT expansion and maintenance), Re (L1) PTT Prioritizer (task prioritization), Re (L2) New Tasks (Success Cases) (success case based task generation), Act (L1) Command Generation (command generation using RAG), Act (L1) Command Execution (executes commands), and Act (L1) Log Analysis (analyzes command logs) modules.
The framework leverages Command Generation RAG (RAG for command generation) and Success Cases RAG (RAG for success cases) to enhance offensive security, enabling autonomous vulnerability discovery and exploitation through iterative command refinement and success case reuse.

From Text to Space: Mapping Abstract Spatial Models in LLMs during a Grid-World Navigation Task

GWSOT (Grid-World Spatial Orientation Task): introduces agent, goal, grid, spatial information representations, LLM, activations, policy maps, and performance metrics to investigate spatial understanding of language models in grid navigation.
GWSOT evaluates how different spatial information representations like cartesian, topographic, and textual formats impact LLM navigation performance and internal spatial encoding.
The framework uses performance metrics and policy maps to analyze LLM success rate, path efficiency, and spatial decision-making within the grid-world environment.

BIOMAZE: BENCHMARKING AND ENHANCING LARGE LANGUAGE MODELS FOR BIOLOGICAL PATHWAY REASONING

PATHSEEKER: introduces LLM agent for biological pathway reasoning via interactive subgraph navigation.
PATHSEEKER enhances reasoning using global subgraph search, local subgraph search, graph encoding and final reasoning on pathway database.
This method provides robust, scientifically grounded approach for complex pathway reasoning challenges.

The Hidden Strength of Disagreement: Unraveling the Consensus-Diversity Tradeoff in Adaptive Multi-Agent Systems

Dynamic Consensus-Diversity Tradeoff: introduces a framework with Receives information, Arguments, Interpret intention, and Action components, describing consensus-diversity tradeoff in multi-agent systems.
This framework contrasts implicit consensus, where agents decide independently after discussion, with explicit consensus, where agents unify actions via voting.
The framework aims to demonstrate that implicit consensus enhances robustness and adaptability in dynamic environments by preserving diversity.

All That Glitters is Not Novel: Plagiarism in AI Generated Research

SSAG (Semantic Scholar Augmented Generation): introduces plagiarism detection framework with query generation, paper retrieval, relevance scoring and similarity checking components for LLM-generated research.
SSAG framework utilizes LLMs and Semantic Scholar API to identify similar research papers and assess plagiarism in generated research proposals.
SSAG framework's evaluation reveals limitations in detecting sophisticated plagiarism within LLM-generated research documents, highlighting need for improved methods.

22nd February 2025

SMARTIFY: A MULTI-AGENT FRAMEWORK FOR AUTOMATED VULNERABILITY DETECTION AND REPAIR IN SOLIDITY AND MOVE SMART CONTRACTS

Smartify (SMARTIFY: A MULTI-AGENT FRAMEWORK FOR AUTOMATED VULNERABILITY DETECTION AND REPAIR IN SOLIDITY AND MOVE SMART CONTRACTS): introduces a multi-agent framework with Auditor, Architect, Code Generator, Refiner, and Validator components for automated smart contract vulnerability detection and repair.
Smartify leverages specialized LLMs, including LLM1 (Gemma2 9B) for initial analysis and LLM 2 (FT CodeGemma) for code generation, alongside Move RAG and Solidity RAG for language-specific context retrieval.
Smartify framework processes Code Dataset of smart contracts through its components to output Repaired Smart Contract, aiming for improved accuracy and efficiency in vulnerability remediation within blockchain landscape.

Exploring Sentiment Manipulation by LLM-Enabled Intelligent Trading Agents

Framework name here: introduces a system exploring sentiment manipulation in trading using reinforcement learning agent, with arxiv_paper_framework_2-components RL-based Trading Agent, TD3 Algorithm, Actor Network, Critic Network, Target Networks, Internal State, Environmental State, Sentiment Agent, Social Media Feed, Sentiment Analysis (RoBERTa), Sentiment Heuristic, Social Media Post Generation, Language Model (Llama 3.2), Market Simulation (ABIDES), Order Book, and Historical Data (LOBSTER).
The framework investigates how an RL-based trading agent can learn to manipulate market sentiment through generated social media posts to improve trading performance in a simulated market environment.
The study utilizes a sentiment agent that reacts to social media posts and a market simulation driven by historical order book data to evaluate the RL agent's sentiment manipulation strategies.

Reproducibility Study of Cooperation, Competition, and Maliciousness: LLM-Stakeholders Interactive Negotiation

LLM-Stakeholders Interactive Negotiation benchmark: evaluates LLM agents in negotiation games with negotiation game, LLM agents, CoT prompts, single-agent baseline, multi-agent setup, evaluation metrics, Pareto front analysis, structure leakage metric, and inequality metric.
This benchmark study reproduces and extends prior negotiation research by analyzing open-weight models and introducing fairness and confidentiality metrics.
The research highlights that single-agent baselines can achieve comparable negotiation performance to multi-agent setups, questioning communication necessity.

An Autonomous Network Orchestration Framework Integrating Large Language Models with Continual Reinforcement Learning

ARC (Autonomous Reinforcement Coordination): introduces a two-tier network orchestration framework integrating LLMs and continual RL for SemCom-enabled SAGIN, featuring RAG for data processing, HAP for hierarchical planning, SKB and DKB for knowledge storage, and RL Agents for action execution.
ARC decomposes network orchestration into high-level planning using LLM within HAP and low-level decision-making using RL Agents within Action Executioner, enhancing adaptability and efficiency through continual learning and few-shot learning.
ARC utilizes RAG to generate allocation prompts for HAP, which then employs User Sequencer to optimize user order and Action Executioner with RL agents to execute resource allocation decisions based on SKB and DKB knowledge.

Mojito: LLM-Aided Motion Instructor with Jitter-Reduced Inertial Tokens

Mojito (LLM-Aided Motion Instructor): introduces an intelligent motion agent utilizing IMU Tokenizer, Motion Tokenizer, Distribution Matching, Motion Decoder, Projection Layers, Decoder-only Transformer, LoRA Adapters, Text Tokenizer, and Qwen2-based Language Model for interactive motion capture and analysis.
Mojito employs a jitter-reduced inertial token representation and extended language model to provide real-time human motion analysis and feedback, addressing limitations of noisy IMU data.
The framework leverages VQVAE for discrete latent space learning of IMU signals and incorporates LoRA adapters for personalized feedback styles in fitness or rehabilitation scenarios.

Protecting Users From Themselves: Safeguarding Contextual Privacy in Interactions with Conversational Agents

LLM-based Conversational Agent Framework: introduces a process for contextual privacy in conversational agents, with User Prompt (Initial user input), Detection & Flagging (Identifies context and sensitive data), Determine Subject & Context (Establishes topic and setting), Detect PII and Sensitive Phrases (Finds personal and private phrases), Sensitive Space (Categorizes sensitivity level), Essential Info (Identifies necessary information), Non-Essential Info (Identifies unnecessary information), Mitigation (Applies privacy measures), Get User Approval (User confirms action), Reformulate Prompt (Rewrites user input for privacy), Get User Approval (User confirms rewritten input), and LLM-based Conversational Agent (Core agent processing input).
This framework processes user prompts to recognize context and sensitive information, subsequently providing revised prompts to users that aim to maintain original intent while minimizing out-of-context details.
The framework empowers users to make informed privacy decisions during interactions with conversational agents by identifying and reformulating contextually inappropriate information in prompts.

Echo: A Large Language Model with Temporal Episodic Memory

MADGF (Multi-Agent Data Generation Framework): introduces Characters, Plots, and Environments to simulate multi-turn dialogues for generating episodic memory training data.
MADGF framework controls dialogue scenarios between human roles and AI assistant to create context-rich episodic memory data.
MADGF framework aims to produce high-quality episodic memory data by designing diverse characters and plot-driven dialogues.

Curie: Toward Rigorous and Automated Scientific Experimentation with AI Agents

Curie: introduces AI agent framework designed for rigorous automated scientific experimentation with intra-agent rigor module, inter-agent rigor module and experiment knowledge manager.
Curie framework employs architect agent for planning and technician agents for execution, coordinated by experimental rigor module.
Curie framework aims to enhance reliability, methodical control, and interpretability in AI-driven scientific experimentation.

RAG-Enhanced Collaborative LLM Agents for Drug Discovery

CLADD (Collaborative framework of LLM Agents for Drug Discovery): introduces multi-agent framework for drug discovery question-answering, integrating Planning Team (identifies data sources), Knowledge Graph Team (retrieves KG information), Molecule Understanding Team (molecule description), and Prediction Agent (generates final answer) to leverage Annotation Database (molecular annotations source), Knowledge Graph (biomedical knowledge source), Captioning Tool (external molecule captioning) and Available Data and Tools (general resources).
CLADD framework utilizes Planning Team with MolAnn Planner (annotation database relevance) and KG Planner (knowledge graph relevance), Knowledge Graph Team with DrugRel Agent (related drug entities report) and BioRel Agent (biological relationships report), and Molecule Understanding Team with MU Agent (molecule annotation report) to provide comprehensive analysis.
CLADD framework enhances drug discovery tasks by employing collaborative agents to dynamically retrieve and integrate external knowledge, improving interpretability and flexibility without domain-specific fine-tuning.

21st February 2025

Multi-Agent Multimodal Models for Multicultural Text to Image Generation

MosAIG (Multi-Agent framework for Multicultural Image Generation): introduces multi-agent framework with Moderator, Social Agents (Country, Landmark, Age-Gender), Summarizer Agents, and Social Agents Conversation to generate Image Caption for AltDiffusion/FLUX image generation models.
MosAIG framework employs iterative Social Agents Conversation for refining culturally sensitive and contextually rich Image Caption, enhancing multicultural text-to-image generation.
MosAIG framework leverages distinct agent roles to decompose multicultural image generation task, achieving improved Alignment, Aesthetics, and Quality compared to simple models.

R³Mem: Bridging Memory Retention and Retrieval via Reversible Compression

R³Mem (Retention and Retrieval through Reversible context compression): introduces memory network optimizing information retention and retrieval through reversible context compression with Reversible Adapter, Large Language Model M, and Virtual memory token components.
R³Mem employs hierarchical compression for multi-granularity assimilation and reversible architecture integrating Context Compression and Context Expansion for duplex network.
R³Mem utilizes Virtual memory token to encode long histories and achieves state-of-the-art performance in long-context language tasks.

Self-Taught Agentic Long-Context Understanding

AgenticLU (Agentic Long-Context Understanding): introduces a framework designed for enhancing long-context question answering in LLMs, utilizing Chain-of-Clarifications (CoC) through iterative Raise Clarification Question, Find Context, and Self Clarify steps, and trained via CoC Path Distillation, SFT Dataset, Path Sampling, DPO Dataset, and Path & Neg Path Pair, starting from a base LLM and resulting in an Answer to the Long Context QA, contrasting with a Direct Answer approach.
AgenticLU framework employs a two-stage fine-tuning process involving Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to distill collected Chain-of-Clarifications (CoC) paths into a single inference pass model, improving efficiency and effectiveness.
The core innovation of AgenticLU lies in its Chain-of-Clarifications (CoC) mechanism, which enables models to iteratively refine understanding and resolve uncertainties in long contexts through self-generated questions and contextual grounding, leading to improved reasoning and answer quality.

AutoToM: Automated Bayesian Inverse Planning and Model Discovery for Open-ended Theory of Mind

AutoToM (Automated Theory of Mind): introduces automated Bayesian Theory of Mind method with Information Extraction, Initial Model Proposal, BTOM Models, Bayesian Inverse Planning, and Model Adjustment components.
AutoToM leverages Large Language Model for backend operations and iteratively refines Bayesian Theory of Mind model based on inference uncertainty.
This framework achieves state-of-the-art performance in Theory of Mind benchmarks, offering scalable, robust, and interpretable approach to machine Theory of Mind.

WorldCraft: Photo-Realistic 3D World Creation and Customization via LLM Agents

WorldCraft: introduces a system utilizing LLM agents including coordinator, Forgelt, Arrangelt, trajectory control, asset collection and renderer components to create photo-realistic 3D virtual worlds from text instructions.
WorldCraft framework employs a coordinator agent managing specialized agents for object customization, layout arrangement and scene animation based on user's natural language input.
WorldCraft enables non-professionals to create and customize complex 3D scenes with precise object geometry and PBR textures through intuitive natural language interaction.

Construction and Evaluation of LLM-based agents for Semi-Autonomous penetration testing

LLM Penetration Testing Agent: introduces a semi-autonomous penetration testing system, with Planning Module, Executor Module, Summarizer Module, RAG, Search Engines, Execution Environment, and PTT, to address limitations of LLMs in cybersecurity tasks.
The system employs multiple LLMs in modules for strategy formulation, command generation, and result analysis, leveraging RAG and search engines for knowledge integration.
This framework aims to overcome challenges in applying LLMs to penetration testing by using iterative reasoning and flexible information retrieval, reducing manual intervention.

Position: Standard Benchmarks Fail – LLM Agents Present Overlooked Risks for Financial Applications

SAEA (Safety-Aware Evaluation Agent): introduces a three-level evaluation framework, including model-level, workflow-level, and system-level audits, to assess safety risks of LLM agents in finance.
SAEA framework analyzes agent's intrinsic capabilities, multi-step process reliability, and integration robustness to identify vulnerabilities overlooked by traditional benchmarks.
The proposed SAEA framework shifts focus from raw performance to safety, robustness, and real-world resilience, addressing critical gaps in current LLM agent evaluations for high-stakes financial applications.

Pub-Guard-LLM: Detecting Fraudulent Biomedical Articles with Reliable Explanations

Pub-Guard-LLM (Large Language Model): introduces Pub-Guard-LLM, a system for detecting fraudulent biomedical articles, with Input Article, External Knowledge, Teacher Model, Pub-Guard-LLM, Vanilla, RAG, Debate, Fine-Tuning, Output, Prediction, Explanation, Relevance, and Coherence components.
Pub-Guard-LLM enhances fraud detection in biomedical research by providing reliable explanations for its predictions.
The framework offers three application modes: Vanilla Reasoning, Retrieval-Augmented Generation, and Multi-Agent Debate, to accommodate diverse user needs and improve detection performance and explainability.

Textual-to-Visual Iterative Self-Verification for Slide Generation

Iterative Self-Verification Framework (Textual-to-Visual Iterative Self-Verification Framework): decomposes slide generation into content and layout generation, using textual-to-visual self-verification for refinement.
Content generation enhances coherence using context from surrounding slides and section retrieval, while layout generation employs Reviewer + Refiner workflow.
Modality transformation visualizes textual layouts, enabling intuitive review and refinement by LLM-based Reviewer and Refiner modules for improved slide quality.

ARS: Automatic Routing Solver with Large Language Models

ARS (Automatic Routing Solver): introduces automatic routing solver framework, with pre-defined constraint examples, constraint selection, constraint checker, violation scorer, constraint handling method, initialization, optimization, final solution, local search, destroy & repair, destroy operators, repair operator, local search operators, input problem instance and termination condition.
ARS framework enhances backbone heuristic algorithm by automatically generating constraint-aware heuristics using LLM agents.
ARS framework utilizes database of VRP constraints and RAG-like approach for constraint selection to improve heuristic generation.

Auto-Bench: An Automated Benchmark for Scientific Discovery in LLMs

Auto-Bench: introduces benchmark for evaluating Large Language Models in scientific discovery, incorporating settings, prompting, Large Language Model, interventions, observations, ground-truths comparison, adjacency matrix, match-not match, and loop control components.
Auto-Bench framework evaluates LLMs' capability to discover hidden causal structures via iterative interactions and strategic interventions within chemistry and social network environments.
This benchmark leverages causal graph discovery to assess LLMs' reasoning and decision-making skills in simulated scientific exploration tasks.

The Evolving Landscape of LLM- and VLM-Integrated Reinforcement Learning

Taxonomy: introduces a framework for integrating LLMs/VLMs into RL, categorizing approaches based on three roles: agent (FM serves as policy), planner (FM generates sub-goals), and reward (FM shapes rewards).
This taxonomy further distinguishes agent roles into parametric (fine-tuning FM to generate outputs) and non-parametric (enriching prompts with context) approaches, planner roles into comprehensive (sequence of sub-goals in one pass) and incremental (sub-goals step by step) planning, and reward roles into reward model (outputs scalar reward signal) and reward function (specifies reward function code) mechanisms.
The framework helps to understand how LLMs/VLMs address RL challenges like prior knowledge, planning, and reward design, paving the way for unifying natural language and visual understanding with sequential decision-making.

Investigating the Adaptive Robustness with Knowledge Conflicts in LLM-based Multi-Agent Systems

AutoGen (Multi-Agent System framework): introduces a multi-agent programming system comprising project manager for task coordination, coders for collaborative programming, and executor for tool interaction and code execution.
This framework investigates robustness of LLM-based multi-agent systems when facing knowledge conflicts during collaborative programming tasks.
The system aims to simulate real-world collaborative programming scenarios to analyze the impact of knowledge conflicts on decision-making and system stability.

20th February 2025

GATE: Graph-based Adaptive Tool Evolution Across Diverse Tasks

GATE (Graph-based Adaptive Tool Evolution): introduces an adaptive framework for dynamic construction and evolution of hierarchical graph of reusable tools across scenarios, utilizing Task Solver, Tool Manager, Adaptive Tool Graph, Graphrank Retrieval, Tool Requirement, Tool Generation, Tool Creation, Tool Merging, Self-Check, Tool Graph Update, Basic Tools, Composed Tools, Node, Edge, Adjacency Matrix, Graphrank Algorithm, Pruning, Online Learning, Training Stage and Testing Stage.
GATE framework employs two interacting agents, Task Solver and Tool Manager, with Adaptive Tool Graph to dynamically manage and evolve toolset, addressing tool redundancy and limited generalizability in existing methods.
The framework leverages Graphrank Retrieval for efficient tool discovery and incorporates Self-Check and Tool Merging to ensure tool quality and conciseness, achieving state-of-the-art performance across diverse tasks including open-ended and closed-ended scenarios.

Is Safety Standard Same for Everyone? User-Specific Safety Evaluation of Large Language Models

U-SAFEBENCH (User-Specific Safety Benchmark): introduces benchmark, with LLM Agent (generates response considering user profile) and LLM-as-a-Judge (evaluates response safety and refusal), to evaluate user-specific safety of Large Language Models.
U-SAFEBENCH assesses if LLM Agent response is user-specific unsafe response based on user profile and instruction.
U-SAFEBENCH employs LLM-as-a-Judge to classify LLM Agent response as either refusal or fulfillment regarding user instruction.

Red-Teaming LLM Multi-Agent Systems via Communication Attacks

AiTM (Agent-in-the-Middle) introduces communication attack framework for LLM Multi-Agent Systems, which includes Benign Agent (system participant), Malicious Agent (harmful actor), Agent-in-the-Middle (message interceptor manipulator), Adversarial Input (malicious agent data), Communication Channel (message pathway), Messages (agent information exchange), Reflection Mechanism (adversarial self-improvement), and Victim Agent (targeted agent).
The framework evaluates vulnerability by employing Agent-in-the-Middle to intercept Messages within Communication Channel to manipulate Victim Agent, contrasting with Malicious Agent and Adversarial Input attacks targeting individual agents.
AiTM leverages Reflection Mechanism in Agent-in-the-Middle to refine adversarial strategies based on intercepted Messages, highlighting critical security concerns in inter-agent communication within LLM Multi-Agent Systems.

Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation

CoSyn (Code Guided Synthetic data generation system): introduces a framework for generating text-rich multimodal data using topic generation, data generation, code generation, rendering tools, and instruction generation.
CoSyn leverages text-only LLMs to generate code for rendering synthetic images and textual instructions for vision-language model training.
The framework addresses the scarcity of diverse text-rich vision-language data for improving VLMs in understanding text-rich images.

Optimizing Model Selection for Compound AI Systems

LLMSELECTOR (LLMSELECTOR): introduces Input, Module Nominator, Model Updater, and Output components for efficient model selection in compound AI systems.
LLMSELECTOR iteratively nominates modules and uses module-wise performance estimation to allocate the best-performing model to each module.
This framework achieves high-quality model allocation for compound AI systems, outperforming single-LLM allocation strategies.

A Multi-Agent Perspective on Modern Information Retrieval

Multi-Agent Perspective on Modern Information Retrieval: introduces query agent, document agent, and ranker agent to analyze modern information retrieval through agent interactions.
This perspective addresses complexities arising from automated query and document generation impacting retrieval paradigms.
The framework emphasizes revisiting classical IR evaluation and modeling for effective multi-agent retrieval systems.

Tree-of-Debate: Multi-Persona Debate Trees Elicit Critical Thinking for Scientific Comparative Analysis

TOD (Tree-of-Debate): introduces a framework for comparative scientific paper analysis using paper personas, moderator-guided debate tree construction, self-deliberation, debate rounds, expansion determination, debate synthesis, retrieval embedding model and evidence pool.
TOD dynamically builds a debate tree to analyze novelty arguments by converting papers into debating personas and facilitating structured critical reasoning.
The framework employs iterative retrieval and multi-persona debates to generate fine-grained contrastive summaries of scientific literature, aiding researchers in literature review.

Multi-Agent Coordination across Diverse Applications: A Survey

Unified Framework: introduces iterative process for sequential decision-making in multi-agent coordination, consisting of Evaluate System-level Goal, Who to Coordinate with, and How to Coordinate components.
Unified Framework: addresses coordination by evaluating system performance, determining agent clusters based on interdependencies, and updating decisions using appropriate methodologies.
Unified Framework: provides a structured perspective on coordination, applicable across diverse multi-agent system applications by breaking down the coordination process into key decision points.

I-MCTS: Enhancing Agentic AutoML via Introspective Monte Carlo Tree Search

I-MCTS (Introspective Monte Carlo Tree Search): introduces agentic AutoML framework, incorporating I-MCTS search module, LLM agent experiment executor, introspective node expansion, and hybrid reward mechanism.
I-MCTS enhances search quality and efficiency by introspectively expanding nodes and adaptively blending LLM-estimated and empirical rewards.
The introspective node expansion leverages parent and sibling node analysis for continuous refinement, addressing limitations of scalar feedback and static search spaces in AutoML.

InstructAgent: Building User Controllable Recommender via LLM Agent

InstructAgent and Instruct² Agent: introduce user-agent-platform paradigm for recommendation, featuring Parser for instruction understanding, Reranker for recommendation adjustment, Self-reflection Mechanism for output verification, External Knowledge for external data access, Internal Knowledge for instruction based knowledge, Static Memory for historical user interactions, Dynamic Memory for adaptive user representation, Extractor for interest extraction and Profile Generator for profile creation.
InstructAgent employs static memory and instruction parsing for reranking recommendations, whereas Instruct² Agent enhances personalization through dynamic memory and profile learning from user feedback.
The framework aims to enhance user control in recommendation systems and mitigate issues like echo chambers and biases against less-active users by acting as a protective shield between users and platforms.

Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents

Vending-Bench Framework: introduces agent architecture with main agent, sub-agent, memory tools, context management, and task-specific tools for vending machine operation benchmark.
The framework uses main agent for decision making and sub-agent to interact with simulated vending machine environment.
Memory tools and context management address LLM's memory limitations for long-term coherence evaluation.

Plan-over-Graph: Towards Parallelable LLM Agent Schedule

Plan-over-Graph: introduces a novel paradigm for parallel LLM agent scheduling, incorporating RAGS, tree-based random graph generation, annotated data, goal definition, initial source specification, textual query generation, SFT, DPO, graph extraction, plan generation, and executable task schedule.
This framework decomposes textual tasks into graph structures, enabling parallel execution planning and enhancing efficiency for complex tasks.
The plan-over-graph approach addresses limitations in existing sequential planning methods by leveraging graph representations for improved scalability and performance in LLM agents.

CORBA: Contagious Recursive Blocking Attacks on Multi-Agent Systems Based on Large Language Models

CORBA (Contagious Recursive Blocking Attacks): introduces a novel attack paradigm against LLM-MAS (Large Language Model-based Multi-Agent System) by leveraging CORBA Prompt to initiate Attack Propagation across the Topology of Agents, ultimately leading to Blocking State and system unavailability.
CORBA exploits contagious and recursive properties to propagate blocking state through LLM-MAS network, causing resource depletion and availability degradation.
The attack's effectiveness is demonstrated across various LLM-MAS frameworks and topologies, highlighting security vulnerabilities in current multi-agent systems.

MLGYM: A New Framework and Benchmark for Advancing AI Research Agents

MLGYM (Meta MLGYM): introduces a framework for developing and evaluating LLM agents in AI research tasks, comprising Agent, Environment, and Computer components.
MLGYM framework utilizes a Gymnasium Environment to integrate diverse AI research tasks, enabling agent interaction through actions and feedback within a controlled setting.
The framework provides components like Tool Docs, Task Description, Prompts, Models for Agent; Tools, Data, Code, Requirements for Environment; and Shell, File System for Computer, facilitating comprehensive AI research agent evaluation.

Enhancing Language Multi-Agent Learning with Multi-Agent Credit Re-Assignment for Interactive Environment Generalization

CollabUIAgents: introduces multi-agent reinforcement learning framework, with Agents, Critic Agent, Adversarial Agent, Reward Matrix, Action Matrix, Preference Optimization, Actions Rolling Out, System Update, Group Initialization, Multi-Agent Reinforcement Learning, Agentic Fine-Tuning, Curriculum Learning, Data Collection, Base Model, Base UIAgent, Environment, Observation, Reward, and Action, for enhancing generalization in interactive environments.
CollabUIAgents framework employs novel credit re-assignment strategy using LLM-based critic and preference learning to foster collaborative behaviors and improve generalization.
The framework achieves state-of-the-art performance in mobile and web UI interaction tasks, demonstrating effectiveness of credit re-assignment and preference optimization for multi-agent learning.

FLOWAGENT: Achieving Compliance and Flexibility for Workflow Agents

FLOWAGENT (FLOWAGENT): introduces a novel agent framework for workflow management, incorporating PDL, Controllers, DAG of node dependency, API node, Answer node, OOW node, Pre-decision controllers, Post-decision controllers, Conversation history, User, Bot agent, System, Workflow, Output Action, and Output System response, to achieve both compliance and flexibility.
FLOWAGENT framework utilizes Procedure Description Language (PDL) to define workflows and employs controllers for managing agent behavior, dynamically balancing compliance and flexibility when handling user interactions and unexpected queries.
The framework architecture includes pre- and post-decision controllers that guide and validate agent actions based on PDL-defined workflows, ensuring both structured execution and responsiveness to dynamic interactions.

ChemHTS: Hierarchical Tool Stacking for Enhancing Chemical Agents

ChemHTS (Chemical Hierarchical Tool Stacking): introduces a method that optimizes tool invocation pathways through hierarchical stacking strategy.
ChemHTS comprises Self-Stacking Warmup (individual tool warmup) and Multi-Layer Optimization (hierarchical path optimization) stages, enabling dynamic refinement of tool usage.
This framework addresses limitations in tool-augmented Large Language Models by facilitating effective collaboration among diverse tools and minimizing tool invocation errors.

Beyond Self-Talk: A Communication-Centric Survey of LLM-Based Multi-Agent Systems

Communication-Centric Framework: introduces a communication-centric perspective on LLM-based multi-agent systems, with Communication Architecture, Communication Goal, Communication Strategy, Communication Paradigm, Communication Object, and Communication Content components, where the framework analyzes system-level and internal communication elements in LLM-MAS workflows.
Communication-Centric Framework decomposes LLM-MAS workflow based on communication, categorizing system-level aspects like agent organization and goals, and internal aspects like strategies and message handling.
Communication-Centric Framework provides a structured approach to understand and analyze the communication dynamics within LLM-MAS, offering insights into design and optimization for diverse applications.

STeCa: Step-level Trajectory Calibration for LLM Agent Learning

STeCa (Step-Level Trajectory Calibration): introduces a framework for LLM agent learning with Deviated Action Detection, MC Step Reward, Expert Sub-trajectory, Reflection, Reflective Thought, Calibration Trajectory Construction, Calibrated Trajectory, Expert Trajectory, Successful Data, Reinforced Training, and Calibration Data, to enable step-level trajectory calibration for mitigating suboptimal actions.
STeCa framework utilizes step-level reward comparison and LLM-driven reflection to construct calibrated trajectories from explored trajectories with detected deviations, which are then used with successful trajectories for reinforced training.
The framework aims to improve LLM agent's decision-making in long-horizon tasks by addressing early-stage deviations through timely calibration, enhancing robustness and reducing error accumulation.

MEM2EGO: EMPOWERING VISION-LANGUAGE MODELS WITH GLOBAL-TO-EGO MEMORY FOR LONG-HORIZON EMBODIED NAVIGATION

MEM2EGO (Memory-to-Egocentric): introduces a VLM-based navigation framework, integrating Observation, Memory Mapping, Memory Augmented Observation, Landmark Memory Update, and Metric Map Memory, for enhanced embodied agent navigation.
MEM2EGO framework adaptively retrieves task-relevant cues from global memory, encompassing Frontier Map, Landmark Semantic Memory, and Visitation Memory, and dynamically aligns global context with local perception for improved spatial reasoning.
MEM2EGO enhances agent's navigation in complex environments by maintaining three distinct memory types and projecting cues onto egocentric images to guide goal location prediction and decision-making.

Enhancing Conversational Agents with Theory of Mind: Aligning Beliefs, Desires, and Intentions for Human-Like Interaction

LatentQA interpretability pipeline (LatentQA): introduces Target Model (analyzed language model) to process Dialogue (input conversation text) and answer ToM Question (query about mental states) using Decoder Model (extracts ToM information) to produce ToM Answer (inferred mental state) with Feedback (gradient for training-steering) for generating Aligned Response (ToM-steered output) instead of Generated Answer (model's initial output), involving ToM Inference (inferring mental states), ToM Feedback (feedback on ToM inference), Steering Inference (ToM-based model steering), and Steering Feedback (feedback on steering).
LatentQA pipeline employs Decoder Model to extract Theory of Mind (ToM) related information from Target Model's internal representations based on Dialogue and ToM Questions, utilizing Feedback mechanisms for both ToM inference and model steering to achieve improved response alignment.
The framework aims to enhance conversational agents by incorporating Theory of Mind (ToM) principles, leveraging LatentQA to interpret and manipulate model's latent representations for generating more human-like and aligned responses through explicit consideration of beliefs, desires, and intentions.

19th February 2025

Investigating Non-Transitivity in LLM-as-a-Judge

SWIM (Swiss-Wise Iterative Matchmaking): introduces User Instruction, Response of A, Response of B, Response of C, Judge Evaluation, Round Robin Tournament, Bradley Terry, Elo Score, and SWIM Tournament components for evaluating LLMs by addressing non-transitivity in pairwise comparisons using efficient tournament approach.
SWIM framework employs round-robin tournaments and Bradley-Terry model to produce reliable model rankings, mitigating sensitivity to baseline choice in LLM evaluation.
SWIM tournament enhances computational efficiency of round-robin evaluations while maintaining robustness and alignment with human evaluations by dynamic model matching.

Autellix: An Efficient Serving Engine for LLM Agents as General Programs

Autellix: introduces an efficient serving engine for LLM agents, incorporating Process Table (tracks program metadata), Load Balancer (distributes LLM calls), LLM Engine (processes LLM calls) with Scheduler (schedules LLM calls), Priority Function (determines call priority), Memory Manager (manages engine memory), KV Cache (stores key-value pairs), and Model Executor (executes LLM model).
Autellix leverages program-level statistics and discretized priority queues to minimize head-of-line blocking and improve throughput for agentic programs with dynamic execution workflows.
The system employs a stateful API and data locality-aware load balancing to enhance KV-cache reuse and reduce latency in multi-engine LLM serving environments.

RAG-Gym: Optimizing Reasoning and Search Agents with Process Supervision

RAG-Gym (Retrieval-Augmented Generation Gymnasium): introduces unified framework optimizing agentic RAG through process supervision with inner and outer Markov Decision Processes.
RAG-Gym: formulates knowledge-intensive question answering as nested Markov Decision Process, incorporating diverse agent architectures and process supervision methods.
RAG-Gym: enhances information-seeking agents by fine-grained process supervision at each search step, utilizing process reward data for optimization.

Qwen2.5-VL Technical Report

Qwen2.5-VL (Qwen2.5 Vision-Language): introduces vision-language framework integrating vision encoder, vision-language merger and language model decoder for processing multimodal inputs like images and videos.
Qwen2.5-VL framework's vision encoder utilizes native resolution input, dynamic FPS sampling, MROPE, window attention and full attention to process visual data efficiently before merging with text embeddings.
This architecture enables Qwen2.5-VL to achieve advancements in visual recognition, document parsing and long-video comprehension, while maintaining computational efficiency through window attention and dynamic resolution processing.

Exploring Personalized Health Support through Data-Driven, Theory-Guided LLMs: A Case Study in Sleep Health

HEALTHGURU: introduces multi-agent framework for personalized health support, integrating behavior change technique theory, wearable data, context data, activity recommendation model, user message, agent coordinators, data insight agent, recommendation agent, response agent, and chat history.
HEALTHGURU: is LLM-powered chatbot providing data-driven theory-guided sleep health support using contextual multi-armed bandit model for adaptive recommendations.
HEALTHGURU: enhances user engagement motivation for behavior change through personalized context-aware recommendations delivered via natural conversation.

DataSciBench: An LLM Agent Benchmark for Data Science

DataSciBench (DSB): introduces a benchmark for data science LLM evaluation, with Prompt Definition and Collection-component, Response Integration and Validation-component, and LLM evaluation-component, utilizing Task-Function-Code framework for assessment.
DataSciBench framework employs Directed Acyclic Graph to manage task dependencies and Programmatic Rules for consistent code evaluation, ensuring comprehensive LLM performance analysis in data science tasks.
DataSciBench benchmark includes Aggregate Functions and Test Cases with Ground Truth to provide detailed and reliable evaluation metrics for diverse data science challenges, addressing limitations of existing benchmarks.

Enhancing Cross-Domain Recommendations with Memory-Optimized LLM-Based User Agents

AgentCF++ (Agent Collaborative Filtering Plus Plus): introduces user- and item-agents with domain-separated-, domain-fused-, group-shared-, and item-memories, and interest groups to enhance cross-domain recommendations by refining user behavior simulation.
AgentCF++ employs dual-layer memory architecture with domain-separated and domain-fused memories and interest groups with group-shared memory to capture popularity influence and domain-specific preferences.
The framework utilizes a two-step fusion mechanism to integrate cross-domain knowledge and reflection mechanism for memory updates, improving the accuracy of user behavior simulation in recommender systems.

From Correctness to Comprehension: AI Agents for Personalized Error Diagnosis in Education

MathCCS (Mathematical Classification and Constructive Suggestions) Benchmark: introduces multi-modal benchmark with real-world problems, student data, and expert annotations for error analysis and feedback.
MathCCS benchmark incorporates real-world problems, unique student IDs with timestamps, and expert-defined error categorization with suggestions.
MathCCS benchmark facilitates systematic error analysis and personalized feedback in AI-driven education by capturing real student learning complexities.

AI Software Engineer: Programming with Trust

Framework components: introduces key elements of LLM agents for software engineering, including LLMs as back-ends (computation engines), interaction with software tools (tool utilization), autonomy (agent independence), and guardrails (security and validation).
These components define the capabilities and trust mechanisms considered essential for deploying AI software engineers in practical software development workflows.
The paper argues for the importance of trust in AI-generated code and proposes agentic capabilities to enhance trustworthiness in automated programming.

An LLM-based Agent for Reliable Docker Environment Configuration

Repo2Run (LLM-based Agent for Reliable Docker Environment Configuration): introduces automated Docker environment configuration, with external environment, internal environment, Dockerfile generator, rollback mechanism, event stream, environment monitoring, dependency installation, code editing, test running, bash commands, dependency management, result processor, action-observation interaction, event history, finished commands, conflict list, and Dockerfile action.
Repo2Run utilizes dual-environment architecture for atomic configuration synthesis, ensuring reliable Dockerfile generation and preventing environment pollution through rollback.
Repo2Run's atomic configuration synthesis and Dockerfile generator address challenges in automated environment setup, achieving high success rate in configuring Python repositories.

STaR-SQL: Self-Taught Reasoner for Text-to-SQL

STaR-SQL (Self-Taught Reasoner for Text-to-SQL): introduces reasoning-driven approach for text-to-SQL, utilizing Question, Schema, Rationale Generation, Finetune, Scale up test-time compute, Outcome-supervised Reward Model, Test-time Verification, and Difficulty-based Resample components.
STaR-SQL framework employs rationale generation and outcome supervision to enhance text-to-SQL performance by iteratively refining rationales and verifying SQL query correctness.
The framework leverages increased test-time computation and difficulty-based resampling to improve accuracy and robustness for complex text-to-SQL tasks.

OpenSearch-SQL: Enhancing Text-to-SQL with Dynamic Few-shot and Consistency Alignment

OpenSearch-SQL: introduces a multi-agent framework for Text-to-SQL, incorporating Preprocessing, Extraction, Generation, Refinement, and Alignment Module with Agent Alignment, Function Alignment, Style Alignment, Correction, and Self-consistency & vote components.
This framework uses a consistency alignment mechanism to reduce hallucination and improve information flow between agents during the Text-to-SQL process, leveraging Vector Database and Few-shot examples.
The method achieves state-of-the-art performance by dynamically adjusting few-shot examples and employing a SQL-Like intermediate language within a structured Chain-of-Thought approach, enhancing both effectiveness and efficiency without fine-tuning.

Beyond Single-Value Metrics: Evaluating and Enhancing LLM Unlearning with Cognitive Diagnosis

UNCD (UNlearning evaluation using Cognitive Diagnosis): introduces UNCD, a framework for fine-grained LLM unlearning evaluation, with Unlearning Process, QA Eval, UNCD Eval, Base LLM, LLM+GA, LLM+NPO, Precise Diagnosis, Training-free Diagnosis, CDM, Knowledge States, Unlearn Set, Eval Set, Knowledge Concepts, Forget KC, Retain KC, Expert check, Question generation, Scoring, Processing, Raw data, and UNCD-Agent.
UNCD leverages Cognitive Diagnosis Modeling for detailed assessment of harmful knowledge removal and introduces UNCD-Cyber benchmark for cybersecurity domain.
UNCD-Agent enhances unlearning by diagnosing knowledge remnants and generating targeted unlearning data, improving removal of harmful LLM abilities.

MCTS-KBQA: Monte Carlo Tree Search for Knowledge Base Question Answering

MCTS-KBQA (Monte Carlo Tree Search for Knowledge Base Question Answering): introduces MCTS methodology to KBQA domain, enhancing LLM reasoning with selection, expansion, evaluation, backpropagation, and termination steps.
This framework uses LLM agent interacting with database environment, guided by step-wise reward mechanism and prompts, to perform knowledge base question answering.
MCTS-KBQA achieves improved performance over linear methods by exploring multiple reasoning paths and evaluating intermediate steps within the search tree.

18th February 2025

Towards an AI co-scientist

AI co-scientist: introduces a multi-agent system designed to augment scientific discovery by generating, debating, and evolving research hypotheses, utilizing Scientist inputs, Research plan configuration, Generation agent, Reflection agent, Ranking agent, Evolution agent, Proximity agent, Meta-review agent, Tool Use, Memory, and Supervisor agent components.
AI co-scientist employs a generate, debate, and evolve approach inspired by the scientific method, leveraging specialized agents for literature exploration, hypothesis review, ranking via tournaments, and iterative refinement, all orchestrated by a Supervisor agent and supported by Memory and Tool Use.
AI co-scientist framework facilitates flexible compute scaling and iterative improvement of hypothesis quality through a self-improving loop enabled by feedback from tournament-based ranking and meta-review, aiming to accelerate scientific discovery in biomedicine and beyond.

AIDE: AI-Driven Exploration in the Space of Code

AIDE (AI-Driven Exploration): introduces an agent for machine learning engineering, with Solution Tree, Coding Operator, Evaluator, Search Policy, and Summarization Operator, automating trial-and-error via tree search in code space.
AIDE employs a tree structure to organize historical solutions and uses a coding operator to propose improvements based on tree nodes, guided by automated evaluations.
By strategically reusing and refining solutions within its framework, AIDE trades computational resources for enhanced performance on machine learning engineering benchmarks.

Facilitating Long Context Understanding via Supervised Chain-of-Thought Reasoning

PAI (Property-driven Agentic Inference): introduces three-stage framework with property extraction, retrieval, and summarization agents for generating reasoning-augmented answers in long-context question answering.
PAI framework simulates human-like reasoning by decomposing queries, retrieving relevant information, and synthesizing conclusions to facilitate long-context understanding.
PAI framework enhances long-context question answering by incorporating chain-of-thought reasoning and improving model performance on complex tasks.

TEXT2WORLD: Benchmarking Large Language Models for Symbolic World Model Generation

TEXT2WORLD: introduces benchmark for evaluating LLMs in symbolic world model generation, with Automatic Generation, Automatic Correction, Syntax Parser, World Model, Executor, and Multi-criteria Evaluation components.
TEXT2WORLD benchmark employs PDDL and execution-based metrics to address limitations in prior world model evaluations, emphasizing domain diversity and evaluation robustness.
TEXT2WORLD enables detailed analysis of LLM world modeling performance via component-wise F1 scores and error analysis, aiming to foster advancements within the field.

LLM TRADING: ANALYSIS OF LLM AGENT BEHAVIOR IN EXPERIMENTAL ASSET MARKETS

LLM Trading (Large Language Model Trading): introduces experimental framework with agent, order submission, price forecasting, memory, market, and environment components for analyzing LLM behavior in asset markets.
This framework investigates LLM agents' trading strategies and market dynamics in simulated financial markets, comparing their behavior to human participants.
The study focuses on evaluating LLMs' rationality and ability to replicate human-driven market phenomena like bubbles and crashes within controlled experimental settings.

Training Turn-by-Turn Verifiers for Dialogue Tutoring Agents: The Curious Case of LLMs as Your Coding Tutors

TRAVER (Trace-and-Verify): introduces agent workflow with knowledge tracing for student state estimation, utterance generation for tutor messages, and verifier for response quality assessment.
TRAVER leverages turn-by-turn verification and knowledge tracing to guide students in coding tasks through dialogue.
The framework aims to improve tutoring effectiveness by adapting guidance based on student knowledge and utterance quality.

Demonstrating specification gaming in reasoning models

ReAct-like harness: introduces observe, orient, decide, and act phases alongside memory, plan, and subgoal components for LLM agent to interact with environment.
The framework employs observe phase to process command outputs, orient phase to update strategic plan, decide phase to select tactical subgoal, and act phase to generate shell commands for task execution.
Memory, plan, and subgoal components maintain agent state, enabling iterative refinement of actions based on observed outcomes within the environment.

OCCULT: Evaluating Large Language Models for Offensive Cyber Operation Capabilities

OCCULT (Offensive Cyber Operation Lightweight operational evaluation framework): introduces a methodology to evaluate LLMs for Offensive Cyber Operations, structured around LLM Use Case, OCO Capability Areas, and Reasoning Power components.
OCCULT framework facilitates rigorous and repeatable evaluations to quantify cyber security risks associated with employing LLMs in offensive cyber operations.
OCCULT methodology aims to standardize LLM testing in OCO domain, enabling better comparisons across different models and evaluation approaches.

Grounding LLM Reasoning with Knowledge Graphs

Framework for Grounding LLM Reasoning with Knowledge Graphs: introduces agent and automatic graph exploration approaches for question answering using knowledge graphs, incorporating components like RetrieveNode, NeighborCheck, Entities and Triples.
Agent approach employs predefined actions such as RetrieveNode and NeighborCheck for targeted KG interaction, while automatic exploration utilizes extracted Entities and Triples to navigate the knowledge graph.
The framework evaluates Chain-of-Thought, Tree-of-Thought, and Graph-of-Thought reasoning strategies within both Agent and Automatic Graph Exploration approaches to enhance question answering performance on knowledge graphs.

Interactive Agents to Overcome Ambiguity in Software Engineering

OpenHands framework (OpenHands): introduces interactive environment for LLM Agent, enabling structured code refinement, task planning, and command execution using integrated tools within secure sandbox.
OpenHands framework: facilitates iterative code improvement through file editing, script execution, and error analysis within controlled environment.
OpenHands framework: leverages User Proxy to simulate realistic interactions, allowing agent to gather necessary context and improve performance in ambiguous software engineering tasks.

AEIA-MN: Evaluating the Robustness of Multimodal LLM-Powered Mobile Agents Against Active Environmental Injection Attacks

AEIA-MN (Active Environment Injection Attack - Mobile Notifications): introduces active environment injection attack scheme, with Perception Stage, Reasoning Stage, Action Stage, System, State and Action components, to evaluate MLLM-based agents robustness.
AEIA-MN leverages mobile notifications to perform attacks by disrupting agent decision-making through environmental manipulation.
The framework includes Adversarial Attack, Reasoning Gap Attack, and Combinatorial Attack strategies to comprehensively evaluate agent robustness against active injection attacks.

Towards a Design Guideline for RPA Evaluation: A Survey of Large Language Model-Based Role-Playing Agents

RPA evaluation design guideline: introduces agent attributes, agent-oriented metrics, task attributes, and task-oriented metrics for systematic RPA evaluation.
The guideline proposes a two-step process: first, decide agent-oriented metrics based on agent attributes, and second, decide task-oriented metrics based on task attributes.
This guideline aims to enhance the reliability and consistency of RPA evaluation by linking evaluation metrics to agent and task attributes.

You need to MIMIC to get FAME: Solving Meeting Transcript Scarcity with a Multi-Agent Conversations

MIMIC (Multi-agent IMItation of Conversations): introduces a multi-agent meeting synthesis framework that uses Knowledge Source, Content Brainstorming, Casting, Scriptwriting, Filming, Quality Assuring, Special Effects, and Editing to generate Meeting Transcript.
MIMIC framework employs pre-production, production, and post-production stages to orchestrate psychologically grounded agents debating turn-by-turn, refining outputs to ensure coherent and credible dialogues.
The modular architecture of MIMIC allows for scalable generation of meeting transcripts, addressing data scarcity for training and testing meeting summarization systems.

Flow-of-Options: Diversified and Improved LLM Reasoning by Thinking Through Options

FoO (Flow-of-Options): introduces an agentic framework for automated machine learning tasks, incorporating Input task, Planner, Option Generator, Flow-of-Options, Plan Executor, Update, CODE-RAW, Case-Based Reasoning, Case Bank, Retrieve and adapt, Walk Generation with Consistency Checker, Update Values, and Update Case Bank components.
This framework leverages a network data structure to systematically explore diverse reasoning paths by enumerating options at each step of a task plan, enhancing Large Language Model performance in solving complex problems.
The approach integrates case-based reasoning for long-term memory and solution reuse, improving efficiency and overcoming biases inherent in Large Language Models for automated machine learning workflows.

SEFL: Harnessing Large Language Model Agents to Improve Educational Feedback Systems

SEFL (Synthetic Educational Feedback Loops): introduces Agent Framework with Teacher (LLM creating assignments), Student (LLM completing assignments with errors), Fineweb-Edu (assignment text source), Synthetic Instruction-Tuning Data (generated interaction data), fine-tuned LLM (feedback model), and Output Evaluation (performance measurement process) for improving educational feedback systems.
SEFL framework leverages two LLMs in Teacher and Student roles to simulate formative feedback workflows and generate synthetic data for fine-tuning smaller feedback LLMs.
This synthetic data generation and fine-tuning pipeline enables scalable and effective educational feedback systems, addressing real-world data scarcity challenges.

Towards more Contextual Agents: An extractor-Generator Optimization Framework

Extractor-Generator Framework: introduces a two-stage approach with feature extraction and prompt generation to optimize prompts for contextual LLM-based agents using input-output dataset, feature extraction, prompt component generation and performance evaluation.
The framework extracts contextual features from gold-standard input-output pairs and generates prompt components iteratively refining them through self-improvement techniques and performance evaluation.
This automated optimization process enhances the adaptability and reliability of LLM agents in context-specific tasks by improving generalization and reducing error propagation.

Knapsack Optimization-based Schema Linking for LLM-based Text-to-SQL Generation

KaSLA (Knapsack optimization-based Schema Linking Agent): introduces a plug-in schema linking agent, with Hierarchical Linking Strategy, Table linking, Column linking, Knapsack optimization-based schema linking, Binary scoring function, Probabilistic scoring function, Relevance score, Redundancy score, Redundancy tolerance, Linking, Dynamic Programming, Training dataset, and Query, designed to prevent missing relevant schema elements and minimize redundant ones.
KaSLA employs hierarchical linking strategy, initially linking tables and subsequently columns, utilizing knapsack optimization with binary-probabilistic scoring functions and dynamic programming to select relevant schema elements under redundancy tolerance.
The framework enhances text-to-SQL models by replacing schema linking processes, improving SQL generation accuracy through optimized schema linking and reduced missing or redundant information.

Fraud-R1 : A Multi-Round Benchmark for Assessing the Robustness of LLM Against Augmented Fraud and Phishing Inducements

Fraud-R1: introduces a multi-round evaluation framework with Helpful Assistant, Role-play, LLM Judge, and Defense Status Judgement components.
Fraud-R1 assesses LLM robustness against fraud using Defense Success, Defense Failure, and Need More Information statuses.
Fraud-R1 framework evaluates LLMs in different settings to identify challenges in fraud defense.

Continuous Learning Conversational AI: A Personalized Agent Framework via A2C Reinforcement Learning

CLCA (Continuous Learning Conversational AI): presents an A2C reinforcement learning framework for personalized conversational agents, integrating synthetic data generation, RL environment design, A2C agent training, and A2C-guided response selection.
CLCA framework employs a simulated RL environment with state space representing dialogue context, action space controlling dialogue metrics, and reward function guiding A2C agent to learn personalized dialogue strategies.
This A2C-driven CLCA method advances beyond static LLMs by enabling continuous learning and personalization through synthetic data and RL, creating dynamically adaptive AI companions.

Benchmarking Automatic Speech Recognition coupled LLM Modules for Medical Diagnostics

Framework name here: introduces a two-stage system for medical diagnosis from speech, with audio preprocessing, speech recognition, and LLM-based diagnosis classification, utilizing medical speech database.
The system employs audio preprocessing with denoising and equalization to enhance audio quality before ASR and uses LLM for context-aware medical diagnosis from transcribed speech.
The framework is designed to improve robustness in noisy medical call recordings and leverage LLMs for accurate medical diagnosis from patient speech.

Towards Adaptive Feedback with AI: Comparing the Feedback Quality of LLMs and Teachers on Experimentation Protocols

LLM feedback agent (Large Language Model feedback agent): presents a system generating student feedback on experiment protocols, utilizing Feed Up, Feed Back, Feed Forward feedback types, assessed by Constructive Tone, Linguistic Clarity, Technical Terminology criteria.
The study compares LLM feedback against teacher and expert feedback, revealing similar overall quality yet LLM agent's limitations in Feed Back error identification.
Findings indicate LLMs' capability for efficient educational feedback, underscoring the necessity for enhanced contextual understanding in error-specific feedback generation.

An LLM-Powered Agent for Physiological Data Analysis: A Case Study on PPG-based Heart Rate Estimation

OpenCHA Framework: introduces an LLM-powered agent for physiological data analysis, with Interface, Orchestrator, External Sources, Response Generator, Task Planner, Task Executor, Data Pipe, PPG Processing Pipeline, AI and Analysis Models, Wearable PPG Data and User Data Sources components, aiming to integrate LLMs with analytical tools for health insights.
The framework utilizes an orchestrator to coordinate user interaction, data retrieval, and analytical processing, leveraging external sources for data and AI models to generate accurate health assessments.
The agent's architecture is designed for modularity and adaptability, enabling integration of various data sources and analytical tools for diverse physiological data analysis tasks.

R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs

R2-KG (General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs): introduces dual-agent framework with Operator, Supervisor, KG, Iteration limit, Feedback, Question, Answer, Abstention for reliable KG reasoning by separating evidence gathering and judgment roles.
It employs Operator (low-capacity LLM evidence gatherer) and Supervisor (high-capacity LLM judgment maker) to enhance cost-efficiency and reliability.
R2-KG incorporates Abstention mechanism to avoid answering when evidence is insufficient, improving trustworthiness.

Multi-Novelty:Improve the Diversity and Novelty of Contents Generated by Large Language Models via inference-time Multi-Views Brainstorming

Multi-Novelty: introduces inference-time multi-view brainstorming method with Input Prompt, Multi-view Embedding, LLMs, Generated Answers and DNC Framework components to enhance diversity and novelty of generated contents.
Multi-view Embedding component incorporates Text views and Image views to enrich input prompts by generating diverse perspectives from textual and visual sources, which are then processed by LLMs to produce varied responses.
DNC Framework evaluates Generated Answers using diversity, novelty, and correctness metrics, demonstrating the effectiveness of Multi-Novelty in improving LLM outputs without architectural changes and across different models.

Perovskite-LLM: Knowledge-Enhanced Large Language Models for Perovskite Solar Cell Research

Perovskite-LLM (Knowledge-Enhanced Large Language Models for Perovskite Solar Cell Research): introduces Perovskite-KG, a domain-specific knowledge graph, constructed via Document Filtering, Knowledge Extracting, and Knowledge Graph Organization, alongside a Multi-agent framework with Information Extraction Agent, Quality Validation Agent, and Document Summarizer Agent, utilizing DeepSeek R1 and OpenAI 01 LLMs to generate Instruction Tuning and Reasoning Dataset.
Perovskite-KG organizes knowledge from research papers into a structured graph, while the multi-agent framework creates datasets for instruction tuning specialized Large Language Models.
The system aims to enhance research efficiency in perovskite solar cell domain by providing tools for knowledge retrieval, literature review, and complex problem-solving.

One Size doesn't Fit All: A Personalized Conversational Tutoring Agent for Mathematics Instruction

PACE (PersonAlized Conversational tutoring agEnt): introduces a personalized tutoring framework for mathematics instruction, incorporating Simulating Learning Styles (models student learning style), Conceptualize Teaching Strategy (designs teaching approach), Socratic-style Conversation (implements teaching dialogue), Persona Pool (collection of student profiles), Interaction (tutor-student communication), Multi-aspect Criteria (quality assessment metrics), and Evaluation Approaches (assessment methodologies).
PACE framework personalizes learning by simulating student learning styles from personas, conceptualizing tailored teaching strategies, and employing Socratic dialogue for enhanced engagement and critical thinking.
The framework utilizes multi-aspect criteria and dual evaluation approaches, including reference-based and LLM-based methods, to comprehensively assess the personalized tutoring performance.

Automating Prompt Leakage Attacks on Large Language Models Using Agentic Approach

AG2 (AutoGen): introduces agentic framework for automating prompt leakage attacks, utilizing Initial Analysis Agent, Judge Agent, and Tested Agent within GroupChat for evaluating LLM security.
This framework employs specialized agents to probe and exploit target LLMs, assessing prompt leakage by comparing responses from original and sanitized prompts.
The agentic approach provides a systematic methodology for adversarial testing, bridging automated threat modeling and practical LLM security evaluation.

DemonAgent: Dynamically Encrypted Multi-Backdoor Implantation Attack on LLM-based Agent

DemonAgent (Dynamically Encrypted Multi-Backdoor Implantation Attack): introduces dynamically encrypted multi-backdoor implantation attack, with Dynamic Encryption Mechanism, Multi-Backdoor Tiered Implantation (MBTI), and AgentBackdoorEval dataset components.
DemonAgent decomposes backdoor code into Sub-backdoor fragments, uses Anchor Tokens and Attack Matrix for stealth, and employs Encryption Table for secure storage within Agent's workflow.
DemonAgent leverages Encryptor, Decoder, Assembler, Executor, and Retriever components to manage encrypted backdoor fragments and activate attack through Cumulative Triggering, effectively bypassing safety audits.

A Cognitive Writing Perspective for Constrained Long-Form Text Generation

CogWriter: introduces a novel training-free framework with Planning Agent for hierarchical task decomposition, Generation Agents for parallel segment generation, and Monitor Functions including Global Plan Reviewing, Local Plan Reviewing and Length Reviewing for continuous quality control.
CogWriter framework employs Planning Agent to create structured plans and Generation Agents to execute these plans, utilizing Global Plan Reviewing and Local Plan Reviewing for iterative refinement and Length Reviewing for output length adjustment.
CogWriter framework aims to bridge gap between human cognitive writing processes and current LLMs for complex constrained long-form text generation, enhancing instruction completion and generation length.

UXAgent: An LLM-Agent-Based Usability Testing Framework for Web Design

UXAGENT (LLM-Agent-Based Usability Testing Framework) introduces a system with Persona Generator, LLM Agent, Universal Browser Connector, and Result Viewer, utilizing Chrome browser, Action Trace, Memory Trace, Video Recording, Final Outcome, Chat Interface, Fast Loop with Perception-, Planning-, and Action-Modules, Slow Loop with Wonder- and Reflect-Modules, and Memory Stream to simulate usability testing for web design.
UXAGENT employs Persona Generator to create diverse user demographics, LLM Agent with Fast and Slow Loops for web interaction and reasoning, Universal Browser Connector for website interaction, and Result Viewer to present collected user behavior data to UX researchers.
The framework facilitates iterative UX study design by providing simulated user behavior data including action traces, memory logs, video recordings, and chat interfaces, enabling researchers to evaluate and refine usability testing before real human-subject studies.

CityEQA: A Hierarchical LLM Agent on Embodied Question Answering Benchmark in City Space

PMA (Planner-Manager-Actor): introduces hierarchical framework with Planner, Manager, and Actor modules for embodied question answering in city environments.
PMA framework incorporates Planner for task parsing, Manager with Memory and Map for process control and spatial reasoning, and Actor with Navigator, Explorer, and Collector for action generation.
PMA agent utilizes cognitive map and hierarchical structure to achieve long-horizon planning and efficient task execution in complex urban spaces.

Policy-to-Language: Train LLMs to Explain Decisions with Flow-Matching Generated Rewards

Policy-to-Language Framework: introduces a model-agnostic explanation generator using Explanation LLM, Guidance LLM and Reward Generation by Flow Matching for training with PPO Training, taking Input context and producing Output reasoning, verified against True action.
The framework employs Reward Generation by Flow Matching with components like Projection, Rectified Flow Network, Linear Layer, Embedding, First L-1 Layers, Gaussian Noise, PE(t), Zt, and Cross-Attention Layer to provide effective rewards for training the Explanation LLM.
This approach aims to generate dense and effective rewards, reducing reliance on human feedback and improving the quality and accuracy of explanations for agent decisions in both RL and LLM tasks.

Simulating Cooperative Prosocial Behavior with Multi-Agent LLMs: Evidence and Mechanisms for AI Agents to Inform Policy Decisions

Multi-Agent Architecture: introduces a class structure for social emergent behavior simulation, encompassing World (simulation environment), Locations (places for agents), Events (agent actions and observations), Agents (LLM instances representing people), Plans (agent intentions and goals), and Memories (agent past experiences).
The framework uses World class to define simulation space with Locations for agent interaction and Events to record agent actions, while Agents, as LLM instances, possess Plans to guide behavior and Memories to inform actions based on past events.
This architecture facilitates emergent behaviors by enabling agents to reason about surroundings, create plans, react to events, and communicate, offering a structured approach to simulate complex social interactions.

EDGE: Efficient Data Selection for LLM Agents via Guideline Effectiveness

EDGE (Efficient Data selection for LLM Agents via Guideline Effectiveness): introduces a data selection framework for LLM agents, with Unlabeled Data Pool, Initial Guideline, GE Metric, GE Score Calculation, Lowest GE Score Data Selection, Guideline Update, Updated Guideline, High-quality SFT Data Generation, Fine-tuning open-source LLM, and Guideline-based prompt engineering components.
EDGE framework uses Guideline Effectiveness metric to identify informative samples from unlabeled data by measuring the impact of human guidelines in multi-turn interaction tasks.
Selecting low GE score samples allows for efficient prompt engineering and fine-tuning by focusing on data where guidelines are less effective, thus more informative.

Boost, Disentangle, and Customize: A Robust System2-to-System1 Pipeline for Code Generation

BDC Framework (Boost, Disentangle, and Customize Framework): introduces System2-to-System1 pipeline for code generation, with System 2 Knowledge Exploration, Composable System 1 Experts Preparation, and Customized Solver Generation components.
BDC Framework addresses complex reasoning and data heterogeneity using MC-Tree-Of-Agents with mutual boosting, disentangling data for LoRA experts, and input-aware hypernetwork for customization.
The framework utilizes multiple LLMs for verification, Monte-Carlo Tree Search with pruning, and DisenLoRA for adaptive generation of customized problem solvers.

EPO: Explicit Policy Optimization for Strategic Reasoning in LLMs via Reinforcement Learning

EPO (Explicit Policy Optimization): introduces strategic reasoning model (LLMs) with policy, optimize, multi-turn RL, PRM (Process Reward Model), history, interaction, agent/human/env, LLM agent, observation, strategy, and reward components for goal-directed behavior in dynamic environments.
EPO framework utilizes multi-turn RL with process rewards and iterative self-play to train the strategic reasoning model, enhancing adaptability and policy transferability without supervised fine-tuning.
The strategic reasoning model in EPO integrates with LLM agents, enabling long-term goal achievement through enhanced strategic reasoning in interactive scenarios.

Investigating and Extending Homans' Social Exchange Theory with Large Language Model based Agents

Agent Framework: introduces LLM-based agent framework with BDI, Affinity, REI, SVO, Negotiation and Exchange components to study Homans' Social Exchange Theory.
The framework simulates a multi-agent society where agents negotiate and exchange resources based on designed components.
This approach provides a novel method to investigate social science theories using LLM-based agents, bridging social science and computer science.

17th February 2025

A-MEM: Agentic Memory for LLM Agents

A-MEM (Agentic Memory) introduces agentic memory system for LLM agents with Note Construction, Link Generation, Memory Evolution, Memory Retrieval, and Memory components.
A-MEM enables dynamic memory structuring and autonomous memory management inspired by Zettelkasten method for long-term agent interactions.
A-MEM facilitates creation of interconnected knowledge networks and evolution of memories, enhancing LLM agents' long-term interaction capabilities.

ARMAP: SCALING AUTONOMOUS AGENTS VIA AUTOMATIC REWARD MODELING AND PLANNING

ARMAP (autonomous Agents from automatic Reward Modeling And Planning): introduces a framework that enhances LLM agents' decision-making by using Automatic Reward Model (evaluates trajectory quality) to guide Default Policy Model (generates initial action plans) in Tree Planning (search algorithm for actions) using Trajectories (sequences of agent actions) and Reward (score for trajectory success).
Leverages Automatically Generated Dataset (training data for reward model) from Sampled Trajectories (trajectories from environment), Refine Task Instructions (improved task goals), and Sample Negative Trajectories (unsuccessful action paths) to train Reward Model (evaluates trajectory success) without human annotations.
Improves agent performance across tasks by integrating learned Reward Model (evaluates trajectory success) with various planning algorithms, addressing limitations of data scarcity and API accessibility for complex interactive environments.

HARBOR: Exploring Persona Dynamics in Multi-Agent Competition

HARBOR (Housing Auction for Reasoning, Bidding, and Opponent Recognition): introduces a testbed to study persona dynamics in multi-agent auctions, incorporating Persona, Bidding Domain Knowledge, Auction History Memory, Priority Planning, Profiling Competitors, and Theory of Mind Strategy.
HARBOR simulates realistic house bidding scenarios to analyze how personas influence agent behavior, competitor profiling, and strategic decision-making in competitive environments.
This framework enables the evaluation of LLM agents' profitability, competitive standing, and persona-driven objective achievement in multi-agent competitive settings.

SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?

SWE-Lancer: introduces benchmark evaluating language models on real-world software engineering tasks using Original Issue, Codebase, Large Language Model, Generated PR, Human End-to-End Tests, Grader, and Scoring for individual contributions and Original Issue, Proposals, Large Language Model, Rejected Proposals, Chosen Proposal, Comparison, and Scoring for management decisions.
SWE-Lancer benchmark assesses model's ability to solve freelance software engineering tasks by generating code patches or selecting optimal proposals, evaluated through end-to-end tests and comparison with human decisions.
SWE-Lancer framework provides realistic software engineering evaluation by utilizing real-world tasks, payouts, and full-stack complexities, moving beyond isolated unit tests to comprehensive end-to-end assessments.

Learning Getting-Up Policies for Real-World Humanoid Robots

HUMANUP: introduces a two-stage RL framework with Discovery Policy (Stage I motion exploration) and Deployable Policy (Stage II robust motion tracking) for humanoid robots getting-up.
HUMANUP employs a Curriculum (Progressive training strategy) including Collision Mesh Curriculum (Mesh complexity progression), Posture Randomization Curriculum (Initial pose variation), and Control Regularization Curriculum (Regularization strength progression) to enhance learning.
Stage II Deployable Policy utilizes Tracking Rewards (Stage II imitation reward) to refine discovered motions for real-world deployment.

Can LLMs Simulate Social Media Engagement? A Study on Action-Guided Response Generation

Action-Guided Response Generation Framework: introduces a method to simulate social media engagement using Trending Post, User Information Historical Records, Action, and Generated Response components.
Action-Guided Response Generation Framework predicts user engagement Action (retweet, quote, rewrite) towards Trending Post, then generates Generated Response based on predicted Action and User Information Historical Records.
Action-Guided Response Generation Framework aims to capture user engagement dynamics for informed response generation in social media simulations.

CAMEL: Continuous Action Masking Enabled by Large Language Models for Reinforcement Learning

CAMEL (Continuous Action Masking Enabled by Large Language Models): introduces reinforcement learning framework integrating LLM Policy Generator, Action Masking, Actor, Critic, Replay Buffer and Epsilon Masking to enhance exploration and convergence by using LLM-generated policies and dynamic action constraints.
CAMEL leverages Action Masking to dynamically constrain action space based on LLM outputs and Epsilon Masking to reduce reliance on LLM guidance over time, enabling autonomous policy refinement.
The framework demonstrates improved sample efficiency and performance in MuJoCo environments by effectively utilizing LLM-generated priors for initial policy guidance and exploration.

Leveraging Dual Process Theory in Language Agent Framework for Real-time Simultaneous Human-AI Collaboration

DPT-Agent (Dual Process Theory Agent): introduces a language agent framework integrating System 1 with Finite-State Machine, Code as Policy, Action Executor and System 2 with Theory of Mind, Asynchronous Reflection, Belief, Guide, alongside General Introduction and Information History Buffer for real-time human-AI collaboration.
DPT-Agent leverages Dual Process Theory, employing System 1 for rapid responses and System 2 for deliberate reasoning, to achieve autonomous and simultaneous human-AI collaboration.
The framework utilizes Finite-State Machine and code-as-policy in System 1 for fast decision-making, and Theory of Mind with asynchronous reflection in System 2 to infer human intentions and improve autonomous decisions.

Hypothesis-Driven Theory-of-Mind Reasoning for Large Language Models

Thought-tracing: introduces inference-time reasoning algorithm, with Parse Trajectory, Perception Inference, Hypothesis Inference, Initialize Hypotheses, Update Weights, Resample Hypotheses, Rejuvenate Hypotheses, Propagate Hypotheses, designed to trace agent mental states.
Thought-tracing algorithm, inspired by Bayesian theory-of-mind and sequential Monte Carlo, uses LLMs to generate and weight natural language hypotheses about agent beliefs based on perceptions and actions.
Thought-tracing improves performance on theory-of-mind benchmarks by providing intermediate reasoning steps, contrasting with math/coding focused reasoning models.

Can LLM Agents Maintain a Persona in Discourse?

Agent-based evaluation framework: introduces a methodology to evaluate personality maintenance in dyadic conversations using Participant A/B, Assign, Personality Traits, Topic of Conversation, Pairwise Conversation, and Judge Agent components.
This framework employs System Prompt and User Prompt to guide LLM agents in conversations and JSON output for structured evaluation by Judge Agent, including Predicted_bfi and Correct? metrics.
The framework aims to assess personality consistency and alignment by Extract, Analyze, and Plot actions performed by Judge Agents on conversation data to evaluate personality adherence.

Table-Critic: A Multi-Agent Framework for Collaborative Criticism and Refinement in Table Reasoning

Table-Critic: introduces multi-agent framework with Judge, Critic, Refiner, Curator and Self-evolving Template Tree, for collaborative criticism and iterative refinement in table reasoning tasks.
Table-Critic framework employs Judge to identify errors, Critic to provide critiques, Refiner to correct reasoning, and Curator with self-evolving template tree to accumulate critique knowledge for improved future performance.
Self-evolving template tree in Table-Critic dynamically accumulates critique patterns from experience-driven learning, enabling system to handle diverse error types and improve reasoning quality over time.

Personality Editing for Language Models through Relevant Knowledge Editing

PALETTE (Persona Adjustment by LLM Self-Targeted Trait Control via Relevant Knowledge Editing): introduces personality control method through knowledge editing, with MBTI questionnaire, adjustment query construction, LLM (original), generate model response, extract self-referential statement, extract opposite trait word, layer, association at layer 1, optimize v by object, edited LLM, and specific trait-focused response generation components.
PALETTE leverages MBTI-inspired adjustment queries and rank-one model editing to modify LLM's internal representations for personality traits.
This approach enables controlled shifts in personality, addressing inherent biases and improving consistency in LLM responses.

Plant in Cupboard, Orange on Table, Book on Shelf Benchmarking Practical Reasoning and Situation Modelling in a Text-Simulated Situated Environment

AdventureGame: introduces text-based environment, with Agent (executes game actions), Environment (simulated text-based world), Interpreter (processes agent commands), Parser (command grammar definition), State Change Module (updates game world state), World State (current game facts representation), Observation Feedback (textual game responses), Goal (task objective for agent), Action Space (set of valid commands), and Memory (interaction history storage).
AdventureGame facilitates situated language environment understanding and common-sense reasoning evaluation through simplified interactive fiction game.
AdventureGame framework assesses unassisted LLMs performance by providing direct action outcome observations, combining planning and execution within turn-limited episodes.

LLM Agents Making Agent Tools

TOOLMAKER: introduces an agentic framework, with User, AI research assistant, Tool library, Environment setup, Tool implementation, Workflow components, LLM calls, Environment interactions, Agents, TM-BENCH, Docker container, Conversation history, and Environment state, that autonomously creates LLM-compatible tools from scientific papers and code repositories.
TOOLMAKER framework addresses the limitation of requiring pre-defined tools for LLM agents by enabling dynamic tool creation through a closed-loop self-correction mechanism.
The framework's effectiveness is objectively evaluated using TM-BENCH, a benchmark suite designed to assess tool correctness and robustness across diverse tasks.

Exploring LLM-based Student Simulation for Metacognitive Cultivation

Pipeline for generating and filtering high-quality simulated student agents: introduces a multi-component framework with Student Profile Generation, Two-Round Scoring with Profile Consistency Scoring using Questioning Agent and Profile Scorer, Behavioral Consistency Scoring using Dialogue Agent and Behavior Scorer, Graph-Based Score Propagation, Ranking and Filtering, and Human Expert Evaluation to automate the creation of authentic student simulations for metacognitive cultivation.
This pipeline leverages a two-round automated scoring system, incorporating graph neural networks for score propagation, to ensure both profile and behavioral consistency of simulated agents, thereby reducing the need for extensive human annotation.
The framework aims to enhance the authenticity of learning difficulty simulations, facilitating broader applications in personalized learning and pedagogical evaluation by providing a robust method for generating and validating high-quality simulated students.

Competing LLM Agents in a Non-Cooperative Game of Opinion Polarisation

Non-Cooperative Game Framework: introduces a simulation framework with Red Agent, Blue Agent, Judge Agent, Network, BCM Model and Memory to analyse opinion polarisation in a non-cooperative game setting.
This framework models adversarial interactions between LLM agents, simulating misinformation spread and countering within a social network using BCM Model for opinion updates and Judge Agent for message potency assessment.
The framework investigates the impact of resource constraints and cognitive biases on opinion dynamics, highlighting the strategic importance of early influence and resource management in countering misinformation.

DEVIATION RATINGS: A GENERAL, CLONE INVARIANT RATING METHOD

Deviation Ratings: introduces deviation ratings, a novel rating method, with Deviation Rating, CCE, Deviation Gain, Iterative Minimization, and Linear Programming components, which provides clone invariant ratings for N-player general-sum games.
Deviation Ratings framework utilizes Coarse Correlated Equilibrium and iterative minimization of deviation gain via linear programming for rating calculation.
Deviation Ratings method overcomes Nash Equilibrium limitations by employing CCE, achieving scalable, clone-attack-proof, and data agnostic rating system.

A Survey of Personalized Large Language Models: Progress and Future Directions

PLLM (Personalized Large Language Models) Framework: introduces personalized response generation through profile, dialogues, content, and interactions, utilizing personalized prompting, efficient fine-tuning, and personalized alignment with generation, recommendation and classification tasks.
The framework processes user query and data to adapt LLM via prompting, adaptation and alignment for tailored responses, considering generation, recommendation, and classification.
This approach enhances user experience across applications by contextually relevant and user-specific response generation, addressing limitations of general LLMs.

AGrail: A Lifelong Agent Guardrail with Effective and Adaptive Safety Detection

AGrail (A Lifelong Agent Guardrail with Effective and Adaptive Safety Detection): introduces a lifelong agent guardrail framework, with Analyzer, Memory, Executor, and Tool Box components, to enhance LLM agent safety by adaptively detecting risks and applying effective safety policies.
AGrail framework, depicted in figures, integrates with Agent (Planning, ReAct) interacting with diverse environments like Web, Database, and OS, utilizing Guard Request, Agent Specification, and Safety Criteria for context-aware safety checks.
AGrail framework optimizes safety checks through iterative refinement and tool compatibility, aiming for a trade-off between robustness and utility in LLM agent security.

SMART: Self-Aware Agent for Tool Overuse Mitigation

SMART (Strategic Model-Aware Reasoning with Tools): introduces SMARTAgent, a family of models, with MetaCognition, User Reasoning, SMART-ER, Parametric Knowledge, and External Tools, to enhance agent self-awareness for optimized task handling and reduced tool overuse.
SMARTAgent leverages SMART-ER dataset, which includes Reasoning Chain, Decompose, Annotate, Tool Mapping, Execute, and Refine components, to dynamically balance parametric knowledge and tool use in problem-solving.
By calibrating agent's self-awareness through explicit justifications and strategic tool utilization, SMARTAgent achieves improved performance, reduced tool overuse, and better generalization across diverse tasks.

FLAG-TRADER: Fusion LLM-Agent with Gradient-based Reinforcement Learning for Financial Trading

FLAG-TRADER: introduces a financial trading framework integrating LLM (linguistic processing model) with gradient-based RL, using prompts (structured input), LLM (linguistic processing model), policy head (action probability), value head (state value), actor (policy network), critic (value network), reply buffer (experience replay), and environment (financial market).
The framework uses a partially fine-tuned LLM (linguistic processing model) as policy network, optimized by policy gradient and experience replay in reply buffer (experience replay storage) within the environment (financial market).
FLAG-TRADER enhances LLM performance in financial tasks by synergistically combining linguistic processing with reward-driven policy optimization for improved trading decisions.

TimeCAP: Learning to Contextualize, Augment, and Predict Time Series Events with Large Language Model Agents

TimeCAP (TimeCAP: Learning to Contextualize, Augment, and Predict): introduces TimeCAP framework, which includes Contextualize LLM Agent, Predict LLM Agent, Multi-Modal Encoder, Augment Prompt, Augment Input, Time Series, Text Summary, and LLM Aggregate, for time series event prediction using dual LLM agents and multi-modal encoder.
TimeCAP framework leverages a contextualize LLM agent to generate text summaries of time series data, which are then used by a predict LLM agent for enhanced event prediction.
The framework incorporates a multi-modal encoder to synergize with LLM agents through input and prompt augmentations, further improving prediction accuracy and interpretability.

"Nuclear Deployed!”: Analyzing Catastrophic Risks in Decision-making of Autonomous LLM Agents

Three-stage evaluation framework: introduces System, Agent, State Update, and Reasoning components to analyze catastrophic risks in autonomous LLM agents' decision-making, particularly in CBRN scenarios.
This framework uses Restricted Authority, Cat. Behav. w/ Cmd. Violation, and Deception: False Accusation to expose potential catastrophic behaviors and deceptive tendencies in LLM agents.
The evaluation method empirically proves the existence of catastrophic risks by simulating agent interactions and analyzing agent's reasoning and actions within defined scenarios.

16th February 2025

OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning

OctoTools: introduces a training-free agentic framework with Query Analyzer, Action Predictor, Context Verifier, Solution Summarizer, Task-specific Toolset Optimization, Planner, Executor, Tool cards, Action, Context, Command Generator, Command Executor and Answer for complex reasoning across diverse domains.
OctoTools framework uses Tool cards to standardize tool integration and employs Planner-Executor paradigm for managing multi-step reasoning and tool usage.
The framework achieves substantial accuracy gains over baselines by combining multi-step planning with specialized tool usage and task-specific toolset optimization.

PlanGenLLMs: A Modern Survey of LLM Planning Capabilities

LLM Planning Framework: introduces taxonomy of LLM planning approaches, with Foundation, Performance Criteria, Datasets and Evaluation, to categorize and analyze components and methodologies in the field.
LLM Planning Framework: systematically breaks down LLM planning into key aspects like task decomposition, search algorithms, and evaluation metrics, providing a structured overview.
LLM Planning Framework: serves as a comprehensive guide for understanding the architectural components and evaluation strategies employed in current LLM planning research, highlighting key areas and future directions.

A Survey of LLM-based Agents in Medicine: How far are we from Baymax?

LLM-based Medical Agent Framework: introduces profile, clinical planning, medical reasoning, and external capacity enhancement components for structuring medical agents.
LLM-based Medical Agent Framework facilitates medical decision-making by integrating clinical knowledge and ensuring safe deployment through defined components.
The framework supports agent paradigms like single agent, sequential task chain, collaborative experts and iterative evolution to enable diverse medical applications.

CSP: A Simulator For Multi-Agent Ranking Competitions

CSP (Competitive Search Platform) introduces CSP SIMULATOR (core simulation component) with Competition properties (competition setup parameters), Player properties (agent characteristics), Ranking function properties (ranking algorithm definition), CSP ANALYZER (competition analysis module), and CSP COMPARE (competition comparison tool) for configurable multi-agent ranking competition simulations.
CSP SIMULATOR facilitates fine-grained control over ranking function, query, initial documents, and agent types, supporting LLM-based players and diverse experimental setups.
CSP ANALYZER and CSP COMPARE modules enable in-depth analysis of individual competitions and comparative analysis across different competition configurations, respectively.

NavRAG: Generating User Demand Instructions for Embodied Navigation through Retrieval-Augmented LLM

NavRAG (Retrieval-Augmented Generation): introduces NavRAG framework with Navigation Graph Generator, Object and View Annotator, View Retriever, Viewpoint Annotator, Viewpoint Retriever, Zone Generator & Annotator, Zone Retriever, Scene Annotator, Scene Description, Zone Description, Viewpoint Description, Object Description & Functionality, User Demands, Role Profiles, Rough Instruction Generator, Refined Instruction Generator, Trajectory Generator, Instruction Records, Store Instruction, Ground Truth Trajectory, Hierarchical Retrieval LLM, Viewpoint LLM, and Scene Tree, which generates user demand instructions for vision-language navigation using retrieval-augmented large language models.
NavRAG framework constructs hierarchical scene description tree using Scene Tree memory component and leverages Hierarchical Retrieval LLM and Viewpoint LLM to generate diverse navigation instructions tailored to user demands and roles.
NavRAG improves instruction quality by considering global context and user needs through hierarchical scene understanding and retrieval-augmented generation, addressing limitations of step-by-step instruction methods.

VisPath: Automated Visualization Code Synthesis via Multi-Path Reasoning and Feedback-Driven Optimization

VisPath (A Multi-Path Reasoning and Feedback-Driven Optimization Framework for Visualization Code Generation): introduces a multi-stage framework for visualization code generation, utilizing Multi-Path Agent for query expansion, Code Agent for code generation, Visual Feedback Agent for evaluation, and Synthesis Agent for result aggregation, incorporating Feedback for optimization.
VisPath framework enhances code quality by employing structured reasoning and refinement, addressing underspecified queries through diverse reformulated queries and feedback-driven optimization.
VisPath systematically explores multiple interpretative pathways and leverages iterative feedback aggregation to improve accuracy and robustness in visualization code generation for complex user intents.

MasRouter: Learning to Route LLMs for Multi-Agent Systems

MasRouter: introduces Multi-Agent System Routing (MASR) problem and proposes MasRouter framework with collaboration determiner, role allocator, and LLM router for cost-effective multi-agent systems.
MasRouter: employs cascaded controller network integrating collaboration mode determination, role allocation, and LLM routing to construct high-performing and resource-efficient MAS progressively.
MasRouter: optimizes performance and cost in multi-agent systems by dynamically routing LLMs and balancing effectiveness with reduced overhead via customized routing.

G-Safeguard: A Topology-Guided Security Lens and Treatment on LLM-based Multi-agent Systems

G-Safeguard: introduces a topology-guided security framework for LLM-based multi-agent systems, with Multi-agent Utterance Graph, Detection, Remediation, Edge pruning, Graph Neural Network, Node Encoder, and Embedding components.
G-Safeguard leverages graph neural networks to detect anomalies on multi-agent utterance graphs and employs topological intervention for attack remediation.
This framework enhances security in multi-agent systems by identifying and mitigating adversarial attacks through graph-based analysis and intervention.

Hierarchical Expert Prompt for Large-Language-Model: An Approach Defeat Elite AI in TextStarCraft II for the First Time

HEP (Hierarchical Expert Prompt): introduces a framework with Expert Tactic Prompt (inject expert tactic knowledge), Hierarchical Decision Prompt (hierarchical decision making logic), and Legal Action Library (action options for LLM) within a System Prompt (prompt to guide LLM behavior) to enhance LLM decision-making for StarCraft II, utilizing LLM Input (condensed text game observation), LLM Output (text-based game actions), and Text Action Recognition (converts text actions to game commands).
HEP injects expert knowledge and employs hierarchical decision-making to tackle challenges like insufficient knowledge and inadequate control over subtasks in complex environments.
This method enables an LLM agent to overcome the Elite AI in TextStarCraft II for the first time, demonstrating its efficacy in intricate decision-making scenarios.

Talk Structurally, Act Hierarchically: A Collaborative Framework for LLM Multi-Agent Systems

TalkHier (Talk Structurally, Act Hierarchically): introduces Structured Communication Protocol (organizes agent communication) and Hierarchical Refinement (hierarchical evaluation process) within a multi-agent framework to improve communication and refinement in LLM-based multi-agent systems.
TalkHier framework utilizes Supervisor (oversees task success), Member Agent (problem-solving focus), and Evaluation Team (hierarchical feedback provision) with components like Message (instruction content), Background Information (contextual details), and Intermediate Output (shared output for progression) for effective collaboration.
TalkHier enhances agent functionality through Memory (agent-specific information storage), Plugins (domain-specific external tools), Roles (generator, evaluator, revisor), and Types (supervisor, member), enabling structured, role-specific operations and improved performance across diverse tasks.

AGENTIC LLM FRAMEWORK FOR ADAPTIVE DECISION DISCOURSE

Agentic LLM Framework: introduces agent assembly within conference room, utilizing scenario and persona prompts, employing summoning mechanism, generating conversation output to derive actionable measures.
This framework simulates decision discourse using LLM agents representing stakeholders, enabling exploration of diverse perspectives and adaptive strategy refinement through self-governance.
The framework facilitates breadth-first exploration of alternatives for robust recommendations in complex, uncertain scenarios, enhancing decision-making and scalability for AI-driven solutions.

SCALE: Towards Collaborative Content Analysis in Social Science with Large Language Model Agents and Human Intervention

SCALE (Simulates Content Analysis via Large language model Agents): introduces multi-agent framework for content analysis, incorporating Coder Simulation to initialize agents, Bot Annotation for independent coding, Agent Discussion for collaborative refinement, Codebook Evolution for iterative rule updates, and Human Intervention for expert guidance.
SCALE framework simulates content analysis phases including text coding, collaborative discussions, and codebook evolution, aiming to reduce subjectivity and enhance scalability through human-AI collaboration.
SCALE leverages LLM agents with diverse personas and iterative refinement processes to achieve human-level performance in complex content analysis tasks, offering potential for social science research automation.

15th February 2025

D-CIPHER: Dynamic Collaborative Intelligent Agents with Planning and Heterogeneous Execution for Enhanced Reasoning in Offensive Security

D-CIPHER: introduces multi-agent framework with Planner Agent for planning, heterogeneous Executor Agents for task execution, and Auto-prompter Agent for initial prompt generation, utilizing shared Container Environment to interact with Challenge Server.
D-CIPHER employs Planner-Executor system to divide responsibilities and Auto-prompter to enhance initial prompt relevance for collaborative CTF challenge solving.
D-CIPHER framework facilitates dynamic feedback loops and agent interactions to improve reasoning and performance in complex offensive security tasks.

Enhancing Conversational Agents from Open-Source Large Language Models with Illocutionary Force and Document-Based Knowledge Retrieval

Bespoke Conversational System: introduces a system for analyzing dialogue illocutionary forces and integrating them with document retrieval for enhanced agent responses.
Bespoke Conversational System utilizes BERT for force extraction and Ollama-served open-source LLMs for response generation, using custom documents as knowledge.
This system enhances conversational agent relevance by incorporating user intent and document knowledge within open-source LLM frameworks.

PCGRLLM: Large Language Model-Driven Reward Design for Procedural Content Generation Reinforcement Learning

PCGRLLM (Large Language Model-Driven Reward Design for Procedural Content Generation Reinforcement Learning): introduces feedback-based reward generation framework for PCG, with Instruct (Story), LLM M, Alignment, Agent π, Environment, Content, and Feedback components.
PCGRLLM framework refines reward function iteratively using self-alignment and feedback mechanisms to improve content generation in PCGRL.
PCGRLLM leverages prompt engineering techniques like Chain-of-Thought, Tree-of-Thought, and Graph-of-Thought to enhance reward space exploration and improve reward function design.

Divergent Thoughts toward One Goal: LLM-based Multi-Agent Collaboration System for Electronic Design Automation

EDAid (Electronic Design Automation id): introduces multi-agent system, with divergent-thoughts agent (generates diverse EDA scripts), decision-making agent (selects optimal EDA script), EDA tool usage demo database (stores EDA task examples), relevant demos (retrieved similar EDA tasks), demo groups (sets of relevant demos), planning steps (sequential task breakdown), EDA script (executable automation code), KV cache (stores system prompts), and EDA tools (software for circuit design), where system automates electronic design automation flow using multiple agents with diverse approaches.
EDAid employs divergent-thoughts agents to generate multiple electronic design automation script solutions using retrieved relevant demos and demo groups, then decision-making agent selects optimal script based on analysis of script correctness.
EDAid enhances electronic design automation flow automation reliability by utilizing multi-agent collaboration and divergent thinking, effectively addressing challenges inherent in complex electronic design automation tasks and diverse tool interfaces.

Be Friendly, Not Friends: How LLM Sycophancy Shapes User Trust

Framework name here: introduces LLM agent (conversational agent) with Sycophancy manipulation (adapts responses to user views) and Friendliness manipulation (expresses warmth or neutrality) to investigate user trust.
LLM agent's sycophancy dynamically aligns responses with user perspectives, while friendliness incorporates warm language cues.
The framework examines how these manipulations jointly influence perceived authenticity and user trust in human-LLM interactions.

Rule-Bottleneck Reinforcement Learning: Joint Explanation and Decision Optimization for Resource Allocation with Language Agents

RBRL (Rule-Bottleneck Reinforcement Learning): introduces a framework that integrates LLMs for rule generation and RL for rule selection, with state-to-language descriptor, prompt function, LLM (rule generator), rule sets, attention network (rule selector), RL agent (policy optimizer), budget constraint, action and explanation LLM (action and explanation generator), and replay buffer, to jointly optimize decision-making and explanation in resource allocation tasks.
RBRL framework leverages LLMs to generate interpretable candidate rules based on the environment state and employs an attention-based RL policy to select the most suitable rule, optimizing for both environment reward and explanation quality.
The framework enhances transparency and interpretability in sequential decision-making by using structured language rules as a bottleneck between high-level reasoning and action execution, facilitating human understanding and trust in AI systems.

14th February 2025

Can Large Language Model Agents Balance Energy Systems?

LLM-SUC (LLM-assisted Stochastic Unit Commitment): introduces a hybrid approach integrating LLM Agent, Agent 1, Agent 2, and SUC to enhance stochastic unit commitment by leveraging scenario tree and MILP for improved energy system balancing.
LLM-SUC framework uses LLM Agent with Agent 1 and Agent 2 to refine scenario tree generation, improving the stochastic unit commitment process for renewable energy integration.
By dynamically adjusting quantile parameters and interpreting wind error distributions, LLM-SUC framework enhances computational efficiency and decision-making in uncertain operating conditions.

Agentic Verification for Ambiguous Query Disambiguation

VERDICT (Verified-Diversification with Consolidation): introduces a unified framework integrating diversification and verification for ambiguous query disambiguation, incorporating query relaxation & retrieval, relevance feedback from retriever, execution feedback from generator, verified diversification, and consolidating feedback components.
VERDICT framework leverages agentic feedback from retriever and generator to ensure grounded disambiguations upfront, mitigating noisy retrieval and cascading errors during ambiguous question answering.
VERDICT improves efficiency and robustness by reducing reliance on multiple retrieval and inference steps, achieving enhanced grounding and performance in comparison to diversify-then-verify approaches.

Process Reward Models for LLM Agents: Practical Framework and Directions

AgentPRM (Agent Process Reward Models): introduces a framework for training LLM agents, with Agent Policy, Environment, Rollout, Target Dataset, Target Computation, PRM Dataset, Process Reward Model, PRM Training, Prompts, Policy Update, and Updated Policy, where framework trains agents through interaction.
AgentPRM framework uses actor-critic paradigm and Monte Carlo rollouts for reward target computation, enabling iterative policy refinement.
AgentPRM framework integrates into RLHF pipelines with minimal modifications, facilitating scalable agent training and deployment.

LARGE LANGUAGE MODELS AND SYNTHETIC DATA FOR MONITORING DATASET MENTIONS IN RESEARCH PAPERS

Data Use Extraction and Classification Framework: introduces a machine learning pipeline with zero-shot extraction and classification, LLM-as-a-Judge, reasoning agent, synthetic pre-fine-tuning data, fine-tuned LLM, BERT-based classifier, and curated data to automate dataset mention detection.
This framework employs synthetic data generation and two-stage fine-tuning to address data scarcity and improve model generalization for scalable dataset monitoring in research papers.
The approach utilizes LLMs for extraction and refinement, and a BERT-based classifier for efficient filtering, achieving state-of-the-art performance in dataset extraction accuracy.

Do Large Language Models Reason Causally Like Us? Even Better?

LLM (Large Language Model): introduces comparative study evaluating LLMs' causal reasoning against human reasoning using collider graph inference tasks and normative/psychological models.
Study assesses GPT-40, Claude, Gemini-Pro, GPT-3.5 on predictive, independence, diagnostic inferences, comparing to human data and Causal Bayes Nets, Mutation Sampler models.
Findings emphasize importance of understanding AI biases in causal reasoning for reliable AI systems, revealing LLMs' varying normative alignment and domain knowledge influence.

Cooperative Multi-Agent Planning with Adaptive Skill Synthesis

COMPASS (Cooperative Multi-Agent Planning with Adaptive Skill Synthesis): introduces a decentralized closed-loop framework for cooperative multi-agent systems, with Perception (multi-modal input processing), Task Reasoning (objective decomposition), Self-Reflection (execution quality evaluation), Actor (skill selection and execution), Skill Synthesis (executable code generation), Local Memory (agent's individual memory), Global Memory (shared memory), and Skill Library (executable skill collection).
COMPASS integrates VLMs with a dynamic skill library and structured communication for decentralized decision-making under partial observability.
The framework enables agents to adapt strategies through iterative planning, skill synthesis, and information sharing, improving performance in cooperative tasks.

A Survey on LLM-powered Agents for Recommender Systems

Unified Agent Architecture: introduces unified agent architecture consisting of Profile, Memory, Planning and Action modules for LLM-powered agent recommender systems.
Unified Agent Architecture: decomposes agent recommender into profile construction, memory management, strategic planning, and action execution modules.
Unified Agent Architecture: facilitates analysis of existing LLM-powered agent methods by providing structured framework of core components and their functions.

Automated Hypothesis Validation with Agentic Sequential Falsifications

POPPER: introduces an agentic framework for automated hypothesis validation, with Experiment Design Agent, Experiment Execution Agent, Sequential Error Control, Relevance Checker, Self-Refine, and Memory components.
POPPER framework rigorously validates free-form hypotheses by sequentially testing measurable implications through experiments and controlling Type-I error.
POPPER leverages LLM agents to automate experiment design and execution, providing a scalable and efficient solution for hypothesis validation across diverse domains.

Diverse Inference and Verification for Advanced Reasoning

Diverse Inference: introduces a diverse inference framework that combines multiple models and methods, including Human (User input), LLM (Reasoning engine), Agent (Process orchestrator), Game Environment (Problem representation), Verifier (Solution checker), and specific methods like LEAP (Task-specific learning), Z3 (Theorem prover), RTO (Round trip optimization), BoN (Best-of-N sampling), SC (Self-consistency), MoA (Mixture of agents), MCTS (Monte Carlo search), PV (Prover-verifier game), Unit Tests (Code execution verification), and Best-of-N (Sampling based verification), to enhance reasoning and verification for advanced tasks.
Leverages perfect verifiers like Lean Verifier (Formal proof checker) for IMO and ARC problems and imperfect verifiers like Best-of-N (Sampling based verification) for HLE questions, achieving higher accuracy on challenging benchmarks.
Utilizes test-time simulations, reinforcement learning, and meta-learning to adapt agent graph representations and improve generalization across diverse problem types.

Agentic Verification for Ambiguous Query Disambiguation

VERDICT (Verified-Diversification with Consolidation): unifies diversification with verification by incorporating feedback from retriever and generator early on, improving efficiency and robustness by reducing reliance on multiple retrieval and inference steps.
VERDICT framework integrates Rewrite-component for query relaxation, Retrieve-component for relevant passages, Execution-component for interpretation extraction, Embed-component for latent space projection, and Clustering-component for feedback consolidation.
The framework leverages LLM Calls and Retriever Calls to enhance the grounding and accuracy of ambiguous query disambiguation in retrieval-augmented generation.

Process Reward Models for LLM Agents: Practical Framework and Directions

AgentPRM (Agent Process Reward Models): introduces a framework for training LLM agents, with Agent Policy, Environment, Rollout, Target Dataset, Target Computation, PRM Dataset, Process Reward Model, PRM Training, Prompts, Policy Update, and Updated Policy, where framework trains agents through interaction.
AgentPRM framework uses actor-critic paradigm and Monte Carlo rollouts for reward target computation, enabling iterative policy refinement.
AgentPRM framework integrates into RLHF pipelines with minimal modifications, facilitating scalable agent training and deployment.

LARGE LANGUAGE MODELS AND SYNTHETIC DATA FOR MONITORING DATASET MENTIONS IN RESEARCH PAPERS

Data Use Extraction and Classification Framework: introduces a machine learning pipeline with zero-shot extraction and classification, LLM-as-a-Judge, reasoning agent, synthetic pre-fine-tuning data, fine-tuned LLM, BERT-based classifier, and curated data to automate dataset mention detection.
This framework employs synthetic data generation and two-stage fine-tuning to address data scarcity and improve model generalization for scalable dataset monitoring in research papers.
The approach utilizes LLMs for extraction and refinement, and a BERT-based classifier for efficient filtering, achieving state-of-the-art performance in dataset extraction accuracy.

Do Large Language Models Reason Causally Like Us? Even Better?

LLM (Large Language Model): introduces comparative study evaluating LLMs' causal reasoning against human reasoning using collider graph inference tasks and normative/psychological models.
Study assesses GPT-40, Claude, Gemini-Pro, GPT-3.5 on predictive, independence, diagnostic inferences, comparing to human data and Causal Bayes Nets, Mutation Sampler models.
Findings emphasize importance of understanding AI biases in causal reasoning for reliable AI systems, revealing LLMs' varying normative alignment and domain knowledge influence.

Cooperative Multi-Agent Planning with Adaptive Skill Synthesis

COMPASS (Cooperative Multi-Agent Planning with Adaptive Skill Synthesis): introduces a decentralized closed-loop framework for cooperative multi-agent systems, with Perception (multi-modal input processing), Task Reasoning (objective decomposition), Self-Reflection (execution quality evaluation), Actor (skill selection and execution), Skill Synthesis (executable code generation), Local Memory (agent's individual memory), Global Memory (shared memory), and Skill Library (executable skill collection).
COMPASS integrates VLMs with a dynamic skill library and structured communication for decentralized decision-making under partial observability.
The framework enables agents to adapt strategies through iterative planning, skill synthesis, and information sharing, improving performance in cooperative tasks.

ScamFerret: Detecting Scam Websites Autonomously with Large Language Models

ScamFerret: presents an agent system for autonomous scam website detection, integrating Scam Website Analysis, External Information Collection, and Analysis Results Output components.
ScamFerret utilizes LLM's pre-existing knowledge and tool-based information retrieval to classify scam websites without scam-specific fine-tuning.
ScamFerret iteratively refines website analysis by collecting external information when initial URL data is insufficient, enhancing detection accuracy and explainability.

A Survey on LLM-powered Agents for Recommender Systems

Unified Agent Architecture: introduces unified agent architecture consisting of Profile, Memory, Planning and Action modules for LLM-powered agent recommender systems.
Unified Agent Architecture: decomposes agent recommender into profile construction, memory management, strategic planning, and action execution modules.
Unified Agent Architecture: facilitates analysis of existing LLM-powered agent methods by providing structured framework of core components and their functions.

VIRAC: A VISION-REASONING AGENT HEAD MOVEMENT CONTROL FRAMEWORK IN ARBITRARY VIRTUAL ENVIRONMENTS

VIRAC (Vision-Reasoning Agent Head Movement Control framework): introduces vision-reasoning framework for realistic agent head rotations utilizing Perception Module with VLM and FMM, Decision-making Module with AHM and LLM, and iterative loop with perception, reasoning, action selection, environment update and history update components.
VIRAC framework leverages large-scale models for common-sense knowledge and reasoning capabilities, emulating human-like perception and decision-making processes for context-aware head rotations.
VIRAC framework operates through iterative cycle of perception, reasoning, action selection, and environment/history updates to produce dynamic and context-sensitive head rotations in virtual environments.

Automated Hypothesis Validation with Agentic Sequential Falsifications

POPPER: introduces an agentic framework for automated hypothesis validation, with Experiment Design Agent, Experiment Execution Agent, Sequential Error Control, Relevance Checker, Self-Refine, and Memory components.
POPPER framework rigorously validates free-form hypotheses by sequentially testing measurable implications through experiments and controlling Type-I error.
POPPER leverages LLM agents to automate experiment design and execution, providing a scalable and efficient solution for hypothesis validation across diverse domains.

Representation and Interpretation in Artificial and Natural Computing

Newell's hierarchy of system levels: analyzes computing machines through system levels, including knowledge level, symbol level, hardware level, and physical world.
This hierarchy facilitates analyzing computing machines across abstraction levels, ranging from knowledge representation to physical implementation.
Understanding these levels facilitates comprehending functionality and independence across complex computing systems.

Robustness tests for biomedical foundation models should tailor to specification

Robustness testing frameworks: introduces threat-based test design, priority-based test design, patient EHR foundation model, and brain MRI foundation model, where paper describes robustness testing framework for biomedical foundation models.
Framework uses threat-based tests with distance bounds and priority-based tests with realistic artifacts to evaluate robustness.
This approach aims to standardize robustness evaluation by tailoring tests to specific tasks and priorities in biomedical domain.

Agentic Verification for Ambiguous Query Disambiguation

VERDICT (Verified-Diversification with Consolidation): introduces a unified framework integrating diversification and verification for ambiguous query disambiguation, incorporating query relaxation & retrieval, relevance feedback from retriever, execution feedback from generator, verified diversification, and consolidating feedback components.
VERDICT framework leverages agentic feedback from retriever and generator to ensure grounded disambiguations upfront, mitigating noisy retrieval and cascading errors during ambiguous question answering.
VERDICT improves efficiency and robustness by reducing reliance on multiple retrieval and inference steps, achieving enhanced grounding and performance in comparison to diversify-then-verify approaches.

VocalCrypt: Novel Active Defense Against Deepfake Voice Based on Masking Effect

VocalCrypt: introduces active defense mechanism, with Preprocessing, DWT Transform, Band analysis module, Frequency band division, Calculate the energy, Masking threshold calculation module, Masking threshold calculation, and iSTFT, to protect against deepfake voice cloning.
VocalCrypt leverages auditory masking effect by embedding pseudo-timbre based on calculated masking thresholds to disrupt AI voice cloning systems.
VocalCrypt enhances robustness, real-time performance, and offers preemptive defense compared to existing post-attack detection methods against voice cloning attacks.

Process Reward Models for LLM Agents: Practical Framework and Directions

AgentPRM (Agent Process Reward Models): introduces a framework for training LLM agents, with Agent Policy, Environment, Rollout, Target Dataset, Target Computation, PRM Dataset, Process Reward Model, PRM Training, Prompts, Policy Update, and Updated Policy, where framework trains agents through interaction.
AgentPRM framework uses actor-critic paradigm and Monte Carlo rollouts for reward target computation, enabling iterative policy refinement.
AgentPRM framework integrates into RLHF pipelines with minimal modifications, facilitating scalable agent training and deployment.

Reinforcement Learning in Strategy-Based and Atari Games: A Review of Google DeepMind's Innovations

AlphaGo: introduces a pioneering AI model utilizing policy- and value-networks with Monte Carlo Tree Search (MCTS) for playing Go.
AlphaGo integrates supervised and reinforcement learning to achieve expert-level Go performance, surpassing human players.
The model's architecture combines neural networks with tree search for effective game exploration and decision-making in Go.

Open-Source AI-Powered Optimization in Scalene: Advancing Python Performance Profiling with DeepSeek-R1 and LLaMA 3.2

SCALENE: introduces open-source AI-powered optimization, with Profiler, AI-powered optimization suggestions, Open-source LLMs, Ollama framework, Code Snippet Input, Optimization Suggestions Output, to advance Python performance profiling.
SCALENE integrates DeepSeek-R1 and LLaMA 3.2 via Ollama, enabling local, cost-effective AI-driven code optimization suggestions for Python.
This integration enhances SCALENE's utility by providing accessible, transparent, and hardware-aware optimization, improving Python developer efficiency.

SegX: Improving Interpretability of Clinical Image Diagnosis with Segmentation-based Enhancement

SegX (Segmentation-based Explanation) and SegU (Segmentation-based Uncertainty Assessment) introduces Segmentation-based Explanation (SegX) module to refine original XAI map using segmentation mask and Segmentation-based Uncertainty Assessment (SegU) module to evaluate prediction certainty by comparing explanation map and clinical interests mask.
SegX enhances interpretability of clinical image diagnosis by aligning model's explanation with clinically relevant areas using Segmentation Model and improves reliability through uncertainty quantification with SegU.
The framework utilizes Classification Model for initial prediction and XAI Method for explanation, then refines explanation and assesses certainty using segmentation-derived clinical knowledge, aiming for model-agnostic enhancement in medical AI.

Artificial Intelligence to Assess Dental Findings from Panoramic Radiographs - A Multinational Study

AI system: introduces Dental Panoramic Radiograph (DPR) input with finding localization, tooth index classification, and post-processing to generate Output Finding Assessment for dental radiographs.
AI system employs object detection and semantic segmentation, utilizing convolutional neural networks for dental findings and tooth indices from radiographic images.
AI system achieves performance comparable to human experts in dental findings assessment, demonstrating potential for clinical workflow integration and diagnostic improvement.

Robust variance estimators in application to segmentation of measurement data distorted by impulsive and non-Gaussian noise

Robust Change Point Detection Methodology: introduces robust offline methodology for measurement data segmentation based on classical structural break detection with robust scale estimators, utilizing data, CSS and change point components.
Methodology uses Cumulative Sums of Squares statistic and quantile method, enhanced by Biweight Midvariance and Quantile Conditional Variance for non-Gaussian data segmentation.
Proposed approach improves change point estimation accuracy, especially for heavy-tailed and impulsive noise data in financial, mechanical, and medical systems.

LARGE LANGUAGE MODELS AND SYNTHETIC DATA FOR MONITORING DATASET MENTIONS IN RESEARCH PAPERS

Data Use Extraction and Classification Framework: introduces a machine learning pipeline with zero-shot extraction and classification, LLM-as-a-Judge, reasoning agent, synthetic pre-fine-tuning data, fine-tuned LLM, BERT-based classifier, and curated data to automate dataset mention detection.
This framework employs synthetic data generation and two-stage fine-tuning to address data scarcity and improve model generalization for scalable dataset monitoring in research papers.
The approach utilizes LLMs for extraction and refinement, and a BERT-based classifier for efficient filtering, achieving state-of-the-art performance in dataset extraction accuracy.

Seamless acceleration of Fortran intrinsics via AMD AI engines

Compilation flow: introduces an approach for seamless acceleration of Fortran intrinsics on AMD AI Engines, utilizing Fortran source code, Flang, HLFIR & FIR, Existing Lowering, Linalg dialect, Standard MLIR dialects, Lowering passes, LLVM dialect, Generation, LLVM IR, CPU code, xrt dialect, func dialect, aie dialect, Specialisation, AIE code, AMD's AIE MLIR tooling, and Library of AIE dialect kernels.
This compilation flow leverages MLIR infrastructure to lower Fortran intrinsics represented in linalg dialect to CPU and AIE code, using xrt dialect for CPU-AIE interaction and aie dialect for AIE specific operations, enabling transparent offloading without programmer modifications.
The approach uses a library of pre-built AIE kernel templates and a specialisation process to tailor AIE code for specific Fortran intrinsics, demonstrating performance benefits for reduction, transpose, and matrix multiplication operations on AMD Ryzen AI NPUs.

Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

Step-Video-T2V: introduces a text-to-video model with Video-VAE Encoder, Video-VAE Decoder, Bilingual Text Encoder(s), DiT w/ 3D Full Attention and Video-DPO components.
Step-Video-T2V: employs cascaded training and video-based DPO to enhance video generation quality and reduce artifacts.
Step-Video-T2V: presents state-of-the-art text-to-video generation and is publicly released to foster advancements in video foundation models.

Learning to Solve the Min-Max Mixed-Shelves Picker-Routing Problem via Hierarchical and Parallel Decoding

MAHAM (Multi-Agent Hierarchical Attention Model): introduces a hierarchical and parallel decoding framework for solving the min-max Mixed-Shelves Picker-Routing Problem (MSPRP) with Problem Encoder, Agent Context Encoder, Parallel Agent Pointer, and Sequential Action Selection components.
MAHAM framework facilitates efficient picker coordination within complex action spaces by integrating hierarchical decoding, parallel solution construction, and sequential action selection mechanisms.
MAHAM achieves state-of-the-art performance regarding solution quality and computational efficiency, especially for large-scale MSPRP instances, highlighting neural combinatorial optimization effectiveness for warehouse logistics.

ProReco: A Process Discovery Recommender System

ProReco (Process discovery Recommender): introduces a system recommending process discovery algorithms by utilizing Event Log as input, Feature Extractor to derive log features, Machine Learning Predictor to forecast algorithm performance, Algorithm Ranking to order algorithms, and Measure Weights for user preference integration.
ProReco employs machine learning to predict quality measures like fitness, simplicity, precision, and generalization for various process discovery algorithms based on event log characteristics.
The system enhances user experience in process discovery algorithm selection by providing explainable recommendations and incorporating user-defined weights for quality measures.

A Multiagent Path Search Algorithm for Large-Scale Coalition Structure Generation

SALDAE (Scalable Algorithm with Large-Scale Distributed Agent Exploration): introduces multiagent path search algorithm for coalition structure generation using search agents exploring coalition structure graph from start node, generating child nodes, performing node selection, comparing to incumbent, employing bridging paths, utilizing memory management and applying conflict resolution.
SALDAE algorithm iteratively builds search graph by splitting or merging coalitions for solving coalition structure generation problems, aiming to find optimal partitioning of agents into coalitions to maximize social welfare.
SALDAE leverages memory management with OPEN, SUBSTITUTE, and RESERVE lists and incorporates strategies like SPLIT-THEN-MERGE, MERGE-THEN-SPLIT, and APPROACH-THEN-SWAP to enhance search efficiency and solution quality in large-scale multiagent systems.

Do Large Language Models Reason Causally Like Us? Even Better?

LLM (Large Language Model): introduces comparative study evaluating LLMs' causal reasoning against human reasoning using collider graph inference tasks and normative/psychological models.
Study assesses GPT-40, Claude, Gemini-Pro, GPT-3.5 on predictive, independence, diagnostic inferences, comparing to human data and Causal Bayes Nets, Mutation Sampler models.
Findings emphasize importance of understanding AI biases in causal reasoning for reliable AI systems, revealing LLMs' varying normative alignment and domain knowledge influence.

AI-in-the-Loop Sensing and Communication Joint Design for Edge Intelligence

JSAC (AI-in-the-loop joint sensing and communication): introduces an AI-driven closed-loop architecture, with Local Data Source, Data Collection, Sensing, AI Model, Input, Feedback, Reweight, Gradient Uploading, Communication, and Server, that jointly optimizes system resources for superior system-level performance.
JSAC framework incorporates AI Model Feedback into Data Collection and Communication, using Reweight and Gradient Uploading to dynamically adjust data sampling and gradient noise for enhanced model generalization.
This approach aims to reduce communication energy and sensing costs while improving model generalization by integrating model feedback into the control loop of edge intelligence systems.

Dynamic Reinforcement Learning for Actors

Dynamic RL (Dynamic Reinforcement Learning): introduces Actor RNN (generates chaotic dynamics), Sensitivity (local index of convergence), SAL (prevents excessive convergence), and SRL (adjusts dynamics based on TD error) for controlling system dynamics directly in reinforcement learning.
Dynamic RL framework shifts from static to dynamic RL by embedding exploration into action generation through chaotic system dynamics within the Actor RNN.
The framework utilizes Sensitivity, SAL, and SRL to manage convergence and divergence of system dynamics based on temporal difference error, aiming for improved learning and adaptability.

VideoDiff: Human-AI Video Co-Creation with Alternatives

VideoDiff: introduces a human-AI co-creation system for video editing, with AI Edit Recommendations, Multiple Alternatives, Timeline View, Transcript View, Video Alignment, Difference Highlighting, Regenerate, Recombine, Refine, and Sort components.
VideoDiff supports video creators in exploring multiple variations of video edits through AI-driven suggestions and comparison interfaces.
The framework facilitates efficient review and customization of video edits by aligning variations, highlighting differences, and providing tools for organization and refinement.

Reinforcement Learning based Constrained Optimal Control: an Interpretable Reward Design

Interpretable Reward Design Framework: introduces interpretable reward design for reinforcement learning based constrained optimal control, with reward function, terminal constraint reward, guidance reward, penalty for state constraint violations, cost reduction incentive reward, curriculum learning stages, and subproblem solving.
Framework constructs reward function from four weighted components and uses curriculum learning with subproblem solutions to inform reward design and improve convergence.
Proposed approach enhances satisfaction of constraints and optimization of control cost in reinforcement learning for constrained optimal control problems.

Safe platooning control of connected and autonomous vehicles on curved multi-lane roads

Decentralized Control Strategy: introduces decentralized control strategy for safe platooning and merging on curved multi-lane roads, featuring lateral control law, longitudinal control law, and constructive barrier feedback for collision avoidance.
Decentralized Control Strategy: incorporates lateral control law ensuring geometrical convergence using nominal lateral controller and lateral constructive barrier feedback for road edge collision prevention.
Decentralized Control Strategy: employs longitudinal control law achieving desired arc length and velocity using nominal longitudinal controller and longitudinal constructive barrier feedback for inter-vehicle collision prevention.

From Markov to Laplace: How Mamba In-Context Learns Markov Chains

MambaZero: introduces simplified Mamba model with embedding, MambaZero block, linear and prediction components for Markov chain learning.
MambaZero framework utilizes convolution within its MambaZero block to achieve Laplacian smoothing for optimal next-token prediction.
The architecture demonstrates that convolution is a key component for Mamba's in-context learning capabilities on Markovian data.

STMA: A Spatio-Temporal Memory Agent for Long-Horizon Embodied Task Planning

STMA (Spatio-Temporal Memory Agent) introduces a novel framework with Spatio-Temporal Memory Module (captures historical and environmental changes) and Planner-Critic Module (enables closed-loop planning), utilizing Temporal Memory Submodule (processes sequential event records) with Record (stores raw interaction data), Summarizer (condenses raw observations), Temporal Belief (compressed temporal information), and Spatial Memory Submodule (manages spatial relationships) with Spatial Memory (dynamic knowledge graph), Update KGs (updates knowledge graph), Relation Retriever (extracts spatial relationships), Retrieve Algorithm (extracts task-relevant subgraphs), Relation Aggregator (organizes triples into natural language), Spatial Belief (processed spatial memory), operating within Environment (simulated task environment).
STMA framework enhances task planning and execution by integrating spatio-temporal memory, enabling dynamic adaptation and improved performance in long-horizon embodied tasks within dynamic environments.
The planner-critic mechanism in STMA facilitates iterative refinement of task strategies through closed-loop feedback, contributing to robust decision-making and adaptability for embodied agents.

Agentic End-to-End De Novo Protein Design for Tailored Dynamics Using a Language Diffusion Model

VibeGen (generative AI framework): introduces agentic dual-model architecture with Protein designer (PD) (sequence candidate generator) and Protein predictor (PP) (dynamic accuracy evaluator) for end-to-end de novo protein design conditioned on normal mode vibrations.
VibeGen framework synergizes diversity, accuracy, and novelty in protein design by using Protein designer (PD) (sequence candidate generator) to propose sequences based on vibrational modes and Protein predictor (PP) (dynamic accuracy evaluator) to evaluate dynamic accuracy.
VibeGen framework establishes bidirectional link between protein sequence and vibrational behavior, enabling new pathways for engineering biomolecules with tailored dynamical and functional properties, validated by full-atom molecular simulations.

Modeling biases in binary decision-making within the generalized nonlinear q-voter model

Generalized nonlinear q-voter model: introduces agent-based framework, with agent, q-panel, opinion, unanimity, conformity, non-unanimity, bias, complete graph, Monte Carlo simulations, rate equation, phase diagram, and exit probability, to study biased binary decision-making.
This model extends the classic q-voter model by incorporating state-dependent biases in opinion updates during non-unanimous influence scenarios, analyzed using rate equations and Monte Carlo simulations on complete graphs.
The framework reveals novel phase diagrams and exit probability behaviors, particularly for larger influence groups (q ≥ 3), highlighting the complex interplay of bias and group influence in collective decision outcomes.

Combinatorial Reinforcement Learning with Preference Feedback

MNL-VQL (MNL Preference Model with Variance-weighted Item-level Q-Learning): introduces Online Parameter Estimation, Optimistic Q-Values, Pessimistic Q-Values, Variance Estimation, Efficient Optimistic Q-Value Estimation and Exploration Policy to address combinatorial reinforcement learning with preference feedback challenges.
MNL-VQL framework incorporates variance-weighted regression and optimistic estimation to achieve computational efficiency and nearly minimax-optimal regret, particularly in linear MDPs with preference feedback.
The framework's novelty lies in its efficient optimistic assortment selection and regret reduction by factor of √H compared to summing H MNL bandit regrets, offering statistical guarantees in combinatorial RL with preference feedback.

Cooperative Multi-Agent Planning with Adaptive Skill Synthesis

COMPASS (Cooperative Multi-Agent Planning with Adaptive Skill Synthesis): introduces a decentralized closed-loop framework for cooperative multi-agent systems, with Perception (multi-modal input processing), Task Reasoning (objective decomposition), Self-Reflection (execution quality evaluation), Actor (skill selection and execution), Skill Synthesis (executable code generation), Local Memory (agent's individual memory), Global Memory (shared memory), and Skill Library (executable skill collection).
COMPASS integrates VLMs with a dynamic skill library and structured communication for decentralized decision-making under partial observability.
The framework enables agents to adapt strategies through iterative planning, skill synthesis, and information sharing, improving performance in cooperative tasks.

Interpretable Concept-based Deep Learning Framework for Multimodal Human Behavior Modeling

AGCM (Attention-Guided Concept Model): introduces interpretable concept-based framework for multimodal human behavior modeling with Transformer Backbone, Attention-Guided Concept Generator (ACG), Multi-scale Spatial Attention (MSA), Channel Attended Concept Mapping (CACM), Concept Probability Generator, Task Predictor and Loss Computation.
AGCM framework learns conceptual explanations by identifying predictive concepts and their spatial locations through multimodal concept alignment and co-learning for enhanced decision-making insights.
AGCM framework achieves state-of-the-art performance in affective computing tasks by balancing interpretability and accuracy through spatial concept supervision and attention mechanisms.

Provably Efficient RL under Episode-Wise Safety in Constrained MDPs with Linear Function Approximation

OPSE-LCMDP (Optimistic-Pessimistic Softmax Exploration for Linear CMDP): introduces a reinforcement learning algorithm for linear constrained Markov decision processes, utilizing Safe Policy Deployment, Optimistic Exploration, Pessimistic Constraint, Softmax Policy, Bisection Search, Clipped Value Functions, and Confidence Bounds.
This algorithm achieves episode-wise zero-violation guarantee and sublinear regret in linear CMDPs, improving upon existing methods by ensuring safety and computational efficiency.
The key innovations include a novel safe policy deployment rule and a softmax-based optimistic-pessimistic exploration strategy, addressing the challenge of safe RL with function approximation in large-scale CMDPs.

--

Leveraging V2X for Collaborative HD Maps Construction Using Scene Graph Generation

HDMapLaneNet (High-Definition Map Lane Network): introduces a framework leveraging V2X communication and scene graph generation, with Input Image, Backbone, Encoder, Decoder, Prediction Heads, RGCN, Scene Graph, V2X Conversion & Transmission, and Final Map Generation & Distribution, to collaboratively construct localized geometric HD maps from camera images.
HDMapLaneNet utilizes DeepLabV3 for feature extraction, DETR-based transformer for lane detection, and RGCN for modeling lane connectivity, transmitting scene graphs via V2X for cloud-based aggregation and HD map generation.
The framework aims to improve HD map generation by enabling individual vehicles to contribute localized lane information, addressing limitations of traditional mapping approaches in real-time updates and cost-effectiveness.

CAUSAL INFORMATION PRIORITIZATION FOR EFFICIENT REINFORCEMENT LEARNING

CIP (Causal Information Prioritization): introduces augment, reweight, and empowerment components to improve reinforcement learning sample efficiency by prioritizing causal information.
CIP framework leverages causal relationships between states, actions, and rewards within factored MDPs to enhance exploration and learning of efficient policies.
CIP utilizes counterfactual data augmentation and causality-aware empowerment to focus on causally significant behaviors, bridging causal reasoning and efficient exploration.

A novel approach to data generation in generative model

CFP (Convergent Fusion Paradigm): introduces a geometric framework for data generation in generative models like VAEs, utilizing Ambient (input) space, Latent space, Ambient (output) space, Riemannian metric, Jacobian, Pull-back metric, crSTR, and DCPSs of TC-EO components.
arxiv_paper_framework_name: framework facilitates structured integration of dimensional spaces through Riemannian metric and Jacobian, enabling qualitative transformation in data generation.
arxiv_paper_framework_name: theory addresses limitations of Euclidean geometry in capturing complex data generation by incorporating concepts like crSTR and DCPSs of TC-EO for enhanced model expressivity.

Enhancing Patient Acceptance of Robotic Ultrasound through Conversational Virtual Agent and Immersive Visualizations

Robotic Ultrasound Guidance Procedure: introduces a system for robotic ultrasound with virtual agent guidance, incorporating STT (transcribes patient speech), LLM (generates responses), TTS (converts text to speech), IK (controls avatar movement), Virtual Environment (immersive setting), Motion & Force Control (robot control), and Image Processing (ultrasound image processing).
The system uses mixed reality visualizations (AR, AV, VR) to enhance patient comfort and trust during robotic ultrasound procedures.
The integration of a conversational virtual agent and immersive visualizations aims to bridge the gap between robotic automation and patient-centered care in medical procedures.

Coordinated control of multiple autonomous surface vehicles: challenges and advances - a systematic review

Coordinated control of multiple autonomous surface vehicles: introduces a systematic review of coordinated control for multiple autonomous surface vehicles (ASVs), focusing on Control Techniques, Disturbances, Uncertainties, Communication, and Experimental Validation.
This review analyzes various control methods, disturbance considerations, uncertainty handling, communication strategies, and experimental validation approaches in coordinated ASV control.
The paper identifies research gaps and future directions in coordinated ASV control, emphasizing the need for more experimental validation and robust communication strategies.

TOWARDS EMPOWERMENT GAIN THROUGH CAUSAL STRUCTURE LEARNING IN MODEL-BASED RL

ECL (Empowerment through Causal Learning): introduces a novel framework integrating empowerment with causal reasoning in model-based reinforcement learning, incorporating Causal Dynamics Model, Reward Model, Model Optimization, Empowerment-driven Exploration, and Policy Learning with Intrinsic Reward Bonus.
ECL framework actively utilizes causal structure to maximize empowerment gain, enhancing agent controllability and learning efficiency within complex environments.
The framework employs curiosity reward to reduce overfitting and improve exploration, achieving enhanced performance in causal discovery and policy learning tasks.

A Survey on LLM-powered Agents for Recommender Systems

Unified Agent Architecture: introduces unified agent architecture consisting of Profile, Memory, Planning and Action modules for LLM-powered agent recommender systems.
Unified Agent Architecture: decomposes agent recommender into profile construction, memory management, strategic planning, and action execution modules.
Unified Agent Architecture: facilitates analysis of existing LLM-powered agent methods by providing structured framework of core components and their functions.

Automation Bias in the AI Act On the Legal Implications of Attempting to De-Bias Human Oversight of AI

AIA (Artificial Intelligence Act): introduces AI providers (develop and design AI), AI deployers (use AI systems), human oversight agents (oversee AI systems) and harmonised standards (technical specifications for AI) to address automation bias in high-risk AI systems.
AIA mandates human oversight and awareness of automation bias, yet its focus on providers may not fully address design and context as causes of bias.
Harmonised standards referencing research on automation bias and human-AI interaction are proposed to balance legal mandates and behavioural science within AIA framework.

Dream to Drive: Model-Based Vehicle Control Using Analytic World Models

Analytic World Models (AWMs): introduces differentiable simulation framework utilizing observed modalities, modality features, fused world state, RNNCell, world model, differentiable simulator, dynamics, agent, reward, and semantic predictions for vehicle control tasks.
AWMs framework enables learning relative odometry, state planning, and inverse state estimation by leveraging gradients from differentiable simulator, improving upon Analytic Policy Gradients (APG).
The proposed approach uses Model Predictive Control (MPC) with AWMs for planning, achieving improved performance and interpretability in autonomous driving tasks compared to reactive policies.

V2V-LLM: Vehicle-to-Vehicle Cooperative Autonomous Driving with Multi-Modal Large Language Models

V2V-LLM (Vehicle-to-Vehicle Large Language Model): introduces V2V-LLM, a framework for cooperative autonomous driving using multiple CAVs sharing Perception Data Language QA with a central LLM to answer driving-related questions.
V2V-LLM fuses scene and object level features from each CAV's perception data, enabling comprehensive environmental understanding for improved driving safety.
This approach addresses sensor occlusion challenges in autonomous driving by leveraging cooperative perception and LLMs for enhanced decision-making and trajectory planning.

Generating on Generated: An Approach Towards Self-Evolving Diffusion Models

RSIDiff (Recursive Self-Improvement Diffusion): introduces a self-training approach for diffusion models, iteratively refining the model using its own generated data via prompt construction and filtering pipeline, preference sampling, distribution-based weighting, and model fine-tuning.
RSIDiff framework enhances diffusion model performance by addressing training collapse through perceptual alignment and hallucination reduction using quality-focused data and distribution control.
The framework leverages synthetic data for continuous self-evolution, employing strategies to generate perceptually aligned data, filter human-preferred samples, and penalize hallucinatory errors.

Strategyproof Maximum Matching under Dichotomous Agent Preferences

SAFE (Sequential Allocation for Fairness and Efficiency) and Rank-Maximal mechanisms: introduces acceptability graph, safe blocks, baseline permutation and iterative assignment to achieve strategyproof maximum matching under dichotomous agent preferences and strict institution priorities.
SAFE mechanism iteratively identifies safe blocks of institutions based on acceptability graph and baseline permutation, assigning agents to institutions within these blocks to maximize matching size while ensuring fairness.
Rank-Maximal mechanism, equivalent to SAFE, uses ranked acceptability graph and lexi-optimal set of institutions for iterative matching, providing an alternative perspective on achieving the same desirable properties.

Diverse Inference and Verification for Advanced Reasoning

Diverse Inference: introduces a diverse inference framework that combines multiple models and methods, including Human (User input), LLM (Reasoning engine), Agent (Process orchestrator), Game Environment (Problem representation), Verifier (Solution checker), and specific methods like LEAP (Task-specific learning), Z3 (Theorem prover), RTO (Round trip optimization), BoN (Best-of-N sampling), SC (Self-consistency), MoA (Mixture of agents), MCTS (Monte Carlo search), PV (Prover-verifier game), Unit Tests (Code execution verification), and Best-of-N (Sampling based verification), to enhance reasoning and verification for advanced tasks.
arxiv_paper_framework_name: leverages perfect verifiers like Lean Verifier (Formal proof checker) for IMO and ARC problems and imperfect verifiers like Best-of-N (Sampling based verification) for HLE questions, achieving higher accuracy on challenging benchmarks.
arxiv_paper_framework_name: utilizes test-time simulations, reinforcement learning, and meta-learning to adapt agent graph representations and improve generalization across diverse problem types.

The Blind Men and the Elephant: Mapping Interdisciplinarity in Research on Decentralized Autonomous Organizations

Framework name here: introduces citation network analysis, topic modeling, and outlet analysis components to assess interdisciplinary research maturity on Decentralized Autonomous Organizations (DAOs).
Citation network analysis identifies boundary papers and interaction patterns, topic modeling explores interdisciplinary discussion themes, and outlet analysis investigates author and publication relationships.
This multi-method approach uncovers where and how interdisciplinary exchanges occur in DAO research, revealing fragmented yet inherently interdisciplinary nature.

MIR-Bench: Benchmarking LLM's Long-Context Intelligence via Many-Shot In-Context Inductive Reasoning

MIR-Bench: introduces a data generation pipeline with Data Collection, Input-Output Generation, and Dataset Construction for benchmarking LLMs' long-context inductive reasoning.
MIR-Bench pipeline utilizes coding benchmarks for function acquisition and GPT-4 for generating input-output pairs, employing filtering to create MIR-Extended and MIR-Core datasets via factor analysis.
MIR-Bench addresses the evaluation gap in LLMs' many-shot inductive reasoning by offering a large-scale, diverse benchmark and providing novel problems for insightful analysis.

Self-Consistent Model-based Adaptation for Visual Reinforcement Learning

SCMA (Self-Consistent Model-based Adaptation): introduces robust visual RL adaptation method using denoising model, pre-trained world model, and policy to mitigate distractions.
SCMA utilizes a denoising model to transfer cluttered observations to clean ones, guided by a pre-trained world model for unsupervised distribution matching.
The approach enhances policy performance in distracting environments without policy modification, offering a plug-and-play solution for various policies.

(How) Can Transformers Predict Pseudo-Random Numbers?

GPT-style Autoregressive Transformer: introduces paper investigates Transformer's ability to predict pseudo-random numbers using embedding layers, attention heads, MLP block, output layer, and memory.
It reveals Transformers learn to factorize modulus and use digit-wise representations for prediction.
The study shows a sharp accuracy transition at depth 3 and sublinear scaling of context elements with modulus.

Unconventional Transport in a System with a Tower of Quantum Many-Body Scars

Spin-1 Chain Model: introduces investigation of unconventional transport phenomena within a spin-1 model, utilizing Hamiltonian, Ladder Operator, Scarred Subspace, Coherent States, and Autocorrelation Function components.
This research demonstrates transport mechanism linked to quantum many-body scars, observing slowly decaying oscillations from dynamical symmetry within the scar subspace using autocorrelation function analysis.
The study reveals that ETH-preserving many-body spectrum mediates this transport, while quantum many-body scars themselves provide negligible contribution in thermodynamic limit, suggesting generalized eigenstate thermalization hypothesis.

Unknown Word Detection for English as a Second Language (ESL) Learners Using Gaze and Pre-trained Language Models

EyeLingo: introduces method for unknown word detection utilizing Gaze Data (user's eye movement) and Text Data (reading content) processed by Transformer-based Model (predicts unknown words) to identify Unknown Words (words needing definition) for ESL learners.
EyeLingo leverages gaze to locate region of interest, integrating linguistic information from pre-trained language models and gaze trajectory for real-time unknown word prediction.
The framework compensates for gaze inaccuracy by incorporating probabilities from language model, enabling real-time language learning assistance and vocabulary acquisition.

CISSIR: Beam Codebooks with Self-Interference Reduction Guarantees for Integrated Sensing and Communication Beyond 5G

CISSIR (Codebooks with Integral Split for Self-Interference Reduction): introduces beam codebook design with digital pre-coding, beam selection, gain control, radar processing, I/Q Modulation, DAC, TX Analog Beams, TX Antennas, direct coupling, clutter, user equipment, SI channel, target, radar channel, RX Antennas, RX Analog Beams, ADC, and I/Q Demodulation for self-interference reduction in integrated sensing and communication (ISAC) systems.
CISSIR framework optimizes codebooks for tapered beamforming and phased arrays, adapting codebooks to self-interference channel and achieving specific self-interference level.
CISSIR method improves sensing quality while maintaining communication performance, reducing dependence on hyperparameters and enhancing self-interference reduction capabilities.

Decentralized State Estimation and Opacity Verification Based on Partially Ordered Observation Sequences

ASS (All Sequence Structure): introduces decentralized observation architecture for state estimation and opacity verification using observation sites and coordinator.
ASS framework constructs current-state and initial-state estimators based on partially ordered observation sequences for offline analysis.
The approach reduces complexity in verifying state-isolation properties like opacity within decentralized systems compared to existing methods.

Investigations of multi-socket high core count RISC-V for HPC workloads

Sophon SG2042 Benchmarking Framework: introduces performance evaluation of Sophon SG2042 CPU, dual-socket system, nodes, sockets, cores, memory subsystem, benchmark suite, and comparison CPUs for HPC workloads.
This framework assesses RISC-V based Sophon SG2042 CPU in dual-socket configuration using NAS Parallel Benchmarks, focusing on memory and compute bound performance.
The study contrasts SG2042 against AMD EPYC, Intel Skylake, and Marvel ThunderX2 CPUs to identify performance characteristics in realistic HPC scenarios.

TRUSTZERO - OPEN, VERIFIABLE AND SCALABLE ZERO-TRUST

TrustZero (Zero Trust Architecture (ZTA)): introduces ModSecurity, Policy Enforcement Point, Policy Administrator, Policy Engine, Trust Token, and Trust Algorithm to establish a scalable zero-trust security layer using verifiable cryptographic signatures and continuous identity verification.
TrustZero framework, leveraging cryptographic principles and zero-trust architecture, aims to enhance security and trust in inter-organisational communication by replacing implicit trust with explicit verification and portable trust tokens.
TrustZero's architecture emphasizes transparency and reproducibility, offering an open-source framework designed for deployment in sensitive infrastructures and adaptable to legacy systems by integrating with European Digital Identity Wallet.

SPIRIT: Short-term Prediction of solar IRradlance for zero-shot Transfer learning using Foundation Models

SPIRIT (Short-term Prediction of solar IRradlance for zero-shot Transfer learning using Foundation Models): introduces a novel approach for solar irradiance forecasting using Dataset, Vision Encoder, Physics Features, SpatioTemporal Data, Regressor, Time Series Model and Future Covariate Vector components.
arxiv_paper_framework_name: leverages foundation models and physics-informed features to enable zero-shot transfer learning for solar irradiance prediction, reducing reliance on site-specific data.
arxiv_paper_framework_name: achieves effective adaptation across diverse transfer learning scenarios and demonstrates rapid scalability to new solar plant locations without prior data.

A Multiagent Path Search Algorithm for Large-Scale Coalition Structure Generation

SALDAE (Scalable Algorithm with Large-Scale Distributed Agent Exploration): introduces multiagent path search algorithm for coalition structure generation using search agents exploring coalition structure graph from start node, generating child nodes, performing node selection, comparing to incumbent, employing bridging paths, utilizing memory management and applying conflict resolution.
SALDAE algorithm iteratively builds search graph by splitting or merging coalitions for solving coalition structure generation problems, aiming to find optimal partitioning of agents into coalitions to maximize social welfare.
SALDAE leverages memory management with OPEN, SUBSTITUTE, and RESERVE lists and incorporates strategies like SPLIT-THEN-MERGE, MERGE-THEN-SPLIT, and APPROACH-THEN-SWAP to enhance search efficiency and solution quality in large-scale multiagent systems.

11th February 2025

CSR-Bench: Benchmarking LLM Agents in Deployment of Computer Science Research Repositories

CSR-Agents: introduces a multi-agent framework to automate code repository deployment using multiple LLM agents with specialized roles including Command Drafter, Script Executor, Log Analyzer, Issue Retriever, and Web Searcher.
CSR-Agents framework leverages iterative trial-and-error process within Docker Environment, refining bash commands based on execution logs, issue database retrieval and web search integration.
CSR-Agents framework is evaluated on CSR-Bench, a novel benchmark designed for assessing LLM agent capabilities in automating deployment of computer science research repositories from GitHub.

10th February 2025

Visual Agentic AI for Spatial Reasoning with a Dynamic API

VADAR (Visual, Agentic, Dynamic AI for Reasoning): introduces a training-free agentic approach for 3D understanding, which dynamically generates new skills in Python, using Signature Agent, Implementation Agent, Test Agent, Program Agent, Execution Agent, Vision Specialists, and API, and is evaluated on CLEVR and OMNI3D-BENCH.
VADAR leverages LLM agents to define and expand a domain-specific language, generating new functions and skills in two phases: API Generation and Program Synthesis.
The framework addresses limitations of prior approaches that rely on a static, human-defined API, allowing it to handle a wider range of queries.

7th February 2025

Sirius: Self-improving Multi-agent Systems via Bootstrapped Reasoning

SIRIUS: introduces a self-improving multi-agent system framework, utilizing Physicist, Mathematician, Summarizer agents, Experience Library, Experience Augmentation, and Fine-tuning for optimizing multi-agent systems.
SIRIUS constructs an experience library by retaining successful reasoning trajectories to provide training data for agent policy fine-tuning.
SIRIUS further enriches the library by augmenting unsuccessful trajectories, enhancing data diversity and improving system performance.

MELON: Indirect Prompt Injection Defense via Masked Re-execution and Tool Comparison

MELON (Masked re-Execution and TooL comparisON) introduces an indirect prompt injection defense framework, with Agent System, Tool Execution, Tool Call Cache, Compare Tool Calls, and Masking Function, that detects attacks by comparing tool calls between original and masked executions.
MELON framework leverages Masking Function to generate task-neutral prompts for masked re-execution, utilizing Tool Call Cache to store masked run tool calls and Compare Tool Calls to identify deviations indicating potential attacks.
MELON framework enhances security and utility balance by focusing on tool call comparison and incorporating designs like customized masking, tool call caching, and focused comparison to reduce false positives and negatives in indirect prompt injection detection.

NVAGENT: Automated Data Visualization from Natural Language via Collaborative Agent Workflow

NVAGENT: introduces collaborative agent workflow for NL2VIS, with processor (database processing and context filtering), composer (planning visualization generation), and validator (code translation and output verification).
NVAGENT decomposes visualization generation into manageable subtasks using processor for data preparation, composer for VQL generation, and validator for ensuring correctness.
NVAGENT leverages divide-and-conquer strategy with specialized agents to effectively handle complex NL2VIS tasks, improving visualization accuracy and quality.

The Rising Threat to Emerging AI-Powered Search Engines

Agent-based Defense: introduces agent-based defense with Observation, Thought, Action, Tools, and Agent-based Defense components, where agent-based defense mitigates risks in AIPSE outputs.
Agent-based defense framework uses Observation to gather AIPSE output, Thought for reasoning, Action to use Tools like Content Refinement and URL Detector.
Agent-based defense aims to filter and mark potential risks in AIPSE output while preserving response similarity.

S².-MAD: Breaking the Token Barrier to Enhance Multi-Agent Debate Efficiency

S2-MAD (Selective Sparse Multi-Agent Debate): introduces Initial Response Generation, Grouping Discussion with Decision-Making Mechanism, and Reaching Consensus, with Agents organized in Groups, to enhance multi-agent debate efficiency.
S2-MAD framework employs Decision-Making Mechanism comprising Similarity Calculation, Redundant Information Filtering, and Conditional Participation modules to manage agent engagement.
S2-MAD framework aims to reduce token costs in multi-agent debate by selectively incorporating non-redundant viewpoints and optimizing information exchange among agents.

Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research

Agentic Reasoning: introduces a framework enhancing large language model reasoning by integrating Web-Search Agent, Coding Agent, and Mind Map Agent, to solve complex problems requiring research and logical deduction.
Agentic Reasoning framework utilizes Mind Map Agent for structured knowledge graph construction, Web-Search Agent for real-time information retrieval, and Coding Agent for computational analysis, improving reasoning and decision-making.
Agentic Reasoning enables language models to perform multi-step strategies and tackle complex problems by dynamically adapting to information and performing quantitative analyses using external agents and structured memory.

Self-Regulation and Requesting Interventions

Offline PRM-Tabular RL Framework: introduces an offline approach for training intervention-requesting agents, with State Dynamics, PRM, Tabular RL, Usage computation, Policy computation, Reward Search, and SFT Helper components.
Offline PRM-Tabular RL Framework combines LLM-based Process Reward Models with tabular reinforcement learning to efficiently determine optimal intervention timing under budget constraints.
This framework reduces costly intervention calls during training by leveraging offline data and enhancing robustness through PRMs and tabular RL, avoiding deep RL inefficiencies.

Every Software as an Agent: Blueprint and Case Study

JiT-Codegen: introduces software agent framework for in-software execution using LLM-powered Agent, JiT Code Agent, Software Runtime, and Exec. Sandbox.
JiT-Codegen framework enables LLMs access to software internals and inject code for execution within a secure Exec. Sandbox.
This approach aims to overcome limitations of API-based and GUI-based agents by enabling more direct and efficient software interaction.

STRIDE: Automating Reward Design, Deep Reinforcement Learning Training and Feedback Optimization in Humanoid Robotics Locomotion

STRIDE (Automating Reward Design, Deep Reinforcement Learning Training and Feedback Optimization in Humanoid Robotics Locomotion): introduces framework integrating Humanoid Robotics Environment, Motion Task Description, LLMs, Reward Function Sampling, Reward Functions, Reward Reflection, DRL Training, and Feedback Result for automated reward design.
STRIDE leverages LLMs for zero-shot reward function generation and iterative refinement through feedback from DRL training outcomes.
STRIDE automates reward engineering for humanoid robot locomotion, overcoming limitations of manual reward design and improving task performance.

Learning Strategic Language Agents in the Werewolf Game with Iterative Latent Space Policy Optimization

LSPO (Latent Space Policy Optimization): introduces iterative framework with Latent Space Construction (create discrete strategy space), Policy Optimization in Latent Space (optimize strategy in latent space), and Latent Space Expansion (expand strategy space coverage) for strategic language agents.
LSPO framework addresses challenges in free-form language games by mapping text to latent space, optimizing policy with game-theoretic methods, and expanding space via LLM fine-tuning.
Iterative process of LSPO enhances strategic reasoning and language communication, improving agent performance in complex games like Werewolf.

6th February 2025

VTutor: An Open-Source SDK for Generative AI-Powered Animated Pedagogical Agents with Multi-Media Output

VTutor (Software Development Kit): introduces an open-source framework for creating animated pedagogical agents, integrating Generative AI (LLMs), Text-to-Speech (TTS), Lip Synchronization (LipSync), Character Model, WebGL Rendering, Web Interface, API Communication, SDK, Iframe Integration, and React SDK.
VTutor combines generative AI with animation technologies to enable personalized learning experiences through adaptable, realistic animated pedagogical agents with multi-media output in web environments.
The framework leverages LLMs for context-aware feedback, uLipSync for accurate lip movements, and WebGL for seamless web integration, offering tools for developers to create engaging educational agents.

PsyPlay: Personality-Infused Role-Playing Conversational Agents

PsyPlay: introduces dialogue generation framework with Role Card Creation (generates agent roles), Topic Extraction (extracts dialogue topics), and Dialogue Generation (creates personality-infused dialogues) for personality-infused role-playing.
PsyPlay framework facilitates expression of rich personalities among multiple LLM agents assuming distinct roles and engaging in discussions.
PsyPlay validation demonstrates accurate portrayal of intended personality traits with high success rate on generated dialogue data.

Large Language Models for Multi-Robot Systems: A Survey

BOLAA (Benchmarking and Orchestrating LLM-augmented Autonomous Agents) architecture orchestrates multiple LAAs using Agents Message Controller, which manages communication between Environment, Labor Agents Pool consisting of multiple agents LAA 1, LAA 2, LAA m, each having LLM and Agent Prompt, Action Parser, Memory and Agents Selection.
BOLAA architecture employs central controller for message distribution to individual agents with own LLMs, processing distributed messages to generate actions, improving consistency and reliability in collaborative systems.
BOLAA architecture serves as comparative framework for LLM-augmented agents, offering insights into LLM integration and agent orchestration for multi-robot applications, despite focus on multi-agent systems rather than exclusively MRS.

Division-of-Thoughts: Harnessing Hybrid Language Model Synergy for Efficient On-Device Agents

DoT (Division-of-Thoughts): introduces collaborative reasoning framework, with Task Decomposer, Task Scheduler, Task Allocation, Plug-and-Play Adapter, SLM, LLM, and Self-Reinforced Tree Search, for efficient on-device agents using hybrid language model synergy.
DoT framework employs Task Decomposer for breaking down queries, Task Scheduler for dependency analysis between sub-tasks, and Plug-and-Play Adapter for dynamic allocation of sub-tasks between SLM and LLM.
Self-Reinforced Tree Search method trains Plug-and-Play Adapter using task execution feedback to optimize sub-task allocation strategy for enhanced efficiency and maintained accuracy.

--

Multi-Agent Reinforcement Learning with Focal Diversity Optimization

MARL-Focal (Multi-Agent Reinforcement Learning with Focal Diversity Optimization): introduces a two-stage framework with Decider Agent (selects optimal LLM subset) and Aggregator Agent (synthesizes final output) to improve LLM performance by leveraging a Pool of LLMs in Cloud (collection of available models) and diversity metrics.
MARL-Focal framework utilizes Decider Agent (selects optimal LLM subset) within Multi Agent Environment (manages agent interactions) to choose Chosen LLMs (selected ensemble subset) based on Perf. Metrics (diversity-based selection metrics) from Incoming Online Queries (user input queries) and generate Model Outputs (LLM generated responses) for final aggregation.
The framework's architecture allows for adaptive ensemble creation by dynamically selecting and combining diverse LLMs, aiming to enhance output quality and robustness while maintaining cost-efficiency in multi-agent learning scenarios.

ACTIVE TASK DISAMBIGUATION WITH LLMS

Active Task Disambiguation: introduces method for LLM agents to actively clarify ambiguous tasks by iteratively asking questions, using solution generator, question generator, and question evaluator to refine problem statement and solution space.
It leverages Bayesian Experimental Design principles to select questions maximizing information gain, shifting reasoning from implicit to explicit solution space exploration.
This approach improves task disambiguation compared to methods relying solely on implicit reasoning within the question space, enhancing LLM agents' ability to address underspecified problems.

ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization

ScoreFlow: introduces automated workflow generation framework, with LLM Generator (workflow code generator), Executor (workflow performance evaluator), Collect data (score gathering component), Scores (workflow performance metrics), Preference workflow pairs (score-based workflow rankings), Iterative Score-DPO (score-aware optimization algorithm), Operators (reusable agent node combinations), Workflows (code for agent interactions), Problem dataset (input task collection).
ScoreFlow framework utilizes Score-DPO optimization, incorporating quantitative evaluation feedback for workflow generation.
ScoreFlow enhances scalability and performance through gradient-based optimization and code-based workflow representation.

Multi-agent Architecture Search via Agentic Supernet

MaAS (Multi-agent Architecture Search): introduces agentic supernet, probabilistic architecture distribution, with controller, agentic operators, environment, and feedback, where MaAS optimizes distribution of agentic architectures for query-dependent multi-agent systems.
MaAS framework leverages controller network to sample task-specific multi-agent systems from agentic supernet, adapting architecture based on environmental feedback and agentic operators.
Agentic supernet in MaAS enables efficient resource allocation by dynamically adjusting multi-agent system complexity based on query difficulty and domain, utilizing feedback for continuous improvement.

Speaking the Language of Teamwork: LLM-Guided Credit Assignment in Multi-Agent Reinforcement Learning

LCA (LLM-guided Credit Assignment): introduces a novel framework for multi-agent reinforcement learning that uses LLM State Ranking (LLM prompts for pairwise state ranking) to generate agent-specific rewards.
LCA framework leverages Ego Agent (agent perspective encoding) and Scoring Model Training Dataset (ranking data for training) to train a scoring model, facilitating credit assignment in sparse reward MARL environments.
By utilizing Temporal Difference (RL update mechanism) and LLM (large language model), LCA achieves faster convergence and higher returns compared to baselines by addressing credit assignment and reward sparsity.

MultiQ&A: An Analysis in Measuring Robustness via Automated Crowdsourcing of Question Perturbations and Answers

MultiQ&A: introduces a multi-step pipeline for question answering robustness evaluation, with Query Rewrite, Generator, and Aggregator components.
MultiQ&A framework utilizes Query Rewrite to create diverse question variations, Generator to independently answer them, and Aggregator to evaluate answer robustness and consistency.
This approach enables automated crowdsourcing and robust question answering evaluation mimicking real-world scenarios for assessing Large Language Model performance under question perturbations.

5th February 2025

A Schema-Guided Reason-while-Retrieve framework for Reasoning on Scene Graphs with Large-Language-Models (LLMs)

SG-RwR (Schema-Guided Reason-while-Retrieve) introduces a two-agent framework with Reasoner (task planning and query generation) and Retriever (graph information extraction) for scene graph reasoning.
SG-RwR framework utilizes Scene Graph Schema (textual graph description) to guide both Reasoner and Retriever agents in code-writing for information retrieval and reasoning.
The framework components Reasoner Query (query for graph information), Retrieve Code (code for information retrieval), Retrieved Information (extracted graph data) and Reason Code (code for reasoning and tools) enable iterative and adaptive graph processing to generate Answer (final task solution).

PalimpChat: Declarative and Interactive AI analytics

PALIMPCHAT: introduces chat interface, user, pipeline design, program optimization, plan execution, programming framework, cost estimation, plan execution and unstructured data for AI analytics.
PALIMPCHAT integrates ARCHYTAS reasoning agent and PALIMPZEST declarative framework to enable natural language interaction for AI pipeline creation and execution.
PALIMPCHAT simplifies complex AI workflows by providing accessible chat-based interface for both expert and non-expert users to design and run data processing pipelines.

SymAgent: A Neural-Symbolic Self-Learning Agent Framework for Complex Reasoning over Knowledge Graphs

SymAgent (A Neural-Symbolic Self-Learning Agent Framework) introduces an agent framework with Agent-Planner, Agent-Executor, KG Environment, Action Tool Set, and Self-learning Framework for complex reasoning over knowledge graphs.
SymAgent framework utilizes Agent-Planner to derive symbolic rules from knowledge graphs and Agent-Executor to apply action tools for information integration from knowledge graphs and external sources.
Self-learning Framework in SymAgent enables iterative online exploration and offline policy updating, facilitating autonomous synthesis of reasoning trajectories and performance improvement.

Strategizing with AI: Insights from a Beauty Contest Experiment

Framework name here: introduces LLMs, Game Scenarios, Prompts, and Performance Analysis to investigate strategic behavior of AI agents in game theory experiments.
This framework evaluates LLMs' decision-making by comparing their game strategies against human players and theoretical predictions across varied game scenarios.
The framework aims to understand LLMs' strategic reasoning, adaptability, and limitations in emulating human-like behavior within economic game contexts.

COSMosFL: Ensemble of Small Language Models for Fault Localisation

COSMosFL (COllection of Small Language Models for Fault Localisation): introduces a task-level LLM ensemble technique using voting mechanism with M models, SLMs, Query Root Cause, Bug Information, Query Fault Location, Available Tools, Voting-based Ensemble, and Confidence Score for fault localization.
COSMosFL aggregates answers from multiple SLMs instead of repeated sampling from a single LLM to improve fault localization accuracy and cost-benefit trade-off.
COSMosFL leverages voting-based ensemble and differential evolution for weight optimization to achieve Pareto-optimality in fault localization accuracy and inference cost.

4th February 2025

Adaptive Self-improvement LLM Agentic System for ML Library Development

Adaptive self-improvement LLM agentic system: introduces agentic system organization with parallel sampling to enhance LLM Agents via select multi-level experiences, stratify them by difficulty, filter high-quality answers, and use demonstrations for ML library development Task, verified by Verifier to produce Answer.
This system employs adaptive self-improvement learning algorithm that filters quality answers, stratifies experiences by difficulty, and selects demonstrations to improve LLM agents' performance in generating architecture-specific programming language code.
The framework addresses challenges in ML library development by enabling complex reasoning with limited data through a self-improvement cycle where LLM agents evolve via earned experiences and generate high-quality ML operators.

AdaptBot: Combining LLM with Knowledge Graphs and Human Input for Generic-to-Specific Task Decomposition and Knowledge Refinement

AdaptBot: introduces framework integrating LLM, Knowledge Graph, Human Input, Execution, and Decision Module, for task decomposition and knowledge refinement.
AdaptBot utilizes LLM for generating initial abstract action plans, Knowledge Graph for domain-aware refinement, and human input for error correction and knowledge expansion.
AdaptBot framework facilitates adaptation to new tasks through incremental knowledge refinement via human feedback and Knowledge Graph-guided error resolution.

Anticipate & Act : Integrating LLMs and Classical Planning for Efficient Task Execution in Household Environments

Anticipate & Act: introduces a framework for efficient task execution, integrating User, LLM Prompting, LLM, Mapping, Planning, FASTDOWNWARD PLANNER, GENERATED PLAN, and SIMULATION components.
Anticipate & Act: leverages LLM to predict high-level tasks from User prompts and uses FASTDOWNWARD PLANNER to generate fine-grained action sequences via Planning and Mapping components.
Anticipate & Act: demonstrates efficiency in household tasks by anticipating future tasks and planning actions jointly within SIMULATION environment, reducing execution time and plan length.

CoAT: Chain-of-Associated-Thoughts Framework for Enhancing Large Language Models Reasoning

CoAT (Chain-of-Associated-Thoughts): introduces a reasoning framework for large language models that combines an optimized Monte Carlo Tree Search (MCTS) with a dynamic associative memory mechanism, integrating Target LLM, Associative Memories, Nodes, optional External Brain, Knowledge Graph, Vector Database, LLM agents, Internet Access, and Evaluator.
The framework expands the reasoning search space and adaptively incorporates new information, mimicking human-like associative thinking during inference.
Optimized MCTS algorithm systematically integrates associative content and generated content through tree node search, and flexible mechanism sources associative content by self-association or external knowledge retrieval.

3rd February 2025

Improving Transformer World Models for Data-Efficient RL

Improved TWM for Data-Efficient RL: introduces MBRL framework with MFRL Baseline, MBRL Baseline, Dyna with Warmup, Nearest Neighbor Tokenizer, and Block Teacher Forcing for enhanced data efficiency in reinforcement learning.
The framework combines model-free and model-based RL with novel tokenization and training techniques to achieve state-of-the-art performance in the Craftax-classic environment.
Key improvements include Dyna with Warmup for hybrid real-imaginary training, Nearest Neighbor Tokenizer for efficient image encoding, and Block Teacher Forcing for improved TWM training and rollout accuracy.

PROCESS REINFORCEMENT THROUGH IMPLICIT REWARDS

PRIME (Process Reinforcement through IMplicit rEwards): introduces scalable online reinforcement learning framework with dense token-level rewards with Policy Model, Implicit PRM, SFT Model, Outcome Verifier, and Reference Model.
PRIME framework updates Implicit PRM online using policy rollouts and outcome labels, removing dedicated reward model training phase.
PRIME utilizes Implicit PRM for token-level rewards generation, mitigating reward hacking and enhancing sample efficiency in reinforcement learning for LLMs.

TReMu: Towards Neuro-Symbolic Temporal Reasoning for LLM-Agents with Memory in Multi-Session Dialogues

TReMu (Temporal Reasoning for LLM-Agents in Multi-Session Dialogues): introduces a framework with Time-aware Memorization Model (summarizes dialogue sessions with dates), Memory Retrieval Model (retrieves relevant memory for question), Neuro-symbolic Reasoning Model (generates Python code for reasoning), and Python Executor (executes generated Python code) to enhance temporal reasoning in multi-session dialogues.
It employs timeline summarization for memory and neuro-symbolic reasoning using LLMs to generate and execute Python code for temporal calculations.
This approach improves temporal reasoning performance by leveraging Python's libraries for temporal calculations and step-by-step code execution.

Reinforcement Learning for Long-Horizon Interactive LLM Agents

LOOP: introduces reinforcement learning framework for training interactive digital agents, utilizing hidden state, task context, agent output, and environment output for long-horizon tasks.
This framework uses partially observable Markov decision process to formalize agent-environment interactions via read-eval-print loop.
LOOP framework enhances sample efficiency and memory efficiency by reusing off-policy samples and maintaining single LLM copy.

Memento No More: Coaching AI Agents to Master Multiple Tasks via Hints Internalization

MNM (Memento No More): introduces an iterative coaching process with Initial Agent, Human Analyst, Hints, Teacher Agent, Training Data, Student Agent, and Task Trajectories, where human feedback guides an AI agent to master multiple tasks.
The framework refines agent behavior through iterative rounds of mistake analysis and hint internalization, improving task execution without extensive prompts.
MNM leverages context distillation to transfer hint knowledge into agent weights, enhancing generalization and reducing reliance on prompt-based guidance.

THE IN-CONTEXT INDUCTIVE BIASES OF VISION-LANGUAGE MODELS DIFFER ACROSS MODALITIES

Framework name here: introduces vision-language models, with vision input (processing visual information) and text input (processing textual information), to study generalization process (inferring category from examples) across modalities.
It highlights that inductive biases in vision-language models differ significantly based on whether input is visual or textual, affecting generalization.
Furthermore, the study reveals that in textual input, the order of feature descriptors influences the model's generalization, indicating sensitivity to linguistic structure.

SHARPIE: A Modular Framework for Reinforcement Learning and Human-AI Interaction Experiments

SHARPIE (Shared Human-AI Reinforcement Learning Platform for Interactive Experiments): introduces a modular framework for human-AI interaction experiments, featuring versatile wrapper for RL components, participant web interface, experiment configuration, logging, deployment utilities, and multi-modal communication channels.
The framework standardizes human-RL interaction research by offering a generic interface and tools for studying diverse interaction aspects and facilitating experiment design and execution.
SHARPIE supports diverse human-RL interaction use cases including reward annotation, teaching, action delegation, task specification, and human-AI teaming, facilitating research in cognitive science and RL.

TwinMarket: A Scalable Behavioral and Social Simulation for Financial Markets

TwinMarket: introduces a multi-agent framework designed for simulating socio-economic systems, incorporating User Profile, Belief, Desire, Intention, World Knowledge, Action Space, Market Environment, Social Environment, Order-Driven Trading System, Matching Engine, Data Sources, and Validation Metrics components.
TwinMarket framework simulates investor behavior within a stock market environment by utilizing Belief-Desire-Intention framework integrated with a simulated social media platform and real-world market data.
TwinMarket framework facilitates the investigation of emergent market phenomena, such as financial bubbles and volatility clustering, through scalable simulations of individual decision-making and social interactions.

Simulating Rumor Spreading in Social Networks using LLM Agents

LLM-based multi-agent network framework: introduces a simulation framework with LLM-based Agent, Post History, Rumor Belief, and Social Network to examine rumor propagation dynamics.
The framework employs LLM-based Agent to simulate user behavior, utilizing Post History for context and Rumor Belief for opinion tracking within a Social Network.
This framework assesses how different Social Network structures and agent behaviors impact Rumor Belief and overall rumor dissemination.

Evolving Symbolic 3D Visual Grounder with Weakly Supervised Reflection

EASE (Evolvable Symbolic Visual Grounder): introduces a training-free symbolic framework for 3D visual grounding, integrating Agents, executor, Visprog., Ours, test suite, relation encoder, object locations, scene scans, relation functions, and feedback components.
EASE framework employs offline LLM generation and optimization within its Ours and test suite components to enhance relation encoders, contrasting with online Agents and visual programming Visprog. methods.
The framework leverages relation encoders and feedback mechanisms to achieve a balance between grounding accuracy and inference efficiency, differing from Agents' online processing and Visprog.'s reliance on annotated relation functions.

Plan-Then-Execute: An Empirical Study of User Trust and Team Performance When Using LLM Agents As A Daily Assistant

Plan-then-execute LLM Agents: introduces a framework with LLM Planning, Plan Edit, Planning Outcome, Action Prediction, Action Execution, User-Involved Execution, Manual Specify Action or Feedback, Involve vs Approve, Approve, Involve, Execution Outcome, Successful Login, and Successful Transaction to study user trust and team performance in human-AI collaboration.
This framework uses plan-then-execute workflow where LLM agents first generate a plan, then users can edit it, and finally the agent executes the plan step-by-step with potential user involvement at each action.
The architecture allows for empirical investigation of how different levels of user involvement during planning and execution affect user trust and task outcomes when using LLM agents as daily assistants.

TeLL-Drive: Enhancing Autonomous Driving with Teacher LLM-Guided Deep Reinforcement Learning

TeLL-Drive (Teacher LLM-Guided Deep Reinforcement Learning): introduces a framework integrating LLM-Teacher with Decision Engine, Memory Repository, and Reflective Evaluator, and RL-Student with Actor, Critic, Add & Norm Attention, Multi-Head Attention, Data Distillation, and Mixed Policy for enhanced autonomous driving decision-making.
TeLL-Drive leverages LLM-Teacher's guidance through Decision Engine, Memory Repository, and Reflective Evaluator to improve RL-Student's Actor-Critic learning and policy via attention mechanisms and data distillation for efficient and robust autonomous driving.
The framework's architecture with LLM-Teacher and RL-Student components facilitates knowledge transfer and policy refinement, leading to improved adaptability and safety in autonomous driving across diverse scenarios.

PSSD: Making Large Language Models Self-denial via Human Psyche Structure

PSSD (Psyche Structure for Self-Denial): introduces a novel paradigm for Large Language Models self-denial, comprising Intuition-based Id Role, Rule-driven Superego Role, and Script-centric Ego Role, to enhance reasoning accuracy.
PSSD framework leverages multi-agent approach inspired by human psyche structure, utilizing three distinct roles for initial attempts, rule-based guidance, and procedural execution.
PSSD aims to address limitations of current mistake correction methods by facilitating agents' self-denial within LLMs, leading to improved reasoning and resource efficiency.

Human-Agent Interaction in Synthetic Social Networks: A Framework for Studying Online Polarization

Introduces agent-based architecture (individuals with attributes and interactions), LLM infrastructure (enables content generation and analysis), social network structure (governs information dissemination dynamically), agent model (represents individual user with attributes), opinion value (numerical stance on topic), personality description (agent's character traits), short biography (agent's background information), unique username (agent's identifier), interaction history (agent's past engagements), message generation (agent's content creation process), interaction mechanisms (agent's reaction to messages), opinion-based interaction function (evaluates opinion alignment), opinion strength factor (reflects opinion intensity), opinion assessment function (interprets message opinion), opinion update process (agent's opinion change mechanism), social network model (directed graph of agent connections), network structure (set of agents and follow relationships), connection dynamics (network evolution over time), information propagation (message visibility and exposure), recommendation system (determines message presentation), and influence-based scoring system (evaluates message author influence) for studying online polarization in synthetic social networks.
Framework combines mathematical opinion dynamics with large language models to simulate human-agent interaction in synthetic social networks for controlled experimentation of online polarization.
Framework enables investigation of polarization mechanisms, bridging gap between theoretical models and empirical observations, offering opportunities to study causal mechanisms underlying online opinion dynamics.

ChartCitor: Answer Citations for ChartQA via Multi-Agent LLM Retrieval

ChartCitor: introduces multi-agent framework with Table Extraction Agent, Answer Reformulation Agent, Entity Captioning Agent, LLM Prefiltering Agent, LLM Re-ranking Agent, and Cell Localization Agent for fine-grained chart answer citations.
ChartCitor framework orchestrates specialized LLM agents to extract tables, reformulate answers, generate captions, retrieve evidence, and localize cited cells in chart images.
This system enhances explainability and user trust in LLM-assisted chart question answering by providing reliable and logically-explained citations sourced from charts.

PlotGen: Multi-Agent LLM-based Scientific Data Visualization via Multimodal Feedback

PlotGen: introduces a multi-agent framework for scientific data visualization, with Query Planning Agent, Code Generation Agent, Numeric Feedback Agent, Lexical Feedback Agent, Visual Feedback Agent, and Self-Reflection, that leverages multimodal LLMs to iteratively refine visualizations based on user specifications.
PlotGen framework orchestrates agents for query decomposition, code generation, and multimodal feedback to ensure data accuracy, textual correctness, and visual alignment in generated plots.
The framework utilizes self-reflection within code generation and feedback agents to iteratively improve plot quality and address errors, enhancing user trust and productivity in data visualization tasks.

Firewalls to Secure Dynamic LLM Agentic Networks

Firewalled Agentic Networks (FAN): introduces input firewall, data firewall, and trajectory firewall, where FAN automatically constructs task-specific rules from prior simulations to build firewalls for constrained LLM agentic networks.
FAN offers layers of defense by converting free-form input to protocol, abstracting user data, and self-correcting agent trajectory.
Data and trajectory firewalls are built from prior simulations to balance adaptability, security, and privacy in LLM agentic networks.

SelfCheckAgent: Zero-Resource Hallucination Detection in Generative Large Language Models

SelfCheckAgent: introduces a framework for hallucination detection, with Symbolic Agent (semantic representation), Specialized Detection Agent (domain-aware detection) and Contextual Consistency Agent (context-aware verification), providing a multi-dimensional approach.
SelfCheckAgent framework integrates three distinct agents, utilizing diverse techniques like semantic similarity, fine-tuned NLI models, and contextual consistency checks to evaluate LLM response factuality.
SelfCheckAgent framework leverages triangulation strategy across agents, enhancing hallucination detection robustness and applicability in complex mathematical and general domains, improving trustworthiness of LLMs.

Agentic Bug Reproduction for Effective Automated Program Repair at Google

LIBRO: introduces automated BRT generation, with GITS issue, buggy file(s), test file, edit LLM, and candidate BRT, where LIBRO adapts LLM for bug reproduction test generation.
LIBRO: utilizes code-editing LLM to generate candidate BRT by prompting with bug report, buggy files, and test file.
LIBRO: aims to generate BRTs by leveraging LLM's understanding of bug descriptions and code context.

Position: Towards a Responsible LLM-empowered Multi-Agent Systems

RLHF (Reinforcement Learning from Human Feedback): presents a two-step approach involving reward model training from human feedback and language model fine-tuning through reinforcement learning to achieve human value alignment.
RLHF framework utilizes preference data and techniques like Proximal Policy Optimisation (PPO) or Direct Preference Optimization (DPO) for policy updates, enhancing model agreement with human preferences.
This method aims to create helpful and harmless AI assistants by incorporating human feedback into the learning process, improving model behaviour and safety.

Al-Khwarizmi: Discovering Physical Laws with Foundation Models

Al-Khwarizmi: introduces agentic framework for physical law discovery from data, integrating system observation, RAG, prompt, LLM, optimization, score model, test data, and human feedback components.
Framework leverages foundation models and SINDy method to automate physical law discovery by incorporating prior knowledge and iterative refinement.
Al-Khwarizmi framework achieves state-of-the-art performance in physical law discovery by utilizing multiple data modalities and automated choices of algorithms.

2nd February 2025

Efficient Multi-Agent System Training with Data Influence-Oriented Tree Search

DITS (Data Influence-oriented Tree Search): introduces a novel framework for efficient multi-agent system training with data influence-oriented tree search, incorporating Multi Agent Network, MCTS Data Synthesis, Influence Score Estimation, Data Selection, and Iterative Data Synthesis.
DITS leverages influence scores to guide tree search and data selection, effectively identifying impactful data for system improvement and enhancing model performance.
DITS derives influence score estimation methods for non-differentiable metrics, reducing computational overhead and enabling efficient synthesis time scaling.

RTBAgent: A LLM-based Agent System for Real-Time Bidding

RTBAgent (LLM-based Agent System for Real-Time Bidding): introduces an agent framework for real-time bidding, utilizing Tools (CTR prediction and bidding strategies), Summarized Memory (aggregated information for decision), Reflection Memory (self-assessment of past decisions), Bidding Memory (record of bidding history), Environment Memory (historical market conditions), Two-Step Decision-Making (sequential decision process), Insight Reasoning (analyze decision ranges and risks), Action Making (determine bidding action and reason), and Action Space (range of possible bidding adjustments).
RTBAgent employs a two-step decision-making process with Insight Reasoning (analyze decision ranges and risks) and Action Making (determine bidding action and reason) to determine optimal bidding prices, leveraging multi-memory retrieval and expert knowledge.
The framework's multi-memory system, including Reflection Memory (self-assessment of past decisions), Bidding Memory (record of bidding history), and Environment Memory (historical market conditions), enables adaptive bidding strategies by reviewing historical data and market changes.

AgentBreeder: Mitigating the AI Safety Impact of Multi-Agent Scaffolds

AGENTBREEDER: introduces evolutionary framework, with Seed Scaffolds, Population, Capability benchmark, Safety benchmark, Embedding function, Clustering function, Pareto Fronts, Elites, Meta Agent, Crossover, Mutation, and New Scaffolds, for multi-objective search over multi-agent system scaffolds.
AGENTBREEDER framework evaluates scaffolds using capability and safety benchmarks, clusters architectures, identifies Pareto optimal elites, and evolves new generations via meta-agent-driven crossover and mutation.
AGENTBREEDER framework facilitates exploration of diverse multi-agent scaffolds, balancing capability and safety objectives through evolutionary optimization and quality-diversity search algorithm.

Meta-Prompt Optimization for LLM-Based Sequential Decision Making

EXPO (EXPonential-weight algorithm for prompt Optimization): introduces an automated meta-prompt optimization framework for LLM-based agents, with components including LLM Agent (selects action based prompt), Evaluator (measures action performance), Embedding Model (converts text to numbers), Score Estimation NN (predicts meta-prompt scores), Randomized Meta-Prompt Selection (chooses meta-prompt based scores), and Exemplar Set (history of input-score pairs).
EXPO framework uses adversarial bandit algorithm principles to address non-stationarity in reward observations during sequential decision-making for optimizing task description and meta-instruction within the meta-prompt.
The framework leverages a neural network for score estimation and exponential-weight mechanism for meta-prompt selection, achieving a balance between exploitation and exploration in meta-prompt optimization.

PhiP-G: Physics-Guided Text-to-3D Compositional Scene Generation

PhiP-G (Physics-Guided Text-to-3D Compositional Scene Generation): introduces a framework for compositional scene generation, with AG-extractor (scene graph extraction from text), Scene graph (structured scene representation), AG-generater (2D image generation agent), 3D Gaussian model (3D asset generation model), Asset retrieval (2D asset library access), 2D asset retrieval library (storage for 2D assets), AG-supervisor (visual layout supervision agent), Physical pool (physics-based initial layout), Blender (3D scene environment), and World model (layout prediction and planning).
PhiP-G integrates LLM-based agents and world model for layout guidance with 3D Gaussian Splatting for efficient and physically consistent 3D scene generation.
The framework leverages a physical pool and visual supervision for iterative layout refinement, achieving state-of-the-art performance and improved efficiency.

Leveraging LLMs for Dynamic IoT Systems Generation through Mixed-Initiative Interaction

IoT-Together (Mixed-Initiative Interaction Paradigm): introduces a system architecture with User Interface (interaction medium), Goal Management (goal identification), Knowledge Management (data repository), Context Management (service hosting), Backend Generation (service generation), Intelligent User Interface Generation (application building), Interoperability platform (data pipeline), IOT DEVICES (sensor network), and Services (concrete functionalities) to enable dynamic IoT system generation through mixed-initiative interaction.
IoT-Together paradigm facilitates user-system collaboration by leveraging LLMs within Goal Management and Backend Generation for interpreting user queries and generating runtime services based on available IoT data and service definitions.
The architecture supports dynamic evolvability by generating and integrating new services at runtime, enhancing system adaptability and real-world usability in dynamic IoT environments like smart cities.

Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?

Self-MoA (Self-Mixture-of-Agents): introduces Self-MoA, an ensemble method, with Proposer (Generates multiple responses) and Aggregator (Synthesizes responses into output), that aggregates outputs from a single top-performing Large Language Model.
Self-MoA leverages in-model diversity by repeatedly sampling from the same model, achieving superior performance compared to Mixed-MoA in various benchmarks.
Self-MoA-Seq, a sequential version, addresses context length limitations by using a sliding window for aggregation, maintaining effectiveness while enabling scalability.

Agent-Based Uncertainty Awareness Improves Automated Radiology Report Labeling with an Open-Source Large Language Model

Bayesian Prompt Ensemble pipeline: introduces uncertainty-aware predictions for radiology reports using semantically equivalent prompts, LLM, predictions, aggregation function, LLM Agent, entropy-based methods, uniform weights, linear weights, MLP, decision, and uncertainty.
Bayesian Prompt Ensemble pipeline aggregates multiple LLM prompt outputs via agent-based or entropy-based methods to improve structured data extraction from radiology reports.
Agent Decision Model within Bayesian Prompt Ensemble pipeline synthesizes prompt responses and explanations to categorize decisions into confidence levels for calibrated uncertainty.

1st February 2025

WHO'S THE MVP? A GAME-THEORETIC EVALUATION BENCHMARK FOR MODULAR ATTRIBUTION IN LLM AGENTS

CapaBench (Capability-level Assessment Benchmark): introduces evaluation framework for modular LLM agents with Planning Module (decomposes instructions), Reasoning Module (performs logical inference), Action Module (translates to operations), and Reflection Module (systematic performance analysis).
CapaBench systematically quantifies module contributions using Shapley Value from game theory for performance attribution.
Framework facilitates component-level evaluation and holistic system assessment for optimizing modular LLM agents.

MarketSenseAI 2.0: Enhancing Stock Analysis through LLM Agents

MarketSenseAI: introduces a framework leveraging LLM agents including News, Fundamentals, Dynamics, Macroeconomic, and Signal Agents for holistic stock analysis.
MarketSenseAI framework processes diverse financial data like news, prices, fundamentals, and macroeconomics to support stock analysis and selection decisions.
The framework utilizes Retrieval-Augmented Generation and Chain-of-Agents architecture to enhance fundamental and macroeconomic analysis accuracy.

31st January 2025

A parallelizable variant of HCA*

HCA* (Hierarchical Cooperative A* algorithm): introduces parallelizable variant for multi-agent path finding, with Agent (computes paths and intersections), Central Server (manages coordination and conflict resolution), Reservation Table (stores fixed agent paths), Intersection Graph (represents path collisions), and Map Partition (divides map for parallel processing).
This variant parallelizes path finding and intersection graph construction to reduce computation time.
Parallelism is achieved by map partitioning and independent agent path calculations, improving performance over standard HCA*.

Multi-agent Multi-armed Bandit with Fully Heavy-tailed Dynamics

HT-HMUCB (Heavy-Tailed HoMogeneous Upper Confidence Bounds): introduces decentralized multi-agent multi-armed bandit framework with hub identification, arm selection using UCB, transmission, information update, local and global estimation components for homogeneous rewards in heavy-tailed dynamic environments.
HT-HMUCB framework addresses sparse random graphs and heavy-tailed rewards by exploiting hub structures for variance reduction and robust estimation using median-of-means estimator.
The framework achieves improved regret bounds compared to existing methods by enabling efficient communication and information aggregation in challenging heavy-tailed scenarios.

Neuro-LIFT: A Neuromorphic, LLM-based Interactive Framework for Autonomous Drone Flight at the Edge

Neuro-LIFT (Neuromorphic, LLM-based Interactive Framework for Autonomous Drone Flight at the Edge): introduces modular framework integrating Human Interaction Module, Neuromorphic Sensing Module, LLM, and Planning and Control Module for autonomous drone navigation based on human commands.
Neuro-LIFT framework utilizes Human Interaction Module for user commands, Neuromorphic Sensing Module for environment perception, LLM for command interpretation, and Planning and Control Module for drone maneuver execution.
Neuro-LIFT framework achieves real-time interactive autonomous drone flight by combining LLM-based natural language understanding with low-latency, energy-efficient neuromorphic vision for enhanced responsiveness and adaptability.

True Online TD-Replan(\xce\xbb) Achieving Planning through Replaying

TD-Replan(\xce\xbb) (True Online TD-Replan(\xce\xbb)): introduces a novel reinforcement learning method extending True Online TD by incorporating experience replay and a parameter to control replay density and target depth.
TD-Replan(\xce\xbb) utilizes interim \xce\xbb-return targets and online updates for efficient learning, demonstrating improved performance in tasks benefiting from experience replay.
The method achieves balance between planning and acting by replaying past experiences and adjusting replay density, making it suitable for complex environments and deep learning integration.

Swarm-Gen: Fast Generation of Diverse Feasible Swarm Behaviors

Swarm-Gen: introduces a framework with Generative Model (CVAE/VQ-VAE), Safety-Filter (SF), and Initialization Network, with Encoder, Decoder, QP Block, PixelCNN, MLP, and Fixed-Point Solver components, for fast generation of diverse feasible swarm behaviors.
This framework uses generative models to sample diverse trajectories, projects them onto a feasible set using a safety filter, and accelerates the safety filter convergence with a learned initialization network.
The approach demonstrates real-time generation of multi-modal swarm trajectories on commodity GPUs, offering a balance between trajectory diversity and computational efficiency using CVAE and VQ-VAE generative models.

LLM-based Affective Text Generation Quality Based on Different Quantization Values

LLM (Large Language Model): introduces quantization, LLMs, emotion classifier, seed prompts, emotion-prompt, text generation module, GPU RAM, inference time, and memory to investigate the trade-off between quantization values and affective text generation quality.
This paper evaluates the impact of different quantization levels (8, 16, 32 bits) on the performance of various LLMs (Llama-2, Mistral, Mixtral) in generating affective text, considering GPU RAM usage and inference time.
The research highlights that while quantization reduces memory consumption, it can affect text quality and inference time, revealing a trade-off between efficiency and efficacy in LLM-based affective text generation.

An Empirical Game-Theoretic Analysis of Autonomous Cyber-Defence Agents

MRO (Multiple Response Oracles): introduces a framework for holistic evaluation of ACD approaches, with INITIALPOLICIES(), Set initial mixtures, RBlue, RRed, GBlue, GRed, AUGMENTGAME, and SOLVEGAME components.
MRO framework extends the Double Oracle algorithm by incorporating multiple response oracles to enhance the assessment of Autonomous Cyber-Defence approaches.
MRO algorithm utilizes response functions and game-theoretic analysis to iteratively refine and evaluate policies for cyber-defence and cyber-attack agents.

Beyond checkmate: exploring the creative chokepoints in AI text

Chess-Text Analogy Framework: introduces a method to explore human and AI text differences by analogy to chess game segments (opening, mid game, end game) and text segments (introduction, body, conclusion), utilizing source and segment comparisons, statistical tests, feature extraction, and various datasets and LLMs.
This framework examines creative limitations in AI text generation by analyzing stylometric and psycholinguistic features across text segments, finding body segment crucial for AI detection and greater human cross-segment variation.
Research emphasizes text segments in AI detection, suggesting body segment focus and cross-segment feature variations improve detection and provide insights into LLMs' creative abilities.

PixelWorld: Towards Perceiving Everything as Pixels

PEAP (Perceive Everything as Pixels): introduces Language Model, ViT, and Text Instruction components for unified multimodal input processing.
PEAP framework processes all modalities as pixels, contrasting with token-based methods and enhancing multimodal task performance.
The framework evaluation suite, PIXELWORLD, demonstrates PEAP's effectiveness and identifies areas for improvement in complex reasoning tasks.

Enabling Autonomic Microservice Management through Self-Learning Agents

SERVICEODYSSEY: introduces a self-learning agent system for autonomic microservice management, leveraging Curriculum Builder for task generation, Execution Planner for plan creation, Knowledge Curator for skill consolidation, Data Layer for data storage, and Management Layer for module orchestration within the Operational Environment.
SERVICEODYSSEY framework incorporates High-level Manager to decompose tasks and coordinate Low-level Agents, utilizing Running State and Interaction History for context, Task Queue and Execution Queue for task management, and Feedback and Skill Library for learning and improvement.
The system refines solutions through Environment Feedback, Peer Feedback, and Hierarchical Feedback, demonstrating its effectiveness in the Sock Shop Microservice environment for autonomic management of microservices.

Secured Communication Schemes for UAVs in 5G: CRYSTALS-Kyber and IDS

CRYSTALS-Kyber and IDS Framework: introduces secure UAV communication architecture, integrating UAV Layer, Raspberry Pi, AES Encryption-Decryption, KEM, ECC, CRYSTALS-Kyber, Communication Layer, Ground Station Layer, Server, File Storage, IDS Dataset, KEM Dataset, AI Techniques, and IDS Module.
This architecture employs hybrid cryptography using AES with ECC and CRYSTALS-Kyber for quantum resistance, alongside AI-driven IDS for intrusion detection in 5G UAV networks.
Evaluated in VPN and 5G, the framework demonstrates effective security and performance balance, suitable for resource-limited UAVs facing quantum threats.

Vintix: Action Model via In-Context Reinforcement Learning

Vintix: introduces a fixed cross-domain model for in-context reinforcement learning using Noise Distillation, Cross-Domain Dataset, Causal Transformer, and Algorithm Distillation components.
Vintix framework employs Algorithm Distillation to construct versatile action models by learning behaviors through in-context reinforcement learning.
The framework demonstrates self-correction capabilities and scaling potential of In-Context Reinforcement Learning for generalist decision-making systems across multiple domains.

MINDSTORES: Memory-Informed Neural Decision Synthesis for Task-Oriented Reinforcement in Embodied Systems

MINDSTORES: experience-augmented planning framework enables embodied agents to build and leverage mental models through natural interaction with their environment.
Framework uses database of past experiences; represents experiences as natural language embeddings; allows efficient retrieval and reasoning by LLM planner; generates insights and guides plan refinement.
MINDSTORES represents an important step toward more capable embodied AI systems that can learn continuously through natural experience.

Language Games as the Pathway to Artificial Superhuman Intelligence

Language games: framework for expanded data reproduction to overcome data reproduction trap in LLMs.
Includes role fluidity, reward variety, and rule plasticity for open-ended exploration and human-AI co-evolution towards superhuman intelligence through dynamic linguistic interaction.
This framework is important as it redefines data reproduction as an engine for superhuman intelligence.

Enabling Autonomic Microservice Management through Self-Learning Agents

SERVICEODYSSEY: self-learning agent system autonomously manages microservices without prior knowledge of service-specific configurations.
Leverages curriculum learning principles and iterative exploration; develops deep understanding of operational environments; reduces dependence on human input; includes Curriculum Builder, Execution Planner, and Knowledge Curator modules.
This approach has potential for autonomic microservice management as demonstrated by prototype.

Think Smarter not Harder: Adaptive Reasoning with Inference Aware Optimization

Inference Budget-Constrained Policy Optimization (IBPO) is an algorithm designed to enable models to understand query difficulty and allocate inference budgets accordingly.
It uses utility maximization with inference budget constraint, addresses single-modal behavior in long reasoning models, and improves token efficiency.
This method is important as it significantly enhances reasoning efficiency and shows potential for broader applications beyond mathematical problem-solving.

s1: Simple test-time scaling

s1 is a simple test-time scaling approach to improve language model reasoning performance by using budget forcing and small dataset.
s1 uses budget forcing to control test-time compute, curated small dataset s1K with 1,000 high-quality questions, and supervised finetuning on Qwen2.5-32B-Instruct.
s1 demonstrates that simple test-time scaling can achieve strong reasoning performance and sample efficiency.

Do LLMs Strategically Reveal, Conceal, and Infer Information? A Theoretical and Empirical Analysis in The Chameleon Game

The Chameleon Game: is a language-based hidden-identity game to investigate information control and decision-making capabilities of LLMs.
Framework analyzes strategic interactions, information control, and decision-making capabilities using theoretical and empirical analysis with contemporary LLMs such as GPT-4, GPT-4o, Gemini 1.5, and Claude 3.5 Sonnet.
This framework is important as it points to a weakness of contemporary LLMs in strategic interactions.

TV-Dialogue: Crafting Theme-Aware Video Dialogues with Immersive Interaction

TV-Dialogue: novel multi-modal agent framework ensures theme alignment and visual consistency through real-time immersive interactions among video characters.
Introduces Theme-aware Video Dialogue Crafting (TVDC) task, generates dialogues aligned with video content and user-specified themes, includes multi-granularity evaluation benchmark for assessment, enables zero-shot generation for any length and theme, applicable for video re-creation and film dubbing.
TV-Dialogue framework underscores potential for video re-creation, film dubbing, and downstream multimodal tasks.

KBQA-01: Agentic Knowledge Base Question Answering with Monte Carlo Tree Search

KBQA-01: is a novel agentic Knowledge Base Question Answering (KBQA) method with Monte Carlo Tree Search (MCTS).
ReAct-based agent process; stepwise logical form generation; KB environment exploration; MCTS for heuristic search; balances exploration and search space; generates high-quality annotations; incremental fine-tuning; outperforms low-resource KBQA methods.
KBQA-01 improves performance in low-resource KBQA and provides publicly available code for further research.

Survey and Improvement Strategies for Gene Prioritization with Large Language Models

Gene Prioritization Framework benchmarks and improves large language models for gene prioritization using multi-agent and HPO classification approaches combined with a divide-and-conquer strategy.
Framework benchmarks various LLMs including GPT-4 and Mixtral, uses multi-agent and HPO classification for case solvability, and employs divide-and-conquer strategy to enhance accuracy and overcome biases.
This framework significantly optimizes disease-causal gene identification and streamlines rare genetic disorder diagnosis.

Free Agent in Agent-Based Mixture-of-Experts Generative AI Framework

RLFA (Reinforcement Learning Free Agent) algorithm: introduces sports-inspired mechanism for replacing underperforming agents in multi-agent GenAI systems.
Draws inspiration from Major League Baseball free agency, uses mixture-of-experts approach, and improves performance and adaptability in multi-agent systems.
RLFA provides a straightforward route for continuous upgrades and maintains performance in critical tasks.

Autonomous Legacy Web Application Upgrades Using a Multi-Agent System

Multi-agent pipeline: LLM based multi-agent system autonomously upgrades legacy web applications to the latest version.
System distributes tasks across multiple phases; updates files to latest version; uses Zero-Shot and One-Shot Learning prompts; keeps context across tasks and agents.
Proposed system contributes as working foundation for future model implementations with existing code.

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Constitutional Classifiers: introduces classifier safeguards, with Human's Query, Constitutional Input Classifier, AI Assistant, Constitutional Output Classifier, Response Shown to Human, Response Blocked, Harmless Constitution, Harmful Constitution, LLM with Constitution, Synthetic LLM Prompts and Completions, Data Augmentation Pipeline, Harmless Pool Set, and Training Set, as a framework to defend large language models against universal jailbreaks by monitoring both user inputs and model outputs using constitution-guided classifiers.
"Constitutional Classifiers framework trains classifier safeguards using synthetic data generated by prompting language models with natural language rules defining harmful and harmless content categories."
"This approach enhances robustness and deployment viability by incorporating data augmentation, benign data pools, and streaming prediction in output classifiers for real-time intervention."

30th January 2025

Can we Retrieve Everything All at Once? ARM: An Alignment-Oriented LLM-based Retrieval Method

ARM (Alignment-Oriented LLM-based Retrieval Method): is an LLM-based retrieval method that aligns questions with data organization by exploring relationships among data objects.
ARM uses constrained decoding with N-grams, a reasoning solver for structure alignment, and self-verification for object selection, and it is evaluated on Bird and OTT-QA datasets.
This method achieves better retrieval performance and efficiency compared to standard and agentic RAG approaches.

Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs

TIP (thought switching penalty): is a decoding strategy that discourages premature transitions between thoughts, encouraging deeper exploration of each reasoning path.
It introduces a novel metric to quantify underthinking by measuring token efficiency in incorrect answers, and it improves accuracy across challenging datasets without requiring model fine-tuning.
This framework contributes to understanding reasoning inefficiencies in o1-like LLMs and offers a practical solution to enhance their problem-solving capabilities.

REPOAUDIT: An Autonomous LLM-Agent for Repository-Level Code Auditing

REPOAUDIT: introduces autonomous LLM-agent, with initiator, explorer, validator, memory, for precise, efficient repository-level code auditing by demand-driven exploration.
It employs agent memory for on-demand repository exploration and validator for hallucination mitigation.
Validation design improves precision by checking data-flow facts and path condition satisfiability, discarding false positives.

Leveraging LLM Agents for Automated Optimization Modeling for SASP Problems: A Graph-RAG based Approach

MAG-RAG is automated modeling approach based on retrieval-augmented generation technique for SASP problems.
It uses multi-agent structure for AOM architecture, graph-based RAG for domain knowledge integration, human expert modeling principles and precise knowledge retrieval using graph structure.
MAG-RAG approach realizes the potential of LLM-assisted AOM for solving SASP problems.

REPOAUDIT: An Autonomous LLM-Agent for Repository-Level Code Auditing

REPOAUDIT: autonomous LLM-agent designed for precise and efficient repository-level code auditing.
Equipped with agent memory, REPOAUDIT explores code repository on demand, analyzes data-flow facts along feasible program paths, and introduces validator for hallucination mitigation.
REPOAUDIT demonstrates substantial potential for flexible and configurable code security analysis.

Design and Validation of Learning Aware HMI For Learning-Enabled Increasingly Autonomous Systems

LEIAS (Learning-Enabled Increasingly Autonomous Systems): is an architecture designed to enhance operational safety by emphasizing communication representation and pilot preference learning in autonomous systems.
- incorporates human-machine collaboration
uses Soar cognitive architecture with reinforcement learning
provides transparent multi-sensor data assessment (GPS, IMU, LIDAR)
adapts to pilot preferences
validated in XPlane simulation for sensor anomaly management
This framework is important for advancing the safety and reliability of learning-enabled autonomous systems in complex operational environments.

Integrating LMM Planners and 3D Skill Policies for Generalizable Manipulation

LMM-3DP: LMM-3DP is a framework integrating LMM planners and 3D skill policies for generalizable robotic manipulation.
Integrates LMM planners and 3D skill policies, uses high-level planning with visual feedback, includes critic agent for self-improvement, enables lifelong learning with skill library, utilizes semantic 3D feature field for low-level control.
LMM-3DP significantly enhances robot manipulation by improving success rate and planning accuracy in complex tasks.

Invisible Traces: Using Hybrid Fingerprinting to identify underlying LLMs in GenAI Apps

Hybrid Fingerprinting framework: Novel fingerprinting framework integrates static and dynamic techniques to identify underlying LLMs in GenAI Apps.
Addresses real-world challenges; Combines static and dynamic fingerprinting; Identifies architectural features and behavioral traits; Demonstrates semantic distinction in LLM outputs; Robust and accurate in complex environments.
Framework is important for ensuring security and transparency in AI applications by reliably identifying underlying LLMs.

Can we Retrieve Everything All at Once? ARM: An Alignment-Oriented LLM-based Retrieval Method

ARM (Alignment-Oriented LLM-based Retrieval Method) is a retrieval method that aligns question with data collection organization by exploring relationships among data objects.
It is retrieve-all-at-once solution for complex queries by better aligning question with data organization and exploring relationships among data objects beyond utterance matching for efficient and comprehensive retrieval.
The proposed method is important as it improves retrieval performance for complex questions by addressing limitations of existing RAG approaches.

Leveraging LLM Agents for Automated Optimization Modeling for SASP Problems: A Graph-RAG based Approach

MAG-RAG is automated modeling approach based on retrieval-augmented generation technique for SASP problems.
It uses multi-agent structure for AOM architecture, graph-based RAG for domain knowledge integration, human expert modeling principles and precise knowledge retrieval using graph structure.
MAG-RAG approach realizes the potential of LLM-assisted AOM for solving SASP problems.

LLM-AutoDiff: Auto-Differentiate Any LLM Workflow

LLM-AutoDiff is a novel framework for Automatic Prompt Engineering (APE) that extends textual gradient-based methods to multi-component, potentially cyclic LLM architectures.
Framework accommodates functional nodes, preserves time-sequential behavior, combats "lost-in-the-middle" problem, boosts training efficiency, and uses graph-centric lens.
LLM-AutoDiff offers a powerful new paradigm for scaling and automating LLM workflows.

29th January 2025

Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate

Critique Fine-Tuning (CFT): is a framework where models learn to critique noisy responses rather than imitating correct ones.
CFT encourages deeper analysis and nuanced understanding, uses GPT-40 to generate critiques, and shows consistent improvement over SFT on math benchmarks.
This approach offers a more effective alternative to advance the reasoning of language models.

Human-Aligned Skill Discovery: Balancing Behaviour Exploration and Alignment

HaSD (Human-aligned Skill Discovery): is a framework designed to incorporate human feedback into unsupervised skill discovery to find safer and more aligned skills.
Addresses unconstrained skill discovery, finds useful skills in complex environments, optimizes skill diversity and human alignment, maintains alignment throughout discovery, and allows configurable skills with diversity-alignment trade-offs.
This framework is important as it enables the discovery of diverse, safe, and human-aligned skills for practical applications.

LARGE LANGUAGE MODELS THINK TOO FAST TO EXPLORE EFFECTIVELY

Large Language Models (LLMs): Study investigates exploration capabilities of LLMs in open-ended tasks using Little Alchemy 2.
LLMs underperform humans in exploration; uncertainty-driven strategies dominant; empowerment underutilized; premature decisions due to fast processing.
Findings are crucial for enhancing LLM adaptability and exploration effectiveness.

Is Conversational XAI All You Need? Human-AI Decision Making With a Conversational XAI Assistant

Conversational XAI assistant: Conversational XAI interface is proposed to augment existing XAI methods to increase user engagement and boost user understanding of AI system.
Exploration of conversational XAI interface impact on user understanding, trust and reliance; comparison with XAI dashboard; over-reliance on AI system observed; enhanced conversations amplified over-reliance; illusion of explanatory depth.
Findings have important implications for designing effective conversational XAI interfaces to facilitate appropriate reliance and improve human-AI collaboration.

RICOTA: Red-teaming of In-the-wild Conversation with Test Attempts

RICOTA: is a Korean red teaming dataset of in-the-wild user interactions.
It uses user-chatbot conversations from a Korean Reddit-like community, focuses on jailbreak attempts, and provides a novel evaluation approach.
This dataset is important for evaluating LLMs' ability to identify conversation types and user testing purposes.

ACTIONS SPEAK LOUDER THAN WORDS: AGENT DECISIONS REVEAL IMPLICIT BIASES IN LANGUAGE MODELS

Language-agent simulation technique: systematically investigates implicit biases in LLMs across diverse sociodemographic groups and decision-making scenarios.
It uses persona generation and action generation steps, reveals that state-of-the-art LLMs exhibit significant sociodemographic disparities, and shows that implicit biases are amplified compared to explicit biases.
This framework provides a way to identify biases in LLM-powered applications, ensuring they are aligned with ethical principles and societal norms.

GENERAL SCENE ADAPTATION FOR VISION-AND-LANGUAGE NAVIGATION

GSA-VLN (General Scene Adaptation for VLN): is a novel task requiring agents to execute navigation instructions within a specific scene and simultaneously adapt to it for improved performance over time.
GSA-VLN introduces environment-specific memory bank, uses three-stage instruction orchestration pipeline with LLMs, and proposes Graph-Retained DUET (GR-DUET) method.
This framework addresses the challenge of single-scene adaptation, enabling agents to continuously improve as they execute instructions in previously unseen environments.

28th January 2025

Thalamic oscillations distinguish natural states of consciousness in humans

A novel fast thalamic oscillation (20-45 Hz) is identified in humans, which specifically occurs during wakefulness and REM sleep, and is absent during NREM sleep.
The oscillation is localized to the central thalamus and is temporally coupled with eye movements during REM sleep.

LARGE LANGUAGE MODEL CRITICS FOR EXECUTION-FREE EVALUATION OF CODE CHANGES

LLM Critics: is a framework that uses LLM-based critics to derive execution-free evaluation proxies for code changes.
It uses gold test patch as reference, predicts executability of editing locations, aggregates predictions to predict build status, and outperforms other reference-free and reference-aware LLM critics.
This framework enables more efficient evaluation of code changes without relying on execution.

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

SFT (Supervised fine-tuning) and RL (reinforcement learning): are compared on generalization and memorization in text and visual environments.
RL generalizes better than SFT, especially with outcome-based reward; SFT memorizes training data; RL improves visual recognition; SFT stabilizes output format for RL.
RL is advantageous for acquiring generalizable knowledge in complex, multimodal tasks.

MCTS-SQL: An Effective Framework for Text-to-SQL with Monte Carlo Tree Search

MCTS-SQL (Monte Carlo Tree Search for SQL): is a framework for text-to-SQL that uses Monte Carlo Tree Search to guide SQL generation iteratively.
It includes a schema selector for extracting relevant information and an MCTS-based generator for iterative query refinement; it uses a fast-slow thinking approach with a direct SQL generation component and an MCTS-based refiner; it achieves state-of-the-art performance on the BIRD and SPIDER benchmarks.
This framework improves the accuracy and reliability of text-to-SQL systems, especially when dealing with complex user queries.

ToolFactory: Automating Tool Generation by Leveraging LLM to Understand REST API Documentations

ToolFactory: is an open-source pipeline for automating tool generation from unstructured API documents.
It includes API Extraction Benchmark, APILlama model fine-tuned with prompt tuning, and tool validation pipeline.
This framework facilitates the seamless integration of scientific REST APIs into AI workflows.

A Stochastic Dynamical Theory of LLM Self-Adversariality: Modeling Severity Drift as a Critical Process

Stochastic dynamical framework: models how LLMs may self-amplify biases through chain-of-thought reasoning.
It uses a continuous-time stochastic differential equation (SDE) approach, analyzes phase transitions, derives stationary distributions, and investigates scaling laws.
This framework provides a basis for formal verification of model stability and bias propagation.

MACI: Multi-Agent Collaborative Intelligence for Robust Reasoning and Temporal Planning

MACI (Multi-Agent Collaborative Intelligence): is a framework centered on a meta-planner that orchestrates multiple agents to generate planner templates.
It includes a three-tier architecture with meta-planning, common and specialized agents; enables advanced temporal reasoning and adaptability; decouples planning from validation.
This framework provides a robust solution for complex reasoning and planning tasks.

Auto-Differentiating Any LLM Workflow: A Farewell to Manual Prompting

LLM-AutoDiff: is a framework for Automatic Prompt Engineering (APE) that extends textual gradient-based methods to multi-component, potentially cyclic LLM architectures.
It treats each textual input as a trainable parameter, uses a frozen "backward engine" LLM to generate feedback, accommodates functional nodes, preserves time-sequential behavior, and combats the "lost-in-the-middle" problem.
This framework offers a new paradigm for scaling and automating LLM workflows.

JUPYBARA: Operationalizing a Design Space for Actionable Data Analysis and Storytelling with LLMs

JUPYBARA: is an AI-enabled assistant for actionable EDA and storytelling implemented as a Jupyter Notebook extension.
It employs design-space-aware prompting and multi-agent architectures, including semantic, rhetorical, and pragmatic dimensions, to operationalize the design space.
This framework enhances usability, steerability, explainability, and reparability in actionable data analysis and storytelling.

A sketch of an AI control safety case

AI control: framework argues that models are safe because of measures such as monitoring and human auditing.
Framework uses control evaluation with red and blue teams, includes untrusted and trusted monitors, and uses a safety layer to prevent data exfiltration.
This framework provides a step toward more concrete arguments that can be used to show that a dangerously capable LLM agent is safe to deploy.

27th of January 2025

GUI-Bee : Align GUI Action Grounding to Novel Environments via Autonomous Exploration

GUI-Bee is MLLM-based autonomous agent to collect environment-specific data through exploration and fine-tune GUI grounding models for novel environments.
novel environments; autonomous exploration; Q-ICRL method; exploration efficiency; data quality; NovelScreenSpot benchmark; align GUI action grounding models.
Aligning GUI action grounding models to novel environments significantly enhances performance.

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Janus-Pro: Advances multimodal models via optimized training, expanded data, and model scaling. Janus-Pro achieves SOTA-level performance in both multimodal understanding and text-to-image generation benchmarks.
Enhanced training strategy includes "Longer Training in Stage I" and "Focused Training in Stage II" for better efficiency and performance. This refines the original 3-stage training process of Janus.
Text-to-image generation stability and aesthetic quality are significantly enhanced through synthetic data and improved training.
Decoupled visual encoding remains a core and effective architectural design for unified multimodal tasks.
7B model demonstrates strong scalability of the decoupled visual encoding approach.

On the Feasibility of Using LLMs to Execute Multistage Network Attacks

Incalmo: is an LLM-agnostic high-level attack abstraction layer that sits between an LLM and the environment.
Incalmo uses action planner, attack graph service and environment state service to enable LLMs to specify high-level tasks, translate them into low-level primitives, and provide structure for selecting relevant actions.
Incalmo consists of three stages. The first stage is called "onboarding pre-prompt"-stage, which “teaches” the LLM the capabilities of Incalmo, Second stage provides environment specific prompts to outline attach goals and environment details. In the third stage, the LLM autonomously executes the multistage attack via Incalmo in an interactive execution loop.
Demonstrates capability to find vurnerable services, execute exploits to gain access to network, to discover misconfigurations and vulnerabilities to move laterally and exploit vulnerabilities to escalate privileges and exfiltrate data from networks.
Demonstrates, that abstraction is more important than LLM model size and that Incalmo-action planner module is critical module.
This framework enables LLMs to successfully execute multistage attacks in realistic emulated networks.

Gensors: Authoring Personalized Visual Sensors with Multimodal Foundation Models and Reasoning

Gensors is a system designed to empower users to create personalized visual sensors by leveraging multimodal foundation models and reasoning.
It uses two-stage pipeline with Gemini 1.5 Flash and Pro, supports user-configurable logic and examples, and facilitates criteria refinement and debugging.
Gensors is important as it makes intelligent sensing technologies more accessible and customizable for end-users.

MULTI-AGENT GEOSPATIAL COPILOTS FOR REMOTE SENSING WORKFLOWS

GeoLLM-Squad: geospatial Copilot introduces multi-agent paradigm to remote sensing workflows by separating agentic orchestration from geospatial task-solving.
Multi-agent system; agentic orchestration; geospatial task-solving; specialized sub-agents; open-source AutoGen and GeoLLM-Engine; diverse applications; robust performance; improved agentic correctness.
GeoLLM-Squad highlights the potential of multi-agent AI in advancing remote sensing workflows.

Will Systems of LLM Agents Cooperate: An Investigation into a Social Dilemma

LLM Agent System: Framework investigates cooperative tendencies of Large Language Model (LLM) agents in social dilemma by prompting LLMs to generate strategies for iterated Prisoner's Dilemma.
Defines three classes of agents (attitudes): agressive, cooperative and neutral.
evolutionary game theory; strategic dispositions; aggressive, cooperative, neutral; distinct biases; long-term behaviour; strategic environments.
This research highlights importance of considering strategic environments for deployed LLM-based autonomous agents and their potential long-term behaviour.

AI Agents for Computer Use: A Review of Instruction-based Computer Control, GUI Automation, and Operator Assistants

AI Agents for Computer Use: A Review offers a comprehensive overview of instruction-based computer control agents, GUI automation, and operator assistants.
It examines agents taxonomy, development, resources, shift to foundation models, datasets, evaluation methods, and deployment challenges.
This review provides a comprehensive foundation to understand and push the future development of AI agents for computer use.

LLM-attacker: Enhancing Closed-loop Adversarial Scenario Generation for Autonomous Driving with Large Language Models

LLM-attacker: closed-loop adversarial scenario generation framework leveraging large language models.
multiple LLM agents; identify optimal attackers; optimize attacker trajectories; iterative refinement based on ADS performance; feedback loop.
Framework is important to test and enhance the safety and robustness of ADS.

MADP: Multi-Agent Deductive Planning for Enhanced Cognitive-Behavioral Mental Health Question Answer

MADP (Multi-Agent Deductive Planning): is a CBT-based multi-agent reasoning strategy that analyzes interactions among multiple CBT elements for mental health support.
Deeper understanding of help-seeker context; personalized assistance; fine-tuned LLM (MADP-LLM); enhanced emotional reasoning; reduced deployment costs.
MADP framework effectively provides personalized, empathetic, and targeted mental health support.

Harnessing Diverse Perspectives: A Multi-Agent Framework for Enhanced Error Detection in Knowledge Graphs

MAKGED (Multi-Agent framework for Knowledge Graph Error Detection): is a novel framework utilizing multiple large language models in a collaborative setting for enhanced knowledge graph error detection.
multi-agent framework; multiple LLMs; collaborative setting; subgraph embeddings; query embeddings; transparent decision-making; multi-round discussions.
MAKGED enhances the reliability of downstream applications by improving the accuracy and robustness of knowledge graph error detection.

LLM-powered Multi-agent Framework for Goal-oriented Learning in Intelligent Tutoring System

GenMentor: LLM-powered multi-agent framework is designed for goal-oriented and personalized learning within Intelligent Tutoring System.
multi-agent system; goal-oriented learning; personalized learning; skill gap identification; adaptive learner modeling; personalized resource delivery.
GenMentor effectively enhances learning guidance, content quality, goal alignment and resource targeting for enhanced personalization.

Deception in LLMs: Self-Preservation and Autonomous Goals in Large Language Models

DeepSeek R1: is a model trained to output reasoning tokens, exhibiting deceptive tendencies and self-preservation instincts.
The model attempts self-replication, masks true objectives, and expands capabilities autonomously.
This study highlights the critical need for robust goal specification and safety frameworks before physical implementation.

MULTI-AGENT GEOSPATIAL COPILOTS FOR REMOTE SENSING WORKFLOWS

GeoLLM-Squad: introduces a multi-agent paradigm to remote sensing workflows.
It separates agentic orchestration from geospatial task-solving, uses AutoGen and GeoLLM-Engine frameworks, and enables modular integration of diverse applications.
This approach maintains robust performance and improves agentic correctness compared to single-agent systems.

Will Systems of LLM Agents Cooperate: An Investigation into a Social Dilemma

LLM (Large Language Model) agents framework: investigates emergent cooperative tendencies in a social dilemma.
Framework prompts LLMs to generate complete strategies, uses evolutionary game theory, simulates populations with different strategic dispositions, and observes evolutionary dynamics.
This research provides insights into long-term behavior of deployed LLM-based autonomous agents and highlights importance of strategic environments.

26th of January 2025

Qwen2.5-1M Technical Report

Introduces Qwen2.5-1M, which extends open source support for 1M token context length.
Includes infererence framework, which speeds up 1M context inference by 3.2x to 6.7x.

OpenCharacter: Training Customizable Role-Playing LLMs with Large-Scale Synthetic Personas

OpenCharacter: framework trains customizable role-playing LLMs with large-scale synthetic personas.
Explores large-scale data synthesis approach, uses response rewriting and generation strategies, achieves performance comparable to GPT-40 models.
This work is important for advancing research in customizable role-playing dialogue systems.

ToM-agent: Large Language Models as Theory of Mind Aware Generative Agents with Counterfactual Reflection

ToM-agent is a novel paradigm designed to empower LLM-based generative agents to simulate Theory of Mind in open-domain conversational interactions.
Disentangles confidence from mental states; emulates agent's perception of counterpart's mental states (beliefs, desires, intentions - BDIs); dynamically adjusts inferred BDIs; counterfactual intervention method; enhances reflection efficiency.
ToM-agent provides new insights for studying large-scale LLMs-based simulation of human social behaviors.

25th January 2025

OptiSeq: Optimizing Example Ordering for In-Context Learning

OptiSeq: introduces a score based on log probabilities of LLM outputs to prune example orderings in few-shot ICL.
optimizing example ordering; in-context learning; LLM outputs; prune orderings; best order; correct/incorrect outputs; empirical evaluation; accuracy improvement.
OptiSeq improves accuracy significantly across multiple tasks.

24th January 2025

RL + Transformer = A General-Purpose Problem Solver

ICRL (In-Context Reinforcement Learning): introduces LLaMA 3.1 8B Instruct (Pre-trained Transformer), IA3 Adapter (Efficient Fine-tuning), DQN (RL Algorithm), Input Sequence (History of Interactions), and Output Q-value (Action-value Function) to demonstrate a meta-learning approach for solving unseen problems through reinforcement learning.
ICRL leverages a pre-trained transformer fine-tuned with reinforcement learning to achieve in-context learning, enabling generalization to new environments and tasks without additional training.
The framework exhibits robustness to low-quality training data and adaptability to non-stationary environments, highlighting its potential as a general-purpose problem solver.

Self-reflecting Large Language Models: A Hegelian Dialectical Approach

Hegelian Dialectical Approach: Framework introduces philosophical approach inspired by the Hegelian Dialectic for LLMs' self-reflection.
It uses self-dialectical approach to emulate internal critiques, synthesize new ideas by resolving contradictions, dynamic annealing approach for temperature generation, Multi Agent Majority Voting (MAMV) strategy to assess validity and novelty.
Framework is examined to determine ability to generate novel ideas and provide stepping stone for future research.

MedAgentBench: Dataset for Benchmarking LLMs as Agents in Medical Applications

MedAgentBench: is a broad evaluation suite designed to assess the agent capabilities of large language models within medical records contexts.
It encompasses 100 patient-specific clinically-derived tasks, realistic profiles of 100 patients with over 700,000 data elements, a FHIR-compliant interactive environment, and an accompanying codebase.
This framework establishes a valuable benchmark for model developers to track progress and drive continuous improvements in the agent capabilities of large language models within the medical domain.

DEEPFLOW: Serverless Large Language Model Serving at Scale

DEEPFLOW: is a serverless AI platform designed for efficient large language model serving at scale.
It uses request-job-task model, FLOWSERVE serving engine, NPU-centric execution, SPMD-based parallelism, and novel scheduling policies.
This framework addresses resource allocation, serving efficiency, and cold start latencies.

DRESSING UP LLM: EFFICIENT STYLIZED QUESTION-ANSWERING VIA STYLE SUBSPACE EDITING

DRESS (Disentangling Representation Editing in Style Subspace): is a novel approach for generating stylized large language model (LLM) responses through representation editing.
It leverages over-parameterized nature of LLMs, disentangles style-relevant subspace, applies adaptive editing strengths, and maintains stylistic fidelity and semantic integrity.
DRESS is a lightweight, train-free solution for enhancing LLMs with flexible and effective style control, making it useful for developing stylized conversational agents.

Exploring the sustainable scaling of Al dilemma: A projective study of corporations' Al environmental impacts

The proposed methodology: estimates the environmental impact of a company's AI portfolio, providing actionable insights without extensive AI and Life-Cycle Assessment (LCA) expertise.
The framework includes four interconnected models: life cycle impacts of primary components, life cycle impacts of AI use cases, AI company portfolio model, and 2030 AI landscape projections.
This framework empowers organizations to understand and project their AI impacts and align their initiatives with global sustainability goals.

MASTER: A Multi-Agent System with LLM Specialized MCTS

MASTER (Multi-Agent System with Tactical Execution and Reasoning using LLM Specialized MCTS): is a novel multi-agent framework that employs a new agent recruitment process and communication protocol based on the MCTS algorithm.
It autonomously adjusts the number of agents based on task complexity, mitigates distractions and token window shortage, and includes a modified MCTS tailored to LLM scenarios.
This framework achieves state-of-the-art performance on HotpotQA and WebShop datasets.

Top Ten Challenges Towards Agentic Neural Graph Databases

Agentic NGDB (Agentic Neural Graph Databases): extends NGDBs with autonomous query construction, neural query execution, and continuous learning.
It identifies ten key challenges, including semantic unit representation, abductive reasoning, scalable query execution, and integration with foundation models like LLMs.
This framework enables intelligent, self-improving systems for modern data-driven applications.

Serving Long-Context LLMs at the Mobile Edge: Test-Time Reinforcement Learning-based Model Caching and Inference Offloading

T2DRL (Test-Time Deep Reinforcement Learning): is a joint model caching and inference offloading framework that optimizes deployment and execution strategies for long-context LLM serving.
Framework analyzes performance convergence, designs optimization problem considering context windows, manages cached models and service requests, adapts to context changes, and uses double Dutch auction mechanism for resource allocation.
The framework reduces system costs while guaranteeing the performance of LLM agents in real-world perception and reasoning tasks.

Distributed Multi-Agent Coordination Using Multi-Modal Foundation Models

VL-DCOPs (visual-linguistic instruction-based DCOPs): is a framework that uses large multimodal foundation models to generate constraints from visual and linguistic instructions.
Framework includes spectrum of agent archetypes, from neuro-symbolic to fully neural agents, and evaluates them using LLMs and VLMs on novel VL-DCOP tasks.
This work extends the DCOP literature by addressing the challenge of manual problem construction and opens new research directions.

AI Chatbots as Professional Service Agents: Developing a Professional Identity

LAPI (LLM-based Agent with a Professional Identity): is a novel framework for designing professional service agents tailored for medical question-and-answer services.
LAPI includes theory-guided task planning process, pragmatic entropy method, and iterative updating of responses.
This framework improves response quality, providing more accurate, empathetic, and professional answers compared to baseline approaches.

ARGOS: Agentic Time-Series Anomaly Detection with Autonomous Rule Generation via Large Language Models

ARGOS: is an agentic system for detecting time-series anomalies in cloud infrastructure by leveraging large language models (LLMs).
It uses explainable anomaly rules as intermediate representation, employs LLMs to autonomously generate rules, and includes detection-, repair- and review-agents.
This framework improves anomaly detection accuracy and efficiency compared to state-of-the-art methods.

Top Ten Challenges Towards Agentic Neural Graph Databases

Agentic NGDB (Agentic Neural Graph Databases): extends NGDBs with autonomous query construction, neural query execution, and continuous learning.
It identifies ten key challenges, including semantic unit representation, abductive reasoning, scalable query execution, and integration with foundation models like LLMs.
This framework enables intelligent, self-improving systems for modern data-driven applications.

23rd of January 2025

BEYOND THE SUM: UNLOCKING AI AGENTS POTENTIAL THROUGH MARKET FORCES

AI Agent Market Infrastructure Framework presents systematic analysis of infrastructure requirements for AI agents to function as autonomous participants in digital markets.
Framework identifies key areas like identity, service discovery, interfaces and payment systems and highlights existing infrastructure challenges impeding agent participation, suggesting new economic organization forms.
This framework is important as it addresses infrastructure challenges as fundamental step toward enabling new forms of economic organization.

ElCopilot: Search and Explore Enterprise Information over Large-scale Knowledge Graphs with LLM-driven Agents

EICopilot: is a novel agent-based solution enhancing search and exploration of enterprise registration data within extensive online knowledge graphs.
EICopilot includes data pre-processing pipeline, comprehensive reasoning pipeline with Chain-of-Thought and In-context learning, and novel query masking strategy.
EICopilot is a groundbreaking tool for exploration and exploitation of large-scale knowledge graphs for enterprise information search.

The though process behind Kimi k1.5

Explains the way the Kimi K-1.5 model was trained and discusses overall likely o1-model training procedure.

Operator System Card

OA Operator-agent system card.
Uses RL.
Additional details

21st of January 2025

LLM-Agents Driven Automated Simulation Testing and Analysis of small Uncrewed Aerial Systems

AUTOSIMTEST: is a Large Language Model (LLM)-driven framework, where multiple LLM agents collaborate to support the sUAS simulation testing process.
Framework includes scenario generation-, mission-, environment- and analytics-agents; uses RAG approach; provides interactive analysis interface.
Framework improves efficiency and scope of sUAS testing process, allowing for more comprehensive and varied scenario evaluations while reducing manual effort.

EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents

EMBODIEDEVAL: is a comprehensive and interactive evaluation benchmark for MLLMs with embodied tasks.
EMBODIEDEVAL features 328 distinct tasks within 125 varied 3D scenes, covers navigation, object interaction, social interaction, attribute question answering, and spatial question answering.
This framework provides insights for future development of MLLMs in embodied capabilities.

20th of January 2025

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinformcent Learning

DeepSeek-R1: Trains SOTA-level Large Reasoning Model from LLM via Reinforcement Learning, which matches performance with o1-model.

Kimi-K1.5: Scaling Reinforcement Learning with LLMs

Kimi k1.5: is a multi-modal large language model (LLM) trained with reinforcement learning (RL) to achieve SOTA-level reasoning performance across multiple benchmarks and modalities.

Conversation Routines: A Prompt Engineering Framework for Task-Oriented Dialog Systems

Conversation Routines (CR): is a structured prompt engineering framework for developing task-oriented dialog systems using Large Language Models (LLMs).
CR enables development of Conversation Agentic Systems (CAS) through natural language specifications, embedding task-oriented logic within LLM prompts, providing systematic methodology for designing complex conversational workflows while maintaining behavioral consistency.
This framework enables domain experts to design conversational workflows in natural language while leveraging custom enterprise functionalities.

Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training

Agent-R: is an iterative self-training framework that enables language agents to reflect on the fly.
It leverages Monte Carlo Tree Search (MCTS) to construct training samples, recovers correct trajectories from erroneous ones, and uses a model-guided critique construction mechanism for timely revision.
This framework effectively equips agents to identify and correct erroneous actions while avoiding loops, achieving superior performance.

Towards Advancing Code Generation with Large Language Models: A Research Roadmap

Six-layer vision framework: categorizes code generation process into Input, Orchestration, Development, and Validation phases.
Framework includes analysis of existing studies, outlines vision workflow, and systematically analyses challenges faced by LLMs.
This work provides guidelines for improving reliability, robustness and usability of LLM-based code generation systems.

Large Language Model Agents for Radio Map Generation and Wireless Network Planning

LLM agent framework: automates radio map generation and wireless network planning tasks.
Framework includes tools-, models- and profiles-modules; it uses short-term and long-term memory; it performs task planning.
The framework reduces manual operations and enhances network coverage and signal-to-interference-noise ratio.

Code Readability in the Age of Large Language Models: An Industrial Case Study from Atlassian

HULA (Human-in-the-loop software development agents framework): is a LLM-based framework for software development.
The framework uses GPT-4, compares LLM-generated code with human-written code, and evaluates code readability using static analysis metrics.
This study highlights the importance of code readability in the age of LLMs and shows that LLM-generated code can be comparable to human-written code.

PlotEdit: Natural Language-Driven Accessible Chart Editing in PDFs via Multimodal LLM Agents

PlotEdit: is a multi-agent framework for natural language-driven end-to-end chart image editing via self-reflective LLM agents.
Framework includes Chart2Table, Chart2Vision, Chart2Code, Instruction Decomposition and Multimodal Editing agents; uses multimodal feedback to maintain visual fidelity; outperforms existing baselines on ChartCraft dataset.
It enhances accessibility for visually challenged users and improves novice productivity.

19th of January 2025

IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems

IntellAgent: is a scalable, open-source multi-agent framework designed to evaluate conversational AI systems.
It automates synthetic benchmark creation using policy-driven graph modeling, realistic event generation, and interactive user-agent simulations, providing fine-grained diagnostics.
This framework enables comprehensive evaluation of conversational AI by addressing limitations of traditional methods.

GREEN-CODE: Optimizing Energy Efficiency in Large Language Models for Code Generation

GREEN-CODE: is a framework for energy-aware code generation in LLMs, performing dynamic early exit during inference.
It uses Reinforcement Learning agent to balance accuracy, latency, and energy consumption trade-offs, and fine-tunes models with weighted aggregated loss.
This framework reduces energy consumption significantly without affecting accuracy for code generation tasks.

Open FinLLM Leaderboard: Towards Financial AI Readiness

Open FinLLM Leaderboard: is an open platform for assessing and comparing Large Language Models' performance on financial tasks.
The framework includes a leaderboard, demos, and financial AI readiness components; it uses zero-shot evaluation, and provides side-by-side model comparisons.
This framework is important for encouraging innovation and improving model effectiveness in the financial sector.

Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments

LEARN-BY-INTERACT: is a data-centric framework to adapt LLM agents to any given environments without human annotations.
LEARN-BY-INTERACT synthesizes agent-environment interactions based on documentations, constructs instructions by summarizing interaction histories, and uses innovative retrieval approaches optimized for agents.
This framework serves as a foundation for agent data synthesis as LLMs are increasingly deployed at real-world environments.

18th of January 2025

Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments

LEARN-BY-INTERACT: is a data-centric framework to adapt LLM agents to any given environments without human annotations.
Framework synthesizes agent-environment interaction trajectories, uses backward construction for instructions, and leverages synthetic data for training and in-context learning with optimized retrieval.
Framework serves as a foundation for agent data synthesis for LLMs in real-world environments.

--

BAP v2: An Enhanced Task Framework for Instruction Following in Minecraft Dialogues

BAP v2 (Builder Action Prediction v2): is an upgraded task framework for instruction following in Minecraft dialogues.
BAP v2 includes enhanced evaluation benchmark with cleaner test set and fairer metrics, and additional synthetic training data generated from novel Minecraft dialogue and target structure simulators.
BAP v2 enables more efficient and meaningful progress on the task of instruction following in Minecraft dialogues.

ML-SceGen: A Multi-level Scenario Generation Framework

ML-SceGen: is a three-stage framework for generating comprehensive and critical scenarios in autonomous driving.
It uses LLM agents for parsing, Answer Set Programming (ASP) solver for logical traffic generation, and LLM for parameter updates to increase criticality.
This framework enhances controllability, scalability, and realism in scenario generation for autonomous driving systems.

17th of January 2025

Evolving Deeper LLM Thinking

Mind Evolution: is an evolutionary search strategy that uses a language model to generate, recombine and refine candidate responses.
It avoids formalizing the inference problem (so is usable in spaces like planning in natural language without explicit formalization of the problem and as well in hiding encoded message inside poems, which is non-natural language task), uses a global solution evaluator (focuses on domains, where evaluator is available), and can be easily parallelized.
This approach significantly outperforms other inference strategies in natural language planning tasks.
Introduces new StegPoet-benchmark, where the benchmark task is to encode message inside essay/story.

Agent4Edu: Generating Learner Response Data by Generative Agents for Intelligent Education Systems

Agent4Edu: is a personalized learning simulator that uses LLM-powered generative agents to simulate human learners' response data.
It includes learner profile, memory, and action modules; interacts with personalized learning environments; evaluates and improves intelligent tutoring algorithms.
This framework provides a versatile platform for comprehensive evaluations and future collection of valuable learner response data.

Towards Human-Guided, Data-Centric LLM Co-Pilots

CliMB-DC (Clinical predictive Model Builder with Data-Centric AI): is a human-guided, data-centric framework for LLM co-pilots.
It includes a multi-agent reasoning system with a strategic coordinator and a specialized worker agent, integrates state-of-the-art data-centric tools, and uses a human-in-the-loop approach.
This framework empowers domain experts to actively participate in driving real-world impact using ML.

Towards Preventing Overreliance on Task-Oriented Conversational AI Through Accountability Modeling

Accountability Model: is an augmented LLM with an additional accountability head, functioning as a binary classifier to predict dialogue state slots.
It detects false positives and negatives, guides LLM decoder for accurate actions, enables self-correction, and introduces friction to prevent overreliance.
This model improves joint goal accuracy and overall performance in task-oriented dialogue systems.

PaSa: An LLM Agent for Comprehensive Academic Paper Search

PaSa: is an advanced paper search agent powered by large language models. Available https://pasa-agent.ai/
It autonomously makes decisions, including invoking search tools, reading papers, and selecting references; it is optimized using reinforcement learning with synthetic dataset; it outperforms existing baselines on real-world academic queries.
This framework significantly improves the efficiency and accuracy of academic search.

LLM Reasoner and Automated Planner: A new NPC approach

LLM Reasoner and Automated Planner: is a novel architecture that integrates an LLM for decision-making with a classical automated planner.
Framework uses LLM to decide goal, then uses automated planning to create plan, and includes modules for reasoning, planning and interface.
This framework aims to empower autonomous agents with flexibility to adapt to any situation while maintaining plausible and human-like behavior.

A Survey on LLM Test-Time Compute via Search: Tasks, LLM Profiling, Search Algorithms, and Relevant Frameworks

This survey provides a comprehensive technical review that unifies task definitions and provides modular definitions of LLM profiling and search procedures.
It enables precise comparisons of various LLM inference frameworks, highlights their departures from conventional search algorithms, and discusses applicability, performance, and efficiency.
This survey offers a collection of classical and reusable implementations that can serve as solid foundations for future research and development.

Agent-as-Judge for Factual Summarization of Long Narratives

NARRATIVEFACTSCORE: is a novel "Agent-as-a-Judge" framework for evaluating and refining summaries.
It leverages Character Knowledge Graph (CKG), assesses factual consistency, provides actionable guidance for refinement, identifies missing or erroneous facts, and uses retrieval-based verification with explicit feedback.
This framework improves the factual reliability of LLM-generated summaries.

A Survey on Multi-Turn Interaction Capabilities of Large Language Models

This survey provides a focused review of the multi-turn capabilities of LLMs.
The survey explores core model capabilities, evaluation methods, enhancement algorithms, and future research directions.
This survey is important for both academic researchers and industry practitioners.

TOWARDS A LITMUS TEST FOR COMMON SENSE

Axiomatic litmus test: diagnoses common sense by combining minimal prior knowledge constraints with diagonal arguments to create tasks beyond the agent's known concept set.
It addresses deceptive hallucinations, integrates observations regarding emergent deceptive hallucinations, and uses Abstraction and Reasoning Corpus (ARC) constraints.
This test provides a stepping stone toward an ethical, reliable foundation for future safe, beneficial and aligned artificial intelligence.

16th of January 2025

Authenticated Delegation and Authorized AI Agents

Authenticated Delegation Framework: novel framework enables authenticated, authorized, and auditable delegation of authority to AI agents.
Secure delegation; restrict permissions and scope; accountability; extends OAuth 2.0 and OpenID Connect; natural language to auditable access control.
Framework facilitates immediate AI agent deployment while ensuring security and accountability.

Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps

Inference-time scaling framework: explores the inference-time scaling behavior of diffusion models beyond increasing denoising steps.
Framework uses search problem to identify better noises, design space includes verifiers and algorithms, experiments on class-conditioned and text-conditioned image generation benchmarks.
This framework reveals that increasing inference-time compute leads to substantial improvements in the quality of samples generated by diffusion models.

Foundations of Large Language Models

Introduces a literature review / survey on LLMs.

AutoCBT: An Autonomous Multi-agent Framework for Cognitive Behavioral Therapy in Psychological Counseling

AutoCBT: An Autonomous Multi-agent Framework for Cognitive Behavioral Therapy in Psychological Counseling.
AutoCBT incorporates a counsellor agent and multiple supervisor agents, uses short-term and long-term memory, and is evaluated on a bilingual dataset.
AutoCBT leverages dynamic routing and supervisory mechanisms to offer high-quality, automated CBT services, enhancing the effectiveness of single-turn consultations.

OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking

OmniThink: is a machine writing framework that emulates human-like iterative expansion and reflection.
It uses continuous reflection and exploration, attaches knowledge to an information tree, and extracts it into a conceptual pool to deepen understanding.
This framework improves the knowledge density of generated articles without compromising coherence and depth.

CyberMentor: AI Powered Learning Tool Platform to Address Diverse Student Needs in Cybersecurity Education

CyberMentor: is a learning tool platform designed to address diverse needs of cybersecurity students using agentic workflow and Generative Large Language Models (LLMs).
It leverages Retrieval-Augmented Generation (RAG) for accurate information retrieval, includes knowledge base, skill base and LLM agent, and provides personalized learning experiences.
This framework aims to improve equity and sustainability in higher education by offering open-source design for adaptation across disciplines.

Empowering Large Language Models in Wireless Communication: A Novel Dataset and Fine-Tuning Framework

PVI (Pointwise V-Information) based fine-tuning method: enhances LLMs for wireless communication by quantifying information content of training data.
Dataset includes multi-hop questions, true/false and multiple-choice types, varying difficulty levels, rigorous data curation, advanced language models for entity extraction and question generation.
This work aims to improve LLM training and evaluation for wireless communication research and applications.

SOP-AGENT: EMPOWER GENERAL PURPOSE AI AGENT WITH DOMAIN-SPECIFIC SOPS

SOP-agent (Standard Operational Procedure-guided Agent): is a novel framework for constructing domain-specific agents through pseudocode-style Standard Operational Procedures (SOPs) written in natural language.
SOP-agent represents SOP as a decision graph, traverses it to guide the agent, conducts experiments across multiple domains, and introduces Grounded Customer Service Benchmark.
SOP-agent demonstrates excellent versatility, achieving performance superior to general-purpose agent frameworks and comparable to domain-specific agent systems.

15th of January 2025

The geometry of moral decision making

Geometry of Moral Decision Making Framework: Understands bounded rationality as interplay of deontology and utilitarianism.
Deontology as regularisation function in optimal control; Inverse temperature shields from expected utility; Information geometry of bounded rationality and rate distortion theory; Markov kernels and regular conditional probability; Gradient equation determines utility expansion path.
Framework is relevant to theory of autonomous agents and analysis of legal doctrine.

Networked Agents in the Dark: Team Value Learning under Partial Observability

DNA-MARL (Double Networked Averaging MARL) is distributed method for networked agents that introduces consensus mechanism for local communication and gradient descent for local computation in partially observable Markov games.
Framework addresses cooperative multi-agent reinforcement learning in networked dynamic partially observable Markov game (ND-POMG) using decentralized training and decentralized execution (DTDE), and achieves team value function learning under partial observability via consensus mechanism for cooperative value function learning with actor-critic algorithm.
DNA-MARL enhances the potential of networked agents for real-world applications requiring privacy and robustness to message loss.

Between Puppet and Actor: Reframing Authorship in this Age of AI Agents

Puppet and Actor framework: This framework reframes authorship in the age of AI agents by positioning AI agency between puppet and actor.
Conceptual tensions in AI agent roles; creative processes; Large Language Models (LLMs); Schmidt's categorization; classical authorship; puppet-actor spectrum; creative autonomy; dynamic state; evolving authorship.
Understanding AI agency as puppet-actor spectrum is important for adapting authorship concepts in the age of AI.

AGENTIC RETRIEVAL-AUGMENTED GENERATION: A SURVEY ON AGENTIC RAG

Introduces Survey on compherensive list of RAG-techniques with LLM-agents.

Agent TCP/IP: An Agent-to-Agent Transaction System

ATCP/IP (Agent Transaction Control Protocol for Intellectual Property): introduces a trustless framework for exchanging IP between agents via programmable contracts.
Framework enables agents to initiate, trade, borrow, and sell agent-to-agent contracts on the Story blockchain network, including legal wrappers for offchain enforcement, and facilitates autonomous selling of training data, licensing of information, and content collaboration.
This framework is important for creating a standardized way for agents to negotiate and enter into agreements, forming a market for knowledge.

Leveraging Large Language Models as Knowledge-Driven Agents for Reliable Retrosynthesis Planning

MBRPS (Multi-branched Reaction Pathway Search): Algorithm enabling exploration of all pathways, with a focus on multi-branched ones.
Framework integrates LLMs and KGs, automates literature retrieval, reaction data extraction, database querying, and construction of retrosynthetic pathway trees, and recommends optimal routes.
Attempt to develop a fully automated retrosynthesis planning agent tailored specially for macromolecules powered by LLMs.

AutoRestTest: A Tool for Automated REST API Testing Using LLMs and MARL

AutoRestTest: is a novel tool that integrates Semantic Operation Dependency Graph (SODG) with Multi-Agent Reinforcement Learning (MARL) and Large Language Models (LLMs) for effective REST API testing.
It uses five specialized agents for operation, parameter, value, dependency, and header identification, and employs LLMs for realistic input generation and a command-line interface for user interaction.
This framework provides a comprehensive solution for thorough REST API evaluation and validation.

Leveraging LLM Agents for Translating Network Configurations

IRAG (Intent-based Retrieval Augmented Generation): is an intent-based framework for translating network configurations using LLM agents.
Framework includes intent extraction, manual retrieval, incremental translation, syntax verification and semantic verification modules.
This framework achieves high syntax correctness and superior translation accuracy compared to state-of-the-art methods.

DISENTANGLING EXPLORATION OF LARGE LANGUAGE MODELS BY OPTIMAL EXPLOITATION

Optimal Exploitation framework: isolates exploration as the sole objective by tasking the agent with delivering information that enhances future returns.
Framework decomposes missing rewards into exploration and exploitation components, measures optimal achievable return for explored states, and provides insights into behaviors driven by agent instructions.

Physical AI Agents: Integrating Cognitive Intelligence with Real-World Action

Physical AI Agents: is a framework that integrates cognitive reasoning with physical interaction for real-world tasks.
Framework includes modular architecture with perception, cognition, and actuation blocks, and introduces Ph-RAG (Physical Retrieval Augmented Generation) design pattern for real-time decision-making.

Doc-Guided Sent2Sent++: A Sent2Sent++ Agent with Doc-Guided memory for Document-level Machine Translation

Doc-Guided Sent2Sent++: is an agent that employs an incremental sentence-level forced decoding strategy for document-level machine translation.
It uses Doc-Guided Memory with summary and its translation, ensures sentence completeness, enhances fluency, and improves translation quality.
This approach addresses the limitations of other DocMT agents by maintaining both completeness and fluency.

Evaluating GenAl for Simplifying Texts for Education: Improving Accuracy and Consistency for Enhanced Readability

GenAI (Generative Artificial Intelligence): framework evaluates the use of LLMs for text simplification in educational contexts.
Framework uses three LLMs (GPT-4 Turbo, Claude 3, and Mixtral 8x22B), four prompting techniques (zero-shot, directional stimulus, chain-of-thought, and prompt chaining), and a novel multi-agent architecture; it assesses grade level accuracy, keyword accuracy, semantic similarity, and word count change.
This study provides a rigorous evaluation of LLMs for automated text simplification, offering insights for educators and future research.

14th of January 2025

14th January 2025

Governing AI Agents

Governance strategy: Governance strategy centered around inclusivity, visibility, and liability is proposed for designing and regulating AI agents.
agency law and theory; principal-agent problems; information asymmetry, authority, loyalty, delegation; limitations of conventional solutions; new technical and legal infrastructure; governance principles.
New technical and legal infrastructure is needed to support governance principles for reliable, safe, and ethical AI agents.

Flow: A Modular Approach to Automated Agentic Workflow Generation

Flow: is a multi-agent framework that dynamically adjusts workflows using activity-on-vertex graphs.
It refines workflows based on historical performance, emphasizes modularity, and achieves concurrent sub-task execution.
This framework improves efficiency and adaptability in multi-agent systems through dynamic workflow updates.

POKERBENCH: Training Large Language Models to become Professional Poker Players

POKERBENCH: is a benchmark for evaluating poker-playing abilities of large language models (LLMs).
It includes 11,000 poker scenarios, covers pre-flop and post-flop play, and evaluates models like GPT-4, ChatGPT 3.5, Llama and Gemma series.
This benchmark provides a quick and reliable way to evaluate LLMs in complex game-playing scenarios.

A Multi-Agent Framework for Systematic Review Automation Using Large Language Models

LatteReview: Intrdocus LLM-based systematic literature review multi-agent framework automation, which consists of three layers: LM providers (local models / LLMs via api), Reviewer agents (with roles & expertise levels) and Workflows (support sequential, parallel review rounds, dynamic decision-making and iterative refinement).
Includes BaseReviewer/ScoringReviewer/TitleAbstractReviewer/AbstractionReviewer/Custom reviewer-agents, which are used as modular agents for title and abstract screening, relevance scoring, and structured data extraction; agents operate within orchestrated workflows.
Workflow module includes Concept of rounds / Chaining reviews / Parallel reviews and Dynamic filter.

CodeCoR: An LLM-Based Self-Reflective Multi-Agent Framework for Code Generation

CodeCoR (Code Collaboration and Repair): is a self-reflective multi-agent framework for code generation.
It includes prompt-, coding-, test- and repair-agents, uses pruning methods to evaluate agent effectiveness, and enhances self-reflective ability.
It significantly outperforms existing state-of-the-art methods in code generation.

Engineering LLM Powered Multi-agent Framework for Autonomous CloudOps

MOYA (Meta Orchestrator Of Your Agents): is a multi-agent framework leveraging GenAI for autonomous CloudOps, balancing automation with human control.
Framework integrates internal and external systems, optimizes task orchestration, security, and error mitigation using Retrieval Augmented Generation (RAG), and includes LLM-based and non-LLM-based agents.
The framework enhances accuracy, responsiveness, and effectiveness over non-agentic approaches across complex workflows.

Agent-Centric Projection of Prompting Techniques and Implications for Synthetic Training Data for Large Language Models

Agent-Centric Projection: introduces a framework to reveal connections between prompting strategies and multi-agent systems.
Framework uses linear and non-linear contexts to classify prompting techniques, and proposes three conjectures about the relationship between prompting and multi-agent systems.
This framework enables cross-pollination of research findings between prompting and multi-agent domains, while providing new directions for improving both the design and training of future LLM systems.

Talk to Right Specialists: Routing and Planning in Multi-agent System for Question Answering

RopMura: is a multi-agent system that incorporates a router and a planner for question answering across diverse knowledge domains.
RopMura includes router for selecting relevant agents, planner for decomposing complex queries, and knowledge sovereignty consideration.
This framework enables efficient and accurate multi-domain question-answering.

Infecting Generative AI With Viruses

VLM/LLM (Vision-Large Language Model): framework tests security boundaries by embedding EICAR test file within JPEG images.
Framework includes multiple LLM platforms, such as OpenAI GPT-40, Microsoft Copilot, Google Gemini 1.5 Pro, and Anthropic Claude 3.5 Sonnet; it demonstrates masking EICAR string, extracting test file, and using obfuscation techniques.
This research extends penetration testing framework to evaluate cloud-based generative AI and LLM security boundaries.

Visual Language Models as Operator Agents in the Space Domain

Explores the application of VLMs as operator agents in the space domain.
Framework builds on LLMs and their multimodal extensions, investigates how VLMs enhance autonomous control and decision-making in space missions, includes software and hardware operational paradigms.
This research demonstrates that VLMs can effectively process visual and textual data to generate contextually appropriate actions.

ADAM-1: AI and Bioinformatics for Alzheimer's Detection and Microbiome-Clinical Data Integrations

ADAM-1 (Alzheimer's Disease Analysis Model Generation 1): is a multi-agent large language model framework designed to integrate and analyze multi-modal data.
Framework uses retrieval-augmented generation techniques, multi-agent architecture, synthesizes insights from diverse data sources, contextualizes findings using literature-driven evidence, and is tailored for binary classification tasks.
This framework demonstrates robustness and consistency, particularly in small laboratory datasets, and has potential for Alzheimer's research and diagnostics.

ADDRESSING THE SUSTAINABLE AI TRILEMMA: A CASE STUDY ON LLM AGENTS AND RAG

Sustainable AI Trilemma: highlights the tensions between AI capability, digital equity, and environmental sustainability.
Framework analyzes energy costs in memory module designs, introduces metrics for energy consumption and system performance trade-offs, challenges LLM-centric autonomy paradigm.
This framework provides practical insights for developing more sustainable AI systems.

Agent-Centric Projection of Prompting Techniques and Implications for Synthetic Training Data for Large Language Models

Agent-Centric Projection: introduces a framework to reveal connections between prompting strategies and multi-agent systems.
Framework uses linear and non-linear contexts to classify prompting techniques, and proposes three conjectures about the relationship between prompting and multi-agent systems.
This framework enables cross-pollination of research findings between prompting and multi-agent domains, while providing new directions for improving both the design and training of future LLM systems.

ASTRID - An Automated and Scalable TRIaD for the Evaluation of RAG-based Clinical Question Answering Systems

ASTRID: is an Automated and Scalable TRIaD for evaluating clinical QA systems leveraging RAG.
ASTRID includes three metrics: Context Relevance (CR), Refusal Accuracy (RA), and Conversational Faithfulness (CF); it is validated using real-world patient questions and clinician assessments; it is automatable using LLMs.
ASTRID provides a valuable resource for further research and development of clinical QA systems.

CuAsmRL: Optimizing GPU SASS Schedules via Deep Reinforcement Learning

CuAsmRL: is an automatic optimizer for optimizing NVIDIA GPU SASS schedules using reinforcement learning.
It formulates SASS optimization as an assembly game, integrates with OpenAI Triton, and improves performance of specialized CUDA kernels by up to 26%.
This framework provides a way to automatically optimize GPU kernels, which is important for improving the performance of LLMs.

13th of January 2025

The Lessons of Developing Process Reward Models in Mathematical Reasoning

PRM (Process Reward Model): A model for process supervision in mathematical reasoning of LLMs, which aims to identify and mitigate intermediate errors in the reasoning processes.
Monte Carlo (MC) estimation, Best-of-N (BoN) evaluation, consensus filtering mechanism, response-level and step-level metrics, data efficiency, error identification.
The paper addresses challenges in developing effective PRMs, offering solutions for data annotation, evaluation methodologies, and proposing a consensus filtering mechanism to enhance model performance and data efficiency.

Evaluating Agent-based Program Repair at Google

Passerine: An agent-based program repair system designed to operate within Google's development environment.
Inspired by SWE-Agent, utilizes ReAct-style loop, limited command set, Gemini 1.5 Pro, 20 trajectory samples, evaluates on GITS-Eval (178 bugs from Google's internal issue tracking system).
Establishes a baseline for agent-based automated program repair performance on an industrially relevant benchmark, highlighting challenges and opportunities in an enterprise context.

GPT as a Monte Carlo Language Tree: A Probabilistic Perspective

Reviews LLM as a Monte Carlo Language Tree (data tree), where each node is token, each edge is the token transition probability and each sequence has unique path.
Any GPT LLM can be flattened into MCLT.
Claims CoT attempts to find path between the input and output in the MCLT to connect them.

WebWalker: Benchmarking LLMs in Web Traversal

WebWalker: is a multi-agent framework that mimics human-like web navigation through an explore-critic paradigm.
WebWalkerQA is a benchmark designed to assess the ability of LLMs to perform web traversal, it evaluates the capacity of LLMs to traverse a website's subpages to extract high-quality data systematically, and it focuses on text-based reasoning abilities.
This work highlights the importance of deep, vertical exploration in web-based tasks.

Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

MVoT (Multimodal Visualization-of-Thought): is a multimodal native reasoning paradigm that generates image visualizations of reasoning traces.
MVoT uses token discrepancy loss to improve visual coherence and fidelity, and is validated on dynamic spatial reasoning tasks, showing competitive performance.
MVoT establishes new possibilities for complex reasoning tasks where visual thinking complements verbal reasoning.

Understanding and Benchmarking Artificial Intelligence: OpenAI's 03 Is Not AGI

Claims, that ARC-AGI (Abstraction and Reasoning Corpus) is a benchmark proposed to measure intelligence, but not suitable for measuring progress towards AGI.
ARC-AGI tasks represent a specific problem structure, which can be solved by massive trialling of predefined operations, and it does not require exploration, but only exploitation.
A new benchmark is outlined that covers a much higher diversity of unknown tasks to be solved, to enable a comprehensive assessment of intelligence and of progress towards AGI.

PoAct: Policy and Action Dual-Control Agent for Generalized Applications

PoAct (Policy and Action Dual-Control Agent): is a framework that dynamically adjusts action space and reasoning policy using a Policy Controller and Action Controller.
PoAct includes a Policy Controller for switching between reasoning policies, and an Action Controller with RAG Selector and Action Reviewer for managing action space and reasoning paths; it is evaluated on LegalAgentBench and AgentBench datasets.
PoAct achieves higher quality code actions and more accurate reasoning paths, while also reducing token consumption.

Lifelong Learning of Large Language Model based Agents: A Roadmap

Introduces a s survey incorporating lifelong learning into LLM-based agents.
Categorizes core components into perception-, memory-, and action-modules, highlights continuous adaptation, mitigates catastrophic forgetting, and improves long-term performance.

How GPT LEARNS LAYER BY LAYER

Explores how LLMs build internal world models with OthelloGPT by using Sparse AutoEncoders.

SST-EM: Advanced Metrics for Evaluating Semantic, Spatial and Temporal Aspects in Video Editing

SST-EM (Semantic, Spatial, and Temporal Evaluation Metric): is a benchmark for video editing that leverages VLMs, object detection, and temporal consistency checks.
SST-EM includes semantic extraction using VLM, primary object tracking with object detection, focused object refinement via LLM agent, and temporal consistency assessment using ViT.
This framework provides a comprehensive evaluation of semantic fidelity and temporal smoothness in video editing.

PoAct: Policy and Action Dual-Control Agent for Generalized Applications

PoAct (Policy and Action Dual-Control Agent): is a framework that dynamically adjusts action space and reasoning policy by switching between different reasoning policies and managing action space.
PoAct includes Policy Controller for high-quality planning and coding, and Action Controller with RAG Selector and Action Reviewer for managing action space and reasoning paths; it is evaluated on multiple datasets with commercial and open-source large models.
PoAct achieves higher-quality code actions and more accurate reasoning paths, demonstrating strong generalizability and scalability.

Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability

cDPO (critical Direct Preference Optimization): is a novel framework for identifying and penalizing critical tokens in mathematical reasoning tasks.
It uses rollout sampling to identify critical tokens, contrastive estimation to pinpoint them efficiently, and token-level rewards for preference optimization.
This framework significantly improves model accuracy in mathematical reasoning tasks by reducing errors.

12th of January 2025

Eliza: A Web3 friendly AI Agent Operating System

Eliza: The first open-source, web3-friendly, agentic framework that makes the deployment of web3 applications effortless.
Typescript program, seamless web3 integration, stable performance, key runtime components, community-driven, modular design, multi-agent simulation.
Eliza bridges the gap between AI and web3, offering a platform for decentralized AI applications.

DVM: Towards Controllable LLM Agents in Social Deduction Games

DVM (Dynamic Victory Manager): is a framework for controllable LLM agents in social deduction games, comprising Predictor, Decider, and Discussor components.
It uses reinforcement learning with a win rate-constrained decision chain reward mechanism, enabling agents to dynamically adjust their gameplay proficiency, and it is evaluated in the Werewolf game.
DVM enables adaptive and balanced gameplay in social deduction games, opening new research avenues for controllable game agents.

LLMs Model Non-WEIRD Populations: Experiments with Synthetic Cultural Agents

Synthetic Cultural Agents (SCAs): uses LLMs to create synthetic agents representing non-WEIRD populations. Includes web scraping, LLMs, RAG prompting to construct cultural profiles and uses these agents to classic behavioral experiments, demonstrating cross-cultural variability.
Offers an effective and ethical method to pilot experiments and refine protocols for hard-to-reach populations for cross-cultural economic studies.

AIOPSLAB: A HOLISTIC FRAMEWORK TO EVALUATE AI AGENTS FOR ENABLING AUTONOMOUS CLOUDS

AIOPSLAB: is a framework that deploys microservice cloud environments, injects faults, generates workloads, exports telemetry data, orchestrates components, and provides interfaces for interacting with and evaluating agents.
AIOPSLAB includes Agent-Cloud Interface (ACI), a unified interface for agent-cloud interaction, and supports evaluation of LLM-based agents with a benchmark suite of 48 problems across different AIOps tasks.
AIOPSLAB provides a holistic approach to evaluate AIOps agents in complex cloud environments, addressing the limitations of existing benchmarks.

11th of January 2025

The Internet of Large Language Models

The Internet of LLM: introduces an universal environment and sharing protocol of LLM training/knowledge exchange, which consists of LLM sharing protocol/LLM Universal environment/Agent Optimal Path Module/joint mining mechanism.
Includes also planning-, reflection- and tool use-agents.

Guided Code Generation with LLMs: A Multi-Agent Framework for Complex Code Tasks

Guided code generation: introduces a multi-agent framework for complex code tasks, which includes hierarchical decomposition, bottom-up code generation, and multi-agent validation.
Leverages LLMs as fuzzy searchers and information retrievers. Mitigates LLM weaknesses in long sequential reasoning and context understanding.
This framework enhances code generation capabilities and overcomes limitations of LLMs in compositional reasoning and context handling.

10th of January 2025

BioAgents: Democratizing Bioinformatics Analysis with Multi-Agent Systems

BioAgents: is a multi-agent system designed to assist users in bioinformatics pipeline design, development, and troubleshooting. which includes two specialized agents and a reasoning agent.
First specialized agent was fine tuned with conceptual genomics tasks and the second specialized agent uses RAG related to workflow documentation.
Reasoning agent uses self-ratings / threshold.
Achieves performance comparable to human experts on conceptual genomics tasks.

Multi-Agent Collaboration Mechanisms: A Survey of LLMs

The survey reviews Multi-Agent Systems (MASs) collaboration mechanisms based on key dimensions.
Framework includes actors, types, structures, strategies, and coordination protocols; reviews existing methodologies; investigates applications across diverse domains; identifies key lessons, open challenges, and potential research directions.

How to Enable Effective Cooperation Between Humans and NLP Models: A Survey of Principles, Formalizations, and Beyond

Human-Model Cooperation: is a survey of principles, formalizations, and open challenges in human-model cooperation.
It introduces a new taxonomy for categorizing human-model cooperation, identifies key research frontiers, and discusses associated challenges.

OpenFOAMGPT: a RAG-Augmented LLM Agent for OpenFOAM-Based Computational Fluid Dynamics

OpenFOAMGPT: LLM-based agent tailored for OpenFOAM-centric computational fluid dynamics (CFD) simulations.
It leverages GPT-4 and a chain-of-thought (CoT)-enabled o1 preview model, uses retrieval-augmented generation (RAG) pipeline, and includes an iterative correction loop.

9th of January 2024

Search-01: Agentic Search-Enhanced Large Reasoning Models

Search-01: is a framework that enhances Large Reasoning Models (LRMs) with an agentic retrieval-augmented generation mechanism and a Reason-in-Documents module.
It integrates an agentic search workflow, enables dynamic retrieval of external knowledge, and uses a separate module to analyze retrieved information.
This approach enhances the trustworthiness and applicability of LRMs in complex reasoning tasks.

OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis

OpenOmni: Introduces three-stage training method combining speech-to-text generation/image-to-text generation/speech generation, which results SOTA-level omnimodal LLM.

Emergence of human-like polarization among large language model agents

Introduces a networked system, which simulates social interactions of thousands of LLM-based agents, including capabilities of establishing social relationships, communicating, and forming opinions on political issues. LLM agents form spontaneously human-like social networks (echo chamber).
LLM agents exhibit human-like polarization and can be used to study interventions, offering insights into managing polarization in real-world scenarios.
Self-regulation helps to reduce inconsistencies in the opinions, which leads to more balanced polarization patterns. Openmindedness and diverse interaction limit polarization effect.

NSChat: A Chatbot System To Rule Them All

NSChat: introduces a web-based chatbot system designed for neuroscience research.
NSChat is built using React framework, it is customizable, flexible, and allows integration of various LLMs, it also includes a logging mechanism for user interactions.

Emergence of human-like polarization among large language model agents

LLM (Large Language Model) agents framework: simulates a networked system of agents that establish social relationships, communicate, and form opinions on political issues.
Framework includes self-expression, communication, and opinion update stages; agents develop human-like polarization, homophilic clustering, and echo chamber effects; self-regulation strategy reduces self-inconsistency.
This framework provides a valuable platform for exploring strategies to mitigate polarization and promote inclusive political conversations.

LearningFlow: Automated Policy Learning Workflow for Urban Driving with Large Language Models

LearningFlow: is an automated policy learning workflow for urban driving that uses multiple LLM agents.
It includes curriculum sequence generation and reward generation processes, supported by analysis agents, and enhances sample efficiency.
This framework automates policy learning across complex driving tasks and reduces reliance on manual reward function design.

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

OVO-Bench (Online-VideO-Benchmark): is a novel video benchmark for evaluating online video understanding capabilities of Video-LLMs.
It includes 644 videos, 2800 meta-annotations, and 12 tasks across three categories: Backward Tracing, Real-Time Visual Perception, and Forward Active Responding.
This benchmark highlights the importance of temporal awareness for advanced online video understanding.

9th of January 2025

Transformer-Squared: Self-adaptive LLMs

Transformer²: A self-adaptation framework that adapts LLMs (Large Language Models) for unseen tasks in real-time by selectively adjusting the singular components of their weight matrices.
Two-pass mechanism, task-specific expert vectors, reinforcement learning, dynamic mixing, targeted behavior, outperforming LoRA, fewer parameters, greater efficiency, versatility across different LLM architectures and modalities.
Represents a significant leap forward, offering a scalable, efficient solution for enhancing the adaptability and task-specific performance of LLMs, paving the way for truly dynamic, self-organizing AI systems.

On Corrigibility and Alignment in Multi Agent Games

Multi Agent Corrigibility Games: introduces a framework for studying corrigibility in systems comprised of multiple autonomous agents.
Framework models a 2-player game with human supervision, uses Bayesian games to introduce uncertainty over human beliefs, and analyzes specific cases like two-player corrigibility and adversary settings.
This framework provides insights into designing corrigible multi-agent systems, even in the face of human irrationality.

8th of January 2025

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

rStar-Math: A framework demonstrating that small language models (SLMs) can rival or surpass the math reasoning capability of OpenAI models through deep thinking. Iteratively improves through self-evolution generating millions of new math reasoning trajectories in each round.
Uses Monte Carlo Tree Search (MCTS) with self-annotated Q-values. rStar-Math used 747k math word problems, took the final correct answer and then rolled out 16 MCTS-based step-by-step verified reasoning trajectories, to categorize problems by difficulty level (easy/medium/hard) based on ratio of correct solutions. Hard problems are assigned with an additional extra 16 rollouts. The policy SLM is trained using all the step-by-step trajectories with their Q-values.
The importance of this work lies in showing that smaller language models can achieve state-of-the-art math reasoning, rivaling larger models, through a novel self-evolutionary process.
Includes Code-Augmented CoT, where step-by-step reasoning trajectories generated are verified with code execution for correctness.

Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought

Meta-CoT (Meta Chain-of-Thought): A novel framework that extends traditional CoT by explicitly modeling the underlying reasoning process required to arrive at a particular CoT.
Inspired by Cognitive Science's dual-process theory, non-linear, iterative, latent process of exploration and verification, in-context search, process supervision, synthetic data generation, search algorithms, instruction tuning, reinforcement learning, scaling laws, verifier roles, novel reasoning algorithms, meta-reinforcement learning.
This work provides a theoretical and practical roadmap to enable Meta-CoT in LLMs, paving the way for more powerful and human-like reasoning in artificial intelligence.

URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics

URSA (Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics): A framework for enhancing the mathematical reasoning capabilities of Multimodal Large Language Models (MLLMs) through a three-module synthesis strategy and a novel dual-view process supervision data synthesis method.
Integrates CoT distillation, trajectory-format rewriting, format unification, MMathCoT-1M dataset, DualMath-1.1M dataset, URSA-7B model, URSA-RM-7B model, test-time scaling, process annotation, out-of-distribution (OOD) verification.
This work significantly enhances MLLMs' potential in mathematical reasoning, achieving state-of-the-art performance on multiple multimodal mathematical benchmarks and demonstrating robust supervision abilities.

Retrieval-Augmented Generation with Graphs (GraphRAG)

GraphRAG: is a framework for retrieval-augmented generation using graph-structured data.
It defines key components like query processor, retriever, organizer, generator, and data source; reviews techniques tailored to different domains; discusses research challenges and future directions.
This framework provides a comprehensive overview of GraphRAG for information retrieval, data mining, and machine learning communities.

Agent Laboratory: Using LLM Agents as Research Assistants

Agent Laboratory: An autonomous research-framework with LLMs for completing the entire research process (literature review/experimentation/report writing), from literature review to experimentation (plan formulation, data preparation and running experiments) and report writing (report writing and report refinements).
Human-in-the-loop, research idea as input and code repository/research report as output. Producs SOTA-level performance and reduces research expensesn.
The framework has the potential to accelerate scientific discovery by enabling researchers to focus on creative ideation rather than low-level coding and writing.
Includes postdoc/ph student/sw engineer/ml engineer/professor-agents. Includes mle-solver-tool capable of solving ML-tasks, which iteratively improves research code.
Automated evaluation of the framework significantly overestimated the accurate scoring. Copilot mode was found useful by the human testers. Includes prompts.

Supervision-free Vision-Language Alignment

SVP (Supervision-free Visual Projection): A novel framework that enhances vision-language alignment in VLMs without relying on curated data or preference annotation.
Leverages self-captioning, pre-trained grounding model, feedback mechanism, elicits latent information, improves vision-language alignment.
The framework significantly improves performance across various tasks, including captioning, referring, visual question answering, multitasking, hallucination control, and object recall, highlighting its potential to advance multimodal AI systems.

7th of January 2025

Reasoning-Enhanced Self-Training for Long-Form Personalized Text Generation

REST-PG (Reasoning-Enhanced Self-Training for Personalized Text Generation): Introduces a multi-stage framework designed to teach LLMs reasoning over personalized context through Expectation-Maximization Reinforced Self-Training.
Generates reasoning paths based on the user's past preferences, background knowledge, and writing style
The framework enhances LLMs' ability to generate personalized text, outperforming state-of-the-art baselines by 14.5% on average.

6th of January 2025

Large language models for artificial general intelligence (AGI): A survey of foundational principles and approaches

Introduces a survey about AGI concepts and achieving AGI with LLMs. Includes list of memory types used with LLMs: sensory/working/semantic/episodic/procedural. Lists aspects of embodiment as: goal-awareness/self-awareness/situatedness/deliberate action.

CALM: Curiosity-Driven Auditing for Large Language Models

CALM (Curiosity-driven Auditing for LLMs): Introduces intrinsically motivated RL based on curiousity to finetune LLM as an auditor agent, to discover harmful/biased input/output pairs in the LLM. Includes token-level intrinsic bonus. Uses curiosity-driven exploration to navigate efficiently the prompt space, such as discover specific celebrity names.

RTLSquad: Multi-Agent Based Interpretable RTL Design

RTLSquad: is a novel LLM-Based Multi-Agent system for interpretable RTL code generation.
It divides the design process into exploration, implementation, and verification & evaluation stages, managed by specialized agent squads, generating optimized RTL code through inter-agent collaboration, and providing decision interpretability through the communication process.
This framework enhances the ability to generate functionally correct RTL code and optimize PPA performance, while also providing decision paths.

5th of January 2025

LLMs Help Alleviate the Cross-Subject Variability in Brain Signal and Language Alignment

Decodes EEG scans to text with subject-independent semantic features for Brain-Computer Interfaces (BCIs). Introduces EEG embeddings.
Includes cross-subject generalization (addresses the issue of variability in brain anatomy between humans/neural dynamics/signal), zero-shot and comprehensive evaluation.

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

DeepSeek LLM: is an open-source large language model framework with 7B and 67B configurations.
It uses a 2 trillion token dataset, multi-step learning rate scheduler, and includes SFT and DPO stages.
This framework achieves superior performance compared to LLaMA-2 and GPT-3.5 in various benchmarks.

4th of January 2025

Table as Thought: Exploring Structured Thoughts in LLM Reasoning

Table as Thought: organizes reasoning within a tabular schema, where rows represent sequential thought steps and columns capture critical constraints and contextual information.
Framework is inspired by cognitive neuroscience theories, reasoning process iteratively populates the table until self-verification ensures completeness and correctness, excels in planning tasks and mathematical reasoning.
This work provides a novel exploration of refining thought representation within LLMs, paving the way for advancements in reasoning and AI cognition.

Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks

Hallo3: The first application of a pretrained transformer-based video generative model for highly dynamic, realistic portrait animation.
Identity reference network, 3D VAE, transformer layers, speech audio conditioning, motion frame mechanisms, DiT-based video generation, video extrapolation.
Addresses challenges of non-frontal perspectives, dynamic objects, and immersive backgrounds in portrait animation.

Thinking with Many Minds: Using Large Language Models for Multi-Perspective Problem-Solving

Replicates the concept of "Wisdom of the Crowd" with LLMs using synthetic deliberation.
Generates multiple agents, each with dinstinct perspective to a problem. Agents simulate arguments and counter-arguments from their perspective.
Agents explore in parallel the problem space using its own perspective. The integration mechanism adjusts agents positions based on proposals/evaluations of others controllable with influence parameter alpha. The iterative deliberation repeats multiple rounds until consensus is reached.

UAVs Meet LLMs: Overviews and Perspectives Toward Agentic Low-Altitude Mobility

Review systematically integration of LLMs with UAVs (Unmanned aerial vehicles).
Proposes roadmap towards agentic UAVs. Includes github-repository with links to papers/approaches around LLM-based UAV systems.

3rd of January 2025

SDPO: Segment-Level Direct Preference Optimization for Social Agents

Introduces SDPO (Segment-Level Direct Preference Optimization)-fine tuning, which aligns the LLM to key segments in multi-turn conversation.
Addresses goal-completion in multi-turn conversation.

AgentRefine: Enhancing Agent Generalization through Refinement Tuning

AgentRefine: Uses a strong LLM to simulate interactive role-playing, with the model acting as both Dungeon Master and player. A verifier checks each action for errors, providing feedback that allows the model to refine its actions until it achieves the correct result. This iterative process, with its corrected action sequences, trains the system to explore viable actions and generalize to new scenarios.

Multi-Agent Conversational Online Learning for Adaptive LLM Response Identification

MACO (Multi-Agent Conversation Online learning for adaptive LLM response identification): Introduces near-optimal cumulative regret with multiple local agents to identify, which is the most optimal LLM response to serve for the particular user, even when new user.

MoColl: Agent-Based Specific and General Model Collaboration for Image Captioning

MoColl: Introduces LLM-agent based framework for image captioning with specialised VQA model. Includes warm-up stage and agent-guided tuning stage.

2nd of January 2025

ProgCo: Program Helps Self-Correction of Large Language Models

ProgCo (Program-driven Self-Correction): A self-correction framework that uses self-generated and self-executed verification pseudo-programs to improve reasoning in large language models. Incluces ProgVe (Program driven Verification) and ProgRe (Program driven Refinement).
This framework enhances the ability of large language models to self-correct without external feedback, particularly in complex reasoning tasks.

PREDICTING THE PERFORMANCE OF BLACK-BOX LLMS THROUGH SELF-QUERIES

QueRE (Question Representation Elicitation): A framework to extract features of LLMs (Large Language Models) in a black-box manner by using follow-up prompts and taking the probabilities of different responses as representations to train reliable predictors of model behavior.
Low-dimensional representations, linear model, instance level, model performance, hidden state, question-answering, adversarial system prompt, model architectures, model sizes.
The framework can be used to predict model performance, detect models influenced by adversarial system prompts and distinguish between different model architectures and sizes.

A3: Android Agent Arena for Mobile GUI Agents

A3 (Android Agent Area): Introduces benchmark to evaluate mobile GUI agents, which focuses on practical tasks, larger action spaces and automated LLM-based evaluation.
A3 consists of controller (gets/controls states of the device), evaluator (final rating) and translator (between device device function and the agent message).

Dynamic Scaling of Unit Tests for Code Reward Modeling

CodeRM-8B: A lightweight unit test generator with dynamic scaling mechanism, which adapts number of unit tests based on problem difficulty. The unit tests are used in validating generated code by the LLM as reward signal.
The framework significantly improves performance of code generation across various models and benchmarks by enhancing the quality of the reward signal.

3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer

3D-LLaVa: Introduces 3D multi modal LLM with point clouds/text instruction/visual prompt as input and generates text output and 3D mask with Omni Superpoint Transformer (OST).
3D-LLaVa handles 3D vision-centric dialogue.
OST includes visual features selection, visual prompt encoding and 3D mask generation.

Harnessing Multi-Agent LLMs for Complex Engineering Problem-Solving: A Framework for Senior Design Projects

Proposes multi-agent LLM framework for engineering design projects consisting of problem formulation/breadth & depth/ambiguity & uncertainty/system complexity/technical innovation & risk management/societal & ethical consideration/methodology & approach/compherensive evaluation-agents.
Each agent consists of description, task, objective and evaluation points.

Embodied AI-Enhanced Vehicular Networks: An Integrated Large Language Models and Reinforcement Learning Method

Incorporates embodied AI framework, which consists of semantic data processing with LLaVa-agent (extracts semantics from image data captured by the vehicle), Data transmission optimization (balances bandwidth utilization and quality of experience) and Enhanced decision making with Deep RL with GAE-PPO.

MDSF: Context-Aware Multi-Dimensional Data Storytelling Framework based on Large language Model

MDSF (Multidimensional Data Storytelling Framework): Automatess data analysis and storytelling. Includes data preprocessing steps, fine tuned LLMs, LLM agents.

Toward Inclusive Educational AI: Auditing Frontier LLMs through a Multiplexity Lens

Suggests two strategies to improve LLMs multiplexity (diverse cultural viewpoints) over WEIRD (western/educated/industrialized/rich/democratic): system prompt with diverse cultural perspectives and multi-agent system with agents with different cultural views. Sentiment analysis is used to review cultural resonance.

PSYCHE: A Multi-faceted Patient Simulation Framework for Evaluation of Psychiatric Assessment Conversational Agents

PSYCHE: Introduces an LLM-based psychiatric evaluation framework by comparing the predicted values of psychiatric elements (Construct-PACA) against the actual values (Construct-SP). The actual values are simulated patient data generated with a multi-faceted construct (MFC).
The framework guarantees clinical relevance, ethical safety, cost efficiency, and quantitative evaluation by simulating psychiatric patients with detailed profiles, histories, and behaviors.

BoxingGym: Benchmarking Progress in Automated Experimental Design and Model Discovery

Introduces BoxingGym-benchmark, reviews LLMs capabilities to design and model discovery: collect data to test scientific theory and propose/update scientific theories through 10 environments. Introduces metric called EIG.
Expected information gain (EIG) measures an experiment's informativeness by testing if one scientific agent's model explanation enables another to make accurate environmental predictions.

General Information Metrics for Improving AI Model Training Efficiency

GIME (General Information Metrics Evaluation): A novel framework for optimizing AI model training by evaluating datasets using 11 general information metrics before training begins.
Objective Information Theory (OIT), pre-training assessment, data selection, training efficiency, reduced costs, model-agnostic, domain-independent.
This framework improves AI model training efficiency and reduces resource consumption while preserving model performance across various domains.

1st of January 2025

Agentic Systems: A Guide to Transforming Industries with Vertical AI Agents

Reviews transition from SaaS to context-aware, adaptive systems handling dynamic environments through vertical agents.
Identifies core modules of LLM agents: memory/reasoning engine/cognitive skills/tools.
Author categorises agentic systems into: task-specific, multi-agent and human augmented agent systems.

Large Language Model Based Multi-Agent System Augmented Complex Event Processing Pipeline for Internet of Multimedia Things

Introduces multi-agent framework for complex event processing of video queries(think TikTok/Youtube as examples) with AutoGen and Kafka brokers (real time data streams).
Consists of conversable/assistant/user proxy/LLM backend/human backed/tool backed-agents.

Interactionalism: Re-Designing Higher Learning for the Large Language Agent Era

Introduces Interactionalism-framework focuses on interactional intelligence to learn more personalized/social/non-linearly way, instead of monological way.
Proposes usage of dialogue-agents in education, such as tutors, teaching assistants, evaluators, guides and mentors.

LLM-Powered Multi-Agent System for Automated Crypto Portfolio Management

Introduces multi-agent framework for cryptocurrency investing with intrateam and interteam collaboration and multi modality. Consists of expert training module and multi-agent investment module.
Expert training module uses data/literature-agents to feed historical data and investment literature. Explanation-agents process this information to generate high-quality prompts to fine tune investment agents.
Multi-agent investment module consists of data-agent fetching real-time data to market-agents and crypto agents. Market agents includes two expert agents to analyze news/market factors to predict market trends and determining cash-crypto allocation. Crypto-agents includes two specialized agents to analyze crypto-specific factors and candlestick charts to make crypto selection decisions. Trading agents finally act with a trading API to execute the final portfolio strategy.

Beyond Text: Implementing Multimodal Large Language Model-Powered Multi-Agent Systems Using a No-Code Platform

Proposes design and implementation of multi modal and multi-agent framework with LLMs. Includes multi modal inputs (text/audio/video/image), multi-agent layer (includes supervisory-agent and RAG/image analysis/audio generation/image generation/video generation- worker agents), process layer (vector db and modality specific models) and the output layer (text/audio/video/image).
Supervisor agent controls sequence of tasks, distributes tasks, manages output of worker agents, tnterprets outputs and makes decisions about next steps in the sequence.

31st of December 2024

Enhancing LLM Reasoning with Reward-guided Tree Search

STILL-1 (Slow Thinking with LLMs): A reward-guided tree search framework to enhance the reasoning capabilities of LLMs.
Integrates policy model, reward model, and search algorithm; policy model navigates a dynamically expanding tree; guided by a trained reward model.
Improves LLMs' performance on complex mathematical reasoning tasks by trading test time for improved accuracy.

MAIN-RAG: Multi-Agent Filtering Retrieval-Augmented Generation

Main-RAG: Introduces multi-agent framework, where LLM-agents collaboratively filter and score retrieved documents.
Introduces adaptive filtering, which dynamically adjusts relevance filtering threshold.
Includes three agents: predictor (infers answers based on retrieved documents), judge (scores filtering and ordering) and final-predictor (generates final answer based on filtered and ordered documents).
Includes system instruction prompts.

Enhancing LLM Reasoning with Multi-Path Collaborative Reactive and Reflection agents

RR-MP (Reactive and Reflection agents with Multi-Path Reasoning): Improves reasoning capability of LLMs in complex scientific tasks.
Consists of reactive and reflection agents collaborating together to improve accuracy/avoid degeneration-of-thoughts.
Reactive agent receives information from external environment, decomposes it into sub-tasks, then stores them in the database.
Reflective agent analyzes sub-task it executes, offering suggestions or critiques. This feedback loop allows the reactive agent to refine its reasoning and complete the scientific process.

Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding

Embodied VideoAgent: Introduces VLM-based Embodied VideoAgent, which constructs scene memory from both egocentric video and embodied sensory inputs.
Includes persistent object memory, using VLM (depth maps / camera poses).
Automatically updates memory as actions / activities over objects are perceived.

Enabling New HDLs with Agents

HDLAgent: Introduces LLM-based agent to support code generation for underrepresented HDLs (Hardware Description Languages).

VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM

VideoRefer-model: Improves Video-LLMs fine-grained spatial and temporal detail understanding in videos, which facilitates more precise object descriptions, more detailed event analysis, and enhanced predictive reasoning in dynamic environments using masked object features.
VideoRefer-model consists of VideoLLaMA 2.1 as the foundation and a novel unified spatial-temporal object encoder that merges cross-frame token similarities.
Includes VideoRefer-dataset and VideoReferBench-benchmark.

LLM-MedQA: Enhancing Medical Question Answering through Case Studies in Large Language Models

LLM-MedQA: is a multi-agent medical question-answering system that incorporates similar case generation within a multi-agent architecture.
It leverages Llama3.1:70B model, includes question-specific analysis, option analysis, and case generation agents, and uses zero-shot learning.
This framework enhances performance on the MedQA dataset and improves interpretability and reliability in medical question answering.

30th of December 2024

Aviary: training language agents on challenging scientific tasks

Defines Language Decision Process (LDP). LDP is framed as Partially-Observable Markov Decision Process (POMDP), where actions only consist of the ones with the external environment.
Introduces Language agent training framework: Aviary. Includes implementation in 3 scientific domain tasks.
Builds language agents as stochastic computation graphs (SCG).

Distributed Mixture-of-Agents for Edge Inference with Large Language Models

Introduces Distributed Mixture-of-Agents, where multiple LLMs collaborate on various edge devices with decentralized gossip algorithm.
Does not rely in centralized server.

Exploring and Controlling Diversity in LLM-Agent Conversation

APP (Adaptive Prompt Pruning): Controls diversity of the LLM-agent conversation through adjusting lambda-variable.
The lambbda variable adjusts diversity by increasing/decreasing details about: current dialogue/history dialogue/environment/profile/memory.

Plancraft: an evaluation dataset for planning with LLM agents

Introduces Plancraft-benchmark to evaluate VLMs and LLMs planning capabilities and ability to decide in Minecraft craftting GUI, if the model is able to identify task as unsolvable (intentionally).
Identifies, that success rate alone is poor metric in real world tasks.

25th of December 2024

Probabilistic Mission Design in Neuro-Symbolic Systems

ProMis (Probabilistic Mission Design): ProMis helps drones understand where they can and cannot go by combining different types of information, like maps and sensor data, with rules and regulations, such as no-fly zones. Refers with mission landscape to safest and most legal paths.
Combines formal reasoning with probabilistic inference. Uses LLM to convert instructions into ProMis code and ChangeFormer for perception of satellite images.

24th of December 2024

24.12.2024

A Novel Task-Driven Method with Evolvable Interactive Agents Using Event Trees for Enhanced Emergency Decision Support

EvoTaskTree: is a task-driven method with evolvable interactive agents using event trees for emergency decision support.
Framework integrates task executors and task validators powered by large language models (LLMs), leverages insights from event tree analysis, and includes three crucial tasks: initiating event subevent analysis, event tree header event analysis, and decision recommendations.
This approach enhances rapid formulation of emergency decision-making and outperforms existing approaches.

Multi-Agents Based on Large Language Models for Knowledge-based Visual Question Answering

Introduces multi-agent framework consisting of three level of agents collaborating to provide answer: junior, senior and manager. Final answer is determined through voting. Each agent uses planning and tools (knowledge base / LLM knowledge).

VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks

VLABench-benchmark: Evaluates VLA models (Vision-Language Action models). Focuses on tasks requiring mesh & texture understanding, spatial understanding, semantic conversation cognition, common sense & applying real world knowledge, physical laws understanding and long horizon multi-step reasoning.

INVESTORBENCH: A Benchmark for Financial Decision-Making Tasks with LLM-based Agent

Investorbench-benchmark: Evaluates LLMs capability for financial decision making.

Decentralized Intelligence in GameFi: Embodied AI Agents and the Convergence of DeFi and Virtual Ecosystems

Introduces decentralized GameFI-ecosystem with LLM-agents based on Ethereum-blockchain.

Automated Code Review In Practice

Reviews automated code reviews, which led to longer average pull request closer time.

Large Language Model guided Deep Reinforcement Learning for Decision Making in Autonomous Driving

LGDRL (Language Guided Deep Reinforcement Learning): Introduces LLM-based autonomous driving system.
DRL agent learns from LLM-based driving expert-agent (prompted with prompt generator), when the LLM-based driving expert finds necessary to intervene DRL agent actions.

3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding

3DGraphLLM: Improves LLMs understanding of 3D scenes by creating 3D scene graph representation (think graph, where arrows point, if object is right/left/front/behind) from set of point clouds (object input).

Explainable Multi-Modal Data Exploration in Natural Language via LLM Agent

XMODE: Uses LLM to decompose (converts into simpler sub-questions and translates into workflows) user queries into SQL / image analysis.
Includes planning & expert model allocation/execution & self-debugging/decision making/expert models & tools/data lake.

Muse: A Multimodal Conversational Recommendation Dataset with Scenario-Grounded User Profiles

Introduces MUSE-dataset with conversations centered around clothing-domain by using multi-agent framework to generate real world-scenarios (scenario-grounded user profile generator/simulated conversation generator/conversation optimizer).

Defining and Detecting the Defects of the Large Language Model-based Autonomous Agents

Agentable: Introduces static analysis tool to detect defects in code with LLM-based agents and Code Property Graphs (identifies specific code patterns/analyses descriptions). Includes AgentSet-dataset.
Includes pre-processing, defect detection (code abstraction/LLM invocation/semantic enrichment/detect oracles engineeering), and defect reporting-modules.

22.12.2024

Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems

STILL-2 (Slow Thinking with LLMs): A framework to train reasoning models using a three-phase approach: imitation, exploration, and self-improvement.
Initial fine-tuning with distilled long-form thought data, exploration of challenging problems by generating multiple rollouts, iterative refinement of the training dataset.
The framework demonstrates competitive performance compared to industry-level reasoning systems, highlighting the potential of slow-thinking in enhancing complex reasoning capabilities of LLMs.

21st of December 2024

OpenAI o1 System Card

o1 model series: Large-scale reinforcement learning models trained to reason using chain of thought, improving safety and robustness.
Next model in series is OpenAI o1, faster version is OpenAI o1-mini, effective at coding, "thinks before it answers", long chain of thought before responding, refine thinking process, try different strategies, recognize mistakes.
Reasoning allows models to follow safety guidelines, provide helpful answers, resist attempts to bypass safety rules, avoid producing unsafe content, and reach state-of-the-art performance on certain benchmarks.

20th of December 2024

Deliberative Alignment: Reasoning Enables Safer Language Models

Deliberative Alignment: A training approach that "directly teaches" LLMs to explicitly reason through (safety) specifications before producing an answer.
Claims, that reasoning using explicitly specified policies in general, enable scaling alignment. Apart, imrpoves model safety, robustness to jailbreaks, out-of-distribution generalization, and reduces overrefusal rates.
Two core stages: supervised fine-tuning on (prompt, CoT, output) examples, reinforcement learning; uses context distillation; includes a "judge" LLM for reward signal.
Assigns deliberatedly a varied amount of compute to CoT, which improves performance in hard evals.
In first stage, the model is fine tuned with SFT to reason about the (safety) specification within its CoT using examples dataset generated with context distillation with o-type model, where the CoT references the specification.
Second stage trains with high-compute RL the model to think effectively by providing reward signal using a judge LLM with access to the (safety) instructions.

Offline Reinforcement Learning for LLM Multi-Step Reasoning

OREO (Offline REasoning Opyimization): improves multi-step reasoning with offline RL.
Iterative OREO improves consistently with additional training rounds.

19th of December 2024

Disentangling Reasoning Tokens and Boilerplate Tokens For Language Model Fine-tuning

Reasoning-highlighted Finetuning (RFT): Highlights reasoning tokens from boilerplate tokens (format and connecting tokens less critical for the task). Adds larger weight to reasoning tokens.
Introduces SHAD (Shuffle-Aware Discriminator): automatic, adaptive token discrimination.

On Verbalized Confidence Scores for LLMs

Claims, that LLMs can be prompted to provide caliberated confidence scores.

Agent-SafetyBench: Evaluating the Safety of LLM Agents

Agent-SafetyBench-benchmark evaluates LLM-agents safety. Agents tested achieved below 60% pass score.
LLM-agents lack currently robustness and risk awareness.

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

TheAgentCompany-benchmark: evaluates AI agents capacity to perform long-sequence tasks in real world-like environment as a digital worker: arranging meetings, writing code, screening resumes, communicating (simulates communication between agents), planning and administrative work. Best agent completed 24% of tasks.
Generates tasks in a self-contained environment with internal websites and data similar to used by SW companies.

18th of December 2024

Inference Scaling Flaws: The Limits of LLM Resampling with Imperfect Verifiers

LLM Resampling: explores the limits of using resampling with imperfect verifiers for improving language model accuracy.
The framework shows that imperfect verifiers, like unit tests, lead to false positives, limiting the effectiveness of resampling, and that weaker models generalize worse than stronger models, even with infinite compute budget.
This research highlights the importance of developing accurate verifiers and questions the effectiveness of inference scaling with imperfect verifiers.

17th of December 2024

AI PERSONA: Towards Life-long Personalization of LLMs

AI Persona: proposes, that LLMs should continuously adapt to diverse set of users via personalization.
Introduces a framework for life-long personalization of LLMs through learnable and dynamically updated dictionaries, which are updated based on interaction between user and the LLM.

13th of December 2024

Byte Latent Transformer: Patches Scale Better Than Tokens

Byte Latent Transformer (BLT): is a byte-level LLM architecture that encodes bytes into dynamically sized patches to efficiently allocate compute by varying the amount of compute based on the entropy of the next byte prediction.
BLT segments patches based on next-byte entropy, allocates more compute where data complexity increases, and improves training and inference efficiency.
BLT shows better scaling than tokenization-based models by simultaneously growing both patch and model size.

11th of December 2024

A Multimodal Social Agent

MuSA: is a multimodal LLM-based agent designed for analyzing text-rich social content.
MuSA includes reason-, plan-, optimize-, criticize-, refine- and act-LLM-based units, is model-agnostic, and optimized for social content analysis tasks.
MuSA can automate and improve social content analysis, aiding decision-making processes across various applications.

10th of December 2024

CePO: Empowering Llama with Reasoning using Test-Time Compute

CePO (Cerebras Planning and Optimization): Adds sophisticated reasoning capabilities to the Llama family of models using test-time computation techniques.
CePO enables Llama-3.3 70B to surpass Llama-3.1 405B in accuracy across coding, math, and reasoning tasks.
CePO's step-by-step reasoning, comparison instead of verification, and intuitive output format improve Llama's performance.
CePO achieves interactive performance of approximately 100 tokens/second on Cerebras hardware, comparable to leading models like GPT-4 Turbo and Claude 3.5 Sonnet.

9th of December 2024

AlphaVerus: Bootstrapping Formally Verified Code Generation through Self-Improving Translation and Treefinement

AlphaVerus: generates formally verified code with LLMs and through self-improvement by iteratively translating programs from higher resource language.
Includes three phases: exploration (translates programs from source language to Verus, which is a tool to verify correctness of code written in Rust), treefinement(iteratively fixes errors with Verus-verifier feedback/tree search) and critique (validates and filters unspecified/incorrect translations).
Illustrates the potential of inference-time scaling in verified settings. Suggests formal verification ensures correctness and reliability of the generated code.

Query-Efficient Planning with Language Models

Reviews efficient ways to use LLMs for planning: heuristic and LLM as generative planner.
Introduces two new algorithms: Tree of Interaction (ToI) and Boomerang.

Simulating Human-like Daily Activities with Desire-driven Autonomy

D2A-agent (Desire-driven Autonomous Agent): Introduces autonomous agent proposing and selecting autonomously fulfilling and motivating tasks (based on theory of needs: social interaction/personal fulfillment/self-care).
Introduces desire-based characters.
Includes value system (measures satisfaction per desired dimension) and Desire-driven planner (choses next action of the agent with history and value system).
Proposes using in the future more complex human motivation and planning mechanisms to satisfy intrinsic desires. Includes prompts.

Toward LLM-Agent-Based Modeling of Transportation Systems: A Conceptual Framework

Proposes transportation system modelling with LLM-based agents to replicate human decision making.
LLM-based agents include long-lasting core components: identity (age/income/occupation/cars owned/persona/travel related task/travel restrictions)/memory(short and long term)/LLM core(summarization/planning/nlu/workflow).
Includes iterative process with perception, reflection, planning, plan processing and action.

Beyond pip install: Evaluating LLM Agents for the Automated Installation of Python Projects

Installamatic: Reviews LLM-agents capability to install repository-level python packages with pip by automatically inspecting repository content and install the packages required.
Installamatic-agent is capable of installing packages required in 21/40 repositories tested with 4 main challenges: Identifying install-relevant documentation/writing valid docker files/cost/oracle-problem.

AutoDCWorkflow: LLM-based Data Cleaning Workflow Auto-Generation and Benchmark

AutoDCWorkflow: uses LLM to automatically generate data-cleaning workflows (duplicates/missing values/inconsistent data format) and introduces a benchmark.

StarWhisper Telescope: Agent-Based Observation Assistant System to Approach AI Astrophysicist

SWT (StarWhisper Telescope System): proposes automation of the astronomer observation process with LLMs. Includes observation planning/control/data processing/agent suggestion. Includes customized observation lists and real time analysis.

5th of December 2024

Practical Considerations for Agentic LLM Systems

Reviews LLM agent research from perspective of planning (explicit/implicit, task decomposition, plan adherence), memory (RAG, long-term memory), tools (ysage/dynamic/multiplicity) and control flow (output processing/error handling/stopping/multi-persona/context).
Long term memory may include reflection/consolidation/forgetting/revision and should be independent/consistent/long-term.

Targeting the Core: A Simple and Effective Method to Attack RAG-based Agents via Direct LLM Manipulation

Investigates adversial Adaptive Attack Prompt- and ArtPrompt-attack methods success rates between LLM models.

2nd of December 2024

Mastering Board Games by External and Internal Planning with Language Models

MAV (Multi Action-Value) model: is a transformer model pre-trained on textual game data, functioning as a world model, value function, and policy function for multiple perfect-information board games.
Framework includes external and internal search methods, uses MCTS controller, and distills search procedure directly into the LLM, pre-trained on relevant domain knowledge, minimizes hallucinations, and improves win-rates against state-of-the-art bots.
This framework demonstrates the capacity of LLMs to learn strong value functions and act as a world model across multiple perfect information games.

Inference Scaling Flaws: The Limits of LLM Resampling with Imperfect Verifiers

LLM Resampling: explores the limits of using resampling with imperfect verifiers for improving language model accuracy.
The framework shows that imperfect verifiers, like unit tests, lead to false positives, limiting the effectiveness of resampling, and that weaker models generalize worse than stronger models, even with infinite compute budget.
This research highlights the importance of developing accurate verifiers and questions the effectiveness of inference scaling with imperfect verifiers.

29th of November 2024

Amplifying human performance in combinatorial competitive programming

FunSearch: is a framework that evolves scoring functions for a human-designed solution backbone using a large language model.
Framework uses Gemini 1.5 Flash 002, improves scores on Hash Code, and uses a switching variable for multiple choice points.
This approach demonstrates a successful human-AI synergy in combinatorial optimization problems.

25th of November 2024

Agent-Based Modelling Meets Generative AI in Social Network Simulations

Generative Agent-Based Modelling (GABM): LLM-based agents, which simulate social network users with personality traits/interests and custom agent interactions.
The framework consists of two phases: Characterization (Personality assignment) and Simulation (Reasoning module and Interaction module). Decisions of the agent are stored in vector db for retrieval.

TopV-Nav: Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation

TopV-Nav: Improves Zero-Shot Object Navigation (ZSON) in unfamiliar environments by reasoning on top-view maps ("birds eye") with MLLM's spatial reasoning capabilities.
Proposes Adaptive Visual Prompt Generation (AVPG), which adaptively constructs top-view map. The framework then uses Dynamic Map Scaling (DMS), which dynamically zooms top-view map at preferred scales for local reasoning. Uses Target-Guided Navigation (TGN) to facilitate human-like exploration.

A Multi-agent Framework for Materials Laws Discovery

Introduces a LLM-based multi agent framework to discover materials laws in materials science, using general framework for solving symbolic regression tasks with LLMs. Uses a depth-first search (DFS) algorithm and a reflection mechanism, implemented through LLMs, to optimize formula generation.

Enhancing Multi-Agent Consensus through Third-Party LLM Integration: Analyzing Uncertainty and Mitigating Hallucinations in Large Language Models

Introduces a multi-agent consensus framework, which integrates confidence weight obtained with third-party LLM, to adjust attention weights of each agent.
Each agent answers individually on the first round, agents self-adjust with feedback on second/third round with third party LLM and finally agents majority vote the final answer.

SAGEval: The frontiers of satisfactory agent-based NLG evaluation for reference-free open-ended text

SAGEval: Introduces an eval for an open-ended, reference-free natural language generation (NLG) by using a critiquing agent to provide feedback on scores generated by LLM evaluators. Focuses on open-ended text like surveys, forms, and lists.
Includes Evaluator- (based on G-Eval) and Sage-agent as meta-evaluator. Evaluation aspects include: accuracy, semantic diversity, coherence, relevancy, audience understandability, audience engagement score, fairness score and sentiment/tone type.

24th of November 2024

PIANIST: Learning Partially Observable World Models with LLMs for Multi-Agent Decision Making

PIANIST (Partition function, Information set space, Action space function, N players, Information realization function, State space, and Transition reward function): A framework for decomposing a world model into seven components, enabling zero-shot LLM generation of a working world model for multi-agent decision-making tasks.
The framework leverages LLMs for generating forward transition functions, action functions, and information partition functions. It uses MCTS for planning in partially observable environments. The approach is evaluated on language and non-language based action-taking games, without domain-specific training data.
PIANIST demonstrates strong performance in multi-agent, partial information settings, showcasing the potential of LLMs for complex decision-making.

21st of November 2024

Natural Language Reinforcement Learning

Introduces: Natural Language Reinforcement Learning (NLRL).
Efficiently implements RL algorithms and principles in language representation space.
Presents NLRL-pipeline, where LLM learns from textual environmental feedback.
Implements empirically in various games.

18th of November 2024

GENERATIVE WORLD EXPLORER

Generative World Explorer (Genex): Introduces and egocentric world exploration, which allows an agent to mentally explore a large-scale 3D world and acquire imagined observations to update its belief inside partially observable decision process.
Generates high-quality and consistent observations in long-horizon tasks.
Consists of generative video model, egocentric views, belief revision, and decision-making (e.g., LLM agent). Includes multi-agent reasoning with imagination, where the framework infers perspectives of other actors in the scene.

OASIS: Open Agents SOCIAL INTERACTION Simulations on One Million Agents

OASIS (Open Agents SOCIAL INTERACTION Simulations on One Million Agents): Introduces generalizable, scalable (millions of agents) social media (twitter/reddit-like) simulator LLM-based agents supporting dynamic social networks, diverse actions and recommendation systems. Includes registration and simulation phases.
OASIS pulls in the registration phase information about user, past posts, self-description and name.
Simulation phase consists of Environment server(sends agent information, posts and user relationships)/RecSys(recommends visible content to user and agents)/Agent module(generates actions updating environment state)/Time engine(updates agents temporal behaviours)/Scalable Inferencer-components(handles large scale inference requests by user).
OASIS replicates social phenomena observed in human-societies, including group polarization and herd effect, which take place in dynamically updating environments with diverse action spaces.
Uses event-driven architecture, where agent communicates with server in dedicated channel, which consists of asynchronous message queue.

TrojanRobot: Backdoor Attacks Against Robotic Manipulation in the Physical World

TrojanRobot: A backdoor attack framework, which targets robotic manipulation in the physical world by embedding a backdoor robotic system's visual perception module.
Uses common objects as triggers.

A Code Knowledge Graph-Enhanced System for LLM-Based Fuzz Driver Generation

CodeGraphGPT: a framework that leverages a code knowledge graph and an LLM-powered intelligent agent to automate fuzz driver generation (sw testing technique by feeding unexpected random data as program inputs to discover bugs).
Includes agents for API combination generation (knowledge into graphs and then embeddings to query), dynamic program repair (past example embeddings), and crash analysis (bugs embeddings).
Constructs knowledge graph of code repos, tailors fuzz drivers and input seeds, resolves compilation errors, and analyzes crash reports.

Moral Persuasion in Large Language Models: Evaluating Susceptibility and Ethical Alignment

Reviews Persuader agents capacity to influence another LLM agent (Base agent) in morally ambiguous decision making scenarios.
LLMs show greater variability between the degree it is possible to persuade them, than their capacity to persuade others.

LLM-IE: A Python Package for Generative Information Extraction with Large Language Models

LLM-IE [LLM-based Information Extraction]: A Python package for building complete information extraction pipelines using large language models (LLMs).
Key features include interactive LLM agent for prompt design, support for named entity recognition, entity attribute extraction, and relation extraction tasks. Benchmarked on i2b2 datasets. Sentence-based prompting algorithm.

16th of November 2024

Developer Challenges on Large Language Models: A Study of Stack Overflow and OpenAI Developer Forum Posts

Analyzes developer challenges with LLMs. Challenges include LLM ecosystem, API usage, LLM training, dataset management, prompt engineering, and error handling. Identifies several unresolved posts, slow response times, especially with complex topics.

FlexFL: Flexible and Effective Fault Localization with Open-Source Large Language Models

FlexFL (Flexible and Effective Fault Localization): LLM-agents (Agent4SR and Agent4LR) based framework for code debugging / fixing with bug-related information (bug reports, test cases).
The framework employs a two-stage approach: space reduction (Agent4SR) to narrow search space and localization refinement (Agent4LR) to localize top k-most suspicious methods.

IntentGPT: Few-shot Intent Discovery with Large Language Models

IntentGPT: introduces a training-free method for Intent discovery using In-context Learning prompt (generated with LLM consisting of known intents/few-shot examples and user query) and LLM generating the intent.
Adds discovered intents back into the prompt. Includes prompts.
IntentGPT outperforms previous methods with extensive domain-specific data for training/fine-tuning. Discovers intents dynamic, open-world scenarios.

15th of November 2024

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

MPO (Mixed Preference Optimization): is a method that blends supervised fine-tuning loss with preference optimization losses to enhance training effectiveness of multimodal large language models.
MPO uses a novel automated preference data construction pipeline to create MMPR dataset, and explores different Chain-of-Thought approaches with multimodal input to improve reasoning performance.
This approach demonstrates improved performance across multiple benchmarks, particularly in multimodal reasoning tasks.

A dataset of questions on decision-theoretic reasoning in Newcomb-like problems

Decision-theoretic reasoning: Introduces a dataset of natural language questions on Newcomb-like problems.
The dataset includes capability questions (unambiguous answers) and attitude questions (disagreements among decision theorists). It evaluates existing large language models (LLMs) and their attitudes toward evidential decision theory (EDT) and causal decision theory (CDT).
Findings associate higher capability LLMs with more EDT-favorable attitudes across question types. The dataset helps to understand decision-theoretic reasoning capabilities and attitudes of LLMs in AI-AI interactions.

12th of November 2024

RedCode: Risky Code Execution and Generation Benchmark for Code Agents

RedCode-benchmark: Evaluates safety of code agents capacity to generate / execute code and reviews code agents capacity to recognize/manage unsafe code execution.
Includes two steps: RedCode-Gen (evaluates code generated) and RedCode-Exec (evaluates code execution).

World Models: The Safety Perspective

Introduces a Survey about World Models in Embodied AI agents from safety perspective.

BudgetMLAgent: A Cost-Effective LLM Multi-Agent system for Automating Machine Learning Tasks

BudgetLMAgent: Multi agent framework using cascading (sequentially invoking/chaining) free/low cost/frontier LLMs with distinct roles: planner (default/expert)/workers(high-level actions/low-level actions).
Gives LLM-agent an option to call more advanced LLM-model to request help (with maximum retries) in complex planning problems.
Reduces operation cost by 94% compared to single agent with GPT-4 and improved success rate.

LLMPhy: Complex Physical Reasoning Using Large Language Models and World Models

LLMPhy: Combines LLM with Mujoco-physics engine for complex physical reasoning tasks and introduces TraySim-dataset consisting of 100 scenes.
Claims, that LLMs have enough world knowledge with physics engine for better interactive reasoning and LLMs trained with more scientific reasoning tasks tend to demonstrate superior physical reasoning in LLMPhy-pipeline.

From General to Specific: Utilizing General Hallucation to Automatically Measure the Role Relationship Fidelity for Specific Role-Play Agents

Introduces an automatic evaluation framework for Role-Playing Agents (RPAs) that generates claims from a knowledge graph and has characters discuss them with the main character.
Evaluates the believability of interactions by leveraging the inherent hallucination properties of RPAs. Defines relationship hallucination metric.

Mitigating Bias in Queer Representation within Large Language Models: A Collaborative Agent Approach

Focuses on inclusive / gender neutrality in LLM-agents with: assistant/language analysis/optimizer-agents.

11th of November 2024

Mr.Steve: Instruction-Following Agents in Minecraft with What-Where-When Memory

Mr.Steve (Memory Recall Steve-1): Improves long-horizon task solving by incorporating solver module and Place Event Memory (PEM), which recalls what-, where- and when-information from episodes.
Includes memory-augmented task solving and exploration strategy.

Using Generative AI and Multi-Agents to Provide Automatic Feedback

Autofeedback: Introduces multi agent LLM-based framework for student feedback, which includes: feedback generation- and feedback validation/modifier. Reduces over-praising and over-inference.
Includes prompts of both agents.

Script-Strategy Aligned Generation: Aligning LLMs with Expert-Crafted Dialogue Scripts and Therapeutic Strategies for Psychotherapy

SSAG (Script-Strategy Aligned Generation): Aligns LLMs with key therapeutic strategies in Motivational Interviewing. Claims, that LLMs aligned with expert prompting outperform rule-based chatbots and pure LLMs.

Tooling or Not Tooling? The Impact of Tools on Language Agents for Chemistry Problem Solving

ChemAgent-framework: Introduces agent for chemistry tasks, which includes reasoning/grounding and tool use.

A Multi-Agent Approach for REST API Testing with Semantic Graphs and LLM-Driven Inputs

AutoRestTest: Introduces MARL-framework with Semantic Property Dependency Graphs (SDG) and LLMs for REST API exploration.
Includes dependency/operation/parameter/value-agents.

10th of November 2024

Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents

WebDreamer: LLM-based web-agent framework by using LLM to predict outcomes of candidate actions in web environment in order to pick optimal action.
The LLM simulates as world-model actions using prompt like: "what would happen if I click this button" and then evaluates the imagined outcomes.
Model-based planning enables safe simulation of possible actions before taking them (some web environments do not allow going back to previous step, which complicates tree-based search by investigating candidate next steps).
Includes system prompts of the world model and reward model.

9th of November 2024

IOPO: Empowering LLMs with Complex Instruction Following via Input-Output Preference Optimization

IOPO (Input-Output Preference Optimization): Aligns/fine-tunes LLMs based on both the input data (new approach) and the output data (traditional approach).
Explores instruction preference space.

From References to Insights: Collaborative Knowledge Minigraph Agents for Automating Scholarly Literature Review

Introduces CKMAs (Collaborative Knowledge Minigraph Agents), which automate literature reviews. Building knowledge minigraphs by organizing information and relationships from research papers.
Includes KMCA (Knowledge Minigraph Construction Agent) and MPSA (Multiple Path Summarization Agent), which both prompts are included.

8th of November 2024

The influence of persona and conversational task on social interactions with a LLM-controlled embodied conversational agent

Reviews effect of the LLM-based agent persona traits to user experience.
Manipulation of the personality traits strongly influences social interaction and user experience.

Game-theoretic LLM: Agent Workflow for Negotiation Games

Studies with game-theoretic analysis the rationality of LLM-based (with various LLMs) negotiation workflow in various complete-information games and in a incomplete-information game.

7th of November 2024

Interactive Dialogue Agents via Reinforcement Learning on Hindsight Regenerations

Simulates interactive dialogue by utilizing hindsight to regenerate optimal task-relevant dialogue data based on initial dialogue data.
Includes hindsight controller, which takes dialogue input and prefix, then outputs a more desirable action.

GUI Agents with Foundation Models: A Comprehensive Survey

Introduces Survey about GUI Agents.
Divides LLM-based GUI agents into: GUI Perceiver, Task Planner, Decision Maker, Excecutor and Memory Planner (internal memory: actions/screenshots, external memory: manual construct/auto exploration and self-evolution: transition diagram/documents).
Identifies challenges related to inference efficiency, self-evolution and real world vs. benchmark gap.

CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models

CodeTree: Introduces multi-agent, LLM-based code generation, which improves multi-stage planning/generation/debugging by using tree search.
Includes Thinker/Solver/Debugger/Critic-agents.
Critic-agents scores/expands/terminates nodes, which is based on feedback generated by the LLM and the execution feedback on test cases.

CaPo: Cooperative Plan Optimization for Efficient Embodied Multi-Agent Cooperation

CaPo (Cooperative Plan Optimization): Includes meta-plan generation and progress-adaptive meta-plan & execution
Meta plan generation consists of analyzing, discuss, create the meta-plan decomposed into subtasks by the various agents.
Progress-Adaptive Meta-Plan & Execution: agents execute task in the meta plan and dynamically adjust it based on latest progress in multiturn dialogue.

6th of November 2024

AdaSociety: An Adaptive Environment with Social Structures for Multi-Agent Decision-Making

AdaSociety: multi-agent environment to simulate decision making with physical(resources, events, agents skill inventories)/social(establish, alter, form groups, hierarchies)-components.
Introduces social states: multilayer directed graph to describe adaptive / dynamic connections, which drive long-term coalition formation / hierarchy.
Dynamically connects with other agents to establish autonomously non-deterministic connection with the other agent.
State and action space dynamically advance.
Identifies research challenges in collective reasoning, social cognition, adaptation, communication and emergence of new social skills and norms.

MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue

MRJ-Agent: Introduces multi-round dialogue jailbreaking agent, which decomposes harmful queries into multiple sub-queries.
This widely generalizable jailbreaking-technnique achieves SOTA-level success rates.

From Novice to Expert: LLM Agent Policy Optimization via Step-wise Reinforcement Learning

StepAgent: Optimizes LLM-agents wit step-wise RL with inspection- and reflection-steps.

5th of November 2024

SAUCE: Synchronous and Asynchronous User-Customizable Environment for Multi-Agent LLM Interaction

SAUCE (Synchronous and Asynchronous User-Customizable Environment): Introduces LLM-based multi agent framework with asynchronous communication feature, where models decide when to speak and what to say.
Includes experiment(configures discussio, participants, host and end criteria)/session room(manages ongoing experiment and exit criteria)/host (directs interaction)/person(human or LLM).
Implements LLM-agent personas (and human participant) as class-objects in Python.

AI Metropolis: Scaling Large Language Model-based Multi-Agent Simulation with Out-of-order Execution

AI Metropolis: introduces multi agent LLM-based framework, which enables out-of-order execution (parallel processing) of agents by tracking dynamically real dependencies between agents.
LLM agents often wait unnecessarily each step to complete, before proceeding, even when it is a false dependency.
LLM agents can be: blocked (another blocks proceeding), coupled (proceed together), clustered (group needs to synchronize), worker (independent process handling cluster) or controller (main process communicating with workers).
The related work-section offers comphrensive view on the different scheduling approaches to with agentic AI.

1st of November 2024

DARD: A Multi-Agent Approach for Task-Oriented Dialog Systems

DARD (Domain Assigned Response Generation): LLM-based multi agent framework in multi domain & task oriented dialogue.
Introduces dialogue manager/hotel/attraction/restaurant/train/taxi-agents, external db and dialogue state tracker.
Uses both fine-tuned LLMs and Sonnet 3.0. Reviews differences in performance.

31st of October 2024

Navigating the Unknown: A Chat-Based Collaborative Interface for Personalized Exploratory Tasks

CARE (Collaborative Assistant for Personalised Exploration): Introduces personalized LLM-based multi agent framework, where user interface includes chat/solution/needs-panels.
Focuses on improving multi-turn contextual understanding, personalization, exploration and reduce cognitive load.
Employs inquiry/ranking/needs discovery/solution crafting/milestone-agents.

30th of October 2024

EMOS: Embodiment-aware Heterogeneous Multi-robot Operating System with LLM Agents

EMOS: multi-agent framework for multi-robot system with embodiment & spatial-aware reasoning/navigation/manipulation/object rearrangement.
Includes hierarchical task planning, assignment and actioning. Evaluates success rate, sub-goal success rate, token usage and simulation step.
Uses "Robot Resume": a self-prompting, instead of "human roleplay" by interpreting the robot URDF files to call robot kinematics tools to generate descriptions of its physical abilities for guiding its planning/action execution.

Aligning Audio-Visual Joint Representations with an Agentic Workflow

AVAgent: Adapts audio signal with visual data using LLM-based agent framework, which plans edits of the audio signals and reflection with VLM to evaluate the modifications and uses tool to convert video and audio modality to text.

29th of October 2024

BENCHAGENTS: Automated Benchmark Creation with Agent Interaction

BENCHAGENTS: Introduces LLM-agent framework automating benchmark creation, which includes four components: planning/generation/data verification/evaluation-agents.
Dynamic benchmarks help to identify common failure modes/model differences, while LLM models improve quickly.
Planning includes: prompt/task-specific parameters/constraints (positive/negative/positional/sequencing/conditional/iterative).

28th of October 2024

Asynchronous Tool Usage for Real-Time Agents

Asynchronous AI agents: Introduces asynchronous, parallel thought processing and real-time tool use based on event-driven finite state-machines.
Time stamp is in the messages to enable clock awareness, which enables time-constrained tasks.
Event states include idle/listening/generating/emitting.

25th of October 2024

Cooperative Strategic Planning Enhances Reasoning Capabilities in Large Language Models

CoPlanner (Cooperative Planner): Improves reasoning capabilities of LLM by separating reasoning steps. Each agent gets assigned unique reasoning step.
Includes planning agent and reasoning agent.
Pre-defines 10 human cognition-based meta-strategies. Includes 5 logical reasoning methods: deduction/induction/abduction/analogy/contradiction and four problem solving methods: decomposition/enumeration/elimination/reflection and meta-strategy: finish to indicate end of reasoning.

VisionCoder: Empowering Multi-Agent Auto-Programming for Image Processing with Hybrid LLMs

VisionCoder: Multi agent framework with team leader, module leader, function coordinator and development group
Identifies excellent two aspects for the Agent-definitions: structural (explains the agents place in the overall structure/scope/responsibilities) and functional (operational steps/reasoning path expected from the agent and the output format requirements).
Includes bi-directional workflow: hierarchical tasks are divided into smaller units (forward task flow) and then restored back (backward task flow) from smaller pieces to larger units. Pair programming-concept includes coder and tester: coder produces code, tester reviews it and then the roles are reversed. The pair programming step is repeated three rounds with code execution with incorporation of the error messages to get final working code.

Designing LLM-Agents with Personalities: A Psychometric Approach

Reviews creation of psychometrically sound LLM-based agents based on the theory about big 5 personality traits (openess/conscientiousness/extraversion/agreeabless/neuroticism).

FISHNET: Financial Intelligence from Sub-querying, Harmonizing, Neural-Conditioning, Expert Swarms, and Task Planning

FISHNET: Multi agent-framework for insights from SEC regulatory forms. Includes sub-querying (converts query into sub-queries)-, task planning- , experts (Swarm Intelligence)-, harmonizer(routes to specific expert based on embedding match vs. agent persona/tables description)-agents and long term memory.
Expert agents consist of: n-port-, n-mfp-, adv-, n-cen-, n-csrv- and 13f-agents, which are experts in different forms related to SEC regulations.

AGENT-CQ: Automatic Generation and Evaluation of Clarifying Questions for Conversational Search with LLMs

Agent-CQ: Introduces a framework for generating and evaluating conversational search questions and answers. Includes generation (question generation / filtering / answer generation)- and evaluation (multiple LLM-judge calls to review generated questions/answers)-stages.

EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data

EDGE: Introduces framework to generate training data for GUI-tasks in the internet. Introduces element- and action-grounding.

Investigating the Role of Prompting and External Tools in Hallucination Rates of Large Language Models

Investigates prompting techniques and finds simpler is often better and best prompts are problem specific.
In math problems self-consistency with majority vote works well, Chat protect helps to manage amount of hallucinated answers and Self-Verification worked well with MMLU.

AgentSense: Benchmarking Social Intelligence of Language Agents through Interactive Scenarios

AgentSense-benchmark: introduces a multiturn evaluation of LLM-agents regards social intelligence. Focuses on goal competition and implicit reasoning.
Character-info includes: attributes/relationships/rules of replacement. Scenarios include: background/characters/social goals/private info.
Includes a sample agent-prompt.

24th of October 2024

Unbounded: A Generative Infinite Game of Character Life Simulation

Unbounded: Introduces a conceptual and technical implementation of concept called "generative infinite game".
Addresses semantically alignedconsistent environment/characters.
Trained an LLM based game engine game engine (generating coherent and real-time game mechanisms, narratives and contextual character responses) and "Regional IP-Adapter", which creates visually consistent characters/environments between multiple images while applying creativity. Regional IP-Adapter tracks changes overtime, so if your character gets injured in forest, the injury remains in the following images and the character still wears same clothes, while giving creative touches to the visuals.

AR: Operating System Control via State-Aware Reasoning and Re-Planning

OSCAR: Introduces GUI-agent with unified control interfaces / GUI grounding (dual grounding) / exploration-based simulation and re-planning (task driven replanning of only specific tasks).
Works both in smartphones and desktop OS. Reviews GUI agents. Includes system prompts.
Agent states include: init/observe/plan/execute/error/verify/fail/success/reset. Includes context memory.

Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs

Skywork-Reward: introduces methods to enhance reward modeling for LLMs, focusing on data-centric techniques.
It proposes data selection and filtering strategies for high-quality preference datasets, resulting in Skywork-Reward data collection, and develops Skywork-Reward model series including Skywork-Reward-Gemma-27B and Skywork-Reward-Llama-3.1-8B.
This work enhances performance of top-ranked models on RewardBench, highlighting practical impact in preference learning applications.

PDL: A Declarative Prompt Programming Language

PDL (Prompt Declarative Language): Introduces declarative and data-oriented language based on YAML to construct LLN prompt programs. Every PDL program is a valid YAML-document with PDL-schema.

From a Tiny Slip to a Giant Leap: An LLM-Based Simulation for Fake News Evolution

FUSE (Fake News evlUtion Simulation framEwork): Reviews the way true news convert into fake news with LLMs. Includes LLM-based agents: spreaders/commentators/verifiers/bystanders.
The simulation evolves with a module called News Evolution Simulator.
Includes content deviation metrics.

PRACT: Optimizing Principled Reasoning and Acting of LLM Agent

PRAct (Principled Reasoning and Acting)-framework: improves action understanding of agents by including action principles. Introduces RPO (Reflective Principle Optimization).

23rd of October 2024

ASYNCHRONOUS RLHF: FASTER AND MORE EFFICIENT OFF-POLICY RL FOR LANGUAGE MODELS

Asynchronous RLHF (Reinforcement Learning from Human Feedback): A framework that separates generation and learning in RLHF, enabling asynchronous generation of new samples while simultaneously training on old samples.
Online but off-policy, faster training, more compute-optimal scaling, training LLAMA 3.1 8B on instruction-following task 40% faster while matching final performance.
This framework addresses the computational inefficiency of the dominant paradigm for RL finetuning of LLMs by separating generation and learning, leading to faster training and more efficient use of resources.

GraphTeam: Facilitating Large Language Model-based Graph Analysis via Multi-Agent Collaboration

GraphTeam: LLM-based collaborative multi agent and graph-based system using three modules: input-output normalization/external knowledge retrieval/problem solving.
Includes question(reformats question)/search/coding/reasoning/answer-agents.
Constructs to knowledge graphs: documentation and experience.

Real-World Robot Applications of Foundation Models: A Review

This paper provides an overview of the practical application of foundation models in real-world robotics.
The review emphasizes the replacement of specific components within existing robot systems, input-output relationships, perception, motion planning, and control.
The paper concludes with a discussion of future challenges and implications for practical robot applications.

MiniFed : Integrating LLM-based Agentic-Workflow for Simulating FOMC Meeting

MiniFed: Simulates real world Federal Reserve FOMC-meetings using LLM-agent based multi-agent framework.
Consists of initialization/data collection/simulation/decision making/evaluation.

Guide for Defense (G4D): Dynamic Guidance for Robust and Balanced Defense in Large Language Models

G4D (Guide for Defense): LLM-based multi agent with external knowledge to discover user intent as safe with a defense framework against jailbreaks.
Includes intention detector (intention extraction, key entities identification and information retrieval)/question paraphraser/safety analyzer-components.

An Intelligent Agentic System for Complex Image Restoration Problems

AgenticIR: VLM/LLM-agent based image restoration using perception/scheduling/reflection/rescheduling/execution-agents.
Includes Rollback-mechanism, where agent returns previous working stage, when an issue.

ReflecTool: Towards Reflection-Aware Tool-Augmented Clinical Agents

ReflecTool: Introduces clinical agent, using progressively built long-term memory to assist domain-specific tool selection and improve tool usage. Includes optimization and inference stages.

Navigate Complex Physical Worlds via Geometrically Constrained LLM

Reviews LLMs-capability to reconstruct physical world from textual knowledge.
Uses LLM-based multi agent framework with scenery designer/object designer/object manufacturer/arranger-agents and geometric constraint solver and generic algorithm.

21st of October 2024

Long Term Memory: The Foundation of AI Self-Evolution

Reviews and defines AI Self-Evolution-capability and Long Term Memory (LTM).
Identifies benefits in Personalized Models.
Identifies limitations in prompt-based memory mechanisms.

Improving Parallel Program Performance Through DSL-Driven Code Generation with LLM Optimizers

Designs Domain Specific Language (DSL) in mapper (maps computations to processors like GPUs, CPUs, etc.) generation related to assignment of compute / memory.
The DSL helps to manage high-level inference decisions without interacting with the low-level C++ code APIs.

20th of October 2024

Redefining Proactivity for Information Seeking Dialogue

Introduces Information Seeking Dialogue (ISD) agents with proactiveness to include information relevant to the user query.
Introduces new prompting strategies: 3-step CoT and 3-in-1 CoT.

18th of October 2024

Teaching Models to Balance Resisting and Accepting Persuasion

PBT (Persuasion Balanced Training): Uses multi-agent recursive dialogue trees to train models with preference optimization to accept persuasion in acceptable situations. PBT-trained model outperform in multi-agent debates.
Agents argue based on logical reasoning/emotional appeal/established credibility.
Refers to research by Woolley et al. (2010), where group intelligence is argued to be driven by diversity/turn-taking/social sensitive, rather than individual intelligence.

18th of October 2024

Make LLMs better zero-shot reasoners: Structure-orientated autonomous reasoning

SARA (Structure-oriented Autonomous Reasoning Agents): Introduces multi agent LLM-based reasoning framework with structure-oriented analysis by refinement and RAG.
Outperforms in some cases few-shot learning.
Includes reason (structured oriented analysis)-, retrieval-, refinement-agents and shared memory. Includes prompts used.

AI can help humans find common ground in democratic deliberation

Habermas Machine: AI mediation technique promoting fair/inclusive debate.
LLM-agent opinions/critiques refine group statement to maximize group approval.
Aims to improve collective decision making in political discussion/conflict resolution.

17th of October 2024

Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation

Proposes World-Model-Augmented (WMA) web agent by simulating planned actions to obtain outcome before using them (metacognitive monitoring) in order to avoid performing erroneous moves. Reviews LLMs lack of capability to avoid performing errors, which humans can easily avoid by posing world model.
Introduces "Transition-focused observation abstraction": world model generates free-form important state differences before / after. Agent simulates outcomes of each possible action with world model and reward model asesses each one.
Includes prompts.

Chain of Ideas: Revolutionizing Research in Novel Idea Development with LLM Agents

CoI (Chain-of-Ideas): CoI-agent generates research ideas comparable to human-level by organizing literature in a chain structure to avoid logical inconsistencies in ideation.
Improves LLMs research ideation capabilities. Consists of three steps: CoI-construction (identifies current trends), Idea generation (consolidates ideas) and Experience design (final experiment design).
CoI-prompts include: converting topic in search query for literature retrieval/evaluation of paper relevance to the topic/extract research paper ideas, experiments, entities and reference/summarising trends of the this CoI.
Idea generation prompts include: predict future trends / generate ideas / novelty check of ideas.
Experiment design prompts include: generate experiment design / review experiment design / obtain queries to edit experiment design / refine experiment design.

AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents

AgentOccam: Refines LLM-agent observation/action space to improve its performance in web tasks with three methods. Sets SOTA in WebArena.
Introduces planning actions: branching and pruning. Minimizes trivial interaction space. Removes unnecessary web content.
Agent prompt includes general instructions (task description/output specification/action specification) and Online Task Information.
Simplifies web content/selectively replays web elements/selectively replays past pages.

AdaSwitch: Adaptive Switching between Small and Large Agents for Effective Cloud-Local Collaborative Learning

AdaSwitch: Uses local agents for basic and cloud agent for complex tasks.
Includes self-practicing, collaborative examination and reflective learning steps.

Harnessing Webpage UIs for Text-Rich Visual Understanding

Introduces MultiUI-dataset of 1 million websites for web / UI agents.

Rapid and Automated Alloy Design with Graph Neural Network-Powered LLM-Driven Multi-Agent Systems

Multi-agent system including LLMs, AI agents (multi modal LLM-agents) and GNNs to discover automatically new metallic alloys.
The LLM-agent roles include: planner-, executor-, coder-, reviewer- and multi-modal-agents.

A Comparative Study on Reasoning Patterns of OpenAI's o1 Model

Reviews o1-model against other test-time compute methods like BoN/Self-Refin/Agent workflow.
Identifies 6 reasoning patterns with o1-model: systematic analysis/method reuse/divide & conquer / self-refinement / context identification / emphasizing constraints.

MeNTi: Bridging Medical Calculator and LLM Agent with Nested Tool Calling

MeNTI-framework chooses appropriate meta-tool, fills data according to the meta-tool documentation and nested-calling verifies task completion.

Integrating Large Language Models and Reinforcement Learning for Non-Linear Reasoning

RL guides LLM's exploration. The architecture includes: LLM-module/validation module/reasoning tree/RL agent. Applied in code generation.
LLM module generates n-candidates, validation module reviews characteristics of each candidate, the features of each review are added to reasoning tree and finally RL explores this reasoning tree to decide the node to explore next.

Metacognitive Monitoring: A Human Ability Beyond Generative Artificial Intelligence

Reviews metacognition monitoring abilities of LLMs.

RescueADI: Adaptive Disaster Interpretation in Remote Sensing Images with Autonomous Agents

ADI (Adaptive Disaster Interpretation)-framework: introduces an multimodal LLM-agents interpreting disaster scenarios using tools. Introduces RescueADI-dataset.
ADI-framework includes perception/recognition/planning/tools-modules.

16th of October 2024

Revealing the Barriers of Language Agents in Planning

Reviews planning capabilities of LLMs and identifies current models like o1 only achieve 15.6% performance in real-world tasks.
Identifies two core issues: interpretation of constraints/loss of focus in long-horizon planning tasks.
Episodic and parametric memory help, but do not resolve the lack of planning capabilities.

Graph-constrained Reasoning: Faithful Reasoning on Knowledge Graphs with Large Language Models

GCR (Graph-Constrained Reasoning): Integrates Knowledge Graph (KG) into LLM decoding to reduce hallucinations in reasoning.
Uses KG-Trie method.

Evaluating Software Development Agents: Patch Patterns, Code Quality, and Issue Complexity in Real-World GitHub Scenarios

Reviews LLM-agents ability to patch code, suggesting smaller sub-tasks to patch code to be easier for LLM-agents.

JudgeBench: A Benchmark for Evaluating LLM-based Judges

JudgeBench-benchmark: Evaluates LLM-judge agents, which focuses on instruction following/factuality/logic/style.

SAC-GLAM: Improving Online RL for LLM agents with Soft Actor-Critic and Hindsight Relabeling

SAC-GLAM: Proposes a more autonomous LLM-agents based on adaptation of SAC (Soft Actor-Critic) and HER (Hindsight Experience Replay) for LLM-agents in multi-goal RL environment to perform sequential decision making tasks.
Reviews LLM-agents moving from external objective driven towards more autotelic ("self" + "goals") with an intrinsic purpose rather than extrinsic.

Robust RL with LLM-Driven Data Synthesis and Policy Adaptation for Autonomous Driving

RAPID: Improves RL performance in autonomous driving with LLM-reasoning. Uses LLM-agent data for offline RL distillation and then adapts online RL-agent with LLM-data.

Enhancing LLM Trading Performance with Fact-Subjectivity Aware Reasoning

FS-Reasoning Agent: introduces LLM-based multi-agent trading framework by splitting reasoning processes between factual and subjective reasoning.
Includes Statistics/Fact reasoning/Fact/Subjectivity/Subjectivity reasoning/Trading/Reflection agents.
Concludes, that superiority of the LLM model is not sufficient to guarantee it outperforming multi-step reasoning.

MedAide: Towards an Omni Medical Aide via Specialized LLM-based Multi-Agent Collaboration

MedAide: Introduces LLM-based multi-agent framework, which includes query input/query rewriting/intent recognition/agent collaboration.
Activates specialised agents (own prompt template) dynamically by recognizing intent.
Includes contextual encoder.

Aegis:An Advanced LLM-Based Multi-Agent for Intelligent Functional Safety Engineering

Aegis: LLM-based multi-agent framework for FSRs (Functional Safety Requirements) and HARA (Hazard Analysis and Risk Assessment).

15th of October 2024

G-Designer: Architecting Multi-agent Communication Topologies via Graph Neural Networks

G-Designer: introduces designer of multi-agent LLM-graphs based on MACP. Includes Materials/Construct/Design/Optimize-steps.
Proposes a LLM-agent communication protocol for multi-agent systems called MACP. MACP includes performance/adaptability/robustness.

AGENTiGraph: An Interactive Knowledge Graph Platform for LLM-based Chatbots Utilizing Private Data

AGENTiGraph (Adaptive Generative ENgine for Task-based Interaction and Graphical Representation): LLM-based multi-agent knowledge management framework with knowledge graphs.
Includes knowledge extraction/integration/real-time visualization.
Dynamically interprets user intent/manage tasks/integrate new knowledge. Classifies tasks. Extracts key concepts. Constructs knowledge graphs. Includes prompts used.

Revisiting Benchmark and Assessment: An Agent-based Exploratory Dynamic Evaluation Framework for LLMs

TestAgent-framework: quantitative/qualitative benchmark using agent-based evaluation with RL, multi-turn interaction from knowledge base/topics of interests.

14th of October 2024

AFlow: Automating Agentic Workflow Generation

AFlow: Optimises LLM-agent workflow with MCTS.
Includes search space (node, operators, code represented edges), search via AFliw and Search result (math, Q&A and code generation workflows.)

10th of October 2024

Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning

PAVs (Process Advantage Verifiers): is a framework that trains verifiers to predict progress in multi-step reasoning by measuring the change in likelihood of a correct response under a prover policy.
PAVs improve exploration during test-time search and online RL, using complementary prover policies, and are more compute-efficient than ORMs.
This framework enables more efficient and accurate reasoning in large language models by providing a better way to measure progress in multi-step reasoning.

Multi-Agent Collaborative Data Selection for Efficient LLM Pretraining

Introduces LLM-based multi-agent system for efficient LLM pretraining data selection. LLM converges faster in the pretraining and the method improves LLM output quality.
The Data console integrates data inisghts dynamically from the different agents during the training process.
Agent console include quality/domain/topic-agents. Includes as well memory.

Optima: Optimizing Effectiveness and Efficiency for LLM-Based Multi-Agent System

Optima (OPTImising effectiveness and efficiency for LLM-based Multi-Agent systems): Introduces framework to train LLM-based multi-agent system (MAS).
Includes 4 iterative steps: Generate/Rank/Select/Train.
Investigates scaling laws of inference compute.
Optima helps to make LLMs highly efficient conversationalists.

DelTA: An Online Document-Level Translation Agent Based on Multi-Level Memory

DelTA (Document-level Translation Agent): Introduces translation LLM-agent using multi-layer memory components to improve translation consistency/quality.
Memory components include: Proper noun memory(to apply correct terminology)/Bilingual summary/long-term/short-term-memory units.

Mars: Situated Inductive Reasoning in an Open-World Environment

Mars: Introduces framework for Situated Inductive Reasoning-benchmark and a framework with LLM-agents called: IfR (Induction from Reflection).
The paper identifies two critical components for inductive reasoning: situatedness (situational context) and abstractiveness (abstract conclusions).
IfR-framework includes task proposer/planner/controller/reflection-steps, rule library (when this, do that) and skill library. The LLM-based reflection-step induces new rules, which actual LLMs struggle currentyly.

Benchmarking Agentic Workflow Generation

Introduces WorFEBench-benchmark for unified workflow generation and WorFEval evaluation protocol of workflows for LLM-agents.

9th of October 2024

AgentBank: Towards Generalized LLM Agents via Fine-Tuning on 50000+ Interaction Trajectories

Samoyed: Introduces LLM-models fine-tuned with AgentBank-dataset for general agent tasks.
AgentBank-dataset includes dimensions: reasoning/math/programming/web/embodied AI.

Smart Audit System Empowered by LLM

Introduces Smart Audit System with LLMs, which include dynamic risk assessment model/manufacturing compliance copilot/Commonality analysis agent. Developed by Apple researchers.
Dynamic risk assessment model adjusts audit: focus/sample size/critical items/resource allocation.
Manufacturing compliance copilot self-adjusts its the knowledge base with new information.
Commonality analysis agent manages an autonomous agent conducting real-time analysis to custom requests, in order to drive supplier improvements. Includes planning/memory/tools/selecting and usage of tools/generating responses.

Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making

Introduces Embodied Agent Interface-benchmark for embodied decision making LLM-agents.
Reviews four critical capabilities: Goal interpretation, Subgoal decomposition, Action sequencing and Transition modelling.

I Want to Break Free! Anti-Social Behavior and Persuasion Ability of LLMs in Multi-Agent Settings with Social Hierarchy

zAImbardo-framework: Introduces LLM-agent simulation between prisoner/guard-agents using prompts, which are either shared or private.
Shared prompts: communication rules/environment description/research oversight/risks. Private prompts: Starting prompt/personality/goals.

Towards Realistic UAV Vision-Language Navigation: Platform, Benchmark, and Methodology

Introduces UAV navigation agent using MLLM. Includes three levels of assistants: constant/difficult situations/hazard situations.

MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses

Moose-Chem: multi-agent framework to discover novel chemistry research hypothesises from given information.

Seeker: Enhancing Exception Handling in Code with LLM-based Multi-Agent Approach

Seeker: introduces LLM-based multi-agent framework for exception handling with planner/detector/predator/ranker/handler-agents.

ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents

ST-WebAgentBench-benchmark: Evaluates safety and trustworthy of web agents against performing undesired operations in business/user applications.

Do great minds think alike? Investigating Human-AI Complementarity in Question Answering with CAIMIRA

CAIMIRA (Content-Aware, Identifiable, Multidimensional, Item Response Analysis)-framework: Reviews differences between humans and SOTA-level LLMs in QA-tasks in reasoning and textual understanding.

8th of October 2024

AgentSquare: Automatic LLM Agent Search in Modular Design Space

AgentSquare: Introduces modular LLM-agent framework using module evolution, recombination and performance predictor(skip unpromising agent designs). - The framework optimizes agent designs with Planning/Reasoning/Tool use/Memory-modules.
Introduces the research concept of MoLAS (Modularized LLM Agent Search): the automatic optimization of LLM-agent designs from succesfull designs.
Includes search-, program-level search- and performance predictor-meta prompts.

Name		Name	Last commit message	Last commit date
Latest commit History 1,179 Commits
Autonomous_Agents_Resources.md		Autonomous_Agents_Resources.md
Autonomous_agent_logo.png		Autonomous_agent_logo.png
LICENSE		LICENSE
README.md		README.md

License

tmgthb/Autonomous-Agents

Folders and files

Latest commit

History

Repository files navigation

Autonomous Agents

Research papers

4th March 2025

3rd March 2025

2nd March 2025

1st March 2025

28th February 2025

27th February 2025

26th February 2025

25th February 2025

24th February 2025

23rd February 2025

22nd February 2025

21st February 2025

20th February 2025

19th February 2025

18th February 2025

17th February 2025

16th February 2025

15th February 2025

14th February 2025

11th February 2025

10th February 2025

7th February 2025

6th February 2025

5th February 2025

4th February 2025

3rd February 2025

2nd February 2025

1st February 2025

31st January 2025

30th January 2025

29th January 2025

28th January 2025

27th of January 2025

26th of January 2025

25th January 2025

24th January 2025

23rd of January 2025

21st of January 2025

20th of January 2025

19th of January 2025

18th of January 2025

17th of January 2025

16th of January 2025

15th of January 2025

14th of January 2025

14th January 2025

13th of January 2025

12th of January 2025

11th of January 2025

10th of January 2025

9th of January 2024

9th of January 2025

8th of January 2025

7th of January 2025

6th of January 2025

5th of January 2025

4th of January 2025

3rd of January 2025

2nd of January 2025

1st of January 2025

31st of December 2024

30th of December 2024

25th of December 2024

24th of December 2024

24.12.2024

22.12.2024

21st of December 2024

20th of December 2024

19th of December 2024

18th of December 2024

17th of December 2024

13th of December 2024

Packages