10 items · 9 from notable labs/scholars · via sonar-pro
Evaluating LLM Agents for Forward-Looking AI Research Judgment
Sam Toyer, Rylan Schaeffer, Siddharth Reddy, Noah Goodman et al. — Anthropic · Stanford University · University of California, Berkeley
★ Notable lab / scholarReasoning, tool use, and multi-agent systemsarXiv2026-06-02
This paper evaluates several LLM-based research agents on tasks that require forward-looking judgment about AI research directions and potential impact.[4] The authors compare native LLM baselines, hybrid RAG setups, and three agentic adaptations across four backbone models, analyzing how explicit evidence organization and tool-assisted reasoning affect performance.[4] They introduce task formulations that mirror real research decision-making, such as choosing promising research directions and forecasting impact from incomplete information.[4]
Why it matters. Forward-looking research judgment is a core capability for using LLM agents as collaborators in scientific and engineering workflows, beyond static Q&A. The work provides an early systematic evaluation of agentic setups for research decision tasks, highlighting where structured evidence handling yields concrete gains and where current agents still fail.
★ Notable lab / scholarLLM evaluation and AI-assisted writingarXiv2026-06-02
The authors build AI-Paper-Review, a web-UI-integrated tool that generates structured reviews of draft research papers and aligns its comments with human reviewer feedback.[8] They run a case study on 20 computer architecture papers of varying topic and submission status, comparing AI-identified issues to those raised by human reviewers.[8] Results show that the AI review covers a substantial fraction of human-raised issues and also surfaces additional problems that humans did not mention, though there are limitations in depth and nuance.[8]
Why it matters. This work offers a concrete, domain-specific evaluation of LLMs as writing and reviewing assistants in technical research, moving beyond anecdotal use. It is particularly relevant for ML researchers considering integrating AI-based review into their drafting workflows or conference toolchains.
Knowledge Graphs as the Missing Data Layer for LLM-Based Agents
Rohit Patel, Bing He, Sanjay Chawla, Christopher Re et al. — QCRI · Stanford University · KAUST
★ Notable lab / scholarKnowledge management and retrieval over expertisearXiv2026-05-27
This position paper argues that explicit knowledge graphs are a critical missing data layer for LLM-based agents operating over complex, evolving domains.[13] The authors discuss how graph structures can provide declarative memory, support consistency, and enable more robust task decomposition compared to purely parametric or ad hoc RAG memories.[13] They also describe AssetOpsBench, a benchmark for evaluating LLM agent autonomy on industrial maintenance tasks, illustrating the need for structured knowledge integration in real operations contexts.[13]
Why it matters. For practitioners building agentic systems and ‘LLM-wiki’ style knowledge platforms, this paper frames a concrete architecture pattern: combining LLMs with an explicit graph data backbone. It is also notable for proposing an industrial benchmark focused on long-horizon, asset-maintenance workflows rather than toy tasks.
Econstellar: An Open-Source AI-Augmented Research Engine for Applied Econometrics
Wenjie Yan, Yuan Zhang, Yao Zhao, Zhiwei Steven Wu — University of Minnesota · Carnegie Mellon University
AI agents in practicearXiv2026-06-06
Econstellar is presented as an open-source research engine that allows users to run publication-grade financial and applied econometrics analyses from a browser, with an AI assistant explaining the modeling and results.[12] The system orchestrates code execution, data management, and LLM-based narrative generation, providing end-to-end support from model specification to interpretation.[12] It targets real practitioners rather than demonstrations, emphasizing robust, reproducible workflows.
Why it matters. This is a concrete example of an AI ‘copilot’ that tightly integrates LLMs, traditional statistical tooling, and domain-specific workflows in a serious applied field. The design patterns—LLM-mediated code, explanation, and data handling—are transferable to other scientific and engineering domains.
AI agentsscientific workflowseconometricscopilotstool use
MIRAI: Prediction and Generation of High-Impact Academic Research
Zilong Li, Yijun Xiao, Tao Lei, Yoshua Bengio — Université de Montréal · Mila · MIT
★ Notable lab / scholarReasoning, tool use, and multi-agent systemsarXiv2026-06-06
The paper introduces MIRAI (Multi-year Inference of Research trends and Academic Impact), a deep learning framework that predicts the long-term impact of papers using only their titles and metadata.[6] MIRAI models evolving research trends over multiple years and can also be used generatively to suggest research directions or paper-style outputs likely to have significant future influence.[6] The authors train and evaluate MIRAI on large scholarly datasets and analyze which features correlate most with downstream impact.[6]
Why it matters. Beyond bibliometrics, this is an instance of AI systems reasoning about research trajectories and content, with potential applications in prioritizing projects, funding decisions, and agentic research assistants. It also shows how relatively shallow inputs (titles) can carry enough structure for nontrivial impact forecasting.
research forecastingscientific impactdeep learningagents for science
Measuring Research Impact Through Large Language Model Memory
Cheng-Yu Hsieh, Shengcao Cao, Yiming Ma, Christopher Ré — Stanford University
★ Notable lab / scholarData processing & curation for MLarXiv2026-05-31
This work proposes LLM-Metrics, a new kind of research-impact metric based on what large language models remember in their parametric weights.[10] The authors query LLMs about a large corpus of papers and quantify which works are retrievable and how accurately, using this as a signal for community-wide influence.[10] They analyze correlations between LLM-Metrics and traditional citation-based measures, as well as how training data and cutoff dates shape what models ‘remember’.[10]
Why it matters. For both ML and scientometrics, this is an example of using the internal state of foundation models as a measurement tool in its own right. It also highlights subtle data-curriculum and curation effects: which papers are prominent enough to be encoded and how that relates to visibility and access for downstream users of LLMs.
What Are LLMs Doing to Scientific Communication? Measuring Changes in Writing Style, Content, and Authorship
Irene Solaiman, Emily M. Bender, Yoav Goldberg — Allen Institute for AI · University of Washington · Bar-Ilan University
★ Notable lab / scholarUsing LLMs to organize and manage technical knowledgearXiv2026-05-30
The authors assemble a corpus of NLP papers from the ACL Anthology and compare pre- and post-LLM eras to study how scientific communication is changing.[7] They analyze shifts in writing style, the prevalence of template-like phrasing, changes in content structure, and patterns of authorship and acknowledgments potentially linked to AI-assisted drafting.[7] The paper quantifies how widespread LLM-mediated writing has become and discusses implications for knowledge transmission and peer review.[7]
Why it matters. For researchers relying on LLMs as writing or summarization tools, this provides empirical evidence about how these tools are already reshaping scientific text. It can inform how we design future knowledge-management systems, peer review processes, and style guidelines in an LLM-saturated ecosystem.
Evaluating LLM Agents on Industrial Asset Operations: AssetOpsBench
★ Notable lab / scholarAI agents in practicearXiv2026-05-27
Embedded within the "Knowledge Graphs as the Missing Data Layer for LLM-Based Agents" paper, the authors introduce AssetOpsBench, a benchmark focused on industrial asset operations for evaluating LLM agent autonomy.[13] The benchmark includes maintenance and asset-operations tasks that require long-horizon planning, tool use, and interaction with structured data, aiming to close the gap between agent demos and real-world industrial workflows.[13] It provides scenarios, tools, and evaluation criteria tailored to realistic industrial settings.[13]
Why it matters. AssetOpsBench is one of the first domain-specific evaluation suites targeting LLM agents in heavy-industry operations rather than synthetic puzzles. It is useful for researchers and engineers working on agents that must interact with complex data systems, safety constraints, and long-running tasks.
Jiajun Wu, Chelsea Finn, Sergey Levine, Fei-Fei Li — Stanford University · Google DeepMind · UC Berkeley
★ Notable lab / scholarReasoning, tool use, and multi-agent systemsarXiv2026-06-03
This overview paper identifies key reliability challenges in embodied AI, such as robustness to distribution shifts, safety under partial observability, and long-horizon error accumulation.[9] The authors outline three complementary research directions for achieving reliable embodied intelligence, integrating perception, control, and high-level reasoning in real-world environments.[9] They emphasize the need for unified benchmarks, standardized evaluation protocols, and architectures that couple language models with low-level controllers.[9]
Why it matters. For teams building robot-augmented agents or grounding LLMs in physical tasks, this work synthesizes the reliability problem space and proposes a structured agenda. It also highlights how LLM-centric and control-centric communities can converge on shared tools and benchmarks for embodied reliability.
The Compass for AI Utility and Adoption in the Global Majority
Nanjira Sambuli, Ruth Kamaru, Kentaro Toyama — Microsoft Research Africa
★ Notable lab / scholarData processing & curation for MLarXiv2026-06-02
This work presents a practical framework and index for cross-cultural AI research, development, and deployment in the global majority, built around ten dimensions grouped into three overarching themes.[5] The index provides concrete criteria for evaluating AI utility, accessibility, governance, and socio-technical fit in diverse contexts beyond the Global North.[5] The authors draw on fieldwork and deployment experiences to ground the dimensions in real-world constraints.[5]
Why it matters. For ML researchers working on data curation, deployment, and evaluation in non-Western settings, this provides a structured set of requirements that can be operationalized in project design. It is particularly relevant when building or curating datasets and evaluation suites intended for global use, including for language and multimodal models.