Ready Jurist One: Benchmarking Language Agents for Legal Intelligence in Dynamic Environments

1Fudan University, 2Shanghai Innovation Institute,
3Huazhong University of Science and Technology 4University of Southern California,
5Midu Technology, 6Northwest University of Political and Law
† Equal contribution

J1-ENVS is an interactive and comprehensive legal benchmark where LLM agents engage in diverse legal scenarios, completing tasks through interactions with various participants under procedural rules.

Overview

J1-Envs: Interactive Legal Environments

J1-Envs consists of six environments and could be categorized into three levels based on environmental complexity (role diversity, interaction demands, and procedural difficulty):

  • Level-I: Knowledge Questioning & Legal Consultation. This level involves two participants, where the general public of various occupations initiates progressive questions and the legal agent responds. The dialogue continues until all questions are addressed.
  • Level-II: Complaint Drafting & Defence Drafting. This level involves two participants: an individual with a specific legal need, and a legal agent who guides the interaction to progressively collect the required information and ultimately generate a legal document.
  • Level-III: Civil Court & Criminal Court. This level involves multiple participants and is governed by strict court procedural norms. Specifically, the judge agent guides other participants through the courtroom process, which includes several stages, and reachs a judgment.

J1-Eval: Holistic Legal Agent Evaluation

Considering the distinct characteristics of each legal task, J1-Eval provides a set of task-specific and fine-grained metrics to evaluate agent capabilities within our constructed environments with either rule-based or model-based methods.

Contributions

The contributions of our paper are as follows:

  • We introduce J1-ENVS, the first interactive and dynamic legal environment for agents, encompassing six representative real-world scenarios from Chinese legal practice. It provides an open-ended platform where agents engage in diverse legal tasks.
  • We introduce J1-EVAL to perform fine-grained evaluation of agents across different levels of legal proficiency, ranging from trainee to lawyer to judge. It gives a deeper understanding of the skill sets required for legal task execution and offers a realistic and reliable overview of agent capabilities
  • Experimental results reveal gaps between diverse agents and real-world legal demands, offering valuable insights to guide future progress.
  • This work introduces a new paradigm for legal intelligence, shifting from static to dynamic. Beyond evaluation, it can further be extended for data generation and reinforcement learning training.

Datasets

J1-EVAL sets out 508 distinct environments, including 160 at Level I, 186 at Level II, and 192 at Level III, encompassing a diverse and nuanced set of legal attributes aligned with real-wolrd legal practice.


Experiments


We conduct extensive experiments on 17 LLM agents, including proprietary models, open-source models, and legal-specific models.

Overall performance ranking: GPT-4o achieves the best performance, showing strong legal intelligence. Qwen3-Instruct-32B demonstrates performance beyond expectations, surpassing the Deepseek series. Although the legal-specific LLMs perform comparably to the GPT series on existing legal benchmarks, they exhibit significantly weaker performance in our setting, falling behind even smaller models.

Overall Ranking

Performance at different levels: Both general and legal-specific models perform relatively well in Level I. However, their performance drops in Level II and Level III, which require proactive engagement and rigorous procedural compliance.

results

Can LLMs drive J1-ENVS? Both GPT-4o and human rate the consistency between each role's profile and its behavior on a scale of 1 to 10. The ratings(Figure (a)) remain consistently high and stable, which validates the reliability and effectiveness of our environment.

Different NPC We then employ Qwen3-Instruct-32B to drive our environment, and find that the relative performance differences and rankings(Figure (b)) among agents remain consistent across various environments, further validating the robustness of our framework.

Conclusions

  • (A) Although current LLM agents have internalized legal knowledge, they struggle to maintain flexible interactive capabilities while adhering to rules, which limits their effectiveness in dynamic real-world environments. While legal intelligence improves with model size, even SOTA models average below 60, highlighting ongoing challenges in handling diverse and complex legal scenarios.
  • (B) In complex courtroom proceedings, the legal agent’s capacity for sound legal reasoning and procedural compliance is fundamental to the accuracy and reliability of judicial outcomes.
  • (C) Although criminal court procedures are not necessarily more complex than those in civil courts, legal agents face greater challenges in criminal courts due to the involvement of more roles. Therefore, enhancing multi-role interaction and communication is crucial for agents.



Related Work

LawLLM: Intelligent Legal System with Legal Reasoning and Verifiable Retrieval
- Shengbin Yue, Shujun Liu, Yuxuan Zhou, Chenchen Shen, Siyuan Wang, Yao Xiao, Bingxuan Li, Yun Song, Xiaoyu Shen, Wei Chen, Xuanjing Huang, Zhongyu Wei

Multi-Agent Simulator Drives Language Models for Legal Intensive Interaction
- Shengbin Yue, Ting Huang, Zheng Jia, Siyuan Wang, Shujun Liu, Yun Song, Xuanjing Huang, Zhongyu Wei

AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator
- Zhihao Fan, Jialong Tang, Wei Chen, Siyuan Wang, Zhongyu Wei, Jun Xi, Fei Huang, Jingren Zhou

AgentSense: Benchmarking Social Intelligence of Language Agents through Interactive Scenarios
- Xinyi Mou, Jingcong Liang, Jiayu Lin, Xinnong Zhang, Xiawei Liu, Shiyue Yang, Rong Ye, Lei Chen, Haoyu Kuang, Xuanjing Huang, Zhongyu Wei

LawBench: Benchmarking Legal Knowledge of Large Language Models
- Zhiwei Fei, Xiaoyu Shen, Dawei Zhu, Fengzhe Zhou, Zhuo Han, Songyang Zhang, Kai Chen, Zongwen Shen, Jidong Ge

BibTeX

@article{jia2025readyjuristonebenchmarking,
      author    = {Zheng Jia and Shengbin Yue and Wei Chen and Siyuan Wang and Yidong Liu and Yun Song and Zhongyu Wei},
      title     = {Ready Jurist One: Benchmarking Language Agents for Legal Intelligence in Dynamic Environments},
      year      = {2025},
     journal    = {arXiv preprint arXiv:2507.04037},
     url        = {https://arxiv.org/abs/2507.04037}
    }