With the rapid progress in artificial intelligence and deep learning, methodological advances in single-cell analysis have undergone a notable shift from traditional statistical techniques to specialized deep learning models, and more recently, to pre-trained foundation models. While these developments have led to significant improvements in performance and scalability, several inherent limitations remain unresolved:
(1) Lack of Unification. For different types of omics data and downstream tasks, existing paradigms typically require separately designed models, lacking a unified approach capable of simultaneously handling multi-omics and multi-task scenarios.
(2) Limited User-Friendliness. Effective application of these methods to single-cell analysis often necessitates domain expertise in biology as well as proficiency in programming. Furthermore, the lack of user-centric interaction design in current models poses a significant barrier to adoption by non-expert users.
(3) Poor Interpretability. Most of the existing data-driven black-box models directly learn the mapping from input (e.g., gene expression) to output (e.g., cell type information), without incorporating interpretable intermediate steps. As a result, users are often unable to understand the rationale behind the model's decisions. To this end, we seek to establish a unified, user-friendly, and interpretable paradigm for single-cell analysis.
An illustration of traditional single-cell analysis and language-centric single-cell analysis.
Specifically, we introduce CellVerse , a unified language-centric benchmark dataset for evaluating the capabilities of LLMs in single-cell analysis. We begin by curating five sub-datasets spanning four types of single-cell multi-omics data (scRNA-seq, CITE-seq, ASAP-seq, and scATAC-seq data) and translate them into natural languages. Subsequently, we select three most representative single-cell analysis tasks—cell type annotation (cell-level), drug response prediction (drug-level), and perturbation analysis (gene-level)—and reformulate them as question-answering (QA) problems by integrating each with the natural language-formatted single-cell data.
Next, we conduct a comprehensive and systematic evaluation of 14 advanced LLMs on CellVerse . The evaluated models include open-source LLMs such as C2S-Pythia (160M, 410M, and 1B), Qwen-2.5 (7B, 32B, and 72B),Llama-3.3-70B, and DeepSeek (V3 and R1), as well as closed-source models including GPT-4, GPT-4o-mini, GPT-4o, GPT-4.1-mini, and GPT-4.1.
Through this large-scale empirical assessment, we uncover several key insights:
(1) Generalist models perform better than specialist models, with DeepSeek-family and GPT-family models demonstrating emergent reasoning capabilities in cell biology, while C2S-Pythia exhibit complete failure across all sub-tasks.;
(2) Model performance positively correlates with parameter scale, as evidenced by the Qwen-2.5 series where the performance hierarchy follows model size: 72B > 32B > 7B variants.
(3) Current LLMs demonstrate limited understanding of cell biology, particularly in drug response prediction and perturbation analysis tasks where most LLMs fail to surpass random guessing baselines with statistical significance.