Application of GPT-4 for RNA sequencing

Single-cell RNA sequencing (scRNA-seq) is a method for studying expression profiles at the level of individual cells, that is, determining which RNAs are present in each cell and in what quantities. This allows scientists to understand how each cell functions and what functions it performs.

In simple words: this method helps to understand which genes in a cell are “on” and “off” at the moment. This is important because active genes determine how the cell will behave, e.g. will it be healthy, will it turn cancerous, will it help the immune system fight infection and so on. Thus, RNA sequencing is used to develop drugs, study diseases and treat them, and also to understand how various living organisms develop and function at the level of their cells.

The whole process is quite complex, but how does GPT-4 help in its implementation?? I will talk about this in detail and clearly in this article!

Happy reading! 🙂

Introduction to the Study

This research is based on the use of the large GPT-4 language model to automate the process cell type annotation in data single cell RNA sequencing.

Single-cell RNA sequencing is a high-tech research method that allows scientists to look inside individual cells and find out which genes are active in them. Every gene that is “turned on” produces RNA, and it is this RNA that scientists “read” using sequencing. This is done by decoding nucleotide sequences!

You can imagine that inside each cell there is a small factory where genes are workers that perform different tasks. Some workers are active at certain times, while others rest. Single-cell RNA sequencing allows us to find out which workers are currently “on shift.” This is very important because different types of cells perform different functions in the body, and gene activity reflects these functions. For example, liver cells will activate one set of genes, and brain cells will activate a completely different one 😛

Now about cell type annotation. Once scientists have their data from single-cell RNA-seq, they are faced with the challenge of understanding which cells they were studying. After all, samples for research are often taken from tissues that contain many different cells. Annotation is a process in which scientists match groups of cells to already known cell types based on their gene activity. If we return to the analogy with the factory, it is as if you determine what kind of products the factory produces by looking at which workers are currently on shift 🙂

GPT-4 is able to recognize and classify different cell types based on information about marker genes, that is, genes that are specific to certain cell types.

The effectiveness of GPT-4 has been tested on a large number of tissue and cell types, and the results produced by the model have shown a high degree of agreement with manual annotations by human experts. This means that GPT-4 can accurately identify cell types within complex biological samples, which typically requires extensive knowledge and time with the traditional approach.

A special software package was also created for R programming languagenamed GPT Celltype.

This package is precisely a tool that allows you to use the power of GPT-4 to automatically annotate cell types, simplifying and speeding up the process for researchers.

Thus, the application of GPT-4 in single-cell RNA-seq analysis offers opportunities to reduce the workload of researchers and simplify the annotation process, making it more accessible and less time-consuming.

Research methods and results

Comparison of cell type annotation by human experts, GPT-4 and other automated methods

Comparison of cell type annotation by human experts, GPT-4 and other automated methods

GPT-4 provides cost-effectiveness and seamless integration into existing single-cell analysis pipelines such as Seurat, eliminating the need to create additional pipelines and collect high-quality benchmark datasets. GPT-4's extensive training dataset allows it to be used in a wide range of applications across a variety of tissues and cell types, and its chatbot nature allows users to refine annotations.

Seurat is a software package for R that is designed specifically for the analysis of single-cell RNA-seq data. Seurat provides users with tools for data quality, cell type identification and classification, molecular pathway analysis, and other common single-cell genomics tasks. With it, researchers can process complex single-cell sequence data sets to reveal heterogeneity and molecular mechanisms of cellular function and interactions.

The researchers systematically assessed the performance of GPT-4 on ten datasets spanning five species and hundreds of tissue and cell types, including both normal and cancer samples.

Queries to GPT-4 were made using GPTCelltype. For comparison, the researchers also evaluated GPT-3.5, the previous version of GPT-4, and CellMarker2.0, SingleR, and ScType, which are automated methods for annotating cell types and provide reference data applicable to a wide range of tissues.

Cell type annotations made using GPT-4 or competing methods were assessed based on their consistency with the manual annotations provided by the original studies. The degree of agreement was measured using a numerical score. The supplementary table below provides an example of evaluating GPT-4 cell type annotations in human prostate tissue.

The scientists also examined various factors that may affect the accuracy of GPT-4 annotation.

The figure below summarizes the various databases, projects, and cancer types that are relevant to research aimed at understanding the complexity of cell populations in healthy and diseased tissues:

Azimuth is a tool or platform for annotating cell types in scRNA-seq data. It uses extensive reference datasets to identify cell types based on their gene expression profiles.

Colon Cancer is a general term for colon cancer. In the context of cell type annotation, studies can be aimed at identifying and classifying different cell types in tissues affected by colon cancer to better understand its pathogenesis.

HCL (Human Cell Landscape) is a project designed to map cell types and states in various human tissues and organs. It provides rich information that can be used to annotate and compare cell types in scRNA-seq data.

Lung Cancer is lung cancer. Similar to colon cancer, in the context of cell type annotation, these studies aim to distinguish cell types in cancerous and normal lung samples.

BCL (Blood Cancer Lymphoma) is lymphoma or blood cancer. In cell type studies, this may relate to the analysis of cellular heterogeneity and subtypes of blood cancers, including lymphoma.

GTEx (Genotype-Tissue Expression) is a project and database that contains information about gene expression in various tissues of the normal human body. These data can serve as important reference information for cell type annotation and comparison.

MCA (Mouse Cell Atlas) is an atlas of mouse cells, similar to Human Cell Landscape, but for the mouse model organism.

TS (Tissue Specificity) is a term that refers to gene expression that is specific to certain tissues. In the context of cell type annotation, this refers to the identification of cell types that are characteristic of a particular tissue.

Average goodness-of-fit scores for varying numbers of leading differential genes, statistical tests for differential analysis, and hint strategies.

Average goodness-of-fit scores for varying numbers of leading differential genes, statistical tests for differential analysis, and hint strategies.

In summary, the researchers found that GPT-4 performed best when using the top 10 differentially expressed genes (compared to 20 and 30) and when using two-tailed Wilcoxon test to obtain differential genes.

The two-tailed Wilcoxon test is a nonparametric statistical test that is used to compare two samples to determine whether there are significant differences between them. It is often used when data does not follow a normal distribution, making it an alternative to the Student's t test for normally distributed data.

In the context of genetic data, such as gene expression in different cell types, the Wilcoxon test can be used to identify genes that are differentially expressed between two groups, for example between healthy and diseased samples or between different cell types.

GPT-4 accuracy was found to be similar across different cueing strategies, including basic hint strategy, prompt strategy inspired by the chain of thought methodwhich includes stages of reasoning, and repeating prompt strategy (also shown in the picture above).

Basic hint strategy: Here the information was provided by GPT-4 in a simple and straightforward format, without further explanation or direction.

Prompt Strategy Inspired by the Chain of Thought Method: In this approach, successive steps of reasoning were included in the query. This means that each step of inference was explicitly stated in a query to the model, giving it a “hint” about how to reason to reach the conclusion.

Repeating hint strategy: This involves repeatedly asking for the same or changed information over multiple attempts, perhaps to deepen the analysis or clarify responses.

In subsequent analyses, both GPT-4 and GPT-3.5 used a basic hint strategy with the top ten differentially expressed genes obtained by the Wilcoxon test as input to the applicable data sets.

GPT-4 annotations match fully or partially with manual annotations for more than 75% of cell types in most studies and tissues, demonstrating its competence in generating cell type annotations comparable to expert ones.

Proportion of cell types with different levels of agreement in each study and tissue, most abundant common cell types, malignant cells, different cell population sizes, and major cell types versus cell subtypes.

Proportion of cell types with different levels of agreement in each study and tissue, most abundant common cell types, malignant cells, different cell population sizes, and major cell types versus cell subtypes.

This agreement is particularly high for marker genes found in the literature, with at least 70% complete agreement in most tissues. Although there is less agreement for genes identified by differential analysis, it is still high. However, results from datasets published before September 2021 should be interpreted with caution as they predate the GPT-4 training date.

Main conclusions:

The researchers concluded that GPT-4 is better at immune cellssuch as granulocytes, compared to other cell types. He identifies malignant cells in colon and lung cancer datasets, but has difficulty with B lymphoma, possibly due to a lack of clear gene sets. Identification of malignant cells could benefit from other approaches, such as gene copy number variation.

Also, performance decreases slightly in small cell populations of no more than ten cells, possibly due to the limited amount of information available.

GPT-4 annotations are more common fully consistent with manual annotations for major cell types (For example, T cells) compared to subtypes (e.g. memory CD4 T cells), while more than 75% of subtypes still achieve complete or partial agreement.

T cells, also known as T lymphocytes, are a key component of the adaptive immune system in mammals, including humans. These cells develop from stem cells in the bone marrow and mature in the thymus (hence the name “T”), which is a small organ located in front of the chest.

Depending on their function and the type of antigen receptor on their surface, T cells are divided into several main types. T helper cells (CD4+ T cells), while we're talking about them, help activate other immune cells, including B cells to produce antibodies and killer T cells to destroy infected cells.

The low level of agreement between GPT-4 annotations and manual annotations in some cell types does not necessarily mean that the GPT-4 annotations are incorrect.

For example, to cell types classified as stromal cellsinclude fibroblasts And osteoblastsexpressing type I collagen genes, and chondrocytesexpressing type II collagen genes.

Fibroblasts are the most common cell type in connective tissue. They play a key role in wound healing and tissue repair by producing important extracellular matrix components such as collagen, fibronectin and glycosaminoglycans. Fibroblasts provide mechanical strength to tissue and participate in the creation of the extracellular matrix.

Osteoblasts are cells that are responsible for bone formation. They produce and secrete bone matrix, which then mineralizes into hard bone tissue. Osteoblasts also play an important role in the process of bone remodeling, that is, in the constant renewal and restoration of bone tissue.

Chondrocytes are cells that are the main and only cells of cartilage tissue. They are responsible for the synthesis and maintenance of the extracellular matrix of cartilage, which includes collagen (mainly type II), proteoglycans and other components. Unlike osteoblasts, which form bone tissue, chondrocytes do not produce a hard matrix that mineralizes, but instead create a flexible and resilient matrix that allows cartilage to perform its functions.

For cells manually annotated as stromal cells, GPT-4 assigns cell type annotations at higher granularity (e.g., fibroblasts and osteoblasts), resulting in partial matches and lower levels of agreement. For cell types that are manually annotated as stromal cells but defined by GPT-4 as fibroblasts or osteoblasts, type I collagen genes show significantly higher expression than type II collagen genes. This is consistent with the observed pattern in cells manually annotated as chondrocytes, fibroblasts, and osteoblasts, suggesting that GPT-4 provides more accurate cell type annotations for stromal cells.

GPT-4 significantly outperforms other methods based on average agreement scores.

Using GPTCelltype as an interface, GPT-4 is also noticeably faster, in part due to the use of differential genes from standard single-cell analysis pipelines such as Seurat.

Given the integral role of these pipelines, the researchers view differential genes as directly accessible to GPT-4. In contrast, other methods such as SingleR and ScType require additional steps to reprocess gene expression matrices.

Additionally, compared to other free methods, GPT-4 charges a monthly fee of $20 to use the web portal. The cost of the GPT-4 API is linearly correlated with the number of cell types requested and does not exceed $0.1 for all requests in this study.

Financial cost of GPT-4 API query versus cell type numbers

Financial cost of GPT-4 API query versus cell type numbers

The researchers also assessed the robustness of GPT-4 in complex real-world data using simulated datasets. GPT-4 can distinguish pure and mixed cell types with 93% accuracyand also distinguish known and unknown cell types with 99% accuracy. When the input gene set contains fewer genes or contains noise, GPT-4 performance decreases but remains high. These results demonstrate the robustness of GPT-4 in different scenarios.

Performance of GPT-4 in identifying mixed/single cell types and known/unknown cell types, and under varying levels of subsampling and noise across multiple rounds of simulation.

Performance of GPT-4 in identifying mixed/single cell types and known/unknown cell types, and under varying levels of subsampling and noise across multiple rounds of simulation.

Finally, the researchers assessed the reproducibility of GPT-4 annotations using previous simulated studies. GPT-4 generated identical annotations for the same marker genes 85% of the time, indicating high reproducibility.

Annotations of the two versions of GPT-4 showed identical agreement rates in most cases with Cohen's Kappa coefficient equal to 0.67, indicating significant agreement.

Cohen's Kappa coefficient (Cohen's κ) is a statistical measure used to assess the degree of agreement or reliability between two or more raters who independently classify each item into categorical scales.

Although GPT-4 is superior to existing cell type annotation methods, there are limitations that should be considered. First, the undisclosed nature of the GPT-4 training corpus makes it difficult to validate the basis of its annotations, which requires human evaluation to ensure quality and reliability. Second, human involvement in possible additional tuning of the model may affect reproducibility due to subjectivity and may limit the scalability of the model in large data sets. Third, high noise levels in scRNA-seq data and unreliable differential genes may adversely affect GPT-4 annotations.

The researchers recommend that GPT-4 cell type annotations be reviewed by human experts before subsequent analyses.

Although the study focuses on the standard version of GPT-4, additional customization of GPT-4 using high-quality reference marker gene lists can further improve cell type annotation performance using services such as “GPTs” provided by OpenAI.

conclusions

It is important to note that the use of exclusively AI and GPT-4 forces at this stage is unacceptable, because requires refinement and additional testing, but this method really has excellent prospects in the future!

The authors of the main study are Wenpin Hou (Department of Biostatistics, Columbia University School of Public Health) and Zhicheng Ji (Department of Biostatistics and Bioinformatics, Duke University School of Medicine).

Thanks for reading! We will be waiting for you in the comments 🙂

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *