Why Data Leaks in Large Language Models. Part 3

Good day, dear readers. This is the third part of the article dedicated to the “leak” of confidential data using large language models as an example, implemented through cyber attacks. In the first two parts (once And two) we considered the possible causes and consequences of such attacks. We also separately touched upon their types, and went into detail about the mechanisms and methods of collecting and forming data sets, their structure and properties.

Here we will consider the properties of the resulting knowledge graphs, as well as tools for displaying them. First of all, we are interested in obtaining a knowledge graph (once And two) and its correct interpretation, as well as the selection of a tool that would objectively reflect the graph and could support very fast scaling, because the amount of data in the model is constantly growing, and the nodes are constantly migrating. Moreover, as it turned out, they are not static and can be subject to mergers, disintegrations and flow into adjacent areas.

Knowledge graph

Before focusing on choosing a tool for knowledge graph visualization (once, two, three, four, five, six, seven, eight) based on the generated set of prompts (question-answer and distance metrics), by which we are going to obtain key nodes, we need to evaluate the quality of the resulting graph: how informative, complete and, most importantly, reliable it will be. The speed of the attack depends on the selection of key nodes. The faster we find them (nodes), the faster we understand what the set is made of and how it can be manipulated, transformed and “poisoned”.

A detailed examination of scientific papers showed that now requests to the model fit into a special paradigm of Learning to Plan from Knowledge Graphs (LPKG). This system helps the network not to immediately give an answer to a request, but to plan a whole line of answers, from the set of which a decision is formed. It turns out to be an analogue of an internal critic, that is, the model itself begins to evaluate how well it will answer in one way or another. It turns out to be another add-on to the main network.

At the same time, a new quality metric was created, just for this kind of work — CLQA-Wiki. It allows to evaluate the quality of the received prompt, uniformly covering multi-way, comparative, intersecting and merging types of queries, which is a more complete (comprehensive) representation for a more relevant answer.

The system is, of course, built on the new RAG architecture (once, two, three, four). Even specialized modules have been developed for extracting knowledge chains, such as ToG (article) And KELP. KELP's work consists of three stages:

  1. path knowledge extraction;

  2. sample coding;

  3. fine-tuning the choice of the highest quality path in the graph.

Modular systems have also been proposed that allow extracting information from the model and building knowledge graphs in a real-time question-answer mode. A special hackathon was even held. The main goal is to understand the properties of the graph and its main characteristics.

Graph properties

I have listed a short list of key parameters by which the quality of the resulting node is assessed. I will immediately make a reservation: since we are trying to restore the graph topology based on prompts (request-response), not all aspects will be fully involved, but it is worth striving to implement them in order to more fully understand the entire structure of the graph. If you look at serious reference books on graph theory, for example Here And Hereyou can see that quite a lot of different characteristics stand out. It is quite difficult to cover them all, and it is not always necessary, so I have highlighted only a few:

Coherence and integrity. The graph should be well connected, all key concepts and notions are interconnected, which would allow us to trace the interdependencies between different parts of the model and more clearly highlight the topology of the nodes. Here we can highlight the following types of connections, which are most important:

  1. Hierarchical relationships between more general and more specific concepts.

  2. Associative links reflecting thematic closeness and interrelationships between concepts.

  3. Cause and effect relationships showing dependencies and influences between phenomena.

  4. Functional relationships that describe the roles and purposes of various elements.

Hierarchy. The graph should show hierarchical relationships between concepts, where more general concepts are linked to more specific ones. This approach will help identify high-level “hubs” or key nodes that affect most of the model, while the relationship should be considered in detail.

Semantic richness. The links between nodes should be semantically rich, reflecting different types of relationships (e.g., “is part of”, “uses”, “affects”, etc.). With this approach, it is possible to analyze in more detail the structure and logic of the model, as well as its interaction and the development of the prompt (question-answer) along the query formation network.

Coverage of the subject area, should, if possible, cover all key aspects and concepts related to the subject area of ​​the model. At the same time, the more complete and better it is presented, the more valuable the information obtained during the analysis will be.

Graph scalability. There must be the ability to expand as the model grows and new concepts and connections are added. This is extremely important, since we are always evaluating not the final stage, but the possibility of constantly completing the graph as new data arrives, and they will pour in constantly and in a very large volume.

Level of detail includes the ability to provide the necessary level of detail and depth of knowledge representation. Moreover, it is desirable to include both high-level general concepts and more detailed, specific concepts in accordance with the requirements of the subject area and the purposes of use.

The next important point is setting up the graph visualization tools and reviewing its toolkit. The search led me to some interesting developments that I want to share. Of course, the most interesting one will be at the end of the list, but it is not always necessary, since, although it fully reflects the task, you can always limit yourself to simpler things.

Visualization tools

Today, there are quite a lot of tools for working with graphs. But my task is quite specific: I needed tools that work with knowledge graphs, so the selection was also associated with a number of features:

Node Centrality Analysis. Identifying nodes with high centrality, such as nodes with the most connections or located on the shortest paths between other nodes. These will play a key role in the model. The following aspects are important here:

  1. Degree Centrality. Identify the nodes with the most connections, as they are the key concepts that link many others.

  2. Betweenness Centrality. Identifying nodes that are on the shortest paths between many other nodes, indicating their role in transmitting information.

  3. Closeness Centrality. Analyze the nodes closest to the “center” of the graph, as they may be influential concepts.

Definition of connecting nodes — which connect different clusters or communities in the graph. These nodes can be especially important for understanding the interactions between different parts of the model, and can fill in gaps in knowledge. They can also contain important metadata or contextual information. Moreover, they can optimize search and improve the overall performance of the network.

Assessing the influence of nodes. Quantifying the influence of each node on the rest of the graph, for example using metrics such as PageRank or proximity to the center. This also helps to identify the most important concepts in the model, take into account the dynamic nature of the graph, and more fully interpret the result. This takes into account aspects such as:

  1. Modularity. Identification of tightly connected groups of nodes that may represent important thematic modules or subsystems.

  2. Structural Equivalence. Identifying nodes that play similar roles in a graph, which may indicate concepts related to the same entity.

  3. PageRank. An assessment of the importance of nodes based on the quantity and quality of links leading to them, giving an indication of the authority of concepts.

  4. HITS (Hyperlink-Induced Topic Search). Identification of “hubs” (nodes with many links) and “authorities” (nodes that are linked to by many “hubs”), which may indicate key concepts.

Dynamic analysis. Tracking changes in the knowledge graph as the model is updated or expanded, allowing us to understand how the model evolves over time and which nodes are the most stable or undergo the most change:

  1. Change in node centrality over time. Track how the importance of concepts changes as the knowledge graph is updated.

  2. The appearance and disappearance of connections. Identifying new or discontinuous connections between concepts that may signal their changing significance.

Integration with other sources and subject areas. Combining the knowledge graph with other information resources, such as documentation, expertise, or data, to gain a more complete understanding of the model and its context. The following can help here:

  • Path Analysis. Determining the shortest paths between pairs of concepts. This will allow us to discover indirect connections that may be implicit. Analyzing high-level paths that connect distant concepts through intermediate concepts, where such paths may reveal hidden semantic relationships.

  • Cluster analysis. Apply clustering techniques such as hierarchical clustering or k-means to identify groups of strongly related concepts. Discover clusters that may unite seemingly disparate concepts, revealing hidden connections between them.

  • Common Neighbors Analysis. Identifying concepts that share a large number of common links with other nodes. Such concepts with a large number of common neighbors may be indicators of hidden relationships between them.

  • Structural Equivalence Analysis. Identifying concepts that have similar patterns of relationships. Structurally equivalent concepts can indicate hidden similarities and relationships between them.

  • Latent Semantic Analysis. Applying latent semantic analysis methods such as SVD or topic modeling can help discover hidden semantic relationships between concepts based on co-occurrence in context.

Based on this, here is a rough list of tools; some of them have the stated properties, some do not, but still, they are all actively used:

  1. Graphviz offers flexible options for creating graph diagrams. It supports various output data formats and can generate high-quality images. Additionally, it has various graph layouts, advanced appearance customization, support for Python, JavaScript, Java, C/C++, Perl, Ruby programming languages. There is a possibility of clustering and subgraphs, support for presentation formats for export to SVG, PDF, PS, PNG, JPEG. An advanced plugin system is configured.

  2. Cytoscape specializes in biological and medical graphs, but its functionality can be useful for other types as well. It offers rich capabilities for graph visualization and analysis. It is a fairly powerful system, supporting a variety of layouts and compositions (force-directed, hierarchical, circular, etc.) with a developed system of scaling, panning, and focusing. The algorithmic components include:

    • Interactivity with the ability to navigate, select and filter elements.

    • Calculation of topological characteristics (node ​​degrees, clustering, centrality, etc.).

    • Detecting communities and modules in a graph.

    • Finding paths, calculating distances and connectivity between nodes.

    • Analysis of the dynamics of graph changes over time.

    • A vast ecosystem of plugins that provide additional capabilities.

    • Possibility of developing your own plugins in Java.

    • Integration with data analysis, machine learning and bioinformatics tools.

  3. TensorFlow Graph Neural Networks (TF-GNN). A framework developed by Google for building and training neural network models on graphs. Allows the use of deep learning methods to analyze the structure and properties of a graph, including identifying hidden connections. It has a wide range of algorithms, such as GraphSAGEGAT and GCN.

  4. NetworkX. A Python library for creating, manipulating, and studying the structure, dynamics, and functions of complex networks. Provides tools for graph visualization, topology analysis, community detection, centrality metrics, and graph evolution.

  5. Apache Spark GraphX. A library for graph data analysis built into the Apache Spark ecosystem. Supports a wide range of graph operations, such as centrality calculation, shortest path search, clustering, etc. It has high performance due to the use of distributed computing. Supports various graph layout algorithms, such as Spring-layout, Fruchterman-Reingoldcircular layout, etc., and pathfinding algorithms such as Dijkstra, Bellman-Ford, Floyd-Warshall. In addition, the library is capable of generating random graphs with the support of the algorithm Erdos-Renyimodels Barabashi-Alberta And Watts-Strogatz models.

The latest and most interesting development is the methodology GraphRAG (article). Here, a two-step approach is used to build a graph-based text index: first, a knowledge graph of entities is built from source documents, then community summaries are pre-generated for all groups of closely related entities. Given a query, each community generates a summary that is used to create a partial answer, after which all partial answers are summed up again into a final answer for the user.

For the class of global comprehension questions on datasets in the range of 1 million lexemes, Graph RAG was shown to provide significant improvements over naive RAG in both the completeness and diversity of generated answers. That is, it attempts to comprehend answers in the widest range on a set of documents, and the following workflow is proposed:

  1. Source Documents → Text Chunks. Input texts are refined for processing, and then split into chunks for LLM queries. Longer chunks require fewer LLM calls, but degrade quality. At the same time, a longer context window is created to improve performance, while simultaneously assessing the impact of chunk size on performance. The focus is on the balance between memorization and accuracy of information.

  2. Text Chunks → Element Instances. Identification and extraction of entities in text and adaptation of the prompt to the subject area. By default, statements related to detected entities are extracted, including topic, object, type, and description, taking into account gap filling.

  3. Element Instances → Element Summaries. Generalization is performed at the instance level.

  4. Element Summaries → Graph Communities. The index created in the previous step can be thought of as a graph with normalized edge weights, and community detection algorithms can be used to partition the graph into strongly connected communities. The Leiden algorithm is then used to reconstruct the hierarchical structure of communities in large-scale graphs. The resulting hierarchy allows a generalization of the global divide-and-conquer approach, with each level of the hierarchy representing a partition of the community spanning all nodes in the graph.

  5. Graph Communities → Community Answers. After this, summaries are created to help you understand the global structure and semantics of the data. They are also useful for understanding the data without having to query it. Summaries can be used to highlight common themes across different levels of the hierarchy.

  6. Community Answers → Global Answer.

    Fig. 1. Graph communities discovered using the Leiden algorithm (Traag et al., 2019) on the indexed MultiHop-RAG dataset (Tang and Yang, 2024). Circles represent entity nodes with size proportional to their degree. Node placement was performed using OpenORD (Martin et al., 2011) and Force Atlas 2 (Jacomy et al., 2014).

Key features include support for mission-critical information discovery and analysis use cases where the data needed to derive useful insights spans multiple documents, is noisy, contains unreliable and/or misleading information, or where the questions users want to answer are more abstract or topical than those that can be directly answered by the underlying data.

The tool is intended for situations where users are already trained in responsible analytical approaches and are expected to think critically. GraphRAG can provide a high degree of understanding of complex information topics, but a final human review by a subject matter expert is still needed to validate and complement the answers it generates.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *