Prompt Me One More Time. We teach LLMs to build knowledge graphs from texts
Over the past few years, large language models (LLMs) have demonstrated impressive potential in natural language processing. They handle the tasks of generating text, answering questions, and even writing creative texts. However, despite their ability to process vast amounts of data and capture context, LLMs often face limitations when dealing with factually accurate information. Knowledge graphs come to their aid in this, organizing data in a structured form and capable of accurately modeling complex relationships between objects.
Knowledge graphs (Knowledge Graphs, KGs) are directed relational graphs in which entities are represented as vertices, and relationships between entities are represented as edges in the graph. To describe knowledge graphs, a set of triplets is used, represented as (subject, relation, object) or (s, r, o). Such graphs provide a structured representation of facts about both real-world objects and abstract concepts.
However, keeping knowledge graphs up to date requires a significant amount of manual labor. For example, one of the largest open-source knowledge graphs WikiData contains more than 1.54 billion elements and is maintained by a collaborative effort of volunteers.
However, a huge amount of information today is stored in unstructured text formats such as books, articles, news reports, blog posts and social networks. This means there is a request to automate the formation and filling of knowledge graphs with information extracted from such texts. This could be a very good solution to the problem of keeping KG relevant and at the same time would greatly relieve the burden on the community that is involved in this.
Unfortunately, everything is not as simple as we would like. To create a model that would automate the filling of knowledge graphs with triplets from texts, you need datasets, which still need to be collected manually. In addition, annotators must be familiar with all the entities and relationships in the graph to identify all possible facts mentioned in the text. In the case of large-scale KGs such as Wikidata, this complicates matters significantly. As a result, there are practically no high-quality datasets to solve this problem, only sparse or noisy data sets with low-quality annotations.
Although the main methods for extracting information from text rely on training language models, recent research suggestthat extracting graph triplets can be implemented using LLM by intra-contextual learning (in‑context learning) and following instructions (instructions following). The use of these methods can be more profitable, since now there is no need to train models and, as a result, collect large annotated datasets for them.
Despite this alternative, there are few articles on the application of LLM to the problem of constructing knowledge graphs. In contrast, previous works either indicate their inability extract structured information from the text, or do not check the quality of the construction of the graph itself, as for example in the work GraphRAG. My colleagues and I had a different opinion and decided to prove it in practice.
Pipeline and data
To extract triplets from texts using LLM, we propose a method consisting of several steps.
First stage — extraction of candidate triplets. Candidate triples may include entity and relationship names that do not conform to the formats used in KG WikiData. Therefore on second stage these candidates are adjusted based on similar entities and relationships that are already present in the KG. Output we check the normalized triplets and filter out those that do not correspond to the KG ontology. This step involves checking the relationship constraints and ensuring that the entity types are compatible with them. For all steps except the last one, we used the OpenAI gpt-3.5-turbo1 model. The resulting approach is called Prompt Me One More Time.
As mentioned above, there are practically no high-quality datasets to evaluate the solution to the problem of extracting triplets from text. Working with existing data sets would provide a highly misleading estimate of the models' performance. Therefore, we used a synthetic but higher quality data set SynthIEcreated by Martin Josifoski and his colleagues. These, by the way, are the same authors who argued that LLMs cannot solve the problem of information extraction. Therefore, they recognized the problem as asymmetric – according to their findings, LLMs can generate text from KG triplets, but not KGs from text.
Let's look at each step in more detail.
Step 1: Extract candidates from text
In the first step, LLM extracts candidates into triplets from the input text. In the request to LLM, we described the essence of the task and used three examples from the training split of the data set to demonstrate Wiki‑cIE Code:
At this stage, for example, one of the triplets extracted by LLM from the text
“Tahiti Honey” is a film written by Frederik Kohnner
will: (“Tahiti Honey” “written by” ,“Frederik Kohnner” ).
However, the subject, object, and relation in the extracted triplet are not necessarily normalized, i.e., they may not correspond to the names used in the original KG. In the example above, unnormalized names are highlighted in bold.
Step 2: Triplets Normalization
To solve the problem of non-normalized names in a triplet, we used a two-stage query strategy in LLM. After extracting information from the text in the first stage, we used the search index FAISS to find canonical names from KG Wikidata ranked by cosine distance with entities and relationships from extracted triplets. The index was built using pre-trained embeddings Contrieverwhich have demonstrated high performance in various search scenarios.
Based on the top 5 retrieved exact entity and relationship names similar to those extracted in the first step, the LLM query describes the task of selecting names that fit the context of the text and the triplet itself:
For example, for the above triplet, the second step will find 5 canonical names ranked by similarity for each extracted subject, object and relation respectively:
“written by” : [“lyrics by”, “adapted by”, “produced by”, “screenwriter”, “author”],
“Tahiti Honey”: [“Tahiti Honey”, “Honey”, “Honey Chile”, “Celtic Honey”, “Tahitipresse”],
“Frederik Kohnner” : [“Frederick Kohner”, “Paul Kohner”, “Adolf Kohner”, “Susan Kohner”, “Henry Rohner”].
And as a result, the triplet corrected using LLM in the second step looks like
(“Tahiti Honey”, “screenwriter”, “Frederick Kohner”).
Step 3: Ensure Logical Matching of Extracted Triplets
To increase the reliability of the method results in the case of possible LLM hallucinations, we introduced automatic verification of the generated triplets based on the KG ontology as a final step. An ontology is a semantic model for representing knowledge in a particular domain that defines the types of entities represented in that domain, as well as the constraints that define how these entities interact through relationships.
The output triplets from the second step were checked against the rules predefined in WikiData. These rules are assigned to a particular relation and determine what type of entities can be used with it in a triplet. To do this, we used the rules according to which a relation can connect a subject and an object in a triplet.
Next, the subject and object classes from the generated triplet were extracted from the Wikidata query service by querying the “instance of” (P31) and “subclass of” (P279) properties of the subject and object up to the beginning of their subclass hierarchy:
A triplet is considered valid if the subject and object types match the types specified in the relation rules from this triplet.
Thus, the reliability of the logical structure of the extracted information was ensured. An example of the verification process is schematically described in the figure:
Experiments
To assess the importance of each of the steps in the proposed method, we compared approaches that use only part of the proposed steps. The table below shows the performance metrics of full and partial triplet extraction methods:
What conclusions did we draw based on this test:
The most important part of the method is to provide the LLM with representative examples of triplets. Using a two-step method with example demonstration and triplet testing gives a five times better F1 metric result compared to the same method without example demonstration.
Adding a triple refinement step significantly improves the recall metric compared to a single query approach.
In turn, the verification stage is necessary to maintain accuracy after refinement of triplets, comparable to that which was obtained only after the first stage.
In the single-query approach, verification does not provide significant improvement. This is explained by the fact that after the first step, all triplets whose names are not in the KG are automatically filtered out, leaving only those that were most accurately mentioned in the text, which allowed LLM to identify them. However, in the second step, LLM may not have enough knowledge of the specific KG ontology to compose new triplets with precise names of entities and relations that were initially inaccurately specified in the text, so we used post-processing filtering to maintain a higher accuracy rate without reducing This is the meaning of recall.
Next, we compared our method with existing SOTA. The left side of the table below shows the results of the method based on step-by-step queries to LLM, compared to SynthIE T5-largethe best model trained on the SynthIE dataset from the same Josifoski and colleagues.
Although the improved two-stage approach performs lower in quality than the pretrained model, this is due to the quality of the dataset used to train and evaluate it.
While manually reviewing the synthetic dataset, we noticed frequent inconsistencies between texts and triplets in the markup. To demonstrate the shortcomings of the synthetic dataset, we selected 100 random examples from WikicIE-test-small. Using the reference entities used to generate each text in the original article, we manually added similar natural language text about those entities from Wikipedia paragraphs. Thus, we obtained a set of natural language texts referencing KG entities from the original dataset.
Next, the model trained on synthetic data and our LLM-based method were used to extract triplets from these texts. The results of both methods were then manually evaluated by three annotators. The results of the manual assessment are also presented on the right side of the table.
We see that the LLM-based method now significantly outperforms SynthIE T5-large in human-rated accuracy. The pre-trained model tends to generate triplets that are not relevant to the text context, using learned triplets instead of extracting relevant ones from the context.
In addition, the correct triplets predicted by both methods are noticeably different from each other. The measured intersection over union (IoU) metric or Jaccard distance shows that only 8% of correct triplets are predicted by both methods. This suggests that each model has its own specialization, demonstrating advantages in different aspects of information retrieval. Therefore, their predictions can be combined, leading to further improvements in system performance.
So what next?
In summary, we have proven that LLMs, combined with a two-step strategy of triplet correction and ontology verification, are a promising alternative to training models for extracting triplets from texts, despite past claims that LLMs are not suitable for such a task. Moreover, our proposed method exhibits better performance on non-synthetic data compared to the model trained on the SynthIE dataset. Along the way, we found out that the SynthIE dataset, which was considered the highest quality among sets for assessing the extraction of triplets from text, has a number of unpleasant shortcomings.
But there is still a lot of work ahead. Thus, we recognize that the effectiveness of the LLM+KG connection is domain-sensitive: if the domain in which the text is written is very different from that during training, domain adaptation may be required. Even so, some specialized models may be more accurate than ours. Finally, a correctly formulated prompt plays a big role here.