Finding synonyms for terms using Wikidata (python)


Wikidata screenshot
Wikidata screenshot

There are many services for the selection of synonyms, but they rarely cope with terms that contain more than one word. For the selection of synonyms for more complex expressions, Wikidata can help. Few people know that in addition to the standard Wikipedia, there is an additional database called Wikidata, which is a knowledge graph of the Wikimedia Foundation. Now it is integrated into Wikipedia itself, so for many articles in the left menu you can find the Wikidata Element item. Wikidata is presented in the rdf model, that is, the information is in the form of triplets that characterize the entity. The triplet looks like a statement subject – predicate – object. An example, for the entity England one of such information triplets is presented: England – has its capital – London.

One of the predicates (link types) is altLabel, which means alternative names, which will help us in the search for synonyms.

It should immediately be borne in mind that Wikidata is a very extensive knowledge base, but, nevertheless, it is not perfect. Therefore, for terms that are not presented there, or are presented, but there are no alternative names entered for their entities, no synonyms will be found.

Finding an item in the knowledge base

The first step is to find the Wikidata entity that represents the given term. To do this, you need to find its unique identifier (Q_id). This can be done by sending a request through the Wikidata API.

Full API documentation can be found at https://www.mediawiki.org/wiki/API:Main_page

import requests

session = requests.Session()
URL = 'https://www.wikidata.org/w/api.php'

def wbgetentities(name):
    res = session.post(URL, data={
        'action': 'wbsearchentities',
        'search': name,
        'language':'ru',
        'format': 'json',
    })
    try:
        res_json = res.json()['search'][0]['id']
    except:
        res_json = None
    return res_json

Q_id = wbgetentities(term)

Search for synonyms

To search for a synonym, use SPARQL. SPARQL is an RDF data query language that allows you to quickly search for data on a query. It will allow us to search for alternative names for our entity using the altLabel predicate.

The sparql-client library was used to send sparql queries.

import sparql

def create_query(first_id):
    q = ('''
    PREFIX entity: <http://www.wikidata.org/entity/>
    PREFIX wdt: <http://www.wikidata.org/prop/direct/>
    SELECT ?syno
    WHERE {
      ?O ?P ?id .
      OPTIONAL{?id skos:altLabel ?syno
          filter (lang(?syno) = 'ru')}
      VALUES ?id {entity:'''+ first_id +'''}   
      SERVICE wikibase:label {bd:serviceParam wikibase:language "ru" .}}''')
    return q

synonyms = []
query = create_query(Q_id)
result = sparql.query('https://query.wikidata.org/sparql', query)
for r in result:
     values = sparql.unpack_row(r)
     if values[0] not in synonyms:
           synonyms.append(values[0])
     
print(synonyms)

Thus, the code will return a list of synonyms for the term, if there are any in the Wikidata system. You can also find synonyms in another language if you change the code ‘ru’ in the request to the code of another language presented in the list https://www.wikidata.org/wiki/Help:Wikimedia_language_codes/lists/all

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *