A simple way to write Wikidata bots
Introduction:
Sulab has proven it’s love for Wikipedia by creating the GeneWiki[1]. Wikipedia is primarily a collection of free-text pages which also can contain some structured data in the form of infoboxes. In order to increase the abilities of handling and representing structured data in the MediaWiki universe, Wikidata was conceived and finally rolled out to the community in late 2012. It has now reached an advanced stage, also in the integration with Wikipedia and the availability of query tools, so it is ready to be also populated with biological data.
Wikidata is a document-oriented database, consisting primarily of items and its properties. Items can be created by any user and whereas new properties are required to be proposed to the community and are only approved after community discussion and consensus. Wikidata items also carry labels, descriptions, links to other MediaWiki projects in different languages. Furthermore, properties should have references stating the source of the data and they can have qualifiers in order to increase the precision of validity of a certain value. Each property has a certain data type, e.g. string, date, Wikidata item ID, quantity, … . Information on the Wikidata data model can be found here.
As with Wikipedia, it can be edited by anybody, making it the Wikipedia for structured data. Editing can be done by either using the Wikidata website or by using the MediaWiki API, specifically extended for use with structured data. This Wikidata API is a RESTful API, allowing queries composed as URLs. In order to efficiently make use of the Wikidata API, the API calls should be used as part of a program, also known as bot. For this, the MediaWiki community offers a set of language specific, low level APIs, e.g. pywikibot, a MediaWiki/Wikidata API wrapper written in Python.
The PBB_Core:
When writing a bot for Wikidata, simple low level functions of reading and writing items are not sufficient for more complex tasks like importing all human genes or diseases. It is crucial that existing items are being updated instead of creating a new item, so an item needs to be searched for first and, if found, updated. For biological data which might stem from several distributed sources, the data needs to be aggregated and prepared in a way which allows to write it to Wikidata. We tackled this problem by creating an API termed PBB_Core. PBB_Core is essentially a collection of Python class which take any type of Wikidata prepared structured data and write it to Wikidata. The preparation of the data (e.g. mapping to the correct identifiers, addition of qualifiers and references) needs to happen in the resource specific part of the bot and the search and write happens in the resource independent part, the PBB_Core.
Using PBB_Core is simple:
Requirements:
-A Wikidata user account
-Cloning of the PBB_Core repository:
git clone https://bitbucket.org/sulab/wikidatabots.git
-An entry for each core property in wd_property_store.py. This is required, as only a core set of properties can be used for a certain domain of data (e.g. human genes, drugs) in order to identify an item in Wikidata.
An example of a basic bot:
# Import PBB_Core and PBB_login
import PBB_Core
import PBB_login
# create a login object with Wikidata user credentials
login = PBB_login.WDLogin(user, pwd)
# create a Wikidata value of a certain datatype. The value and the property number for the item are required (property number is Drugbank ID)
item_name = ‘Vemurafenib’
value = PBB_Core.WDString(‘08881’, ‘P715’)
# With one value created, we can try to write to Wikidata. An instance of WDItemEngine will be created. It will search for an item with property ‘P715’ and value ‘08881’. If found, the item will be loaded, otherwise, a new item will be created.
wd_item = PBB_Core.WDItemEngine(item_name=item_name, domain=’drugs’, data=[value])
# Above just loads or creates a new item locally. In order to write it, a specific write call needs to be issued.
wd_item.write(login)
All data types currently available in Wikidata are implemented and can be instantiated:
-PBB_Core.WDItemID
-PBB_Core.WDUrl
-PBB_Core.WDCommonsMedia
-PBB_Core.WDString
-PBB_Core.WDMonolingualText
-PBB_Core.WDGlobeCoordinate
-PBB_Core.WDQuantity
-PBB_Core.WDTime
A slightly more complex example using two values and also a reference:
import PBB_Core
import PBB_login
login = PBB_login.WDLogin(user, pwd)
# create a reference
ref1 = PBB_Core.WDItemID(value=’Q1122544′, prop_nr=’P248′, is_reference=True)
item_name = ‘Vemurafenib’
value1 = PBB_Core.WDString(value=’08881′, prop_nr=’P715′, references=[[ref1]])
value2 = PBB_Core.WDItemID(value=’Q12140′, prop_nr=’P31′)
# Now two values have been created and also one reference. The reference is then added to value1. Adding qualifiers works the same way.
wd_item = PBB_Core.WDItemEngine(item_name=item_name, domain=’drugs’, data=[value1, value2])
wd_item.write(login)