Help:Dataset sizing
Purpose
editThis page aims to list and define a few standard metrics suitable to be determined on a subset of Wikidata items.
For metrics used elsewhere, it attempts to provide queries that can be used on Query Server.
Version
editThis is the version as of 20200823075556. Please use the "permanent link" on the left side when quoting this page.
Introduction
editSample queries to select the items:
- sleds:
SELECT ?item WHERE { ?item wdt:P279* wd:Q181388 }
- tennis:
SELECT ?item WHERE { ?item wdt:P641 wd:Q847 }
Knowledge Graphs on the Web -- an Overview (Q86997852) proposes a few metrics:
- a. # instances
- b. # assertions
- c. average linking degree
- d. median ingoing edges
- e. median outgoing edges
- f. # classes
- g. # relations
- h. average depth of class tree
- i. average branching factor of class tree (average width of class tree)
- j. ontological complexity
They are described at "3. Comparison of Knowledge Graphs" in the paper.
Discussion at Wikidata:Request a query#Dataset sizing.
The queries below are mostly based on truthy main statements (wdt:), not qualifiers (pq:), references (pr:), sitelinks, or labels/descriptions/aliases. Please help expand/add alternate ways to calculate.
A few other metrics are included as well.
Basic metrics
editnumber of instances
edit- definition
- number of distinct items
# a. # instances
SELECT (COUNT(DISTINCT ?item) as ?nb_instance)
WHERE
{
?item wdt:P279* wd:Q181388 .
# ?item wdt:P641 wd:Q847 .
}
number of assertions
edit# b. # assertions
# Tbd: include sitelinks?
SELECT (SUM(?st) as ?nb_assertions)
WITH
{
SELECT DISTINCT ?item ?st
WHERE
{
?item wdt:P279* wd:Q181388 .
# ?item wdt:P641 wd:Q847 .
?item wikibase:statements ?st .
}
} as %a
{
INCLUDE %a
}
average linking degree
edit# c. average linking degree
# TBD: include incoming links?
SELECT (AVG(?st) as ?avg_linking_degree)
WITH
{
SELECT DISTINCT ?item ?st
WHERE
{
?item wdt:P279* wd:Q181388 .
# ?item wdt:P641 wd:Q847 .
?item wikibase:statements ?st .
}
} as %a
{
INCLUDE %a
}
median ingoing edges
edit# d. median ingoing edges: number of ingoing edges
# after the below, calculate median on ?nb_ingoing_edges
SELECT ?item (COUNT(?wdt) as ?nb_ingoing_edges)
WITH
{
SELECT DISTINCT ?item
WHERE
{
?item wdt:P279* wd:Q181388 .
# ?item wdt:P641 wd:Q847 .
}
} as %a
{
INCLUDE %a
?p wikibase:directClaim ?wdt ; wikibase:propertyType wikibase:WikibaseItem .
[] ?wdt ?item
}
GROUP BY ?item
median outgoing edges
edit# e. median outgoing edges: number of outgoing edges
# after the below, calculate median on ?nb_outgoing_edges
# alternative method: include external id properties
SELECT ?item (COUNT(?wdt) as ?nb_outgoing_edges)
WITH
{
SELECT DISTINCT ?item
WHERE
{
?item wdt:P279* wd:Q181388 .
# ?item wdt:P641 wd:Q847 .
}
} as %a
{
INCLUDE %a
?p wikibase:directClaim ?wdt ; wikibase:propertyType wikibase:WikibaseItem .
?item ?wdt []
}
GROUP BY ?item
number of relations
edit# g. # relations
# currently properties. Could be expanded to other
SELECT (COUNT(DISTINCT ?wdt) as ?nb_relations)
WITH
{
SELECT DISTINCT ?item
WHERE
{
?item wdt:P279* wd:Q181388 .
# ?item wdt:P641 wd:Q847 .
}
} as %a
{
INCLUDE %a
?p wikibase:directClaim ?wdt .
{ ?item ?wdt [] } UNION { [] ?wdt ?item }
}
number of classes (types)
edit- definition
- number of distinct values used with instance of (P31) or subclass of (P279)
- query
# f. # classes
SELECT (COUNT(DISTINCT ?class) as ?nb_classes)
WITH
{
SELECT DISTINCT ?item
WHERE
{
?item wdt:P279* wd:Q181388 .
# ?item wdt:P641 wd:Q847 .
}
} as %a
{
INCLUDE %a
?item (wdt:P31|wdt:P279) ?class
}
Most frequent
editmost frequently used properties
edit- definition
- properties most frequently used as main values (truthy values)
- query
most frequent sitelinks
edit- definition
- most frequently linked WMF sites (Wikipedia, Commons, Wikisource, etc.)
- query
most frequently used classes (types)
edit- definition
- most frequent values used with instance of (P31) or subclass of (P279). Sometimes limited to P31.
- query