Wikidata:WikiProject Data Quality

Wikiproject on data quality (Q1757694) in Wikidata.

wikiproject_dataquality
Picture taken at the Workshop on Data Quality Management in Wikidata (Q59426297) at Wikimedia Deutschland.

Motivation

edit

Ensuring data quality is of utmost importance, as the goal of Wikidata is to “give more people more access to knowledge” and therefore, the data needs to fit the needs of the data consumers. The Wikidata community has already developed methods and tools that monitor relative completeness (e.g., Recoin gadget), encourage link validation and correction (e.g. Mix’N’Match) and help editors observe recent changes and identify vandalism. Moreover, the community started global discussions about relevant dimensions of data quality in a recent RFC that used a survey of Linked Data Quality methods as the debate’s starting point to better describe and categorize quality issues and add more quality aspects/dimensions, with the goal of developing a data quality framework for Wikidata (RfC:DataQualityFramework).

Motivated by this, on January 18, 2019, we organized a workshop on "Data Quality Management in Wikidata" which took place at Wikimedia Germany.

In this workshop, we discussed existing challenges of data quality in Wikidata and derived possible solutions for data quality monitoring and data quality assurance in the context of Wikidata.

Resources

edit

Queries

edit

People where an alias is equal to the label

edit

The following query uses these:

  • Properties: instance of (P31)     
    SELECT ?item ?label WHERE{
      ?item wdt:P31 wd:Q5 .
      ?item rdfs:label ?label FILTER(lang(?label) = 'en') .
      ?item skos:altLabel ?alias FILTER (LANG (?alias) = 'en')
      FILTER(?label = ?alias)
    } LIMIT 100
    

Labels containing Markup

edit

The following query uses these:

  • Properties: instance of (P31)     
    #Labels containing markup tags
    
    SELECT DISTINCT ?item ?label
    WHERE
    {
      ?item wdt:P31 wd:Q13442814;
            rdfs:label ?label.
      FILTER CONTAINS(?label, "</").
    }
    LIMIT 100
    

Titles containing HTML escape sequences

edit

The following query uses these:

PubMed article titles

edit

PubMed uses brackets to indicate that the original title is translated to English

The following query uses these:

  • Properties: instance of (P31)     , PubMed publication ID (P698)     
    SELECT ?paper ?paperLabel ?pmid WHERE {
        ?paper wdt:P31 wd:Q13442814;
               wdt:P698 ?pmid;
               rdfs:label ?paperLabel
        FILTER(lang(?paperLabel)="en")
        FILTER(STRSTARTS(?paperLabel, "[")).
        FILTER(STRENDS(?paperLabel, "]")).
    } LIMIT 100
    

https://www.nlm.nih.gov/bsd/licensee/elements_descriptions.html#articletitle explains the brackets, and that titles may include other non-title words:

  • Non-EN title that is not yet translated will be just: [In Process Citation]
  • Explanatory information is enclosed in parentheses, eg: (author's transl)
  • Corporate/collective authors may appear at the end, eg: GISU. Interdisciplinary Group for the Study of Ulcer.
  • OLDMEDLINE subset (<CitationSubset> = OM) may have: Not Available

https://www.nlm.nih.gov/bsd/licensee/elements_descriptions.html#verniculartitle: <VernacularTitle> hold the original, untranslated title.

Tools

edit

Publications

edit

See Scholia on data quality for publications. Please add missing publications to Wikidata as publication items and add main subject (P921): data quality (Q1757694) for the items to be included in this list.

Events

edit

Tracking errors in people

edit

These should be searched weekly.

Tracking errors in Identifiers

edit

These are for the external sources to monitor errors we have found.

edit
Events

Subpages

edit

Participants

edit

The participants listed below can be notified using the following template in discussions:
{{Ping project|Data Quality}}


See also

edit
  • Wikidata:WikiProject Virtual Twins — can be seen as a data quality project, since a pair of twins is mostly due to random and to the biases that exist in Wikidata coverage