Research:Topical coverage of Edit Wars: Difference between revisions
Diego (WMF) (talk | contribs) Cleaning up, adding user registration analysis |
Diego (WMF) (talk | contribs) No edit summary |
||
(4 intermediate revisions by the same user not shown) | |||
Line 6: | Line 6: | ||
<!-- Who should be contacted about this project? e.g. {{Investigator|Aaron Halfaker|Wikimedia Foundation}} You can link to your username if you have one: {{Investigator|[[User:EpochFail|Aaron Halfaker]]|Wikimedia Foundation}}--> |
<!-- Who should be contacted about this project? e.g. {{Investigator|Aaron Halfaker|Wikimedia Foundation}} You can link to your username if you have one: {{Investigator|[[User:EpochFail|Aaron Halfaker]]|Wikimedia Foundation}}--> |
||
| contact = {{Investigator|[[User:Diego (WMF)|Diego Sáez]]|Wikimedia Foundation}} |
| contact = {{Investigator|[[User:Diego (WMF)|Diego Sáez]]|Wikimedia Foundation}} |
||
<!-- Who else is working on the project? You can add as many {{Investigator}} entries as needed --> |
<!-- Who else is working on the project? You can add as many {{Investigator}} entries as needed --> |
||
| collaborators = {{Investigator|[[User:Dario (WMF)|Dario Taraborelli]]|Wikimedia Foundation}} {{Investigator|[[User:Jmorgan (WMF)|Jonathan Morgan]]|Wikimedia Foundation}} |
| collaborators = {{Investigator|[[User:Dario (WMF)|Dario Taraborelli]]|Wikimedia Foundation}} {{Investigator|[[User:Jmorgan (WMF)|Jonathan Morgan]]|Wikimedia Foundation}} |
||
Line 60: | Line 60: | ||
}} <!-- End of Research project template --> |
}} <!-- End of Research project template --> |
||
This project is run by the [[mw:Wikimedia Research|Research Team]] as part of the [[Community health initiative]]. |
This project is run by the [[mw:Wikimedia Research|Research Team]] as part of the [[Community health initiative]]. |
||
The main outputs of this work are: |
|||
This project is focused on the the usage of toxic language <ref>https://blog.wikimedia.org/2017/02/07/scaling-understanding-of-harassment </ref> including specific behavior, such as wiki-hounding <ref>https://meta.wikimedia.org/wiki/Research:Wikihounding_and_Machine_Learning_Analysis</ref> The project focuses mainly on the content of user interactions. Our research adopted a complementary, language agnostic approach, which focuses on edit wars and attempts to differentiate between topic-centered and person-centered conflicts. |
|||
* We found that around '''~52% of the users are topic''' focused (they edit only in one topic), there are ~42% of the contributors that make big jumps across topics. Therefore, this can be described as **bimodal distribution**. |
|||
* We have found that '''just 7% of edit wars are cross-topic'''. |
|||
* However, users involved in this cross-topic edit wars are generally very active users, making difficult to assume that those wars are due person-centered conflicts. |
|||
⚫ | |||
== Introduction == |
== Introduction == |
||
Line 88: | Line 94: | ||
* Based on the aforementioned co-revision probability, identify [[:en:Anomaly detection|anomalous behaviors]] that are potentially related to stalking or wikihounding behavior. |
* Based on the aforementioned co-revision probability, identify [[:en:Anomaly detection|anomalous behaviors]] that are potentially related to stalking or wikihounding behavior. |
||
=== Topic |
=== Topic === |
||
* The [[en:Topic_model|Topic Model]] described in this section is based on previous work on Wikiproject-based model developed [[Research:Automatic_new_article_topics_suggestion | here]]. |
|||
==== Mapping pages to topic ==== |
==== Mapping pages to topic ==== |
||
Line 100: | Line 108: | ||
* Wikiprojects can be represented as a graph. |
* Wikiprojects can be represented as a graph. |
||
** We use this library: https://github.com/wiki-ai/drafttopic/tree/master/drafttopic/utilities). |
** We use this library: https://github.com/wiki-ai/drafttopic/tree/master/drafttopic/utilities). |
||
* Given that each page can belong to more than one wikiproject, we define the distance between two pages as the minimum [[:en:Shortest path problem|shortest path]] |
* Given that each page can belong to more than one wikiproject, we define the distance between two pages as the minimum [[:en:Shortest path problem|shortest path]] among all pairs of nodes on the Wikiprojects graph. |
||
** Example: Given a page X and Y, with X in Wikiprojects ''a'' and ''b'' and Y in Wikprojects ''c,'' and ''d''. We compute the lenght of shortest path between ''(a,c), (a,d), (b,c)'' and ''(b,d)'', and return the minimum value among these results. In Python: |
** Example: Given a page X and Y, with X in Wikiprojects ''a'' and ''b'' and Y in Wikprojects ''c,'' and ''d''. We compute the lenght of shortest path between ''(a,c), (a,d), (b,c)'' and ''(b,d)'', and return the minimum value among these results. In Python: |
||
Line 144: | Line 152: | ||
=== Wikiprojects graph === |
=== Wikiprojects graph === |
||
* Using the Wikiproject graph generated from [https://github.com/wiki-ai/drafttopic/tree/master/drafttopic/utilities this script] |
* Using the Wikiproject graph generated from [https://github.com/wiki-ai/drafttopic/tree/master/drafttopic/utilities this script][https://github.com/wiki-ai/drafttopic/tree/master/drafttopic/utilities script for parsing wikiprojects graph] |
||
* Pages to Wikiprojects: https://quarry.wmflabs.org/query/23214 |
* Pages to Wikiprojects: https://quarry.wmflabs.org/query/23214 |
||
* The Wikiproject graph (in [[:en:NetworkX|networkx]] format) can be downloaded here: [https://github.com/digitalTranshumant/ |
* The Wikiproject graph (in [[:en:NetworkX|networkx]] format) can be downloaded here: [https://github.com/digitalTranshumant//blob/master/WikiProjectGraphNetworkx.pickle] |
||
=== Terminology / conventions === |
=== Terminology / conventions === |
||
Line 161: | Line 169: | ||
[[File:RevisionsRevertedAccordingRegistrationTime.png|300px|thumb|% of Reverted according to account age/tenure (this is a [[en:Box_plot| Box plot]])]] |
[[File:RevisionsRevertedAccordingRegistrationTime.png|300px|thumb|% of Reverted according to account age/tenure (this is a [[en:Box_plot| Box plot]])]] |
||
A detailed analysis of reverting behavior can be found here: [https://github.com/digitalTranshumant/ |
A detailed analysis of reverting behavior can be found here: [https://github.com/digitalTranshumant//blob/master/Reverting%20behavior.ipynb] |
||
Line 171: | Line 179: | ||
[[File:Topic jumps.png|300px|thumb|Characterization Topic focus of Wikipedia Editors]] |
[[File:Topic jumps.png|300px|thumb|Characterization Topic focus of Wikipedia Editors]] |
||
More details can be found here: [https://github.com/digitalTranshumant/ |
More details can be found here: [https://github.com/digitalTranshumant//blob/master/EditWars.ipynb] |
||
=== Characterization on topical distance in multipage editwars === |
=== Characterization on topical distance in multipage editwars === |
||
Line 212: | Line 220: | ||
|} |
|} |
||
=== Outliers === |
|||
⚫ | |||
=== Activity and Editwars === |
|||
⚫ | |||
⚫ | |||
* Moreover, we found that users involved in cross-topic edit wars tends also to be the more actives users, therefore we cannot assume that their behavior is related with person-centered problems or is just a consequence of their high activity. |
|||
[[File:Edit Wars Class Vs Activity.png|300px|thumb|Different types of Edit Warring behavior vs amount of activity]] |
|||
Details about this study can be found here: [https://github.com/digitalTranshumant/TopicalSpanofEditWars/blob/master/EditWars.ipynb] |
|||
== Conclusions and Main Outputs == |
== Conclusions and Main Outputs == |
||
Line 246: | Line 256: | ||
=== Wikiprojects graph === |
=== Wikiprojects graph === |
||
* Using the Wikiproject graph generated from [https://github.com/wiki-ai/drafttopic/tree/master/drafttopic/utilities this script] |
* Using the Wikiproject graph generated from [https://github.com/wiki-ai/drafttopic/tree/master/drafttopic/utilities this script] |
||
* Pages to Wikiprojects: https://quarry.wmflabs.org/query/23214 |
* Pages to Wikiprojects: https://quarry.wmflabs.org/query/23214 |
||
* The Wikiproject graph (in [[:en:NetworkX|networkx]] format) can be downloaded here: [https://github.com/digitalTranshumant/Wiki-examples/blob/master/WikiProjectGraphNetworkx.pickle] |
* The Wikiproject graph (in [[:en:NetworkX|networkx]] format) can be downloaded here: [https://github.com/digitalTranshumant/Wiki-examples/blob/master/WikiProjectGraphNetworkx.pickle] |
||
Line 252: | Line 262: | ||
=== Code === |
=== Code === |
||
* Find all the code used in this study here: |
* Find all the code used in this study here: |
||
** Generate interactions dataset: [https://github.com/digitalTranshumant/ |
** Generate interactions dataset: [https://github.com/digitalTranshumant//blob/master/Reverts%20from%20SHA1.ipynb] |
||
** Reverting Behavior study: [https://github.com/digitalTranshumant/ |
** Reverting Behavior study: [https://github.com/digitalTranshumant//blob/master/Reverting%20behavior.ipynb] |
||
** Topic-Span of edit wars: [https://github.com/digitalTranshumant/ |
** Topic-Span of edit wars: [https://github.com/digitalTranshumant//blob/master/EditWars.ipynb] |
||
== Future work == |
|||
* '''Design a probabilistic model for outlier detection, considering the (un)predictability of two users of co-editing the same page.''' |
|||
* '''Improve mapping system from pages to Wikiprojects.''' |
|||
** The aforementioned [[quarry:query/23214|query]] returns around 22% of pages matching with no wikiproject. However, when we manually reviewed, we found cases where the Wikiproject is not correctly (?) assigned as a category in the Talk pages. For example: [[:en:Classon Avenue (IND Crosstown Line)|Classon_Avenue_(IND_Crosstown_Line)]] belongs to Wikiproject Trains, but that project is not listed as category. |
|||
== |
== == |
||
Q1, Q2 |
|||
* [https://blog.wikimedia.org/2017/02/07/scaling-understanding-of-harassment A project] focused on the the usage of toxic language. |
|||
== References == |
|||
* [[Research:Wikihounding_and_Machine_Learning_Analysis]] |
|||
<references /> |
|||
* Wikiproject topic model developed here: [[Research:Automatic_new_article_topics_suggestion | Automatic_new_article_topics_suggestion]] |
|||
== Acknowledgements == |
== Acknowledgements == |
Latest revision as of 14:36, 16 April 2018
This project is run by the Research Team as part of the Community health initiative. Here, we take a language agnostic approach, which focuses on edit wars, trying to differentiate between topic-centered and person-centered conflicts (edit wars).
The main outputs of this work are:
- We found that around ~52% of the users are topic focused (they edit only in one topic), there are ~42% of the contributors that make big jumps across topics. Therefore, this can be described as **bimodal distribution**.
- We have found that just 7% of edit wars are cross-topic.
- However, users involved in this cross-topic edit wars are generally very active users, making difficult to assume that those wars are due person-centered conflicts.
Code (jupyter notebooks) and technical details about this project can be found in this repository
Introduction
[edit]Background
[edit]An edit war occurs when editors who disagree about the content of a page repeatedly override each other's contributions. When edit wars occur across multiple topics, they might be an indicator of a personal attack (instead of topic-centered) is occurring. This behavior might be categorized as wikihounding.
Proposal
[edit]Edit wars on Wikipedia have been widely studied. We know they can be dectected in a content-agnostic (without analyzing the text) way. We propose studying the topical span of edit wars and characterizing usual and unusual (potentially toxic) behaviors.
The main tasks to develop such models are:
- Define and implement a robust topic model.
- Define a distance metric for topics (eg: Geography is N steps far from Politics, and M steps far from Sports.)
- Generate a representative dataset of edit wars in Wikipedia.
- Detect pairs or groups of users involved in more than X controversies. X will be defined as part of the study.
- Apply an outlier detection mechanism to find potential cases of harassment.
Methodology
[edit]- Define a topic model that allows us to measure topic distance between Wikipedia pages.
- Characterize a user's behavior according to the topics that the user edits and the amount of reverts the user commits.
- Compute the probability of a pair of users co-revising a page and the probability that this co-revision is a revert.
- Based on the aforementioned co-revision probability, identify anomalous behaviors that are potentially related to stalking or wikihounding behavior.
Topic Model
[edit]- The Topic Model described in this section is based on previous work on Wikiproject-based model developed here.
Mapping pages to topic
[edit]- Mapping Wikipedia pages to Wikiprojects can be done with this query: https://quarry.wmflabs.org/query/23214
Note
- We use wikiprojects as proxies for topic assignments.
- Many Wikipedia pages can be considered relevant to more than wikiproject. For example, the page about Diego_Maradona is relevant to Wikipedia:WikiProject_Biography/Sports_and_games, Wikipedia:WikiProject_Football and Wikipedia:WikiProject_Argentina.
Topic distance
[edit]- Wikiprojects can be represented as a graph.
- We use this library: https://github.com/wiki-ai/drafttopic/tree/master/drafttopic/utilities).
- Given that each page can belong to more than one wikiproject, we define the distance between two pages as the minimum shortest path among all pairs of nodes on the Wikiprojects graph.
- Example: Given a page X and Y, with X in Wikiprojects a and b and Y in Wikprojects c, and d. We compute the lenght of shortest path between (a,c), (a,d), (b,c) and (b,d), and return the minimum value among these results. In Python:
def distancePages(Graph,page1,page2): """ Graph: is the wikiprojects graphs return -2 if error (page without wikiproject) return -1 if the two pages are the same else return shortest path """ global pagesToWikiprojects results = [] if page1==page2: return -1 try: pages1Projects = pagesToWikiprojects.get(page1,[]) pages2Projects = pagesToWikiprojects.get(page2,[]) except: return -2 for x,y in product(pages1Projects,pages2Projects): try: results.append(nx.shortest_path_length(Graph,x,y)) except:pass if not results: return -2 else: return min(results)
User behavior
[edit]Topical coverage
[edit]- Topical stability (us): For each user U, we obtain the distance for the user's next revisions.
- For example, if user U makes three revisions, the first one in the topic 'Sports', the second one in the same topic, and the third one in the topic biology (with distance 4 from Sports), the probability that user U will edit with the topical distance of 0, is 2/3, with the topical distance of 4 is 1/3, and the probability is 0 for the all other distances. This metric gives an idea of user stability in terms of topics.
Reverting behavior
[edit]- For each user we compute a portion of reverts (within the dataset) compared with her/his total amount of revisions. Considering that our dataset contains only revisions by users who have made 10 or more revisions.
Note: We are just considering the reverts among these users.
Wikiprojects graph
[edit]- Using the Wikiproject graph generated from this scriptscript for parsing wikiprojects graph
- Pages to Wikiprojects: https://quarry.wmflabs.org/query/23214
- The Wikiproject graph (in networkx format) can be downloaded here: [1]
Terminology / conventions
[edit]- We use wikiprojects as proxies for topics.
Results
[edit]Reverting behavior
[edit]- We found a strong correlation between the number of revisions and the number of reverts. This suggests that reverting is part of the task of active users.
- Also, we found a change in the reverting behavior according to the date of user registration (account age/tenure), as expected, older users tend to be less reverted, and do more reverts.
A detailed analysis of reverting behavior can be found here: [2]
Characterization of user topic-focus
[edit]- Following our definition of topical stability, we see that 83.95% of 'next' revisions happen on the same page, and 99.25% on the same topic.
- Moreover, 52.88% of users never jump out of the same topic. However, 41.88% of users jump more than 4 steps at least once.
More details can be found here: [3]
Characterization on topical distance in multipage editwars
[edit]- In order to have a notion of how frequent reverts and edit wars are across multiple topics, we considered all pairs of users with U, and V, where U has reverted V more than 2 times, and computed the topical distance between all the pairs of pages reverted.
- Next, we computed the mode -- the most frequent value -- for each pair of users and reported the frequency of those values. As expected, most of wars, 71%, focus on one page and 22% are in the same topic but on a different page. The remaining 7% are cross-topic reverts. This reinforces our intuition that cross-topic edit wars are rare.
Distance | % |
---|---|
-1 | 0.7107 |
0 | 0.2297 |
4 | 0.0143 |
5 | 0.0121 |
2 | 0.0103 |
3 | 0.0074 |
6 | 0.0056 |
7 | 0.0034 |
8 | 0.0027 |
1 | 0.0016 |
9 | 0.0012 |
10 | 0.0007 |
11 | 0.0002 |
12 | 0.0001 |
Activity and Editwars
[edit]- We found that almost 40% of users jump 4 or more steps in the Wikiprojects graph, making difficult to predict the likelihood of two users of co-editing the same page.
- Moreover, we found that users involved in cross-topic edit wars tends also to be the more actives users, therefore we cannot assume that their behavior is related with person-centered problems or is just a consequence of their high activity.
Details about this study can be found here: [4]
Conclusions and Main Outputs
[edit]- Based on previous work [5], we have implemented and released a model that allows us to measure topical distance between Wikipedia pages.
- We have found that around 99% of 'next revisions' are done within the same topic.
- We have found that just 7% of edit wars are cross-topic.
- However, users involved in this cross-topic edit wars are generally very active users, making difficult to assume that those wars are due person-centered conflicts.
Resources
[edit]Raw data
[edit]- We considered all users with more than 10 revisions in the selected period (1/Jan/2017 to 16/Nov/2017)
- The revisions from those users can be obtained from this query: https://quarry.wmflabs.org/query/23424
- A list of possible bots was obtained using this approach: Research_talk:Identifying_bot_accounts
Reverts dataset
[edit]- The interactions dataset can be downloaded here: [6]
Format: interactions[user1][user2]:[[pageid,timestamp,deltatime,revision_id_reverted,revision_id_reverting], another revert, etc] user1: user reverting user2: user reverted pageid: page_id timestamp: timestamp when the reverted version was created (done by user2) deltatime: delta time from the reverted version to the reverting revision (done by user1). revision_id_reverted: revision_id_reverted (by user2) revision_id_reverting: revision_id_reverting (by user1)
Wikiprojects graph
[edit]- Using the Wikiproject graph generated from this script
- Pages to Wikiprojects: https://quarry.wmflabs.org/query/23214
- The Wikiproject graph (in networkx format) can be downloaded here: [7]
Code
[edit]- Find all the code used in this study here:
See Also
[edit]- A project focused on the the usage of toxic language.
- Research:Wikihounding_and_Machine_Learning_Analysis
- Wikiproject topic model developed here: Automatic_new_article_topics_suggestion