Research:Understanding Curious and Critical Readers/Knowledge Networks of Wikipedia Readers

A first exploratory analysis of knowledge networks of Wikipedia readers.

Summary of findings:

The knowledge networks of Wikipedia readers are similar to the knowledge networks of participants browsing Wikipedia in a controlled study (KNOT-data). This similarity is not coincidental as a large set of knowledge networks generated from different null models cannot reproduce these observations. This adds confidence that the previous approaches and findings can be generalized to the general population of Wikipedia readers. This raises the question how robust are these knowledge networks from readers in other language-versions of Wikipedia which also contain different underlying link structures.
We reproduce previous findings about the existence of two types of curiosity denoted hunter and busybody. The prevalence of these types is similar for Wikipedia and the KNOT-data.
We find a wide variation in the structure of the knowledge networks with a continuum and no clearly defined clusters. One principal component in this variation is captured by the hunter/busybody dichotomy. This raises the question how to identify and interpret other styles of curiosity encoded in these knowledge networks?

Background and motivation

Previous research has found two types of curious behavior of readers of Wikipedia, so-called hunters and busybodies^[1]. They showed that the groups are characterized by different knowledge networks: clustering, characteristic path length, modularity, etc. Furthermore, the structure of the networks correlated with some dimensions of a 5-dimensional curiosity-scale (specifically deprivation sensitivity), where aggregate scores were inferred from responses to a questionnaire.

However, there are some limitations of the study with respect to its generalizability. Specifically, the dataset used in the study, the so-called KNOT-data (knowledge networks over time) had the following characteristics:

The sample size was small (149 participants)
The sociodemographic composition of the participants in the study is likely not representative of the general reader population of Wikipedia. The participants were recruited through poster, Facebook, Craigslist, and university research site advertisements in Philadelphia and the surrounding university community. Gender: 121 female, 26 male, 2 other gender. Age: between 18.21 and 65.24 years (M= 25.05, SD= 6.99). Ethnicity: African American/Black (6.71%), Asian (25.50%), Hispanic/Latino (5.37%), Multiracial (5.37%), other (5.37%), White (49.66%), and missing information (2.01%).
Participants were instructed to spend 15 minutes in self-directed information seeking on Wikipedia and to explore whatever topics interested them. While this is different from targeted navigation in lab-settings such as Wikispeedia (where the target is typically pre-defined) ^[2], the very instruction to browse Wikipedia might alter the navigation of the reader. In addition, participants received incentives for completing the 15-minutes of Wikipedia browsing, raising the possibility that participants browsed to obtain the incentives rather than to satisfy their deprivation curiosity.

Therefore, the aim of this analysis was to assess to which degree the findings from the KNOT-study apply to readers of Wikipedia in a natural setting. This will help us understand better whether the findings about readers’ curiosity are likely to generalize to the general reader population.

Methods

Data

Knot data. We use the dataset from ^[1] which consists of browsing sessions on Wikipedia of 149 participants over 21 days (see the Methods-section for details on the data).

Wikipedia data. We use reading sessions from the Wikipedia-app. The reason is that the sessions from the KNOT-data span several days of browsing (mean: 17.9 days) with a median of 168 pageviews. In contrast, the reading sessions we could reconstruct from desktop and mobile-web browser on Wikipedia are believed to be reliable only on a short timescale (1 day) and are thus much shorter (median:1/mean:~2 ) ^[3]. The reading sessions from the Wikipedia-app allow for aggregation of pageviews on longer timescales such that they can be compared to the KNOT-data. Details of the filtering:

Requests to pageviews of English Wikipedia during month 2022-03
Only requests in the app (access_method=mobile_app)
Pageviews to main namespace, resolving redirects
Remove bots and automated traffic (agent_type=user)
Keep only reading sessions in which all pageviews originate from country-code US (to ensure comparability with KNOT data and remove additional confounders)
Aggregate pageviews into sessions via the wmfuuid (remove id after aggregation and assign pseudo-id) keeping the time-ordering of the pageloads
We only keep the ordered sequence of page-id and page-title as well as the page-id and the page-title of the referrer (if from Wikipedia)

Knowledge networks

We construct knowledge networks from the reading sessions following the similar procedure as described in ^[1]:

Nodes: list of pages (without repetition) visited in the reading session.
Edges between nodes: we define undirected and binary edges between two nodes s and t with value 1 if there is a hyperlink from s to t or from t to s , otherwise 0. In practice, we use consider all hyperlinks between pages in the snapshot of the corresponding month from the pagelinks-table
Filter networks: Random sample of 1000 networks with at least 100 nodes. The reason for this sample is two-fold: i) larger networks provide more robust estimates of metrics such as clustering coefficient; ii) the sample is closer to the KNOT data in terms of network size which is a known confounding factor when calculating network metrics.
Note: the resulting knowledge networks do not contain any personally-identifiable information. For each network, we only keep the list of nodes and edges and do not keep any other information. Since we are only interested in the structure of the network, we assign arbitrary ids to nodes which are i) independent of the original page-ids that correspond to the node; ii) the pseudo-ids are not consistent across networks.

Network metrics

We calculate the following characteristics of the networks

Size of the network:
- Number of nodes, N
- Number of edges, E
Clustering coefficient, C (via networkx)
Characteristic path length, L (only for the largest connected component) (via networkx)
Global efficiency, geff (via networkx)
Core-periphery structure, cps (via bctpy)
Community structure. We fit a hierarchical degree-corrected stochastic blockmodel (via graph-tool). This yields hierarchical grouping of nodes into groups in which the number of levels in the hierarchy and the number of groups in each level are inferred automatically. From which we derive the following quantities:
- Minimum Description Length (per node), MDLn (via graph-tool)
- Modularity, Q, of the partition of nodes into groups on the lowest level of the hierarchy (via graph-tool)
- Number of groups, B, of the partition of nodes into groups on the lowest level of the hierarchy

Results

How similar are the knowledge networks of Wikipedia readers and readers from the KNOT-study?

We compare the network metrics for the different datasets.

The knowledge networks of the two datasets are very similar: the curves of the marginal distributions are overlapping and in the scatterplots the samples cover the same regions. This means that the two datasets yield similar populations in terms of their knowledge networks.

Distributions (diagonal) and scatterplots (off-diagonal) for different network metrics in the KNOT and Wikipedia dataset

Can we find different types of knowledge networks?

(1) Can we identify the hunter- and busybody types of knowledge networks identified in previous studies?

We assign an aggregate score that captures whether the knowledge network is more hunter-like (loose knowledge networks) or busybody-like (tight knowledge networks) and compare the different datasets. Specifically. The hunter/busybody score was defined using the static network measures used in ^[1]: edge weight, clustering coefficient, shortest path efficiency, and characteristic path length. The standardized values that characterize hunter-like exploration were summed, including edge weight, clustering coefficient, and shortest path efficiency. The standardized values that characterize busybody-like exploration were subtracted, including the characteristic path length.

We see that the knowledge networks capture a wide spectrum on the hunter/busybody scale. More importantly, both datasets seem to capture a similar distribution of knowledge networks in terms of their hunter/busybody score.

Distribution of aggregate score of knowledge networks capturing hunter (high) and busybody (low) style of curiosity for Wikipedia and KNOT data

(2) Can we identify other types of curiosity in the population of knowledge networks ?

We generate a 2D-embedding using umap in order to visualize the variation of knowledge networks in terms of the different network metrics.

Observations

We see that knowledge networks from Wikipedia and KNOT are overlapping and covering the same space. This again highlights that the two populations of knowledge networks are very similar.
We see there is a continuum of knowledge networks with few indications of well-defined clusters (apart of some isolated datapoints)
When looking at the hunter/busybody scores of the networks, we see that one principal component (roughtly left to right) seems to capture the variation from busybodies (low scores-red) to hunters (high scores-blue). An immediate follow-up question is then to ask what types fo curiosity are captured by the other regions in the space of knowledge networks?

2D-embedding of the knowledge networks from Wikipedia and KNOT

2D-embedding of the knowledge networks of Wikipedia and KNOT data with overlay of their hunter/busybody scores

How similar are the knowledge networks of simple null models?

In order to better understand the previous results, we compare the knowledge networks from readers in the Wikipedia and KNOT dataset with a range of null models:

Randomized networks. We randomize the structure of the observed network by swapping edges between pairs of nodes via Degree-preserving_randomization.
Canonical networks. We created 2 synthetic networks that corresponded to each Wikipedia user’s network. All networks were matched to the user’s network size.
- The Erdos-Renyi (ER) network was generated using an edge probability corresponding to the edge density of the user’s network size.
- The Barabasi-Albert (BA) model generates scale-free networks with heterogeneous degree distribution. The number of new edges per new node while generating the network corresponds to the average node degree of the user’s network.
Random walks (rw_<p_link>). We create synthetic knowledge networks by performing a random walk on the hyperlink network. For each original reading session, we start from the same page and visit the same number of pages. At each step, we pick the next page according to: i) with probability p_link: pick one of the hyperlinks on the page randomly (each hyperlink has the same weight); if there are no hyperlinks, go to the next step; ii) with probability 1-p_link, pick one page from the wiki randomly (each page has a weight according to its popularity in terms of the number of pageviews received during the respective month). We choose three different values for p_link to capture different scenarios
- p_link = 0: picks pages randomly according to their overall popularity (ignoring the hyperlink network)
- p_link = 1: only internal navigation via the hyperlinks
- p_link = 0.375: a combination of the two extremes where we choose the value according to the fraction of pageloads with an internal referrer^[3]
Wikispeedia. A common dataset used to study navigation on Wikipedia is Wikispeedia derived from a game in which players are tasked to navigate from a starting article to a target article (i.e. targeted navigation). Navigation takes place on a reduced network of 4,604 articles. We use a random subsample of 1000 completed session from the publicly available dataset.

Observations:

All null model networks are markedly different (i.e. they are located in different regions in the embeddings) than the observed networks (Wikipedia and KNOT). This shows that the similarity between Wikipedia and KNOT is not coincidental -- i.e. it is not trivial to generate knowledge networks with the specific structure found in these datasets.
The randomized networks and the canonical networks are further away from the observed networks than the random-walks. This is intuitive as the knowledge networks from a random walk should approximate closer to the observed knowledge networks.
The dimension along the y-axis (bottom to top) aligns with the variation of the null model knowledge networks from the ranom walk with decreasing value of p_link. I.e. below the observed networks we find rw_1 (always using an internal hyperlink), above we find rw_375 (a mixture of using an internal hyperlink and randomly selecting any page ), and even further above we find rw_0 (always selecting a random page). This might suggest that also in the observed networks the bottom-top direction captures different mixtures of browsing via internal links or using external search engines (for more details about these two modes of navigation see ^[3]). This complements the previous finding where the left-to-right variation could be attributed to differences in the hunter/busybody scores of the networks
The Wikispeedia knowledge networks are the furthest away showing that the knowledge networks from targeted navigation are very different than those from natural browsing of readers. We also find that the networks from targeted navigation are very hunter-like (very tight networks) with the distribution skewed towards high values of the hunter/busybody scores (not shown here).

2D-embedding of the knowledge networks of the Wikipedia and KNOT data together with different sets of null models for comparison.

References

↑ ^a ^b ^c ^d Lydon-Staley, D. M., Zhou, D., Blevins, A. S., Zurn, P., & Bassett, D. S. (2020). Hunters, busybodies and the knowledge network building associated with deprivation curiosity. Nature Human Behaviour. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8082236/
↑ West, R., & Leskovec, J. (2012). Human wayfinding in information networks. In Proceedings of the 21st international conference on World Wide Web - WWW ’12. https://doi.org/10.1145/2187836.2187920 Link to pdf
↑ ^a ^b ^c Piccardi, T., Gerlach, M., Arora, A., & West, R. (2021). A Large-Scale Characterization of How Readers Browse Wikipedia. In arXiv [cs.CY]. arXiv. http://arxiv.org/abs/2112.11848

[:0-1] Lydon-Staley, D. M., Zhou, D., Blevins, A. S., Zurn, P., & Bassett, D. S. (2020). Hunters, busybodies and the knowledge network building associated with deprivation curiosity. Nature Human Behaviour. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8082236/

[2] West, R., & Leskovec, J. (2012). Human wayfinding in information networks. In Proceedings of the 21st international conference on World Wide Web - WWW ’12. https://doi.org/10.1145/2187836.2187920 Link to pdf

[:1-3] Piccardi, T., Gerlach, M., Arora, A., & West, R. (2021). A Large-Scale Characterization of How Readers Browse Wikipedia. In arXiv [cs.CY]. arXiv. http://arxiv.org/abs/2112.11848

[1]