Performance.wikimedia.org/Runbook
This is the runbook for deploying and monitoring webperf services.
Hosts
The puppet role for these services is role::webperf:processors_and_site.
Find the current production hosts for this role in puppet: site.pp. Find the current beta host at openstack-browser: deployment-prep.
Hosts as of Jan 2022 (T305460):
- webperf1003 (Eqiad cluster)
- webperf2003 (Codfw cluster)
- deployment-webperf21 (Beta cluster).
navtiming
The navtiming
service (written in Python) consumes NavigationTiming and SaveTiming events from EventLogging (over Kafka), and after processing submits them to Graphite (over Statsd) and Prometheus, from which they can be visualised in Grafana.
The events start their life in Extension:NavigationTiming as part of MediaWiki, which beacons them to EventLogging (beacon js source).
Meta
- User documentation: Performance/Metrics.
- Source code: performance/navtiming.git (Gerrit code review activity)
- Puppet class: webperf::navtiming
- Event schema: schema.wikimedia.org/secondary: navigationtiming (Git source, Code review)
Monitor navtiming
Application logs
View and explore the logs in Logstash via the "Discover" page, query the "logstash-" or "ecs-" index (whici index depends on the service) for messages from the "webperf" host and/or specific programs like "navtiming" or "excimer".
Alternatively, you can tail then directly on a given host over SSH, by running sudo journalctl -u navtiming -f -n100
Raw events
To look at the underlying Kafka stream directly you can use Kafkacat from our webperf host (requires perf-admins shell) or from a stats host (requires analytics-privatedata-users shell)
# Read the last 1000 items and stop webperf1003$ kafkacat -C -b 'kafka-jumbo1001.eqiad.wmnet:9092' -t eventlogging_NavigationTiming -o '-1000' | head -n1000 # Consume live, stop after 10 new items webperf1003$ kafkacat -C -b 'kafka-jumbo1001.eqiad.wmnet:9092' -t eventlogging_NavigationTiming | head -n10 # Read the last 1000 items and stop after 10 events match the grep pattern webperf1003$ kafkacat -C -b 'kafka-jumbo1001.eqiad.wmnet:9092' -t eventlogging_NavigationTiming -o '-1000' | grep largestContentfulPaint | head -n10
Event validation
When our JS client submits events to the EventGate server, these are validated by our schema. The messages that are valid, and sent forward into the Kafka topic. The messages that are rejected, are logged for us to review in the EventGate-validation dashboard in Logstash.
Deploy navtiming
This service runs on the webperf*1 hosts.
To update the service on the Beta Cluster:
- Connect with
ssh deployment-webperf21.deployment-prep.eqiad1.wikimedia.cloud
- run
sudo journalctl -u navtiming -f -n100
and keep this open during the following steps - in a new tab, connect with
ssh deployment-deploy04.deployment-prep.eqiad1.wikimedia.cloud
(or whatever the current deployment-deploy* host is, check). cd /srv/deployment/performance/navtiming
git pull
scap deploy
- Review the scap output (here) and the journalctl output (on the webperf server) for any errors.
To deploy a change in production:
- Before you start, open Logstash in your browser to monitor the
webperf#
production hosts in real-time. - In your terminal, ssh to the deployment server:
ssh deployment.eqiad.wmnet
and navigate to the navtiming directory:cd /srv/deployment/performance/navtiming
. - Prepare the working copy:
- Ensure the working copy is clean,
git status
. - Fetch the latest changes from Gerrit remote,
git fetch origin
. - Review the changes,
git log -p HEAD..@{u}
. - Apply the changes to the working copy,
git rebase
.
- Ensure the working copy is clean,
- Deploy the changes, this will automatically restarts the service afterward.
- Run
scap deploy
- Run
Verify a deploy in production:
- Check the logs you are tailing in your tab, look for new errors.
- Go to the navtiming dashboard in Grafana and verify that we get metrics to Graphite. Zoom in and refresh and verify that new metrics are still received. Wait a couple of minutes and see that metrics comes in after your deployment.
- Do the same for the Prometheus metrics, you can do that in the response start dashboard following the same pattern as for Graphite.
Rollback a change in production:
- Revert the change in Gerrit by accessing your change set in Gerrit and click on the revert button and add the reason why you are reverting.
- +2 to the revert and wait for the code to be merged.
- Follow the instruction in how to deploy a change in production and make sure your revert is there when you review your change.
Restart navtiming
sudo systemctl restart navtiming
Check that Prometheus is running
You can check metrics and verify that Prometheus is running by using curl on the webperf host:
curl localhost:9230/metrics
Then you will see all the metrics collected. And if you want to measure how long time it takes to get the metrics you can use:
curl -o /dev/null -s -w 'Total: %{time_total}s\n' localhost:9230/metric
statsv
The statsv
service (written in Python) forwards data from the Kafka stream for /beacon/statsv
web requests to Statsd.
- Source code: statsv.git (Gerrit)
- Source code: varnishkafka::statsv
- Source code: mw.track handler
- Deployed using Scap3.
- Puppet class: webperf::statsv.
How it works
StatsV is an HTTP beacon endpoint (/beacon/statsv
) for sending data to Prometheus or Graphite from MediaWiki JavaScript.
Data pipeline (simplified):
HTTP -> Varnish -> Kafka -> statsv.py [this service]
Stats are a lightweight and productive way to aggregate simple statistics, without needing the overhead of an EventLogging schema or long-term storage and queryability.
The interface from MediaWiki is as follows:
// Graphite
mw.track( 'timing.MediaWiki.foo_bar', 1234.56 ); // milliseconds
mw.track( 'counter.MediaWiki.foo_quux' ); // defaults to increment=1
mw.track( 'counter.MediaWiki.foo_quux', 1 );
// Prometheus, optional key-value labels
mw.track( 'stats.mediawiki_foo_bar_total' ); // defaults to increment=1
mw.track( 'stats.mediawiki_foo_bar_total', 1, { something: 'quux' } );
For Graphite, these must include the MediaWiki.
prefix (matching $wgStatsdMetricPrefix).
For Prometheus, these must include the mediawiki_
prefix (matching $wgStatsPrefix), and must comply with Stats Lib naming recommendations.
See also mw.track Documentation on mediawiki.org. The above mw.track topics are defined in statsd.js
via the WikimediaEvents extension (source code).
Logs
Application logs are kept locally, and can be read via sudo journalctl -u statsv
.
Restart statsv
sudo systemctl restart statsv
perfsite
- Source code:
- Code review:
- Puppet class: profile::webperf::site.
This powers the site at https://performance.wikimedia.org/. Beta Cluster instance at https://performance.wikimedia.beta.wmflabs.org/.
Deploy the site
- Follow instructions in the README to create a commit.
- Push to Gerrit for review.
- Once merged, Puppet will update the web servers within 30min.
Former services
coal and coal-web
The coal
service consumed EventLogging events from Kafka and produced moving medians directly to Graphite. The coal-web
service served as fast web-accessible cache of these metrics from Graphite, to visualize on the https://performance.wikimedia.org/ homepage.
The service was created by Ori Livneh in 2015 (change 204628) to overcome the shortcomings of Statsd/Graphite (unweighted averages of per-minute averages, slow to respond to long-spanning queries, doesn't scale for high-traffic web pages) by providing reliable and statistically sound long-term moving medians per day, month, and year; with cached data served from a Python web server. The service underwent major changes in T159354 and T158837, and was eventually decom'ed in 2023 (decom task T335242) in favour of Prometheus for most use cases (T175087).
Source code: performance/coal.git