Incidents/20150422-LabsOutage
Summary
Many labs instances were migrated to new virtualization hardware. A kernel bug on the new hosts resulted in bad behavior of the guest VMs: poor response time, network interruptions and a flurry of monitoring alerts. Kernel update and reboot on the affected systems resolved the problem, but the accompanying reboot further interrupted many VMs.
Affected hosts were running a kernel having this bug: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1346917 which was found by investigating the symptoms reported at https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1307473 on comparable Precise->Trusty upgrades.
Timeline
prehistory
- There are six new labs virtualization boxes, labvirt1001-1006. They run the same hardware as old, tried-and-true nodes virt1010, 1011 and 1012. virt1010 and 1011 are running Ubuntu Precise, virt1012 is running Trusty with 3.13.0-46 kernel. The new nodes use a stock install of Trusty, kernel version 3.13.0-24.
- Andrew migrates select instances to the new labvirt hardware. Projects 'openstack,' 'testlabs,' and a few miscellaneous instances are moved to the hardware. No ill-effects are observed.
2015-04-20
- Andrew migrates the 'cvn' and 'staging' projects to labvirt hosts.
2015-04-21
- Andrew runs a scripted migration of the deployment-prep project to labvirt hosts. This is the first large-scale migration to the new hardware.
2015-04-22
- [02:00] Shinken starts to send many, many alerts to #wikimedia-releng, reporting deployment hosts to be flapping. Page loads fail intermittently.
- [12:30] Andrew wakes up, begins a scripted migration of Tools instances to the labvirt hardware.
- [13:00] Andrew converses with Tyler Cipriani and becomes aware of the deployment-prep issues, starts debugging in earnest.
- [15:00] By this time it's clear that the issue is localized to instances on labvirt hardware. Scripted migration of tools is halted.
- [16:30] The first working theory is that there's a competition for resources on labvirt1005 and 1006, as instances on those hosts are sending the most alerts. Ganglia graphs are spiky and concerning and most instances on those hosts are unresponsive, so Andrew reboots them. Symptoms are temporarily alleviated
- [18:00] It's clear now that reboots were insufficient and we still have issues, including on labvirt1001-1004.
- [20:00] Alex Monk notes that ping times are very irregular; sometimes jumping to multiple seconds. Andrew confirms that this issue is also isolated to instances on labvirt hosts. Marc joins the debug effort.
- [21:00] Marc notices clock drift on instances, quickly locates a kernel bug that fits. https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1307473
- [21:15] It's agreed to try a kernel upgrade. Andrew starts migrating tools hosts away from labvirt1001 so it can be restarted without causing further interruption.
- [21:45] Andrew upgrade the kernel of labvirt1001 to 3.16.0-34, reboots. The instance fails to start as it is unable to mount the filesystem,
- [22:30] labvirt1001 is finally back up, running kernel 3.13.0-48. Instances seem to be running properly. Andrew migrates tools nodes away from labvirt1002.
- [23:45] Andrew upgrades labvirt1002 to 3.13.0-48, reboots, restarts all instances.
2015-04-23
- [14:00] labvirt1001 and labvirt1002 are declared healthy. Andrew migrates tools hosts away from 1003-1006, Marc drains jobs from affected tools-exec nodes.
- [14:24] labvirt1006 upgraded and rebooted
- [15:08] labvirt1005 upgraded and rebooted
- [15:42] labvirt1003 upgraded and rebooted
- [16:06] labvirt1004 upgraded and rebooted
- [16:30] All labvirt hosts up, all instances running.
Actionables
(Use https://phabricator.wikimedia.org/tag/incident-20150422-labsoutage/ for any follow up tasks)
All labvirt nodes are now upgraded and fine. When the other HP systems (virt1010, 1011, 1012) are re-imaged, it's critical that a dist-upgrade and reboot be run before any instances are migrated to them.
- Guard against nova-compute being deployed on affected kernels (T97152)
Affected Instances
c6f1aa6d-a52b-4234-9835-630658a71940 | compiler 228e1ae7-eee6-4930-b706-a5b20423cfd1 | consul1 1b7b03fb-9c28-42d1-b332-32b1847bc64d | consul2 e4091940-9eda-43c9-b888-3c89c7e26cb3 | consul3 b52e2a1e-bb61-4819-b2f5-d552c1cfc825 | cvn-apache8 e43664a5-e763-469a-9948-2f2c6c539db2 | cvn-app4 9cb1f5db-e6b0-4e47-a508-29feb705bcf2 | cvn-app5 742f631a-6bcc-4bb9-8ab1-8cacc1d376da | dashboard-sentry fd78c4b0-353e-4b8e-a079-dd59d6232751 | deployment-apertium01 e2c147fd-dc09-417a-8d11-ce8a2a652467 | deployment-bastion 9d05dbda-4103-432c-9449-498243e10db6 | deployment-cache-bits01 aa3c3550-96a7-4f20-a1ab-c88c01a8e5e9 | deployment-cache-mobile03 94d3fa4b-5bf7-4c15-acdf-35d20bb4942d | deployment-cache-text02 5e4a6717-6db8-4033-aa2b-14282dad290e | deployment-cache-upload02 d0da5ac8-34ca-43a9-b63d-64ee438c29cc | deployment-cxserver03 aec2245c-e965-4f06-8496-a0d9abf519ee | deployment-db1 8a9c2ff4-ef08-4fd9-89aa-bc954e982c2d | deployment-db2 cacacac3-010d-4ce6-a13c-004e12a17f5b | deployment-elastic05 1411d0ec-e934-4bfa-8327-81bfbbe4df32 | deployment-elastic06 52c5fd7a-9b14-45da-b531-7c7a458be5c2 | deployment-elastic07 a9a9522f-883d-4f24-b290-960c57a91f2d | deployment-elastic08 abca73aa-4b99-4442-a662-adbfcaadd40b | deployment-eventlogging02 05cec48f-ed80-40a9-b6fc-eff9d3c40fbe | deployment-fluoride ff9aac2d-ba32-4b86-91be-5aa4181589f3 | deployment-jobrunner01 dfacf7e3-d60c-4990-9681-30610df4ae3d | deployment-kafka02 0294f3af-eba5-4ac8-9205-7c1aba8808d9 | deployment-logstash1 012b196d-c795-4ab6-94ea-e453214d39c2 | deployment-lucid-salt e8cdee8b-d4b9-4ccb-8be5-944093ae3bf3 | deployment-mathoid 91247f8c-e524-4d20-9c9e-2fc2ae3cdc23 | deployment-mediawiki01 beb4a87e-cb4f-4e2f-a442-fda67ce20c98 | deployment-mediawiki02 2cfaf18c-e6ea-4c2d-b96f-df7f50b6bc9a | deployment-mediawiki03 8290c03a-a64c-4d22-bbce-f7c92afe30cd | deployment-memc02 15d6d50c-aae9-4320-9e48-ea3022fba95f | deployment-memc03 507ed00f-b7fc-42e8-803c-53224646598d | deployment-memc04 811cd53f-855b-490f-b28e-c80184600dd5 | deployment-mx cec6f6dc-5ab0-420e-8bc0-871ad3c9999b | deployment-parsoid01-test b26e5c79-7190-431c-9fc9-e12bf05c0cd6 | deployment-parsoid05 9a284217-479f-4cab-9652-58fb3659aa66 | deployment-parsoidcache02 d46df8b9-6c41-409d-9853-b2b4dc876088 | deployment-pdf01 54c66f88-4c39-487b-802b-2eec751f4300 | deployment-pdf02 fb5507a9-6488-47b8-9737-ed739f8faff5 | deployment-redis01 e9e794b0-df85-4181-8b8a-c4a784bf11e7 | deployment-redis02 a71ec107-2c2a-4a5a-bdb5-d35d1ca95302 | deployment-restbase01 088a0575-c09b-4c42-88a5-4ef57d8705c0 | deployment-restbase02 5a4610ff-3fb1-443c-92f2-995ed63d3e79 | deployment-rsync01 abb50762-93d0-4c9d-8853-adbbb6b56e00 | deployment-salt fcaa135e-3fce-4fae-afe0-ded789fe6f6a | deployment-sca01 ee28b0f9-7071-4724-a852-63b0a95a7416 | deployment-sentry2 70e51d3c-f898-4c87-9b3f-e11bee0087d6 | deployment-stream ec228fb1-7cca-4c1b-9f5f-63bfc0aee45c | deployment-test eba7ec1f-8fcf-4ab3-a616-33f486cfb099 | deployment-upload 02bed745-a849-4f95-8d0c-ef2633b19ac0 | deployment-urldownloader 4d4ee285-2eaa-4286-9e98-4fd705c50de4 | deployment-videoscaler01 1754601e-6b04-49d1-a1f9-0c85e361379b | deployment-zookeeper01 be146908-b8ad-45ee-894b-9c7c1ed983ff | deployment-zotero01 071aaf64-0da6-463f-9793-1a847774b816 | designate-devel e7720543-b214-41ea-824e-60626717509e | etcd1 0995d912-6091-4f1e-bd82-b2e547b558fd | etcd2 939f577c-1057-4fbb-aca4-c6436ffe3130 | etcd3 f7e8f15f-d5b3-4cf7-847b-612f4443b86c | etherpadt 78c56d53-1770-466b-9ad2-6955a539561c | integration-saltmaster a4958be9-7226-4485-b569-5dedeaccc9be | integration-slave-trusty-1021 b1af424a-e4f8-4291-8ca0-e572c4db7ff5 | labs-bootstrapvz-jessie 06893e3c-cb62-40bb-b198-ba9f1be01725 | labs-vmbuilder-trusty c3a82ada-8d29-4915-b2ef-44d85994a7ab | otto-hadoop-master01 5853c165-347a-4597-b9ff-a80288b9332d | otto-hadoop-worker01 022b065e-af9f-4731-9dda-bdbafdf31673 | otto-impala-master01 c1036bc0-58fb-4c1e-9b4c-3fb265816af0 | puppet-andrew 7cbdffce-592c-41f4-82bc-4287ed889e9c | puppet-jmm cbf588f5-169d-4fa9-899b-0fc5419b0630 | puppet-jmm-debian bd22137c-3d58-4d86-9ec5-dcd104259c4a | puppet-jmm-precise 9d9666b5-5c1e-402f-81e1-dd811eff2f1c | puppet-jmm-salt-trusty 6a73ec36-5f5b-4074-9c30-128a738f91ee | puppet-jmm-salt-trusty-minion a0451388-9e99-4620-8c23-c8d64667dd12 | puppet-jmm-trusty 2e2e7624-7264-4c28-9cf5-16ccefa794a4 | puppet-mailman 1c9789e7-2e0f-4ecb-ba70-1561112574f7 | puppet-matanya 0d61121b-5f29-4c3c-a5db-3a2b5f20ad56 | sol 2f7a4792-0e76-4e32-b53c-204d5c54c9b8 | staging-cache-text01 887d3809-46b3-4281-a301-e0a2629eb790 | staging-db01 f935259d-e57b-4afb-8e22-e83167d77be5 | staging-elastic01 cfb5bbd0-f76a-416f-8794-f33d9571b43a | staging-elastic02 43224d6d-882a-4da0-9e4b-6a594edb3901 | staging-elastic03 929c9324-b6f2-444f-9a43-a17b012d1c5c | staging-elastic04 e40ecd84-c0b2-4148-a7c8-9f94ab34f5e4 | staging-eventlogging abcb156c-640b-4e17-a5c0-fb8eefbfbd42 | staging-mc1 00112297-e081-482d-b767-48f4359c8882 | staging-mc2 63356051-954e-4d82-965f-2718d5976fe9 | staging-ms-be01 728acb5a-72a4-41ce-90a3-83173dc7673e | staging-ms-be02 3842a06c-6b40-4bd8-845b-539adb9259df | staging-ms-be03 bf30c55f-c79f-43c5-8ead-871955a5237f | staging-ms-fe01 e2dc144b-bdb0-4833-8167-73e7c1e3aa3b | staging-mw01 8d000f33-5751-4704-b150-026b2a0e2013 | staging-mx d35af3fe-0e9e-41e3-82df-ae5bcad08812 | staging-ocg01 45bc9d67-a7ea-41a2-bb36-aa2c496f2119 | staging-palladium 555bef0f-c83b-41ea-bf09-1e359a17f4cf | staging-rdb01 140c242c-4e96-4552-9603-6d45f2d13439 | staging-rdb1 20cd2542-f813-4f62-88cf-d2ee9b8b4632 | staging-rdb2 923b0a84-729b-4b80-9fa1-e5c9e26ee330 | staging-sca01 10d943dc-5012-4a38-9201-fb614838e0a9 | staging-stream 3d06622c-c4cb-434e-9fe4-0d47ac8b3e9d | staging-test-tin ed5f01c6-9157-4962-8360-7e696c2fdefb | staging-tin ca015784-def3-4231-a225-e7844f179ce4 | tools-bastion-01 dc88c3f6-b685-4e5a-8a54-436b63147497 | tools-bastion-02 086da8f4-e7d2-4541-830b-9f946510e7dc | toolsbeta-quarry-labsdebrepo-test 605caf6e-f642-4cd3-8d42-268fb5e2c612 | tools-dev 4222c0f5-b3bd-41a9-94d2-30faad4202ce | tools-exec-01 eb6e8fad-8646-4251-a706-fc90bf0be0c9 | tools-exec-02 fa611e16-6b85-4f74-92a3-2ed1635fa481 | tools-exec-04 6a1a2095-8474-4378-8290-9dece5b9c3d8 | tools-exec-05 ad12146e-b225-47b2-97f0-330527688331 | tools-exec-06 30b98f1d-1c5a-49c1-b800-f4c535addc12 | tools-exec-07 cb2940d6-2560-4dc5-9e12-f894efd33dfc | tools-exec-08 5cd684db-d0a6-4241-a11f-daf4c1b2f717 | tools-exec-09 ec414ae4-a46f-425f-b9d5-950df155f137 | tools-exec-10 a4fc3c84-bc8e-42bf-9209-0549c9872e84 | tools-exec-11 47608ad4-1adc-4104-b1c5-96281a945ff8 | tools-exec-12 dcb7a789-5c33-42c5-85ea-b8dd50dcbf1b | trusty-manual bf212ca9-9e05-427b-ac44-7155900f7cba | trusty-medium-1429649189 3d2e226d-a74d-4188-a0b5-1c56093d599c | util-abogott fde6e5e8-a78d-4950-8b6c-b12a84170a3d | wikidata-mobile 095763a3-84ca-4c7a-90dc-d143be64722c | wikitech-test-network bf85a20d-beba-47db-ae38-86b149acf9a2 | wt-test-api ea6a001c-b2a9-44b8-8c47-17a51885232d | wt-test-compute 8e592f62-3dbc-43f1-ae4c-94cc807f6418 | wt-test-controller 36de46b1-c196-4253-adf9-0bb3b036ddb3 | zk1 533c8e50-facc-4659-bb9a-b934e5585be3 | zk2 920a53e7-91a1-4a60-a8c5-92db1ab8aa55 | zk3