Extension:LocalisationUpdate/LUv2

Revision as of 15:37, 8 July 2022 by AKlapper (WMF) (talk | contribs) (Page is historical. GSoC 2014 proposal.)

Code: https://github.com/konarak/LUv2/tree/epicmess (please bear with me, it'd habe much nicer, proper commits in about a week)

LUv2: A generic, efficient localisation update service

Public URL
https://www.mediawiki.org/wiki/Extension:LocalisationUpdate/LUv2
Bugzilla report
T48653
Announcement
Updates
Updates

Identity and Contact

Name
Konarak Ratnakar
Email
konarak.11 at google dot com
Typical working hours
0930 to 2130 UTC
IRC
kondi
Location
Ahmedabad, India

Project Outline

Wikimedia wikis use the LocalisationUpdate extension to get the updated localised messages from translatewiki.net. To accomplish this, the extension downloads the translation files for mediawiki and installed extensions, stores them to local cache and then processes them to check which messages that can be updated.

The described flow is not very efficient since not all the wikis require support for ~300 languages and the extension downloads complete files instead of only downloading the deltas. This process consumes quite some time. Plus the extension requires setting up cron jobs and other manual configuration which can be avoided.

LUv2 aims to fix this with creation of a new service which will keep track of the updates and make them accessible via a push interface and a RESTful API.

Assigned mentors
Kartik Mistry, Nikerabbit, Amir Aharoni

Deliverables

The main component of the project is the LUv2 service.

Service

The service comprises a database update mechanism and an API. The service will store the latest version of the translation messages and make them available to clients via the API. The service will be written in node.js and it'll use redis as database.

API

There will be two API endpoints. One with which the clients will communicate in order to fetch the updates and the second endpoint will be used to trigger the database update.

There are multiple ways in which we can know whether updates available for the client or not. The basic option would be to depend on the client to send the project_id, language_id, message_id->message pairs and comparing them with the ones in the database. This option is not most efficient, as Niklas hinted.

Another option would be to generate the hashes of localisation files on server on every database update and requiring the client to send its hash, comparing it with the server's. Yet another option would be to depend on clients to send the timestamp of last update and comparing it with messages' timestamps.

(To allow or not to allow multiple projects in single query?)

The client would make the following post request

api.translatewiki.net ? project_id=mediawiki & language_ids=gu|hi & lastupdated=1396682587

{
  "result": "success",
  "project_id": "mediawiki",
  "language_ids": [
    "gu",
    "hi"
  ],
  "updates": {
    "gu": {
      "file_shasum": "shasum",
      "updated_at": 1396683192,
      "message_count": "7",
      "messages": {
        "sun": "રવિ",
        "mon": "સોમ",
        "tue": "મંગળ",
        "wed": "બુધ",
        "thu": "ગુરૂ",
        "fri": "શુક્ર",
        "sat": "શનિ"
      }
    },
    "hi": {
      "file_shasum": "shasum",
      "updated_at": 1396683705,
      "message_count": "4",
      "messages": {
        "sun": "रवि",
        "mon": "सोम",
        "tue": "मंगल",
        "wed": "बुध"
      }
    }
  }
}
The git hook would make the following post request

api.translatewiki.net ?

{
}

Database Update

To automate database updates the post-receive hook on the projects' remote repo will be configured to hit an API endpoint. This will trigger the database update process. The update script will download the modified/new files to a temporary directory, parse and compare them with the old files, and insert/update the messages in the database. This script will ideally support all the formats that are currently supported by the translate extension.

(Alternatively, we can also get the messages directly from the twn elasticsearch, for that I need to think of a way by which I can notify the service that updates are available. Since we can get the same messages via the elasticsearch store in a standard format the different file format parsers won't be required)

In case one server goes down the clients should have a secondary source that they can query. Having a read only replica would solve this.

The database will primarily store the following data:

  • last modified timestamp
  • project identifier
  • message identifier
  • language code of the message
  • message string

This service will ideally work with all the projects that are hosted on translatewiki.net. The service will primarily be beneficial to web applications. Using this service these applications will be able to serve the latest translation messages instead of depending on software updates.

Cron dependency

I have left a significant part of the schedule open to research, discuss and implement the best possible solution for eliminating the cron requirement. Push based solution will be best suited for this which will have to be heavily scrutinized in terms of security. One possible solution would be to use an event based approach in which the client will register an endpoint with the service which will be notified when updates are availble. Another possible solution would be to use the PubSubHubBub protocol for pushing the updates. This functionality will be beneficial for the mediawiki users who use shared hosting (lot of them don't let users set up cron jobs) and who are not used to setting up cron jobs.

Extension

The Localisation Update extension will be rewritten to fetch the updates from the new service.

Future

I plan on maintaining the service and the extension after the gsoc period ends. I'd work with other projects willing to use this service.

Tentative timeline

Dates Task
May 12 - May 17 Finalize and document the flow for the database update mechanism.
May 19 - May 24 Code the initial update mechanism with support for php arrays, yaml and json.
May 26 - May 31 Code the service APIs.
Jun 02 - Jun 07 Test the code in a sandbox for mediawiki, rails-port, jquery.uls. Set up redis, import translations.
Jun 09 - Jun 14 Code the new LocalisationUpdate extension.
Jun 14 - Jun 19 Test all of the components in the sandbox. Gather bugs.
Jun 23 - Jun 28 Mid-term evaluations. Fixing bugs.
Jun 30 - Jul 05 Write missing tests and documentation, if any. Gather-fix bugs.
Jul 07 - Jul 12 Discuss the possible solutions to tackle the cron dependency issue.
Jul 14 - Aug 02 write and test the code according to consensus of discussion
Aug 04 - Aug 09 Clean up, make improvements in code, documentation if needed.
Aug 11 Pencils down

About you

I'm based out of Ahmedabad and currently I'm pursuing a diploma in engineering in information technology. I've been involved in various wikimedia activities since early 2012, mostly outreach. I've contributed to English and Gujarati wikipedias, Wikidata and Commons.

How did you hear about this program?

Read somewhere on the internet.

Will you have any other time commitments, such as school work, another job, planned vacation, etc., during the duration of the program?

I'll have exams in late May. Apart from that I'll be mostly working on the project itself, if selected.

Participation

Progress tracking

I keep a log of `in progress` or `done` tasks, I plan on doing the same for this project. I'll publish these updates on mediawiki.org.

Source code

I'll regularly publish the source code on a public git repository, either on gerrit or git.

Communication

I'm almost always online on IRC, particularly active on #wikipedia-en and #mediawiki-i18n. Emailing me would be the second best way to get a quick response.

Past experience

Please describe your experience with any other FOSS projects as a user and as a contributor

I've mostly been a consumer of open source software to this day and I wish to contribute code now, hopefully starting with this project. I've recently fixed a parsoid bug and gotten myself familiar with mediawiki's development process.[1]

What project(s) are you interested in (these can be in the same or different organizations)?

Other than this project, I'm very much interested in the mediawiki API and wikidata.