Calepini

← Back to Blog

Automatically Aligned Bilingual Stories

How Wikidata, multilingual sentence embeddings and dynamic time warping can make Chekhov and other literary classics more accessible to all of us

By Calepini
literaturebilingualstories
An illustration combining a portrait of Anton Chekhov (following his portrait by Osip Braz), some documents and a dynamic time warping path in a similarity matrix. Image created by the author using a mouse and Inkscape.
An illustration combining a portrait of Anton Chekhov (following his portrait by Osip Braz), some documents and a dynamic time warping path in a similarity matrix. Image created by the author of this post using Inkscape and a mouse.

As a language learner, I have often grappled with a difficult trade-off: the literary quality and interest of a text versus how much I can actually understand and appreciate. Should I stick to children’s books, or laboriously tackle Chekhov’s masterpieces — knowing I might grasp the gist with effort, but lose the nuances? Reading foreign literature in the original, with direct access to a professional translation for any challenging word or sentence, would be ideal. This is possible with a bilingual edition.

My hypothesis: Bilingual editions have traditionally been limited in circulation due to the mismatch between the small number of readers for each language pair and the substantial effort required for printing and editing. But we can now use computational tools to automate the preparation of bilingual editions for any work available in multiple languages.

If you just want to read Chekhov's stories as extracted from Wikisource Russian-English, visit this page (or drop me a message if your favourite short story is not there).

Read on if you want to know the recipe.

Step 1: Finding works available in two languages

Our first step is to source literary works available in multiple languages. Wikisource hosts many public domain works, including numerous short stories by Anton Chekhov.

Wikidata is like a vast catalog that links Wikipedia articles, Wikisource pages, persons, literary works, animals and other entities. Among many other things, you query Wikidata for short stories by Anton Chekhov (known to Wikidata as Q5685), sorted by the number of Wikisource pages linked to them. The list includes short stories such as At Home (Q3576783), available in Russian, English, German and Norwegian (Bokmål). See Wikidata queries in the appendix.

Step 2: Aligning pairs of texts

Now comes the interesting challenge. We have two texts: the Russian original and an English translation. We could put each of these in a column and hope that a Russian sentence and its translation remain close to each other across the whole text, but this hope would only rarely be fulfilled. So how do we align these two texts (one of which we do not understand perfectly) automatically?

2a: Splitting texts in chunks (sentences or paragraphs)

First, we need to break both texts into manageable chunks: sentences or paragraphs. Alignment will then consist in mapping a sentence in the first text to a sentence in the second text, and making sure that their vertical position is approximately the same.

Splitting a text into paragraphs is straightforward if line breaks or markup are present. Dividing paragraphs into sentences is trickier (consider periods ending sentences versus those in abbreviations or decimals), but perfection is not our goal. After all, a single original sentence may be split into two in translation, or vice versa.

So we have two lists - one with Russian paragraphs, one with English paragraphs. But they are usually not the same length, and the tenth Russian sentence will typically not match the tenth English sentence. We will need some understanding of the respective sentences to match them.

2b: Embedding chunks of text

This is where things get interesting. How can we compare a Russian sentence to an English one to determine if they are translations of each other?

The answer: multilingual sentence embeddings.

We use a multilingual sentence embedding model — such as the "paraphrase-multilingual-MiniLM-L12-v2" model, which maps sentences to vectors in a 384-dimensional space. This has been trained on texts in many languages, and it learned that "The dog barked loudly" in English and "Собака громко лаяла" in Russian should be close together in embedding space, because they mean the same thing.

What are embeddings?

Think of embeddings as converting text into coordinates in a high-dimensional space. Just as any location on Earth can be represented by latitude and longitude (two numbers), embeddings represent the meaning of text with hundreds of numbers. Sentences with similar meanings end up near each other in this space, even if they are in different languages.

After running both sets of paragraphs through this model, we have two sets of coordinates, one for each language. While making sense of points in 384-dimensional space is challenging (to say the least), it is easy to calculate the distance or similarity between a pair of points, and thus to measure how similar any Russian sentence is to any English sentence.

Pairwise similarities of sentences in two languages
Pairwise similarities of sentences in two languages (Russian and English) for the first 22 sentences extracted from the Wikisource texts of Chekhov's story Дома/At Home.

One can display this similarity between every Russian sentence and every English sentence in a matrix, as in the above figure. This similarity matrix is characterized by a nice diagonal pattern, corresponding to high similarity between the nth sentence in Russian and the (almost n)th sentence in English. But of course the diagonal is a bit wobbly because sentences are not always perfectly parallel; otherwise it would be too easy.

2c: Finding the best alignment

This wobbly diagonal is what we are looking for. How can we formalize it? Enter dynamic time warping (DTW).

Imagine two actors delivering the same lines, but one slightly faster and with a different rhythm. How would you align their performances? You cannot just match second 1 to second 1, because they would end up completely out of sync after a few seconds. You need something smarter that can stretch and compress the timeline and compensate for differences in rhythm.

That is exactly what DTW does for our texts. It finds the optimal path through that similarity matrix shown and described above. A path that connects Russian sentences to their English counterparts while allowing for some flexibility. Maybe two Russian sentences correspond to one long English sentences, or vice versa. DTW can handle that. The algorithm works its way through both texts, sometimes moving forward in both languages simultaneously (when sentences match one-to-one), sometimes lingering in one language (when multiple sentences in one language correspond to a single paragraph in the other), always choosing the path that maximizes overall similarity.

Unlike naive approaches that would assume texts are the same length or that chunks match one-to-one, DTW gracefully accommodates the realities of translation: shifting sentence and paragraph boundaries, condensations or elaborations by the translator, and differing front matter between versions.

Pairwise similarities of sentences in two languages and optimal alignment path found by dynamic time warping (DTW)
Pairwise similarities of sentences in Russian and English (as above) and the optimal alignment path found between them by dynamic time warping (DTW).

2d: Consolidating paragraphs

After experimenting with different chunking levels (sentence and paragraph), I realized that the best results could be obtained by first chunking into sentences and then aggregating sentences back into paragraphs, considering paragraph boundaries in both texts as well as some constraints on "nice" paragraph sizes.

Step 3: reading

This step is yours to explore. Some of the ways you could do it:

  • If you are proficient in the original language, focus on the original text, using the translation only to clarify or deepen your understanding of challenging phrases. As a proficient Russian speaker, you can read the authentic Chekhov, but when you hit the archaic "аршинной" (arshin-length), you can glance at the translation without losing your place.

  • If you are less proficient, start with the translation, then read the original — now equipped with the understanding the translation provided.

  • As a language student, read both versions side by side, paragraph by paragraph. Notice the translator’s choices: where they added explanations, simplified, or where nuances were inevitably lost in translation.

Conclusion

Contemporary natural language processing can do more with literature than count words and letters (as in Markov’s day) or generate text (a capability we now all know, and sometimes worry about).

By combining Wikidata’s cultural knowledge graph, the cross-lingual power of multilingual embeddings, and DTW’s flexible alignment, we can create something genuinely useful: bilingual editions for any text with translations on Wikisource. The beauty of this approach is its universality. Want to read Maupassant in French and English? Kafka in German and Spanish? The method does not care — it simply matches meanings across languages, provided both are supported by the multilingual embedding model.

Literature in translation has always involved trade-offs. We gain access to stories and ideas from other cultures, but lose the music of the original language. Bilingual editions transcend this tradeoff: learn languages through the literature you actually want to read, rather than being confined to texts written for language learners.

I also have lots of ideas for more interactive and more helpful bilingual editions, from word-level alignment to the creation of personalized glossaries for each story etc. so stay tuned for more.

Appendix

Wikidata queries

You can execute the following queries with the Wikidata Query Service.

Listing popular works written by a given author:

# List popular works written by a given author
# Replace $author_id with the name of the author of interest, e.g. Q5685 for Chekhov
SELECT ?work ?workLabel (COUNT(DISTINCT ?wsSitelink) AS ?wikisourceSitelinks)
    WHERE
    {
      {
        SELECT ?work ?workLabel
        WHERE
        {
          ?work wdt:P50 wd:$author_id .  # author is ...
          ?work wikibase:sitelinks ?sitelinks.
          SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],mul,en". }
        }
        ORDER BY DESC(?sitelinks)
        LIMIT 200 # an intermediate limit to speed up the query
      }
      OPTIONAL {
        ?wsSitelink schema:about ?work ;
                    schema:isPartOf ?site .
        ?site wikibase:wikiGroup "wikisource" .
      }
    }
    GROUP BY ?work ?workLabel
    ORDER BY DESC(?wikisourceSitelinks)

Getting Wikisource sitelinks for a given work:

# Get Wikisource sitelinks for a given work
# Replace $work_id with the work you are interested in
# e.g. Q3191735 for "Kashtanka"
SELECT ?work ?workLabel ?sitelink ?siteName ?site ?wikiGroup ?languageLabel
WHERE {
  BIND(wd:$work_id AS ?work)
  
  # Get all sitelinks for this work
  ?sitelink schema:about ?work .
  ?sitelink schema:isPartOf ?site .
  ?site wikibase:wikiGroup ?wikiGroup .
  ?site wikibase:wikiGroup "wikisource" .
  
  # Get the Wikidata item for this site and its language
  OPTIONAL {
    ?siteItem wdt:P31 wd:Q15156455 .  # instance of: Wikimedia project edition
    ?siteItem wdt:P856 ?site .         # official website
    ?siteItem wdt:P407 ?language .     # language of work or name
  }
  
  SERVICE wikibase:label { 
    bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" . 
  }
}
ORDER BY ?languageLabel
Automatically Aligned Bilingual Stories - My Next.js Blog