2021-12-01 00:08:59 +08:00
2021-11-30 21:55:58 +08:00
2021-11-30 21:57:13 +08:00
2021-11-28 13:59:28 +08:00
2021-11-30 23:59:44 +08:00
2021-12-01 00:08:59 +08:00

Bertalign

Word Embedding-Based Bilingual Sentence Aligner

Bertalign is designed to facilitate the construction of bilingual parallel corpora, which have a wide range of applications in translation-related research such as corpus-based translation studies, contrastive linguistics, computer-assisted translation, translator education and machine translation.

Bertalign uses the LaBSE multilingual BERT model (supporting 109 languages) provided by sentence-transformers to represent source and target sentences so that semantically similar sentences in different languages can be mapped onto similar vector spaces. According to our experiments, Bertalign achieves more accurate results than the traditional length-, dictionary-, or MT-based alignment methods such as Galechurch, Hunalign and Bleualign. It also performs better than Vecalign on MAC, a manually aligned parallel corpus of Chinese-English literary texts.

Installation

Bertalign is developed using Python and tested on Windows 10 and Linux systems. The following packages need to be installed before running Bertalign:

pip install numba
pip install faiss-cpu
pip install faiss-gpu
pip install sentence-transformers

Please note that embedding sentences on GPU-enabled machines is much faster than those with CPU only. The following experiments are conducted using Google Colab which provides free GPU service.

Evaluation Corpora

Bertalign is language-agnostic thanks to the cross-language embedding model sentence-transformers.

For now, we only use the following two Chinese-English corpora to evaluate the performance of Bertalign. Dataset with other language pairs are to be added later.

MAC-Dev

MAC-Dev is the development set selected from the MAC corpus, a manually aligned corpus of Chinese-English literary texts. The sampling scheme for MAC-Dev can be found at meta_data.tsv. MAC-Dev contains 1,469 Chinese and 1,957 English sentences.

There are 4 subdirectories in MAC-Dev:

The zh and en directories contain the sentence-split and tokenized source texts, target texts and the machine translations of source texts. Hunalign requires tokenized source and target sentences for dictionary search of corresponding bilingual lexicons. Bleualign uses MT translations of source texts to compute the Bleu similarity score between source and target sentences.

We use Moses sentence splitter and Stanford CoreNLP to split and tokenize English sentences, while pyltp and jieba are used to split and tokenize Chinese sentences. The MT of source texts are generated by Google Translate.

The auto and gold directories are for automatic and gold alignments respectively. All the gold alignments are created manually using Intertext.

Bible

The Bible corpus, consisting of 5,000 source and 6,301 target sentences, is selected from the public multilingual Bible corpus. This corpus is mainly used to evaluate the speed of Bertalign.

The directory makeup is similar to MAC-Dev, except that the gold alignments for the Bible corpus are generated automatically from the original verse-aligned Bible corpus.

In order to compare the sentence-based alignments returned by various aligners with the verse-based gold alignments, we put the verse ID for each sentence in the files en.verse and zh.verse, which are used to merge consecutive sentences in the automatic alignments if they belong to the same verse.

System Comparisons

All the experiments are conducted on Google Colab.

Sentence Embeddings

We use the Python script embed_sents.py to create the sentence embedddings for the MAC-Dev and the Bible corpus. This script is based on Vecalign developed by Brian Thompson.

# Embedding MAC-Dev Chinese
!python bin/embed_sents.py \
  -i data/mac/dev/zh \
  -o data/mac/dev/zh/overlap data/mac/dev/zh/overlap.emb \
  -m data/mac/dev/meta_data.tsv \
  -n 8
  
# Embedding MAC-Dev English
!python bin/embed_sents.py \
  -i data/mac/dev/en \
  -o data/mac/dev/en/overlap data/mac/dev/en/overlap.emb \
  -m data/mac/dev/meta_data.tsv \
  -n 8

# Embedding Bible English
!python bin/embed_sents.py \
  -i data/bible/en \
  -o data/bible/en/overlap data/bible/en/overlap.emb \
  -m data/bible/meta_data.tsv \
  -n 5

# Embedding Bible Chinese
!python bin/embed_sents.py \
  -i data/bible/zh \
  -o data/bible/zh/overlap data/bible/zh/overlap.emb \
  -m data/bible/meta_data.tsv \
  -n 5

The parameter -n indicates the maximum number of overlapping sentences allowed on the source and target side, which is similar to word n-grams applied to sentences. After running the script, the overlapping sentences overlap in the source and target texts and their embeddings overlap.emb are saved in the directory mac/dev/zh, mac/dev/en, bible/zh, and bible/en.

Evaluation on MAC-Dev

Bertalign

# Run Bertalign on MAC-Dev
!python bin/bert_align.py \
  -s data/mac/dev/zh \
  -t data/mac/dev/en \
  -o data/mac/dev/auto \
  -m data/mac/dev/meta_data.tsv \
  --src_embed data/mac/dev/zh/overlap data/mac/dev/zh/overlap.emb \
  --tgt_embed data/mac/dev/en/overlap data/mac/dev/en/overlap.emb \
  --max_align 8 --margin

# Evaluate Bertalign on MAC-Dev
!python bin/eval.py \
  -t data/mac/dev/auto \
  -g data/mac/dev/gold
 ---------------------------------
|             |  Strict |    Lax  |
| Precision   |   0.909 |   0.996 |
| Recall      |   0.914 |   1.000 |
| F1          |   0.911 |   0.998 |
 ---------------------------------

Please refer to Sennrich & Volk (2010) for the difference between Strict and Lax evaluation method. We can see that the F1 score is 0.91 when aligning MAC-Dev using Bertalign.

Aligning literary texts is not an easy task, since they contain more interpretive and free translations than non-literary works. You can refer to Xu et al. (2015) for more details about sentence alignment of literary texts. Let's see how the other systems perform on MAC-Dev:

Baseline Approaches

Gale-Church
# Run Gale-Church on MAC-Dev
!python bin/gale_align.py \
  -m data/mac/dev/meta_data.tsv \
  -s data/mac/dev/zh \
  -t data/mac/dev/en \
  -o data/mac/dev/auto
  
# Evaluate Gale-Church on MAC-Dev
!python bin/eval.py \
  -t data/mac/dev/auto \
  -g data/mac/dev/gold
---------------------------------
|             |  Strict |    Lax  |
| Precision   |   0.472 |   0.727 |
| Recall      |   0.497 |   0.746 |
| F1          |   0.485 |   0.737 |
 ---------------------------------
Hunalign
# Run Hunalign on MAC-Dev
!python ext-lib/hunalign/hunalign.py \
  -m data/mac/dev/meta_data.tsv \
  -s data/mac/dev/zh \
  -t data/mac/dev/en \
  -o data/mac/dev/auto \
  -d ec.dic
  
# Evaluate Hunalign on MAC-Dev
!python bin/eval.py \
  -t data/mac/dev/auto \
  -g data/mac/dev/gold
---------------------------------
|             |  Strict |    Lax  |
| Precision   |   0.573 |   0.811 |
| Recall      |   0.653 |   0.897 |
| F1          |   0.610 |   0.852 |
 ---------------------------------

The parameter -d designates the location of the bilingual dictionary for Hunalign to calculate bitext similarity. For fair comparison with other systems, we use a large Chinese-English dictionary with 224,000 entries for Hunalign to achieve higher accuracy.

Bleualign
# Run Bleualign on MAC-Dev
!python ext-lib/bleualign/bleualign.py \
  -m data/mac/dev/meta_data.tsv \
  -s data/mac/dev/zh \
  -t data/mac/dev/en \
  -o data/mac/dev/auto
  
# Evaluate Bleualign on MAC-Dev
!python bin/eval.py \
  -t data/mac/dev/auto \
  -g data/mac/dev/gold
---------------------------------
|             |  Strict |    Lax  |
| Precision   |   0.676 |   0.929 |
| Recall      |   0.609 |   0.823 |
| F1          |   0.641 |   0.873 |
 ---------------------------------
Vecalign
# Run Vecalign on MAC-Dev
!python ext-lib/vecalign/vecalign.py \
  -s data/mac/dev/zh \
  -t data/mac/dev/en \
  -o data/mac/dev/auto \
  -m data/mac/dev/meta_data.tsv \
  --src_embed data/mac/dev/zh/overlap data/mac/dev/zh/overlap.emb \
  --tgt_embed data/mac/dev/en/overlap data/mac/dev/en/overlap.emb \
  -a 8 -v
  
# Evaluate Vecalign on MAC-Dev
!python bin/eval.py \
  -t data/mac/dev/auto \
  -g data/mac/dev/gold
---------------------------------
|             |  Strict |    Lax  |
| Precision   |   0.848 |   0.995 |
| Recall      |   0.878 |   0.999 |
| F1          |   0.863 |   0.997 |
 ---------------------------------

Evaluation on Bible

Bertalign

# Run Bertalign on Bible
!python bin/bert_align.py \
  -s data/bible/en \
  -t data/bible/zh \
  -o data/bible/auto \
  -m data/bible/meta_data.tsv \
  --src_embed data/bible/en/overlap data/bible/en/overlap.emb \
  --tgt_embed data/bible/zh/overlap data/bible/zh/overlap.emb \
  --max_align 5 --margin

# Evaluate Bertalign on Bible
!python bin/eval_bible.py \
  -t data/bible/auto/001.align \
  -g data/bible/gold/001.align \
  --src_verse data/bible/en.verse \
  --tgt_verse data/bible/zh.verse
It takes 1.695 seconds to align all the sentences.
---------------------------------
|             |  Strict |    Lax  |
| Precision   |   0.981 |   1.000 |
| Recall      |   0.976 |   1.000 |
| F1          |   0.979 |   1.000 |
 ---------------------------------

It only takes Bertalign 1.695 seconds to align 5,000 English sentences and 6,301 Chinese sentences. The verse-based evaluation shows that Bertalign can achieve 0.98 F1 score on the Bible corpus.

Now let's see how fast other systems can run on the Bible corpus:

Baseline Approaches

Gale-Church
# Run Gale-Church on Bible
!python bin/gale_align.py \
  -m data/bible/meta_data.tsv \
  -s data/bible/en \
  -t data/bible/zh \
  -o data/bible/auto

# Evaluate Gale-Church on Bible
!python bin/eval_bible.py \
  -t data/bible/auto/001.align \
  -g data/bible/gold/001.align \
  --src_verse data/bible/en.verse \
  --tgt_verse data/bible/zh.verse
It takes 3.742 seconds to align all the sentences.
 ---------------------------------
|             |  Strict |    Lax  |
| Precision   |   0.817 |   0.928 |
| Recall      |   0.824 |   0.934 |
| F1          |   0.821 |   0.931 |
 ---------------------------------
Hunalign
# Run Hunalign on Bible
!python ext-lib/hunalign/hunalign.py \
  -m data/bible/meta_data.tsv \
  -s data/bible/en \
  -t data/bible/zh \
  -o data/bible/auto \
  -d ce.dic
  
# Evaluate Hunalign on Bible
!python bin/eval_bible.py \
  -t data/bible/auto/001.align \
  -g data/bible/gold/001.align \
  --src_verse data/bible/en.verse \
  --tgt_verse data/bible/zh.verse
It takes 2.394 seconds to align all the sentences.
 ---------------------------------
|             |  Strict |    Lax  |
| Precision   |   0.887 |   0.964 |
| Recall      |   0.904 |   0.980 |
| F1          |   0.895 |   0.972 |
 ---------------------------------
Bleualign
# Run Bleualign on Bible
!python ext-lib/bleualign/bleualign.py \
  -m data/bible/meta_data.tsv \
  -s data/bible/en \
  -t data/bible/zh \
  -o data/bible/auto \
  --tok
  
# Evaluate Bleualign on Bible
!python bin/eval_bible.py \
  -t data/bible/auto/001.align \
  -g data/bible/gold/001.align \
  --src_verse data/bible/en.verse \
  --tgt_verse data/bible/zh.verse
It takes 78.898 seconds to align all the sentences.
 ---------------------------------
|             |  Strict |    Lax  |
| Precision   |   0.827 |   0.961 |
| Recall      |   0.753 |   0.880 |
| F1          |   0.788 |   0.919 |
 ---------------------------------
Vecalign
# Run Vecalign on Bible
!python ext-lib/vecalign/vecalign.py \
  -s data/bible/en \
  -t data/bible/zh \
  -o data/bible/auto \
  -m data/bible/meta_data.tsv \
  --src_embed data/bible/en/overlap data/bible/en/overlap.emb \
  --tgt_embed data/bible/zh/overlap data/bible/zh/overlap.emb \
  -a 5 -v
  
# Evaluate Vecalign on Bible
!python bin/eval_bible.py \
  -t data/bible/auto/001.align \
  -g data/bible/gold/001.align \
  --src_verse data/bible/en.verse \
  --tgt_verse data/bible/zh.verse
It takes 4.676 seconds to align all the sentences.
---------------------------------
|             |  Strict |    Lax  |
| Precision   |   0.977 |   1.000 |
| Recall      |   0.978 |   1.000 |
| F1          |   0.978 |   1.000 |
 ---------------------------------

Post-processing

Post-processing means manually correcting the wrong alignments generated by automatic aligners. The human validated sentence pairs can then be loaded into a translation memory software (e.g. OmegaT) or bilingual concordancer (e.g. Paraconc), enabling translators to search the corresponding translation units and improve translation quality, or help researchers to carry out corpus-based translation studies.

Bertalign supports two output formats for manual alignments with LF Aligner and Intertext. For example, running the following scripts will save the converted outputs in the tsv or intertext directory, which can be opened and edited using LF Aligner or Intertext.

# Convert automatic alignments to TSV for LF Aligner
python bin/convert_format.py \
  -p mac-dev \
  -s data/mac/dev/zh zh \
  -t data/mac/dev/en en \
  -a data/mac/dev/auto \
  -f tsv

# Convert automatic alignments to XML for Intertext
python bin/conert_format.py \
  -p mac-dev \
  -s data/mac/dev/zh zh \
  -t data/mac/dev/en en \
  -a data/mac/dev/auto \
  -f intertext

Prepare your own data

In order to align your own bilingual texts with Bertalign, you can run the following script to split the source and target texts into sentences:

# Splitting Chinese text
python utils/sent_splitter.py \
  -i utils/zh_raw
  -o utils/zh
  -l zh
  
# Splitting English text
python utils/sent_splitter.py \
  -i utils/en_raw
  -o utils/en
  -l en

This script uses multilingual sentence splitter pySBD to split raw Chinese and English texts into sentences. pySBD develops a rule-based algorithm for sentence boundary detection of 23 languages. You can specify the language using ISO 639-1 code with the parameter -l.

TODO List

Evaluate Bertalign on datasets containing language pairs other than Chinese and English.

Description
Fork of [bertalign](https://github.com/bfsujason/bertalign) with chunking
Readme 303 MiB
Languages
Python 100%