Files
bertalign/README.md
2021-05-18 02:47:43 +08:00

7.4 KiB
Raw Blame History

Bertalign

Word Embedding-Based Bilingual Sentence Aligner

Evaluation Corpus

This section describes the procedure of creating the evaluation corpora: the manually aligned corpus (MAC) of Chinese-English literary texts and the Bible corpus aligned at the verse level.

MAC-Test

The MAC-Test is saved in corpus/mac/test

The sampling schemes for building MAC-Test can be found at meta_data.tsv

There are 4 subdirectories in MAC-Test.

The split directory contains the sentence-split source texts, target texts and the machine translations of source texts, which are required by Bleualign to perform automatic alignment.

The inputs to Hunalign are saved in the tok directory.

The emb directory is made up of the overlapping sentences and their embeddings for Vecalign and BertAlign.

We use Intertext to perform the manual alignment for MAC and save the gold alignments in the intertext directory.

In order to facilitate system evaluations, we delete the XML tags and save the clean gold alignment file with only sentence IDs in the gold directory

Bible

The Bible corpus is located in corpus/bible

The directory makeup is similar to MAC, except that there is no intertext directory for manual alignments.

The gold alignments for the Bible corpus are generated automatically from the original verse-aligned Bible corpus and saved in eval/bible/gold

In order to compare the sentence-based alignments returned by various aligners with the verse-based gold alignments, we put the verse ID for each sentence in the files corpus/bible/en.verse and corpus/bible/zh.verse, which are used to merge consecutive sentences in the output if they belong to the same verse.

System Comparisons

All the experiments reported in the paper are conducted using Google Colab

Job File

Before performing the automatic alignment, a job file is created for each aligner for batch processing. Each row in the job file represents an alignment task, which is made of three tab-separated file names for source, target and output text.

The job files for MAC-Test and Bible are located in eval/mac/test/job and eval/bible/job

Sentence Embeddings

Before embedding the source and target sentences, we use the following Python script to create the combinations of consecutive sentences:

# MAC-Test
python utils/overlap.py -i corpus/mac/test/split -o corpus/mac/test/emb/en.overlap l en n 8
python utils/overlap.py -i corpus/mac/test/split -o corpus/mac/test/emb/zh.overlap l zh n 8

# Bible
python utils/overlap.py -i corpus/bible/split -o corpus/bible/en.overlap l en n 5
python utils/overlap.py -i corpus/bible/split -o corpus/bible/zh.overlap l en n 5

Use parameters -i to specify the input data directory and -o the output file path.

All the file suffixes in the input directory should end with the corresponding language code, e.g. 001.en and 001.zh etc., and match up with the parameter -l.

The parameter -n indicates the number of overlapping sentences, which is similar to word n-grams applied to sentences.

We use Sentence Transformers to convert texts into embeddings.

To install Sentence Transformers, just run:

pip install sentence-transformers

After the installation, we run the following Python script to embed the bitexts to be aligned:

# MAC-Test
python utils/embed.py i corpus/mac/test/emb/en.overlap o corpus/mac/test/emb/en.overlap.emb
python utils/embed.py i corpus/mac/test/emb/zh.overlap o corpus/mac/test/emb/zh.overlap.emb

# Bible
python utils/embed.py i corpus/bible/emb/en.overlap o corpus/bible/emb/en.overlap.emb
python utils/embed.py i corpus/bible/emb/zh.overlap o corpus/bible/emb/zh.overlap.emb

The parameter -i indicates the file containing sentence combinations.

We use the tofile method provided by Pythons Numpy module to save the sentence embeddings in the file designated by -o.

Evaluation on MAC-Test

Gale-Church

%timeit !python bin/gale_align.py --job eval/mac/test/job/galechurch.job

perl utils/eval_mac.pl --meta corpus/mac/test/meta_data.tsv -gold eval/mac/test/gold --auto eval/mac/test/auto/galechurch \
  --by book

Hunalign

%timeit !bin/hunalign/hunalign -text -batch bin/hunalign/ec.dic eval/mac/test/job/hunalign.job

perl utils/eval_mac.pl --meta corpus/mac/test/meta_data.tsv -gold eval/mac/test/gold --auto eval/mac/test/auto/hunalign \
  --by book

Bleualign

%timeit !python bin/bleualign/batch_align.py eval/mac/test/job/bleualign.job

perl utils/eval_mac.pl --meta corpus/mac/test/meta_data.tsv -gold eval/mac/test/gold --auto eval/mac/test/auto/bleualign \
  --by book

Vecalign

%timeit !python bin/vecalign/vecalign.py --job eval/mac/test/job/vecalign.job \
  --src_embed corpus/mac/test/emb/zh.overlap corpus/mac/test/emb/zh.overlap.emb \
  --tgt_embed corpus/mac/test/emb/en.overlap corpus/mac/test/emb/en.overlap.emb \
  --alignment_max_size 8
  
perl utils/eval_mac.pl --meta corpus/mac/test/meta_data.tsv -gold eval/mac/test/gold --auto eval/mac/test/auto/vecalign \
  --by book

Bertalign (Modified Cosine)

%timeit !python /bin/bert_align.py eval/mac/test/job/mbert.job \
  --src_embed corpus/mac/test/emb/zh.overlap corpus/mac/test/embbert/zh.overlap.emb \
  --tgt_embed corpus/mac/test/emb/en.overlap corpus/mac/test/emb/en.overlap.emb \
  --margin --max_align 8
  
perl utils/eval_mac.pl --meta corpus/mac/test/meta_data.tsv -gold eval/mac/test/gold --auto eval/mac/test/auto/mbert \
  --by book

Evaluation on Bible

Gale-Church

%timeit !python bin/gale_align.py --job eval/bible/job/galechurch.job

perl utils/eval_bible.pl --meta corpus/bible/meta_data.tsv --gold eval/bible/gold --auto eval/bible/auto/galechurch \
  --src_verse corpus/bible/en.verse --tgt_verse corpus/bible/zh.verse

Hunalign

%timeit !bin/hunalign/hunalign -text -batch bin/hunalign/ce.dic eval/bible/job/hunalign.job

perl utils/eval_bible.pl --meta corpus/bible/meta_data.tsv --gold eval/bible/gold --auto eval/bible/auto/hunalign \
  --src_verse corpus/bible/en.verse --tgt_verse corpus/bible/zh.verse

Bleualign (Run OOM on 25,000 sentences)

%timeit !python bin/Bleualign/batch_align.py eval/bible/job/bleualign.job

Vecalign

%timeit !python bin/vecalign/vecalign.py --job eval/bible/job/vecalign.job \
  --src_embed corpus/bible/emb/en.overlap corpus/bible/emb/en.overlap.emb \
  --tgt_embed corpus/bible/emb/zh.overlap corpus/bible/emb/zh.overlap.emb
  
perl utils/eval_bible.pl --meta corpus/bible/meta_data.tsv --gold eval/bible/gold --auto eval/bible/auto/vecalign \
  --src_verse corpus/bible/en.verse --tgt_verse corpus/bible/zh.verse

Bertalign (Modified Cosine)

%timeit !python bin/bert_align.py --job eval/bible/job/mbert.job \
  --src_embed corpus/bible/emb/en.overlap corpus/bible/emb/en.overlap.emb \
  --tgt_embed corpus/bible/emb/zh.overlap corpus/bible/emb/zh.overlap.emb \
  --margin

perl utils/eval_bible.pl --meta corpus/bible/meta_data.tsv --gold eval/bible/gold --auto eval/bible/auto/mbert \
  --src_verse corpus/bible/en.verse --tgt_verse corpus/bible/zh.verse