Bertalign

Word Embedding-Based Bilingual Sentence Aligner

Evaluation Corpus

This section describes the procedure of creating the evaluation corpora: the manually aligned corpus (MAC) of Chinese-English literary texts and the Bible corpus aligned at the verse level.

MAC

Firstly, 5 chapters and their translations are sampled from each of the 6 novels included in MAC, obtaining a corpus of 30 bitexts. We then split the corpus into MAC-Dev and MAC-Test with the former containing 6 chapters and the latter 24 chapters.

The MAC-Test is saved in corpus/mac/test

The sampling schemes for building MAC-Test can be found at meta_data.tsv

There are 4 subdirectories in MAC-Test.

The split directory contains the sentence-split source texts, target texts and the machine translations of source texts, which are required by Bleualign to perform automatic alignment.

The inputs to Hunalign are saved in the tok directory.

The emb directory is made up of the overlapping sentences and their embeddings for Vecalign and BertAlign.

We use Intertext to perform the manual alignment for MAC and save the gold alignments in the intertext directory.

In order to facilitate system evaluations, we delete the XML tags and save the clean gold alignment file with only sentence IDs in the gold directory

Bible

The Bible corpus is located in corpus/bible

The directory makeup is similar to MAC, except that there is no intertext directory for manual alignments.

The gold alignments for the Bible corpus are generated automatically from the original verse-aligned Bible corpus and saved in eval/bible/gold

In order to compare the sentence-based alignments returned by various aligners with the verse-based gold alignments, we put the verse ID for each sentence in the files corpus/bible/en.verse and corpus/bible/zh.verse, which are used to merge consecutive sentences in the output if they belong to the same verse.

System Comparisons

All the experiments reported in the paper are conducted using Google Colab

Job File

Before performing the automatic alignment, a job file is created for each aligner for batch processing. Each row in the job file represents an alignment task, which is made of three tab-separated file names for source, target and output text.

The job files for MAC-Test and Bible are located in eval/mac/test/job and eval/bible/job

Sentence Embeddings

Before embedding the source and target sentences, we use the following Python script to create the combinations of consecutive sentences:

# MAC-Test
python utils/overlap.py -i corpus/mac/test/split -o corpus/mac/test/emb/en.overlap –l en –n 8
python utils/overlap.py -i corpus/mac/test/split -o corpus/mac/test/emb/zh.overlap –l zh –n 8

# Bible
python utils/overlap.py -i corpus/bible/split -o corpus/bible/en.overlap –l en –n 5
python utils/overlap.py -i corpus/bible/split -o corpus/bible/zh.overlap –l en –n 5

Use parameters -i to specify the input data directory and -o the output file path.

All the file suffixes in the input directory should end with the corresponding language code, e.g. 001.en and 001.zh etc., and match up with the parameter -l.

The parameter -n indicates the number of overlapping sentences, which is similar to word n-grams applied to sentences.

We use Sentence Transformers to convert texts into embeddings.

To install Sentence Transformers, just run:

pip install sentence-transformers

After the installation, we run the following Python script to embed the bitexts to be aligned:

# MAC-Test
python utils/embed.py –i corpus/mac/test/emb/en.overlap –o corpus/mac/test/emb/en.overlap.emb
python utils/embed.py –i corpus/mac/test/emb/zh.overlap –o corpus/mac/test/emb/zh.overlap.emb

# Bible
python utils/embed.py –i corpus/bible/emb/en.overlap –o corpus/bible/emb/en.overlap.emb
python utils/embed.py –i corpus/bible/emb/zh.overlap –o corpus/bible/emb/zh.overlap.emb

The parameter -i indicates the file containing sentence combinations.

We use the tofile method provided by Python’s Numpy module to save the sentence embeddings in the file designated by -o.

4.5 KiB Raw Blame History Unescape Escape