2021-05-18 00:12:12 +08:00
2021-05-18 01:06:50 +08:00
2021-05-18 00:03:45 +08:00
2021-05-18 00:03:45 +08:00
2021-05-18 01:00:21 +08:00

Bertalign

word embedding-based bilingual sentence aligner

Evaluation Corpus

This section describes the procedure of creating the evaluation corpora: the manually aligned corpus (MAC) of Chinese-English literary texts and the Bible corpus aligned at the verse level.

MAC

Firstly, 5 chapters and their translations are sampled from each of the 6 novels included in MAC, obtaining a corpus of 30 bitexts. We then split the corpus into MAC-Dev and MAC-Test with the former containing 6 chapters and the latter 24 chapters.

The MAC-Test is saved in corpus/mac/test

The sampling schemes for building MAC-Test can be found at corpus/mac/test/meta_data.tsv

There are 4 subdirectories in MAC-Test. The split directory contains the sentence-split source texts, target texts and the machine translations of source texts, which are required by Bleualign to perform automatic alignment.

The inputs to Hunalign are saved in the tok directory.

The emb directory is made up of the overlapping sentences and their embeddings for Vecalign and BertAlign.

Description
Fork of [bertalign](https://github.com/bfsujason/bertalign) with chunking
Readme 303 MiB
Languages
Python 100%