From e9ed0664da3ed708b5a5f9aacf3dca15248ddc2e Mon Sep 17 00:00:00 2001 From: bfsujason Date: Tue, 18 May 2021 00:14:41 +0800 Subject: [PATCH 1/3] Create README.md --- README.md | 2 ++ 1 file changed, 2 insertions(+) create mode 100644 README.md diff --git a/README.md b/README.md new file mode 100644 index 0000000..9d00041 --- /dev/null +++ b/README.md @@ -0,0 +1,2 @@ +# bertalign +word embedding-based bilingual sentence aligner From 0df79e31391d7d2f9a4feefd988d8c1882a16296 Mon Sep 17 00:00:00 2001 From: bfsujason Date: Tue, 18 May 2021 00:37:52 +0800 Subject: [PATCH 2/3] Update README.md --- README.md | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 9d00041..e3ded8b 100644 --- a/README.md +++ b/README.md @@ -1,2 +1,11 @@ -# bertalign +# Bertalign word embedding-based bilingual sentence aligner + +## Evaluation Corpus +This section describes the procedure of creating the evaluation corpora: the manually aligned corpus (MAC) of Chinese-English literary texts and the Bible corpus aligned at the verse level. +### MAC +Firstly, 5 chapters and their translations are sampled from each of the 6 novels included in MAC, obtaining a corpus of 30 bitexts. We then split the corpus into **MAC-Dev** and **MAC-Test** with the former containing 6 chapters and the latter 24 chapters. + +The **MAC-Test** is saved in [corpus/mac/test](./corpus/mac/test) + +The sampling schemes for building **MAC-Test** can be found at [corpus/mac/test/meta_data.tsv](./corpus/mac/test/meta_data.tsv) From 62e801b0bf2d7457e25931af025e712d0c39b6eb Mon Sep 17 00:00:00 2001 From: bfsujason Date: Tue, 18 May 2021 01:00:21 +0800 Subject: [PATCH 3/3] Update README.md --- README.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/README.md b/README.md index e3ded8b..d899ad0 100644 --- a/README.md +++ b/README.md @@ -9,3 +9,9 @@ Firstly, 5 chapters and their translations are sampled from each of the 6 novels The **MAC-Test** is saved in [corpus/mac/test](./corpus/mac/test) The sampling schemes for building **MAC-Test** can be found at [corpus/mac/test/meta_data.tsv](./corpus/mac/test/meta_data.tsv) + +There are 4 subdirectories in **MAC-Test**. The [split](/corpus/mac/test/split) directory contains the sentence-split source texts, target texts and the machine translations of source texts, which are required by **Bleualign** to perform automatic alignment. + +The inputs to **Hunalign** are saved in the [tok](/corpus/mac/test/tok) directory. + +The emb directory is made up of the overlapping sentences and their embeddings for Vecalign and BertAlign.