This commit is contained in:
nlpfun
2021-11-29 17:19:48 +08:00
2 changed files with 8 additions and 8 deletions

View File

@@ -1,13 +1,13 @@
# Bertalign # Bertalign
Word Embedding-Based Bilingual Sentence Aligner Word Embedding-Based Bilingual Sentence Aligner
Bertalign is designed to facilitate the construction of sentence-aligned bilingual corpora, which have a wide range of applications in translated-related research such as corpus-based translation studies, contrastive linguistics, computer-assisted translation, translator education and machine translation. Bertalign is designed to facilitate the construction of bilingual parallel corpora, which have a wide range of applications in translation-related research such as corpus-based translation studies, contrastive linguistics, computer-assisted translation, translator education and machine translation.
Bertalign uses [cross-lingua embedding models](https://github.com/UKPLab/sentence-transformers) to represent source and target sentences in vectors in order to capture semantically similar sentences in both languages, which according to our explements achieves more accurate results than the traditional length-, dictionary-, or MT-based alignment methods such as [Galechurch](https://aclanthology.org/J93-1004/), [Hunalign](http://mokk.bme.hu/en/resources/hunalign/) and [Bleualign](https://github.com/rsennrich/Bleualign). It also performs better than [Vecalign](https://github.com/thompsonb/vecalign) on our dataset of bilingual Chinese-English literary texts. Bertalign uses [cross-lingua embedding models](https://github.com/UKPLab/sentence-transformers) to represent source and target sentences so that semantically similar sentences in different languages can be mapped onto similar vector spaces. According to our explements, Bertalign achieves more accurate results than the traditional length-, dictionary-, or MT-based alignment methods such as [Galechurch](https://aclanthology.org/J93-1004/), [Hunalign](http://mokk.bme.hu/en/resources/hunalign/) and [Bleualign](https://github.com/rsennrich/Bleualign). It also performs better than [Vecalign](https://github.com/thompsonb/vecalign) on MAC, a manually aligned parallel corpus of Chinese-English literary texts.
## Installation ## Installation
Bertalign has been tested on Win 10 and Linux systems. You need to install at least the following Python packages before using Bertalign: Bertalign is developed using Python and tested on Windows 10 and Linux systems. You need to install the following packages before running Bertalign:
``` ```
pip install numba pip install numba
@@ -25,18 +25,18 @@ For now, we only use the following two Chinese-English corpora to evaluate the p
### MAC-Dev ### MAC-Dev
[MAC-Dev](./data/mac/dev) is the development set selected from the MAC corpus, a manually aligned corpus of Chinese-English literary texts. The sampling schemes for MAC-Dev can be found at [meta_data.tsv](./data/mac/dev/meta_data.tsv) [MAC-Dev](./data/mac/dev) is the development set selected from the MAC corpus, a manually aligned corpus of Chinese-English literary texts. The sampling schemes for MAC-Dev can be found at [meta_data.tsv](./data/mac/dev/meta_data.tsv).
There are 4 subdirectories in MAC-Dev: There are 4 subdirectories in MAC-Dev:
The [zh](./data/mac/dev/zh) and [en](./data/mac/dev/en) directories contain the sentence-split and tokenized source texts, target texts and the machine translations of source texts. Hunalign requires tokenized source and target sentences for dictionary search of similar words. Bleualign uses MT translations of source texts to compute the Bleu similarity score between source and target sentences. The [zh](./data/mac/dev/zh) and [en](./data/mac/dev/en) directories contain the sentence-split and tokenized source texts, target texts and the machine translations of source texts. Hunalign requires tokenized source and target sentences for dictionary search of corresponding bilingual lexicons. Bleualign uses MT translations of source texts to compute the Bleu similarity score between source and target sentences.
We use [Moses sentence splitter](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/ems/support/split-sentences.perl) and [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/usage.html) to split and tokenize English sentences, while [pyltp](https://github.com/HIT-SCIR/pyltp) and [jieba](https://github.com/fxsjy/jieba) are used to split and tokenize Chinese sentences. The MT of source texts are generated by [Google Translate](https://translate.google.cn/). We use [Moses sentence splitter](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/ems/support/split-sentences.perl) and [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/usage.html) to split and tokenize English sentences, while [pyltp](https://github.com/HIT-SCIR/pyltp) and [jieba](https://github.com/fxsjy/jieba) are used to split and tokenize Chinese sentences. The MT of source texts are generated by [Google Translate](https://translate.google.cn/).
The [auto](./data/mac/dev/auto) and [gold](./data/mac/dev/gold) directories are for automatic and gold alignments. All the gold alignments are created manually using [Intertext](https://wanthalf.saga.cz/intertext). The [auto](./data/mac/dev/auto) and [gold](./data/mac/dev/gold) directories are for automatic and gold alignments. All the gold alignments are created manually using [Intertext](https://wanthalf.saga.cz/intertext).
### Bible ### Bible
[The Bible corpus](./data/bible) with 5,000 sentences on the source side is selected from the public [multilingual Bible corpus](https://github.com/christos-c/bible-corpus/tree/master/bibles). This corpus is mainly used to evaluate the speed of Bertalign. [The Bible corpus](./data/bible), consisting of 5,000 source and 6,301 target sentences, is selected from the public [multilingual Bible corpus](https://github.com/christos-c/bible-corpus/tree/master/bibles). This corpus is mainly used to evaluate the speed of Bertalign.
The directory makeup is similar to MAC-Dev, except that the gold alignments for the Bible corpus are generated automatically from the original verse-aligned Bible corpus. The directory makeup is similar to MAC-Dev, except that the gold alignments for the Bible corpus are generated automatically from the original verse-aligned Bible corpus.
@@ -110,7 +110,7 @@ The parameter *-n* indicates the maximum number of overlapping sentences allowed
Please refer to [Sennrich & Volk (2010)](https://aclanthology.org/people/r/rico-sennrich/) for the difference between Strict and Lax evaluation method. We can see that the F1 score is 0.91 when aligning MAC-Dev using Bertalign. Please refer to [Sennrich & Volk (2010)](https://aclanthology.org/people/r/rico-sennrich/) for the difference between Strict and Lax evaluation method. We can see that the F1 score is 0.91 when aligning MAC-Dev using Bertalign.
Please note that aligning literary texts is not an easy task, since they contain more interpretive and free translations than non-literary works. You can refer to [Xu et al. (2015)](https://aclanthology.org/2015.lilt-12.6/) for more details about sentence alignment of literary texts. Let's see how the other systems perform on MAC-Dev: Aligning literary texts is not an easy task, since they contain more interpretive and free translations than non-literary works. You can refer to [Xu et al. (2015)](https://aclanthology.org/2015.lilt-12.6/) for more details about sentence alignment of literary texts. Let's see how the other systems perform on MAC-Dev:
#### Baseline Approaches #### Baseline Approaches

View File

@@ -25,7 +25,7 @@ import numba as nb
def main(): def main():
# user-defined parameters # user-defined parameters
parser = argparse.ArgumentParser('Sentence alignment using Vecalign') parser = argparse.ArgumentParser('Sentence alignment using Bertalign')
parser.add_argument('-s', '--src', type=str, required=True, help='preprocessed source file to align') parser.add_argument('-s', '--src', type=str, required=True, help='preprocessed source file to align')
parser.add_argument('-t', '--tgt', type=str, required=True, help='preprocessed target file to align') parser.add_argument('-t', '--tgt', type=str, required=True, help='preprocessed target file to align')
parser.add_argument('-o', '--out', type=str, required=True, help='Output directory.') parser.add_argument('-o', '--out', type=str, required=True, help='Output directory.')