From 4fa9a4b23cdb7de939330d01f6f4e546e6d596d5 Mon Sep 17 00:00:00 2001 From: nlpfun Date: Tue, 30 Nov 2021 19:21:31 +0800 Subject: [PATCH] Update README.md --- README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index eb5da76..8950c49 100644 --- a/README.md +++ b/README.md @@ -7,7 +7,7 @@ Bertalign uses [cross-lingua embedding models](https://github.com/UKPLab/sentenc ## Installation -Bertalign is developed using Python and tested on Windows 10 and Linux systems. You need to install the following packages before running Bertalign: +Bertalign is developed using Python and tested on Windows 10 and Linux systems. The following packages need to be installed before running Bertalign: ``` pip install numba @@ -19,13 +19,13 @@ pip install sentence-transformers Please note that embedding sentences on GPU-enabled machines is much faster than those with CPU only. The following experiments are conducted using [Google Colab](https://colab.research.google.com/) which provides free GPU service. ## Evaluation Corpora -Bertalign is language-agnostic thanks to the cross-language embedding models [sentence-transformers](https://github.com/UKPLab/sentence-transformers). +Bertalign is language-agnostic thanks to the cross-language embedding model [sentence-transformers](https://github.com/UKPLab/sentence-transformers). For now, we only use the following two Chinese-English corpora to evaluate the performance of Bertalign. Dataset with other language pairs are to be added later. ### MAC-Dev -[MAC-Dev](./data/mac/dev) is the development set selected from the MAC corpus, a manually aligned corpus of Chinese-English literary texts. The sampling schemes for MAC-Dev can be found at [meta_data.tsv](./data/mac/dev/meta_data.tsv). MAC-Dev contains 1,469 Chinese and 1,957 English sentences. +[MAC-Dev](./data/mac/dev) is the development set selected from the MAC corpus, a manually aligned corpus of Chinese-English literary texts. The sampling scheme for MAC-Dev can be found at [meta_data.tsv](./data/mac/dev/meta_data.tsv). MAC-Dev contains 1,469 Chinese and 1,957 English sentences. There are 4 subdirectories in MAC-Dev: @@ -33,7 +33,7 @@ The [zh](./data/mac/dev/zh) and [en](./data/mac/dev/en) directories contain the We use [Moses sentence splitter](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/ems/support/split-sentences.perl) and [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/usage.html) to split and tokenize English sentences, while [pyltp](https://github.com/HIT-SCIR/pyltp) and [jieba](https://github.com/fxsjy/jieba) are used to split and tokenize Chinese sentences. The MT of source texts are generated by [Google Translate](https://translate.google.cn/). -The [auto](./data/mac/dev/auto) and [gold](./data/mac/dev/gold) directories are for automatic and gold alignments. All the gold alignments are created manually using [Intertext](https://wanthalf.saga.cz/intertext). +The [auto](./data/mac/dev/auto) and [gold](./data/mac/dev/gold) directories are for automatic and gold alignments respectively. All the gold alignments are created manually using [Intertext](https://wanthalf.saga.cz/intertext). ### Bible [The Bible corpus](./data/bible), consisting of 5,000 source and 6,301 target sentences, is selected from the public [multilingual Bible corpus](https://github.com/christos-c/bible-corpus/tree/master/bibles). This corpus is mainly used to evaluate the speed of Bertalign.