Update README.md

2021-11-30 19:21:31 +08:00
parent 94fb9812c2
commit 4fa9a4b23c
1 changed files with 4 additions and 4 deletions
--- a/README.md
+++ b/README.md
@@ -7,7 +7,7 @@ Bertalign uses [cross-lingua embedding models](https://github.com/UKPLab/sentenc

 ## Installation

-Bertalign is developed using Python and tested on Windows 10 and Linux systems. You need to install the following packages before running Bertalign:
+Bertalign is developed using Python and tested on Windows 10 and Linux systems. The following packages need to be installed before running Bertalign:

 ```
 pip install numba
@@ -19,13 +19,13 @@ pip install sentence-transformers
 Please note that embedding sentences on GPU-enabled machines is much faster than those with CPU only. The following experiments are conducted using [Google Colab](https://colab.research.google.com/) which provides free GPU service.

 ## Evaluation Corpora
-Bertalign is language-agnostic thanks to the cross-language embedding models  [sentence-transformers](https://github.com/UKPLab/sentence-transformers).
+Bertalign is language-agnostic thanks to the cross-language embedding model  [sentence-transformers](https://github.com/UKPLab/sentence-transformers).

 For now, we only use the following two Chinese-English corpora to evaluate the performance of Bertalign. Dataset with other language pairs are to be added later.

 ### MAC-Dev

-[MAC-Dev](./data/mac/dev) is the development set selected from the MAC corpus, a manually aligned corpus of Chinese-English literary texts. The sampling schemes for MAC-Dev can be found at [meta_data.tsv](./data/mac/dev/meta_data.tsv). MAC-Dev contains 1,469 Chinese and 1,957 English sentences.
+[MAC-Dev](./data/mac/dev) is the development set selected from the MAC corpus, a manually aligned corpus of Chinese-English literary texts. The sampling scheme for MAC-Dev can be found at [meta_data.tsv](./data/mac/dev/meta_data.tsv). MAC-Dev contains 1,469 Chinese and 1,957 English sentences.

 There are 4 subdirectories in MAC-Dev:

@@ -33,7 +33,7 @@ The [zh](./data/mac/dev/zh) and [en](./data/mac/dev/en) directories contain the

 We use [Moses sentence splitter](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/ems/support/split-sentences.perl) and [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/usage.html) to split and tokenize English sentences, while [pyltp](https://github.com/HIT-SCIR/pyltp) and [jieba](https://github.com/fxsjy/jieba) are used to split and tokenize Chinese sentences. The MT of source texts are generated by [Google Translate](https://translate.google.cn/).

-The [auto](./data/mac/dev/auto) and [gold](./data/mac/dev/gold) directories are for automatic and gold alignments. All the gold alignments are created manually using [Intertext](https://wanthalf.saga.cz/intertext).
+The [auto](./data/mac/dev/auto) and [gold](./data/mac/dev/gold) directories are for automatic and gold alignments respectively. All the gold alignments are created manually using [Intertext](https://wanthalf.saga.cz/intertext).

 ### Bible
 [The Bible corpus](./data/bible), consisting of 5,000 source and 6,301 target sentences, is selected from the public [multilingual Bible corpus](https://github.com/christos-c/bible-corpus/tree/master/bibles). This corpus is mainly used to evaluate the speed of Bertalign.