Update README.md
This commit is contained in:
@@ -7,7 +7,7 @@ Bertalign uses [cross-lingua embedding models](https://github.com/UKPLab/sentenc
|
|||||||
|
|
||||||
## Installation
|
## Installation
|
||||||
|
|
||||||
Bertalign is developed using Python and tested on Windows 10 and Linux systems. You need to install the following packages before running Bertalign:
|
Bertalign is developed using Python and tested on Windows 10 and Linux systems. The following packages need to be installed before running Bertalign:
|
||||||
|
|
||||||
```
|
```
|
||||||
pip install numba
|
pip install numba
|
||||||
@@ -19,13 +19,13 @@ pip install sentence-transformers
|
|||||||
Please note that embedding sentences on GPU-enabled machines is much faster than those with CPU only. The following experiments are conducted using [Google Colab](https://colab.research.google.com/) which provides free GPU service.
|
Please note that embedding sentences on GPU-enabled machines is much faster than those with CPU only. The following experiments are conducted using [Google Colab](https://colab.research.google.com/) which provides free GPU service.
|
||||||
|
|
||||||
## Evaluation Corpora
|
## Evaluation Corpora
|
||||||
Bertalign is language-agnostic thanks to the cross-language embedding models [sentence-transformers](https://github.com/UKPLab/sentence-transformers).
|
Bertalign is language-agnostic thanks to the cross-language embedding model [sentence-transformers](https://github.com/UKPLab/sentence-transformers).
|
||||||
|
|
||||||
For now, we only use the following two Chinese-English corpora to evaluate the performance of Bertalign. Dataset with other language pairs are to be added later.
|
For now, we only use the following two Chinese-English corpora to evaluate the performance of Bertalign. Dataset with other language pairs are to be added later.
|
||||||
|
|
||||||
### MAC-Dev
|
### MAC-Dev
|
||||||
|
|
||||||
[MAC-Dev](./data/mac/dev) is the development set selected from the MAC corpus, a manually aligned corpus of Chinese-English literary texts. The sampling schemes for MAC-Dev can be found at [meta_data.tsv](./data/mac/dev/meta_data.tsv). MAC-Dev contains 1,469 Chinese and 1,957 English sentences.
|
[MAC-Dev](./data/mac/dev) is the development set selected from the MAC corpus, a manually aligned corpus of Chinese-English literary texts. The sampling scheme for MAC-Dev can be found at [meta_data.tsv](./data/mac/dev/meta_data.tsv). MAC-Dev contains 1,469 Chinese and 1,957 English sentences.
|
||||||
|
|
||||||
There are 4 subdirectories in MAC-Dev:
|
There are 4 subdirectories in MAC-Dev:
|
||||||
|
|
||||||
@@ -33,7 +33,7 @@ The [zh](./data/mac/dev/zh) and [en](./data/mac/dev/en) directories contain the
|
|||||||
|
|
||||||
We use [Moses sentence splitter](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/ems/support/split-sentences.perl) and [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/usage.html) to split and tokenize English sentences, while [pyltp](https://github.com/HIT-SCIR/pyltp) and [jieba](https://github.com/fxsjy/jieba) are used to split and tokenize Chinese sentences. The MT of source texts are generated by [Google Translate](https://translate.google.cn/).
|
We use [Moses sentence splitter](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/ems/support/split-sentences.perl) and [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/usage.html) to split and tokenize English sentences, while [pyltp](https://github.com/HIT-SCIR/pyltp) and [jieba](https://github.com/fxsjy/jieba) are used to split and tokenize Chinese sentences. The MT of source texts are generated by [Google Translate](https://translate.google.cn/).
|
||||||
|
|
||||||
The [auto](./data/mac/dev/auto) and [gold](./data/mac/dev/gold) directories are for automatic and gold alignments. All the gold alignments are created manually using [Intertext](https://wanthalf.saga.cz/intertext).
|
The [auto](./data/mac/dev/auto) and [gold](./data/mac/dev/gold) directories are for automatic and gold alignments respectively. All the gold alignments are created manually using [Intertext](https://wanthalf.saga.cz/intertext).
|
||||||
|
|
||||||
### Bible
|
### Bible
|
||||||
[The Bible corpus](./data/bible), consisting of 5,000 source and 6,301 target sentences, is selected from the public [multilingual Bible corpus](https://github.com/christos-c/bible-corpus/tree/master/bibles). This corpus is mainly used to evaluate the speed of Bertalign.
|
[The Bible corpus](./data/bible), consisting of 5,000 source and 6,301 target sentences, is selected from the public [multilingual Bible corpus](https://github.com/christos-c/bible-corpus/tree/master/bibles). This corpus is mainly used to evaluate the speed of Bertalign.
|
||||||
|
|||||||
Reference in New Issue
Block a user