Update README.md

This commit is contained in:
nlpfun
2021-11-30 19:21:31 +08:00
parent 94fb9812c2
commit 4fa9a4b23c

View File

@@ -7,7 +7,7 @@ Bertalign uses [cross-lingua embedding models](https://github.com/UKPLab/sentenc
## Installation
Bertalign is developed using Python and tested on Windows 10 and Linux systems. You need to install the following packages before running Bertalign:
Bertalign is developed using Python and tested on Windows 10 and Linux systems. The following packages need to be installed before running Bertalign:
```
pip install numba
@@ -19,13 +19,13 @@ pip install sentence-transformers
Please note that embedding sentences on GPU-enabled machines is much faster than those with CPU only. The following experiments are conducted using [Google Colab](https://colab.research.google.com/) which provides free GPU service.
## Evaluation Corpora
Bertalign is language-agnostic thanks to the cross-language embedding models [sentence-transformers](https://github.com/UKPLab/sentence-transformers).
Bertalign is language-agnostic thanks to the cross-language embedding model [sentence-transformers](https://github.com/UKPLab/sentence-transformers).
For now, we only use the following two Chinese-English corpora to evaluate the performance of Bertalign. Dataset with other language pairs are to be added later.
### MAC-Dev
[MAC-Dev](./data/mac/dev) is the development set selected from the MAC corpus, a manually aligned corpus of Chinese-English literary texts. The sampling schemes for MAC-Dev can be found at [meta_data.tsv](./data/mac/dev/meta_data.tsv). MAC-Dev contains 1,469 Chinese and 1,957 English sentences.
[MAC-Dev](./data/mac/dev) is the development set selected from the MAC corpus, a manually aligned corpus of Chinese-English literary texts. The sampling scheme for MAC-Dev can be found at [meta_data.tsv](./data/mac/dev/meta_data.tsv). MAC-Dev contains 1,469 Chinese and 1,957 English sentences.
There are 4 subdirectories in MAC-Dev:
@@ -33,7 +33,7 @@ The [zh](./data/mac/dev/zh) and [en](./data/mac/dev/en) directories contain the
We use [Moses sentence splitter](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/ems/support/split-sentences.perl) and [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/usage.html) to split and tokenize English sentences, while [pyltp](https://github.com/HIT-SCIR/pyltp) and [jieba](https://github.com/fxsjy/jieba) are used to split and tokenize Chinese sentences. The MT of source texts are generated by [Google Translate](https://translate.google.cn/).
The [auto](./data/mac/dev/auto) and [gold](./data/mac/dev/gold) directories are for automatic and gold alignments. All the gold alignments are created manually using [Intertext](https://wanthalf.saga.cz/intertext).
The [auto](./data/mac/dev/auto) and [gold](./data/mac/dev/gold) directories are for automatic and gold alignments respectively. All the gold alignments are created manually using [Intertext](https://wanthalf.saga.cz/intertext).
### Bible
[The Bible corpus](./data/bible), consisting of 5,000 source and 6,301 target sentences, is selected from the public [multilingual Bible corpus](https://github.com/christos-c/bible-corpus/tree/master/bibles). This corpus is mainly used to evaluate the speed of Bertalign.