Update README

This commit is contained in:
nlpfun
2021-11-28 14:31:46 +08:00
parent a775fce719
commit e10dfe57ab

View File

@@ -25,18 +25,18 @@ For now, we only use the following two Chinese-English corpora to evaluate the p
### MAC-Dev
[MAC-Dev](](./data/mac/dev)) is the development set selected from the MAC corpus, a manually aligned corpus of Chinese-English literary texts. The sampling schemes for MAC-Dev can be found at [meta_data.tsv](./data/mac/dev/meta_data.tsv)
[MAC-Dev](./data/mac/dev) is the development set selected from the MAC corpus, a manually aligned corpus of Chinese-English literary texts. The sampling schemes for MAC-Dev can be found at [meta_data.tsv](./data/mac/dev/meta_data.tsv)
There are 4 subdirectories in MAC-Dev:
The [zh](./data/mac/dev/zh) and [en](./data/mac/dev/en) directories contain the sentence-split and tokenized source texts, target texts and the machine translations of source texts. Hunalign requires tokenized source and target sentences for dictionary search of similar words. Bleualign uses MT translations of source texts to compute the Bleu similarity score between source and target sentences.
We use [Moses sentence splitter](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/ems/support/split-sentences.perl) and [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/usage.html) to split and tokenize English sentences, while [pyltp](https://github.com/HIT-SCIR/pyltp) and [jieba](https://github.com/fxsjy/jieba) are used to split and tokenize Chinese sentences. The MT of source texts are generated by [Google Translate](translate.google.cn).
We use [Moses sentence splitter](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/ems/support/split-sentences.perl) and [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/usage.html) to split and tokenize English sentences, while [pyltp](https://github.com/HIT-SCIR/pyltp) and [jieba](https://github.com/fxsjy/jieba) are used to split and tokenize Chinese sentences. The MT of source texts are generated by [Google Translate](https://translate.google.cn/).
The [auto](./data/mac/dev/auto) and [gold](./data/mac/dev/gold) directories are for automatic and gold alignments. All the gold alignments are created manually using [Intertext](https://wanthalf.saga.cz/intertext).
### Bible
[The Bible corpus]((./data/bible)) with 5,000 sentences on the source side is selected from the public [multilingual Bible corpus](https://github.com/christos-c/bible-corpus/tree/master/bibles). This corpus is mainly used to evaluate the speed of Bertalign.
[The Bible corpus](./data/bible) with 5,000 sentences on the source side is selected from the public [multilingual Bible corpus](https://github.com/christos-c/bible-corpus/tree/master/bibles). This corpus is mainly used to evaluate the speed of Bertalign.
The directory makeup is similar to MAC-Dev, except that the gold alignments for the Bible corpus are generated automatically from the original verse-aligned Bible corpus.