Post-processing script

This commit is contained in:
nlpfun
2021-11-30 20:39:03 +08:00
parent 6c5f824686
commit 2ef398bce5
26 changed files with 6224 additions and 12 deletions

View File

@@ -368,6 +368,30 @@ It takes 4.676 seconds to align all the sentences.
---------------------------------
```
### Post-processing
Post-processing means manually correcting the wrong alignments generated by automatic aligners. The human validated sentence pairs can then be loaded into a translation memory software (e.g. [OmegaT](https://omegat.org/)) or bilingual concordancer (e.g. [Paraconc](https://paraconc.com/)), enabling translators to search the corresponding translation units and improve translation quality, or help researchers to carry out corpus-based translation studies.
Bertalign supports two output formats for manual alignments with [LF Aligner](https://sourceforge.net/projects/aligner/) and [Intertext](https://wanthalf.saga.cz/intertext). For example, running the following scripts will save the converted outputs in [tsv](./data/mac/dev/auto/tsv) or [intertext][./data/mac/dev/data/intertext], which can be opened and edited using LF Aligner or Intertext.
```
# Convert automatic alignments to TSV for LF Aligner
python -p mac-dev \
-s data/mac/dev/zh zh \
-t data/mac/dev/en en \
-a data/mac/dev/auto \
-f tsv
# Convert automatic alignments to XML for Intertext
python -p mac-dev \
-s data/mac/dev/zh zh \
-t data/mac/dev/en en \
-a data/mac/dev/auto \
-f intertext
```
## TODO List
Evaluate Bertalign on datasets containing language pairs other than Chinese and English.