diff --git a/README.md b/README.md index 51c9d27..5a2a1d8 100644 --- a/README.md +++ b/README.md @@ -6,25 +6,25 @@ Bertalign is designed to facilitate the construction of multilingual parallel co --- -##### Approach +#### Approach Bertalign uses [sentence-transformers](https://github.com/UKPLab/sentence-transformers) to represent source and target sentences so that semantically similar sentences in different languages are mapped onto similar vector spaces. Then a two-step algorithm based on dynamic programming is performed: 1) Step 1 finds the 1-1 alignments for approximate anchor points; 2) Step 2 limits the search path to the anchor points and extracts all the valid alignments with 1-many, many-1 or many-to-many relations between the source and target sentences. -##### Performance +#### Performance According to our experiments, Bertalign achieves more accurate results on [Text+Berg](./text+berg), a publicly available German-French parallel corpus, than the traditional length-, dictionary-, or MT-based alignment methods as reported in [Thompson & Koehn (2019)](https://aclanthology.org/D19-1136/) -##### Languges Supported +#### Languges Supported Alignment between 25 languages: Catalan (ca), Chinese (zh), Czech (cs), Danish (da), Dutch (nl), English(en), Finnish (fi), French (fr), German (de), Greek (el), Hungarian (hu), Icelandic (is), Italian (it), Lithuanian (lt), Latvain (lv), Norwegian (no), Polish (pl), Portuguese (pt), Romanian (ro), Russian (ru), Slovak (sk), Slovenian (sl), Spanish (es), Swedish (sv), and Trukish (tr). --- -##### Installation +#### Installation Please see [requirements.txt](./requirements.txt) for installation information. If you are running Bertalign on *GPU-enabled Linux* such as Google Colaboratory, please install *faiss-gpu* for faster processing. -##### Basic example +#### Basic example Just import *Bertalign* and initialize it with the source and target text, which will detect the source and target language automatically and split both texts into sentences. Then invoke the method *align_sents()* to align sentences and print out the result with *print_sents()*. @@ -130,7 +130,7 @@ aligner.print_sents() --- -##### Example with more options +#### Example with more options The following example shows how to use Bertalign to align the Text+Berg corpus, and evaluate its performance with gold standard alignments. The evaluation script [eval.py](./bertalign/eval.py) is based on [Vecalign](https://github.com/thompsonb/vecalign). @@ -236,13 +236,13 @@ log_final_scores(scores) --- -##### Licence +#### Licence Bertalign is released under the [GNU General Public License v3.0](./LICENCE) -##### Credits +#### Credits -###### Main Libraries +##### Main Libraries * [sentence-transformers](https://github.com/UKPLab/sentence-transformers) @@ -250,7 +250,7 @@ Bertalign is released under the [GNU General Public License v3.0](./LICENCE) * [sentence-splitter](https://github.com/mediacloud/sentence-splitter) -###### Other Sentence Aligners +##### Other Sentence Aligners * [Hunalign](http://mokk.bme.hu/en/resources/hunalign/) @@ -258,7 +258,7 @@ Bertalign is released under the [GNU General Public License v3.0](./LICENCE) * [Vecalign](https://github.com/thompsonb/vecalign) -##### Todo List +#### Todo List - Try the [CNN model](https://tfhub.dev/google/universal-sentence-encoder-multilingual/3) for sentence embeddings * Develop a GUI for Windows users