Baseline alignment systems

2021-11-28 13:59:28 +08:00
parent e033edad52
commit cc1ca021e8
34 changed files with 453434 additions and 0 deletions
--- a/ext-lib/bleualign/.gitignore
+++ b/ext-lib/bleualign/.gitignore
@@ -0,0 +1,5 @@
+__pycache__/
+*.pyc
+/dist
+/build
+/MANIFEST
--- a/ext-lib/bleualign/LICENSE
+++ b/ext-lib/bleualign/LICENSE
@@ -0,0 +1,339 @@
+		    GNU GENERAL PUBLIC LICENSE
+		       Version 2, June 1991
+
+ Copyright (C) 1989, 1991 Free Software Foundation, Inc.,
+ 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ Everyone is permitted to copy and distribute verbatim copies
+ of this license document, but changing it is not allowed.
+
+			    Preamble
+
+  The licenses for most software are designed to take away your
+freedom to share and change it.  By contrast, the GNU General Public
+License is intended to guarantee your freedom to share and change free
+software--to make sure the software is free for all its users.  This
+General Public License applies to most of the Free Software
+Foundation's software and to any other program whose authors commit to
+using it.  (Some other Free Software Foundation software is covered by
+the GNU Lesser General Public License instead.)  You can apply it to
+your programs, too.
+
+  When we speak of free software, we are referring to freedom, not
+price.  Our General Public Licenses are designed to make sure that you
+have the freedom to distribute copies of free software (and charge for
+this service if you wish), that you receive source code or can get it
+if you want it, that you can change the software or use pieces of it
+in new free programs; and that you know you can do these things.
+
+  To protect your rights, we need to make restrictions that forbid
+anyone to deny you these rights or to ask you to surrender the rights.
+These restrictions translate to certain responsibilities for you if you
+distribute copies of the software, or if you modify it.
+
+  For example, if you distribute copies of such a program, whether
+gratis or for a fee, you must give the recipients all the rights that
+you have.  You must make sure that they, too, receive or can get the
+source code.  And you must show them these terms so they know their
+rights.
+
+  We protect your rights with two steps: (1) copyright the software, and
+(2) offer you this license which gives you legal permission to copy,
+distribute and/or modify the software.
+
+  Also, for each author's protection and ours, we want to make certain
+that everyone understands that there is no warranty for this free
+software.  If the software is modified by someone else and passed on, we
+want its recipients to know that what they have is not the original, so
+that any problems introduced by others will not reflect on the original
+authors' reputations.
+
+  Finally, any free program is threatened constantly by software
+patents.  We wish to avoid the danger that redistributors of a free
+program will individually obtain patent licenses, in effect making the
+program proprietary.  To prevent this, we have made it clear that any
+patent must be licensed for everyone's free use or not licensed at all.
+
+  The precise terms and conditions for copying, distribution and
+modification follow.
+
+		    GNU GENERAL PUBLIC LICENSE
+   TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
+
+  0. This License applies to any program or other work which contains
+a notice placed by the copyright holder saying it may be distributed
+under the terms of this General Public License.  The "Program", below,
+refers to any such program or work, and a "work based on the Program"
+means either the Program or any derivative work under copyright law:
+that is to say, a work containing the Program or a portion of it,
+either verbatim or with modifications and/or translated into another
+language.  (Hereinafter, translation is included without limitation in
+the term "modification".)  Each licensee is addressed as "you".
+
+Activities other than copying, distribution and modification are not
+covered by this License; they are outside its scope.  The act of
+running the Program is not restricted, and the output from the Program
+is covered only if its contents constitute a work based on the
+Program (independent of having been made by running the Program).
+Whether that is true depends on what the Program does.
+
+  1. You may copy and distribute verbatim copies of the Program's
+source code as you receive it, in any medium, provided that you
+conspicuously and appropriately publish on each copy an appropriate
+copyright notice and disclaimer of warranty; keep intact all the
+notices that refer to this License and to the absence of any warranty;
+and give any other recipients of the Program a copy of this License
+along with the Program.
+
+You may charge a fee for the physical act of transferring a copy, and
+you may at your option offer warranty protection in exchange for a fee.
+
+  2. You may modify your copy or copies of the Program or any portion
+of it, thus forming a work based on the Program, and copy and
+distribute such modifications or work under the terms of Section 1
+above, provided that you also meet all of these conditions:
+
+    a) You must cause the modified files to carry prominent notices
+    stating that you changed the files and the date of any change.
+
+    b) You must cause any work that you distribute or publish, that in
+    whole or in part contains or is derived from the Program or any
+    part thereof, to be licensed as a whole at no charge to all third
+    parties under the terms of this License.
+
+    c) If the modified program normally reads commands interactively
+    when run, you must cause it, when started running for such
+    interactive use in the most ordinary way, to print or display an
+    announcement including an appropriate copyright notice and a
+    notice that there is no warranty (or else, saying that you provide
+    a warranty) and that users may redistribute the program under
+    these conditions, and telling the user how to view a copy of this
+    License.  (Exception: if the Program itself is interactive but
+    does not normally print such an announcement, your work based on
+    the Program is not required to print an announcement.)
+
+These requirements apply to the modified work as a whole.  If
+identifiable sections of that work are not derived from the Program,
+and can be reasonably considered independent and separate works in
+themselves, then this License, and its terms, do not apply to those
+sections when you distribute them as separate works.  But when you
+distribute the same sections as part of a whole which is a work based
+on the Program, the distribution of the whole must be on the terms of
+this License, whose permissions for other licensees extend to the
+entire whole, and thus to each and every part regardless of who wrote it.
+
+Thus, it is not the intent of this section to claim rights or contest
+your rights to work written entirely by you; rather, the intent is to
+exercise the right to control the distribution of derivative or
+collective works based on the Program.
+
+In addition, mere aggregation of another work not based on the Program
+with the Program (or with a work based on the Program) on a volume of
+a storage or distribution medium does not bring the other work under
+the scope of this License.
+
+  3. You may copy and distribute the Program (or a work based on it,
+under Section 2) in object code or executable form under the terms of
+Sections 1 and 2 above provided that you also do one of the following:
+
+    a) Accompany it with the complete corresponding machine-readable
+    source code, which must be distributed under the terms of Sections
+    1 and 2 above on a medium customarily used for software interchange; or,
+
+    b) Accompany it with a written offer, valid for at least three
+    years, to give any third party, for a charge no more than your
+    cost of physically performing source distribution, a complete
+    machine-readable copy of the corresponding source code, to be
+    distributed under the terms of Sections 1 and 2 above on a medium
+    customarily used for software interchange; or,
+
+    c) Accompany it with the information you received as to the offer
+    to distribute corresponding source code.  (This alternative is
+    allowed only for noncommercial distribution and only if you
+    received the program in object code or executable form with such
+    an offer, in accord with Subsection b above.)
+
+The source code for a work means the preferred form of the work for
+making modifications to it.  For an executable work, complete source
+code means all the source code for all modules it contains, plus any
+associated interface definition files, plus the scripts used to
+control compilation and installation of the executable.  However, as a
+special exception, the source code distributed need not include
+anything that is normally distributed (in either source or binary
+form) with the major components (compiler, kernel, and so on) of the
+operating system on which the executable runs, unless that component
+itself accompanies the executable.
+
+If distribution of executable or object code is made by offering
+access to copy from a designated place, then offering equivalent
+access to copy the source code from the same place counts as
+distribution of the source code, even though third parties are not
+compelled to copy the source along with the object code.
+
+  4. You may not copy, modify, sublicense, or distribute the Program
+except as expressly provided under this License.  Any attempt
+otherwise to copy, modify, sublicense or distribute the Program is
+void, and will automatically terminate your rights under this License.
+However, parties who have received copies, or rights, from you under
+this License will not have their licenses terminated so long as such
+parties remain in full compliance.
+
+  5. You are not required to accept this License, since you have not
+signed it.  However, nothing else grants you permission to modify or
+distribute the Program or its derivative works.  These actions are
+prohibited by law if you do not accept this License.  Therefore, by
+modifying or distributing the Program (or any work based on the
+Program), you indicate your acceptance of this License to do so, and
+all its terms and conditions for copying, distributing or modifying
+the Program or works based on it.
+
+  6. Each time you redistribute the Program (or any work based on the
+Program), the recipient automatically receives a license from the
+original licensor to copy, distribute or modify the Program subject to
+these terms and conditions.  You may not impose any further
+restrictions on the recipients' exercise of the rights granted herein.
+You are not responsible for enforcing compliance by third parties to
+this License.
+
+  7. If, as a consequence of a court judgment or allegation of patent
+infringement or for any other reason (not limited to patent issues),
+conditions are imposed on you (whether by court order, agreement or
+otherwise) that contradict the conditions of this License, they do not
+excuse you from the conditions of this License.  If you cannot
+distribute so as to satisfy simultaneously your obligations under this
+License and any other pertinent obligations, then as a consequence you
+may not distribute the Program at all.  For example, if a patent
+license would not permit royalty-free redistribution of the Program by
+all those who receive copies directly or indirectly through you, then
+the only way you could satisfy both it and this License would be to
+refrain entirely from distribution of the Program.
+
+If any portion of this section is held invalid or unenforceable under
+any particular circumstance, the balance of the section is intended to
+apply and the section as a whole is intended to apply in other
+circumstances.
+
+It is not the purpose of this section to induce you to infringe any
+patents or other property right claims or to contest validity of any
+such claims; this section has the sole purpose of protecting the
+integrity of the free software distribution system, which is
+implemented by public license practices.  Many people have made
+generous contributions to the wide range of software distributed
+through that system in reliance on consistent application of that
+system; it is up to the author/donor to decide if he or she is willing
+to distribute software through any other system and a licensee cannot
+impose that choice.
+
+This section is intended to make thoroughly clear what is believed to
+be a consequence of the rest of this License.
+
+  8. If the distribution and/or use of the Program is restricted in
+certain countries either by patents or by copyrighted interfaces, the
+original copyright holder who places the Program under this License
+may add an explicit geographical distribution limitation excluding
+those countries, so that distribution is permitted only in or among
+countries not thus excluded.  In such case, this License incorporates
+the limitation as if written in the body of this License.
+
+  9. The Free Software Foundation may publish revised and/or new versions
+of the General Public License from time to time.  Such new versions will
+be similar in spirit to the present version, but may differ in detail to
+address new problems or concerns.
+
+Each version is given a distinguishing version number.  If the Program
+specifies a version number of this License which applies to it and "any
+later version", you have the option of following the terms and conditions
+either of that version or of any later version published by the Free
+Software Foundation.  If the Program does not specify a version number of
+this License, you may choose any version ever published by the Free Software
+Foundation.
+
+  10. If you wish to incorporate parts of the Program into other free
+programs whose distribution conditions are different, write to the author
+to ask for permission.  For software which is copyrighted by the Free
+Software Foundation, write to the Free Software Foundation; we sometimes
+make exceptions for this.  Our decision will be guided by the two goals
+of preserving the free status of all derivatives of our free software and
+of promoting the sharing and reuse of software generally.
+
+			    NO WARRANTY
+
+  11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
+FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW.  EXCEPT WHEN
+OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
+PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
+OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
+MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.  THE ENTIRE RISK AS
+TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU.  SHOULD THE
+PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
+REPAIR OR CORRECTION.
+
+  12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
+WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
+REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
+INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
+OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
+TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
+YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
+PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
+POSSIBILITY OF SUCH DAMAGES.
+
+		     END OF TERMS AND CONDITIONS
+
+	    How to Apply These Terms to Your New Programs
+
+  If you develop a new program, and you want it to be of the greatest
+possible use to the public, the best way to achieve this is to make it
+free software which everyone can redistribute and change under these terms.
+
+  To do so, attach the following notices to the program.  It is safest
+to attach them to the start of each source file to most effectively
+convey the exclusion of warranty; and each file should have at least
+the "copyright" line and a pointer to where the full notice is found.
+
+    <one line to give the program's name and a brief idea of what it does.>
+    Copyright (C) <year>  <name of author>
+
+    This program is free software; you can redistribute it and/or modify
+    it under the terms of the GNU General Public License as published by
+    the Free Software Foundation; either version 2 of the License, or
+    (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU General Public License for more details.
+
+    You should have received a copy of the GNU General Public License along
+    with this program; if not, write to the Free Software Foundation, Inc.,
+    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+
+Also add information on how to contact you by electronic and paper mail.
+
+If the program is interactive, make it output a short notice like this
+when it starts in an interactive mode:
+
+    Gnomovision version 69, Copyright (C) year name of author
+    Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
+    This is free software, and you are welcome to redistribute it
+    under certain conditions; type `show c' for details.
+
+The hypothetical commands `show w' and `show c' should show the appropriate
+parts of the General Public License.  Of course, the commands you use may
+be called something other than `show w' and `show c'; they could even be
+mouse-clicks or menu items--whatever suits your program.
+
+You should also get your employer (if you work as a programmer) or your
+school, if any, to sign a "copyright disclaimer" for the program, if
+necessary.  Here is a sample; alter the names:
+
+  Yoyodyne, Inc., hereby disclaims all copyright interest in the program
+  `Gnomovision' (which makes passes at compilers) written by James Hacker.
+
+  <signature of Ty Coon>, 1 April 1989
+  Ty Coon, President of Vice
+
+This General Public License does not permit incorporating your program into
+proprietary programs.  If your program is a subroutine library, you may
+consider it more useful to permit linking proprietary applications with the
+library.  If this is what you want to do, use the GNU Lesser General
+Public License instead of this License.
--- a/ext-lib/bleualign/README.md
+++ b/ext-lib/bleualign/README.md
@@ -0,0 +1,105 @@
+Bleualign
+=========
+An MT-based sentence alignment tool
+
+Copyright ⓒ 2010
+Rico Sennrich <sennrich@cl.uzh.ch>
+
+A project of the Computational Linguistics Group at the University of Zurich (http://www.cl.uzh.ch).
+
+Project Homepage: http://github.com/rsennrich/bleualign
+
+This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation
+
+GENERAL INFO
+------------
+
+Bleualign is a tool to align parallel texts (i.e. a text and its translation) on a sentence level.
+Additionally to the source and target text, Bleualign requires an automatic translation of at least one of the texts.
+The alignment is then performed on the basis of the similarity (modified BLEU score) between the source text sentences (translated into the target language) and the target text sentences.
+See section PUBLICATIONS for more details.
+
+Obtaining an automatic translation is up to the user. The only requirement is that the translation must correspond line-by-line to the source text (no line breaks inserted or removed).
+
+REQUIREMENTS
+------------
+
+The software was developed on Linux using Python 2.6, but should also support newer versions of Python (including 3.X) and other platforms.
+Please report any issues you encounter to sennrich@cl.uzh.ch
+
+
+USAGE INSTRUCTIONS
+------------------
+
+The input and output formats of bleualign are one sentence per line.
+A line which only contains .EOA is considered a hard delimiter (end of article).
+Sentence alignment does not cross these delimiters: reliable delimiters improve speed and performance, wrong ones will seriously degrade performance.
+
+Given the files sourcetext.txt, targettext.txt and sourcetranslation.txt (the latter being sentence-aligned with sourcetext.txt), a sample call is
+
+    ./bleualign.py -s sourcetext.txt -t targettext.txt --srctotarget sourcetranslation.txt -o outputfile
+
+It is also possible to provide several translations and/or translations in the other translation direction.
+bleualign will run once per translation provided, the final output being the intersection of the individual runs (i.e. sentence pairs produced in each individual run).
+
+    ./bleualign.py -s sourcetext.txt -t targettext.txt --srctotarget sourcetranslation1.txt --srctotarget sourcetranslation2.txt --targettosrc targettranslation1.txt -o outputfile
+
+    ./bleualign.py -h will show more usage options
+
+To facilitate batch processing multiple files, `batch_align.py` can be used.
+
+    python batch_align directory source_suffix target_suffix translation_suffix
+
+example: given the directory `raw_files` with the files `0.de`, `0.fr` and `0.trans` and so on, (`0.trans` being the translation of `0.de` into the target language), then this command will align all files: 
+
+    python batch_align.py raw_files de fr trans
+
+This will produce the files `0.de.aligned` and `0.fr.aligned`
+
+Input files are expected to use UTF-8 encoding.
+
+USAGE AS PYTHON MODULE
+----------------------
+
+Bleualign works as stand-alone script, but can also be imported as a module other Python projects.
+For code examples, see the example/ directory. If you want to know all options, you can see Aligner.default_options variable in bleualign/aligner.py.
+
+To use Bleualign as a Python module, the package needs to be installed (from a local copy) with:
+
+    python setup.py install
+
+The Bleualign package can also be installed directly from Github with:
+
+    pip install git+https://github.com/rsennrich/Bleualign.git
+
+EVALUATION
+---------
+
+Two hand-aligned documents are provided with the repository for development and testing.
+Evaluation is performed if you add the argument `-d` for the development set, and `-e` for the test set.
+
+An example command for aligning the development set (one long document with 468/554 sentences in DE/FR):
+
+  ./bleualign.py --source eval/eval1957.de --target eval/eval1957.fr --srctotarget eval/eval1957.europarlfull.fr -d
+
+An example command for aligning the test set (7 documents, totalling 993/1011 sentences in DE/FR):
+
+./bleualign.py --source eval/eval1989.de --target eval/eval1989.fr --srctotarget eval/eval1989.europarlfull.fr -e
+
+
+PUBLICATIONS
+------------
+
+The algorithm is described in
+
+Rico Sennrich, Martin Volk (2010):
+   MT-based Sentence Alignment for OCR-generated Parallel Texts. In: Proceedings of AMTA 2010, Denver, Colorado.
+
+Rico Sennrich; Martin Volk (2011): 
+    Iterative, MT-based sentence alignment of parallel texts. In: NODALIDA 2011, Nordic Conference of Computational Linguistics, Riga.
+
+
+CONTACT
+-------
+
+For questions and feeback, please contact sennrich@cl.uzh.ch or use the GitHub repository.
--- a/ext-lib/bleualign/_bleualign.py
+++ b/ext-lib/bleualign/_bleualign.py
@@ -0,0 +1,15 @@
+#!/usr/bin/python
+# -*- coding: utf-8 -*-
+# Copyright © 2010 University of Zürich
+# Author: Rico Sennrich <sennrich@cl.uzh.ch>
+# For licensing information, see LICENSE
+
+import sys
+from command_utils import load_arguments
+from bleualign.align import Aligner
+
+if __name__ == '__main__':
+    options = load_arguments(sys.argv)
+
+    a = Aligner(options)
+    a.mainloop()
--- a/ext-lib/bleualign/batch_align.py
+++ b/ext-lib/bleualign/batch_align.py
@@ -0,0 +1,51 @@
+#!/usr/bin/python
+# -*- coding: utf-8 -*-
+# Copyright: University of Zurich
+# Author: Rico Sennrich
+
+# script to allow batch-alignment of multiple files. No multiprocessing.
+# syntax: python batch_align directory source_suffix target_suffix translation_suffix
+#
+# example: given the directory batch-test with the files 0.de, 0.fr and 0.trans, 1.de, 1.fr and 1.trans and so on,
+# (0.trans being the translation of 0.de into the target language),
+# then this command will align all files: python batch_align.py batch-test/ de fr trans
+#
+# output files will have ending source_suffix.aligned and target_suffix.aligned
+
+
+import sys
+import os
+from bleualign.align import Aligner
+
+if len(sys.argv) < 2:
+    sys.stderr.write('Usage: python batch_align.py job_file\n')
+    exit()
+
+job_fn = sys.argv[1]
+
+options = {}
+options['factored'] = False
+options['filter'] = None
+options['filterthreshold'] = 90
+options['filterlang'] = None
+options['targettosrc'] = []
+options['eval'] = None
+options['galechurch'] = None
+options['verbosity'] = 1
+options['printempty'] = False
+
+jobs = []
+with open(job_fn, 'r', encoding="utf-8") as f:
+    for line in f:
+        if not line.startswith("#"):
+            jobs.append(line.strip())
+
+for rec in jobs:
+    translation_document, source_document, target_document, out_document = rec.split("\t")
+    options['srcfile'] = source_document
+    options['targetfile'] = target_document
+    options['srctotarget'] = [translation_document]
+    options['output'] = out_document
+    a = Aligner(options)
+    a.mainloop()
+    
--- a/ext-lib/bleualign/bleualign.py
+++ b/ext-lib/bleualign/bleualign.py
@@ -0,0 +1,110 @@
+# 2021/11/27
+# bfsujason@163.com
+
+"""
+Usage:
+
+python ext-lib/bleualign/bleualign.py \
+  -m data/mac/test/meta_data.tsv \
+  -s data/mac/test/zh \
+  -t data/mac/test/en \
+  -o data/mac/test/auto
+"""
+
+import os
+import sys
+import time
+import shutil
+import argparse
+
+def main():
+  parser = argparse.ArgumentParser(description='Sentence alignment using Bleualign')
+  parser.add_argument('-s', '--src', type=str, required=True, help='Source directory.')
+  parser.add_argument('-t', '--tgt', type=str, required=True, help='Target directory.')
+  parser.add_argument('-o', '--out', type=str, required=True, help='Output directory.')
+  parser.add_argument('-m', '--meta', type=str, required=True, help='Metadata file.')
+  parser.add_argument('--tok', action='store_true', help='Use tokenized source trans and target text.')
+  args = parser.parse_args()
+  
+  make_dir(args.out)
+  
+  jobs = create_jobs(args.meta, args.src, args.tgt, args.out, args.tok)
+  job_path = os.path.abspath(os.path.join(args.out, 'bleualign.job'))
+  write_jobs(jobs, job_path)
+  
+  bleualign_bin = os.path.abspath('ext-lib/bleualign/batch_align.py')
+  run_bleualign(bleualign_bin, job_path)
+  
+  convert_format(args.out)
+
+def convert_format(dir):
+  for file in os.listdir(dir):
+    if file.endswith('-s'):
+      file_id = file.split('.')[0]
+      src = os.path.join(dir, file)
+      tgt = os.path.join(dir, file_id + '.align-t')
+      out = os.path.join(dir, file_id + '.align')
+      _convert_format(src, tgt, out)
+      os.unlink(src)
+      os.unlink(tgt)
+
+def _convert_format(src, tgt, path):
+  src_align = read_alignment(src)
+  tgt_align = read_alignment(tgt)
+  with open(path, 'wt', encoding='utf-8') as f:
+    for x, y in zip(src_align, tgt_align):
+      f.write("{}:{}\n".format(x,y))
+
+def read_alignment(file):
+  alignment = []
+  with open(file, 'rt', encoding='utf-8') as f:
+    for line in f:
+      line = line.strip()
+      alignment.append([int(x) for x in line.split(',')])
+      
+  return alignment
+      
+def run_bleualign(bin, job):
+  cmd = "python {} {}".format(bin, job)
+  os.system(cmd)
+  os.unlink(job)
+  
+def write_jobs(jobs, path):
+  jobs = '\n'.join(jobs)
+  with open(path, 'wt', encoding='utf-8') as f:
+    f.write(jobs)
+       
+def create_jobs(meta, src, tgt, out, is_tok):
+  jobs = []
+  fns = get_fns(meta)
+  for file in fns:
+    src_path = os.path.abspath(os.path.join(src, file))
+    trans_path = os.path.abspath(os.path.join(src, file + '.trans'))
+    if is_tok:
+      tgt_path = os.path.abspath(os.path.join(tgt, file + '.tok'))
+    else:
+      tgt_path = os.path.abspath(os.path.join(tgt, file))
+    out_path = os.path.abspath(os.path.join(out, file + '.align'))
+    jobs.append('\t'.join([trans_path, src_path, tgt_path, out_path]))
+    
+  return jobs
+  
+def get_fns(meta):
+  fns = []
+  with open(meta, 'rt', encoding='utf-8') as f:
+    next(f) # skip header
+    for line in f:
+      recs = line.strip().split('\t')
+      fns.append(recs[0])
+
+  return fns
+
+def make_dir(path):
+  if os.path.isdir(path):
+    shutil.rmtree(path)
+  os.makedirs(path, exist_ok=True)
+  
+if __name__ == '__main__':
+  t_0 = time.time()
+  main()
+  print("It takes {:.3f} seconds to align all the sentences.".format(time.time() - t_0))
--- a/ext-lib/bleualign/bleualign/init.py
+++ b/ext-lib/bleualign/bleualign/init.py
--- a/ext-lib/bleualign/bleualign/align.py
+++ b/ext-lib/bleualign/bleualign/align.py
--- a/ext-lib/bleualign/bleualign/gale_church.py
+++ b/ext-lib/bleualign/bleualign/gale_church.py
@@ -0,0 +1,205 @@
+# -*- coding: utf-8 -*-
+
+import math
+
+# Based on Gale & Church 1993, 
+# "A Program for Aligning Sentences in Bilingual Corpora"
+
+infinity = float("inf")
+
+def erfcc(x):
+    """Complementary error function."""
+    z = abs(x)
+    t = 1 / (1 + 0.5 * z)
+    r = t * math.exp(-z * z -
+                     1.26551223 + t *
+                     (1.00002368 + t *
+                      (.37409196 + t *
+                       (.09678418 + t *
+                        (-.18628806 + t *
+                         (.27886807 + t *
+                          (-1.13520398 + t *
+                           (1.48851587 + t *
+                            (-.82215223 + t * .17087277)))))))))
+    if (x >= 0.):
+        return r
+    else:
+        return 2. - r
+
+
+def norm_cdf(x):
+    """Return the area under the normal distribution from M{-∞..x}."""
+    return 1 - 0.5 * erfcc(x / math.sqrt(2))
+
+
+class LanguageIndependent(object):
+    # These are the language-independent probabilities and parameters
+    # given in Gale & Church
+
+    # for the computation, l_1 is always the language with less characters
+    PRIORS = {
+        (1, 0): 0.0099,
+        (0, 1): 0.0099,
+        (1, 1): 0.89,
+        (2, 1): 0.089,
+        (1, 2): 0.089,
+        (2, 2): 0.011,
+    }
+
+    AVERAGE_CHARACTERS = 1
+    VARIANCE_CHARACTERS = 6.8
+
+
+def trace(backlinks, source, target):
+    links = set()
+    pos = (len(source) - 1, len(target) - 1)
+
+    #while pos != (-1, -1):
+    while pos[0] != -1 and pos[1] != -1:
+        #print(pos)
+        #print(backlinks)
+        #print(backlinks[pos])
+        s, t = backlinks[pos]
+        for i in range(s):
+            for j in range(t):
+                links.add((pos[0] - i, pos[1] - j))
+        pos = (pos[0] - s, pos[1] - t)
+
+    return links
+
+
+def align_probability(i, j, source_sentences, target_sentences, alignment, params):
+    """Returns the probability of the two sentences C{source_sentences[i]}, C{target_sentences[j]}
+    being aligned with a specific C{alignment}.
+
+    @param i: The offset of the source sentence.
+    @param j: The offset of the target sentence.
+    @param source_sentences: The list of source sentence lengths.
+    @param target_sentences: The list of target sentence lengths.
+    @param alignment: The alignment type, a tuple of two integers.
+    @param params: The sentence alignment parameters.
+
+    @returns: The probability of a specific alignment between the two sentences, given the parameters.
+    """
+    l_s = sum(source_sentences[i - offset] for offset in range(alignment[0]))
+    l_t = sum(target_sentences[j - offset] for offset in range(alignment[1]))
+    try:
+        # actually, the paper says l_s * params.VARIANCE_CHARACTERS, this is based on the C
+        # reference implementation. With l_s in the denominator, insertions are impossible.
+        m = (l_s + l_t / params.AVERAGE_CHARACTERS) / 2
+        delta = (l_t - l_s * params.AVERAGE_CHARACTERS) / math.sqrt(m * params.VARIANCE_CHARACTERS)
+    except ZeroDivisionError:
+        delta = infinity
+
+    return 2 * (1 - norm_cdf(abs(delta))) * params.PRIORS[alignment]
+
+
+def align_blocks(source_sentences, target_sentences, params = LanguageIndependent):
+    """Creates the sentence alignment of two blocks of texts (usually paragraphs).
+
+    @param source_sentences: The list of source sentence lengths.
+    @param target_sentences: The list of target sentence lengths.
+    @param params: the sentence alignment parameters.
+
+    @return: The sentence alignments, a list of index pairs.
+    """
+    alignment_types = list(params.PRIORS.keys())
+
+    # there are always three rows in the history (with the last of them being filled)
+    # and the rows are always |target_text| + 2, so that we never have to do
+    # boundary checks
+    D = [(len(target_sentences) + 2) * [0] for x in range(2)]
+
+    # for the first sentence, only substitution, insertion or deletion are
+    # allowed, and they are all equally likely ( == 1)
+
+    D.append([0, 1])
+    try:
+      D[-2][1] = 1
+      D[-2][2] = 1
+    except:
+      pass
+
+    backlinks = {}
+
+    for i in range(len(source_sentences)):
+        for j in range(len(target_sentences)):
+            m = []
+            for a in alignment_types:
+                k = D[-(1 + a[0])][j + 2 - a[1]]
+                if k > 0:
+                    p = k * \
+                      align_probability(i, j, source_sentences, target_sentences, a, params)
+                    m.append((p, a))
+
+            if len(m) > 0:
+                v = max(m)
+                backlinks[(i, j)] = v[1]
+                D[-1].append(v[0])
+            else:
+                backlinks[(i, j)] = (1, 1)
+                D[-1].append(0)
+
+        D.pop(0)
+        D.append([0, 0])
+
+    return trace(backlinks, source_sentences, target_sentences)
+
+
+def align_texts(source_blocks, target_blocks, params = LanguageIndependent):
+    """Creates the sentence alignment of two texts.
+
+    Texts can consist of several blocks. Block boundaries cannot be crossed by sentence 
+    alignment links. 
+
+    Each block consists of a list that contains the lengths (in characters) of the sentences
+    in this block.
+    
+    @param source_blocks: The list of blocks in the source text.
+    @param target_blocks: The list of blocks in the target text.
+    @param params: the sentence alignment parameters.
+
+    @returns: A list of sentence alignment lists
+    """
+    if len(source_blocks) != len(target_blocks):
+        raise ValueError("Source and target texts do not have the same number of blocks.")
+    
+    return [align_blocks(source_block, target_block, params) 
+            for source_block, target_block in zip(source_blocks, target_blocks)]
+
+
+def split_at(it, split_value):
+    """Splits an iterator C{it} at values of C{split_value}. 
+
+    Each instance of C{split_value} is swallowed. The iterator produces
+    subiterators which need to be consumed fully before the next subiterator
+    can be used.
+    """
+    def _chunk_iterator(first):
+        v = first
+        while v != split_value:
+            yield v
+            v = next(it)
+    
+    while True:
+        yield _chunk_iterator(next(it))
+        
+
+def parse_token_stream(stream, soft_delimiter, hard_delimiter):
+    """Parses a stream of tokens and splits it into sentences (using C{soft_delimiter} tokens) 
+    and blocks (using C{hard_delimiter} tokens) for use with the L{align_texts} function.
+    """
+    return [
+        [sum(len(token) for token in sentence_it) 
+         for sentence_it in split_at(block_it, soft_delimiter)]
+        for block_it in split_at(stream, hard_delimiter)]
+
+
+if __name__ == "__main__":
+    import sys
+    from contextlib import nested
+    
+    with nested(open(sys.argv[1], "r"), open(sys.argv[2], "r")) as (s, t):
+        source = parse_token_stream((l.strip() for l in s), ".EOS", ".EOP")
+        target = parse_token_stream((l.strip() for l in t), ".EOS", ".EOP")
+        print((align_texts(source, target)))
--- a/ext-lib/bleualign/bleualign/score.py
+++ b/ext-lib/bleualign/bleualign/score.py
@@ -0,0 +1,146 @@
+#!/usr/bin/python
+# -*- coding: utf-8 -*-
+#File originally part of moses package: http://www.statmt.org/moses/ (as bleu.py)
+#Stripped of unused code to reduce number of libraries used
+
+# $Id$
+
+'''Provides:
+
+cook_refs(refs, n=4): Transform a list of reference sentences as strings into a form usable by cook_test().
+cook_test(test, refs, n=4): Transform a test sentence as a string (together with the cooked reference sentences) into a form usable by score_cooked().
+score_cooked(alltest, n=4): Score a list of cooked test sentences.
+
+score_set(s, testid, refids, n=4): Interface with dataset.py; calculate BLEU score of testid against refids.
+
+The reason for breaking the BLEU computation into three phases cook_refs(), cook_test(), and score_cooked() is to allow the caller to calculate BLEU scores for multiple test sets as efficiently as possible.
+'''
+
+from __future__ import division, print_function
+import sys, math, re, xml.sax.saxutils
+
+# Added to bypass NIST-style pre-processing of hyp and ref files -- wade
+nonorm = 0
+
+preserve_case = False
+eff_ref_len = "shortest"
+
+normalize1 = [
+    ('<skipped>', ''),         # strip "skipped" tags
+    (r'-\n', ''),              # strip end-of-line hyphenation and join lines
+    (r'\n', ' '),              # join lines
+#    (r'(\d)\s+(?=\d)', r'\1'), # join digits
+]
+normalize1 = [(re.compile(pattern), replace) for (pattern, replace) in normalize1]
+
+normalize2 = [
+    (r'([\{-\~\[-\` -\&\(-\+\:-\@\/])',r' \1 '), # tokenize punctuation. apostrophe is missing
+    (r'([^0-9])([\.,])',r'\1 \2 '),              # tokenize period and comma unless preceded by a digit
+    (r'([\.,])([^0-9])',r' \1 \2'),              # tokenize period and comma unless followed by a digit
+    (r'([0-9])(-)',r'\1 \2 ')                    # tokenize dash when preceded by a digit
+]
+normalize2 = [(re.compile(pattern), replace) for (pattern, replace) in normalize2]
+
+#combine normalize2 into a single regex.
+normalize3 = re.compile(r'([\{-\~\[-\` -\&\(-\+\:-\@\/])|(?:(?<![0-9])([\.,]))|(?:([\.,])(?![0-9]))|(?:(?<=[0-9])(-))')
+
+def normalize(s):
+    '''Normalize and tokenize text. This is lifted from NIST mteval-v11a.pl.'''
+    # Added to bypass NIST-style pre-processing of hyp and ref files -- wade
+    if (nonorm):
+        return s.split()
+    try:
+        s.split()
+    except:
+        s = " ".join(s)
+    # language-independent part:
+    for (pattern, replace) in normalize1:
+        s = re.sub(pattern, replace, s)
+    s = xml.sax.saxutils.unescape(s, {'&quot;':'"'})
+    # language-dependent part (assuming Western languages):
+    s = " %s " % s
+    if not preserve_case:
+        s = s.lower()         # this might not be identical to the original
+    return [tok for tok in normalize3.split(s) if tok and tok != ' ']
+
+def count_ngrams(words, n=4):
+    counts = {}
+    for k in range(1,n+1):
+        for i in range(len(words)-k+1):
+            ngram = tuple(words[i:i+k])
+            counts[ngram] = counts.get(ngram, 0)+1
+    return counts
+
+def cook_refs(refs, n=4):
+    '''Takes a list of reference sentences for a single segment
+    and returns an object that encapsulates everything that BLEU
+    needs to know about them.'''
+    
+    refs = [normalize(ref) for ref in refs]
+    maxcounts = {}
+    for ref in refs:
+        counts = count_ngrams(ref, n)
+        for (ngram,count) in list(counts.items()):
+            maxcounts[ngram] = max(maxcounts.get(ngram,0), count)
+    return ([len(ref) for ref in refs], maxcounts)
+
+def cook_ref_set(ref, n=4):
+    '''Takes a reference sentences for a single segment
+    and returns an object that encapsulates everything that BLEU
+    needs to know about them.  Also provides a set cause bleualign wants it'''
+    ref = normalize(ref)
+    counts = count_ngrams(ref, n)
+    return (len(ref), counts, frozenset(counts))
+
+
+
+
+def cook_test(test, args, n=4):
+    '''Takes a test sentence and returns an object that
+    encapsulates everything that BLEU needs to know about it.'''
+    
+    reflens, refmaxcounts = args
+    test = normalize(test)
+    result = {}
+    result["testlen"] = len(test)
+
+    # Calculate effective reference sentence length.
+    
+    if eff_ref_len == "shortest":
+        result["reflen"] = min(reflens)
+    elif eff_ref_len == "average":
+        result["reflen"] = float(sum(reflens))/len(reflens)
+    elif eff_ref_len == "closest":
+        min_diff = None
+        for reflen in reflens:
+            if min_diff is None or abs(reflen-len(test)) < min_diff:
+                min_diff = abs(reflen-len(test))
+                result['reflen'] = reflen
+
+    result["guess"] = [max(len(test)-k+1,0) for k in range(1,n+1)]
+
+    result['correct'] = [0]*n
+    counts = count_ngrams(test, n)
+    for (ngram, count) in list(counts.items()):
+        result["correct"][len(ngram)-1] += min(refmaxcounts.get(ngram,0), count)
+
+    return result
+
+def score_cooked(allcomps, n=4):
+    totalcomps = {'testlen':0, 'reflen':0, 'guess':[0]*n, 'correct':[0]*n}
+    for comps in allcomps:
+        for key in ['testlen','reflen']:
+            totalcomps[key] += comps[key]
+        for key in ['guess','correct']:
+            for k in range(n):
+                totalcomps[key][k] += comps[key][k]
+    logbleu = 0.0
+    for k in range(n):
+        if totalcomps['correct'][k] == 0:
+            return 0.0
+        #log.write("%d-grams: %f\n" % (k,float(totalcomps['correct'][k])/totalcomps['guess'][k]))
+        logbleu += math.log(totalcomps['correct'][k])-math.log(totalcomps['guess'][k])
+    logbleu /= float(n)
+    #log.write("Effective reference length: %d test length: %d\n" % (totalcomps['reflen'], totalcomps['testlen']))
+    logbleu += min(0,1-float(totalcomps['reflen'])/totalcomps['testlen'])
+    return math.exp(logbleu)
--- a/ext-lib/bleualign/bleualign/utils.py
+++ b/ext-lib/bleualign/bleualign/utils.py
@@ -0,0 +1,191 @@
+#!/usr/bin/python
+# -*- coding: utf-8 -*-
+# Copyright: University of Zurich
+# Author: Rico Sennrich
+# For licensing information, see LICENSE
+
+# Evaluation functions for Bleualign
+
+
+from __future__ import division
+from operator import itemgetter
+
+
+def evaluate(options, testalign, goldalign, log_function):
+    goldalign = [(tuple(src),tuple(target)) for src,target in goldalign]
+    
+    results = {}
+    paircounts = {}
+    for pair in [(len(srclist),len(targetlist)) for srclist,targetlist in goldalign]:
+        paircounts[pair] = paircounts.get(pair,0) + 1
+        pairs_normalized = {}
+        for pair in paircounts:
+            pairs_normalized[pair] = (paircounts[pair],paircounts[pair] / float(len(goldalign)))
+    
+    log_function('\ngold alignment frequencies\n')
+    for aligntype,(abscount,relcount) in sorted(list(pairs_normalized.items()),key=itemgetter(1),reverse=True):
+        log_function(aligntype,end='')
+        log_function(' - ',end='')
+        log_function(abscount,end='')
+        log_function(' ('+str(relcount)+')')
+    
+    log_function('\ntotal recall: ',end='')
+    log_function(str(len(goldalign)) + ' pairs in gold')
+    (tpstrict,fnstrict,tplax,fnlax) = recall((0,0),goldalign,[i[0] for i in testalign],log_function)
+    results['recall'] = (tpstrict,fnstrict,tplax,fnlax)
+
+    for aligntype in set([i[1] for i in testalign]):
+        testalign_bytype = []
+        for i in testalign:
+            if i[1] == aligntype:
+                testalign_bytype.append(i)
+        log_function('precision for alignment type ' + str(aligntype) + ' ( ' + str(len(testalign_bytype)) + ' alignment pairs)')
+        precision(goldalign,testalign_bytype,log_function)
+
+    log_function('\ntotal precision:',end='')
+    log_function(str(len(testalign)) + ' alignment pairs found')
+    (tpstrict,fpstrict,tplax,fplax) = precision(goldalign,testalign,log_function)
+    results['precision'] = (tpstrict,fpstrict,tplax,fplax)
+
+    return results
+
+
+def precision(goldalign, testalign, log_function):
+    tpstrict=0
+    tplax=0
+    fpstrict=0
+    fplax=0
+    for (src,target) in [i[0] for i in testalign]:
+        if (src,target) == ((),()):
+            continue
+        if (src,target) in goldalign:
+            tpstrict +=1
+            tplax += 1
+        else:
+            srcset, targetset = set(src), set(target)
+            for srclist,targetlist in goldalign:
+                #lax condition: hypothesis and gold alignment only need to overlap
+                if srcset.intersection(set(srclist)) and targetset.intersection(set(targetlist)):
+                    fpstrict +=1
+                    tplax += 1
+                    break
+            else:
+                fpstrict +=1
+                fplax +=1
+                log_function('false positive: ',2)
+                log_function((src,target),2)
+    if tpstrict+fpstrict > 0:
+        log_function('precision strict: ',end='')
+        log_function((tpstrict/float(tpstrict+fpstrict)))
+        log_function('precision lax: ',end='')
+        log_function((tplax/float(tplax+fplax)))
+        log_function('')
+    else:
+        log_function('nothing to find')
+
+    return tpstrict,fpstrict,tplax,fplax
+
+
+def recall(aligntype, goldalign, testalign, log_function):
+
+    srclen,targetlen = aligntype
+
+    if srclen == 0 and targetlen == 0:
+        gapdists = [(0,0) for i in goldalign]
+    elif srclen == 0 or targetlen == 0:
+        log_function('nothing to find')
+        return
+    else:
+        gapdists = [(len(srclist),len(targetlist)) for srclist,targetlist in goldalign]
+
+    tpstrict=0
+    tplax=0
+    fnstrict=0
+    fnlax=0
+    for i,pair in enumerate(gapdists):
+        if aligntype == pair:
+            (srclist,targetlist) = goldalign[i]
+            if not srclist or not targetlist:
+                continue
+            elif (srclist,targetlist) in testalign:
+                tpstrict +=1
+                tplax +=1
+            else:
+                srcset, targetset = set(srclist), set(targetlist)
+                for src,target in testalign:
+                    #lax condition: hypothesis and gold alignment only need to overlap
+                    if srcset.intersection(set(src)) and targetset.intersection(set(target)):
+                        tplax +=1
+                        fnstrict+=1
+                        break
+                else:
+                    fnstrict+=1
+                    fnlax+=1
+                    log_function('not found: ',2),
+                    log_function(goldalign[i],2)
+
+    if tpstrict+fnstrict>0:
+        log_function('recall strict: '),
+        log_function((tpstrict/float(tpstrict+fnstrict)))
+        log_function('recall lax: '),
+        log_function((tplax/float(tplax+fnlax)))
+        log_function('')
+    else:
+        log_function('nothing to find')
+
+    return tpstrict,fnstrict,tplax,fnlax
+
+
+def finalevaluation(results, log_function):
+    recall_value = [0,0,0,0]
+    precision_value = [0,0,0,0]
+    for i,k in list(results.items()):
+        for m,j in enumerate(recall_value):
+            recall_value[m] = j+ k['recall'][m]
+        for m,j in enumerate(precision_value):
+            precision_value[m] = j+ k['precision'][m]
+
+    try:
+        pstrict = (precision_value[0]/float(precision_value[0]+precision_value[1]))
+    except ZeroDivisionError:
+        pstrict = 0
+    try:
+        plax =(precision_value[2]/float(precision_value[2]+precision_value[3]))
+    except ZeroDivisionError:
+        plax = 0
+    try:
+        rstrict= (recall_value[0]/float(recall_value[0]+recall_value[1]))
+    except ZeroDivisionError:
+        rstrict = 0
+    try:
+        rlax=(recall_value[2]/float(recall_value[2]+recall_value[3]))
+    except ZeroDivisionError:
+        rlax = 0
+    if (pstrict+rstrict) == 0:
+        fstrict = 0
+    else:
+        fstrict=2*(pstrict*rstrict)/(pstrict+rstrict)
+    if (plax+rlax) == 0:
+        flax=0
+    else:
+        flax=2*(plax*rlax)/(plax+rlax)
+
+    log_function('\n=========================\n')
+    log_function('total results:')
+    log_function('recall strict: ',end='')
+    log_function(rstrict)
+    log_function('recall lax: ',end='')
+    log_function(rlax)
+    log_function('')
+
+    log_function('precision strict: ',end='')
+    log_function(pstrict)
+    log_function('precision lax: '),
+    log_function(plax)
+    log_function('')
+    
+    log_function('f1 strict: ',end='')
+    log_function(fstrict)
+    log_function('f1 lax: ',end='')
+    log_function(flax)
+    log_function('')
--- a/ext-lib/bleualign/command_utils.py
+++ b/ext-lib/bleualign/command_utils.py
@@ -0,0 +1,158 @@
+#!/usr/bin/python
+# -*- coding: utf-8 -*-
+# Copyright © 2010 University of Zürich
+# Author: Rico Sennrich <sennrich@cl.uzh.ch>
+# For licensing information, see LICENSE
+
+
+from __future__ import division, print_function
+import sys
+import os
+import getopt
+
+def usage():
+    bold = "\033[1m"
+    reset = "\033[0;0m"
+    italic = "\033[3m"
+
+    print('\n\t All files need to be one sentence per line and have .EOA as a hard delimiter. --source, --target and --output are mandatory arguments, the others are optional.')
+    print('\n\t' + bold +'--help' + reset + ', ' + bold +'-h' + reset)
+    print('\t\tprint usage information\n')
+    print('\t' + bold +'--source' + reset + ', ' + bold +'-s' + reset + ' file')
+    print('\t\tSource language text.')
+    print('\t' + bold +'--target' + reset + ', ' + bold +'-t' + reset + ' file')
+    print('\t\tTarget language text.')
+    print('\t' + bold +'--output' + reset + ', ' + bold +'-o' + reset + ' filename')
+    print('\t\tOutput file: Will create ' + 'filename' + '-s and ' + 'filename' + '-t')
+    print('\n\t' + bold +'--srctotarget' + reset + ' file')
+    print('\t\tTranslation of source language text to target language. Needs to be sentence-aligned with source language text.')
+    print('\t' + bold +'--targettosrc' + reset + ' file')
+    print('\t\tTranslation of target language text to source language. Needs to be sentence-aligned with target language text.')
+    print('\n\t' + bold +'--factored' + reset)
+    print('\t\tSource and target text can be factored (as defined by moses: | as separator of factors, space as word separator). Only first factor will be used for BLEU score.')
+    print('\n\t' + bold +'--filter' + reset + ', ' + bold +'-f' + reset + ' option')
+    print('\t\tFilters output. Possible options:')
+    print('\t\t' + bold +'sentences' + reset + '\tevaluate each sentence and filter on a per-sentence basis')
+    print('\t\t' + bold +'articles' + reset + '\tevaluate each article and filter on a per-article basis')
+    print('\n\t' + bold +'--filterthreshold' + reset + ' int')
+    print('\t\tFilters output to best XX percent. (Default: 90). Only works if --filter is set.')
+    print('\t' + bold +'--bleuthreshold' + reset + ' float')
+    print('\t\tFilters out sentence pairs with sentence-level BLEU score < XX (in range from 0 to 1). (Default: 0). Only works if --filter is set.')
+    print('\t' + bold +'--filterlang' + reset)
+    print('\t\tFilters out sentences/articles for which BLEU score between source and target is higher than that between translation and target (usually means source and target are in same language). Only works if --filter is set.')
+    print('\n\t' + bold +'--bleu_n' + reset + ' int')
+    print('\t\tConsider n-grams up to size n for BLEU. Default 2.')
+    print('\t' + bold +'--bleu_charlevel' + reset)
+    print('\t\tPerform BLEU on charcter-level (recommended for continuous script language; also consider increasing bleu_n).')
+    print('\n\t' + bold +'--galechurch' + reset)
+    print('\t\tAlign the bitext using Gale and Church\'s algorithm (without BLEU comparison).')
+    print('\t' + bold +'--printempty' + reset)
+    print('\t\tAlso write unaligned sentences to file. By default, they are discarded.')
+    print('\t' + bold +'--verbosity' + reset + ', ' + bold +'-v' + reset + ' int')
+    print('\t\tVerbosity. Choose amount of debugging output. Default value 1; choose 0 for (mostly) quiet mode, 2 for verbose output')
+    print('\t' + bold +'--processes' + reset + ', ' + bold +'-p' + reset + ' int')
+    print('\t\tNumber of parallel processes. Documents are split across available processes. Default: 4.')
+
+def load_arguments(sysargv):
+    try:
+        opts, args = getopt.getopt(sysargv[1:], "def:ho:s:t:v:p:", ["factored", "filter=", "filterthreshold=", "bleuthreshold=", "filterlang", "printempty", "deveval","eval", "help", "bleu_n=", "bleu_charlevel", "galechurch", "output=", "source=", "target=", "srctotarget=", "targettosrc=", "verbosity=", "printempty=", "processes="])
+    except getopt.GetoptError as err:
+        # print help information and exit:
+        print(str(err)) # will print something like "option -a not recognized"
+        usage()
+        sys.exit(2)
+    options = {}
+    options['srcfile'] = None
+    options['targetfile'] = None
+    options['output'] = None
+    options['srctotarget'] = []
+    options['targettosrc'] = []
+    options['processes'] = 4
+    bold = "\033[1m"
+    reset = "\033[0;0m"
+
+    project_path = os.path.dirname(os.path.abspath(__file__))
+    for o, a in opts:
+        if o in ("-h", "--help"):
+            usage()
+            sys.exit()
+        elif o in ("-e", "--eval"):
+            options['srcfile'] = os.path.join(project_path,'eval','eval1989.de')
+            options['targetfile'] = os.path.join(project_path,'eval','eval1989.fr')
+            from eval import goldeval
+            goldalign = [None] * len(goldeval.gold1990map)
+            for index, data in list(goldeval.gold1990map.items()):
+                goldalign[index] = goldeval.gold[data]
+            options['eval'] = goldalign
+        elif o in ("-d", "--deveval"):
+            options['srcfile'] = os.path.join(project_path,'eval','eval1957.de')
+            options['targetfile'] = os.path.join(project_path,'eval','eval1957.fr')
+            from eval import golddev
+            goldalign = [golddev.goldalign]
+            options['eval'] = goldalign
+        elif o in ("-o", "--output"):
+            options['output'] = a
+        elif o == "--factored":
+            options['factored'] = True
+        elif o in ("-f", "--filter"):
+            if a in ['sentences','articles']:
+              options['filter'] = a
+            else:
+              print('\nERROR: Valid values for option ' + bold + '--filter'+ reset +' are '+ bold +'sentences '+ reset +'and ' + bold +'articles'+ reset +'.')
+              usage()
+              sys.exit(2)
+        elif o == "--filterthreshold":
+            options['filterthreshold'] = float(a)
+        elif o == "--bleuthreshold":
+            options['bleuthreshold'] = float(a)
+        elif o == "--filterlang":
+            options['filterlang'] = True
+        elif o == "--galechurch":
+            options['galechurch'] = True
+        elif o == "--bleu_n":
+            options['bleu_ngrams'] = int(a)
+        elif o == "--bleu_charlevel":
+            options['bleu_charlevel'] = True
+        elif o in ("-s", "--source"):
+            if not 'eval' in options:
+                options['srcfile'] = a
+        elif o in ("-t", "--target"):
+            if not 'eval' in options:
+                options['targetfile'] = a
+        elif o == "--srctotarget":
+            if a == '-':
+                options['no_translation_override'] = True
+            else:
+                options['srctotarget'].append(a)
+        elif o == "--targettosrc":
+            options['targettosrc'].append(a)
+        elif o == "--printempty":
+            options['printempty'] = True
+        elif o in ("-v", "--verbosity"):
+            global loglevel
+            loglevel = int(a)
+            options['loglevel'] = int(a)
+            options['verbosity'] = int(a)
+        elif o in ("-p", "--processes"):
+            options['num_processes'] = int(a)
+        else:
+            assert False, "unhandled option"
+
+    if not options['output']:
+      print('WARNING: Output not specified. Just printing debugging output.',0)
+    if not options['srcfile']:
+      print('\nERROR: Source file not specified.')
+      usage()
+      sys.exit(2)
+    if not options['targetfile']:
+      print('\nERROR: Target file not specified.')
+      usage()
+      sys.exit(2)
+    if options['targettosrc'] and not options['srctotarget']:
+        print('\nWARNING: Only --targettosrc specified, but expecting at least one --srctotarget. Please swap source and target side.')
+        sys.exit(2)
+    if not options['srctotarget'] and not options['targettosrc']\
+          and 'no_translation_override' not in options:
+        print("ERROR: no translation available: BLEU scores can be computed between the source and target text, but this is not the intended usage of Bleualign and may result in poor performance! If you're *really* sure that this is what you want, use the option '--srctotarget -'")
+        sys.exit(2)
+    return options
--- a/ext-lib/bleualign/setup.py
+++ b/ext-lib/bleualign/setup.py
@@ -0,0 +1,42 @@
+# -*- coding: utf-8 -*-
+import os
+import setuptools
+
+def read_file(filename):
+	return open(os.path.join(os.path.dirname(__file__), filename)).read()
+
+setuptools.setup(
+	name = 'bleualign',
+	version = '0.1.1',
+	description = 'An MT-based sentence alignment tool',
+	long_description = read_file('README.md'),
+	author = 'Rico Sennrich',
+	author_email = 'sennrich@cl.uzh.ch',
+	url = 'https://github.com/rsennrich/Bleualign',
+	download_url = 'https://github.com/rsennrich/Bleualign',
+	keywords = [
+		'Sentence Alignment',
+		'Natural Language Processing',
+		'Statistical Machine Translation',
+		'BLEU',
+		],
+	classifiers = [
+		# which Development Status?
+# 		'Development Status :: 3 - Alpha',
+		'Development Status :: 4 - Beta',
+# 		'Development Status :: 5 - Production/Stable',
+		'License :: OSI Approved :: GNU General Public License v2 (GPLv2)',
+		'Operating System :: OS Independent',
+		'Programming Language :: Python :: 2.6',
+		'Programming Language :: Python :: 2.7',
+		'Programming Language :: Python :: 3',
+		'Programming Language :: Python :: 3.2',
+		'Programming Language :: Python :: 3.3',
+		'Programming Language :: Python :: 3.4',
+		'Topic :: Scientific/Engineering',
+		'Topic :: Scientific/Engineering :: Information Analysis',
+		'Topic :: Text Processing',
+		'Topic :: Text Processing :: Linguistic',
+	],
+	packages = ['bleualign'],
+)