Baseline alignment systems
This commit is contained in:
5
ext-lib/bleualign/.gitignore
vendored
Normal file
5
ext-lib/bleualign/.gitignore
vendored
Normal file
@@ -0,0 +1,5 @@
|
||||
__pycache__/
|
||||
*.pyc
|
||||
/dist
|
||||
/build
|
||||
/MANIFEST
|
||||
339
ext-lib/bleualign/LICENSE
Normal file
339
ext-lib/bleualign/LICENSE
Normal file
@@ -0,0 +1,339 @@
|
||||
GNU GENERAL PUBLIC LICENSE
|
||||
Version 2, June 1991
|
||||
|
||||
Copyright (C) 1989, 1991 Free Software Foundation, Inc.,
|
||||
51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
|
||||
Everyone is permitted to copy and distribute verbatim copies
|
||||
of this license document, but changing it is not allowed.
|
||||
|
||||
Preamble
|
||||
|
||||
The licenses for most software are designed to take away your
|
||||
freedom to share and change it. By contrast, the GNU General Public
|
||||
License is intended to guarantee your freedom to share and change free
|
||||
software--to make sure the software is free for all its users. This
|
||||
General Public License applies to most of the Free Software
|
||||
Foundation's software and to any other program whose authors commit to
|
||||
using it. (Some other Free Software Foundation software is covered by
|
||||
the GNU Lesser General Public License instead.) You can apply it to
|
||||
your programs, too.
|
||||
|
||||
When we speak of free software, we are referring to freedom, not
|
||||
price. Our General Public Licenses are designed to make sure that you
|
||||
have the freedom to distribute copies of free software (and charge for
|
||||
this service if you wish), that you receive source code or can get it
|
||||
if you want it, that you can change the software or use pieces of it
|
||||
in new free programs; and that you know you can do these things.
|
||||
|
||||
To protect your rights, we need to make restrictions that forbid
|
||||
anyone to deny you these rights or to ask you to surrender the rights.
|
||||
These restrictions translate to certain responsibilities for you if you
|
||||
distribute copies of the software, or if you modify it.
|
||||
|
||||
For example, if you distribute copies of such a program, whether
|
||||
gratis or for a fee, you must give the recipients all the rights that
|
||||
you have. You must make sure that they, too, receive or can get the
|
||||
source code. And you must show them these terms so they know their
|
||||
rights.
|
||||
|
||||
We protect your rights with two steps: (1) copyright the software, and
|
||||
(2) offer you this license which gives you legal permission to copy,
|
||||
distribute and/or modify the software.
|
||||
|
||||
Also, for each author's protection and ours, we want to make certain
|
||||
that everyone understands that there is no warranty for this free
|
||||
software. If the software is modified by someone else and passed on, we
|
||||
want its recipients to know that what they have is not the original, so
|
||||
that any problems introduced by others will not reflect on the original
|
||||
authors' reputations.
|
||||
|
||||
Finally, any free program is threatened constantly by software
|
||||
patents. We wish to avoid the danger that redistributors of a free
|
||||
program will individually obtain patent licenses, in effect making the
|
||||
program proprietary. To prevent this, we have made it clear that any
|
||||
patent must be licensed for everyone's free use or not licensed at all.
|
||||
|
||||
The precise terms and conditions for copying, distribution and
|
||||
modification follow.
|
||||
|
||||
GNU GENERAL PUBLIC LICENSE
|
||||
TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
|
||||
|
||||
0. This License applies to any program or other work which contains
|
||||
a notice placed by the copyright holder saying it may be distributed
|
||||
under the terms of this General Public License. The "Program", below,
|
||||
refers to any such program or work, and a "work based on the Program"
|
||||
means either the Program or any derivative work under copyright law:
|
||||
that is to say, a work containing the Program or a portion of it,
|
||||
either verbatim or with modifications and/or translated into another
|
||||
language. (Hereinafter, translation is included without limitation in
|
||||
the term "modification".) Each licensee is addressed as "you".
|
||||
|
||||
Activities other than copying, distribution and modification are not
|
||||
covered by this License; they are outside its scope. The act of
|
||||
running the Program is not restricted, and the output from the Program
|
||||
is covered only if its contents constitute a work based on the
|
||||
Program (independent of having been made by running the Program).
|
||||
Whether that is true depends on what the Program does.
|
||||
|
||||
1. You may copy and distribute verbatim copies of the Program's
|
||||
source code as you receive it, in any medium, provided that you
|
||||
conspicuously and appropriately publish on each copy an appropriate
|
||||
copyright notice and disclaimer of warranty; keep intact all the
|
||||
notices that refer to this License and to the absence of any warranty;
|
||||
and give any other recipients of the Program a copy of this License
|
||||
along with the Program.
|
||||
|
||||
You may charge a fee for the physical act of transferring a copy, and
|
||||
you may at your option offer warranty protection in exchange for a fee.
|
||||
|
||||
2. You may modify your copy or copies of the Program or any portion
|
||||
of it, thus forming a work based on the Program, and copy and
|
||||
distribute such modifications or work under the terms of Section 1
|
||||
above, provided that you also meet all of these conditions:
|
||||
|
||||
a) You must cause the modified files to carry prominent notices
|
||||
stating that you changed the files and the date of any change.
|
||||
|
||||
b) You must cause any work that you distribute or publish, that in
|
||||
whole or in part contains or is derived from the Program or any
|
||||
part thereof, to be licensed as a whole at no charge to all third
|
||||
parties under the terms of this License.
|
||||
|
||||
c) If the modified program normally reads commands interactively
|
||||
when run, you must cause it, when started running for such
|
||||
interactive use in the most ordinary way, to print or display an
|
||||
announcement including an appropriate copyright notice and a
|
||||
notice that there is no warranty (or else, saying that you provide
|
||||
a warranty) and that users may redistribute the program under
|
||||
these conditions, and telling the user how to view a copy of this
|
||||
License. (Exception: if the Program itself is interactive but
|
||||
does not normally print such an announcement, your work based on
|
||||
the Program is not required to print an announcement.)
|
||||
|
||||
These requirements apply to the modified work as a whole. If
|
||||
identifiable sections of that work are not derived from the Program,
|
||||
and can be reasonably considered independent and separate works in
|
||||
themselves, then this License, and its terms, do not apply to those
|
||||
sections when you distribute them as separate works. But when you
|
||||
distribute the same sections as part of a whole which is a work based
|
||||
on the Program, the distribution of the whole must be on the terms of
|
||||
this License, whose permissions for other licensees extend to the
|
||||
entire whole, and thus to each and every part regardless of who wrote it.
|
||||
|
||||
Thus, it is not the intent of this section to claim rights or contest
|
||||
your rights to work written entirely by you; rather, the intent is to
|
||||
exercise the right to control the distribution of derivative or
|
||||
collective works based on the Program.
|
||||
|
||||
In addition, mere aggregation of another work not based on the Program
|
||||
with the Program (or with a work based on the Program) on a volume of
|
||||
a storage or distribution medium does not bring the other work under
|
||||
the scope of this License.
|
||||
|
||||
3. You may copy and distribute the Program (or a work based on it,
|
||||
under Section 2) in object code or executable form under the terms of
|
||||
Sections 1 and 2 above provided that you also do one of the following:
|
||||
|
||||
a) Accompany it with the complete corresponding machine-readable
|
||||
source code, which must be distributed under the terms of Sections
|
||||
1 and 2 above on a medium customarily used for software interchange; or,
|
||||
|
||||
b) Accompany it with a written offer, valid for at least three
|
||||
years, to give any third party, for a charge no more than your
|
||||
cost of physically performing source distribution, a complete
|
||||
machine-readable copy of the corresponding source code, to be
|
||||
distributed under the terms of Sections 1 and 2 above on a medium
|
||||
customarily used for software interchange; or,
|
||||
|
||||
c) Accompany it with the information you received as to the offer
|
||||
to distribute corresponding source code. (This alternative is
|
||||
allowed only for noncommercial distribution and only if you
|
||||
received the program in object code or executable form with such
|
||||
an offer, in accord with Subsection b above.)
|
||||
|
||||
The source code for a work means the preferred form of the work for
|
||||
making modifications to it. For an executable work, complete source
|
||||
code means all the source code for all modules it contains, plus any
|
||||
associated interface definition files, plus the scripts used to
|
||||
control compilation and installation of the executable. However, as a
|
||||
special exception, the source code distributed need not include
|
||||
anything that is normally distributed (in either source or binary
|
||||
form) with the major components (compiler, kernel, and so on) of the
|
||||
operating system on which the executable runs, unless that component
|
||||
itself accompanies the executable.
|
||||
|
||||
If distribution of executable or object code is made by offering
|
||||
access to copy from a designated place, then offering equivalent
|
||||
access to copy the source code from the same place counts as
|
||||
distribution of the source code, even though third parties are not
|
||||
compelled to copy the source along with the object code.
|
||||
|
||||
4. You may not copy, modify, sublicense, or distribute the Program
|
||||
except as expressly provided under this License. Any attempt
|
||||
otherwise to copy, modify, sublicense or distribute the Program is
|
||||
void, and will automatically terminate your rights under this License.
|
||||
However, parties who have received copies, or rights, from you under
|
||||
this License will not have their licenses terminated so long as such
|
||||
parties remain in full compliance.
|
||||
|
||||
5. You are not required to accept this License, since you have not
|
||||
signed it. However, nothing else grants you permission to modify or
|
||||
distribute the Program or its derivative works. These actions are
|
||||
prohibited by law if you do not accept this License. Therefore, by
|
||||
modifying or distributing the Program (or any work based on the
|
||||
Program), you indicate your acceptance of this License to do so, and
|
||||
all its terms and conditions for copying, distributing or modifying
|
||||
the Program or works based on it.
|
||||
|
||||
6. Each time you redistribute the Program (or any work based on the
|
||||
Program), the recipient automatically receives a license from the
|
||||
original licensor to copy, distribute or modify the Program subject to
|
||||
these terms and conditions. You may not impose any further
|
||||
restrictions on the recipients' exercise of the rights granted herein.
|
||||
You are not responsible for enforcing compliance by third parties to
|
||||
this License.
|
||||
|
||||
7. If, as a consequence of a court judgment or allegation of patent
|
||||
infringement or for any other reason (not limited to patent issues),
|
||||
conditions are imposed on you (whether by court order, agreement or
|
||||
otherwise) that contradict the conditions of this License, they do not
|
||||
excuse you from the conditions of this License. If you cannot
|
||||
distribute so as to satisfy simultaneously your obligations under this
|
||||
License and any other pertinent obligations, then as a consequence you
|
||||
may not distribute the Program at all. For example, if a patent
|
||||
license would not permit royalty-free redistribution of the Program by
|
||||
all those who receive copies directly or indirectly through you, then
|
||||
the only way you could satisfy both it and this License would be to
|
||||
refrain entirely from distribution of the Program.
|
||||
|
||||
If any portion of this section is held invalid or unenforceable under
|
||||
any particular circumstance, the balance of the section is intended to
|
||||
apply and the section as a whole is intended to apply in other
|
||||
circumstances.
|
||||
|
||||
It is not the purpose of this section to induce you to infringe any
|
||||
patents or other property right claims or to contest validity of any
|
||||
such claims; this section has the sole purpose of protecting the
|
||||
integrity of the free software distribution system, which is
|
||||
implemented by public license practices. Many people have made
|
||||
generous contributions to the wide range of software distributed
|
||||
through that system in reliance on consistent application of that
|
||||
system; it is up to the author/donor to decide if he or she is willing
|
||||
to distribute software through any other system and a licensee cannot
|
||||
impose that choice.
|
||||
|
||||
This section is intended to make thoroughly clear what is believed to
|
||||
be a consequence of the rest of this License.
|
||||
|
||||
8. If the distribution and/or use of the Program is restricted in
|
||||
certain countries either by patents or by copyrighted interfaces, the
|
||||
original copyright holder who places the Program under this License
|
||||
may add an explicit geographical distribution limitation excluding
|
||||
those countries, so that distribution is permitted only in or among
|
||||
countries not thus excluded. In such case, this License incorporates
|
||||
the limitation as if written in the body of this License.
|
||||
|
||||
9. The Free Software Foundation may publish revised and/or new versions
|
||||
of the General Public License from time to time. Such new versions will
|
||||
be similar in spirit to the present version, but may differ in detail to
|
||||
address new problems or concerns.
|
||||
|
||||
Each version is given a distinguishing version number. If the Program
|
||||
specifies a version number of this License which applies to it and "any
|
||||
later version", you have the option of following the terms and conditions
|
||||
either of that version or of any later version published by the Free
|
||||
Software Foundation. If the Program does not specify a version number of
|
||||
this License, you may choose any version ever published by the Free Software
|
||||
Foundation.
|
||||
|
||||
10. If you wish to incorporate parts of the Program into other free
|
||||
programs whose distribution conditions are different, write to the author
|
||||
to ask for permission. For software which is copyrighted by the Free
|
||||
Software Foundation, write to the Free Software Foundation; we sometimes
|
||||
make exceptions for this. Our decision will be guided by the two goals
|
||||
of preserving the free status of all derivatives of our free software and
|
||||
of promoting the sharing and reuse of software generally.
|
||||
|
||||
NO WARRANTY
|
||||
|
||||
11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
|
||||
FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN
|
||||
OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
|
||||
PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
|
||||
OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
|
||||
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS
|
||||
TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE
|
||||
PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
|
||||
REPAIR OR CORRECTION.
|
||||
|
||||
12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
|
||||
WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
|
||||
REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
|
||||
INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
|
||||
OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
|
||||
TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
|
||||
YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
|
||||
PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
|
||||
POSSIBILITY OF SUCH DAMAGES.
|
||||
|
||||
END OF TERMS AND CONDITIONS
|
||||
|
||||
How to Apply These Terms to Your New Programs
|
||||
|
||||
If you develop a new program, and you want it to be of the greatest
|
||||
possible use to the public, the best way to achieve this is to make it
|
||||
free software which everyone can redistribute and change under these terms.
|
||||
|
||||
To do so, attach the following notices to the program. It is safest
|
||||
to attach them to the start of each source file to most effectively
|
||||
convey the exclusion of warranty; and each file should have at least
|
||||
the "copyright" line and a pointer to where the full notice is found.
|
||||
|
||||
<one line to give the program's name and a brief idea of what it does.>
|
||||
Copyright (C) <year> <name of author>
|
||||
|
||||
This program is free software; you can redistribute it and/or modify
|
||||
it under the terms of the GNU General Public License as published by
|
||||
the Free Software Foundation; either version 2 of the License, or
|
||||
(at your option) any later version.
|
||||
|
||||
This program is distributed in the hope that it will be useful,
|
||||
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
||||
GNU General Public License for more details.
|
||||
|
||||
You should have received a copy of the GNU General Public License along
|
||||
with this program; if not, write to the Free Software Foundation, Inc.,
|
||||
51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
|
||||
|
||||
Also add information on how to contact you by electronic and paper mail.
|
||||
|
||||
If the program is interactive, make it output a short notice like this
|
||||
when it starts in an interactive mode:
|
||||
|
||||
Gnomovision version 69, Copyright (C) year name of author
|
||||
Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
|
||||
This is free software, and you are welcome to redistribute it
|
||||
under certain conditions; type `show c' for details.
|
||||
|
||||
The hypothetical commands `show w' and `show c' should show the appropriate
|
||||
parts of the General Public License. Of course, the commands you use may
|
||||
be called something other than `show w' and `show c'; they could even be
|
||||
mouse-clicks or menu items--whatever suits your program.
|
||||
|
||||
You should also get your employer (if you work as a programmer) or your
|
||||
school, if any, to sign a "copyright disclaimer" for the program, if
|
||||
necessary. Here is a sample; alter the names:
|
||||
|
||||
Yoyodyne, Inc., hereby disclaims all copyright interest in the program
|
||||
`Gnomovision' (which makes passes at compilers) written by James Hacker.
|
||||
|
||||
<signature of Ty Coon>, 1 April 1989
|
||||
Ty Coon, President of Vice
|
||||
|
||||
This General Public License does not permit incorporating your program into
|
||||
proprietary programs. If your program is a subroutine library, you may
|
||||
consider it more useful to permit linking proprietary applications with the
|
||||
library. If this is what you want to do, use the GNU Lesser General
|
||||
Public License instead of this License.
|
||||
105
ext-lib/bleualign/README.md
Normal file
105
ext-lib/bleualign/README.md
Normal file
@@ -0,0 +1,105 @@
|
||||
Bleualign
|
||||
=========
|
||||
An MT-based sentence alignment tool
|
||||
|
||||
Copyright ⓒ 2010
|
||||
Rico Sennrich <sennrich@cl.uzh.ch>
|
||||
|
||||
A project of the Computational Linguistics Group at the University of Zurich (http://www.cl.uzh.ch).
|
||||
|
||||
Project Homepage: http://github.com/rsennrich/bleualign
|
||||
|
||||
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation
|
||||
|
||||
GENERAL INFO
|
||||
------------
|
||||
|
||||
Bleualign is a tool to align parallel texts (i.e. a text and its translation) on a sentence level.
|
||||
Additionally to the source and target text, Bleualign requires an automatic translation of at least one of the texts.
|
||||
The alignment is then performed on the basis of the similarity (modified BLEU score) between the source text sentences (translated into the target language) and the target text sentences.
|
||||
See section PUBLICATIONS for more details.
|
||||
|
||||
Obtaining an automatic translation is up to the user. The only requirement is that the translation must correspond line-by-line to the source text (no line breaks inserted or removed).
|
||||
|
||||
REQUIREMENTS
|
||||
------------
|
||||
|
||||
The software was developed on Linux using Python 2.6, but should also support newer versions of Python (including 3.X) and other platforms.
|
||||
Please report any issues you encounter to sennrich@cl.uzh.ch
|
||||
|
||||
|
||||
USAGE INSTRUCTIONS
|
||||
------------------
|
||||
|
||||
The input and output formats of bleualign are one sentence per line.
|
||||
A line which only contains .EOA is considered a hard delimiter (end of article).
|
||||
Sentence alignment does not cross these delimiters: reliable delimiters improve speed and performance, wrong ones will seriously degrade performance.
|
||||
|
||||
Given the files sourcetext.txt, targettext.txt and sourcetranslation.txt (the latter being sentence-aligned with sourcetext.txt), a sample call is
|
||||
|
||||
./bleualign.py -s sourcetext.txt -t targettext.txt --srctotarget sourcetranslation.txt -o outputfile
|
||||
|
||||
It is also possible to provide several translations and/or translations in the other translation direction.
|
||||
bleualign will run once per translation provided, the final output being the intersection of the individual runs (i.e. sentence pairs produced in each individual run).
|
||||
|
||||
./bleualign.py -s sourcetext.txt -t targettext.txt --srctotarget sourcetranslation1.txt --srctotarget sourcetranslation2.txt --targettosrc targettranslation1.txt -o outputfile
|
||||
|
||||
./bleualign.py -h will show more usage options
|
||||
|
||||
To facilitate batch processing multiple files, `batch_align.py` can be used.
|
||||
|
||||
python batch_align directory source_suffix target_suffix translation_suffix
|
||||
|
||||
example: given the directory `raw_files` with the files `0.de`, `0.fr` and `0.trans` and so on, (`0.trans` being the translation of `0.de` into the target language), then this command will align all files:
|
||||
|
||||
python batch_align.py raw_files de fr trans
|
||||
|
||||
This will produce the files `0.de.aligned` and `0.fr.aligned`
|
||||
|
||||
Input files are expected to use UTF-8 encoding.
|
||||
|
||||
USAGE AS PYTHON MODULE
|
||||
----------------------
|
||||
|
||||
Bleualign works as stand-alone script, but can also be imported as a module other Python projects.
|
||||
For code examples, see the example/ directory. If you want to know all options, you can see Aligner.default_options variable in bleualign/aligner.py.
|
||||
|
||||
To use Bleualign as a Python module, the package needs to be installed (from a local copy) with:
|
||||
|
||||
python setup.py install
|
||||
|
||||
The Bleualign package can also be installed directly from Github with:
|
||||
|
||||
pip install git+https://github.com/rsennrich/Bleualign.git
|
||||
|
||||
EVALUATION
|
||||
---------
|
||||
|
||||
Two hand-aligned documents are provided with the repository for development and testing.
|
||||
Evaluation is performed if you add the argument `-d` for the development set, and `-e` for the test set.
|
||||
|
||||
An example command for aligning the development set (one long document with 468/554 sentences in DE/FR):
|
||||
|
||||
./bleualign.py --source eval/eval1957.de --target eval/eval1957.fr --srctotarget eval/eval1957.europarlfull.fr -d
|
||||
|
||||
An example command for aligning the test set (7 documents, totalling 993/1011 sentences in DE/FR):
|
||||
|
||||
./bleualign.py --source eval/eval1989.de --target eval/eval1989.fr --srctotarget eval/eval1989.europarlfull.fr -e
|
||||
|
||||
|
||||
PUBLICATIONS
|
||||
------------
|
||||
|
||||
The algorithm is described in
|
||||
|
||||
Rico Sennrich, Martin Volk (2010):
|
||||
MT-based Sentence Alignment for OCR-generated Parallel Texts. In: Proceedings of AMTA 2010, Denver, Colorado.
|
||||
|
||||
Rico Sennrich; Martin Volk (2011):
|
||||
Iterative, MT-based sentence alignment of parallel texts. In: NODALIDA 2011, Nordic Conference of Computational Linguistics, Riga.
|
||||
|
||||
|
||||
CONTACT
|
||||
-------
|
||||
|
||||
For questions and feeback, please contact sennrich@cl.uzh.ch or use the GitHub repository.
|
||||
15
ext-lib/bleualign/_bleualign.py
Normal file
15
ext-lib/bleualign/_bleualign.py
Normal file
@@ -0,0 +1,15 @@
|
||||
#!/usr/bin/python
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright © 2010 University of Zürich
|
||||
# Author: Rico Sennrich <sennrich@cl.uzh.ch>
|
||||
# For licensing information, see LICENSE
|
||||
|
||||
import sys
|
||||
from command_utils import load_arguments
|
||||
from bleualign.align import Aligner
|
||||
|
||||
if __name__ == '__main__':
|
||||
options = load_arguments(sys.argv)
|
||||
|
||||
a = Aligner(options)
|
||||
a.mainloop()
|
||||
51
ext-lib/bleualign/batch_align.py
Normal file
51
ext-lib/bleualign/batch_align.py
Normal file
@@ -0,0 +1,51 @@
|
||||
#!/usr/bin/python
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright: University of Zurich
|
||||
# Author: Rico Sennrich
|
||||
|
||||
# script to allow batch-alignment of multiple files. No multiprocessing.
|
||||
# syntax: python batch_align directory source_suffix target_suffix translation_suffix
|
||||
#
|
||||
# example: given the directory batch-test with the files 0.de, 0.fr and 0.trans, 1.de, 1.fr and 1.trans and so on,
|
||||
# (0.trans being the translation of 0.de into the target language),
|
||||
# then this command will align all files: python batch_align.py batch-test/ de fr trans
|
||||
#
|
||||
# output files will have ending source_suffix.aligned and target_suffix.aligned
|
||||
|
||||
|
||||
import sys
|
||||
import os
|
||||
from bleualign.align import Aligner
|
||||
|
||||
if len(sys.argv) < 2:
|
||||
sys.stderr.write('Usage: python batch_align.py job_file\n')
|
||||
exit()
|
||||
|
||||
job_fn = sys.argv[1]
|
||||
|
||||
options = {}
|
||||
options['factored'] = False
|
||||
options['filter'] = None
|
||||
options['filterthreshold'] = 90
|
||||
options['filterlang'] = None
|
||||
options['targettosrc'] = []
|
||||
options['eval'] = None
|
||||
options['galechurch'] = None
|
||||
options['verbosity'] = 1
|
||||
options['printempty'] = False
|
||||
|
||||
jobs = []
|
||||
with open(job_fn, 'r', encoding="utf-8") as f:
|
||||
for line in f:
|
||||
if not line.startswith("#"):
|
||||
jobs.append(line.strip())
|
||||
|
||||
for rec in jobs:
|
||||
translation_document, source_document, target_document, out_document = rec.split("\t")
|
||||
options['srcfile'] = source_document
|
||||
options['targetfile'] = target_document
|
||||
options['srctotarget'] = [translation_document]
|
||||
options['output'] = out_document
|
||||
a = Aligner(options)
|
||||
a.mainloop()
|
||||
|
||||
110
ext-lib/bleualign/bleualign.py
Normal file
110
ext-lib/bleualign/bleualign.py
Normal file
@@ -0,0 +1,110 @@
|
||||
# 2021/11/27
|
||||
# bfsujason@163.com
|
||||
|
||||
"""
|
||||
Usage:
|
||||
|
||||
python ext-lib/bleualign/bleualign.py \
|
||||
-m data/mac/test/meta_data.tsv \
|
||||
-s data/mac/test/zh \
|
||||
-t data/mac/test/en \
|
||||
-o data/mac/test/auto
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import time
|
||||
import shutil
|
||||
import argparse
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description='Sentence alignment using Bleualign')
|
||||
parser.add_argument('-s', '--src', type=str, required=True, help='Source directory.')
|
||||
parser.add_argument('-t', '--tgt', type=str, required=True, help='Target directory.')
|
||||
parser.add_argument('-o', '--out', type=str, required=True, help='Output directory.')
|
||||
parser.add_argument('-m', '--meta', type=str, required=True, help='Metadata file.')
|
||||
parser.add_argument('--tok', action='store_true', help='Use tokenized source trans and target text.')
|
||||
args = parser.parse_args()
|
||||
|
||||
make_dir(args.out)
|
||||
|
||||
jobs = create_jobs(args.meta, args.src, args.tgt, args.out, args.tok)
|
||||
job_path = os.path.abspath(os.path.join(args.out, 'bleualign.job'))
|
||||
write_jobs(jobs, job_path)
|
||||
|
||||
bleualign_bin = os.path.abspath('ext-lib/bleualign/batch_align.py')
|
||||
run_bleualign(bleualign_bin, job_path)
|
||||
|
||||
convert_format(args.out)
|
||||
|
||||
def convert_format(dir):
|
||||
for file in os.listdir(dir):
|
||||
if file.endswith('-s'):
|
||||
file_id = file.split('.')[0]
|
||||
src = os.path.join(dir, file)
|
||||
tgt = os.path.join(dir, file_id + '.align-t')
|
||||
out = os.path.join(dir, file_id + '.align')
|
||||
_convert_format(src, tgt, out)
|
||||
os.unlink(src)
|
||||
os.unlink(tgt)
|
||||
|
||||
def _convert_format(src, tgt, path):
|
||||
src_align = read_alignment(src)
|
||||
tgt_align = read_alignment(tgt)
|
||||
with open(path, 'wt', encoding='utf-8') as f:
|
||||
for x, y in zip(src_align, tgt_align):
|
||||
f.write("{}:{}\n".format(x,y))
|
||||
|
||||
def read_alignment(file):
|
||||
alignment = []
|
||||
with open(file, 'rt', encoding='utf-8') as f:
|
||||
for line in f:
|
||||
line = line.strip()
|
||||
alignment.append([int(x) for x in line.split(',')])
|
||||
|
||||
return alignment
|
||||
|
||||
def run_bleualign(bin, job):
|
||||
cmd = "python {} {}".format(bin, job)
|
||||
os.system(cmd)
|
||||
os.unlink(job)
|
||||
|
||||
def write_jobs(jobs, path):
|
||||
jobs = '\n'.join(jobs)
|
||||
with open(path, 'wt', encoding='utf-8') as f:
|
||||
f.write(jobs)
|
||||
|
||||
def create_jobs(meta, src, tgt, out, is_tok):
|
||||
jobs = []
|
||||
fns = get_fns(meta)
|
||||
for file in fns:
|
||||
src_path = os.path.abspath(os.path.join(src, file))
|
||||
trans_path = os.path.abspath(os.path.join(src, file + '.trans'))
|
||||
if is_tok:
|
||||
tgt_path = os.path.abspath(os.path.join(tgt, file + '.tok'))
|
||||
else:
|
||||
tgt_path = os.path.abspath(os.path.join(tgt, file))
|
||||
out_path = os.path.abspath(os.path.join(out, file + '.align'))
|
||||
jobs.append('\t'.join([trans_path, src_path, tgt_path, out_path]))
|
||||
|
||||
return jobs
|
||||
|
||||
def get_fns(meta):
|
||||
fns = []
|
||||
with open(meta, 'rt', encoding='utf-8') as f:
|
||||
next(f) # skip header
|
||||
for line in f:
|
||||
recs = line.strip().split('\t')
|
||||
fns.append(recs[0])
|
||||
|
||||
return fns
|
||||
|
||||
def make_dir(path):
|
||||
if os.path.isdir(path):
|
||||
shutil.rmtree(path)
|
||||
os.makedirs(path, exist_ok=True)
|
||||
|
||||
if __name__ == '__main__':
|
||||
t_0 = time.time()
|
||||
main()
|
||||
print("It takes {:.3f} seconds to align all the sentences.".format(time.time() - t_0))
|
||||
0
ext-lib/bleualign/bleualign/__init__.py
Normal file
0
ext-lib/bleualign/bleualign/__init__.py
Normal file
1183
ext-lib/bleualign/bleualign/align.py
Normal file
1183
ext-lib/bleualign/bleualign/align.py
Normal file
File diff suppressed because it is too large
Load Diff
205
ext-lib/bleualign/bleualign/gale_church.py
Normal file
205
ext-lib/bleualign/bleualign/gale_church.py
Normal file
@@ -0,0 +1,205 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
import math
|
||||
|
||||
# Based on Gale & Church 1993,
|
||||
# "A Program for Aligning Sentences in Bilingual Corpora"
|
||||
|
||||
infinity = float("inf")
|
||||
|
||||
def erfcc(x):
|
||||
"""Complementary error function."""
|
||||
z = abs(x)
|
||||
t = 1 / (1 + 0.5 * z)
|
||||
r = t * math.exp(-z * z -
|
||||
1.26551223 + t *
|
||||
(1.00002368 + t *
|
||||
(.37409196 + t *
|
||||
(.09678418 + t *
|
||||
(-.18628806 + t *
|
||||
(.27886807 + t *
|
||||
(-1.13520398 + t *
|
||||
(1.48851587 + t *
|
||||
(-.82215223 + t * .17087277)))))))))
|
||||
if (x >= 0.):
|
||||
return r
|
||||
else:
|
||||
return 2. - r
|
||||
|
||||
|
||||
def norm_cdf(x):
|
||||
"""Return the area under the normal distribution from M{-∞..x}."""
|
||||
return 1 - 0.5 * erfcc(x / math.sqrt(2))
|
||||
|
||||
|
||||
class LanguageIndependent(object):
|
||||
# These are the language-independent probabilities and parameters
|
||||
# given in Gale & Church
|
||||
|
||||
# for the computation, l_1 is always the language with less characters
|
||||
PRIORS = {
|
||||
(1, 0): 0.0099,
|
||||
(0, 1): 0.0099,
|
||||
(1, 1): 0.89,
|
||||
(2, 1): 0.089,
|
||||
(1, 2): 0.089,
|
||||
(2, 2): 0.011,
|
||||
}
|
||||
|
||||
AVERAGE_CHARACTERS = 1
|
||||
VARIANCE_CHARACTERS = 6.8
|
||||
|
||||
|
||||
def trace(backlinks, source, target):
|
||||
links = set()
|
||||
pos = (len(source) - 1, len(target) - 1)
|
||||
|
||||
#while pos != (-1, -1):
|
||||
while pos[0] != -1 and pos[1] != -1:
|
||||
#print(pos)
|
||||
#print(backlinks)
|
||||
#print(backlinks[pos])
|
||||
s, t = backlinks[pos]
|
||||
for i in range(s):
|
||||
for j in range(t):
|
||||
links.add((pos[0] - i, pos[1] - j))
|
||||
pos = (pos[0] - s, pos[1] - t)
|
||||
|
||||
return links
|
||||
|
||||
|
||||
def align_probability(i, j, source_sentences, target_sentences, alignment, params):
|
||||
"""Returns the probability of the two sentences C{source_sentences[i]}, C{target_sentences[j]}
|
||||
being aligned with a specific C{alignment}.
|
||||
|
||||
@param i: The offset of the source sentence.
|
||||
@param j: The offset of the target sentence.
|
||||
@param source_sentences: The list of source sentence lengths.
|
||||
@param target_sentences: The list of target sentence lengths.
|
||||
@param alignment: The alignment type, a tuple of two integers.
|
||||
@param params: The sentence alignment parameters.
|
||||
|
||||
@returns: The probability of a specific alignment between the two sentences, given the parameters.
|
||||
"""
|
||||
l_s = sum(source_sentences[i - offset] for offset in range(alignment[0]))
|
||||
l_t = sum(target_sentences[j - offset] for offset in range(alignment[1]))
|
||||
try:
|
||||
# actually, the paper says l_s * params.VARIANCE_CHARACTERS, this is based on the C
|
||||
# reference implementation. With l_s in the denominator, insertions are impossible.
|
||||
m = (l_s + l_t / params.AVERAGE_CHARACTERS) / 2
|
||||
delta = (l_t - l_s * params.AVERAGE_CHARACTERS) / math.sqrt(m * params.VARIANCE_CHARACTERS)
|
||||
except ZeroDivisionError:
|
||||
delta = infinity
|
||||
|
||||
return 2 * (1 - norm_cdf(abs(delta))) * params.PRIORS[alignment]
|
||||
|
||||
|
||||
def align_blocks(source_sentences, target_sentences, params = LanguageIndependent):
|
||||
"""Creates the sentence alignment of two blocks of texts (usually paragraphs).
|
||||
|
||||
@param source_sentences: The list of source sentence lengths.
|
||||
@param target_sentences: The list of target sentence lengths.
|
||||
@param params: the sentence alignment parameters.
|
||||
|
||||
@return: The sentence alignments, a list of index pairs.
|
||||
"""
|
||||
alignment_types = list(params.PRIORS.keys())
|
||||
|
||||
# there are always three rows in the history (with the last of them being filled)
|
||||
# and the rows are always |target_text| + 2, so that we never have to do
|
||||
# boundary checks
|
||||
D = [(len(target_sentences) + 2) * [0] for x in range(2)]
|
||||
|
||||
# for the first sentence, only substitution, insertion or deletion are
|
||||
# allowed, and they are all equally likely ( == 1)
|
||||
|
||||
D.append([0, 1])
|
||||
try:
|
||||
D[-2][1] = 1
|
||||
D[-2][2] = 1
|
||||
except:
|
||||
pass
|
||||
|
||||
backlinks = {}
|
||||
|
||||
for i in range(len(source_sentences)):
|
||||
for j in range(len(target_sentences)):
|
||||
m = []
|
||||
for a in alignment_types:
|
||||
k = D[-(1 + a[0])][j + 2 - a[1]]
|
||||
if k > 0:
|
||||
p = k * \
|
||||
align_probability(i, j, source_sentences, target_sentences, a, params)
|
||||
m.append((p, a))
|
||||
|
||||
if len(m) > 0:
|
||||
v = max(m)
|
||||
backlinks[(i, j)] = v[1]
|
||||
D[-1].append(v[0])
|
||||
else:
|
||||
backlinks[(i, j)] = (1, 1)
|
||||
D[-1].append(0)
|
||||
|
||||
D.pop(0)
|
||||
D.append([0, 0])
|
||||
|
||||
return trace(backlinks, source_sentences, target_sentences)
|
||||
|
||||
|
||||
def align_texts(source_blocks, target_blocks, params = LanguageIndependent):
|
||||
"""Creates the sentence alignment of two texts.
|
||||
|
||||
Texts can consist of several blocks. Block boundaries cannot be crossed by sentence
|
||||
alignment links.
|
||||
|
||||
Each block consists of a list that contains the lengths (in characters) of the sentences
|
||||
in this block.
|
||||
|
||||
@param source_blocks: The list of blocks in the source text.
|
||||
@param target_blocks: The list of blocks in the target text.
|
||||
@param params: the sentence alignment parameters.
|
||||
|
||||
@returns: A list of sentence alignment lists
|
||||
"""
|
||||
if len(source_blocks) != len(target_blocks):
|
||||
raise ValueError("Source and target texts do not have the same number of blocks.")
|
||||
|
||||
return [align_blocks(source_block, target_block, params)
|
||||
for source_block, target_block in zip(source_blocks, target_blocks)]
|
||||
|
||||
|
||||
def split_at(it, split_value):
|
||||
"""Splits an iterator C{it} at values of C{split_value}.
|
||||
|
||||
Each instance of C{split_value} is swallowed. The iterator produces
|
||||
subiterators which need to be consumed fully before the next subiterator
|
||||
can be used.
|
||||
"""
|
||||
def _chunk_iterator(first):
|
||||
v = first
|
||||
while v != split_value:
|
||||
yield v
|
||||
v = next(it)
|
||||
|
||||
while True:
|
||||
yield _chunk_iterator(next(it))
|
||||
|
||||
|
||||
def parse_token_stream(stream, soft_delimiter, hard_delimiter):
|
||||
"""Parses a stream of tokens and splits it into sentences (using C{soft_delimiter} tokens)
|
||||
and blocks (using C{hard_delimiter} tokens) for use with the L{align_texts} function.
|
||||
"""
|
||||
return [
|
||||
[sum(len(token) for token in sentence_it)
|
||||
for sentence_it in split_at(block_it, soft_delimiter)]
|
||||
for block_it in split_at(stream, hard_delimiter)]
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import sys
|
||||
from contextlib import nested
|
||||
|
||||
with nested(open(sys.argv[1], "r"), open(sys.argv[2], "r")) as (s, t):
|
||||
source = parse_token_stream((l.strip() for l in s), ".EOS", ".EOP")
|
||||
target = parse_token_stream((l.strip() for l in t), ".EOS", ".EOP")
|
||||
print((align_texts(source, target)))
|
||||
146
ext-lib/bleualign/bleualign/score.py
Normal file
146
ext-lib/bleualign/bleualign/score.py
Normal file
@@ -0,0 +1,146 @@
|
||||
#!/usr/bin/python
|
||||
# -*- coding: utf-8 -*-
|
||||
#File originally part of moses package: http://www.statmt.org/moses/ (as bleu.py)
|
||||
#Stripped of unused code to reduce number of libraries used
|
||||
|
||||
# $Id$
|
||||
|
||||
'''Provides:
|
||||
|
||||
cook_refs(refs, n=4): Transform a list of reference sentences as strings into a form usable by cook_test().
|
||||
cook_test(test, refs, n=4): Transform a test sentence as a string (together with the cooked reference sentences) into a form usable by score_cooked().
|
||||
score_cooked(alltest, n=4): Score a list of cooked test sentences.
|
||||
|
||||
score_set(s, testid, refids, n=4): Interface with dataset.py; calculate BLEU score of testid against refids.
|
||||
|
||||
The reason for breaking the BLEU computation into three phases cook_refs(), cook_test(), and score_cooked() is to allow the caller to calculate BLEU scores for multiple test sets as efficiently as possible.
|
||||
'''
|
||||
|
||||
from __future__ import division, print_function
|
||||
import sys, math, re, xml.sax.saxutils
|
||||
|
||||
# Added to bypass NIST-style pre-processing of hyp and ref files -- wade
|
||||
nonorm = 0
|
||||
|
||||
preserve_case = False
|
||||
eff_ref_len = "shortest"
|
||||
|
||||
normalize1 = [
|
||||
('<skipped>', ''), # strip "skipped" tags
|
||||
(r'-\n', ''), # strip end-of-line hyphenation and join lines
|
||||
(r'\n', ' '), # join lines
|
||||
# (r'(\d)\s+(?=\d)', r'\1'), # join digits
|
||||
]
|
||||
normalize1 = [(re.compile(pattern), replace) for (pattern, replace) in normalize1]
|
||||
|
||||
normalize2 = [
|
||||
(r'([\{-\~\[-\` -\&\(-\+\:-\@\/])',r' \1 '), # tokenize punctuation. apostrophe is missing
|
||||
(r'([^0-9])([\.,])',r'\1 \2 '), # tokenize period and comma unless preceded by a digit
|
||||
(r'([\.,])([^0-9])',r' \1 \2'), # tokenize period and comma unless followed by a digit
|
||||
(r'([0-9])(-)',r'\1 \2 ') # tokenize dash when preceded by a digit
|
||||
]
|
||||
normalize2 = [(re.compile(pattern), replace) for (pattern, replace) in normalize2]
|
||||
|
||||
#combine normalize2 into a single regex.
|
||||
normalize3 = re.compile(r'([\{-\~\[-\` -\&\(-\+\:-\@\/])|(?:(?<![0-9])([\.,]))|(?:([\.,])(?![0-9]))|(?:(?<=[0-9])(-))')
|
||||
|
||||
def normalize(s):
|
||||
'''Normalize and tokenize text. This is lifted from NIST mteval-v11a.pl.'''
|
||||
# Added to bypass NIST-style pre-processing of hyp and ref files -- wade
|
||||
if (nonorm):
|
||||
return s.split()
|
||||
try:
|
||||
s.split()
|
||||
except:
|
||||
s = " ".join(s)
|
||||
# language-independent part:
|
||||
for (pattern, replace) in normalize1:
|
||||
s = re.sub(pattern, replace, s)
|
||||
s = xml.sax.saxutils.unescape(s, {'"':'"'})
|
||||
# language-dependent part (assuming Western languages):
|
||||
s = " %s " % s
|
||||
if not preserve_case:
|
||||
s = s.lower() # this might not be identical to the original
|
||||
return [tok for tok in normalize3.split(s) if tok and tok != ' ']
|
||||
|
||||
def count_ngrams(words, n=4):
|
||||
counts = {}
|
||||
for k in range(1,n+1):
|
||||
for i in range(len(words)-k+1):
|
||||
ngram = tuple(words[i:i+k])
|
||||
counts[ngram] = counts.get(ngram, 0)+1
|
||||
return counts
|
||||
|
||||
def cook_refs(refs, n=4):
|
||||
'''Takes a list of reference sentences for a single segment
|
||||
and returns an object that encapsulates everything that BLEU
|
||||
needs to know about them.'''
|
||||
|
||||
refs = [normalize(ref) for ref in refs]
|
||||
maxcounts = {}
|
||||
for ref in refs:
|
||||
counts = count_ngrams(ref, n)
|
||||
for (ngram,count) in list(counts.items()):
|
||||
maxcounts[ngram] = max(maxcounts.get(ngram,0), count)
|
||||
return ([len(ref) for ref in refs], maxcounts)
|
||||
|
||||
def cook_ref_set(ref, n=4):
|
||||
'''Takes a reference sentences for a single segment
|
||||
and returns an object that encapsulates everything that BLEU
|
||||
needs to know about them. Also provides a set cause bleualign wants it'''
|
||||
ref = normalize(ref)
|
||||
counts = count_ngrams(ref, n)
|
||||
return (len(ref), counts, frozenset(counts))
|
||||
|
||||
|
||||
|
||||
|
||||
def cook_test(test, args, n=4):
|
||||
'''Takes a test sentence and returns an object that
|
||||
encapsulates everything that BLEU needs to know about it.'''
|
||||
|
||||
reflens, refmaxcounts = args
|
||||
test = normalize(test)
|
||||
result = {}
|
||||
result["testlen"] = len(test)
|
||||
|
||||
# Calculate effective reference sentence length.
|
||||
|
||||
if eff_ref_len == "shortest":
|
||||
result["reflen"] = min(reflens)
|
||||
elif eff_ref_len == "average":
|
||||
result["reflen"] = float(sum(reflens))/len(reflens)
|
||||
elif eff_ref_len == "closest":
|
||||
min_diff = None
|
||||
for reflen in reflens:
|
||||
if min_diff is None or abs(reflen-len(test)) < min_diff:
|
||||
min_diff = abs(reflen-len(test))
|
||||
result['reflen'] = reflen
|
||||
|
||||
result["guess"] = [max(len(test)-k+1,0) for k in range(1,n+1)]
|
||||
|
||||
result['correct'] = [0]*n
|
||||
counts = count_ngrams(test, n)
|
||||
for (ngram, count) in list(counts.items()):
|
||||
result["correct"][len(ngram)-1] += min(refmaxcounts.get(ngram,0), count)
|
||||
|
||||
return result
|
||||
|
||||
def score_cooked(allcomps, n=4):
|
||||
totalcomps = {'testlen':0, 'reflen':0, 'guess':[0]*n, 'correct':[0]*n}
|
||||
for comps in allcomps:
|
||||
for key in ['testlen','reflen']:
|
||||
totalcomps[key] += comps[key]
|
||||
for key in ['guess','correct']:
|
||||
for k in range(n):
|
||||
totalcomps[key][k] += comps[key][k]
|
||||
logbleu = 0.0
|
||||
for k in range(n):
|
||||
if totalcomps['correct'][k] == 0:
|
||||
return 0.0
|
||||
#log.write("%d-grams: %f\n" % (k,float(totalcomps['correct'][k])/totalcomps['guess'][k]))
|
||||
logbleu += math.log(totalcomps['correct'][k])-math.log(totalcomps['guess'][k])
|
||||
logbleu /= float(n)
|
||||
#log.write("Effective reference length: %d test length: %d\n" % (totalcomps['reflen'], totalcomps['testlen']))
|
||||
logbleu += min(0,1-float(totalcomps['reflen'])/totalcomps['testlen'])
|
||||
return math.exp(logbleu)
|
||||
191
ext-lib/bleualign/bleualign/utils.py
Normal file
191
ext-lib/bleualign/bleualign/utils.py
Normal file
@@ -0,0 +1,191 @@
|
||||
#!/usr/bin/python
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright: University of Zurich
|
||||
# Author: Rico Sennrich
|
||||
# For licensing information, see LICENSE
|
||||
|
||||
# Evaluation functions for Bleualign
|
||||
|
||||
|
||||
from __future__ import division
|
||||
from operator import itemgetter
|
||||
|
||||
|
||||
def evaluate(options, testalign, goldalign, log_function):
|
||||
goldalign = [(tuple(src),tuple(target)) for src,target in goldalign]
|
||||
|
||||
results = {}
|
||||
paircounts = {}
|
||||
for pair in [(len(srclist),len(targetlist)) for srclist,targetlist in goldalign]:
|
||||
paircounts[pair] = paircounts.get(pair,0) + 1
|
||||
pairs_normalized = {}
|
||||
for pair in paircounts:
|
||||
pairs_normalized[pair] = (paircounts[pair],paircounts[pair] / float(len(goldalign)))
|
||||
|
||||
log_function('\ngold alignment frequencies\n')
|
||||
for aligntype,(abscount,relcount) in sorted(list(pairs_normalized.items()),key=itemgetter(1),reverse=True):
|
||||
log_function(aligntype,end='')
|
||||
log_function(' - ',end='')
|
||||
log_function(abscount,end='')
|
||||
log_function(' ('+str(relcount)+')')
|
||||
|
||||
log_function('\ntotal recall: ',end='')
|
||||
log_function(str(len(goldalign)) + ' pairs in gold')
|
||||
(tpstrict,fnstrict,tplax,fnlax) = recall((0,0),goldalign,[i[0] for i in testalign],log_function)
|
||||
results['recall'] = (tpstrict,fnstrict,tplax,fnlax)
|
||||
|
||||
for aligntype in set([i[1] for i in testalign]):
|
||||
testalign_bytype = []
|
||||
for i in testalign:
|
||||
if i[1] == aligntype:
|
||||
testalign_bytype.append(i)
|
||||
log_function('precision for alignment type ' + str(aligntype) + ' ( ' + str(len(testalign_bytype)) + ' alignment pairs)')
|
||||
precision(goldalign,testalign_bytype,log_function)
|
||||
|
||||
log_function('\ntotal precision:',end='')
|
||||
log_function(str(len(testalign)) + ' alignment pairs found')
|
||||
(tpstrict,fpstrict,tplax,fplax) = precision(goldalign,testalign,log_function)
|
||||
results['precision'] = (tpstrict,fpstrict,tplax,fplax)
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def precision(goldalign, testalign, log_function):
|
||||
tpstrict=0
|
||||
tplax=0
|
||||
fpstrict=0
|
||||
fplax=0
|
||||
for (src,target) in [i[0] for i in testalign]:
|
||||
if (src,target) == ((),()):
|
||||
continue
|
||||
if (src,target) in goldalign:
|
||||
tpstrict +=1
|
||||
tplax += 1
|
||||
else:
|
||||
srcset, targetset = set(src), set(target)
|
||||
for srclist,targetlist in goldalign:
|
||||
#lax condition: hypothesis and gold alignment only need to overlap
|
||||
if srcset.intersection(set(srclist)) and targetset.intersection(set(targetlist)):
|
||||
fpstrict +=1
|
||||
tplax += 1
|
||||
break
|
||||
else:
|
||||
fpstrict +=1
|
||||
fplax +=1
|
||||
log_function('false positive: ',2)
|
||||
log_function((src,target),2)
|
||||
if tpstrict+fpstrict > 0:
|
||||
log_function('precision strict: ',end='')
|
||||
log_function((tpstrict/float(tpstrict+fpstrict)))
|
||||
log_function('precision lax: ',end='')
|
||||
log_function((tplax/float(tplax+fplax)))
|
||||
log_function('')
|
||||
else:
|
||||
log_function('nothing to find')
|
||||
|
||||
return tpstrict,fpstrict,tplax,fplax
|
||||
|
||||
|
||||
def recall(aligntype, goldalign, testalign, log_function):
|
||||
|
||||
srclen,targetlen = aligntype
|
||||
|
||||
if srclen == 0 and targetlen == 0:
|
||||
gapdists = [(0,0) for i in goldalign]
|
||||
elif srclen == 0 or targetlen == 0:
|
||||
log_function('nothing to find')
|
||||
return
|
||||
else:
|
||||
gapdists = [(len(srclist),len(targetlist)) for srclist,targetlist in goldalign]
|
||||
|
||||
tpstrict=0
|
||||
tplax=0
|
||||
fnstrict=0
|
||||
fnlax=0
|
||||
for i,pair in enumerate(gapdists):
|
||||
if aligntype == pair:
|
||||
(srclist,targetlist) = goldalign[i]
|
||||
if not srclist or not targetlist:
|
||||
continue
|
||||
elif (srclist,targetlist) in testalign:
|
||||
tpstrict +=1
|
||||
tplax +=1
|
||||
else:
|
||||
srcset, targetset = set(srclist), set(targetlist)
|
||||
for src,target in testalign:
|
||||
#lax condition: hypothesis and gold alignment only need to overlap
|
||||
if srcset.intersection(set(src)) and targetset.intersection(set(target)):
|
||||
tplax +=1
|
||||
fnstrict+=1
|
||||
break
|
||||
else:
|
||||
fnstrict+=1
|
||||
fnlax+=1
|
||||
log_function('not found: ',2),
|
||||
log_function(goldalign[i],2)
|
||||
|
||||
if tpstrict+fnstrict>0:
|
||||
log_function('recall strict: '),
|
||||
log_function((tpstrict/float(tpstrict+fnstrict)))
|
||||
log_function('recall lax: '),
|
||||
log_function((tplax/float(tplax+fnlax)))
|
||||
log_function('')
|
||||
else:
|
||||
log_function('nothing to find')
|
||||
|
||||
return tpstrict,fnstrict,tplax,fnlax
|
||||
|
||||
|
||||
def finalevaluation(results, log_function):
|
||||
recall_value = [0,0,0,0]
|
||||
precision_value = [0,0,0,0]
|
||||
for i,k in list(results.items()):
|
||||
for m,j in enumerate(recall_value):
|
||||
recall_value[m] = j+ k['recall'][m]
|
||||
for m,j in enumerate(precision_value):
|
||||
precision_value[m] = j+ k['precision'][m]
|
||||
|
||||
try:
|
||||
pstrict = (precision_value[0]/float(precision_value[0]+precision_value[1]))
|
||||
except ZeroDivisionError:
|
||||
pstrict = 0
|
||||
try:
|
||||
plax =(precision_value[2]/float(precision_value[2]+precision_value[3]))
|
||||
except ZeroDivisionError:
|
||||
plax = 0
|
||||
try:
|
||||
rstrict= (recall_value[0]/float(recall_value[0]+recall_value[1]))
|
||||
except ZeroDivisionError:
|
||||
rstrict = 0
|
||||
try:
|
||||
rlax=(recall_value[2]/float(recall_value[2]+recall_value[3]))
|
||||
except ZeroDivisionError:
|
||||
rlax = 0
|
||||
if (pstrict+rstrict) == 0:
|
||||
fstrict = 0
|
||||
else:
|
||||
fstrict=2*(pstrict*rstrict)/(pstrict+rstrict)
|
||||
if (plax+rlax) == 0:
|
||||
flax=0
|
||||
else:
|
||||
flax=2*(plax*rlax)/(plax+rlax)
|
||||
|
||||
log_function('\n=========================\n')
|
||||
log_function('total results:')
|
||||
log_function('recall strict: ',end='')
|
||||
log_function(rstrict)
|
||||
log_function('recall lax: ',end='')
|
||||
log_function(rlax)
|
||||
log_function('')
|
||||
|
||||
log_function('precision strict: ',end='')
|
||||
log_function(pstrict)
|
||||
log_function('precision lax: '),
|
||||
log_function(plax)
|
||||
log_function('')
|
||||
|
||||
log_function('f1 strict: ',end='')
|
||||
log_function(fstrict)
|
||||
log_function('f1 lax: ',end='')
|
||||
log_function(flax)
|
||||
log_function('')
|
||||
158
ext-lib/bleualign/command_utils.py
Normal file
158
ext-lib/bleualign/command_utils.py
Normal file
@@ -0,0 +1,158 @@
|
||||
#!/usr/bin/python
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright © 2010 University of Zürich
|
||||
# Author: Rico Sennrich <sennrich@cl.uzh.ch>
|
||||
# For licensing information, see LICENSE
|
||||
|
||||
|
||||
from __future__ import division, print_function
|
||||
import sys
|
||||
import os
|
||||
import getopt
|
||||
|
||||
def usage():
|
||||
bold = "\033[1m"
|
||||
reset = "\033[0;0m"
|
||||
italic = "\033[3m"
|
||||
|
||||
print('\n\t All files need to be one sentence per line and have .EOA as a hard delimiter. --source, --target and --output are mandatory arguments, the others are optional.')
|
||||
print('\n\t' + bold +'--help' + reset + ', ' + bold +'-h' + reset)
|
||||
print('\t\tprint usage information\n')
|
||||
print('\t' + bold +'--source' + reset + ', ' + bold +'-s' + reset + ' file')
|
||||
print('\t\tSource language text.')
|
||||
print('\t' + bold +'--target' + reset + ', ' + bold +'-t' + reset + ' file')
|
||||
print('\t\tTarget language text.')
|
||||
print('\t' + bold +'--output' + reset + ', ' + bold +'-o' + reset + ' filename')
|
||||
print('\t\tOutput file: Will create ' + 'filename' + '-s and ' + 'filename' + '-t')
|
||||
print('\n\t' + bold +'--srctotarget' + reset + ' file')
|
||||
print('\t\tTranslation of source language text to target language. Needs to be sentence-aligned with source language text.')
|
||||
print('\t' + bold +'--targettosrc' + reset + ' file')
|
||||
print('\t\tTranslation of target language text to source language. Needs to be sentence-aligned with target language text.')
|
||||
print('\n\t' + bold +'--factored' + reset)
|
||||
print('\t\tSource and target text can be factored (as defined by moses: | as separator of factors, space as word separator). Only first factor will be used for BLEU score.')
|
||||
print('\n\t' + bold +'--filter' + reset + ', ' + bold +'-f' + reset + ' option')
|
||||
print('\t\tFilters output. Possible options:')
|
||||
print('\t\t' + bold +'sentences' + reset + '\tevaluate each sentence and filter on a per-sentence basis')
|
||||
print('\t\t' + bold +'articles' + reset + '\tevaluate each article and filter on a per-article basis')
|
||||
print('\n\t' + bold +'--filterthreshold' + reset + ' int')
|
||||
print('\t\tFilters output to best XX percent. (Default: 90). Only works if --filter is set.')
|
||||
print('\t' + bold +'--bleuthreshold' + reset + ' float')
|
||||
print('\t\tFilters out sentence pairs with sentence-level BLEU score < XX (in range from 0 to 1). (Default: 0). Only works if --filter is set.')
|
||||
print('\t' + bold +'--filterlang' + reset)
|
||||
print('\t\tFilters out sentences/articles for which BLEU score between source and target is higher than that between translation and target (usually means source and target are in same language). Only works if --filter is set.')
|
||||
print('\n\t' + bold +'--bleu_n' + reset + ' int')
|
||||
print('\t\tConsider n-grams up to size n for BLEU. Default 2.')
|
||||
print('\t' + bold +'--bleu_charlevel' + reset)
|
||||
print('\t\tPerform BLEU on charcter-level (recommended for continuous script language; also consider increasing bleu_n).')
|
||||
print('\n\t' + bold +'--galechurch' + reset)
|
||||
print('\t\tAlign the bitext using Gale and Church\'s algorithm (without BLEU comparison).')
|
||||
print('\t' + bold +'--printempty' + reset)
|
||||
print('\t\tAlso write unaligned sentences to file. By default, they are discarded.')
|
||||
print('\t' + bold +'--verbosity' + reset + ', ' + bold +'-v' + reset + ' int')
|
||||
print('\t\tVerbosity. Choose amount of debugging output. Default value 1; choose 0 for (mostly) quiet mode, 2 for verbose output')
|
||||
print('\t' + bold +'--processes' + reset + ', ' + bold +'-p' + reset + ' int')
|
||||
print('\t\tNumber of parallel processes. Documents are split across available processes. Default: 4.')
|
||||
|
||||
def load_arguments(sysargv):
|
||||
try:
|
||||
opts, args = getopt.getopt(sysargv[1:], "def:ho:s:t:v:p:", ["factored", "filter=", "filterthreshold=", "bleuthreshold=", "filterlang", "printempty", "deveval","eval", "help", "bleu_n=", "bleu_charlevel", "galechurch", "output=", "source=", "target=", "srctotarget=", "targettosrc=", "verbosity=", "printempty=", "processes="])
|
||||
except getopt.GetoptError as err:
|
||||
# print help information and exit:
|
||||
print(str(err)) # will print something like "option -a not recognized"
|
||||
usage()
|
||||
sys.exit(2)
|
||||
options = {}
|
||||
options['srcfile'] = None
|
||||
options['targetfile'] = None
|
||||
options['output'] = None
|
||||
options['srctotarget'] = []
|
||||
options['targettosrc'] = []
|
||||
options['processes'] = 4
|
||||
bold = "\033[1m"
|
||||
reset = "\033[0;0m"
|
||||
|
||||
project_path = os.path.dirname(os.path.abspath(__file__))
|
||||
for o, a in opts:
|
||||
if o in ("-h", "--help"):
|
||||
usage()
|
||||
sys.exit()
|
||||
elif o in ("-e", "--eval"):
|
||||
options['srcfile'] = os.path.join(project_path,'eval','eval1989.de')
|
||||
options['targetfile'] = os.path.join(project_path,'eval','eval1989.fr')
|
||||
from eval import goldeval
|
||||
goldalign = [None] * len(goldeval.gold1990map)
|
||||
for index, data in list(goldeval.gold1990map.items()):
|
||||
goldalign[index] = goldeval.gold[data]
|
||||
options['eval'] = goldalign
|
||||
elif o in ("-d", "--deveval"):
|
||||
options['srcfile'] = os.path.join(project_path,'eval','eval1957.de')
|
||||
options['targetfile'] = os.path.join(project_path,'eval','eval1957.fr')
|
||||
from eval import golddev
|
||||
goldalign = [golddev.goldalign]
|
||||
options['eval'] = goldalign
|
||||
elif o in ("-o", "--output"):
|
||||
options['output'] = a
|
||||
elif o == "--factored":
|
||||
options['factored'] = True
|
||||
elif o in ("-f", "--filter"):
|
||||
if a in ['sentences','articles']:
|
||||
options['filter'] = a
|
||||
else:
|
||||
print('\nERROR: Valid values for option ' + bold + '--filter'+ reset +' are '+ bold +'sentences '+ reset +'and ' + bold +'articles'+ reset +'.')
|
||||
usage()
|
||||
sys.exit(2)
|
||||
elif o == "--filterthreshold":
|
||||
options['filterthreshold'] = float(a)
|
||||
elif o == "--bleuthreshold":
|
||||
options['bleuthreshold'] = float(a)
|
||||
elif o == "--filterlang":
|
||||
options['filterlang'] = True
|
||||
elif o == "--galechurch":
|
||||
options['galechurch'] = True
|
||||
elif o == "--bleu_n":
|
||||
options['bleu_ngrams'] = int(a)
|
||||
elif o == "--bleu_charlevel":
|
||||
options['bleu_charlevel'] = True
|
||||
elif o in ("-s", "--source"):
|
||||
if not 'eval' in options:
|
||||
options['srcfile'] = a
|
||||
elif o in ("-t", "--target"):
|
||||
if not 'eval' in options:
|
||||
options['targetfile'] = a
|
||||
elif o == "--srctotarget":
|
||||
if a == '-':
|
||||
options['no_translation_override'] = True
|
||||
else:
|
||||
options['srctotarget'].append(a)
|
||||
elif o == "--targettosrc":
|
||||
options['targettosrc'].append(a)
|
||||
elif o == "--printempty":
|
||||
options['printempty'] = True
|
||||
elif o in ("-v", "--verbosity"):
|
||||
global loglevel
|
||||
loglevel = int(a)
|
||||
options['loglevel'] = int(a)
|
||||
options['verbosity'] = int(a)
|
||||
elif o in ("-p", "--processes"):
|
||||
options['num_processes'] = int(a)
|
||||
else:
|
||||
assert False, "unhandled option"
|
||||
|
||||
if not options['output']:
|
||||
print('WARNING: Output not specified. Just printing debugging output.',0)
|
||||
if not options['srcfile']:
|
||||
print('\nERROR: Source file not specified.')
|
||||
usage()
|
||||
sys.exit(2)
|
||||
if not options['targetfile']:
|
||||
print('\nERROR: Target file not specified.')
|
||||
usage()
|
||||
sys.exit(2)
|
||||
if options['targettosrc'] and not options['srctotarget']:
|
||||
print('\nWARNING: Only --targettosrc specified, but expecting at least one --srctotarget. Please swap source and target side.')
|
||||
sys.exit(2)
|
||||
if not options['srctotarget'] and not options['targettosrc']\
|
||||
and 'no_translation_override' not in options:
|
||||
print("ERROR: no translation available: BLEU scores can be computed between the source and target text, but this is not the intended usage of Bleualign and may result in poor performance! If you're *really* sure that this is what you want, use the option '--srctotarget -'")
|
||||
sys.exit(2)
|
||||
return options
|
||||
42
ext-lib/bleualign/setup.py
Normal file
42
ext-lib/bleualign/setup.py
Normal file
@@ -0,0 +1,42 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
import os
|
||||
import setuptools
|
||||
|
||||
def read_file(filename):
|
||||
return open(os.path.join(os.path.dirname(__file__), filename)).read()
|
||||
|
||||
setuptools.setup(
|
||||
name = 'bleualign',
|
||||
version = '0.1.1',
|
||||
description = 'An MT-based sentence alignment tool',
|
||||
long_description = read_file('README.md'),
|
||||
author = 'Rico Sennrich',
|
||||
author_email = 'sennrich@cl.uzh.ch',
|
||||
url = 'https://github.com/rsennrich/Bleualign',
|
||||
download_url = 'https://github.com/rsennrich/Bleualign',
|
||||
keywords = [
|
||||
'Sentence Alignment',
|
||||
'Natural Language Processing',
|
||||
'Statistical Machine Translation',
|
||||
'BLEU',
|
||||
],
|
||||
classifiers = [
|
||||
# which Development Status?
|
||||
# 'Development Status :: 3 - Alpha',
|
||||
'Development Status :: 4 - Beta',
|
||||
# 'Development Status :: 5 - Production/Stable',
|
||||
'License :: OSI Approved :: GNU General Public License v2 (GPLv2)',
|
||||
'Operating System :: OS Independent',
|
||||
'Programming Language :: Python :: 2.6',
|
||||
'Programming Language :: Python :: 2.7',
|
||||
'Programming Language :: Python :: 3',
|
||||
'Programming Language :: Python :: 3.2',
|
||||
'Programming Language :: Python :: 3.3',
|
||||
'Programming Language :: Python :: 3.4',
|
||||
'Topic :: Scientific/Engineering',
|
||||
'Topic :: Scientific/Engineering :: Information Analysis',
|
||||
'Topic :: Text Processing',
|
||||
'Topic :: Text Processing :: Linguistic',
|
||||
],
|
||||
packages = ['bleualign'],
|
||||
)
|
||||
Reference in New Issue
Block a user