Baseline alignment systems
This commit is contained in:
5
ext-lib/bleualign/.gitignore
vendored
Normal file
5
ext-lib/bleualign/.gitignore
vendored
Normal file
@@ -0,0 +1,5 @@
|
||||
__pycache__/
|
||||
*.pyc
|
||||
/dist
|
||||
/build
|
||||
/MANIFEST
|
||||
339
ext-lib/bleualign/LICENSE
Normal file
339
ext-lib/bleualign/LICENSE
Normal file
@@ -0,0 +1,339 @@
|
||||
GNU GENERAL PUBLIC LICENSE
|
||||
Version 2, June 1991
|
||||
|
||||
Copyright (C) 1989, 1991 Free Software Foundation, Inc.,
|
||||
51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
|
||||
Everyone is permitted to copy and distribute verbatim copies
|
||||
of this license document, but changing it is not allowed.
|
||||
|
||||
Preamble
|
||||
|
||||
The licenses for most software are designed to take away your
|
||||
freedom to share and change it. By contrast, the GNU General Public
|
||||
License is intended to guarantee your freedom to share and change free
|
||||
software--to make sure the software is free for all its users. This
|
||||
General Public License applies to most of the Free Software
|
||||
Foundation's software and to any other program whose authors commit to
|
||||
using it. (Some other Free Software Foundation software is covered by
|
||||
the GNU Lesser General Public License instead.) You can apply it to
|
||||
your programs, too.
|
||||
|
||||
When we speak of free software, we are referring to freedom, not
|
||||
price. Our General Public Licenses are designed to make sure that you
|
||||
have the freedom to distribute copies of free software (and charge for
|
||||
this service if you wish), that you receive source code or can get it
|
||||
if you want it, that you can change the software or use pieces of it
|
||||
in new free programs; and that you know you can do these things.
|
||||
|
||||
To protect your rights, we need to make restrictions that forbid
|
||||
anyone to deny you these rights or to ask you to surrender the rights.
|
||||
These restrictions translate to certain responsibilities for you if you
|
||||
distribute copies of the software, or if you modify it.
|
||||
|
||||
For example, if you distribute copies of such a program, whether
|
||||
gratis or for a fee, you must give the recipients all the rights that
|
||||
you have. You must make sure that they, too, receive or can get the
|
||||
source code. And you must show them these terms so they know their
|
||||
rights.
|
||||
|
||||
We protect your rights with two steps: (1) copyright the software, and
|
||||
(2) offer you this license which gives you legal permission to copy,
|
||||
distribute and/or modify the software.
|
||||
|
||||
Also, for each author's protection and ours, we want to make certain
|
||||
that everyone understands that there is no warranty for this free
|
||||
software. If the software is modified by someone else and passed on, we
|
||||
want its recipients to know that what they have is not the original, so
|
||||
that any problems introduced by others will not reflect on the original
|
||||
authors' reputations.
|
||||
|
||||
Finally, any free program is threatened constantly by software
|
||||
patents. We wish to avoid the danger that redistributors of a free
|
||||
program will individually obtain patent licenses, in effect making the
|
||||
program proprietary. To prevent this, we have made it clear that any
|
||||
patent must be licensed for everyone's free use or not licensed at all.
|
||||
|
||||
The precise terms and conditions for copying, distribution and
|
||||
modification follow.
|
||||
|
||||
GNU GENERAL PUBLIC LICENSE
|
||||
TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
|
||||
|
||||
0. This License applies to any program or other work which contains
|
||||
a notice placed by the copyright holder saying it may be distributed
|
||||
under the terms of this General Public License. The "Program", below,
|
||||
refers to any such program or work, and a "work based on the Program"
|
||||
means either the Program or any derivative work under copyright law:
|
||||
that is to say, a work containing the Program or a portion of it,
|
||||
either verbatim or with modifications and/or translated into another
|
||||
language. (Hereinafter, translation is included without limitation in
|
||||
the term "modification".) Each licensee is addressed as "you".
|
||||
|
||||
Activities other than copying, distribution and modification are not
|
||||
covered by this License; they are outside its scope. The act of
|
||||
running the Program is not restricted, and the output from the Program
|
||||
is covered only if its contents constitute a work based on the
|
||||
Program (independent of having been made by running the Program).
|
||||
Whether that is true depends on what the Program does.
|
||||
|
||||
1. You may copy and distribute verbatim copies of the Program's
|
||||
source code as you receive it, in any medium, provided that you
|
||||
conspicuously and appropriately publish on each copy an appropriate
|
||||
copyright notice and disclaimer of warranty; keep intact all the
|
||||
notices that refer to this License and to the absence of any warranty;
|
||||
and give any other recipients of the Program a copy of this License
|
||||
along with the Program.
|
||||
|
||||
You may charge a fee for the physical act of transferring a copy, and
|
||||
you may at your option offer warranty protection in exchange for a fee.
|
||||
|
||||
2. You may modify your copy or copies of the Program or any portion
|
||||
of it, thus forming a work based on the Program, and copy and
|
||||
distribute such modifications or work under the terms of Section 1
|
||||
above, provided that you also meet all of these conditions:
|
||||
|
||||
a) You must cause the modified files to carry prominent notices
|
||||
stating that you changed the files and the date of any change.
|
||||
|
||||
b) You must cause any work that you distribute or publish, that in
|
||||
whole or in part contains or is derived from the Program or any
|
||||
part thereof, to be licensed as a whole at no charge to all third
|
||||
parties under the terms of this License.
|
||||
|
||||
c) If the modified program normally reads commands interactively
|
||||
when run, you must cause it, when started running for such
|
||||
interactive use in the most ordinary way, to print or display an
|
||||
announcement including an appropriate copyright notice and a
|
||||
notice that there is no warranty (or else, saying that you provide
|
||||
a warranty) and that users may redistribute the program under
|
||||
these conditions, and telling the user how to view a copy of this
|
||||
License. (Exception: if the Program itself is interactive but
|
||||
does not normally print such an announcement, your work based on
|
||||
the Program is not required to print an announcement.)
|
||||
|
||||
These requirements apply to the modified work as a whole. If
|
||||
identifiable sections of that work are not derived from the Program,
|
||||
and can be reasonably considered independent and separate works in
|
||||
themselves, then this License, and its terms, do not apply to those
|
||||
sections when you distribute them as separate works. But when you
|
||||
distribute the same sections as part of a whole which is a work based
|
||||
on the Program, the distribution of the whole must be on the terms of
|
||||
this License, whose permissions for other licensees extend to the
|
||||
entire whole, and thus to each and every part regardless of who wrote it.
|
||||
|
||||
Thus, it is not the intent of this section to claim rights or contest
|
||||
your rights to work written entirely by you; rather, the intent is to
|
||||
exercise the right to control the distribution of derivative or
|
||||
collective works based on the Program.
|
||||
|
||||
In addition, mere aggregation of another work not based on the Program
|
||||
with the Program (or with a work based on the Program) on a volume of
|
||||
a storage or distribution medium does not bring the other work under
|
||||
the scope of this License.
|
||||
|
||||
3. You may copy and distribute the Program (or a work based on it,
|
||||
under Section 2) in object code or executable form under the terms of
|
||||
Sections 1 and 2 above provided that you also do one of the following:
|
||||
|
||||
a) Accompany it with the complete corresponding machine-readable
|
||||
source code, which must be distributed under the terms of Sections
|
||||
1 and 2 above on a medium customarily used for software interchange; or,
|
||||
|
||||
b) Accompany it with a written offer, valid for at least three
|
||||
years, to give any third party, for a charge no more than your
|
||||
cost of physically performing source distribution, a complete
|
||||
machine-readable copy of the corresponding source code, to be
|
||||
distributed under the terms of Sections 1 and 2 above on a medium
|
||||
customarily used for software interchange; or,
|
||||
|
||||
c) Accompany it with the information you received as to the offer
|
||||
to distribute corresponding source code. (This alternative is
|
||||
allowed only for noncommercial distribution and only if you
|
||||
received the program in object code or executable form with such
|
||||
an offer, in accord with Subsection b above.)
|
||||
|
||||
The source code for a work means the preferred form of the work for
|
||||
making modifications to it. For an executable work, complete source
|
||||
code means all the source code for all modules it contains, plus any
|
||||
associated interface definition files, plus the scripts used to
|
||||
control compilation and installation of the executable. However, as a
|
||||
special exception, the source code distributed need not include
|
||||
anything that is normally distributed (in either source or binary
|
||||
form) with the major components (compiler, kernel, and so on) of the
|
||||
operating system on which the executable runs, unless that component
|
||||
itself accompanies the executable.
|
||||
|
||||
If distribution of executable or object code is made by offering
|
||||
access to copy from a designated place, then offering equivalent
|
||||
access to copy the source code from the same place counts as
|
||||
distribution of the source code, even though third parties are not
|
||||
compelled to copy the source along with the object code.
|
||||
|
||||
4. You may not copy, modify, sublicense, or distribute the Program
|
||||
except as expressly provided under this License. Any attempt
|
||||
otherwise to copy, modify, sublicense or distribute the Program is
|
||||
void, and will automatically terminate your rights under this License.
|
||||
However, parties who have received copies, or rights, from you under
|
||||
this License will not have their licenses terminated so long as such
|
||||
parties remain in full compliance.
|
||||
|
||||
5. You are not required to accept this License, since you have not
|
||||
signed it. However, nothing else grants you permission to modify or
|
||||
distribute the Program or its derivative works. These actions are
|
||||
prohibited by law if you do not accept this License. Therefore, by
|
||||
modifying or distributing the Program (or any work based on the
|
||||
Program), you indicate your acceptance of this License to do so, and
|
||||
all its terms and conditions for copying, distributing or modifying
|
||||
the Program or works based on it.
|
||||
|
||||
6. Each time you redistribute the Program (or any work based on the
|
||||
Program), the recipient automatically receives a license from the
|
||||
original licensor to copy, distribute or modify the Program subject to
|
||||
these terms and conditions. You may not impose any further
|
||||
restrictions on the recipients' exercise of the rights granted herein.
|
||||
You are not responsible for enforcing compliance by third parties to
|
||||
this License.
|
||||
|
||||
7. If, as a consequence of a court judgment or allegation of patent
|
||||
infringement or for any other reason (not limited to patent issues),
|
||||
conditions are imposed on you (whether by court order, agreement or
|
||||
otherwise) that contradict the conditions of this License, they do not
|
||||
excuse you from the conditions of this License. If you cannot
|
||||
distribute so as to satisfy simultaneously your obligations under this
|
||||
License and any other pertinent obligations, then as a consequence you
|
||||
may not distribute the Program at all. For example, if a patent
|
||||
license would not permit royalty-free redistribution of the Program by
|
||||
all those who receive copies directly or indirectly through you, then
|
||||
the only way you could satisfy both it and this License would be to
|
||||
refrain entirely from distribution of the Program.
|
||||
|
||||
If any portion of this section is held invalid or unenforceable under
|
||||
any particular circumstance, the balance of the section is intended to
|
||||
apply and the section as a whole is intended to apply in other
|
||||
circumstances.
|
||||
|
||||
It is not the purpose of this section to induce you to infringe any
|
||||
patents or other property right claims or to contest validity of any
|
||||
such claims; this section has the sole purpose of protecting the
|
||||
integrity of the free software distribution system, which is
|
||||
implemented by public license practices. Many people have made
|
||||
generous contributions to the wide range of software distributed
|
||||
through that system in reliance on consistent application of that
|
||||
system; it is up to the author/donor to decide if he or she is willing
|
||||
to distribute software through any other system and a licensee cannot
|
||||
impose that choice.
|
||||
|
||||
This section is intended to make thoroughly clear what is believed to
|
||||
be a consequence of the rest of this License.
|
||||
|
||||
8. If the distribution and/or use of the Program is restricted in
|
||||
certain countries either by patents or by copyrighted interfaces, the
|
||||
original copyright holder who places the Program under this License
|
||||
may add an explicit geographical distribution limitation excluding
|
||||
those countries, so that distribution is permitted only in or among
|
||||
countries not thus excluded. In such case, this License incorporates
|
||||
the limitation as if written in the body of this License.
|
||||
|
||||
9. The Free Software Foundation may publish revised and/or new versions
|
||||
of the General Public License from time to time. Such new versions will
|
||||
be similar in spirit to the present version, but may differ in detail to
|
||||
address new problems or concerns.
|
||||
|
||||
Each version is given a distinguishing version number. If the Program
|
||||
specifies a version number of this License which applies to it and "any
|
||||
later version", you have the option of following the terms and conditions
|
||||
either of that version or of any later version published by the Free
|
||||
Software Foundation. If the Program does not specify a version number of
|
||||
this License, you may choose any version ever published by the Free Software
|
||||
Foundation.
|
||||
|
||||
10. If you wish to incorporate parts of the Program into other free
|
||||
programs whose distribution conditions are different, write to the author
|
||||
to ask for permission. For software which is copyrighted by the Free
|
||||
Software Foundation, write to the Free Software Foundation; we sometimes
|
||||
make exceptions for this. Our decision will be guided by the two goals
|
||||
of preserving the free status of all derivatives of our free software and
|
||||
of promoting the sharing and reuse of software generally.
|
||||
|
||||
NO WARRANTY
|
||||
|
||||
11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
|
||||
FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN
|
||||
OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
|
||||
PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
|
||||
OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
|
||||
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS
|
||||
TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE
|
||||
PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
|
||||
REPAIR OR CORRECTION.
|
||||
|
||||
12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
|
||||
WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
|
||||
REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
|
||||
INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
|
||||
OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
|
||||
TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
|
||||
YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
|
||||
PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
|
||||
POSSIBILITY OF SUCH DAMAGES.
|
||||
|
||||
END OF TERMS AND CONDITIONS
|
||||
|
||||
How to Apply These Terms to Your New Programs
|
||||
|
||||
If you develop a new program, and you want it to be of the greatest
|
||||
possible use to the public, the best way to achieve this is to make it
|
||||
free software which everyone can redistribute and change under these terms.
|
||||
|
||||
To do so, attach the following notices to the program. It is safest
|
||||
to attach them to the start of each source file to most effectively
|
||||
convey the exclusion of warranty; and each file should have at least
|
||||
the "copyright" line and a pointer to where the full notice is found.
|
||||
|
||||
<one line to give the program's name and a brief idea of what it does.>
|
||||
Copyright (C) <year> <name of author>
|
||||
|
||||
This program is free software; you can redistribute it and/or modify
|
||||
it under the terms of the GNU General Public License as published by
|
||||
the Free Software Foundation; either version 2 of the License, or
|
||||
(at your option) any later version.
|
||||
|
||||
This program is distributed in the hope that it will be useful,
|
||||
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
||||
GNU General Public License for more details.
|
||||
|
||||
You should have received a copy of the GNU General Public License along
|
||||
with this program; if not, write to the Free Software Foundation, Inc.,
|
||||
51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
|
||||
|
||||
Also add information on how to contact you by electronic and paper mail.
|
||||
|
||||
If the program is interactive, make it output a short notice like this
|
||||
when it starts in an interactive mode:
|
||||
|
||||
Gnomovision version 69, Copyright (C) year name of author
|
||||
Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
|
||||
This is free software, and you are welcome to redistribute it
|
||||
under certain conditions; type `show c' for details.
|
||||
|
||||
The hypothetical commands `show w' and `show c' should show the appropriate
|
||||
parts of the General Public License. Of course, the commands you use may
|
||||
be called something other than `show w' and `show c'; they could even be
|
||||
mouse-clicks or menu items--whatever suits your program.
|
||||
|
||||
You should also get your employer (if you work as a programmer) or your
|
||||
school, if any, to sign a "copyright disclaimer" for the program, if
|
||||
necessary. Here is a sample; alter the names:
|
||||
|
||||
Yoyodyne, Inc., hereby disclaims all copyright interest in the program
|
||||
`Gnomovision' (which makes passes at compilers) written by James Hacker.
|
||||
|
||||
<signature of Ty Coon>, 1 April 1989
|
||||
Ty Coon, President of Vice
|
||||
|
||||
This General Public License does not permit incorporating your program into
|
||||
proprietary programs. If your program is a subroutine library, you may
|
||||
consider it more useful to permit linking proprietary applications with the
|
||||
library. If this is what you want to do, use the GNU Lesser General
|
||||
Public License instead of this License.
|
||||
105
ext-lib/bleualign/README.md
Normal file
105
ext-lib/bleualign/README.md
Normal file
@@ -0,0 +1,105 @@
|
||||
Bleualign
|
||||
=========
|
||||
An MT-based sentence alignment tool
|
||||
|
||||
Copyright ⓒ 2010
|
||||
Rico Sennrich <sennrich@cl.uzh.ch>
|
||||
|
||||
A project of the Computational Linguistics Group at the University of Zurich (http://www.cl.uzh.ch).
|
||||
|
||||
Project Homepage: http://github.com/rsennrich/bleualign
|
||||
|
||||
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation
|
||||
|
||||
GENERAL INFO
|
||||
------------
|
||||
|
||||
Bleualign is a tool to align parallel texts (i.e. a text and its translation) on a sentence level.
|
||||
Additionally to the source and target text, Bleualign requires an automatic translation of at least one of the texts.
|
||||
The alignment is then performed on the basis of the similarity (modified BLEU score) between the source text sentences (translated into the target language) and the target text sentences.
|
||||
See section PUBLICATIONS for more details.
|
||||
|
||||
Obtaining an automatic translation is up to the user. The only requirement is that the translation must correspond line-by-line to the source text (no line breaks inserted or removed).
|
||||
|
||||
REQUIREMENTS
|
||||
------------
|
||||
|
||||
The software was developed on Linux using Python 2.6, but should also support newer versions of Python (including 3.X) and other platforms.
|
||||
Please report any issues you encounter to sennrich@cl.uzh.ch
|
||||
|
||||
|
||||
USAGE INSTRUCTIONS
|
||||
------------------
|
||||
|
||||
The input and output formats of bleualign are one sentence per line.
|
||||
A line which only contains .EOA is considered a hard delimiter (end of article).
|
||||
Sentence alignment does not cross these delimiters: reliable delimiters improve speed and performance, wrong ones will seriously degrade performance.
|
||||
|
||||
Given the files sourcetext.txt, targettext.txt and sourcetranslation.txt (the latter being sentence-aligned with sourcetext.txt), a sample call is
|
||||
|
||||
./bleualign.py -s sourcetext.txt -t targettext.txt --srctotarget sourcetranslation.txt -o outputfile
|
||||
|
||||
It is also possible to provide several translations and/or translations in the other translation direction.
|
||||
bleualign will run once per translation provided, the final output being the intersection of the individual runs (i.e. sentence pairs produced in each individual run).
|
||||
|
||||
./bleualign.py -s sourcetext.txt -t targettext.txt --srctotarget sourcetranslation1.txt --srctotarget sourcetranslation2.txt --targettosrc targettranslation1.txt -o outputfile
|
||||
|
||||
./bleualign.py -h will show more usage options
|
||||
|
||||
To facilitate batch processing multiple files, `batch_align.py` can be used.
|
||||
|
||||
python batch_align directory source_suffix target_suffix translation_suffix
|
||||
|
||||
example: given the directory `raw_files` with the files `0.de`, `0.fr` and `0.trans` and so on, (`0.trans` being the translation of `0.de` into the target language), then this command will align all files:
|
||||
|
||||
python batch_align.py raw_files de fr trans
|
||||
|
||||
This will produce the files `0.de.aligned` and `0.fr.aligned`
|
||||
|
||||
Input files are expected to use UTF-8 encoding.
|
||||
|
||||
USAGE AS PYTHON MODULE
|
||||
----------------------
|
||||
|
||||
Bleualign works as stand-alone script, but can also be imported as a module other Python projects.
|
||||
For code examples, see the example/ directory. If you want to know all options, you can see Aligner.default_options variable in bleualign/aligner.py.
|
||||
|
||||
To use Bleualign as a Python module, the package needs to be installed (from a local copy) with:
|
||||
|
||||
python setup.py install
|
||||
|
||||
The Bleualign package can also be installed directly from Github with:
|
||||
|
||||
pip install git+https://github.com/rsennrich/Bleualign.git
|
||||
|
||||
EVALUATION
|
||||
---------
|
||||
|
||||
Two hand-aligned documents are provided with the repository for development and testing.
|
||||
Evaluation is performed if you add the argument `-d` for the development set, and `-e` for the test set.
|
||||
|
||||
An example command for aligning the development set (one long document with 468/554 sentences in DE/FR):
|
||||
|
||||
./bleualign.py --source eval/eval1957.de --target eval/eval1957.fr --srctotarget eval/eval1957.europarlfull.fr -d
|
||||
|
||||
An example command for aligning the test set (7 documents, totalling 993/1011 sentences in DE/FR):
|
||||
|
||||
./bleualign.py --source eval/eval1989.de --target eval/eval1989.fr --srctotarget eval/eval1989.europarlfull.fr -e
|
||||
|
||||
|
||||
PUBLICATIONS
|
||||
------------
|
||||
|
||||
The algorithm is described in
|
||||
|
||||
Rico Sennrich, Martin Volk (2010):
|
||||
MT-based Sentence Alignment for OCR-generated Parallel Texts. In: Proceedings of AMTA 2010, Denver, Colorado.
|
||||
|
||||
Rico Sennrich; Martin Volk (2011):
|
||||
Iterative, MT-based sentence alignment of parallel texts. In: NODALIDA 2011, Nordic Conference of Computational Linguistics, Riga.
|
||||
|
||||
|
||||
CONTACT
|
||||
-------
|
||||
|
||||
For questions and feeback, please contact sennrich@cl.uzh.ch or use the GitHub repository.
|
||||
15
ext-lib/bleualign/_bleualign.py
Normal file
15
ext-lib/bleualign/_bleualign.py
Normal file
@@ -0,0 +1,15 @@
|
||||
#!/usr/bin/python
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright © 2010 University of Zürich
|
||||
# Author: Rico Sennrich <sennrich@cl.uzh.ch>
|
||||
# For licensing information, see LICENSE
|
||||
|
||||
import sys
|
||||
from command_utils import load_arguments
|
||||
from bleualign.align import Aligner
|
||||
|
||||
if __name__ == '__main__':
|
||||
options = load_arguments(sys.argv)
|
||||
|
||||
a = Aligner(options)
|
||||
a.mainloop()
|
||||
51
ext-lib/bleualign/batch_align.py
Normal file
51
ext-lib/bleualign/batch_align.py
Normal file
@@ -0,0 +1,51 @@
|
||||
#!/usr/bin/python
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright: University of Zurich
|
||||
# Author: Rico Sennrich
|
||||
|
||||
# script to allow batch-alignment of multiple files. No multiprocessing.
|
||||
# syntax: python batch_align directory source_suffix target_suffix translation_suffix
|
||||
#
|
||||
# example: given the directory batch-test with the files 0.de, 0.fr and 0.trans, 1.de, 1.fr and 1.trans and so on,
|
||||
# (0.trans being the translation of 0.de into the target language),
|
||||
# then this command will align all files: python batch_align.py batch-test/ de fr trans
|
||||
#
|
||||
# output files will have ending source_suffix.aligned and target_suffix.aligned
|
||||
|
||||
|
||||
import sys
|
||||
import os
|
||||
from bleualign.align import Aligner
|
||||
|
||||
if len(sys.argv) < 2:
|
||||
sys.stderr.write('Usage: python batch_align.py job_file\n')
|
||||
exit()
|
||||
|
||||
job_fn = sys.argv[1]
|
||||
|
||||
options = {}
|
||||
options['factored'] = False
|
||||
options['filter'] = None
|
||||
options['filterthreshold'] = 90
|
||||
options['filterlang'] = None
|
||||
options['targettosrc'] = []
|
||||
options['eval'] = None
|
||||
options['galechurch'] = None
|
||||
options['verbosity'] = 1
|
||||
options['printempty'] = False
|
||||
|
||||
jobs = []
|
||||
with open(job_fn, 'r', encoding="utf-8") as f:
|
||||
for line in f:
|
||||
if not line.startswith("#"):
|
||||
jobs.append(line.strip())
|
||||
|
||||
for rec in jobs:
|
||||
translation_document, source_document, target_document, out_document = rec.split("\t")
|
||||
options['srcfile'] = source_document
|
||||
options['targetfile'] = target_document
|
||||
options['srctotarget'] = [translation_document]
|
||||
options['output'] = out_document
|
||||
a = Aligner(options)
|
||||
a.mainloop()
|
||||
|
||||
110
ext-lib/bleualign/bleualign.py
Normal file
110
ext-lib/bleualign/bleualign.py
Normal file
@@ -0,0 +1,110 @@
|
||||
# 2021/11/27
|
||||
# bfsujason@163.com
|
||||
|
||||
"""
|
||||
Usage:
|
||||
|
||||
python ext-lib/bleualign/bleualign.py \
|
||||
-m data/mac/test/meta_data.tsv \
|
||||
-s data/mac/test/zh \
|
||||
-t data/mac/test/en \
|
||||
-o data/mac/test/auto
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import time
|
||||
import shutil
|
||||
import argparse
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description='Sentence alignment using Bleualign')
|
||||
parser.add_argument('-s', '--src', type=str, required=True, help='Source directory.')
|
||||
parser.add_argument('-t', '--tgt', type=str, required=True, help='Target directory.')
|
||||
parser.add_argument('-o', '--out', type=str, required=True, help='Output directory.')
|
||||
parser.add_argument('-m', '--meta', type=str, required=True, help='Metadata file.')
|
||||
parser.add_argument('--tok', action='store_true', help='Use tokenized source trans and target text.')
|
||||
args = parser.parse_args()
|
||||
|
||||
make_dir(args.out)
|
||||
|
||||
jobs = create_jobs(args.meta, args.src, args.tgt, args.out, args.tok)
|
||||
job_path = os.path.abspath(os.path.join(args.out, 'bleualign.job'))
|
||||
write_jobs(jobs, job_path)
|
||||
|
||||
bleualign_bin = os.path.abspath('ext-lib/bleualign/batch_align.py')
|
||||
run_bleualign(bleualign_bin, job_path)
|
||||
|
||||
convert_format(args.out)
|
||||
|
||||
def convert_format(dir):
|
||||
for file in os.listdir(dir):
|
||||
if file.endswith('-s'):
|
||||
file_id = file.split('.')[0]
|
||||
src = os.path.join(dir, file)
|
||||
tgt = os.path.join(dir, file_id + '.align-t')
|
||||
out = os.path.join(dir, file_id + '.align')
|
||||
_convert_format(src, tgt, out)
|
||||
os.unlink(src)
|
||||
os.unlink(tgt)
|
||||
|
||||
def _convert_format(src, tgt, path):
|
||||
src_align = read_alignment(src)
|
||||
tgt_align = read_alignment(tgt)
|
||||
with open(path, 'wt', encoding='utf-8') as f:
|
||||
for x, y in zip(src_align, tgt_align):
|
||||
f.write("{}:{}\n".format(x,y))
|
||||
|
||||
def read_alignment(file):
|
||||
alignment = []
|
||||
with open(file, 'rt', encoding='utf-8') as f:
|
||||
for line in f:
|
||||
line = line.strip()
|
||||
alignment.append([int(x) for x in line.split(',')])
|
||||
|
||||
return alignment
|
||||
|
||||
def run_bleualign(bin, job):
|
||||
cmd = "python {} {}".format(bin, job)
|
||||
os.system(cmd)
|
||||
os.unlink(job)
|
||||
|
||||
def write_jobs(jobs, path):
|
||||
jobs = '\n'.join(jobs)
|
||||
with open(path, 'wt', encoding='utf-8') as f:
|
||||
f.write(jobs)
|
||||
|
||||
def create_jobs(meta, src, tgt, out, is_tok):
|
||||
jobs = []
|
||||
fns = get_fns(meta)
|
||||
for file in fns:
|
||||
src_path = os.path.abspath(os.path.join(src, file))
|
||||
trans_path = os.path.abspath(os.path.join(src, file + '.trans'))
|
||||
if is_tok:
|
||||
tgt_path = os.path.abspath(os.path.join(tgt, file + '.tok'))
|
||||
else:
|
||||
tgt_path = os.path.abspath(os.path.join(tgt, file))
|
||||
out_path = os.path.abspath(os.path.join(out, file + '.align'))
|
||||
jobs.append('\t'.join([trans_path, src_path, tgt_path, out_path]))
|
||||
|
||||
return jobs
|
||||
|
||||
def get_fns(meta):
|
||||
fns = []
|
||||
with open(meta, 'rt', encoding='utf-8') as f:
|
||||
next(f) # skip header
|
||||
for line in f:
|
||||
recs = line.strip().split('\t')
|
||||
fns.append(recs[0])
|
||||
|
||||
return fns
|
||||
|
||||
def make_dir(path):
|
||||
if os.path.isdir(path):
|
||||
shutil.rmtree(path)
|
||||
os.makedirs(path, exist_ok=True)
|
||||
|
||||
if __name__ == '__main__':
|
||||
t_0 = time.time()
|
||||
main()
|
||||
print("It takes {:.3f} seconds to align all the sentences.".format(time.time() - t_0))
|
||||
0
ext-lib/bleualign/bleualign/__init__.py
Normal file
0
ext-lib/bleualign/bleualign/__init__.py
Normal file
1183
ext-lib/bleualign/bleualign/align.py
Normal file
1183
ext-lib/bleualign/bleualign/align.py
Normal file
File diff suppressed because it is too large
Load Diff
205
ext-lib/bleualign/bleualign/gale_church.py
Normal file
205
ext-lib/bleualign/bleualign/gale_church.py
Normal file
@@ -0,0 +1,205 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
import math
|
||||
|
||||
# Based on Gale & Church 1993,
|
||||
# "A Program for Aligning Sentences in Bilingual Corpora"
|
||||
|
||||
infinity = float("inf")
|
||||
|
||||
def erfcc(x):
|
||||
"""Complementary error function."""
|
||||
z = abs(x)
|
||||
t = 1 / (1 + 0.5 * z)
|
||||
r = t * math.exp(-z * z -
|
||||
1.26551223 + t *
|
||||
(1.00002368 + t *
|
||||
(.37409196 + t *
|
||||
(.09678418 + t *
|
||||
(-.18628806 + t *
|
||||
(.27886807 + t *
|
||||
(-1.13520398 + t *
|
||||
(1.48851587 + t *
|
||||
(-.82215223 + t * .17087277)))))))))
|
||||
if (x >= 0.):
|
||||
return r
|
||||
else:
|
||||
return 2. - r
|
||||
|
||||
|
||||
def norm_cdf(x):
|
||||
"""Return the area under the normal distribution from M{-∞..x}."""
|
||||
return 1 - 0.5 * erfcc(x / math.sqrt(2))
|
||||
|
||||
|
||||
class LanguageIndependent(object):
|
||||
# These are the language-independent probabilities and parameters
|
||||
# given in Gale & Church
|
||||
|
||||
# for the computation, l_1 is always the language with less characters
|
||||
PRIORS = {
|
||||
(1, 0): 0.0099,
|
||||
(0, 1): 0.0099,
|
||||
(1, 1): 0.89,
|
||||
(2, 1): 0.089,
|
||||
(1, 2): 0.089,
|
||||
(2, 2): 0.011,
|
||||
}
|
||||
|
||||
AVERAGE_CHARACTERS = 1
|
||||
VARIANCE_CHARACTERS = 6.8
|
||||
|
||||
|
||||
def trace(backlinks, source, target):
|
||||
links = set()
|
||||
pos = (len(source) - 1, len(target) - 1)
|
||||
|
||||
#while pos != (-1, -1):
|
||||
while pos[0] != -1 and pos[1] != -1:
|
||||
#print(pos)
|
||||
#print(backlinks)
|
||||
#print(backlinks[pos])
|
||||
s, t = backlinks[pos]
|
||||
for i in range(s):
|
||||
for j in range(t):
|
||||
links.add((pos[0] - i, pos[1] - j))
|
||||
pos = (pos[0] - s, pos[1] - t)
|
||||
|
||||
return links
|
||||
|
||||
|
||||
def align_probability(i, j, source_sentences, target_sentences, alignment, params):
|
||||
"""Returns the probability of the two sentences C{source_sentences[i]}, C{target_sentences[j]}
|
||||
being aligned with a specific C{alignment}.
|
||||
|
||||
@param i: The offset of the source sentence.
|
||||
@param j: The offset of the target sentence.
|
||||
@param source_sentences: The list of source sentence lengths.
|
||||
@param target_sentences: The list of target sentence lengths.
|
||||
@param alignment: The alignment type, a tuple of two integers.
|
||||
@param params: The sentence alignment parameters.
|
||||
|
||||
@returns: The probability of a specific alignment between the two sentences, given the parameters.
|
||||
"""
|
||||
l_s = sum(source_sentences[i - offset] for offset in range(alignment[0]))
|
||||
l_t = sum(target_sentences[j - offset] for offset in range(alignment[1]))
|
||||
try:
|
||||
# actually, the paper says l_s * params.VARIANCE_CHARACTERS, this is based on the C
|
||||
# reference implementation. With l_s in the denominator, insertions are impossible.
|
||||
m = (l_s + l_t / params.AVERAGE_CHARACTERS) / 2
|
||||
delta = (l_t - l_s * params.AVERAGE_CHARACTERS) / math.sqrt(m * params.VARIANCE_CHARACTERS)
|
||||
except ZeroDivisionError:
|
||||
delta = infinity
|
||||
|
||||
return 2 * (1 - norm_cdf(abs(delta))) * params.PRIORS[alignment]
|
||||
|
||||
|
||||
def align_blocks(source_sentences, target_sentences, params = LanguageIndependent):
|
||||
"""Creates the sentence alignment of two blocks of texts (usually paragraphs).
|
||||
|
||||
@param source_sentences: The list of source sentence lengths.
|
||||
@param target_sentences: The list of target sentence lengths.
|
||||
@param params: the sentence alignment parameters.
|
||||
|
||||
@return: The sentence alignments, a list of index pairs.
|
||||
"""
|
||||
alignment_types = list(params.PRIORS.keys())
|
||||
|
||||
# there are always three rows in the history (with the last of them being filled)
|
||||
# and the rows are always |target_text| + 2, so that we never have to do
|
||||
# boundary checks
|
||||
D = [(len(target_sentences) + 2) * [0] for x in range(2)]
|
||||
|
||||
# for the first sentence, only substitution, insertion or deletion are
|
||||
# allowed, and they are all equally likely ( == 1)
|
||||
|
||||
D.append([0, 1])
|
||||
try:
|
||||
D[-2][1] = 1
|
||||
D[-2][2] = 1
|
||||
except:
|
||||
pass
|
||||
|
||||
backlinks = {}
|
||||
|
||||
for i in range(len(source_sentences)):
|
||||
for j in range(len(target_sentences)):
|
||||
m = []
|
||||
for a in alignment_types:
|
||||
k = D[-(1 + a[0])][j + 2 - a[1]]
|
||||
if k > 0:
|
||||
p = k * \
|
||||
align_probability(i, j, source_sentences, target_sentences, a, params)
|
||||
m.append((p, a))
|
||||
|
||||
if len(m) > 0:
|
||||
v = max(m)
|
||||
backlinks[(i, j)] = v[1]
|
||||
D[-1].append(v[0])
|
||||
else:
|
||||
backlinks[(i, j)] = (1, 1)
|
||||
D[-1].append(0)
|
||||
|
||||
D.pop(0)
|
||||
D.append([0, 0])
|
||||
|
||||
return trace(backlinks, source_sentences, target_sentences)
|
||||
|
||||
|
||||
def align_texts(source_blocks, target_blocks, params = LanguageIndependent):
|
||||
"""Creates the sentence alignment of two texts.
|
||||
|
||||
Texts can consist of several blocks. Block boundaries cannot be crossed by sentence
|
||||
alignment links.
|
||||
|
||||
Each block consists of a list that contains the lengths (in characters) of the sentences
|
||||
in this block.
|
||||
|
||||
@param source_blocks: The list of blocks in the source text.
|
||||
@param target_blocks: The list of blocks in the target text.
|
||||
@param params: the sentence alignment parameters.
|
||||
|
||||
@returns: A list of sentence alignment lists
|
||||
"""
|
||||
if len(source_blocks) != len(target_blocks):
|
||||
raise ValueError("Source and target texts do not have the same number of blocks.")
|
||||
|
||||
return [align_blocks(source_block, target_block, params)
|
||||
for source_block, target_block in zip(source_blocks, target_blocks)]
|
||||
|
||||
|
||||
def split_at(it, split_value):
|
||||
"""Splits an iterator C{it} at values of C{split_value}.
|
||||
|
||||
Each instance of C{split_value} is swallowed. The iterator produces
|
||||
subiterators which need to be consumed fully before the next subiterator
|
||||
can be used.
|
||||
"""
|
||||
def _chunk_iterator(first):
|
||||
v = first
|
||||
while v != split_value:
|
||||
yield v
|
||||
v = next(it)
|
||||
|
||||
while True:
|
||||
yield _chunk_iterator(next(it))
|
||||
|
||||
|
||||
def parse_token_stream(stream, soft_delimiter, hard_delimiter):
|
||||
"""Parses a stream of tokens and splits it into sentences (using C{soft_delimiter} tokens)
|
||||
and blocks (using C{hard_delimiter} tokens) for use with the L{align_texts} function.
|
||||
"""
|
||||
return [
|
||||
[sum(len(token) for token in sentence_it)
|
||||
for sentence_it in split_at(block_it, soft_delimiter)]
|
||||
for block_it in split_at(stream, hard_delimiter)]
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import sys
|
||||
from contextlib import nested
|
||||
|
||||
with nested(open(sys.argv[1], "r"), open(sys.argv[2], "r")) as (s, t):
|
||||
source = parse_token_stream((l.strip() for l in s), ".EOS", ".EOP")
|
||||
target = parse_token_stream((l.strip() for l in t), ".EOS", ".EOP")
|
||||
print((align_texts(source, target)))
|
||||
146
ext-lib/bleualign/bleualign/score.py
Normal file
146
ext-lib/bleualign/bleualign/score.py
Normal file
@@ -0,0 +1,146 @@
|
||||
#!/usr/bin/python
|
||||
# -*- coding: utf-8 -*-
|
||||
#File originally part of moses package: http://www.statmt.org/moses/ (as bleu.py)
|
||||
#Stripped of unused code to reduce number of libraries used
|
||||
|
||||
# $Id$
|
||||
|
||||
'''Provides:
|
||||
|
||||
cook_refs(refs, n=4): Transform a list of reference sentences as strings into a form usable by cook_test().
|
||||
cook_test(test, refs, n=4): Transform a test sentence as a string (together with the cooked reference sentences) into a form usable by score_cooked().
|
||||
score_cooked(alltest, n=4): Score a list of cooked test sentences.
|
||||
|
||||
score_set(s, testid, refids, n=4): Interface with dataset.py; calculate BLEU score of testid against refids.
|
||||
|
||||
The reason for breaking the BLEU computation into three phases cook_refs(), cook_test(), and score_cooked() is to allow the caller to calculate BLEU scores for multiple test sets as efficiently as possible.
|
||||
'''
|
||||
|
||||
from __future__ import division, print_function
|
||||
import sys, math, re, xml.sax.saxutils
|
||||
|
||||
# Added to bypass NIST-style pre-processing of hyp and ref files -- wade
|
||||
nonorm = 0
|
||||
|
||||
preserve_case = False
|
||||
eff_ref_len = "shortest"
|
||||
|
||||
normalize1 = [
|
||||
('<skipped>', ''), # strip "skipped" tags
|
||||
(r'-\n', ''), # strip end-of-line hyphenation and join lines
|
||||
(r'\n', ' '), # join lines
|
||||
# (r'(\d)\s+(?=\d)', r'\1'), # join digits
|
||||
]
|
||||
normalize1 = [(re.compile(pattern), replace) for (pattern, replace) in normalize1]
|
||||
|
||||
normalize2 = [
|
||||
(r'([\{-\~\[-\` -\&\(-\+\:-\@\/])',r' \1 '), # tokenize punctuation. apostrophe is missing
|
||||
(r'([^0-9])([\.,])',r'\1 \2 '), # tokenize period and comma unless preceded by a digit
|
||||
(r'([\.,])([^0-9])',r' \1 \2'), # tokenize period and comma unless followed by a digit
|
||||
(r'([0-9])(-)',r'\1 \2 ') # tokenize dash when preceded by a digit
|
||||
]
|
||||
normalize2 = [(re.compile(pattern), replace) for (pattern, replace) in normalize2]
|
||||
|
||||
#combine normalize2 into a single regex.
|
||||
normalize3 = re.compile(r'([\{-\~\[-\` -\&\(-\+\:-\@\/])|(?:(?<![0-9])([\.,]))|(?:([\.,])(?![0-9]))|(?:(?<=[0-9])(-))')
|
||||
|
||||
def normalize(s):
|
||||
'''Normalize and tokenize text. This is lifted from NIST mteval-v11a.pl.'''
|
||||
# Added to bypass NIST-style pre-processing of hyp and ref files -- wade
|
||||
if (nonorm):
|
||||
return s.split()
|
||||
try:
|
||||
s.split()
|
||||
except:
|
||||
s = " ".join(s)
|
||||
# language-independent part:
|
||||
for (pattern, replace) in normalize1:
|
||||
s = re.sub(pattern, replace, s)
|
||||
s = xml.sax.saxutils.unescape(s, {'"':'"'})
|
||||
# language-dependent part (assuming Western languages):
|
||||
s = " %s " % s
|
||||
if not preserve_case:
|
||||
s = s.lower() # this might not be identical to the original
|
||||
return [tok for tok in normalize3.split(s) if tok and tok != ' ']
|
||||
|
||||
def count_ngrams(words, n=4):
|
||||
counts = {}
|
||||
for k in range(1,n+1):
|
||||
for i in range(len(words)-k+1):
|
||||
ngram = tuple(words[i:i+k])
|
||||
counts[ngram] = counts.get(ngram, 0)+1
|
||||
return counts
|
||||
|
||||
def cook_refs(refs, n=4):
|
||||
'''Takes a list of reference sentences for a single segment
|
||||
and returns an object that encapsulates everything that BLEU
|
||||
needs to know about them.'''
|
||||
|
||||
refs = [normalize(ref) for ref in refs]
|
||||
maxcounts = {}
|
||||
for ref in refs:
|
||||
counts = count_ngrams(ref, n)
|
||||
for (ngram,count) in list(counts.items()):
|
||||
maxcounts[ngram] = max(maxcounts.get(ngram,0), count)
|
||||
return ([len(ref) for ref in refs], maxcounts)
|
||||
|
||||
def cook_ref_set(ref, n=4):
|
||||
'''Takes a reference sentences for a single segment
|
||||
and returns an object that encapsulates everything that BLEU
|
||||
needs to know about them. Also provides a set cause bleualign wants it'''
|
||||
ref = normalize(ref)
|
||||
counts = count_ngrams(ref, n)
|
||||
return (len(ref), counts, frozenset(counts))
|
||||
|
||||
|
||||
|
||||
|
||||
def cook_test(test, args, n=4):
|
||||
'''Takes a test sentence and returns an object that
|
||||
encapsulates everything that BLEU needs to know about it.'''
|
||||
|
||||
reflens, refmaxcounts = args
|
||||
test = normalize(test)
|
||||
result = {}
|
||||
result["testlen"] = len(test)
|
||||
|
||||
# Calculate effective reference sentence length.
|
||||
|
||||
if eff_ref_len == "shortest":
|
||||
result["reflen"] = min(reflens)
|
||||
elif eff_ref_len == "average":
|
||||
result["reflen"] = float(sum(reflens))/len(reflens)
|
||||
elif eff_ref_len == "closest":
|
||||
min_diff = None
|
||||
for reflen in reflens:
|
||||
if min_diff is None or abs(reflen-len(test)) < min_diff:
|
||||
min_diff = abs(reflen-len(test))
|
||||
result['reflen'] = reflen
|
||||
|
||||
result["guess"] = [max(len(test)-k+1,0) for k in range(1,n+1)]
|
||||
|
||||
result['correct'] = [0]*n
|
||||
counts = count_ngrams(test, n)
|
||||
for (ngram, count) in list(counts.items()):
|
||||
result["correct"][len(ngram)-1] += min(refmaxcounts.get(ngram,0), count)
|
||||
|
||||
return result
|
||||
|
||||
def score_cooked(allcomps, n=4):
|
||||
totalcomps = {'testlen':0, 'reflen':0, 'guess':[0]*n, 'correct':[0]*n}
|
||||
for comps in allcomps:
|
||||
for key in ['testlen','reflen']:
|
||||
totalcomps[key] += comps[key]
|
||||
for key in ['guess','correct']:
|
||||
for k in range(n):
|
||||
totalcomps[key][k] += comps[key][k]
|
||||
logbleu = 0.0
|
||||
for k in range(n):
|
||||
if totalcomps['correct'][k] == 0:
|
||||
return 0.0
|
||||
#log.write("%d-grams: %f\n" % (k,float(totalcomps['correct'][k])/totalcomps['guess'][k]))
|
||||
logbleu += math.log(totalcomps['correct'][k])-math.log(totalcomps['guess'][k])
|
||||
logbleu /= float(n)
|
||||
#log.write("Effective reference length: %d test length: %d\n" % (totalcomps['reflen'], totalcomps['testlen']))
|
||||
logbleu += min(0,1-float(totalcomps['reflen'])/totalcomps['testlen'])
|
||||
return math.exp(logbleu)
|
||||
191
ext-lib/bleualign/bleualign/utils.py
Normal file
191
ext-lib/bleualign/bleualign/utils.py
Normal file
@@ -0,0 +1,191 @@
|
||||
#!/usr/bin/python
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright: University of Zurich
|
||||
# Author: Rico Sennrich
|
||||
# For licensing information, see LICENSE
|
||||
|
||||
# Evaluation functions for Bleualign
|
||||
|
||||
|
||||
from __future__ import division
|
||||
from operator import itemgetter
|
||||
|
||||
|
||||
def evaluate(options, testalign, goldalign, log_function):
|
||||
goldalign = [(tuple(src),tuple(target)) for src,target in goldalign]
|
||||
|
||||
results = {}
|
||||
paircounts = {}
|
||||
for pair in [(len(srclist),len(targetlist)) for srclist,targetlist in goldalign]:
|
||||
paircounts[pair] = paircounts.get(pair,0) + 1
|
||||
pairs_normalized = {}
|
||||
for pair in paircounts:
|
||||
pairs_normalized[pair] = (paircounts[pair],paircounts[pair] / float(len(goldalign)))
|
||||
|
||||
log_function('\ngold alignment frequencies\n')
|
||||
for aligntype,(abscount,relcount) in sorted(list(pairs_normalized.items()),key=itemgetter(1),reverse=True):
|
||||
log_function(aligntype,end='')
|
||||
log_function(' - ',end='')
|
||||
log_function(abscount,end='')
|
||||
log_function(' ('+str(relcount)+')')
|
||||
|
||||
log_function('\ntotal recall: ',end='')
|
||||
log_function(str(len(goldalign)) + ' pairs in gold')
|
||||
(tpstrict,fnstrict,tplax,fnlax) = recall((0,0),goldalign,[i[0] for i in testalign],log_function)
|
||||
results['recall'] = (tpstrict,fnstrict,tplax,fnlax)
|
||||
|
||||
for aligntype in set([i[1] for i in testalign]):
|
||||
testalign_bytype = []
|
||||
for i in testalign:
|
||||
if i[1] == aligntype:
|
||||
testalign_bytype.append(i)
|
||||
log_function('precision for alignment type ' + str(aligntype) + ' ( ' + str(len(testalign_bytype)) + ' alignment pairs)')
|
||||
precision(goldalign,testalign_bytype,log_function)
|
||||
|
||||
log_function('\ntotal precision:',end='')
|
||||
log_function(str(len(testalign)) + ' alignment pairs found')
|
||||
(tpstrict,fpstrict,tplax,fplax) = precision(goldalign,testalign,log_function)
|
||||
results['precision'] = (tpstrict,fpstrict,tplax,fplax)
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def precision(goldalign, testalign, log_function):
|
||||
tpstrict=0
|
||||
tplax=0
|
||||
fpstrict=0
|
||||
fplax=0
|
||||
for (src,target) in [i[0] for i in testalign]:
|
||||
if (src,target) == ((),()):
|
||||
continue
|
||||
if (src,target) in goldalign:
|
||||
tpstrict +=1
|
||||
tplax += 1
|
||||
else:
|
||||
srcset, targetset = set(src), set(target)
|
||||
for srclist,targetlist in goldalign:
|
||||
#lax condition: hypothesis and gold alignment only need to overlap
|
||||
if srcset.intersection(set(srclist)) and targetset.intersection(set(targetlist)):
|
||||
fpstrict +=1
|
||||
tplax += 1
|
||||
break
|
||||
else:
|
||||
fpstrict +=1
|
||||
fplax +=1
|
||||
log_function('false positive: ',2)
|
||||
log_function((src,target),2)
|
||||
if tpstrict+fpstrict > 0:
|
||||
log_function('precision strict: ',end='')
|
||||
log_function((tpstrict/float(tpstrict+fpstrict)))
|
||||
log_function('precision lax: ',end='')
|
||||
log_function((tplax/float(tplax+fplax)))
|
||||
log_function('')
|
||||
else:
|
||||
log_function('nothing to find')
|
||||
|
||||
return tpstrict,fpstrict,tplax,fplax
|
||||
|
||||
|
||||
def recall(aligntype, goldalign, testalign, log_function):
|
||||
|
||||
srclen,targetlen = aligntype
|
||||
|
||||
if srclen == 0 and targetlen == 0:
|
||||
gapdists = [(0,0) for i in goldalign]
|
||||
elif srclen == 0 or targetlen == 0:
|
||||
log_function('nothing to find')
|
||||
return
|
||||
else:
|
||||
gapdists = [(len(srclist),len(targetlist)) for srclist,targetlist in goldalign]
|
||||
|
||||
tpstrict=0
|
||||
tplax=0
|
||||
fnstrict=0
|
||||
fnlax=0
|
||||
for i,pair in enumerate(gapdists):
|
||||
if aligntype == pair:
|
||||
(srclist,targetlist) = goldalign[i]
|
||||
if not srclist or not targetlist:
|
||||
continue
|
||||
elif (srclist,targetlist) in testalign:
|
||||
tpstrict +=1
|
||||
tplax +=1
|
||||
else:
|
||||
srcset, targetset = set(srclist), set(targetlist)
|
||||
for src,target in testalign:
|
||||
#lax condition: hypothesis and gold alignment only need to overlap
|
||||
if srcset.intersection(set(src)) and targetset.intersection(set(target)):
|
||||
tplax +=1
|
||||
fnstrict+=1
|
||||
break
|
||||
else:
|
||||
fnstrict+=1
|
||||
fnlax+=1
|
||||
log_function('not found: ',2),
|
||||
log_function(goldalign[i],2)
|
||||
|
||||
if tpstrict+fnstrict>0:
|
||||
log_function('recall strict: '),
|
||||
log_function((tpstrict/float(tpstrict+fnstrict)))
|
||||
log_function('recall lax: '),
|
||||
log_function((tplax/float(tplax+fnlax)))
|
||||
log_function('')
|
||||
else:
|
||||
log_function('nothing to find')
|
||||
|
||||
return tpstrict,fnstrict,tplax,fnlax
|
||||
|
||||
|
||||
def finalevaluation(results, log_function):
|
||||
recall_value = [0,0,0,0]
|
||||
precision_value = [0,0,0,0]
|
||||
for i,k in list(results.items()):
|
||||
for m,j in enumerate(recall_value):
|
||||
recall_value[m] = j+ k['recall'][m]
|
||||
for m,j in enumerate(precision_value):
|
||||
precision_value[m] = j+ k['precision'][m]
|
||||
|
||||
try:
|
||||
pstrict = (precision_value[0]/float(precision_value[0]+precision_value[1]))
|
||||
except ZeroDivisionError:
|
||||
pstrict = 0
|
||||
try:
|
||||
plax =(precision_value[2]/float(precision_value[2]+precision_value[3]))
|
||||
except ZeroDivisionError:
|
||||
plax = 0
|
||||
try:
|
||||
rstrict= (recall_value[0]/float(recall_value[0]+recall_value[1]))
|
||||
except ZeroDivisionError:
|
||||
rstrict = 0
|
||||
try:
|
||||
rlax=(recall_value[2]/float(recall_value[2]+recall_value[3]))
|
||||
except ZeroDivisionError:
|
||||
rlax = 0
|
||||
if (pstrict+rstrict) == 0:
|
||||
fstrict = 0
|
||||
else:
|
||||
fstrict=2*(pstrict*rstrict)/(pstrict+rstrict)
|
||||
if (plax+rlax) == 0:
|
||||
flax=0
|
||||
else:
|
||||
flax=2*(plax*rlax)/(plax+rlax)
|
||||
|
||||
log_function('\n=========================\n')
|
||||
log_function('total results:')
|
||||
log_function('recall strict: ',end='')
|
||||
log_function(rstrict)
|
||||
log_function('recall lax: ',end='')
|
||||
log_function(rlax)
|
||||
log_function('')
|
||||
|
||||
log_function('precision strict: ',end='')
|
||||
log_function(pstrict)
|
||||
log_function('precision lax: '),
|
||||
log_function(plax)
|
||||
log_function('')
|
||||
|
||||
log_function('f1 strict: ',end='')
|
||||
log_function(fstrict)
|
||||
log_function('f1 lax: ',end='')
|
||||
log_function(flax)
|
||||
log_function('')
|
||||
158
ext-lib/bleualign/command_utils.py
Normal file
158
ext-lib/bleualign/command_utils.py
Normal file
@@ -0,0 +1,158 @@
|
||||
#!/usr/bin/python
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright © 2010 University of Zürich
|
||||
# Author: Rico Sennrich <sennrich@cl.uzh.ch>
|
||||
# For licensing information, see LICENSE
|
||||
|
||||
|
||||
from __future__ import division, print_function
|
||||
import sys
|
||||
import os
|
||||
import getopt
|
||||
|
||||
def usage():
|
||||
bold = "\033[1m"
|
||||
reset = "\033[0;0m"
|
||||
italic = "\033[3m"
|
||||
|
||||
print('\n\t All files need to be one sentence per line and have .EOA as a hard delimiter. --source, --target and --output are mandatory arguments, the others are optional.')
|
||||
print('\n\t' + bold +'--help' + reset + ', ' + bold +'-h' + reset)
|
||||
print('\t\tprint usage information\n')
|
||||
print('\t' + bold +'--source' + reset + ', ' + bold +'-s' + reset + ' file')
|
||||
print('\t\tSource language text.')
|
||||
print('\t' + bold +'--target' + reset + ', ' + bold +'-t' + reset + ' file')
|
||||
print('\t\tTarget language text.')
|
||||
print('\t' + bold +'--output' + reset + ', ' + bold +'-o' + reset + ' filename')
|
||||
print('\t\tOutput file: Will create ' + 'filename' + '-s and ' + 'filename' + '-t')
|
||||
print('\n\t' + bold +'--srctotarget' + reset + ' file')
|
||||
print('\t\tTranslation of source language text to target language. Needs to be sentence-aligned with source language text.')
|
||||
print('\t' + bold +'--targettosrc' + reset + ' file')
|
||||
print('\t\tTranslation of target language text to source language. Needs to be sentence-aligned with target language text.')
|
||||
print('\n\t' + bold +'--factored' + reset)
|
||||
print('\t\tSource and target text can be factored (as defined by moses: | as separator of factors, space as word separator). Only first factor will be used for BLEU score.')
|
||||
print('\n\t' + bold +'--filter' + reset + ', ' + bold +'-f' + reset + ' option')
|
||||
print('\t\tFilters output. Possible options:')
|
||||
print('\t\t' + bold +'sentences' + reset + '\tevaluate each sentence and filter on a per-sentence basis')
|
||||
print('\t\t' + bold +'articles' + reset + '\tevaluate each article and filter on a per-article basis')
|
||||
print('\n\t' + bold +'--filterthreshold' + reset + ' int')
|
||||
print('\t\tFilters output to best XX percent. (Default: 90). Only works if --filter is set.')
|
||||
print('\t' + bold +'--bleuthreshold' + reset + ' float')
|
||||
print('\t\tFilters out sentence pairs with sentence-level BLEU score < XX (in range from 0 to 1). (Default: 0). Only works if --filter is set.')
|
||||
print('\t' + bold +'--filterlang' + reset)
|
||||
print('\t\tFilters out sentences/articles for which BLEU score between source and target is higher than that between translation and target (usually means source and target are in same language). Only works if --filter is set.')
|
||||
print('\n\t' + bold +'--bleu_n' + reset + ' int')
|
||||
print('\t\tConsider n-grams up to size n for BLEU. Default 2.')
|
||||
print('\t' + bold +'--bleu_charlevel' + reset)
|
||||
print('\t\tPerform BLEU on charcter-level (recommended for continuous script language; also consider increasing bleu_n).')
|
||||
print('\n\t' + bold +'--galechurch' + reset)
|
||||
print('\t\tAlign the bitext using Gale and Church\'s algorithm (without BLEU comparison).')
|
||||
print('\t' + bold +'--printempty' + reset)
|
||||
print('\t\tAlso write unaligned sentences to file. By default, they are discarded.')
|
||||
print('\t' + bold +'--verbosity' + reset + ', ' + bold +'-v' + reset + ' int')
|
||||
print('\t\tVerbosity. Choose amount of debugging output. Default value 1; choose 0 for (mostly) quiet mode, 2 for verbose output')
|
||||
print('\t' + bold +'--processes' + reset + ', ' + bold +'-p' + reset + ' int')
|
||||
print('\t\tNumber of parallel processes. Documents are split across available processes. Default: 4.')
|
||||
|
||||
def load_arguments(sysargv):
|
||||
try:
|
||||
opts, args = getopt.getopt(sysargv[1:], "def:ho:s:t:v:p:", ["factored", "filter=", "filterthreshold=", "bleuthreshold=", "filterlang", "printempty", "deveval","eval", "help", "bleu_n=", "bleu_charlevel", "galechurch", "output=", "source=", "target=", "srctotarget=", "targettosrc=", "verbosity=", "printempty=", "processes="])
|
||||
except getopt.GetoptError as err:
|
||||
# print help information and exit:
|
||||
print(str(err)) # will print something like "option -a not recognized"
|
||||
usage()
|
||||
sys.exit(2)
|
||||
options = {}
|
||||
options['srcfile'] = None
|
||||
options['targetfile'] = None
|
||||
options['output'] = None
|
||||
options['srctotarget'] = []
|
||||
options['targettosrc'] = []
|
||||
options['processes'] = 4
|
||||
bold = "\033[1m"
|
||||
reset = "\033[0;0m"
|
||||
|
||||
project_path = os.path.dirname(os.path.abspath(__file__))
|
||||
for o, a in opts:
|
||||
if o in ("-h", "--help"):
|
||||
usage()
|
||||
sys.exit()
|
||||
elif o in ("-e", "--eval"):
|
||||
options['srcfile'] = os.path.join(project_path,'eval','eval1989.de')
|
||||
options['targetfile'] = os.path.join(project_path,'eval','eval1989.fr')
|
||||
from eval import goldeval
|
||||
goldalign = [None] * len(goldeval.gold1990map)
|
||||
for index, data in list(goldeval.gold1990map.items()):
|
||||
goldalign[index] = goldeval.gold[data]
|
||||
options['eval'] = goldalign
|
||||
elif o in ("-d", "--deveval"):
|
||||
options['srcfile'] = os.path.join(project_path,'eval','eval1957.de')
|
||||
options['targetfile'] = os.path.join(project_path,'eval','eval1957.fr')
|
||||
from eval import golddev
|
||||
goldalign = [golddev.goldalign]
|
||||
options['eval'] = goldalign
|
||||
elif o in ("-o", "--output"):
|
||||
options['output'] = a
|
||||
elif o == "--factored":
|
||||
options['factored'] = True
|
||||
elif o in ("-f", "--filter"):
|
||||
if a in ['sentences','articles']:
|
||||
options['filter'] = a
|
||||
else:
|
||||
print('\nERROR: Valid values for option ' + bold + '--filter'+ reset +' are '+ bold +'sentences '+ reset +'and ' + bold +'articles'+ reset +'.')
|
||||
usage()
|
||||
sys.exit(2)
|
||||
elif o == "--filterthreshold":
|
||||
options['filterthreshold'] = float(a)
|
||||
elif o == "--bleuthreshold":
|
||||
options['bleuthreshold'] = float(a)
|
||||
elif o == "--filterlang":
|
||||
options['filterlang'] = True
|
||||
elif o == "--galechurch":
|
||||
options['galechurch'] = True
|
||||
elif o == "--bleu_n":
|
||||
options['bleu_ngrams'] = int(a)
|
||||
elif o == "--bleu_charlevel":
|
||||
options['bleu_charlevel'] = True
|
||||
elif o in ("-s", "--source"):
|
||||
if not 'eval' in options:
|
||||
options['srcfile'] = a
|
||||
elif o in ("-t", "--target"):
|
||||
if not 'eval' in options:
|
||||
options['targetfile'] = a
|
||||
elif o == "--srctotarget":
|
||||
if a == '-':
|
||||
options['no_translation_override'] = True
|
||||
else:
|
||||
options['srctotarget'].append(a)
|
||||
elif o == "--targettosrc":
|
||||
options['targettosrc'].append(a)
|
||||
elif o == "--printempty":
|
||||
options['printempty'] = True
|
||||
elif o in ("-v", "--verbosity"):
|
||||
global loglevel
|
||||
loglevel = int(a)
|
||||
options['loglevel'] = int(a)
|
||||
options['verbosity'] = int(a)
|
||||
elif o in ("-p", "--processes"):
|
||||
options['num_processes'] = int(a)
|
||||
else:
|
||||
assert False, "unhandled option"
|
||||
|
||||
if not options['output']:
|
||||
print('WARNING: Output not specified. Just printing debugging output.',0)
|
||||
if not options['srcfile']:
|
||||
print('\nERROR: Source file not specified.')
|
||||
usage()
|
||||
sys.exit(2)
|
||||
if not options['targetfile']:
|
||||
print('\nERROR: Target file not specified.')
|
||||
usage()
|
||||
sys.exit(2)
|
||||
if options['targettosrc'] and not options['srctotarget']:
|
||||
print('\nWARNING: Only --targettosrc specified, but expecting at least one --srctotarget. Please swap source and target side.')
|
||||
sys.exit(2)
|
||||
if not options['srctotarget'] and not options['targettosrc']\
|
||||
and 'no_translation_override' not in options:
|
||||
print("ERROR: no translation available: BLEU scores can be computed between the source and target text, but this is not the intended usage of Bleualign and may result in poor performance! If you're *really* sure that this is what you want, use the option '--srctotarget -'")
|
||||
sys.exit(2)
|
||||
return options
|
||||
42
ext-lib/bleualign/setup.py
Normal file
42
ext-lib/bleualign/setup.py
Normal file
@@ -0,0 +1,42 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
import os
|
||||
import setuptools
|
||||
|
||||
def read_file(filename):
|
||||
return open(os.path.join(os.path.dirname(__file__), filename)).read()
|
||||
|
||||
setuptools.setup(
|
||||
name = 'bleualign',
|
||||
version = '0.1.1',
|
||||
description = 'An MT-based sentence alignment tool',
|
||||
long_description = read_file('README.md'),
|
||||
author = 'Rico Sennrich',
|
||||
author_email = 'sennrich@cl.uzh.ch',
|
||||
url = 'https://github.com/rsennrich/Bleualign',
|
||||
download_url = 'https://github.com/rsennrich/Bleualign',
|
||||
keywords = [
|
||||
'Sentence Alignment',
|
||||
'Natural Language Processing',
|
||||
'Statistical Machine Translation',
|
||||
'BLEU',
|
||||
],
|
||||
classifiers = [
|
||||
# which Development Status?
|
||||
# 'Development Status :: 3 - Alpha',
|
||||
'Development Status :: 4 - Beta',
|
||||
# 'Development Status :: 5 - Production/Stable',
|
||||
'License :: OSI Approved :: GNU General Public License v2 (GPLv2)',
|
||||
'Operating System :: OS Independent',
|
||||
'Programming Language :: Python :: 2.6',
|
||||
'Programming Language :: Python :: 2.7',
|
||||
'Programming Language :: Python :: 3',
|
||||
'Programming Language :: Python :: 3.2',
|
||||
'Programming Language :: Python :: 3.3',
|
||||
'Programming Language :: Python :: 3.4',
|
||||
'Topic :: Scientific/Engineering',
|
||||
'Topic :: Scientific/Engineering :: Information Analysis',
|
||||
'Topic :: Text Processing',
|
||||
'Topic :: Text Processing :: Linguistic',
|
||||
],
|
||||
packages = ['bleualign'],
|
||||
)
|
||||
224381
ext-lib/hunalign/ce.dic
Normal file
224381
ext-lib/hunalign/ce.dic
Normal file
File diff suppressed because it is too large
Load Diff
BIN
ext-lib/hunalign/cygwin1.dll
Normal file
BIN
ext-lib/hunalign/cygwin1.dll
Normal file
Binary file not shown.
224381
ext-lib/hunalign/ec.dic
Normal file
224381
ext-lib/hunalign/ec.dic
Normal file
File diff suppressed because it is too large
Load Diff
0
ext-lib/hunalign/empty.dic
Normal file
0
ext-lib/hunalign/empty.dic
Normal file
BIN
ext-lib/hunalign/hunalign
Normal file
BIN
ext-lib/hunalign/hunalign
Normal file
Binary file not shown.
BIN
ext-lib/hunalign/hunalign.exe
Normal file
BIN
ext-lib/hunalign/hunalign.exe
Normal file
Binary file not shown.
136
ext-lib/hunalign/hunalign.py
Normal file
136
ext-lib/hunalign/hunalign.py
Normal file
@@ -0,0 +1,136 @@
|
||||
# 2021/11/27
|
||||
# bfsujason@163.com
|
||||
|
||||
"""
|
||||
Usage:
|
||||
|
||||
python ext-lib/hunalign/hunalign.py \
|
||||
-m data/mac/test/meta_data.tsv \
|
||||
-s data/mac/test/zh \
|
||||
-t data/mac/test/en \
|
||||
-o data/mac/test/auto \
|
||||
-d ec.dic
|
||||
"""
|
||||
|
||||
import os
|
||||
import time
|
||||
import shutil
|
||||
import platform
|
||||
import argparse
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description='Sentence alignment using Hunalign')
|
||||
parser.add_argument('-s', '--src', type=str, required=True, help='Source directory.')
|
||||
parser.add_argument('-t', '--tgt', type=str, required=True, help='Target directory.')
|
||||
parser.add_argument('-o', '--out', type=str, required=True, help='Output directory.')
|
||||
parser.add_argument('-m', '--meta', type=str, required=True, help='Metadata file.')
|
||||
parser.add_argument('-d', '--dic', type=str, help='Dictionary file.')
|
||||
args = parser.parse_args()
|
||||
|
||||
make_dir(args.out)
|
||||
|
||||
jobs = create_jobs(args.meta, args.src, args.tgt, args.out)
|
||||
job_path = os.path.abspath(os.path.join(args.out, 'hunalign.job'))
|
||||
write_jobs(jobs, job_path)
|
||||
|
||||
if args.dic:
|
||||
hunalign_dic = os.path.abspath(os.path.join('ext-lib/hunalign', args.dic))
|
||||
else:
|
||||
hunalign_dic = os.path.abspath('ext-lib/hunalign/null.dic')
|
||||
|
||||
# check system OS
|
||||
OS = platform.system()
|
||||
if OS == 'Windows':
|
||||
hunalign_bin = os.path.abspath('ext-lib/hunalign/hunalign.exe')
|
||||
elif OS == 'Linux':
|
||||
hunalign_bin = os.path.abspath('ext-lib/hunalign/hunalign')
|
||||
print(hunalign_bin)
|
||||
print(hunalign_dic)
|
||||
print(job_path)
|
||||
run_hunalign(hunalign_bin, hunalign_dic, job_path)
|
||||
convert_format(args.out)
|
||||
|
||||
def convert_format(dir):
|
||||
for file in sorted(os.listdir(dir)):
|
||||
fp_in = os.path.join(dir, file)
|
||||
fp_out = os.path.join(dir, file + '.align')
|
||||
alignment = _convert_format(fp_in, fp_out)
|
||||
write_alignment(alignment, fp_out)
|
||||
os.unlink(fp_in)
|
||||
|
||||
def _convert_format(fp_in, fp_out):
|
||||
src_id = -1
|
||||
tgt_id = -1
|
||||
alignment = []
|
||||
|
||||
with open(fp_in, 'rt', encoding='utf-8') as f:
|
||||
for line in f:
|
||||
line = line.strip(' \r\n')
|
||||
items = line.split('\t');
|
||||
if not items[0] and not items[1]:
|
||||
continue
|
||||
src_seg_len, src_seg_id = _parse_seg(items[0], src_id)
|
||||
tgt_seg_len, tgt_seg_id = _parse_seg(items[1], tgt_id)
|
||||
src_id += src_seg_len
|
||||
tgt_id += tgt_seg_len
|
||||
alignment.append((src_seg_id, tgt_seg_id))
|
||||
|
||||
return alignment
|
||||
|
||||
def write_alignment(alignment, fp_out):
|
||||
with open(fp_out, 'wt', encoding='utf-8') as f:
|
||||
for id in alignment:
|
||||
f.write("{}:{}\n".format(id[0], id[1]))
|
||||
|
||||
def _parse_seg(seg, id):
|
||||
seg_len = 0
|
||||
seg_id = []
|
||||
if seg:
|
||||
sents = seg.split(' ~~~ ')
|
||||
seg_len = len(sents)
|
||||
seg_id = [id + x for x in range(1, seg_len+1)]
|
||||
|
||||
return seg_len, seg_id
|
||||
|
||||
def run_hunalign(bin, dic, job):
|
||||
cmd = "{} -text -batch {} {}".format(bin, dic, job)
|
||||
os.system(cmd)
|
||||
os.unlink(job)
|
||||
|
||||
def write_jobs(jobs, path):
|
||||
jobs = '\n'.join(jobs)
|
||||
with open(path, 'wt', encoding='utf-8', newline='\n') as f:
|
||||
f.write(jobs)
|
||||
|
||||
def create_jobs(meta, src, tgt, out):
|
||||
jobs = []
|
||||
fns = get_fns(meta)
|
||||
for file in fns:
|
||||
# using tokenized file
|
||||
src_path = os.path.abspath(os.path.join(src, file + '.tok'))
|
||||
tgt_path = os.path.abspath(os.path.join(tgt, file + '.tok'))
|
||||
out_path = os.path.abspath(os.path.join(out, file))
|
||||
|
||||
jobs.append('\t'.join([src_path, tgt_path, out_path]))
|
||||
|
||||
return jobs
|
||||
|
||||
def get_fns(meta):
|
||||
fns = []
|
||||
with open(meta, 'rt', encoding='utf-8') as f:
|
||||
next(f) # skip header
|
||||
for line in f:
|
||||
recs = line.strip().split('\t')
|
||||
fns.append(recs[0])
|
||||
|
||||
return fns
|
||||
|
||||
def make_dir(path):
|
||||
if os.path.isdir(path):
|
||||
shutil.rmtree(path)
|
||||
os.makedirs(path, exist_ok=True)
|
||||
|
||||
if __name__ == '__main__':
|
||||
t_0 = time.time()
|
||||
main()
|
||||
print("It takes {:.3f} seconds to align all the sentences.".format(time.time() - t_0))
|
||||
0
ext-lib/hunalign/null.dic
Normal file
0
ext-lib/hunalign/null.dic
Normal file
0
ext-lib/hunalign/translate.txt
Normal file
0
ext-lib/hunalign/translate.txt
Normal file
9
ext-lib/vecalign/.gitignore
vendored
Normal file
9
ext-lib/vecalign/.gitignore
vendored
Normal file
@@ -0,0 +1,9 @@
|
||||
build/
|
||||
dp_core.c*
|
||||
dp_core.html
|
||||
__pycache__/
|
||||
.idea
|
||||
*~
|
||||
.pytest_cache/
|
||||
venv/
|
||||
|
||||
202
ext-lib/vecalign/LICENSE
Normal file
202
ext-lib/vecalign/LICENSE
Normal file
@@ -0,0 +1,202 @@
|
||||
|
||||
Apache License
|
||||
Version 2.0, January 2004
|
||||
http://www.apache.org/licenses/
|
||||
|
||||
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
|
||||
|
||||
1. Definitions.
|
||||
|
||||
"License" shall mean the terms and conditions for use, reproduction,
|
||||
and distribution as defined by Sections 1 through 9 of this document.
|
||||
|
||||
"Licensor" shall mean the copyright owner or entity authorized by
|
||||
the copyright owner that is granting the License.
|
||||
|
||||
"Legal Entity" shall mean the union of the acting entity and all
|
||||
other entities that control, are controlled by, or are under common
|
||||
control with that entity. For the purposes of this definition,
|
||||
"control" means (i) the power, direct or indirect, to cause the
|
||||
direction or management of such entity, whether by contract or
|
||||
otherwise, or (ii) ownership of fifty percent (50%) or more of the
|
||||
outstanding shares, or (iii) beneficial ownership of such entity.
|
||||
|
||||
"You" (or "Your") shall mean an individual or Legal Entity
|
||||
exercising permissions granted by this License.
|
||||
|
||||
"Source" form shall mean the preferred form for making modifications,
|
||||
including but not limited to software source code, documentation
|
||||
source, and configuration files.
|
||||
|
||||
"Object" form shall mean any form resulting from mechanical
|
||||
transformation or translation of a Source form, including but
|
||||
not limited to compiled object code, generated documentation,
|
||||
and conversions to other media types.
|
||||
|
||||
"Work" shall mean the work of authorship, whether in Source or
|
||||
Object form, made available under the License, as indicated by a
|
||||
copyright notice that is included in or attached to the work
|
||||
(an example is provided in the Appendix below).
|
||||
|
||||
"Derivative Works" shall mean any work, whether in Source or Object
|
||||
form, that is based on (or derived from) the Work and for which the
|
||||
editorial revisions, annotations, elaborations, or other modifications
|
||||
represent, as a whole, an original work of authorship. For the purposes
|
||||
of this License, Derivative Works shall not include works that remain
|
||||
separable from, or merely link (or bind by name) to the interfaces of,
|
||||
the Work and Derivative Works thereof.
|
||||
|
||||
"Contribution" shall mean any work of authorship, including
|
||||
the original version of the Work and any modifications or additions
|
||||
to that Work or Derivative Works thereof, that is intentionally
|
||||
submitted to Licensor for inclusion in the Work by the copyright owner
|
||||
or by an individual or Legal Entity authorized to submit on behalf of
|
||||
the copyright owner. For the purposes of this definition, "submitted"
|
||||
means any form of electronic, verbal, or written communication sent
|
||||
to the Licensor or its representatives, including but not limited to
|
||||
communication on electronic mailing lists, source code control systems,
|
||||
and issue tracking systems that are managed by, or on behalf of, the
|
||||
Licensor for the purpose of discussing and improving the Work, but
|
||||
excluding communication that is conspicuously marked or otherwise
|
||||
designated in writing by the copyright owner as "Not a Contribution."
|
||||
|
||||
"Contributor" shall mean Licensor and any individual or Legal Entity
|
||||
on behalf of whom a Contribution has been received by Licensor and
|
||||
subsequently incorporated within the Work.
|
||||
|
||||
2. Grant of Copyright License. Subject to the terms and conditions of
|
||||
this License, each Contributor hereby grants to You a perpetual,
|
||||
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
||||
copyright license to reproduce, prepare Derivative Works of,
|
||||
publicly display, publicly perform, sublicense, and distribute the
|
||||
Work and such Derivative Works in Source or Object form.
|
||||
|
||||
3. Grant of Patent License. Subject to the terms and conditions of
|
||||
this License, each Contributor hereby grants to You a perpetual,
|
||||
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
||||
(except as stated in this section) patent license to make, have made,
|
||||
use, offer to sell, sell, import, and otherwise transfer the Work,
|
||||
where such license applies only to those patent claims licensable
|
||||
by such Contributor that are necessarily infringed by their
|
||||
Contribution(s) alone or by combination of their Contribution(s)
|
||||
with the Work to which such Contribution(s) was submitted. If You
|
||||
institute patent litigation against any entity (including a
|
||||
cross-claim or counterclaim in a lawsuit) alleging that the Work
|
||||
or a Contribution incorporated within the Work constitutes direct
|
||||
or contributory patent infringement, then any patent licenses
|
||||
granted to You under this License for that Work shall terminate
|
||||
as of the date such litigation is filed.
|
||||
|
||||
4. Redistribution. You may reproduce and distribute copies of the
|
||||
Work or Derivative Works thereof in any medium, with or without
|
||||
modifications, and in Source or Object form, provided that You
|
||||
meet the following conditions:
|
||||
|
||||
(a) You must give any other recipients of the Work or
|
||||
Derivative Works a copy of this License; and
|
||||
|
||||
(b) You must cause any modified files to carry prominent notices
|
||||
stating that You changed the files; and
|
||||
|
||||
(c) You must retain, in the Source form of any Derivative Works
|
||||
that You distribute, all copyright, patent, trademark, and
|
||||
attribution notices from the Source form of the Work,
|
||||
excluding those notices that do not pertain to any part of
|
||||
the Derivative Works; and
|
||||
|
||||
(d) If the Work includes a "NOTICE" text file as part of its
|
||||
distribution, then any Derivative Works that You distribute must
|
||||
include a readable copy of the attribution notices contained
|
||||
within such NOTICE file, excluding those notices that do not
|
||||
pertain to any part of the Derivative Works, in at least one
|
||||
of the following places: within a NOTICE text file distributed
|
||||
as part of the Derivative Works; within the Source form or
|
||||
documentation, if provided along with the Derivative Works; or,
|
||||
within a display generated by the Derivative Works, if and
|
||||
wherever such third-party notices normally appear. The contents
|
||||
of the NOTICE file are for informational purposes only and
|
||||
do not modify the License. You may add Your own attribution
|
||||
notices within Derivative Works that You distribute, alongside
|
||||
or as an addendum to the NOTICE text from the Work, provided
|
||||
that such additional attribution notices cannot be construed
|
||||
as modifying the License.
|
||||
|
||||
You may add Your own copyright statement to Your modifications and
|
||||
may provide additional or different license terms and conditions
|
||||
for use, reproduction, or distribution of Your modifications, or
|
||||
for any such Derivative Works as a whole, provided Your use,
|
||||
reproduction, and distribution of the Work otherwise complies with
|
||||
the conditions stated in this License.
|
||||
|
||||
5. Submission of Contributions. Unless You explicitly state otherwise,
|
||||
any Contribution intentionally submitted for inclusion in the Work
|
||||
by You to the Licensor shall be under the terms and conditions of
|
||||
this License, without any additional terms or conditions.
|
||||
Notwithstanding the above, nothing herein shall supersede or modify
|
||||
the terms of any separate license agreement you may have executed
|
||||
with Licensor regarding such Contributions.
|
||||
|
||||
6. Trademarks. This License does not grant permission to use the trade
|
||||
names, trademarks, service marks, or product names of the Licensor,
|
||||
except as required for reasonable and customary use in describing the
|
||||
origin of the Work and reproducing the content of the NOTICE file.
|
||||
|
||||
7. Disclaimer of Warranty. Unless required by applicable law or
|
||||
agreed to in writing, Licensor provides the Work (and each
|
||||
Contributor provides its Contributions) on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
|
||||
implied, including, without limitation, any warranties or conditions
|
||||
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
|
||||
PARTICULAR PURPOSE. You are solely responsible for determining the
|
||||
appropriateness of using or redistributing the Work and assume any
|
||||
risks associated with Your exercise of permissions under this License.
|
||||
|
||||
8. Limitation of Liability. In no event and under no legal theory,
|
||||
whether in tort (including negligence), contract, or otherwise,
|
||||
unless required by applicable law (such as deliberate and grossly
|
||||
negligent acts) or agreed to in writing, shall any Contributor be
|
||||
liable to You for damages, including any direct, indirect, special,
|
||||
incidental, or consequential damages of any character arising as a
|
||||
result of this License or out of the use or inability to use the
|
||||
Work (including but not limited to damages for loss of goodwill,
|
||||
work stoppage, computer failure or malfunction, or any and all
|
||||
other commercial damages or losses), even if such Contributor
|
||||
has been advised of the possibility of such damages.
|
||||
|
||||
9. Accepting Warranty or Additional Liability. While redistributing
|
||||
the Work or Derivative Works thereof, You may choose to offer,
|
||||
and charge a fee for, acceptance of support, warranty, indemnity,
|
||||
or other liability obligations and/or rights consistent with this
|
||||
License. However, in accepting such obligations, You may act only
|
||||
on Your own behalf and on Your sole responsibility, not on behalf
|
||||
of any other Contributor, and only if You agree to indemnify,
|
||||
defend, and hold each Contributor harmless for any liability
|
||||
incurred by, or claims asserted against, such Contributor by reason
|
||||
of your accepting any such warranty or additional liability.
|
||||
|
||||
END OF TERMS AND CONDITIONS
|
||||
|
||||
APPENDIX: How to apply the Apache License to your work.
|
||||
|
||||
To apply the Apache License to your work, attach the following
|
||||
boilerplate notice, with the fields enclosed by brackets "[]"
|
||||
replaced with your own identifying information. (Don't include
|
||||
the brackets!) The text should be enclosed in the appropriate
|
||||
comment syntax for the file format. We also recommend that a
|
||||
file or class name and description of purpose be included on the
|
||||
same "printed page" as the copyright notice for easier
|
||||
identification within third-party archives.
|
||||
|
||||
Copyright [yyyy] [name of copyright owner]
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
158
ext-lib/vecalign/README.md
Normal file
158
ext-lib/vecalign/README.md
Normal file
@@ -0,0 +1,158 @@
|
||||
# Vecalign
|
||||
|
||||
Vecalign is an accurate sentence alignment algorithm which is fast even for very long documents.
|
||||
In conjunction with [LASER](https://github.com/facebookresearch/LASER), Vecalign
|
||||
works in about 100 languages (i.e. 100^2 language pairs),
|
||||
without the need for a machine translation system or lexicon.
|
||||
|
||||
Vecalign uses similarity of multilingual sentence embeddings to judge the similarity of sentences.
|
||||
|
||||

|
||||
[image based on [this Facebook AI post](https://engineering.fb.com/ai-research/laser-multilingual-sentence-embeddings/)]
|
||||
|
||||
Vecalign uses an approximation to Dynamic Programming based on
|
||||
[Fast Dynamic Time Warping](https://content.iospress.com/articles/intelligent-data-analysis/ida00303)
|
||||
which is linear in time and space with respect to the number of sentences being aligned.
|
||||
|
||||

|
||||
|
||||
### License
|
||||
|
||||
Copyright 2019 Brian Thompson
|
||||
|
||||
Vecalign is released under the [Apache License, Version 2.0](LICENSE).
|
||||
For convenience, the dev and test datasets from Bleualign are provided. Bleualign is Copyright 2010 Rico Sennrich and is released under the [GNU General Public License Version 2](bleualign_data/LICENSE)
|
||||
|
||||
### Build Vecalign
|
||||
|
||||
You will need python 3.6+ with numpy and cython. You can build an environment using conda as follows:
|
||||
|
||||
```
|
||||
# Use latest conda
|
||||
conda update conda -y
|
||||
# Create conda environment
|
||||
conda create --force -y --name vecalign python=3.7
|
||||
# Activate new environment
|
||||
source `conda info --base`/etc/profile.d/conda.sh # See: https://github.com/conda/conda/issues/7980
|
||||
conda activate vecalign
|
||||
# Install required packages
|
||||
conda install -y -c anaconda cython
|
||||
conda install -y -c anaconda numpy
|
||||
```
|
||||
|
||||
Note that Vecalign contains cython code, but there is no need to build it manually as it is compiled automatically by [pyximport](https://github.com/cython/cython/tree/master/pyximport).
|
||||
|
||||
### Run Vecalign (using provided embeddings)
|
||||
```
|
||||
./vecalign.py --alignment_max_size 8 --src bleualign_data/dev.de --tgt bleualign_data/dev.fr \
|
||||
--src_embed bleualign_data/overlaps.de bleualign_data/overlaps.de.emb \
|
||||
--tgt_embed bleualign_data/overlaps.fr bleualign_data/overlaps.fr.emb
|
||||
```
|
||||
|
||||
Alignments are written to stdout:
|
||||
```
|
||||
[0]:[0]:0.156006
|
||||
[1]:[1]:0.160997
|
||||
[2]:[2]:0.217155
|
||||
[3]:[3]:0.361439
|
||||
[4]:[4]:0.346332
|
||||
[5]:[5]:0.211873
|
||||
[6]:[6, 7, 8]:0.507506
|
||||
[7]:[9]:0.252747
|
||||
[8, 9]:[10, 11, 12]:0.139594
|
||||
[10, 11]:[13]:0.273751
|
||||
[12]:[14]:0.165397
|
||||
[13]:[15, 16, 17]:0.436312
|
||||
[14]:[18, 19, 20, 21]:0.734142
|
||||
[]:[22]:0.000000
|
||||
[]:[23]:0.000000
|
||||
[]:[24]:0.000000
|
||||
[]:[25]:0.000000
|
||||
[15]:[26, 27, 28]:0.840094
|
||||
...
|
||||
```
|
||||
|
||||
The first two entries are the source and target sentence indexes for each alignment, respectively.
|
||||
The third entry in each line is the sentence alignment cost computed by Vecalign.
|
||||
Note that this cost includes normalization but does *not* include the penalties terms for containing more than one sentence.
|
||||
Note that the alignment cost is set to zero for insertions/deletions.
|
||||
Also note that the results may vary slightly due to randomness in the normalization.
|
||||
|
||||
To score against a gold alignment, use the "-g" flag.
|
||||
Flags "-s", "-t", and "-g" can accept multiple arguments. This is primarily useful for scoring, as the output alignments will all be concatenated together in stdout. For example, to align and score the bleualign test set:
|
||||
```
|
||||
./vecalign.py --alignment_max_size 8 --src bleualign_data/test*.de --tgt bleualign_data/test*.fr \
|
||||
--gold bleualign_data/test*.defr \
|
||||
--src_embed bleualign_data/overlaps.de bleualign_data/overlaps.de.emb \
|
||||
--tgt_embed bleualign_data/overlaps.fr bleualign_data/overlaps.fr.emb > /dev/null
|
||||
```
|
||||
Which should give you results that approximately match the Vecalign paper:
|
||||
|
||||
```
|
||||
|
||||
---------------------------------
|
||||
| | Strict | Lax |
|
||||
| Precision | 0.899 | 0.985 |
|
||||
| Recall | 0.904 | 0.987 |
|
||||
| F1 | 0.902 | 0.986 |
|
||||
---------------------------------
|
||||
```
|
||||
|
||||
Note: Run `./vecalign.py -h` for full sentence alignment usage and options.
|
||||
For stand-alone scoring against a gold reference, see [score.py](score.py)
|
||||
|
||||
### Embed your own documents
|
||||
|
||||
The Vecalign repository contains overlap and embedding files for the Bluealign dev/test files.
|
||||
This section shows how those files were made, as an example for running on new data.
|
||||
|
||||
Vecalign requires not only embeddings of sentences in each document,
|
||||
but also embeddings of *concatenations* of consecutive sentences.
|
||||
The embeddings of multiple, consecutive sentences are needed to consider 1-many, many-1, and many-many alignments.
|
||||
|
||||
|
||||
To create a file containing all the sentence combinations in the dev and test files from Bleualign:
|
||||
```
|
||||
./overlap.py -i bleualign_data/dev.fr bleualign_data/test*.fr -o bleualign_data/overlaps.fr -n 10
|
||||
./overlap.py -i bleualign_data/dev.de bleualign_data/test*.de -o bleualign_data/overlaps.de -n 10
|
||||
```
|
||||
|
||||
Note: Run `./overlap.py -h` to see full set of embedding options.
|
||||
|
||||
`bleualign_data/overlaps.fr` and `bleualign_data/overlaps.de` are text files containing one or more sentences per line.
|
||||
|
||||
These files must then be embedded using a multilingual sentence embedder.
|
||||
|
||||
We recommend the [Language-Agnostic SEntence Representations (LASER)](https://github.com/facebookresearch/LASER)
|
||||
toolkit from Facebook, as it has strong performance and comes with a pretrained model which works well in about 100 languages.
|
||||
However, Vecalign should also work with other embedding methods as well. Embeddings should be provided as a binary file containing float32 values.
|
||||
|
||||
The following assumes LASER is installed and the LASER environmental variable has been set.
|
||||
|
||||
To embed the Bleualign files using LASER:
|
||||
```
|
||||
$LASER/tasks/embed/embed.sh bleualign_data/overlaps.fr fr bleualign_data/overlaps.fr.emb
|
||||
$LASER/tasks/embed/embed.sh bleualign_data/overlaps.de de bleualign_data/overlaps.de.emb
|
||||
```
|
||||
|
||||
Note that LASER will not overwrite an embedding file if it exsts, so you may need to run first `rm bleualign_data/overlaps.fr.emb bleualign_data/overlaps.de.emb`.
|
||||
|
||||
### Publications
|
||||
|
||||
If you use Vecalign, please cite our [paper](https://www.aclweb.org/anthology/D19-1136.pdf):
|
||||
|
||||
```
|
||||
@inproceedings{thompson-koehn-2019-vecalign,
|
||||
title = "{V}ecalign: Improved Sentence Alignment in Linear Time and Space",
|
||||
author = "Thompson, Brian and Koehn, Philipp",
|
||||
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)",
|
||||
month = nov,
|
||||
year = "2019",
|
||||
address = "Hong Kong, China",
|
||||
publisher = "Association for Computational Linguistics",
|
||||
url = "https://www.aclweb.org/anthology/D19-1136",
|
||||
doi = "10.18653/v1/D19-1136",
|
||||
pages = "1342--1348",
|
||||
}
|
||||
```
|
||||
|
||||
0
ext-lib/vecalign/__init__.py
Normal file
0
ext-lib/vecalign/__init__.py
Normal file
148
ext-lib/vecalign/_vecalign.py
Normal file
148
ext-lib/vecalign/_vecalign.py
Normal file
@@ -0,0 +1,148 @@
|
||||
#!/usr/bin/env python3
|
||||
|
||||
"""
|
||||
Copyright 2019 Brian Thompson
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
You may obtain a copy of the License at
|
||||
|
||||
https://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import logging
|
||||
import pickle
|
||||
from math import ceil
|
||||
from random import seed as seed
|
||||
|
||||
import numpy as np
|
||||
|
||||
logger = logging.getLogger('vecalign')
|
||||
logger.setLevel(logging.WARNING)
|
||||
logFormatter = logging.Formatter("%(asctime)s %(levelname)-5.5s %(message)s")
|
||||
consoleHandler = logging.StreamHandler()
|
||||
consoleHandler.setFormatter(logFormatter)
|
||||
logger.addHandler(consoleHandler)
|
||||
|
||||
from dp_utils import make_alignment_types, print_alignments, read_alignments, \
|
||||
read_in_embeddings, make_doc_embedding, vecalign
|
||||
|
||||
from score import score_multiple, log_final_scores
|
||||
|
||||
|
||||
def _main():
|
||||
# make runs consistent
|
||||
seed(42)
|
||||
np.random.seed(42)
|
||||
|
||||
parser = argparse.ArgumentParser('Sentence alignment using sentence embeddings and FastDTW',
|
||||
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
|
||||
|
||||
parser.add_argument('-s', '--src', type=str, nargs='+', required=True,
|
||||
help='preprocessed source file to align')
|
||||
|
||||
parser.add_argument('-t', '--tgt', type=str, nargs='+', required=True,
|
||||
help='preprocessed target file to align')
|
||||
|
||||
parser.add_argument('-g', '--gold_alignment', type=str, nargs='+', required=False,
|
||||
help='preprocessed target file to align')
|
||||
|
||||
parser.add_argument('--src_embed', type=str, nargs=2, required=True,
|
||||
help='Source embeddings. Requires two arguments: first is a text file, sencond is a binary embeddings file. ')
|
||||
|
||||
parser.add_argument('--tgt_embed', type=str, nargs=2, required=True,
|
||||
help='Target embeddings. Requires two arguments: first is a text file, sencond is a binary embeddings file. ')
|
||||
|
||||
parser.add_argument('-a', '--alignment_max_size', type=int, default=4,
|
||||
help='Searches for alignments up to size N-M, where N+M <= this value. Note that the the embeddings must support the requested number of overlaps')
|
||||
|
||||
parser.add_argument('-d', '--del_percentile_frac', type=float, default=0.2,
|
||||
help='Deletion penalty is set to this percentile (as a fraction) of the cost matrix distribution. Should be between 0 and 1.')
|
||||
|
||||
parser.add_argument('-v', '--verbose', help='sets consle to logging.DEBUG instead of logging.WARN',
|
||||
action='store_true')
|
||||
|
||||
parser.add_argument('--max_size_full_dp', type=int, default=300,
|
||||
help='Maximum size N for which is is acceptable to run full N^2 dynamic programming.')
|
||||
|
||||
parser.add_argument('--costs_sample_size', type=int, default=20000,
|
||||
help='Sample size to estimate costs distribution, used to set deletion penalty in conjunction with deletion_percentile.')
|
||||
|
||||
parser.add_argument('--num_samps_for_norm', type=int, default=100,
|
||||
help='Number of samples used for normalizing embeddings')
|
||||
|
||||
parser.add_argument('--search_buffer_size', type=int, default=5,
|
||||
help='Width (one side) of search buffer. Larger values makes search more likely to recover from errors but increases runtime.')
|
||||
|
||||
parser.add_argument('--debug_save_stack', type=str,
|
||||
help='Write stack to pickle file for debug purposes')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
if len(args.src) != len(args.tgt):
|
||||
raise Exception('number of source files must match number of target files')
|
||||
|
||||
if args.gold_alignment is not None:
|
||||
if len(args.gold_alignment) != len(args.src):
|
||||
raise Exception('number of gold alignment files, if provided, must match number of source and target files')
|
||||
|
||||
if args.verbose:
|
||||
import logging
|
||||
logger.setLevel(logging.INFO)
|
||||
|
||||
if args.alignment_max_size < 2:
|
||||
logger.warning('Alignment_max_size < 2. Increasing to 2 so that 1-1 alignments will be considered')
|
||||
args.alignment_max_size = 2
|
||||
|
||||
src_sent2line, src_line_embeddings = read_in_embeddings(args.src_embed[0], args.src_embed[1])
|
||||
tgt_sent2line, tgt_line_embeddings = read_in_embeddings(args.tgt_embed[0], args.tgt_embed[1])
|
||||
|
||||
width_over2 = ceil(args.alignment_max_size / 2.0) + args.search_buffer_size
|
||||
|
||||
test_alignments = []
|
||||
stack_list = []
|
||||
for src_file, tgt_file in zip(args.src, args.tgt):
|
||||
logger.info('Aligning src="%s" to tgt="%s"', src_file, tgt_file)
|
||||
|
||||
src_lines = open(src_file, 'rt', encoding="utf-8").readlines()
|
||||
vecs0 = make_doc_embedding(src_sent2line, src_line_embeddings, src_lines, args.alignment_max_size)
|
||||
|
||||
tgt_lines = open(tgt_file, 'rt', encoding="utf-8").readlines()
|
||||
vecs1 = make_doc_embedding(tgt_sent2line, tgt_line_embeddings, tgt_lines, args.alignment_max_size)
|
||||
|
||||
final_alignment_types = make_alignment_types(args.alignment_max_size)
|
||||
logger.debug('Considering alignment types %s', final_alignment_types)
|
||||
|
||||
stack = vecalign(vecs0=vecs0,
|
||||
vecs1=vecs1,
|
||||
final_alignment_types=final_alignment_types,
|
||||
del_percentile_frac=args.del_percentile_frac,
|
||||
width_over2=width_over2,
|
||||
max_size_full_dp=args.max_size_full_dp,
|
||||
costs_sample_size=args.costs_sample_size,
|
||||
num_samps_for_norm=args.num_samps_for_norm)
|
||||
|
||||
# write final alignments to stdout
|
||||
print_alignments(stack[0]['final_alignments'], stack[0]['alignment_scores'])
|
||||
|
||||
test_alignments.append(stack[0]['final_alignments'])
|
||||
stack_list.append(stack)
|
||||
|
||||
if args.gold_alignment is not None:
|
||||
gold_list = [read_alignments(x) for x in args.gold_alignment]
|
||||
res = score_multiple(gold_list=gold_list, test_list=test_alignments)
|
||||
log_final_scores(res)
|
||||
|
||||
if args.debug_save_stack:
|
||||
pickle.dump(stack_list, open(args.debug_save_stack, 'wb'))
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
_main()
|
||||
411
ext-lib/vecalign/dp_core.pyx
Normal file
411
ext-lib/vecalign/dp_core.pyx
Normal file
@@ -0,0 +1,411 @@
|
||||
# cython: language_level=3
|
||||
|
||||
"""
|
||||
Copyright 2019 Brian Thompson
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
You may obtain a copy of the License at
|
||||
|
||||
https://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
"""
|
||||
|
||||
import numpy as np
|
||||
|
||||
cimport numpy as np
|
||||
cimport cython
|
||||
|
||||
|
||||
def make_x_y_offsets(alignment_types):
|
||||
# alignment types for which we will precompute costs
|
||||
|
||||
# deletion/insertion is added later
|
||||
for x, y in alignment_types:
|
||||
assert (x > 0)
|
||||
assert (y > 0)
|
||||
|
||||
x_offsets = np.array([x for x, y in alignment_types], dtype=np.int32) # MUST **NOT** INCLUDE (0,1), (1,0)
|
||||
y_offsets = np.array([y for x, y in alignment_types], dtype=np.int32) # MUST **NOT** INCLUDE (0,1), (1,0)
|
||||
return x_offsets, y_offsets
|
||||
|
||||
|
||||
def make_dense_costs(np.ndarray[float, ndim=3] vecs0, # itput
|
||||
np.ndarray[float, ndim=3] vecs1, # input
|
||||
np.ndarray[float, ndim=2] norm0, # input
|
||||
np.ndarray[float, ndim=2] norm1, # input
|
||||
int offset0 = 0, # index into vecs0/norms0
|
||||
int offset1 = 0, # index into vecs1/norms1
|
||||
):
|
||||
"""
|
||||
Make a full N*M feature matrix. By default, makes 1-1 alignments,
|
||||
can build others by specifying offset0, offset1 to index into
|
||||
vecs0, norms0 and vecs1, norms1 respectivly.
|
||||
"""
|
||||
assert vecs0.shape[0] > offset0
|
||||
assert vecs1.shape[0] > offset1
|
||||
assert norm0.shape[0] > offset0
|
||||
assert norm1.shape[0] > offset1
|
||||
|
||||
cdef int size0 = np.shape(vecs0)[1]
|
||||
assert norm0.shape[1] == size0
|
||||
|
||||
cdef int size1 = np.shape(vecs1)[1]
|
||||
assert norm1.shape[1] == size1
|
||||
|
||||
cdef int vecsize = np.shape(vecs0)[2]
|
||||
assert vecs1.shape[2] == vecsize
|
||||
|
||||
cdef int xi, yi
|
||||
cdef float sumx
|
||||
|
||||
cdef np.ndarray[float, ndim=2] costs = np.empty((size0, size1), dtype=np.float32)
|
||||
|
||||
for xi in range(size0):
|
||||
for yi in range(size1):
|
||||
sumx = 0.0
|
||||
for jj in range(vecsize):
|
||||
sumx += vecs0[offset0, xi, jj] * vecs1[offset1, yi, jj]
|
||||
|
||||
costs[xi, yi] = 2.0 * (1.0 - sumx) / (1e-6 + norm0[offset0, xi] + norm1[offset1, yi])
|
||||
# normalize by alignment type
|
||||
costs[xi, yi] = costs[xi, yi] * (offset0 + 1) * (offset1 + 1)
|
||||
|
||||
return costs
|
||||
|
||||
|
||||
def dense_dp(np.ndarray[float, ndim=2] alignment_cost, float pen):
|
||||
"""
|
||||
Compute cost matrix (csum) and backpointers (bp)
|
||||
from full 2-D 1-1 alignment costs matrix (alignment_cost)
|
||||
"""
|
||||
|
||||
size0 = alignment_cost.shape[0]
|
||||
size1 = alignment_cost.shape[1]
|
||||
# csum and traceback matrix are both on nodes
|
||||
# so they are +1 in each dimension compared to the jump costs matrix
|
||||
# For anything being used in accumulation, use float64
|
||||
cdef np.ndarray[double, ndim=2] csum = np.empty((size0 + 1, size1 + 1), dtype=np.float64)
|
||||
cdef np.ndarray[int, ndim=2] bp = np.empty((size0 + 1, size1 + 1), dtype=np.int32)
|
||||
|
||||
# bp and csum are nodes,
|
||||
# while alignment_cost is the cost of going between the nodes
|
||||
# Size of nodes should be one larger than alignment costs
|
||||
b0, b1 = np.shape(bp)
|
||||
c0, c1 = np.shape(csum)
|
||||
j0, j1 = np.shape(alignment_cost)
|
||||
assert (b0 == c0 == j0 + 1)
|
||||
assert (b1 == c1 == j1 + 1)
|
||||
|
||||
cdef int cmax = np.shape(csum)[1]
|
||||
cdef int rmax = np.shape(csum)[0]
|
||||
cdef int c, r
|
||||
cdef double cost0, cost1, cost2
|
||||
|
||||
# initialize the all c-direction deletion path
|
||||
for c in range(cmax):
|
||||
csum[0, c] = c * pen
|
||||
bp[0, c] = 1
|
||||
|
||||
# initialize the all r-direction deletion path
|
||||
for r in range(rmax):
|
||||
csum[r, 0] = r * pen
|
||||
bp[r, 0] = 2
|
||||
|
||||
# Initial cost is 0.0
|
||||
csum[0, 0] = 0.0 # noop
|
||||
bp[0, 0] = 4 # should not matter
|
||||
|
||||
# Calculate the rest recursively
|
||||
for c in range(1, cmax):
|
||||
for r in range(1, rmax):
|
||||
|
||||
# alignment_cost indexes are off by 1 wrt
|
||||
# csum/bp, since csum/bp are nodes
|
||||
cost0 = csum[r - 1, c - 1] + alignment_cost[r - 1, c - 1]
|
||||
cost1 = csum[r, c - 1] + pen
|
||||
cost2 = csum[r - 1, c] + pen
|
||||
|
||||
csum[r, c] = cost0
|
||||
bp[r, c] = 0
|
||||
|
||||
if cost1 < csum[r, c]:
|
||||
csum[r, c] = cost1
|
||||
bp[r, c] = 1
|
||||
if cost2 < csum[r, c]:
|
||||
csum[r, c] = cost2
|
||||
bp[r, c] = 2
|
||||
|
||||
return csum, bp
|
||||
|
||||
|
||||
def score_path(np.ndarray[int, ndim=1] xx,
|
||||
np.ndarray[int, ndim=1] yy,
|
||||
np.ndarray[float, ndim=1] norm1,
|
||||
np.ndarray[float, ndim=1] norm2,
|
||||
np.ndarray[float, ndim=2] vecs1,
|
||||
np.ndarray[float, ndim=2] vecs2,
|
||||
np.ndarray[float, ndim=1] out):
|
||||
cdef int xi, yi, ii, jj
|
||||
cdef float outx
|
||||
cdef int lenxy = xx.shape[0]
|
||||
cdef int vecsize = vecs1.shape[1]
|
||||
|
||||
for ii in range(lenxy):
|
||||
xi = xx[ii]
|
||||
yi = yy[ii]
|
||||
outx = 0.0
|
||||
for jj in range(vecsize):
|
||||
outx += vecs1[xi, jj] * vecs2[yi, jj]
|
||||
out[ii] = 2.0 * (1.0 - outx) / (norm1[xi] + norm2[yi])
|
||||
|
||||
|
||||
# Bounds checking and wraparound slow things down by about 2x
|
||||
# Division by 0 checking has minimal speed impact
|
||||
@cython.boundscheck(False) # turn off bounds-checking for entire function
|
||||
@cython.wraparound(False) # turn off negative index wrapping for entire function
|
||||
@cython.cdivision(True) # use c-style division (no division-by-zero check)
|
||||
def make_sparse_costs(np.ndarray[float, ndim=3] vecs0, # intput: num aligns X num sents X dim
|
||||
np.ndarray[float, ndim=3] vecs1, # input
|
||||
np.ndarray[float, ndim=2] norms0, # intput: num aligns X num sents
|
||||
np.ndarray[float, ndim=2] norms1, # input
|
||||
x_y_path,
|
||||
alignment_types,
|
||||
int width_over2):
|
||||
"""
|
||||
Make features for DP, *for lines running across approximate path*, *for each alignment type*
|
||||
x_offsets, y_offsets should not include (0,1), (1,0)
|
||||
|
||||
Basically, we take the feature matrix, rotate it 45 degress,
|
||||
and compute a "wavy" matrix for the features.
|
||||
It's like the diagonal but it moves around to hopefully always include the true path.
|
||||
"""
|
||||
|
||||
cdef np.ndarray[int, ndim=2] x_y_path_ = np.array(x_y_path).astype(np.int32)
|
||||
|
||||
assert (vecs0.shape[0] == norms0.shape[0])
|
||||
assert (vecs1.shape[0] == norms1.shape[0])
|
||||
|
||||
assert (vecs0.shape[1] == norms0.shape[1])
|
||||
assert (vecs1.shape[1] == norms1.shape[1])
|
||||
|
||||
# check how many overlaps vectors were passed in
|
||||
num_overlaps_in_vecs0 = vecs0.shape[0]
|
||||
num_overlaps_in_vecs1 = vecs1.shape[0]
|
||||
|
||||
# check how many overlaps were requested
|
||||
# edge case: alignment_types could be empty
|
||||
# In that case, we should just return insertions/deletions
|
||||
# and max_x_overlap == max_y_overlap == 0
|
||||
max_x_overlap = max([0] + [x for x, y in alignment_types]) # add [0] in case alignment_types is empty
|
||||
max_y_overlap = max([0] + [y for x, y in alignment_types]) # add [0] in case alignment_types is empty
|
||||
|
||||
# note: alignment types are specified 1-based, but vectors are stored 0-based
|
||||
if max_x_overlap > num_overlaps_in_vecs0:
|
||||
raise Exception('%d x overlaps requrested (via alignment_types), but vecs0 only has %d' % (
|
||||
max_x_overlap, num_overlaps_in_vecs0))
|
||||
if max_y_overlap > num_overlaps_in_vecs1:
|
||||
raise Exception('%d y overlaps requrested (via alignment_types), but vecs1 only has %d' % (
|
||||
max_y_overlap, num_overlaps_in_vecs1))
|
||||
|
||||
# number of sentences in each document
|
||||
cdef int xsize = vecs0.shape[1]
|
||||
cdef int ysize = vecs1.shape[1]
|
||||
|
||||
# vector diminsions should match
|
||||
assert (vecs0.shape[2] == vecs1.shape[2])
|
||||
|
||||
cdef np.ndarray[int, ndim=1] x_offsets, y_offsets
|
||||
x_offsets, y_offsets = make_x_y_offsets(alignment_types)
|
||||
|
||||
# reserve outputs
|
||||
a_len = x_y_path_.shape[0]
|
||||
b_len = 2 * width_over2
|
||||
cdef np.ndarray[float, ndim=3] a_b_feats = np.empty((len(alignment_types), a_len, b_len), dtype=np.float32)
|
||||
cdef np.ndarray[int, ndim=1] b_offset = np.empty(a_len).astype(np.int32)
|
||||
|
||||
cdef int x, y, aa, bb, xx, yy, a_idx, b_idx, bb2, x_offset, y_offset, ii_align, x_offset_idx, y_offset_idx
|
||||
cdef int vecsize = vecs0.shape[2]
|
||||
cdef int num_alignments = x_offsets.shape[0]
|
||||
|
||||
cdef float sumx, feat
|
||||
cdef float inf = np.inf
|
||||
|
||||
for ii in range(x_y_path_.shape[0]):
|
||||
x = x_y_path_[ii, 0]
|
||||
y = x_y_path_[ii, 1]
|
||||
|
||||
# convert xy to ab cords
|
||||
aa = x + y
|
||||
bb = y
|
||||
|
||||
a_idx = aa
|
||||
b_offset[aa] = bb - width_over2
|
||||
for b_idx, bb2 in enumerate(range(bb - width_over2, bb + width_over2)):
|
||||
# convert ab to xy cords
|
||||
xx = aa - bb2
|
||||
yy = bb2
|
||||
|
||||
for ii_align in range(num_alignments):
|
||||
x_offset = x_offsets[ii_align]
|
||||
x_offset_idx = x_offset - 1 # overlaps start at 1, vectors stored 0-based
|
||||
y_offset = y_offsets[ii_align]
|
||||
y_offset_idx = y_offset - 1
|
||||
|
||||
if 0 <= xx < xsize and 0 <= yy < ysize:
|
||||
sumx = 0.0
|
||||
for jj in range(vecsize):
|
||||
sumx += vecs0[x_offset_idx, xx, jj] * vecs1[y_offset_idx, yy, jj]
|
||||
feat = 2.0 * x_offset * y_offset * (1.0 - sumx) / (
|
||||
1e-6 + norms0[x_offset_idx, xx] + norms1[y_offset_idx, yy])
|
||||
|
||||
else:
|
||||
feat = inf
|
||||
|
||||
a_b_feats[ii_align, a_idx, b_idx] = feat
|
||||
|
||||
return a_b_feats, b_offset
|
||||
|
||||
|
||||
def sparse_dp(np.ndarray[float, ndim=3] a_b_costs,
|
||||
np.ndarray[int, ndim=1] b_offset_in,
|
||||
alignment_types,
|
||||
double del_penalty,
|
||||
int x_in_size,
|
||||
int y_in_size):
|
||||
"""
|
||||
Do DP along a path, using features saved off along path.
|
||||
x_offsets, y_offsets should not include (0,1), (1,0)
|
||||
|
||||
xsize, ysize refer to the costs a_b_csum, but in x/y space
|
||||
|
||||
As in the simpler full-DP case,
|
||||
we compute cumulative costs and backpointers on notes,
|
||||
and there are COSTS associated with moving between them.
|
||||
|
||||
This means the size of the notes +1,+1 larger (in x,y) than the COSTS.
|
||||
|
||||
So the size of a_b_csum, a_b_xp, a_b_yp are all one larger in x and y compared to the costs
|
||||
|
||||
In order to save memory (and time, vs a sparse matrix with hashes to look up values), let:
|
||||
a = x + y
|
||||
b = x - y
|
||||
|
||||
b_offsets tells us how far from the left edge the features are computed for.
|
||||
basically it's like we are computing along the diagonal,
|
||||
but we shift the diagonal around based on our belief
|
||||
about where the alignments are.
|
||||
|
||||
b_offsets is used for both costs AND csum, backpointers, so it needs to be
|
||||
+2 longer (it is in the a-direction) than the costs (in the a direction)
|
||||
|
||||
"""
|
||||
cdef np.ndarray[int, ndim=1] x_offsets, y_offsets
|
||||
x_offsets, y_offsets = make_x_y_offsets(alignment_types)
|
||||
|
||||
# make x/y offsets, including (0,1), (1,), i.e. including deletion and insertion
|
||||
x_offsets = np.concatenate([x_offsets, np.array([0, 1], dtype=np.int32)])
|
||||
y_offsets = np.concatenate([y_offsets, np.array([1, 0], dtype=np.int32)])
|
||||
|
||||
cdef int a_in_size = a_b_costs.shape[1]
|
||||
cdef int b_in_size = a_b_costs.shape[2]
|
||||
|
||||
cdef int a_out_size = a_in_size + 2
|
||||
cdef int b_out_size = b_in_size
|
||||
|
||||
cdef int x_out_size = x_in_size + 1
|
||||
cdef int y_out_size = y_in_size + 1
|
||||
|
||||
# costs are the costs of going between nodes.
|
||||
# in x,y for the nodes, we basically add a buffer
|
||||
# at x=0 and y=0, and shift the cost by (x=+1,y=+1)
|
||||
# In a,b space, this means adding two points (for the buffer)
|
||||
# at the beginning, and shifting by (a=+0,b=+1) since
|
||||
# a=x+y and b=y
|
||||
# for the first two points, we can simply replicate the
|
||||
# original b_offset, since it should be -width_over2
|
||||
# i.e. b_offset_in[0] == -width_over2
|
||||
extra_two_points = np.array([b_offset_in[0], b_offset_in[0]], dtype=np.int32)
|
||||
cdef np.ndarray[int, ndim=1] b_offset_out = np.concatenate([extra_two_points, b_offset_in + 1])
|
||||
|
||||
# outputs
|
||||
# For anything being used in accumulation, use float64
|
||||
cdef np.ndarray[double, ndim=2] a_b_csum = np.zeros((a_in_size + 2, b_in_size),
|
||||
dtype=np.float64) + np.inf # error cumulative sum
|
||||
cdef np.ndarray[int, ndim=2] a_b_xp = np.zeros((a_in_size + 2, b_in_size), dtype=np.int32) - 2 # backpointer for x
|
||||
cdef np.ndarray[int, ndim=2] a_b_yp = np.zeros((a_in_size + 2, b_in_size), dtype=np.int32) - 2 # backpointer for y
|
||||
|
||||
cdef int num_alignments = x_offsets.shape[0]
|
||||
cdef double inf = np.inf
|
||||
cdef int xx_out, yy_out, ii_align, x_offset, y_offset
|
||||
cdef int aa_in_cost, bb_in_cost, aa_out, bb_out, aa_out_prev, bb_out_prev, xx_in_cost, yy_in_cost, xx_out_prev, yy_out_prev
|
||||
|
||||
cdef double alignment_cost, total_cost, prev_cost
|
||||
|
||||
# increasing in a is the same as going along diagonals in x/y, so DP order works
|
||||
# (and any ordering is fine in b - nothing depends on values adjacent on diagonal in x/y)
|
||||
for aa_out in range(a_in_size + 2):
|
||||
for bb_out in range(b_in_size):
|
||||
#xx_out, yy_out = ab2xy_w_offset(aa_out, bb_out, b_offset_out)
|
||||
yy_out = bb_out + b_offset_out[aa_out]
|
||||
xx_out = aa_out - yy_out
|
||||
|
||||
# edge case: all deletions in y-direction
|
||||
if xx_out == 0 and 0 <= yy_out < y_out_size:
|
||||
a_b_csum[aa_out, bb_out] = del_penalty * yy_out
|
||||
a_b_xp[aa_out, bb_out] = 0
|
||||
a_b_yp[aa_out, bb_out] = 1
|
||||
|
||||
# edge case: all deletions in x-direction
|
||||
elif yy_out == 0 and 0 <= xx_out < x_out_size:
|
||||
a_b_csum[aa_out, bb_out] = del_penalty * xx_out
|
||||
a_b_xp[aa_out, bb_out] = 1
|
||||
a_b_yp[aa_out, bb_out] = 0
|
||||
|
||||
else:
|
||||
# initialize output to inf
|
||||
a_b_csum[aa_out, bb_out] = inf
|
||||
a_b_xp[aa_out, bb_out] = -42
|
||||
a_b_yp[aa_out, bb_out] = -42
|
||||
|
||||
for ii_align in range(num_alignments):
|
||||
x_offset = x_offsets[ii_align]
|
||||
y_offset = y_offsets[ii_align]
|
||||
|
||||
# coords of location of alignment cost, in input x/y space
|
||||
xx_in_cost = xx_out - 1 # features were front padded,
|
||||
yy_in_cost = yy_out - 1 # so offset is always 1
|
||||
|
||||
# the coords of location of previous cumsum cost, in input x/y space
|
||||
xx_out_prev = xx_out - x_offset
|
||||
yy_out_prev = yy_out - y_offset
|
||||
|
||||
if 0 <= xx_in_cost < x_in_size and 0 <= yy_in_cost < y_in_size and 0 <= xx_out_prev < x_out_size and 0 <= yy_out_prev < y_out_size:
|
||||
# convert x,y to a,b
|
||||
aa_in_cost = xx_in_cost + yy_in_cost
|
||||
bb_in_cost = yy_in_cost - b_offset_in[aa_in_cost]
|
||||
|
||||
aa_out_prev = xx_out_prev + yy_out_prev
|
||||
bb_out_prev = yy_out_prev - b_offset_out[aa_out_prev]
|
||||
|
||||
if 0 <= aa_in_cost < a_in_size and 0 <= bb_in_cost < b_in_size and 0 <= aa_out_prev < a_out_size and 0 <= bb_out_prev < b_out_size:
|
||||
if x_offset == 0 or y_offset == 0:
|
||||
alignment_cost = del_penalty
|
||||
else:
|
||||
alignment_cost = a_b_costs[ii_align, aa_in_cost, bb_in_cost]
|
||||
|
||||
prev_cost = a_b_csum[aa_out_prev, bb_out_prev]
|
||||
|
||||
total_cost = prev_cost + alignment_cost
|
||||
|
||||
if total_cost < a_b_csum[aa_out, bb_out]:
|
||||
a_b_csum[aa_out, bb_out] = total_cost
|
||||
a_b_xp[aa_out, bb_out] = x_offset
|
||||
a_b_yp[aa_out, bb_out] = y_offset
|
||||
|
||||
return a_b_csum, a_b_xp, a_b_yp, b_offset_out
|
||||
665
ext-lib/vecalign/dp_utils.py
Normal file
665
ext-lib/vecalign/dp_utils.py
Normal file
@@ -0,0 +1,665 @@
|
||||
"""
|
||||
Copyright 2019 Brian Thompson
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
You may obtain a copy of the License at
|
||||
|
||||
https://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
"""
|
||||
|
||||
import logging
|
||||
import sys
|
||||
from ast import literal_eval
|
||||
from collections import OrderedDict
|
||||
from math import ceil
|
||||
from time import time
|
||||
|
||||
import numpy as np
|
||||
|
||||
import pyximport
|
||||
pyximport.install(setup_args={'include_dirs':np.get_include()}, inplace=True, reload_support=True)
|
||||
|
||||
from dp_core import make_dense_costs, score_path, sparse_dp, make_sparse_costs, dense_dp
|
||||
|
||||
logger = logging.getLogger('vecalign') # set up in vecalign.py
|
||||
|
||||
|
||||
def preprocess_line(line):
|
||||
line = line.strip()
|
||||
if len(line) == 0:
|
||||
line = 'BLANK_LINE'
|
||||
return line
|
||||
|
||||
|
||||
def yield_overlaps(lines, num_overlaps):
|
||||
lines = [preprocess_line(line) for line in lines]
|
||||
for overlap in range(1, num_overlaps + 1):
|
||||
for out_line in layer(lines, overlap):
|
||||
# check must be here so all outputs are unique
|
||||
out_line2 = out_line[:10000] # limit line so dont encode arbitrarily long sentences
|
||||
yield out_line2
|
||||
|
||||
|
||||
def read_in_embeddings(text_file, embed_file):
|
||||
"""
|
||||
Given a text file with candidate sentences and a corresponing embedding file,
|
||||
make a maping from candidate sentence to embedding index,
|
||||
and a numpy array of the embeddings
|
||||
"""
|
||||
sent2line = dict()
|
||||
with open(text_file, 'rt', encoding="utf-8") as fin:
|
||||
for ii, line in enumerate(fin):
|
||||
if line.strip() in sent2line:
|
||||
raise Exception('got multiple embeddings for the same line')
|
||||
sent2line[line.strip()] = ii
|
||||
|
||||
line_embeddings = np.fromfile(embed_file, dtype=np.float32, count=-1)
|
||||
if line_embeddings.size == 0:
|
||||
raise Exception('Got empty embedding file')
|
||||
|
||||
laser_embedding_size = line_embeddings.size // len(sent2line) # currently hardcoded to 1024
|
||||
if laser_embedding_size != 1024:
|
||||
logger.warning('expected an embedding size of 1024, got %s', laser_embedding_size)
|
||||
logger.info('laser_embedding_size determined to be %d', laser_embedding_size)
|
||||
line_embeddings.resize(line_embeddings.shape[0] // laser_embedding_size, laser_embedding_size)
|
||||
return sent2line, line_embeddings
|
||||
|
||||
|
||||
def make_doc_embedding(sent2line, line_embeddings, lines, num_overlaps):
|
||||
"""
|
||||
lines: sentences in input document to embed
|
||||
sent2line, line_embeddings: precomputed embeddings for lines (and overlaps of lines)
|
||||
"""
|
||||
|
||||
lines = [preprocess_line(line) for line in lines]
|
||||
|
||||
vecsize = line_embeddings.shape[1]
|
||||
|
||||
vecs0 = np.empty((num_overlaps, len(lines), vecsize), dtype=np.float32)
|
||||
|
||||
for ii, overlap in enumerate(range(1, num_overlaps + 1)):
|
||||
for jj, out_line in enumerate(layer(lines, overlap)):
|
||||
try:
|
||||
line_id = sent2line[out_line]
|
||||
except KeyError:
|
||||
logger.warning('Failed to find overlap=%d line "%s". Will use random vector.', overlap, out_line)
|
||||
line_id = None
|
||||
|
||||
if line_id is not None:
|
||||
vec = line_embeddings[line_id]
|
||||
else:
|
||||
vec = np.random.random(vecsize) - 0.5
|
||||
vec = vec / np.linalg.norm(vec)
|
||||
|
||||
vecs0[ii, jj, :] = vec
|
||||
|
||||
return vecs0
|
||||
|
||||
|
||||
def make_norm1(vecs0):
|
||||
"""
|
||||
make vectors norm==1 so that cosine distance can be computed via dot product
|
||||
"""
|
||||
for ii in range(vecs0.shape[0]):
|
||||
for jj in range(vecs0.shape[1]):
|
||||
norm = np.sqrt(np.square(vecs0[ii, jj, :]).sum())
|
||||
vecs0[ii, jj, :] = vecs0[ii, jj, :] / (norm + 1e-5)
|
||||
|
||||
|
||||
def layer(lines, num_overlaps, comb=' '):
|
||||
"""
|
||||
make front-padded overlapping sentences
|
||||
"""
|
||||
if num_overlaps < 1:
|
||||
raise Exception('num_overlaps must be >= 1')
|
||||
out = ['PAD', ] * min(num_overlaps - 1, len(lines))
|
||||
for ii in range(len(lines) - num_overlaps + 1):
|
||||
out.append(comb.join(lines[ii:ii + num_overlaps]))
|
||||
return out
|
||||
|
||||
|
||||
def read_alignments(fin):
|
||||
alignments = []
|
||||
with open(fin, 'rt', encoding="utf-8") as infile:
|
||||
for line in infile:
|
||||
fields = [x.strip() for x in line.split(':') if len(x.strip())]
|
||||
if len(fields) < 2:
|
||||
raise Exception('Got line "%s", which does not have at least two ":" separated fields' % line.strip())
|
||||
try:
|
||||
src = literal_eval(fields[0])
|
||||
tgt = literal_eval(fields[1])
|
||||
except:
|
||||
raise Exception('Failed to parse line "%s"' % line.strip())
|
||||
alignments.append((src, tgt))
|
||||
|
||||
# I know bluealign files have a few entries entries missing,
|
||||
# but I don't fix them in order to be consistent previous reported scores
|
||||
return alignments
|
||||
|
||||
|
||||
def print_alignments(alignments, scores=None, file=sys.stdout):
|
||||
if scores is not None:
|
||||
for (x, y), s in zip(alignments, scores):
|
||||
print('%s:%s:%.6f' % (x, y, s), file=file)
|
||||
else:
|
||||
for x, y in alignments:
|
||||
print('%s:%s' % (x, y), file=file)
|
||||
|
||||
|
||||
class DeletionKnob(object):
|
||||
"""
|
||||
A good deletion penalty is dependent on normalization, and probably language, domain, etc, etc
|
||||
I want a way to control deletion penalty that generalizes well...
|
||||
Sampling costs and use percentile seems to work fairly well.
|
||||
"""
|
||||
def __init__(self, samp, res_min, res_max):
|
||||
|
||||
self.res_min = res_min
|
||||
self.res_max = res_max
|
||||
|
||||
if self.res_min >= self.res_max:
|
||||
logger.warning('res_max <= res_min, increasing it')
|
||||
self.res_max = self.res_min + 1e-4
|
||||
|
||||
num_bins = 1000
|
||||
num_pts = 30
|
||||
|
||||
self.hist, self.bin_edges = np.histogram(samp, bins=num_bins,
|
||||
range=[self.res_min, self.res_max],
|
||||
density=True)
|
||||
|
||||
dx = self.bin_edges[1] - self.bin_edges[0]
|
||||
self.cdf = np.cumsum(self.hist) * dx
|
||||
|
||||
interp_points = [(0, self.res_min), ]
|
||||
for knob_val in np.linspace(0, 1, num_pts - 1)[1:-1]:
|
||||
cdf_idx = np.searchsorted(self.cdf, knob_val)
|
||||
cdf_val = self.res_min + cdf_idx / float(num_bins) * (self.res_max - self.res_min)
|
||||
interp_points.append((knob_val, cdf_val))
|
||||
interp_points.append((1, self.res_max))
|
||||
self.x, self.y = zip(*interp_points)
|
||||
|
||||
def percentile_frac_to_del_penalty(self, knob_val):
|
||||
del_pen = np.interp([knob_val], self.x, self.y)[0]
|
||||
return del_pen
|
||||
|
||||
|
||||
def make_alignment_types(max_alignment_size):
|
||||
# return list of all (n,m) where n+m <= this
|
||||
alignment_types = []
|
||||
for x in range(1, max_alignment_size):
|
||||
for y in range(1, max_alignment_size):
|
||||
if x + y <= max_alignment_size:
|
||||
alignment_types.append((x, y))
|
||||
return alignment_types
|
||||
|
||||
|
||||
def ab2xy_w_offset(aa, bb_idx, bb_offset):
|
||||
bb_from_side = bb_idx + bb_offset[aa]
|
||||
xx = aa - bb_from_side
|
||||
yy = bb_from_side
|
||||
return (xx, yy)
|
||||
|
||||
|
||||
def xy2ab_w_offset(xx, yy, bb_offset):
|
||||
aa = xx + yy
|
||||
bb_from_side = yy
|
||||
bb = bb_from_side - bb_offset[aa]
|
||||
return aa, bb
|
||||
|
||||
|
||||
def process_scores(scores, alignments):
|
||||
# floating point sometimes gives negative numbers, which is a little unnerving ...
|
||||
scores = np.clip(scores, a_min=0, a_max=None)
|
||||
|
||||
for ii, (x_algn, y_algn) in enumerate(alignments):
|
||||
# deletion penalty is pretty arbitrary, just report 0
|
||||
if len(x_algn) == 0 or len(y_algn) == 0:
|
||||
scores[ii] = 0.0
|
||||
# report sores un-normalized by alignment sizes
|
||||
# (still normalized with random vectors, though)
|
||||
else:
|
||||
scores[ii] = scores[ii] / len(x_algn) / len(y_algn)
|
||||
|
||||
return scores
|
||||
|
||||
|
||||
def sparse_traceback(a_b_csum, a_b_xp, a_b_yp, b_offset, xsize, ysize):
|
||||
alignments = []
|
||||
xx = xsize
|
||||
yy = ysize
|
||||
|
||||
cum_costs = []
|
||||
|
||||
while True:
|
||||
aa, bb = xy2ab_w_offset(xx, yy, b_offset)
|
||||
|
||||
cum_costs.append(a_b_csum[aa, bb])
|
||||
|
||||
xp = a_b_xp[aa, bb]
|
||||
yp = a_b_yp[aa, bb]
|
||||
|
||||
if xx == yy == 0:
|
||||
break
|
||||
|
||||
if xx < 0 or yy < 0:
|
||||
raise Exception('traceback bug')
|
||||
|
||||
x_side = list(range(xx - xp, xx))
|
||||
y_side = list(range(yy - yp, yy))
|
||||
alignments.append((x_side, y_side))
|
||||
|
||||
xx = xx - xp
|
||||
yy = yy - yp
|
||||
|
||||
alignments.reverse()
|
||||
cum_costs.reverse()
|
||||
costs = np.array(cum_costs[1:]) - np.array(cum_costs[:-1])
|
||||
# "costs" are scaled by x_alignment_size * y_alignment_size
|
||||
# and the cost of a deletion is del_penalty
|
||||
# "scores": 0 for deletion/insertion,
|
||||
# and cosine distance, *not* scaled
|
||||
# by len(x_alignment)*len(y_alignment)
|
||||
scores = process_scores(scores=costs, alignments=alignments)
|
||||
|
||||
return alignments, scores
|
||||
|
||||
|
||||
def dense_traceback(x_y_tb):
|
||||
xsize, ysize = x_y_tb.shape
|
||||
|
||||
xx = xsize - 1
|
||||
yy = ysize - 1
|
||||
|
||||
alignments = []
|
||||
while True:
|
||||
if xx == yy == 0:
|
||||
break
|
||||
bp = x_y_tb[xx, yy]
|
||||
if bp == 0:
|
||||
xp, yp = 1, 1
|
||||
alignments.append(([xx - 1], [yy - 1]))
|
||||
elif bp == 1:
|
||||
xp, yp = 0, 1
|
||||
alignments.append(([], [yy - 1]))
|
||||
elif bp == 2:
|
||||
xp, yp = 1, 0
|
||||
alignments.append(([xx - 1], []))
|
||||
else:
|
||||
raise Exception('got unknown value')
|
||||
|
||||
xx = xx - xp
|
||||
yy = yy - yp
|
||||
|
||||
alignments.reverse()
|
||||
|
||||
return alignments
|
||||
|
||||
|
||||
def append_slant(path, xwidth, ywidth):
|
||||
"""
|
||||
Append quantized approximation to a straight line
|
||||
from current x,y to a point at (x+xwidth, y+ywidth)
|
||||
"""
|
||||
NN = xwidth + ywidth
|
||||
xstart, ystart = path[-1]
|
||||
for ii in range(1, NN + 1):
|
||||
x = xstart + round(xwidth * ii / NN)
|
||||
y = ystart + round(ywidth * ii / NN)
|
||||
# In the case of ties we want them to round differently,
|
||||
# so explicitly make sure we take a step of 1, not 0 or 2
|
||||
lastx, lasty = path[-1]
|
||||
delta = x + y - lastx - lasty
|
||||
if delta == 1:
|
||||
path.append((x, y))
|
||||
elif delta == 2:
|
||||
path.append((x - 1, y))
|
||||
elif delta == 0:
|
||||
path.append((x + 1, y))
|
||||
|
||||
|
||||
def alignment_to_search_path(algn):
|
||||
"""
|
||||
Given an alignment, make searchpath.
|
||||
Searchpath must step exactly one position in x XOR y at each time step.
|
||||
|
||||
In the case of a block of deletions, the order found by DP is not meaningful.
|
||||
To make things consistent and to improve the probability of recovering
|
||||
from search errors, we search an approximately straight line
|
||||
through a block of deletions. We do the same through a many-many
|
||||
alignment, even though we currently don't refine a many-many alignment...
|
||||
"""
|
||||
path = [(0, 0), ]
|
||||
xdel, ydel = 0, 0
|
||||
ydel = 0
|
||||
for x, y in algn:
|
||||
if len(x) and len(y):
|
||||
append_slant(path, xdel, ydel)
|
||||
xdel, ydel = 0, 0
|
||||
append_slant(path, len(x), len(y))
|
||||
elif len(x):
|
||||
xdel += len(x)
|
||||
elif len(y):
|
||||
ydel += len(y)
|
||||
|
||||
append_slant(path, xdel, ydel)
|
||||
|
||||
return path
|
||||
|
||||
|
||||
def extend_alignments(course_alignments, size0, size1):
|
||||
"""
|
||||
extend alignments to include new endpoints size0, size1
|
||||
if alignments are larger than size0/size1, raise exception
|
||||
"""
|
||||
# could be a string of deletions or insertions at end, so cannot just grab last one
|
||||
xmax = 0 # maximum x value in course_alignments
|
||||
ymax = 0 # maximum y value in course_alignments
|
||||
for x, y in course_alignments:
|
||||
for xval in x:
|
||||
xmax = max(xmax, xval)
|
||||
for yval in y:
|
||||
ymax = max(ymax, yval)
|
||||
|
||||
if xmax > size0 or ymax > size1:
|
||||
raise Exception('asked to extend alignments but already bigger than requested')
|
||||
|
||||
# do not duplicate xmax/ymax, do include size0/size1
|
||||
extra_x = list(range(xmax + 1, size0 + 1))
|
||||
extra_y = list(range(ymax + 1, size1 + 1))
|
||||
|
||||
logger.debug('extending alignments in x by %d and y by %d', len(extra_x), len(extra_y))
|
||||
|
||||
if len(extra_x) == 0:
|
||||
for yval in extra_y:
|
||||
course_alignments.append(([], [yval]))
|
||||
elif len(extra_y) == 0:
|
||||
for xval in extra_x:
|
||||
course_alignments.append(([xval], []))
|
||||
else:
|
||||
course_alignments.append((extra_x, extra_y))
|
||||
|
||||
|
||||
def upsample_alignment(algn):
|
||||
def upsample_one_alignment(xx):
|
||||
return list(range(min(xx) * 2, (max(xx) + 1) * 2))
|
||||
|
||||
new_algn = []
|
||||
for xx, yy in algn:
|
||||
if len(xx) == 0:
|
||||
for yyy in upsample_one_alignment(yy):
|
||||
new_algn.append(([], [yyy]))
|
||||
elif len(yy) == 0:
|
||||
for xxx in upsample_one_alignment(xx):
|
||||
new_algn.append(([xxx], []))
|
||||
else:
|
||||
new_algn.append((upsample_one_alignment(xx), upsample_one_alignment(yy)))
|
||||
return new_algn
|
||||
|
||||
|
||||
def make_del_knob(e_laser,
|
||||
f_laser,
|
||||
e_laser_norms,
|
||||
f_laser_norms,
|
||||
sample_size):
|
||||
e_size = e_laser.shape[0]
|
||||
f_size = f_laser.shape[0]
|
||||
|
||||
if e_size > 0 and f_size > 0 and sample_size > 0:
|
||||
|
||||
if e_size * f_size < sample_size:
|
||||
# dont sample, just compute full matrix
|
||||
sample_size = e_size * f_size
|
||||
x_idxs = np.zeros(sample_size, dtype=np.int32)
|
||||
y_idxs = np.zeros(sample_size, dtype=np.int32)
|
||||
c = 0
|
||||
for ii in range(e_size):
|
||||
for jj in range(f_size):
|
||||
x_idxs[c] = ii
|
||||
y_idxs[c] = jj
|
||||
c += 1
|
||||
else:
|
||||
# get random samples
|
||||
x_idxs = np.random.choice(range(e_size), size=sample_size, replace=True).astype(np.int32)
|
||||
y_idxs = np.random.choice(range(f_size), size=sample_size, replace=True).astype(np.int32)
|
||||
|
||||
# output
|
||||
random_scores = np.empty(sample_size, dtype=np.float32)
|
||||
|
||||
score_path(x_idxs, y_idxs,
|
||||
e_laser_norms, f_laser_norms,
|
||||
e_laser, f_laser,
|
||||
random_scores, )
|
||||
|
||||
min_score = 0
|
||||
max_score = max(random_scores) # could bump this up... but its probably fine
|
||||
|
||||
else:
|
||||
# Not much we can do here...
|
||||
random_scores = np.array([0.0, 0.5, 1.0]) # ???
|
||||
min_score = 0
|
||||
max_score = 1 # ????
|
||||
|
||||
del_knob = DeletionKnob(random_scores, min_score, max_score)
|
||||
|
||||
return del_knob
|
||||
|
||||
|
||||
def compute_norms(vecs0, vecs1, num_samples, overlaps_to_use=None):
|
||||
# overlaps_to_use = 10 # 10 matches before
|
||||
|
||||
overlaps1, size1, dim = vecs1.shape
|
||||
overlaps0, size0, dim0 = vecs0.shape
|
||||
assert (dim == dim0)
|
||||
|
||||
if overlaps_to_use is not None:
|
||||
if overlaps_to_use > overlaps1:
|
||||
raise Exception('Cannot use more overlaps than provided. You may want to re-run make_verlaps.py with a larger -n value')
|
||||
else:
|
||||
overlaps_to_use = overlaps1
|
||||
|
||||
samps_per_overlap = ceil(num_samples / overlaps_to_use)
|
||||
|
||||
if size1 and samps_per_overlap:
|
||||
# sample other size (from all overlaps) to compre to this side
|
||||
vecs1_rand_sample = np.empty((samps_per_overlap * overlaps_to_use, dim), dtype=np.float32)
|
||||
for overlap_ii in range(overlaps_to_use):
|
||||
idxs = np.random.choice(range(size1), size=samps_per_overlap, replace=True)
|
||||
random_vecs = vecs1[overlap_ii, idxs, :]
|
||||
vecs1_rand_sample[overlap_ii * samps_per_overlap:(overlap_ii + 1) * samps_per_overlap, :] = random_vecs
|
||||
|
||||
norms0 = np.empty((overlaps0, size0), dtype=np.float32)
|
||||
for overlap_ii in range(overlaps0):
|
||||
e_laser = vecs0[overlap_ii, :, :]
|
||||
sim = np.matmul(e_laser, vecs1_rand_sample.T)
|
||||
norms0[overlap_ii, :] = 1.0 - sim.mean(axis=1)
|
||||
|
||||
else: # no samples, no normalization
|
||||
norms0 = np.ones((overlaps0, size0)).astype(np.float32)
|
||||
|
||||
return norms0
|
||||
|
||||
|
||||
def downsample_vectors(vecs1):
|
||||
a, b, c = vecs1.shape
|
||||
half = np.empty((a, b // 2, c), dtype=np.float32)
|
||||
for ii in range(a):
|
||||
# average consecutive vectors
|
||||
for jj in range(0, b - b % 2, 2):
|
||||
v1 = vecs1[ii, jj, :]
|
||||
v2 = vecs1[ii, jj + 1, :]
|
||||
half[ii, jj // 2, :] = v1 + v2
|
||||
# compute mean for all vectors
|
||||
mean = np.mean(half[ii, :, :], axis=0)
|
||||
for jj in range(0, b - b % 2, 2):
|
||||
# remove mean
|
||||
half[ii, jj // 2, :] = half[ii, jj // 2, :] - mean
|
||||
# make vectors norm==1 so dot product is cosine distance
|
||||
make_norm1(half)
|
||||
return half
|
||||
|
||||
|
||||
def vecalign(vecs0,
|
||||
vecs1,
|
||||
final_alignment_types,
|
||||
del_percentile_frac,
|
||||
width_over2,
|
||||
max_size_full_dp,
|
||||
costs_sample_size,
|
||||
num_samps_for_norm,
|
||||
norms0=None,
|
||||
norms1=None):
|
||||
if width_over2 < 3:
|
||||
logger.warning('width_over2 was set to %d, which does not make sense. increasing to 3.', width_over2)
|
||||
width_over2 = 3
|
||||
|
||||
# make sure input embeddings are norm==1
|
||||
make_norm1(vecs0)
|
||||
make_norm1(vecs1)
|
||||
|
||||
# save off runtime stats for summary
|
||||
runtimes = OrderedDict()
|
||||
|
||||
# Determine stack depth
|
||||
s0, s1 = vecs0.shape[1], vecs1.shape[1]
|
||||
max_depth = 0
|
||||
while s0 * s1 > max_size_full_dp ** 2:
|
||||
max_depth += 1
|
||||
s0 = s0 // 2
|
||||
s1 = s1 // 2
|
||||
|
||||
# init recursion stack
|
||||
# depth is 0-based (full size is 0, 1 is half, 2 is quarter, etc)
|
||||
stack = {0: {'v0': vecs0, 'v1': vecs1}}
|
||||
|
||||
# downsample sentence vectors
|
||||
t0 = time()
|
||||
for depth in range(1, max_depth + 1):
|
||||
stack[depth] = {'v0': downsample_vectors(stack[depth - 1]['v0']),
|
||||
'v1': downsample_vectors(stack[depth - 1]['v1'])}
|
||||
runtimes['Downsample embeddings'] = time() - t0
|
||||
|
||||
# compute norms for all depths, add sizes, add alignment types
|
||||
t0 = time()
|
||||
for depth in stack:
|
||||
stack[depth]['size0'] = stack[depth]['v0'].shape[1]
|
||||
stack[depth]['size1'] = stack[depth]['v1'].shape[1]
|
||||
stack[depth]['alignment_types'] = final_alignment_types if depth == 0 else [(1, 1)]
|
||||
|
||||
if depth == 0 and norms0 is not None:
|
||||
if norms0.shape != vecs0.shape[:2]:
|
||||
print('norms0.shape:', norms0.shape)
|
||||
print('vecs0.shape[:2]:', vecs0.shape[:2])
|
||||
raise Exception('norms0 wrong shape')
|
||||
stack[depth]['n0'] = norms0
|
||||
else:
|
||||
stack[depth]['n0'] = compute_norms(stack[depth]['v0'], stack[depth]['v1'], num_samps_for_norm)
|
||||
|
||||
if depth == 0 and norms1 is not None:
|
||||
if norms1.shape != vecs1.shape[:2]:
|
||||
print('norms1.shape:', norms1.shape)
|
||||
print('vecs1.shape[:2]:', vecs1.shape[:2])
|
||||
raise Exception('norms1 wrong shape')
|
||||
stack[depth]['n1'] = norms1
|
||||
else:
|
||||
stack[depth]['n1'] = compute_norms(stack[depth]['v1'], stack[depth]['v0'], num_samps_for_norm)
|
||||
|
||||
runtimes['Normalize embeddings'] = time() - t0
|
||||
|
||||
# Compute deletion penalty for all depths
|
||||
t0 = time()
|
||||
for depth in stack:
|
||||
stack[depth]['del_knob'] = make_del_knob(e_laser=stack[depth]['v0'][0, :, :],
|
||||
f_laser=stack[depth]['v1'][0, :, :],
|
||||
e_laser_norms=stack[depth]['n0'][0, :],
|
||||
f_laser_norms=stack[depth]['n1'][0, :],
|
||||
sample_size=costs_sample_size)
|
||||
stack[depth]['del_penalty'] = stack[depth]['del_knob'].percentile_frac_to_del_penalty(del_percentile_frac)
|
||||
logger.debug('del_penalty at depth %d: %f', depth, stack[depth]['del_penalty'])
|
||||
runtimes['Compute deletion penalties'] = time() - t0
|
||||
tt = time() - t0
|
||||
logger.debug('%d x %d full DP make features: %.6fs (%.3e per dot product)',
|
||||
stack[max_depth]['size0'], stack[max_depth]['size1'], tt,
|
||||
tt / (stack[max_depth]['size0'] + 1e-6) / (stack[max_depth]['size1'] + 1e-6))
|
||||
# full DP at maximum recursion depth
|
||||
t0 = time()
|
||||
stack[max_depth]['costs_1to1'] = make_dense_costs(stack[max_depth]['v0'],
|
||||
stack[max_depth]['v1'],
|
||||
stack[max_depth]['n0'],
|
||||
stack[max_depth]['n1'])
|
||||
|
||||
runtimes['Full DP make features'] = time() - t0
|
||||
t0 = time()
|
||||
_, stack[max_depth]['x_y_tb'] = dense_dp(stack[max_depth]['costs_1to1'], stack[max_depth]['del_penalty'])
|
||||
stack[max_depth]['alignments'] = dense_traceback(stack[max_depth]['x_y_tb'])
|
||||
runtimes['Full DP'] = time() - t0
|
||||
|
||||
# upsample the path up to the top resolution
|
||||
compute_costs_times = []
|
||||
dp_times = []
|
||||
upsample_depths = [0, ] if max_depth == 0 else list(reversed(range(0, max_depth)))
|
||||
for depth in upsample_depths:
|
||||
if max_depth > 0: # upsample previoius alignment to current resolution
|
||||
course_alignments = upsample_alignment(stack[depth + 1]['alignments'])
|
||||
# features may have been truncated when downsampleing, so alignment may need extended
|
||||
extend_alignments(course_alignments, stack[depth]['size0'], stack[depth]['size1']) # in-place
|
||||
else: # We did a full size 1-1 search, so search same size with more alignment types
|
||||
course_alignments = stack[0]['alignments']
|
||||
|
||||
# convert couse alignments to a searchpath
|
||||
stack[depth]['searchpath'] = alignment_to_search_path(course_alignments)
|
||||
|
||||
# compute ccosts for sparse DP
|
||||
t0 = time()
|
||||
stack[depth]['a_b_costs'], stack[depth]['b_offset'] = make_sparse_costs(stack[depth]['v0'], stack[depth]['v1'],
|
||||
stack[depth]['n0'], stack[depth]['n1'],
|
||||
stack[depth]['searchpath'],
|
||||
stack[depth]['alignment_types'],
|
||||
width_over2)
|
||||
|
||||
tt = time() - t0
|
||||
num_dot_products = len(stack[depth]['b_offset']) * len(stack[depth]['alignment_types']) * width_over2 * 2
|
||||
logger.debug('%d x %d sparse DP (%d alignment types, %d window) make features: %.6fs (%.3e per dot product)',
|
||||
stack[max_depth]['size0'], stack[max_depth]['size1'],
|
||||
len(stack[depth]['alignment_types']), width_over2 * 2,
|
||||
tt, tt / (num_dot_products + 1e6))
|
||||
|
||||
compute_costs_times.append(time() - t0)
|
||||
t0 = time()
|
||||
# perform sparse DP
|
||||
stack[depth]['a_b_csum'], stack[depth]['a_b_xp'], stack[depth]['a_b_yp'], \
|
||||
stack[depth]['new_b_offset'] = sparse_dp(stack[depth]['a_b_costs'], stack[depth]['b_offset'],
|
||||
stack[depth]['alignment_types'], stack[depth]['del_penalty'],
|
||||
stack[depth]['size0'], stack[depth]['size1'])
|
||||
|
||||
# performace traceback to get alignments and alignment scores
|
||||
# for debugging, avoid overwriting stack[depth]['alignments']
|
||||
akey = 'final_alignments' if depth == 0 else 'alignments'
|
||||
stack[depth][akey], stack[depth]['alignment_scores'] = sparse_traceback(stack[depth]['a_b_csum'],
|
||||
stack[depth]['a_b_xp'],
|
||||
stack[depth]['a_b_yp'],
|
||||
stack[depth]['new_b_offset'],
|
||||
stack[depth]['size0'],
|
||||
stack[depth]['size1'])
|
||||
dp_times.append(time() - t0)
|
||||
|
||||
runtimes['Upsample DP compute costs'] = sum(compute_costs_times[:-1])
|
||||
runtimes['Upsample DP'] = sum(dp_times[:-1])
|
||||
|
||||
runtimes['Final DP compute costs'] = compute_costs_times[-1]
|
||||
runtimes['Final DP'] = dp_times[-1]
|
||||
|
||||
# log time stats
|
||||
max_key_str_len = max([len(key) for key in runtimes])
|
||||
for key in runtimes:
|
||||
if runtimes[key] > 5e-5:
|
||||
logger.info(key + ' took ' + '.' * (max_key_str_len + 5 - len(key)) + ('%.4fs' % runtimes[key]).rjust(7))
|
||||
|
||||
return stack
|
||||
BIN
ext-lib/vecalign/media/dynamic_programing_approximation.gif
Normal file
BIN
ext-lib/vecalign/media/dynamic_programing_approximation.gif
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 1.0 MiB |
BIN
ext-lib/vecalign/media/multilingual_sentence_embedding.png
Normal file
BIN
ext-lib/vecalign/media/multilingual_sentence_embedding.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 56 KiB |
61
ext-lib/vecalign/overlap.py
Normal file
61
ext-lib/vecalign/overlap.py
Normal file
@@ -0,0 +1,61 @@
|
||||
#!/usr/bin/env python3
|
||||
|
||||
"""
|
||||
Copyright 2019 Brian Thompson
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
You may obtain a copy of the License at
|
||||
|
||||
https://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
"""
|
||||
|
||||
|
||||
import argparse
|
||||
|
||||
from dp_utils import yield_overlaps
|
||||
|
||||
|
||||
def go(output_file, input_files, num_overlaps):
|
||||
output = set()
|
||||
for fin in input_files:
|
||||
lines = open(fin, 'rt', encoding="utf-8").readlines()
|
||||
for out_line in yield_overlaps(lines, num_overlaps):
|
||||
output.add(out_line)
|
||||
|
||||
# for reproducibility
|
||||
output = list(output)
|
||||
output.sort()
|
||||
|
||||
with open(output_file, 'wt', encoding="utf-8") as fout:
|
||||
for line in output:
|
||||
fout.write(line + '\n')
|
||||
|
||||
|
||||
def _main():
|
||||
parser = argparse.ArgumentParser('Create text file containing overlapping sentences.',
|
||||
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
|
||||
|
||||
parser.add_argument('-i', '--inputs', type=str, nargs='+',
|
||||
help='input text file(s).')
|
||||
|
||||
parser.add_argument('-o', '--output', type=str,
|
||||
help='output text file containing overlapping sentneces')
|
||||
|
||||
parser.add_argument('-n', '--num_overlaps', type=int, default=4,
|
||||
help='Maximum number of allowed overlaps.')
|
||||
|
||||
args = parser.parse_args()
|
||||
go(output_file=args.output,
|
||||
num_overlaps=args.num_overlaps,
|
||||
input_files=args.inputs)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
_main()
|
||||
170
ext-lib/vecalign/score.py
Normal file
170
ext-lib/vecalign/score.py
Normal file
@@ -0,0 +1,170 @@
|
||||
#!/usr/bin/env python3
|
||||
|
||||
"""
|
||||
Copyright 2019 Brian Thompson
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
You may obtain a copy of the License at
|
||||
|
||||
https://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import sys
|
||||
from collections import defaultdict
|
||||
|
||||
import numpy as np
|
||||
|
||||
from dp_utils import read_alignments
|
||||
|
||||
"""
|
||||
Faster implementation of lax and strict precision and recall, based on
|
||||
https://www.aclweb.org/anthology/W11-4624/.
|
||||
|
||||
"""
|
||||
|
||||
|
||||
def _precision(goldalign, testalign):
|
||||
"""
|
||||
Computes tpstrict, fpstrict, tplax, fplax for gold/test alignments
|
||||
"""
|
||||
tpstrict = 0 # true positive strict counter
|
||||
tplax = 0 # true positive lax counter
|
||||
fpstrict = 0 # false positive strict counter
|
||||
fplax = 0 # false positive lax counter
|
||||
|
||||
# convert to sets, remove alignments empty on both sides
|
||||
testalign = set([(tuple(x), tuple(y)) for x, y in testalign if len(x) or len(y)])
|
||||
goldalign = set([(tuple(x), tuple(y)) for x, y in goldalign if len(x) or len(y)])
|
||||
|
||||
# mappings from source test sentence idxs to
|
||||
# target gold sentence idxs for which the source test sentence
|
||||
# was found in corresponding source gold alignment
|
||||
src_id_to_gold_tgt_ids = defaultdict(set)
|
||||
for gold_src, gold_tgt in goldalign:
|
||||
for gold_src_id in gold_src:
|
||||
for gold_tgt_id in gold_tgt:
|
||||
src_id_to_gold_tgt_ids[gold_src_id].add(gold_tgt_id)
|
||||
|
||||
for (test_src, test_target) in testalign:
|
||||
if (test_src, test_target) == ((), ()):
|
||||
continue
|
||||
if (test_src, test_target) in goldalign:
|
||||
# strict match
|
||||
tpstrict += 1
|
||||
tplax += 1
|
||||
else:
|
||||
# For anything with partial gold/test overlap on the source,
|
||||
# see if there is also partial overlap on the gold/test target
|
||||
# If so, its a lax match
|
||||
target_ids = set()
|
||||
for src_test_id in test_src:
|
||||
for tgt_id in src_id_to_gold_tgt_ids[src_test_id]:
|
||||
target_ids.add(tgt_id)
|
||||
if set(test_target).intersection(target_ids):
|
||||
fpstrict += 1
|
||||
tplax += 1
|
||||
else:
|
||||
fpstrict += 1
|
||||
fplax += 1
|
||||
|
||||
return np.array([tpstrict, fpstrict, tplax, fplax], dtype=np.int32)
|
||||
|
||||
|
||||
def score_multiple(gold_list, test_list, value_for_div_by_0=0.0):
|
||||
# accumulate counts for all gold/test files
|
||||
pcounts = np.array([0, 0, 0, 0], dtype=np.int32)
|
||||
rcounts = np.array([0, 0, 0, 0], dtype=np.int32)
|
||||
for goldalign, testalign in zip(gold_list, test_list):
|
||||
pcounts += _precision(goldalign=goldalign, testalign=testalign)
|
||||
# recall is precision with no insertion/deletion and swap args
|
||||
test_no_del = [(x, y) for x, y in testalign if len(x) and len(y)]
|
||||
gold_no_del = [(x, y) for x, y in goldalign if len(x) and len(y)]
|
||||
rcounts += _precision(goldalign=test_no_del, testalign=gold_no_del)
|
||||
|
||||
# Compute results
|
||||
# pcounts: tpstrict,fnstrict,tplax,fnlax
|
||||
# rcounts: tpstrict,fpstrict,tplax,fplax
|
||||
|
||||
if pcounts[0] + pcounts[1] == 0:
|
||||
pstrict = value_for_div_by_0
|
||||
else:
|
||||
pstrict = pcounts[0] / float(pcounts[0] + pcounts[1])
|
||||
|
||||
if pcounts[2] + pcounts[3] == 0:
|
||||
plax = value_for_div_by_0
|
||||
else:
|
||||
plax = pcounts[2] / float(pcounts[2] + pcounts[3])
|
||||
|
||||
if rcounts[0] + rcounts[1] == 0:
|
||||
rstrict = value_for_div_by_0
|
||||
else:
|
||||
rstrict = rcounts[0] / float(rcounts[0] + rcounts[1])
|
||||
|
||||
if rcounts[2] + rcounts[3] == 0:
|
||||
rlax = value_for_div_by_0
|
||||
else:
|
||||
rlax = rcounts[2] / float(rcounts[2] + rcounts[3])
|
||||
|
||||
if (pstrict + rstrict) == 0:
|
||||
fstrict = value_for_div_by_0
|
||||
else:
|
||||
fstrict = 2 * (pstrict * rstrict) / (pstrict + rstrict)
|
||||
|
||||
if (plax + rlax) == 0:
|
||||
flax = value_for_div_by_0
|
||||
else:
|
||||
flax = 2 * (plax * rlax) / (plax + rlax)
|
||||
|
||||
result = dict(recall_strict=rstrict,
|
||||
recall_lax=rlax,
|
||||
precision_strict=pstrict,
|
||||
precision_lax=plax,
|
||||
f1_strict=fstrict,
|
||||
f1_lax=flax)
|
||||
|
||||
return result
|
||||
|
||||
|
||||
def log_final_scores(res):
|
||||
print(' ---------------------------------', file=sys.stderr)
|
||||
print('| | Strict | Lax |', file=sys.stderr)
|
||||
print('| Precision | {precision_strict:.3f} | {precision_lax:.3f} |'.format(**res), file=sys.stderr)
|
||||
print('| Recall | {recall_strict:.3f} | {recall_lax:.3f} |'.format(**res), file=sys.stderr)
|
||||
print('| F1 | {f1_strict:.3f} | {f1_lax:.3f} |'.format(**res), file=sys.stderr)
|
||||
print(' ---------------------------------', file=sys.stderr)
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
'Compute strict/lax precision and recall for one or more pairs of gold/test alignments',
|
||||
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
|
||||
|
||||
parser.add_argument('-t', '--test', type=str, nargs='+', required=True,
|
||||
help='one or more test alignment files')
|
||||
|
||||
parser.add_argument('-g', '--gold', type=str, nargs='+', required=True,
|
||||
help='one or more gold alignment files')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
if len(args.test) != len(args.gold):
|
||||
raise Exception('number of gold/test files must be the same')
|
||||
|
||||
gold_list = [read_alignments(x) for x in args.gold]
|
||||
test_list = [read_alignments(x) for x in args.test]
|
||||
|
||||
res = score_multiple(gold_list=gold_list, test_list=test_list)
|
||||
log_final_scores(res)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
162
ext-lib/vecalign/vecalign.py
Normal file
162
ext-lib/vecalign/vecalign.py
Normal file
@@ -0,0 +1,162 @@
|
||||
#!/usr/bin/env python3
|
||||
|
||||
"""
|
||||
Copyright 2019 Brian Thompson
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
You may obtain a copy of the License at
|
||||
|
||||
https://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
"""
|
||||
|
||||
"""
|
||||
Usage:
|
||||
|
||||
python ext-lib/vecalign/vecalign.py \
|
||||
-s data/mac/dev/zh \
|
||||
-t data/mac/dev/en \
|
||||
-o data/mac/dev/auto \
|
||||
-m data/mac/dev/meta_data.tsv \
|
||||
--src_embed data/mac/dev/zh/overlap data/mac/dev/zh/overlap.emb \
|
||||
--tgt_embed data/mac/dev/en/overlap data/mac/dev/en/overlap.emb \
|
||||
-a 8 -v
|
||||
"""
|
||||
|
||||
import os
|
||||
import time
|
||||
import argparse
|
||||
import shutil
|
||||
import logging
|
||||
import pickle
|
||||
from math import ceil
|
||||
from random import seed as seed
|
||||
|
||||
import numpy as np
|
||||
|
||||
logger = logging.getLogger('vecalign')
|
||||
logger.setLevel(logging.WARNING)
|
||||
logFormatter = logging.Formatter("%(asctime)s %(levelname)-5.5s %(message)s")
|
||||
consoleHandler = logging.StreamHandler()
|
||||
consoleHandler.setFormatter(logFormatter)
|
||||
logger.addHandler(consoleHandler)
|
||||
|
||||
from dp_utils import make_alignment_types, read_alignments, read_in_embeddings, make_doc_embedding, vecalign
|
||||
|
||||
def main():
|
||||
# make runs consistent
|
||||
seed(42)
|
||||
np.random.seed(42)
|
||||
|
||||
parser = argparse.ArgumentParser('Sentence alignment using Vecalign')
|
||||
parser.add_argument('-s', '--src', type=str, required=True,
|
||||
help='preprocessed source file to align')
|
||||
parser.add_argument('-t', '--tgt', type=str, required=True,
|
||||
help='preprocessed target file to align')
|
||||
parser.add_argument('-o', '--out', type=str, required=True,
|
||||
help='Output directory.')
|
||||
parser.add_argument('-m', '--meta', type=str, required=True,
|
||||
help='Metadata file.')
|
||||
parser.add_argument('--src_embed', type=str, nargs=2, required=True,
|
||||
help='Source embeddings. Requires two arguments: first is a text file, sencond is a binary embeddings file. ')
|
||||
parser.add_argument('--tgt_embed', type=str, nargs=2, required=True,
|
||||
help='Target embeddings. Requires two arguments: first is a text file, sencond is a binary embeddings file. ')
|
||||
parser.add_argument('-a', '--alignment_max_size', type=int, default=5,
|
||||
help='Searches for alignments up to size N-M, where N+M <= this value. Note that the the embeddings must support the requested number of overlaps')
|
||||
parser.add_argument('-d', '--del_percentile_frac', type=float, default=0.2,
|
||||
help='Deletion penalty is set to this percentile (as a fraction) of the cost matrix distribution. Should be between 0 and 1.')
|
||||
parser.add_argument('-v', '--verbose', help='sets consle to logging.DEBUG instead of logging.WARN',
|
||||
action='store_true')
|
||||
parser.add_argument('--max_size_full_dp', type=int, default=300,
|
||||
help='Maximum size N for which is is acceptable to run full N^2 dynamic programming.')
|
||||
parser.add_argument('--costs_sample_size', type=int, default=20000,
|
||||
help='Sample size to estimate costs distribution, used to set deletion penalty in conjunction with deletion_percentile.')
|
||||
parser.add_argument('--num_samps_for_norm', type=int, default=100,
|
||||
help='Number of samples used for normalizing embeddings')
|
||||
parser.add_argument('--search_buffer_size', type=int, default=5,
|
||||
help='Width (one side) of search buffer. Larger values makes search more likely to recover from errors but increases runtime.')
|
||||
args = parser.parse_args()
|
||||
|
||||
if args.verbose:
|
||||
import logging
|
||||
logger.setLevel(logging.INFO)
|
||||
|
||||
if args.alignment_max_size < 2:
|
||||
logger.warning('Alignment_max_size < 2. Increasing to 2 so that 1-1 alignments will be considered')
|
||||
args.alignment_max_size = 2
|
||||
|
||||
src_sent2line, src_line_embeddings = read_in_embeddings(args.src_embed[0], args.src_embed[1])
|
||||
tgt_sent2line, tgt_line_embeddings = read_in_embeddings(args.tgt_embed[0], args.tgt_embed[1])
|
||||
|
||||
width_over2 = ceil(args.alignment_max_size / 2.0) + args.search_buffer_size
|
||||
|
||||
make_dir(args.out)
|
||||
jobs = create_jobs(args.meta, args.src, args.tgt, args.out)
|
||||
|
||||
for rec in jobs:
|
||||
src_file, tgt_file, align_file = rec.split("\t")
|
||||
logger.info('Aligning src="%s" to tgt="%s"', src_file, tgt_file)
|
||||
|
||||
src_lines = open(src_file, 'rt', encoding="utf-8").readlines()
|
||||
vecs0 = make_doc_embedding(src_sent2line, src_line_embeddings, src_lines, args.alignment_max_size)
|
||||
|
||||
tgt_lines = open(tgt_file, 'rt', encoding="utf-8").readlines()
|
||||
vecs1 = make_doc_embedding(tgt_sent2line, tgt_line_embeddings, tgt_lines, args.alignment_max_size)
|
||||
|
||||
final_alignment_types = make_alignment_types(args.alignment_max_size)
|
||||
logger.debug('Considering alignment types %s', final_alignment_types)
|
||||
|
||||
stack = vecalign(vecs0=vecs0,
|
||||
vecs1=vecs1,
|
||||
final_alignment_types=final_alignment_types,
|
||||
del_percentile_frac=args.del_percentile_frac,
|
||||
width_over2=width_over2,
|
||||
max_size_full_dp=args.max_size_full_dp,
|
||||
costs_sample_size=args.costs_sample_size,
|
||||
num_samps_for_norm=args.num_samps_for_norm)
|
||||
|
||||
# write final alignments
|
||||
print_alignments(stack[0]['final_alignments'], align_file)
|
||||
|
||||
def create_jobs(meta, src, tgt, out):
|
||||
jobs = []
|
||||
fns = get_fns(meta)
|
||||
for file in fns:
|
||||
src_path = os.path.abspath(os.path.join(src, file))
|
||||
tgt_path = os.path.abspath(os.path.join(tgt, file))
|
||||
|
||||
out_path = os.path.abspath(os.path.join(out, file + '.align'))
|
||||
jobs.append('\t'.join([src_path, tgt_path, out_path]))
|
||||
|
||||
return jobs
|
||||
|
||||
def get_fns(meta):
|
||||
fns = []
|
||||
with open(meta, 'rt', encoding='utf-8') as f:
|
||||
next(f) # skip header
|
||||
for line in f:
|
||||
recs = line.strip().split('\t')
|
||||
fns.append(recs[0])
|
||||
|
||||
return fns
|
||||
|
||||
def print_alignments(alignments, out):
|
||||
with open(out, 'wt', encoding='utf-8') as f:
|
||||
for x, y in alignments:
|
||||
f.write("{}:{}\n".format(x, y))
|
||||
|
||||
def make_dir(path):
|
||||
if os.path.isdir(path):
|
||||
shutil.rmtree(path)
|
||||
os.makedirs(path, exist_ok=True)
|
||||
|
||||
if __name__ == '__main__':
|
||||
t_0 = time.time()
|
||||
main()
|
||||
print("It takes {} seconds to aligent all the sentences.".format(time.time() - t_0))
|
||||
Reference in New Issue
Block a user