Baseline alignment systems

This commit is contained in:
nlpfun
2021-11-28 13:59:28 +08:00
parent e033edad52
commit cc1ca021e8
34 changed files with 453434 additions and 0 deletions

5
ext-lib/bleualign/.gitignore vendored Normal file
View File

@@ -0,0 +1,5 @@
__pycache__/
*.pyc
/dist
/build
/MANIFEST

339
ext-lib/bleualign/LICENSE Normal file
View File

@@ -0,0 +1,339 @@
GNU GENERAL PUBLIC LICENSE
Version 2, June 1991
Copyright (C) 1989, 1991 Free Software Foundation, Inc.,
51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
Everyone is permitted to copy and distribute verbatim copies
of this license document, but changing it is not allowed.
Preamble
The licenses for most software are designed to take away your
freedom to share and change it. By contrast, the GNU General Public
License is intended to guarantee your freedom to share and change free
software--to make sure the software is free for all its users. This
General Public License applies to most of the Free Software
Foundation's software and to any other program whose authors commit to
using it. (Some other Free Software Foundation software is covered by
the GNU Lesser General Public License instead.) You can apply it to
your programs, too.
When we speak of free software, we are referring to freedom, not
price. Our General Public Licenses are designed to make sure that you
have the freedom to distribute copies of free software (and charge for
this service if you wish), that you receive source code or can get it
if you want it, that you can change the software or use pieces of it
in new free programs; and that you know you can do these things.
To protect your rights, we need to make restrictions that forbid
anyone to deny you these rights or to ask you to surrender the rights.
These restrictions translate to certain responsibilities for you if you
distribute copies of the software, or if you modify it.
For example, if you distribute copies of such a program, whether
gratis or for a fee, you must give the recipients all the rights that
you have. You must make sure that they, too, receive or can get the
source code. And you must show them these terms so they know their
rights.
We protect your rights with two steps: (1) copyright the software, and
(2) offer you this license which gives you legal permission to copy,
distribute and/or modify the software.
Also, for each author's protection and ours, we want to make certain
that everyone understands that there is no warranty for this free
software. If the software is modified by someone else and passed on, we
want its recipients to know that what they have is not the original, so
that any problems introduced by others will not reflect on the original
authors' reputations.
Finally, any free program is threatened constantly by software
patents. We wish to avoid the danger that redistributors of a free
program will individually obtain patent licenses, in effect making the
program proprietary. To prevent this, we have made it clear that any
patent must be licensed for everyone's free use or not licensed at all.
The precise terms and conditions for copying, distribution and
modification follow.
GNU GENERAL PUBLIC LICENSE
TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
0. This License applies to any program or other work which contains
a notice placed by the copyright holder saying it may be distributed
under the terms of this General Public License. The "Program", below,
refers to any such program or work, and a "work based on the Program"
means either the Program or any derivative work under copyright law:
that is to say, a work containing the Program or a portion of it,
either verbatim or with modifications and/or translated into another
language. (Hereinafter, translation is included without limitation in
the term "modification".) Each licensee is addressed as "you".
Activities other than copying, distribution and modification are not
covered by this License; they are outside its scope. The act of
running the Program is not restricted, and the output from the Program
is covered only if its contents constitute a work based on the
Program (independent of having been made by running the Program).
Whether that is true depends on what the Program does.
1. You may copy and distribute verbatim copies of the Program's
source code as you receive it, in any medium, provided that you
conspicuously and appropriately publish on each copy an appropriate
copyright notice and disclaimer of warranty; keep intact all the
notices that refer to this License and to the absence of any warranty;
and give any other recipients of the Program a copy of this License
along with the Program.
You may charge a fee for the physical act of transferring a copy, and
you may at your option offer warranty protection in exchange for a fee.
2. You may modify your copy or copies of the Program or any portion
of it, thus forming a work based on the Program, and copy and
distribute such modifications or work under the terms of Section 1
above, provided that you also meet all of these conditions:
a) You must cause the modified files to carry prominent notices
stating that you changed the files and the date of any change.
b) You must cause any work that you distribute or publish, that in
whole or in part contains or is derived from the Program or any
part thereof, to be licensed as a whole at no charge to all third
parties under the terms of this License.
c) If the modified program normally reads commands interactively
when run, you must cause it, when started running for such
interactive use in the most ordinary way, to print or display an
announcement including an appropriate copyright notice and a
notice that there is no warranty (or else, saying that you provide
a warranty) and that users may redistribute the program under
these conditions, and telling the user how to view a copy of this
License. (Exception: if the Program itself is interactive but
does not normally print such an announcement, your work based on
the Program is not required to print an announcement.)
These requirements apply to the modified work as a whole. If
identifiable sections of that work are not derived from the Program,
and can be reasonably considered independent and separate works in
themselves, then this License, and its terms, do not apply to those
sections when you distribute them as separate works. But when you
distribute the same sections as part of a whole which is a work based
on the Program, the distribution of the whole must be on the terms of
this License, whose permissions for other licensees extend to the
entire whole, and thus to each and every part regardless of who wrote it.
Thus, it is not the intent of this section to claim rights or contest
your rights to work written entirely by you; rather, the intent is to
exercise the right to control the distribution of derivative or
collective works based on the Program.
In addition, mere aggregation of another work not based on the Program
with the Program (or with a work based on the Program) on a volume of
a storage or distribution medium does not bring the other work under
the scope of this License.
3. You may copy and distribute the Program (or a work based on it,
under Section 2) in object code or executable form under the terms of
Sections 1 and 2 above provided that you also do one of the following:
a) Accompany it with the complete corresponding machine-readable
source code, which must be distributed under the terms of Sections
1 and 2 above on a medium customarily used for software interchange; or,
b) Accompany it with a written offer, valid for at least three
years, to give any third party, for a charge no more than your
cost of physically performing source distribution, a complete
machine-readable copy of the corresponding source code, to be
distributed under the terms of Sections 1 and 2 above on a medium
customarily used for software interchange; or,
c) Accompany it with the information you received as to the offer
to distribute corresponding source code. (This alternative is
allowed only for noncommercial distribution and only if you
received the program in object code or executable form with such
an offer, in accord with Subsection b above.)
The source code for a work means the preferred form of the work for
making modifications to it. For an executable work, complete source
code means all the source code for all modules it contains, plus any
associated interface definition files, plus the scripts used to
control compilation and installation of the executable. However, as a
special exception, the source code distributed need not include
anything that is normally distributed (in either source or binary
form) with the major components (compiler, kernel, and so on) of the
operating system on which the executable runs, unless that component
itself accompanies the executable.
If distribution of executable or object code is made by offering
access to copy from a designated place, then offering equivalent
access to copy the source code from the same place counts as
distribution of the source code, even though third parties are not
compelled to copy the source along with the object code.
4. You may not copy, modify, sublicense, or distribute the Program
except as expressly provided under this License. Any attempt
otherwise to copy, modify, sublicense or distribute the Program is
void, and will automatically terminate your rights under this License.
However, parties who have received copies, or rights, from you under
this License will not have their licenses terminated so long as such
parties remain in full compliance.
5. You are not required to accept this License, since you have not
signed it. However, nothing else grants you permission to modify or
distribute the Program or its derivative works. These actions are
prohibited by law if you do not accept this License. Therefore, by
modifying or distributing the Program (or any work based on the
Program), you indicate your acceptance of this License to do so, and
all its terms and conditions for copying, distributing or modifying
the Program or works based on it.
6. Each time you redistribute the Program (or any work based on the
Program), the recipient automatically receives a license from the
original licensor to copy, distribute or modify the Program subject to
these terms and conditions. You may not impose any further
restrictions on the recipients' exercise of the rights granted herein.
You are not responsible for enforcing compliance by third parties to
this License.
7. If, as a consequence of a court judgment or allegation of patent
infringement or for any other reason (not limited to patent issues),
conditions are imposed on you (whether by court order, agreement or
otherwise) that contradict the conditions of this License, they do not
excuse you from the conditions of this License. If you cannot
distribute so as to satisfy simultaneously your obligations under this
License and any other pertinent obligations, then as a consequence you
may not distribute the Program at all. For example, if a patent
license would not permit royalty-free redistribution of the Program by
all those who receive copies directly or indirectly through you, then
the only way you could satisfy both it and this License would be to
refrain entirely from distribution of the Program.
If any portion of this section is held invalid or unenforceable under
any particular circumstance, the balance of the section is intended to
apply and the section as a whole is intended to apply in other
circumstances.
It is not the purpose of this section to induce you to infringe any
patents or other property right claims or to contest validity of any
such claims; this section has the sole purpose of protecting the
integrity of the free software distribution system, which is
implemented by public license practices. Many people have made
generous contributions to the wide range of software distributed
through that system in reliance on consistent application of that
system; it is up to the author/donor to decide if he or she is willing
to distribute software through any other system and a licensee cannot
impose that choice.
This section is intended to make thoroughly clear what is believed to
be a consequence of the rest of this License.
8. If the distribution and/or use of the Program is restricted in
certain countries either by patents or by copyrighted interfaces, the
original copyright holder who places the Program under this License
may add an explicit geographical distribution limitation excluding
those countries, so that distribution is permitted only in or among
countries not thus excluded. In such case, this License incorporates
the limitation as if written in the body of this License.
9. The Free Software Foundation may publish revised and/or new versions
of the General Public License from time to time. Such new versions will
be similar in spirit to the present version, but may differ in detail to
address new problems or concerns.
Each version is given a distinguishing version number. If the Program
specifies a version number of this License which applies to it and "any
later version", you have the option of following the terms and conditions
either of that version or of any later version published by the Free
Software Foundation. If the Program does not specify a version number of
this License, you may choose any version ever published by the Free Software
Foundation.
10. If you wish to incorporate parts of the Program into other free
programs whose distribution conditions are different, write to the author
to ask for permission. For software which is copyrighted by the Free
Software Foundation, write to the Free Software Foundation; we sometimes
make exceptions for this. Our decision will be guided by the two goals
of preserving the free status of all derivatives of our free software and
of promoting the sharing and reuse of software generally.
NO WARRANTY
11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN
OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS
TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE
PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
REPAIR OR CORRECTION.
12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
POSSIBILITY OF SUCH DAMAGES.
END OF TERMS AND CONDITIONS
How to Apply These Terms to Your New Programs
If you develop a new program, and you want it to be of the greatest
possible use to the public, the best way to achieve this is to make it
free software which everyone can redistribute and change under these terms.
To do so, attach the following notices to the program. It is safest
to attach them to the start of each source file to most effectively
convey the exclusion of warranty; and each file should have at least
the "copyright" line and a pointer to where the full notice is found.
<one line to give the program's name and a brief idea of what it does.>
Copyright (C) <year> <name of author>
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License along
with this program; if not, write to the Free Software Foundation, Inc.,
51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
Also add information on how to contact you by electronic and paper mail.
If the program is interactive, make it output a short notice like this
when it starts in an interactive mode:
Gnomovision version 69, Copyright (C) year name of author
Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
This is free software, and you are welcome to redistribute it
under certain conditions; type `show c' for details.
The hypothetical commands `show w' and `show c' should show the appropriate
parts of the General Public License. Of course, the commands you use may
be called something other than `show w' and `show c'; they could even be
mouse-clicks or menu items--whatever suits your program.
You should also get your employer (if you work as a programmer) or your
school, if any, to sign a "copyright disclaimer" for the program, if
necessary. Here is a sample; alter the names:
Yoyodyne, Inc., hereby disclaims all copyright interest in the program
`Gnomovision' (which makes passes at compilers) written by James Hacker.
<signature of Ty Coon>, 1 April 1989
Ty Coon, President of Vice
This General Public License does not permit incorporating your program into
proprietary programs. If your program is a subroutine library, you may
consider it more useful to permit linking proprietary applications with the
library. If this is what you want to do, use the GNU Lesser General
Public License instead of this License.

105
ext-lib/bleualign/README.md Normal file
View File

@@ -0,0 +1,105 @@
Bleualign
=========
An MT-based sentence alignment tool
Copyright ⓒ 2010
Rico Sennrich <sennrich@cl.uzh.ch>
A project of the Computational Linguistics Group at the University of Zurich (http://www.cl.uzh.ch).
Project Homepage: http://github.com/rsennrich/bleualign
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation
GENERAL INFO
------------
Bleualign is a tool to align parallel texts (i.e. a text and its translation) on a sentence level.
Additionally to the source and target text, Bleualign requires an automatic translation of at least one of the texts.
The alignment is then performed on the basis of the similarity (modified BLEU score) between the source text sentences (translated into the target language) and the target text sentences.
See section PUBLICATIONS for more details.
Obtaining an automatic translation is up to the user. The only requirement is that the translation must correspond line-by-line to the source text (no line breaks inserted or removed).
REQUIREMENTS
------------
The software was developed on Linux using Python 2.6, but should also support newer versions of Python (including 3.X) and other platforms.
Please report any issues you encounter to sennrich@cl.uzh.ch
USAGE INSTRUCTIONS
------------------
The input and output formats of bleualign are one sentence per line.
A line which only contains .EOA is considered a hard delimiter (end of article).
Sentence alignment does not cross these delimiters: reliable delimiters improve speed and performance, wrong ones will seriously degrade performance.
Given the files sourcetext.txt, targettext.txt and sourcetranslation.txt (the latter being sentence-aligned with sourcetext.txt), a sample call is
./bleualign.py -s sourcetext.txt -t targettext.txt --srctotarget sourcetranslation.txt -o outputfile
It is also possible to provide several translations and/or translations in the other translation direction.
bleualign will run once per translation provided, the final output being the intersection of the individual runs (i.e. sentence pairs produced in each individual run).
./bleualign.py -s sourcetext.txt -t targettext.txt --srctotarget sourcetranslation1.txt --srctotarget sourcetranslation2.txt --targettosrc targettranslation1.txt -o outputfile
./bleualign.py -h will show more usage options
To facilitate batch processing multiple files, `batch_align.py` can be used.
python batch_align directory source_suffix target_suffix translation_suffix
example: given the directory `raw_files` with the files `0.de`, `0.fr` and `0.trans` and so on, (`0.trans` being the translation of `0.de` into the target language), then this command will align all files:
python batch_align.py raw_files de fr trans
This will produce the files `0.de.aligned` and `0.fr.aligned`
Input files are expected to use UTF-8 encoding.
USAGE AS PYTHON MODULE
----------------------
Bleualign works as stand-alone script, but can also be imported as a module other Python projects.
For code examples, see the example/ directory. If you want to know all options, you can see Aligner.default_options variable in bleualign/aligner.py.
To use Bleualign as a Python module, the package needs to be installed (from a local copy) with:
python setup.py install
The Bleualign package can also be installed directly from Github with:
pip install git+https://github.com/rsennrich/Bleualign.git
EVALUATION
---------
Two hand-aligned documents are provided with the repository for development and testing.
Evaluation is performed if you add the argument `-d` for the development set, and `-e` for the test set.
An example command for aligning the development set (one long document with 468/554 sentences in DE/FR):
./bleualign.py --source eval/eval1957.de --target eval/eval1957.fr --srctotarget eval/eval1957.europarlfull.fr -d
An example command for aligning the test set (7 documents, totalling 993/1011 sentences in DE/FR):
./bleualign.py --source eval/eval1989.de --target eval/eval1989.fr --srctotarget eval/eval1989.europarlfull.fr -e
PUBLICATIONS
------------
The algorithm is described in
Rico Sennrich, Martin Volk (2010):
MT-based Sentence Alignment for OCR-generated Parallel Texts. In: Proceedings of AMTA 2010, Denver, Colorado.
Rico Sennrich; Martin Volk (2011):
Iterative, MT-based sentence alignment of parallel texts. In: NODALIDA 2011, Nordic Conference of Computational Linguistics, Riga.
CONTACT
-------
For questions and feeback, please contact sennrich@cl.uzh.ch or use the GitHub repository.

View File

@@ -0,0 +1,15 @@
#!/usr/bin/python
# -*- coding: utf-8 -*-
# Copyright © 2010 University of Zürich
# Author: Rico Sennrich <sennrich@cl.uzh.ch>
# For licensing information, see LICENSE
import sys
from command_utils import load_arguments
from bleualign.align import Aligner
if __name__ == '__main__':
options = load_arguments(sys.argv)
a = Aligner(options)
a.mainloop()

View File

@@ -0,0 +1,51 @@
#!/usr/bin/python
# -*- coding: utf-8 -*-
# Copyright: University of Zurich
# Author: Rico Sennrich
# script to allow batch-alignment of multiple files. No multiprocessing.
# syntax: python batch_align directory source_suffix target_suffix translation_suffix
#
# example: given the directory batch-test with the files 0.de, 0.fr and 0.trans, 1.de, 1.fr and 1.trans and so on,
# (0.trans being the translation of 0.de into the target language),
# then this command will align all files: python batch_align.py batch-test/ de fr trans
#
# output files will have ending source_suffix.aligned and target_suffix.aligned
import sys
import os
from bleualign.align import Aligner
if len(sys.argv) < 2:
sys.stderr.write('Usage: python batch_align.py job_file\n')
exit()
job_fn = sys.argv[1]
options = {}
options['factored'] = False
options['filter'] = None
options['filterthreshold'] = 90
options['filterlang'] = None
options['targettosrc'] = []
options['eval'] = None
options['galechurch'] = None
options['verbosity'] = 1
options['printempty'] = False
jobs = []
with open(job_fn, 'r', encoding="utf-8") as f:
for line in f:
if not line.startswith("#"):
jobs.append(line.strip())
for rec in jobs:
translation_document, source_document, target_document, out_document = rec.split("\t")
options['srcfile'] = source_document
options['targetfile'] = target_document
options['srctotarget'] = [translation_document]
options['output'] = out_document
a = Aligner(options)
a.mainloop()

View File

@@ -0,0 +1,110 @@
# 2021/11/27
# bfsujason@163.com
"""
Usage:
python ext-lib/bleualign/bleualign.py \
-m data/mac/test/meta_data.tsv \
-s data/mac/test/zh \
-t data/mac/test/en \
-o data/mac/test/auto
"""
import os
import sys
import time
import shutil
import argparse
def main():
parser = argparse.ArgumentParser(description='Sentence alignment using Bleualign')
parser.add_argument('-s', '--src', type=str, required=True, help='Source directory.')
parser.add_argument('-t', '--tgt', type=str, required=True, help='Target directory.')
parser.add_argument('-o', '--out', type=str, required=True, help='Output directory.')
parser.add_argument('-m', '--meta', type=str, required=True, help='Metadata file.')
parser.add_argument('--tok', action='store_true', help='Use tokenized source trans and target text.')
args = parser.parse_args()
make_dir(args.out)
jobs = create_jobs(args.meta, args.src, args.tgt, args.out, args.tok)
job_path = os.path.abspath(os.path.join(args.out, 'bleualign.job'))
write_jobs(jobs, job_path)
bleualign_bin = os.path.abspath('ext-lib/bleualign/batch_align.py')
run_bleualign(bleualign_bin, job_path)
convert_format(args.out)
def convert_format(dir):
for file in os.listdir(dir):
if file.endswith('-s'):
file_id = file.split('.')[0]
src = os.path.join(dir, file)
tgt = os.path.join(dir, file_id + '.align-t')
out = os.path.join(dir, file_id + '.align')
_convert_format(src, tgt, out)
os.unlink(src)
os.unlink(tgt)
def _convert_format(src, tgt, path):
src_align = read_alignment(src)
tgt_align = read_alignment(tgt)
with open(path, 'wt', encoding='utf-8') as f:
for x, y in zip(src_align, tgt_align):
f.write("{}:{}\n".format(x,y))
def read_alignment(file):
alignment = []
with open(file, 'rt', encoding='utf-8') as f:
for line in f:
line = line.strip()
alignment.append([int(x) for x in line.split(',')])
return alignment
def run_bleualign(bin, job):
cmd = "python {} {}".format(bin, job)
os.system(cmd)
os.unlink(job)
def write_jobs(jobs, path):
jobs = '\n'.join(jobs)
with open(path, 'wt', encoding='utf-8') as f:
f.write(jobs)
def create_jobs(meta, src, tgt, out, is_tok):
jobs = []
fns = get_fns(meta)
for file in fns:
src_path = os.path.abspath(os.path.join(src, file))
trans_path = os.path.abspath(os.path.join(src, file + '.trans'))
if is_tok:
tgt_path = os.path.abspath(os.path.join(tgt, file + '.tok'))
else:
tgt_path = os.path.abspath(os.path.join(tgt, file))
out_path = os.path.abspath(os.path.join(out, file + '.align'))
jobs.append('\t'.join([trans_path, src_path, tgt_path, out_path]))
return jobs
def get_fns(meta):
fns = []
with open(meta, 'rt', encoding='utf-8') as f:
next(f) # skip header
for line in f:
recs = line.strip().split('\t')
fns.append(recs[0])
return fns
def make_dir(path):
if os.path.isdir(path):
shutil.rmtree(path)
os.makedirs(path, exist_ok=True)
if __name__ == '__main__':
t_0 = time.time()
main()
print("It takes {:.3f} seconds to align all the sentences.".format(time.time() - t_0))

View File

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,205 @@
# -*- coding: utf-8 -*-
import math
# Based on Gale & Church 1993,
# "A Program for Aligning Sentences in Bilingual Corpora"
infinity = float("inf")
def erfcc(x):
"""Complementary error function."""
z = abs(x)
t = 1 / (1 + 0.5 * z)
r = t * math.exp(-z * z -
1.26551223 + t *
(1.00002368 + t *
(.37409196 + t *
(.09678418 + t *
(-.18628806 + t *
(.27886807 + t *
(-1.13520398 + t *
(1.48851587 + t *
(-.82215223 + t * .17087277)))))))))
if (x >= 0.):
return r
else:
return 2. - r
def norm_cdf(x):
"""Return the area under the normal distribution from M{-∞..x}."""
return 1 - 0.5 * erfcc(x / math.sqrt(2))
class LanguageIndependent(object):
# These are the language-independent probabilities and parameters
# given in Gale & Church
# for the computation, l_1 is always the language with less characters
PRIORS = {
(1, 0): 0.0099,
(0, 1): 0.0099,
(1, 1): 0.89,
(2, 1): 0.089,
(1, 2): 0.089,
(2, 2): 0.011,
}
AVERAGE_CHARACTERS = 1
VARIANCE_CHARACTERS = 6.8
def trace(backlinks, source, target):
links = set()
pos = (len(source) - 1, len(target) - 1)
#while pos != (-1, -1):
while pos[0] != -1 and pos[1] != -1:
#print(pos)
#print(backlinks)
#print(backlinks[pos])
s, t = backlinks[pos]
for i in range(s):
for j in range(t):
links.add((pos[0] - i, pos[1] - j))
pos = (pos[0] - s, pos[1] - t)
return links
def align_probability(i, j, source_sentences, target_sentences, alignment, params):
"""Returns the probability of the two sentences C{source_sentences[i]}, C{target_sentences[j]}
being aligned with a specific C{alignment}.
@param i: The offset of the source sentence.
@param j: The offset of the target sentence.
@param source_sentences: The list of source sentence lengths.
@param target_sentences: The list of target sentence lengths.
@param alignment: The alignment type, a tuple of two integers.
@param params: The sentence alignment parameters.
@returns: The probability of a specific alignment between the two sentences, given the parameters.
"""
l_s = sum(source_sentences[i - offset] for offset in range(alignment[0]))
l_t = sum(target_sentences[j - offset] for offset in range(alignment[1]))
try:
# actually, the paper says l_s * params.VARIANCE_CHARACTERS, this is based on the C
# reference implementation. With l_s in the denominator, insertions are impossible.
m = (l_s + l_t / params.AVERAGE_CHARACTERS) / 2
delta = (l_t - l_s * params.AVERAGE_CHARACTERS) / math.sqrt(m * params.VARIANCE_CHARACTERS)
except ZeroDivisionError:
delta = infinity
return 2 * (1 - norm_cdf(abs(delta))) * params.PRIORS[alignment]
def align_blocks(source_sentences, target_sentences, params = LanguageIndependent):
"""Creates the sentence alignment of two blocks of texts (usually paragraphs).
@param source_sentences: The list of source sentence lengths.
@param target_sentences: The list of target sentence lengths.
@param params: the sentence alignment parameters.
@return: The sentence alignments, a list of index pairs.
"""
alignment_types = list(params.PRIORS.keys())
# there are always three rows in the history (with the last of them being filled)
# and the rows are always |target_text| + 2, so that we never have to do
# boundary checks
D = [(len(target_sentences) + 2) * [0] for x in range(2)]
# for the first sentence, only substitution, insertion or deletion are
# allowed, and they are all equally likely ( == 1)
D.append([0, 1])
try:
D[-2][1] = 1
D[-2][2] = 1
except:
pass
backlinks = {}
for i in range(len(source_sentences)):
for j in range(len(target_sentences)):
m = []
for a in alignment_types:
k = D[-(1 + a[0])][j + 2 - a[1]]
if k > 0:
p = k * \
align_probability(i, j, source_sentences, target_sentences, a, params)
m.append((p, a))
if len(m) > 0:
v = max(m)
backlinks[(i, j)] = v[1]
D[-1].append(v[0])
else:
backlinks[(i, j)] = (1, 1)
D[-1].append(0)
D.pop(0)
D.append([0, 0])
return trace(backlinks, source_sentences, target_sentences)
def align_texts(source_blocks, target_blocks, params = LanguageIndependent):
"""Creates the sentence alignment of two texts.
Texts can consist of several blocks. Block boundaries cannot be crossed by sentence
alignment links.
Each block consists of a list that contains the lengths (in characters) of the sentences
in this block.
@param source_blocks: The list of blocks in the source text.
@param target_blocks: The list of blocks in the target text.
@param params: the sentence alignment parameters.
@returns: A list of sentence alignment lists
"""
if len(source_blocks) != len(target_blocks):
raise ValueError("Source and target texts do not have the same number of blocks.")
return [align_blocks(source_block, target_block, params)
for source_block, target_block in zip(source_blocks, target_blocks)]
def split_at(it, split_value):
"""Splits an iterator C{it} at values of C{split_value}.
Each instance of C{split_value} is swallowed. The iterator produces
subiterators which need to be consumed fully before the next subiterator
can be used.
"""
def _chunk_iterator(first):
v = first
while v != split_value:
yield v
v = next(it)
while True:
yield _chunk_iterator(next(it))
def parse_token_stream(stream, soft_delimiter, hard_delimiter):
"""Parses a stream of tokens and splits it into sentences (using C{soft_delimiter} tokens)
and blocks (using C{hard_delimiter} tokens) for use with the L{align_texts} function.
"""
return [
[sum(len(token) for token in sentence_it)
for sentence_it in split_at(block_it, soft_delimiter)]
for block_it in split_at(stream, hard_delimiter)]
if __name__ == "__main__":
import sys
from contextlib import nested
with nested(open(sys.argv[1], "r"), open(sys.argv[2], "r")) as (s, t):
source = parse_token_stream((l.strip() for l in s), ".EOS", ".EOP")
target = parse_token_stream((l.strip() for l in t), ".EOS", ".EOP")
print((align_texts(source, target)))

View File

@@ -0,0 +1,146 @@
#!/usr/bin/python
# -*- coding: utf-8 -*-
#File originally part of moses package: http://www.statmt.org/moses/ (as bleu.py)
#Stripped of unused code to reduce number of libraries used
# $Id$
'''Provides:
cook_refs(refs, n=4): Transform a list of reference sentences as strings into a form usable by cook_test().
cook_test(test, refs, n=4): Transform a test sentence as a string (together with the cooked reference sentences) into a form usable by score_cooked().
score_cooked(alltest, n=4): Score a list of cooked test sentences.
score_set(s, testid, refids, n=4): Interface with dataset.py; calculate BLEU score of testid against refids.
The reason for breaking the BLEU computation into three phases cook_refs(), cook_test(), and score_cooked() is to allow the caller to calculate BLEU scores for multiple test sets as efficiently as possible.
'''
from __future__ import division, print_function
import sys, math, re, xml.sax.saxutils
# Added to bypass NIST-style pre-processing of hyp and ref files -- wade
nonorm = 0
preserve_case = False
eff_ref_len = "shortest"
normalize1 = [
('<skipped>', ''), # strip "skipped" tags
(r'-\n', ''), # strip end-of-line hyphenation and join lines
(r'\n', ' '), # join lines
# (r'(\d)\s+(?=\d)', r'\1'), # join digits
]
normalize1 = [(re.compile(pattern), replace) for (pattern, replace) in normalize1]
normalize2 = [
(r'([\{-\~\[-\` -\&\(-\+\:-\@\/])',r' \1 '), # tokenize punctuation. apostrophe is missing
(r'([^0-9])([\.,])',r'\1 \2 '), # tokenize period and comma unless preceded by a digit
(r'([\.,])([^0-9])',r' \1 \2'), # tokenize period and comma unless followed by a digit
(r'([0-9])(-)',r'\1 \2 ') # tokenize dash when preceded by a digit
]
normalize2 = [(re.compile(pattern), replace) for (pattern, replace) in normalize2]
#combine normalize2 into a single regex.
normalize3 = re.compile(r'([\{-\~\[-\` -\&\(-\+\:-\@\/])|(?:(?<![0-9])([\.,]))|(?:([\.,])(?![0-9]))|(?:(?<=[0-9])(-))')
def normalize(s):
'''Normalize and tokenize text. This is lifted from NIST mteval-v11a.pl.'''
# Added to bypass NIST-style pre-processing of hyp and ref files -- wade
if (nonorm):
return s.split()
try:
s.split()
except:
s = " ".join(s)
# language-independent part:
for (pattern, replace) in normalize1:
s = re.sub(pattern, replace, s)
s = xml.sax.saxutils.unescape(s, {'&quot;':'"'})
# language-dependent part (assuming Western languages):
s = " %s " % s
if not preserve_case:
s = s.lower() # this might not be identical to the original
return [tok for tok in normalize3.split(s) if tok and tok != ' ']
def count_ngrams(words, n=4):
counts = {}
for k in range(1,n+1):
for i in range(len(words)-k+1):
ngram = tuple(words[i:i+k])
counts[ngram] = counts.get(ngram, 0)+1
return counts
def cook_refs(refs, n=4):
'''Takes a list of reference sentences for a single segment
and returns an object that encapsulates everything that BLEU
needs to know about them.'''
refs = [normalize(ref) for ref in refs]
maxcounts = {}
for ref in refs:
counts = count_ngrams(ref, n)
for (ngram,count) in list(counts.items()):
maxcounts[ngram] = max(maxcounts.get(ngram,0), count)
return ([len(ref) for ref in refs], maxcounts)
def cook_ref_set(ref, n=4):
'''Takes a reference sentences for a single segment
and returns an object that encapsulates everything that BLEU
needs to know about them. Also provides a set cause bleualign wants it'''
ref = normalize(ref)
counts = count_ngrams(ref, n)
return (len(ref), counts, frozenset(counts))
def cook_test(test, args, n=4):
'''Takes a test sentence and returns an object that
encapsulates everything that BLEU needs to know about it.'''
reflens, refmaxcounts = args
test = normalize(test)
result = {}
result["testlen"] = len(test)
# Calculate effective reference sentence length.
if eff_ref_len == "shortest":
result["reflen"] = min(reflens)
elif eff_ref_len == "average":
result["reflen"] = float(sum(reflens))/len(reflens)
elif eff_ref_len == "closest":
min_diff = None
for reflen in reflens:
if min_diff is None or abs(reflen-len(test)) < min_diff:
min_diff = abs(reflen-len(test))
result['reflen'] = reflen
result["guess"] = [max(len(test)-k+1,0) for k in range(1,n+1)]
result['correct'] = [0]*n
counts = count_ngrams(test, n)
for (ngram, count) in list(counts.items()):
result["correct"][len(ngram)-1] += min(refmaxcounts.get(ngram,0), count)
return result
def score_cooked(allcomps, n=4):
totalcomps = {'testlen':0, 'reflen':0, 'guess':[0]*n, 'correct':[0]*n}
for comps in allcomps:
for key in ['testlen','reflen']:
totalcomps[key] += comps[key]
for key in ['guess','correct']:
for k in range(n):
totalcomps[key][k] += comps[key][k]
logbleu = 0.0
for k in range(n):
if totalcomps['correct'][k] == 0:
return 0.0
#log.write("%d-grams: %f\n" % (k,float(totalcomps['correct'][k])/totalcomps['guess'][k]))
logbleu += math.log(totalcomps['correct'][k])-math.log(totalcomps['guess'][k])
logbleu /= float(n)
#log.write("Effective reference length: %d test length: %d\n" % (totalcomps['reflen'], totalcomps['testlen']))
logbleu += min(0,1-float(totalcomps['reflen'])/totalcomps['testlen'])
return math.exp(logbleu)

View File

@@ -0,0 +1,191 @@
#!/usr/bin/python
# -*- coding: utf-8 -*-
# Copyright: University of Zurich
# Author: Rico Sennrich
# For licensing information, see LICENSE
# Evaluation functions for Bleualign
from __future__ import division
from operator import itemgetter
def evaluate(options, testalign, goldalign, log_function):
goldalign = [(tuple(src),tuple(target)) for src,target in goldalign]
results = {}
paircounts = {}
for pair in [(len(srclist),len(targetlist)) for srclist,targetlist in goldalign]:
paircounts[pair] = paircounts.get(pair,0) + 1
pairs_normalized = {}
for pair in paircounts:
pairs_normalized[pair] = (paircounts[pair],paircounts[pair] / float(len(goldalign)))
log_function('\ngold alignment frequencies\n')
for aligntype,(abscount,relcount) in sorted(list(pairs_normalized.items()),key=itemgetter(1),reverse=True):
log_function(aligntype,end='')
log_function(' - ',end='')
log_function(abscount,end='')
log_function(' ('+str(relcount)+')')
log_function('\ntotal recall: ',end='')
log_function(str(len(goldalign)) + ' pairs in gold')
(tpstrict,fnstrict,tplax,fnlax) = recall((0,0),goldalign,[i[0] for i in testalign],log_function)
results['recall'] = (tpstrict,fnstrict,tplax,fnlax)
for aligntype in set([i[1] for i in testalign]):
testalign_bytype = []
for i in testalign:
if i[1] == aligntype:
testalign_bytype.append(i)
log_function('precision for alignment type ' + str(aligntype) + ' ( ' + str(len(testalign_bytype)) + ' alignment pairs)')
precision(goldalign,testalign_bytype,log_function)
log_function('\ntotal precision:',end='')
log_function(str(len(testalign)) + ' alignment pairs found')
(tpstrict,fpstrict,tplax,fplax) = precision(goldalign,testalign,log_function)
results['precision'] = (tpstrict,fpstrict,tplax,fplax)
return results
def precision(goldalign, testalign, log_function):
tpstrict=0
tplax=0
fpstrict=0
fplax=0
for (src,target) in [i[0] for i in testalign]:
if (src,target) == ((),()):
continue
if (src,target) in goldalign:
tpstrict +=1
tplax += 1
else:
srcset, targetset = set(src), set(target)
for srclist,targetlist in goldalign:
#lax condition: hypothesis and gold alignment only need to overlap
if srcset.intersection(set(srclist)) and targetset.intersection(set(targetlist)):
fpstrict +=1
tplax += 1
break
else:
fpstrict +=1
fplax +=1
log_function('false positive: ',2)
log_function((src,target),2)
if tpstrict+fpstrict > 0:
log_function('precision strict: ',end='')
log_function((tpstrict/float(tpstrict+fpstrict)))
log_function('precision lax: ',end='')
log_function((tplax/float(tplax+fplax)))
log_function('')
else:
log_function('nothing to find')
return tpstrict,fpstrict,tplax,fplax
def recall(aligntype, goldalign, testalign, log_function):
srclen,targetlen = aligntype
if srclen == 0 and targetlen == 0:
gapdists = [(0,0) for i in goldalign]
elif srclen == 0 or targetlen == 0:
log_function('nothing to find')
return
else:
gapdists = [(len(srclist),len(targetlist)) for srclist,targetlist in goldalign]
tpstrict=0
tplax=0
fnstrict=0
fnlax=0
for i,pair in enumerate(gapdists):
if aligntype == pair:
(srclist,targetlist) = goldalign[i]
if not srclist or not targetlist:
continue
elif (srclist,targetlist) in testalign:
tpstrict +=1
tplax +=1
else:
srcset, targetset = set(srclist), set(targetlist)
for src,target in testalign:
#lax condition: hypothesis and gold alignment only need to overlap
if srcset.intersection(set(src)) and targetset.intersection(set(target)):
tplax +=1
fnstrict+=1
break
else:
fnstrict+=1
fnlax+=1
log_function('not found: ',2),
log_function(goldalign[i],2)
if tpstrict+fnstrict>0:
log_function('recall strict: '),
log_function((tpstrict/float(tpstrict+fnstrict)))
log_function('recall lax: '),
log_function((tplax/float(tplax+fnlax)))
log_function('')
else:
log_function('nothing to find')
return tpstrict,fnstrict,tplax,fnlax
def finalevaluation(results, log_function):
recall_value = [0,0,0,0]
precision_value = [0,0,0,0]
for i,k in list(results.items()):
for m,j in enumerate(recall_value):
recall_value[m] = j+ k['recall'][m]
for m,j in enumerate(precision_value):
precision_value[m] = j+ k['precision'][m]
try:
pstrict = (precision_value[0]/float(precision_value[0]+precision_value[1]))
except ZeroDivisionError:
pstrict = 0
try:
plax =(precision_value[2]/float(precision_value[2]+precision_value[3]))
except ZeroDivisionError:
plax = 0
try:
rstrict= (recall_value[0]/float(recall_value[0]+recall_value[1]))
except ZeroDivisionError:
rstrict = 0
try:
rlax=(recall_value[2]/float(recall_value[2]+recall_value[3]))
except ZeroDivisionError:
rlax = 0
if (pstrict+rstrict) == 0:
fstrict = 0
else:
fstrict=2*(pstrict*rstrict)/(pstrict+rstrict)
if (plax+rlax) == 0:
flax=0
else:
flax=2*(plax*rlax)/(plax+rlax)
log_function('\n=========================\n')
log_function('total results:')
log_function('recall strict: ',end='')
log_function(rstrict)
log_function('recall lax: ',end='')
log_function(rlax)
log_function('')
log_function('precision strict: ',end='')
log_function(pstrict)
log_function('precision lax: '),
log_function(plax)
log_function('')
log_function('f1 strict: ',end='')
log_function(fstrict)
log_function('f1 lax: ',end='')
log_function(flax)
log_function('')

View File

@@ -0,0 +1,158 @@
#!/usr/bin/python
# -*- coding: utf-8 -*-
# Copyright © 2010 University of Zürich
# Author: Rico Sennrich <sennrich@cl.uzh.ch>
# For licensing information, see LICENSE
from __future__ import division, print_function
import sys
import os
import getopt
def usage():
bold = "\033[1m"
reset = "\033[0;0m"
italic = "\033[3m"
print('\n\t All files need to be one sentence per line and have .EOA as a hard delimiter. --source, --target and --output are mandatory arguments, the others are optional.')
print('\n\t' + bold +'--help' + reset + ', ' + bold +'-h' + reset)
print('\t\tprint usage information\n')
print('\t' + bold +'--source' + reset + ', ' + bold +'-s' + reset + ' file')
print('\t\tSource language text.')
print('\t' + bold +'--target' + reset + ', ' + bold +'-t' + reset + ' file')
print('\t\tTarget language text.')
print('\t' + bold +'--output' + reset + ', ' + bold +'-o' + reset + ' filename')
print('\t\tOutput file: Will create ' + 'filename' + '-s and ' + 'filename' + '-t')
print('\n\t' + bold +'--srctotarget' + reset + ' file')
print('\t\tTranslation of source language text to target language. Needs to be sentence-aligned with source language text.')
print('\t' + bold +'--targettosrc' + reset + ' file')
print('\t\tTranslation of target language text to source language. Needs to be sentence-aligned with target language text.')
print('\n\t' + bold +'--factored' + reset)
print('\t\tSource and target text can be factored (as defined by moses: | as separator of factors, space as word separator). Only first factor will be used for BLEU score.')
print('\n\t' + bold +'--filter' + reset + ', ' + bold +'-f' + reset + ' option')
print('\t\tFilters output. Possible options:')
print('\t\t' + bold +'sentences' + reset + '\tevaluate each sentence and filter on a per-sentence basis')
print('\t\t' + bold +'articles' + reset + '\tevaluate each article and filter on a per-article basis')
print('\n\t' + bold +'--filterthreshold' + reset + ' int')
print('\t\tFilters output to best XX percent. (Default: 90). Only works if --filter is set.')
print('\t' + bold +'--bleuthreshold' + reset + ' float')
print('\t\tFilters out sentence pairs with sentence-level BLEU score < XX (in range from 0 to 1). (Default: 0). Only works if --filter is set.')
print('\t' + bold +'--filterlang' + reset)
print('\t\tFilters out sentences/articles for which BLEU score between source and target is higher than that between translation and target (usually means source and target are in same language). Only works if --filter is set.')
print('\n\t' + bold +'--bleu_n' + reset + ' int')
print('\t\tConsider n-grams up to size n for BLEU. Default 2.')
print('\t' + bold +'--bleu_charlevel' + reset)
print('\t\tPerform BLEU on charcter-level (recommended for continuous script language; also consider increasing bleu_n).')
print('\n\t' + bold +'--galechurch' + reset)
print('\t\tAlign the bitext using Gale and Church\'s algorithm (without BLEU comparison).')
print('\t' + bold +'--printempty' + reset)
print('\t\tAlso write unaligned sentences to file. By default, they are discarded.')
print('\t' + bold +'--verbosity' + reset + ', ' + bold +'-v' + reset + ' int')
print('\t\tVerbosity. Choose amount of debugging output. Default value 1; choose 0 for (mostly) quiet mode, 2 for verbose output')
print('\t' + bold +'--processes' + reset + ', ' + bold +'-p' + reset + ' int')
print('\t\tNumber of parallel processes. Documents are split across available processes. Default: 4.')
def load_arguments(sysargv):
try:
opts, args = getopt.getopt(sysargv[1:], "def:ho:s:t:v:p:", ["factored", "filter=", "filterthreshold=", "bleuthreshold=", "filterlang", "printempty", "deveval","eval", "help", "bleu_n=", "bleu_charlevel", "galechurch", "output=", "source=", "target=", "srctotarget=", "targettosrc=", "verbosity=", "printempty=", "processes="])
except getopt.GetoptError as err:
# print help information and exit:
print(str(err)) # will print something like "option -a not recognized"
usage()
sys.exit(2)
options = {}
options['srcfile'] = None
options['targetfile'] = None
options['output'] = None
options['srctotarget'] = []
options['targettosrc'] = []
options['processes'] = 4
bold = "\033[1m"
reset = "\033[0;0m"
project_path = os.path.dirname(os.path.abspath(__file__))
for o, a in opts:
if o in ("-h", "--help"):
usage()
sys.exit()
elif o in ("-e", "--eval"):
options['srcfile'] = os.path.join(project_path,'eval','eval1989.de')
options['targetfile'] = os.path.join(project_path,'eval','eval1989.fr')
from eval import goldeval
goldalign = [None] * len(goldeval.gold1990map)
for index, data in list(goldeval.gold1990map.items()):
goldalign[index] = goldeval.gold[data]
options['eval'] = goldalign
elif o in ("-d", "--deveval"):
options['srcfile'] = os.path.join(project_path,'eval','eval1957.de')
options['targetfile'] = os.path.join(project_path,'eval','eval1957.fr')
from eval import golddev
goldalign = [golddev.goldalign]
options['eval'] = goldalign
elif o in ("-o", "--output"):
options['output'] = a
elif o == "--factored":
options['factored'] = True
elif o in ("-f", "--filter"):
if a in ['sentences','articles']:
options['filter'] = a
else:
print('\nERROR: Valid values for option ' + bold + '--filter'+ reset +' are '+ bold +'sentences '+ reset +'and ' + bold +'articles'+ reset +'.')
usage()
sys.exit(2)
elif o == "--filterthreshold":
options['filterthreshold'] = float(a)
elif o == "--bleuthreshold":
options['bleuthreshold'] = float(a)
elif o == "--filterlang":
options['filterlang'] = True
elif o == "--galechurch":
options['galechurch'] = True
elif o == "--bleu_n":
options['bleu_ngrams'] = int(a)
elif o == "--bleu_charlevel":
options['bleu_charlevel'] = True
elif o in ("-s", "--source"):
if not 'eval' in options:
options['srcfile'] = a
elif o in ("-t", "--target"):
if not 'eval' in options:
options['targetfile'] = a
elif o == "--srctotarget":
if a == '-':
options['no_translation_override'] = True
else:
options['srctotarget'].append(a)
elif o == "--targettosrc":
options['targettosrc'].append(a)
elif o == "--printempty":
options['printempty'] = True
elif o in ("-v", "--verbosity"):
global loglevel
loglevel = int(a)
options['loglevel'] = int(a)
options['verbosity'] = int(a)
elif o in ("-p", "--processes"):
options['num_processes'] = int(a)
else:
assert False, "unhandled option"
if not options['output']:
print('WARNING: Output not specified. Just printing debugging output.',0)
if not options['srcfile']:
print('\nERROR: Source file not specified.')
usage()
sys.exit(2)
if not options['targetfile']:
print('\nERROR: Target file not specified.')
usage()
sys.exit(2)
if options['targettosrc'] and not options['srctotarget']:
print('\nWARNING: Only --targettosrc specified, but expecting at least one --srctotarget. Please swap source and target side.')
sys.exit(2)
if not options['srctotarget'] and not options['targettosrc']\
and 'no_translation_override' not in options:
print("ERROR: no translation available: BLEU scores can be computed between the source and target text, but this is not the intended usage of Bleualign and may result in poor performance! If you're *really* sure that this is what you want, use the option '--srctotarget -'")
sys.exit(2)
return options

View File

@@ -0,0 +1,42 @@
# -*- coding: utf-8 -*-
import os
import setuptools
def read_file(filename):
return open(os.path.join(os.path.dirname(__file__), filename)).read()
setuptools.setup(
name = 'bleualign',
version = '0.1.1',
description = 'An MT-based sentence alignment tool',
long_description = read_file('README.md'),
author = 'Rico Sennrich',
author_email = 'sennrich@cl.uzh.ch',
url = 'https://github.com/rsennrich/Bleualign',
download_url = 'https://github.com/rsennrich/Bleualign',
keywords = [
'Sentence Alignment',
'Natural Language Processing',
'Statistical Machine Translation',
'BLEU',
],
classifiers = [
# which Development Status?
# 'Development Status :: 3 - Alpha',
'Development Status :: 4 - Beta',
# 'Development Status :: 5 - Production/Stable',
'License :: OSI Approved :: GNU General Public License v2 (GPLv2)',
'Operating System :: OS Independent',
'Programming Language :: Python :: 2.6',
'Programming Language :: Python :: 2.7',
'Programming Language :: Python :: 3',
'Programming Language :: Python :: 3.2',
'Programming Language :: Python :: 3.3',
'Programming Language :: Python :: 3.4',
'Topic :: Scientific/Engineering',
'Topic :: Scientific/Engineering :: Information Analysis',
'Topic :: Text Processing',
'Topic :: Text Processing :: Linguistic',
],
packages = ['bleualign'],
)