PK! CHANGELOGPK!R CONTRIBUTING.md# Contribution guide ## Overview OpenKiwi is an Open Source Quality Estimation toolkit aimed at implementing state of the art models in an efficient and unified fashion. While we do welcome contributions, in order to guarantee their quality and usefulness, it is necessary that we follow basic guidelines in order to ease development, collaboration and readability. ## Basic guidelines * The project must fully support Python 3.5 or further. * Code is linted with [flake8](http://flake8.pycqa.org/en/latest/user/error-codes.html), please run `flake8 kiwi` and fix remaining errors before pushing any code. * Code formatting must stick to the Facebook style, 80 columns and single quotes. For Python 3.6+, the [black](https://github.com/ambv/black) formatter can be used by running `Black kiwi`. For python 3.5, [YAPF](https://github.com/google/yapf) should get most of the job done, although some manual changes might be necessary. * Imports are sorted with [isort](https://github.com/timothycrosley/isort). * Filenames must be in lowercase. * Tests are running with [pytest](https://docs.pytest.org/en/latest/) which is commonly referred to the best unittesting framework out there. Pytest implements a standard test discovery which means that it will only search for `test_*.py` or `*_test.py` files. We do not enforce a minimum code coverage but it is preferrable to have even very basic tests running for critical pieces of code. Always test functions that takes/returns tensor argument to document the sizes. * The `kiwi` folder contains core features. Any script calling these features must be placed into the `scripts` folder. ## Contributing * Keep track of everything by creating issues and editing them with reference to the code! Explain succinctly the problem you are trying to solve and your solution. * Contributions to `master` should be made through github pull-requests. * Dependencies are managed using `Poetry`. Although we would rather err on the side of less rather than more dependencies, if needed they are managed through the `pyproject.toml` file. * Work in a clean environment (`virtualenv` is nice). * Your commit message must start with an infinitive verb (Add, Fix, Remove, ...). * If your change is based on a paper, please include a clear comment and reference in the code and in the related issue. * In order to test your local changes, install OpenKiwi following the instructions on the [documentation](https://unbabel.github.io/openkiwi) PK!OۆۆLICENSE GNU AFFERO GENERAL PUBLIC LICENSE Version 3, 19 November 2007 Copyright (C) 2007 Free Software Foundation, Inc. Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Preamble The GNU Affero General Public License is a free, copyleft license for software and other kinds of works, specifically designed to ensure cooperation with the community in the case of network server software. The licenses for most software and other practical works are designed to take away your freedom to share and change the works. By contrast, our General Public Licenses are intended to guarantee your freedom to share and change all versions of a program--to make sure it remains free software for all its users. When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for them if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs, and that you know you can do these things. Developers that use our General Public Licenses protect your rights with two steps: (1) assert copyright on the software, and (2) offer you this License which gives you legal permission to copy, distribute and/or modify the software. A secondary benefit of defending all users' freedom is that improvements made in alternate versions of the program, if they receive widespread use, become available for other developers to incorporate. Many developers of free software are heartened and encouraged by the resulting cooperation. However, in the case of software used on network servers, this result may fail to come about. The GNU General Public License permits making a modified version and letting the public access it on a server without ever releasing its source code to the public. The GNU Affero General Public License is designed specifically to ensure that, in such cases, the modified source code becomes available to the community. It requires the operator of a network server to provide the source code of the modified version running there to the users of that server. Therefore, public use of a modified version, on a publicly accessible server, gives the public access to the source code of the modified version. An older license, called the Affero General Public License and published by Affero, was designed to accomplish similar goals. This is a different license, not a version of the Affero GPL, but Affero has released a new version of the Affero GPL which permits relicensing under this license. The precise terms and conditions for copying, distribution and modification follow. TERMS AND CONDITIONS 0. Definitions. "This License" refers to version 3 of the GNU Affero General Public License. "Copyright" also means copyright-like laws that apply to other kinds of works, such as semiconductor masks. "The Program" refers to any copyrightable work licensed under this License. Each licensee is addressed as "you". "Licensees" and "recipients" may be individuals or organizations. To "modify" a work means to copy from or adapt all or part of the work in a fashion requiring copyright permission, other than the making of an exact copy. The resulting work is called a "modified version" of the earlier work or a work "based on" the earlier work. A "covered work" means either the unmodified Program or a work based on the Program. To "propagate" a work means to do anything with it that, without permission, would make you directly or secondarily liable for infringement under applicable copyright law, except executing it on a computer or modifying a private copy. Propagation includes copying, distribution (with or without modification), making available to the public, and in some countries other activities as well. To "convey" a work means any kind of propagation that enables other parties to make or receive copies. Mere interaction with a user through a computer network, with no transfer of a copy, is not conveying. An interactive user interface displays "Appropriate Legal Notices" to the extent that it includes a convenient and prominently visible feature that (1) displays an appropriate copyright notice, and (2) tells the user that there is no warranty for the work (except to the extent that warranties are provided), that licensees may convey the work under this License, and how to view a copy of this License. If the interface presents a list of user commands or options, such as a menu, a prominent item in the list meets this criterion. 1. Source Code. The "source code" for a work means the preferred form of the work for making modifications to it. "Object code" means any non-source form of a work. A "Standard Interface" means an interface that either is an official standard defined by a recognized standards body, or, in the case of interfaces specified for a particular programming language, one that is widely used among developers working in that language. The "System Libraries" of an executable work include anything, other than the work as a whole, that (a) is included in the normal form of packaging a Major Component, but which is not part of that Major Component, and (b) serves only to enable use of the work with that Major Component, or to implement a Standard Interface for which an implementation is available to the public in source code form. A "Major Component", in this context, means a major essential component (kernel, window system, and so on) of the specific operating system (if any) on which the executable work runs, or a compiler used to produce the work, or an object code interpreter used to run it. The "Corresponding Source" for a work in object code form means all the source code needed to generate, install, and (for an executable work) run the object code and to modify the work, including scripts to control those activities. However, it does not include the work's System Libraries, or general-purpose tools or generally available free programs which are used unmodified in performing those activities but which are not part of the work. For example, Corresponding Source includes interface definition files associated with source files for the work, and the source code for shared libraries and dynamically linked subprograms that the work is specifically designed to require, such as by intimate data communication or control flow between those subprograms and other parts of the work. The Corresponding Source need not include anything that users can regenerate automatically from other parts of the Corresponding Source. The Corresponding Source for a work in source code form is that same work. 2. Basic Permissions. All rights granted under this License are granted for the term of copyright on the Program, and are irrevocable provided the stated conditions are met. This License explicitly affirms your unlimited permission to run the unmodified Program. The output from running a covered work is covered by this License only if the output, given its content, constitutes a covered work. This License acknowledges your rights of fair use or other equivalent, as provided by copyright law. You may make, run and propagate covered works that you do not convey, without conditions so long as your license otherwise remains in force. You may convey covered works to others for the sole purpose of having them make modifications exclusively for you, or provide you with facilities for running those works, provided that you comply with the terms of this License in conveying all material for which you do not control copyright. Those thus making or running the covered works for you must do so exclusively on your behalf, under your direction and control, on terms that prohibit them from making any copies of your copyrighted material outside their relationship with you. Conveying under any other circumstances is permitted solely under the conditions stated below. Sublicensing is not allowed; section 10 makes it unnecessary. 3. Protecting Users' Legal Rights From Anti-Circumvention Law. No covered work shall be deemed part of an effective technological measure under any applicable law fulfilling obligations under article 11 of the WIPO copyright treaty adopted on 20 December 1996, or similar laws prohibiting or restricting circumvention of such measures. When you convey a covered work, you waive any legal power to forbid circumvention of technological measures to the extent such circumvention is effected by exercising rights under this License with respect to the covered work, and you disclaim any intention to limit operation or modification of the work as a means of enforcing, against the work's users, your or third parties' legal rights to forbid circumvention of technological measures. 4. Conveying Verbatim Copies. You may convey verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice; keep intact all notices stating that this License and any non-permissive terms added in accord with section 7 apply to the code; keep intact all notices of the absence of any warranty; and give all recipients a copy of this License along with the Program. You may charge any price or no price for each copy that you convey, and you may offer support or warranty protection for a fee. 5. Conveying Modified Source Versions. You may convey a work based on the Program, or the modifications to produce it from the Program, in the form of source code under the terms of section 4, provided that you also meet all of these conditions: a) The work must carry prominent notices stating that you modified it, and giving a relevant date. b) The work must carry prominent notices stating that it is released under this License and any conditions added under section 7. This requirement modifies the requirement in section 4 to "keep intact all notices". c) You must license the entire work, as a whole, under this License to anyone who comes into possession of a copy. This License will therefore apply, along with any applicable section 7 additional terms, to the whole of the work, and all its parts, regardless of how they are packaged. This License gives no permission to license the work in any other way, but it does not invalidate such permission if you have separately received it. d) If the work has interactive user interfaces, each must display Appropriate Legal Notices; however, if the Program has interactive interfaces that do not display Appropriate Legal Notices, your work need not make them do so. A compilation of a covered work with other separate and independent works, which are not by their nature extensions of the covered work, and which are not combined with it such as to form a larger program, in or on a volume of a storage or distribution medium, is called an "aggregate" if the compilation and its resulting copyright are not used to limit the access or legal rights of the compilation's users beyond what the individual works permit. Inclusion of a covered work in an aggregate does not cause this License to apply to the other parts of the aggregate. 6. Conveying Non-Source Forms. You may convey a covered work in object code form under the terms of sections 4 and 5, provided that you also convey the machine-readable Corresponding Source under the terms of this License, in one of these ways: a) Convey the object code in, or embodied in, a physical product (including a physical distribution medium), accompanied by the Corresponding Source fixed on a durable physical medium customarily used for software interchange. b) Convey the object code in, or embodied in, a physical product (including a physical distribution medium), accompanied by a written offer, valid for at least three years and valid for as long as you offer spare parts or customer support for that product model, to give anyone who possesses the object code either (1) a copy of the Corresponding Source for all the software in the product that is covered by this License, on a durable physical medium customarily used for software interchange, for a price no more than your reasonable cost of physically performing this conveying of source, or (2) access to copy the Corresponding Source from a network server at no charge. c) Convey individual copies of the object code with a copy of the written offer to provide the Corresponding Source. This alternative is allowed only occasionally and noncommercially, and only if you received the object code with such an offer, in accord with subsection 6b. d) Convey the object code by offering access from a designated place (gratis or for a charge), and offer equivalent access to the Corresponding Source in the same way through the same place at no further charge. You need not require recipients to copy the Corresponding Source along with the object code. If the place to copy the object code is a network server, the Corresponding Source may be on a different server (operated by you or a third party) that supports equivalent copying facilities, provided you maintain clear directions next to the object code saying where to find the Corresponding Source. Regardless of what server hosts the Corresponding Source, you remain obligated to ensure that it is available for as long as needed to satisfy these requirements. e) Convey the object code using peer-to-peer transmission, provided you inform other peers where the object code and Corresponding Source of the work are being offered to the general public at no charge under subsection 6d. A separable portion of the object code, whose source code is excluded from the Corresponding Source as a System Library, need not be included in conveying the object code work. A "User Product" is either (1) a "consumer product", which means any tangible personal property which is normally used for personal, family, or household purposes, or (2) anything designed or sold for incorporation into a dwelling. In determining whether a product is a consumer product, doubtful cases shall be resolved in favor of coverage. For a particular product received by a particular user, "normally used" refers to a typical or common use of that class of product, regardless of the status of the particular user or of the way in which the particular user actually uses, or expects or is expected to use, the product. A product is a consumer product regardless of whether the product has substantial commercial, industrial or non-consumer uses, unless such uses represent the only significant mode of use of the product. "Installation Information" for a User Product means any methods, procedures, authorization keys, or other information required to install and execute modified versions of a covered work in that User Product from a modified version of its Corresponding Source. The information must suffice to ensure that the continued functioning of the modified object code is in no case prevented or interfered with solely because modification has been made. If you convey an object code work under this section in, or with, or specifically for use in, a User Product, and the conveying occurs as part of a transaction in which the right of possession and use of the User Product is transferred to the recipient in perpetuity or for a fixed term (regardless of how the transaction is characterized), the Corresponding Source conveyed under this section must be accompanied by the Installation Information. But this requirement does not apply if neither you nor any third party retains the ability to install modified object code on the User Product (for example, the work has been installed in ROM). The requirement to provide Installation Information does not include a requirement to continue to provide support service, warranty, or updates for a work that has been modified or installed by the recipient, or for the User Product in which it has been modified or installed. Access to a network may be denied when the modification itself materially and adversely affects the operation of the network or violates the rules and protocols for communication across the network. Corresponding Source conveyed, and Installation Information provided, in accord with this section must be in a format that is publicly documented (and with an implementation available to the public in source code form), and must require no special password or key for unpacking, reading or copying. 7. Additional Terms. "Additional permissions" are terms that supplement the terms of this License by making exceptions from one or more of its conditions. Additional permissions that are applicable to the entire Program shall be treated as though they were included in this License, to the extent that they are valid under applicable law. If additional permissions apply only to part of the Program, that part may be used separately under those permissions, but the entire Program remains governed by this License without regard to the additional permissions. When you convey a copy of a covered work, you may at your option remove any additional permissions from that copy, or from any part of it. (Additional permissions may be written to require their own removal in certain cases when you modify the work.) You may place additional permissions on material, added by you to a covered work, for which you have or can give appropriate copyright permission. Notwithstanding any other provision of this License, for material you add to a covered work, you may (if authorized by the copyright holders of that material) supplement the terms of this License with terms: a) Disclaiming warranty or limiting liability differently from the terms of sections 15 and 16 of this License; or b) Requiring preservation of specified reasonable legal notices or author attributions in that material or in the Appropriate Legal Notices displayed by works containing it; or c) Prohibiting misrepresentation of the origin of that material, or requiring that modified versions of such material be marked in reasonable ways as different from the original version; or d) Limiting the use for publicity purposes of names of licensors or authors of the material; or e) Declining to grant rights under trademark law for use of some trade names, trademarks, or service marks; or f) Requiring indemnification of licensors and authors of that material by anyone who conveys the material (or modified versions of it) with contractual assumptions of liability to the recipient, for any liability that these contractual assumptions directly impose on those licensors and authors. All other non-permissive additional terms are considered "further restrictions" within the meaning of section 10. If the Program as you received it, or any part of it, contains a notice stating that it is governed by this License along with a term that is a further restriction, you may remove that term. If a license document contains a further restriction but permits relicensing or conveying under this License, you may add to a covered work material governed by the terms of that license document, provided that the further restriction does not survive such relicensing or conveying. If you add terms to a covered work in accord with this section, you must place, in the relevant source files, a statement of the additional terms that apply to those files, or a notice indicating where to find the applicable terms. Additional terms, permissive or non-permissive, may be stated in the form of a separately written license, or stated as exceptions; the above requirements apply either way. 8. Termination. You may not propagate or modify a covered work except as expressly provided under this License. Any attempt otherwise to propagate or modify it is void, and will automatically terminate your rights under this License (including any patent licenses granted under the third paragraph of section 11). However, if you cease all violation of this License, then your license from a particular copyright holder is reinstated (a) provisionally, unless and until the copyright holder explicitly and finally terminates your license, and (b) permanently, if the copyright holder fails to notify you of the violation by some reasonable means prior to 60 days after the cessation. Moreover, your license from a particular copyright holder is reinstated permanently if the copyright holder notifies you of the violation by some reasonable means, this is the first time you have received notice of violation of this License (for any work) from that copyright holder, and you cure the violation prior to 30 days after your receipt of the notice. Termination of your rights under this section does not terminate the licenses of parties who have received copies or rights from you under this License. If your rights have been terminated and not permanently reinstated, you do not qualify to receive new licenses for the same material under section 10. 9. Acceptance Not Required for Having Copies. You are not required to accept this License in order to receive or run a copy of the Program. Ancillary propagation of a covered work occurring solely as a consequence of using peer-to-peer transmission to receive a copy likewise does not require acceptance. However, nothing other than this License grants you permission to propagate or modify any covered work. These actions infringe copyright if you do not accept this License. Therefore, by modifying or propagating a covered work, you indicate your acceptance of this License to do so. 10. Automatic Licensing of Downstream Recipients. Each time you convey a covered work, the recipient automatically receives a license from the original licensors, to run, modify and propagate that work, subject to this License. You are not responsible for enforcing compliance by third parties with this License. An "entity transaction" is a transaction transferring control of an organization, or substantially all assets of one, or subdividing an organization, or merging organizations. If propagation of a covered work results from an entity transaction, each party to that transaction who receives a copy of the work also receives whatever licenses to the work the party's predecessor in interest had or could give under the previous paragraph, plus a right to possession of the Corresponding Source of the work from the predecessor in interest, if the predecessor has it or can get it with reasonable efforts. You may not impose any further restrictions on the exercise of the rights granted or affirmed under this License. For example, you may not impose a license fee, royalty, or other charge for exercise of rights granted under this License, and you may not initiate litigation (including a cross-claim or counterclaim in a lawsuit) alleging that any patent claim is infringed by making, using, selling, offering for sale, or importing the Program or any portion of it. 11. Patents. A "contributor" is a copyright holder who authorizes use under this License of the Program or a work on which the Program is based. The work thus licensed is called the contributor's "contributor version". A contributor's "essential patent claims" are all patent claims owned or controlled by the contributor, whether already acquired or hereafter acquired, that would be infringed by some manner, permitted by this License, of making, using, or selling its contributor version, but do not include claims that would be infringed only as a consequence of further modification of the contributor version. For purposes of this definition, "control" includes the right to grant patent sublicenses in a manner consistent with the requirements of this License. Each contributor grants you a non-exclusive, worldwide, royalty-free patent license under the contributor's essential patent claims, to make, use, sell, offer for sale, import and otherwise run, modify and propagate the contents of its contributor version. In the following three paragraphs, a "patent license" is any express agreement or commitment, however denominated, not to enforce a patent (such as an express permission to practice a patent or covenant not to sue for patent infringement). To "grant" such a patent license to a party means to make such an agreement or commitment not to enforce a patent against the party. If you convey a covered work, knowingly relying on a patent license, and the Corresponding Source of the work is not available for anyone to copy, free of charge and under the terms of this License, through a publicly available network server or other readily accessible means, then you must either (1) cause the Corresponding Source to be so available, or (2) arrange to deprive yourself of the benefit of the patent license for this particular work, or (3) arrange, in a manner consistent with the requirements of this License, to extend the patent license to downstream recipients. "Knowingly relying" means you have actual knowledge that, but for the patent license, your conveying the covered work in a country, or your recipient's use of the covered work in a country, would infringe one or more identifiable patents in that country that you have reason to believe are valid. If, pursuant to or in connection with a single transaction or arrangement, you convey, or propagate by procuring conveyance of, a covered work, and grant a patent license to some of the parties receiving the covered work authorizing them to use, propagate, modify or convey a specific copy of the covered work, then the patent license you grant is automatically extended to all recipients of the covered work and works based on it. A patent license is "discriminatory" if it does not include within the scope of its coverage, prohibits the exercise of, or is conditioned on the non-exercise of one or more of the rights that are specifically granted under this License. You may not convey a covered work if you are a party to an arrangement with a third party that is in the business of distributing software, under which you make payment to the third party based on the extent of your activity of conveying the work, and under which the third party grants, to any of the parties who would receive the covered work from you, a discriminatory patent license (a) in connection with copies of the covered work conveyed by you (or copies made from those copies), or (b) primarily for and in connection with specific products or compilations that contain the covered work, unless you entered into that arrangement, or that patent license was granted, prior to 28 March 2007. Nothing in this License shall be construed as excluding or limiting any implied license or other defenses to infringement that may otherwise be available to you under applicable patent law. 12. No Surrender of Others' Freedom. If conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot convey a covered work so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not convey it at all. For example, if you agree to terms that obligate you to collect a royalty for further conveying from those to whom you convey the Program, the only way you could satisfy both those terms and this License would be to refrain entirely from conveying the Program. 13. Remote Network Interaction; Use with the GNU General Public License. Notwithstanding any other provision of this License, if you modify the Program, your modified version must prominently offer all users interacting with it remotely through a computer network (if your version supports such interaction) an opportunity to receive the Corresponding Source of your version by providing access to the Corresponding Source from a network server at no charge, through some standard or customary means of facilitating copying of software. This Corresponding Source shall include the Corresponding Source for any work covered by version 3 of the GNU General Public License that is incorporated pursuant to the following paragraph. Notwithstanding any other provision of this License, you have permission to link or combine any covered work with a work licensed under version 3 of the GNU General Public License into a single combined work, and to convey the resulting work. The terms of this License will continue to apply to the part which is the covered work, but the work with which it is combined will remain governed by version 3 of the GNU General Public License. 14. Revised Versions of this License. The Free Software Foundation may publish revised and/or new versions of the GNU Affero General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. Each version is given a distinguishing version number. If the Program specifies that a certain numbered version of the GNU Affero General Public License "or any later version" applies to it, you have the option of following the terms and conditions either of that numbered version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of the GNU Affero General Public License, you may choose any version ever published by the Free Software Foundation. If the Program specifies that a proxy can decide which future versions of the GNU Affero General Public License can be used, that proxy's public statement of acceptance of a version permanently authorizes you to choose that version for the Program. Later license versions may give you additional or different permissions. However, no additional obligations are imposed on any author or copyright holder as a result of your choosing to follow a later version. 15. Disclaimer of Warranty. THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 16. Limitation of Liability. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. 17. Interpretation of Sections 15 and 16. If the disclaimer of warranty and limitation of liability provided above cannot be given local legal effect according to their terms, reviewing courts shall apply local law that most closely approximates an absolute waiver of all civil liability in connection with the Program, unless a warranty or assumption of liability accompanies a copy of the Program in return for a fee. END OF TERMS AND CONDITIONS How to Apply These Terms to Your New Programs If you develop a new program, and you want it to be of the greatest possible use to the public, the best way to achieve this is to make it free software which everyone can redistribute and change under these terms. To do so, attach the following notices to the program. It is safest to attach them to the start of each source file to most effectively state the exclusion of warranty; and each file should have at least the "copyright" line and a pointer to where the full notice is found. Copyright (C) This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details. You should have received a copy of the GNU Affero General Public License along with this program. If not, see . Also add information on how to contact you by electronic and paper mail. If your software can interact with users remotely through a computer network, you should also make sure that it provides a way for users to get its source. For example, if your program is a web application, its interface could display a "Source" link that leads users to an archive of the code. There are many ways you could offer source, and different solutions will be better for different programs; see section 13 for the specific requirements. You should also get your employer (if you work as a programmer) or school, if any, to sign a "copyright disclaimer" for the program, if necessary. For more information on this, and how to apply and follow the GNU AGPL, see . PK!ٶkiwi/__init__.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # from kiwi.lib.train import train_from_file as train # NOQA from kiwi.lib.predict import load_model # NOQA __version__ = '0.1.0' __copyright__ = ( '2019 Unbabel. All rights reserved. ' 'Source code available under the AGPL-3.0.' ) PK!?~~kiwi/__main__.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # import kiwi.cli.main def main(): return kiwi.cli.main.cli() if __name__ == '__main__': main() PK!hkiwi/cli/__init__.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # PK!"kiwi/cli/better_argparse.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # import logging import configargparse from configargparse import Namespace from kiwi.cli import opts from kiwi.cli.opts import PathType from kiwi.lib.utils import merge_namespaces logger = logging.getLogger(__name__) class HyperPipelineParser: def __init__( self, name, pipeline_parser, pipeline_config_key, options_fn=None ): self.name = name self._pipeline_parser = pipeline_parser self._pipeline_config_key = pipeline_config_key.replace('-', '_') self._parser = configargparse.get_argument_parser( self.name, prog='kiwi {}'.format(self.name), add_help=False, config_file_parser_class=configargparse.YAMLConfigFileParser, ignore_unknown_config_file_keys=False, ) self._parser.add( '--config', required=False, is_config_file=True, type=PathType(exists=True), help='Load config file from path', ) if options_fn is not None: options_fn(self._parser) def parse(self, args): if len(args) == 1 and args[0] in ['-h', '--help']: self._parser.print_help() return None # Parse train pipeline options meta_options, extra_args = self._parser.parse_known_args(args) print(meta_options) if hasattr(meta_options, self._pipeline_config_key): extra_args = [ '--config', getattr(meta_options, self._pipeline_config_key), ] + extra_args pipeline_options = self._pipeline_parser.parse(extra_args) options = Namespace() options.meta = meta_options options.pipeline = pipeline_options return options class PipelineParser: _parsers = {} def __init__( self, name, model_parsers, options_fn=None, add_io_options=True, add_general_options=True, add_logging_options=True, add_save_load_options=True, ): self.name = name # Give the option to create pipelines with no models if model_parsers is not None: self._models = {model.name: model for model in model_parsers} else: self._models = None if name in self._parsers: self._parser = self._parsers[name] else: self._parser = configargparse.get_argument_parser( self.name, add_help=False, prog='kiwi {}'.format(self.name), config_file_parser_class=configargparse.YAMLConfigFileParser, ignore_unknown_config_file_keys=True, ) self._parsers[name] = self._parser self.add_config_option(self._parser) if add_io_options: opts.io_opts(self._parser) if add_general_options: opts.general_opts(self._parser) if add_logging_options: opts.logging_opts(self._parser) if add_save_load_options: opts.save_load_opts(self._parser) if options_fn is not None: options_fn(self._parser) if model_parsers is not None: group = self._parser.add_argument_group('models') group.add_argument( '--model', required=True, choices=self._models.keys(), help="Use 'kiwi {} --model --help' for specific " "model options.".format(self.name), ) if 'config' in self._parsers: self._config_option_parser = self._parsers['config'] else: self._config_option_parser = configargparse.get_argument_parser( 'config', add_help=False ) self._parsers['config'] = self._config_option_parser self.add_config_option(self._config_option_parser, read_file=False) @staticmethod def add_config_option(parser, read_file=True): parser.add( '--config', required=False, is_config_file=read_file, help='Load config file from path', ) def parse_config_file(self, file_name): return self.parse(['--config', str(file_name)]) def parse(self, args): if len(args) == 1 and args[0] in ['-h', '--help']: self._parser.print_help() return None # Parse train pipeline options pipeline_options, extra_args = self._parser.parse_known_args(args) config_option, _ = self._config_option_parser.parse_known_args(args) options = Namespace() options.pipeline = pipeline_options options.model = None options.model_api = None # Parse specific model options if there are model parsers if self._models is not None: if pipeline_options.model not in self._models: raise KeyError( 'Invalid model: {}'.format(pipeline_options.model) ) if config_option: extra_args = ['--config', config_option.config] + extra_args # Check if there are model parsers model_parser = self._models[pipeline_options.model] model_options, remaining_args = model_parser.parse_known_args( extra_args ) options.model = model_options # Retrieve the respective API for the selected model options.model_api = model_parser.api_module else: remaining_args = extra_args options.all_options = merge_namespaces(options.pipeline, options.model) if remaining_args: raise KeyError('Unrecognized options: {}'.format(remaining_args)) return options class ModelParser: _parsers = {} def __init__(self, name, pipeline, options_fn, api_module, title=None): self.name = name self._title = title self._pipeline = pipeline self.api_module = api_module self._parser = self.get_parser( '{}-{}'.format(name, pipeline), description=self._title ) PipelineParser.add_config_option(self._parser) options_fn(self._parser) @classmethod def get_parser(cls, name, **kwargs): if name in cls._parsers: return cls._parsers[name] parser = configargparse.get_argument_parser( name, prog='... {}'.format(name), config_file_parser_class=configargparse.YAMLConfigFileParser, ignore_unknown_config_file_keys=True, **kwargs, ) cls._parsers[name] = parser return parser def parse_known_args(self, args): return self._parser.parse_known_args(args) PK!st t kiwi/cli/main.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # import configargparse from kiwi import __copyright__, __version__ from kiwi.cli.pipelines import evaluate, jackknife, predict, train def build_parser(): global parser parser = configargparse.get_argument_parser( name='main', prog='kiwi', description='Quality Estimation toolkit', add_help=True, epilog='Copyright {}'.format(__copyright__), ) parser.add_argument('--version', action='version', version=__version__) subparsers = parser.add_subparsers( title='Pipelines', description="Use 'kiwi (-h | --help)' to check it out.", help='Available pipelines:', dest='pipeline', ) subparsers.required = True subparsers.add_parser( 'train', # parents=[train.parser], add_help=False, help='Train a QE model', ) subparsers.add_parser( 'predict', # parents=[predict.parser], add_help=False, help='Use a pre-trained model for prediction', ) subparsers.add_parser( 'jackknife', # parents=[jackknife.parser], add_help=False, help='Jackknife training data with model', ) subparsers.add_parser( 'evaluate', add_help=False, help='Evaluate a model\'s predictions using popular metrics', ) return parser def cli(): options, extra_args = build_parser().parse_known_args() if options.pipeline == 'train': train.main(extra_args) if options.pipeline == 'predict': predict.main(extra_args) # Meta pipelines # if options.pipeline == 'search': # search.main(extra_args) if options.pipeline == 'jackknife': jackknife.main(extra_args) if options.pipeline == 'evaluate': evaluate.main(extra_args) if __name__ == '__main__': # pragma: no cover cli() # pragma: no cover PK!hkiwi/cli/models/__init__.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # PK!a4$4$kiwi/cli/models/linear.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # import logging from kiwi.cli.better_argparse import ModelParser from kiwi.cli.opts import PathType from kiwi.models.linear_word_qe_classifier import LinearWordQEClassifier logger = logging.getLogger(__name__) title = 'linear' def _add_vocabulary_opts(parser): group = parser.add_argument_group('vocabulary options') group.add_argument( '--source-vocab-size', type=int, default=None, help='Size of the source vocabulary.', ) group.add_argument( '--target-vocab-size', type=int, default=None, help='Size of the target vocabulary.', ) group.add_argument( '--source-vocab-min-frequency', type=int, default=1, help='Min word frequency for source vocabulary.', ) group.add_argument( '--target-vocab-min-frequency', type=int, default=1, help='Min word frequency for target vocabulary.', ) def add_training_data_file_opts(parser): # Data options group = parser.add_argument_group('data') group.add_argument( '--train-source', type=PathType(exists=True), help='Path to training source file', ) group.add_argument( '--train-target', type=PathType(exists=True), help='Path to training target file', ) group.add_argument( '--train-alignments', type=str, help='Path to train alignments between source and target.', ) group.add_argument( '--train-source-tags', type=PathType(exists=True), help='Path to validation label file for source (WMT18 format)', ) group.add_argument( '--train-target-tags', type=PathType(exists=True), help='Path to validation label file for target', ) group.add_argument( '--train-source-pos', type=PathType(exists=True), help='Path to training PoS tags file for source', ) group.add_argument( '--train-target-pos', type=PathType(exists=True), help='Path to training PoS tags file for target', ) group.add_argument( '--train-target-parse', type=PathType(exists=True), help='Path to training dependency parsing file for target (tabular ' 'format)', ) group.add_argument( '--train-target-ngram', type=PathType(exists=True), help='Path to training highest order ngram file for target (tabular ' 'format)', ) group.add_argument( '--train-target-stacked', type=PathType(exists=True), help='Path to training stacked predictions file for target (tabular ' 'format)', ) group = parser.add_argument_group('validation data') group.add_argument( '--valid-source', type=PathType(exists=True), # required=True, help='Path to validation source file', ) group.add_argument( '--valid-target', type=PathType(exists=True), # required=True, help='Path to validation target file', ) group.add_argument( '--valid-alignments', type=str, # required=True, help='Path to valid alignments between source and target.', ) group.add_argument( '--valid-source-tags', type=PathType(exists=True), help='Path to validation label file for source (WMT18 format)', ) group.add_argument( '--valid-target-tags', type=PathType(exists=True), help='Path to validation label file for target', ) group.add_argument( '--valid-source-pos', type=PathType(exists=True), help='Path to training PoS tags file for source', ) group.add_argument( '--valid-target-pos', type=PathType(exists=True), help='Path to training PoS tags file for target', ) group.add_argument( '--valid-target-parse', type=PathType(exists=True), help='Path to validation dependency parsing file for target (tabular ' 'format)', ) group.add_argument( '--valid-target-ngram', type=PathType(exists=True), help='Path to validation highest order ngram file for target (tabular ' 'format)', ) group.add_argument( '--valid-target-stacked', type=PathType(exists=True), help='Path to validation stacked predictions file for target (tabular ' 'format)', ) def add_predicting_data_file_opts(parser): # Data options group = parser.add_argument_group('data') group.add_argument( '--test-source', type=PathType(exists=True), required=True, help='Path to validation source file', ) group.add_argument( '--test-target', type=PathType(exists=True), required=True, help='Path to validation target file', ) group.add_argument( '--test-alignments', type=PathType(exists=True), help='Path to test alignments between source and target.', ) group.add_argument( '--test-source-pos', type=PathType(exists=True), help='Path to training PoS tags file for source', ) group.add_argument( '--test-target-pos', type=PathType(exists=True), help='Path to training PoS tags file for target', ) group.add_argument( '--test-target-parse', type=PathType(exists=True), help='Path to test dependency parsing file for target (tabular format)', ) group.add_argument( '--test-target-ngram', type=PathType(exists=True), help='Path to test highest order ngram file for target (tabular ' 'format)', ) # noqa group.add_argument( '--test-target-stacked', type=PathType(exists=True), help='Path to test stacked predictions file for target (tabular ' 'format)', ) # noqa return group def _add_output_options(group): # Other options (used both at training and test time). group.add_argument( '--evaluation-metric', type=str, default='f1_mult', help='Evaluation metric (f1_mult or f1_bad).', ) def add_training_options(training_parser): add_training_data_file_opts(training_parser) _add_vocabulary_opts(training_parser) group = training_parser.add_argument_group( 'linear', description='Linear Quality Estimation' ) # Model options (training time). group.add_argument( '--use-basic-features-only', type=int, default=0, help='1 for using only basic features (words).', ) group.add_argument( '--use-bigrams', type=int, default=1, help='1 for using bigram features (i.e. a CRF-like model).', ) group.add_argument( '--use-simple-bigram-features', type=int, default=0, help='1 for using only label indicators as bigram features.', ) # Training options. group.add_argument( '--training-algorithm', type=str, default='svm_mira', help='Algorithm for training the model (svm_mira, svm_sgd, ' 'perceptron).', ) group.add_argument( '--regularization-constant', type=float, default=0.001, help='L2 regularization constant.', ) group.add_argument( '--cost-false-positives', type=float, default=0.2, help='Cost for false positives (svm_mira and svm_sgd only).', ) group.add_argument( '--cost-false-negatives', type=float, default=0.8, help='Cost for false negatives (svm_mira and svm_sgd only).', ) _add_output_options(group) def add_predicting_options(predicting_parser): add_predicting_data_file_opts(predicting_parser) _add_output_options(predicting_parser) def parser_for_pipeline(pipeline): if pipeline == 'train': return ModelParser( 'linear', 'train', title=LinearWordQEClassifier.title, options_fn=add_training_options, api_module=LinearWordQEClassifier, ) if pipeline == 'predict': return ModelParser( 'linear', 'predict', title=LinearWordQEClassifier.title, options_fn=add_predicting_options, api_module=LinearWordQEClassifier, ) return None PK!;;kiwi/cli/models/nuqe.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # from distutils.util import strtobool from kiwi.cli.better_argparse import ModelParser from kiwi.cli.models.quetch import ( add_data_flags, add_predicting_options, add_training_data_file_opts, add_vocabulary_opts, ) from kiwi.models.nuqe import NuQE def add_model_hyper_params_opts(training_parser): group = training_parser.add_argument_group('hyper-parameters') group.add_argument( '--bad-weight', type=float, default=3.0, help='Relative weight for bad labels.', ) group.add_argument( '--window-size', type=int, default=3, help='Sliding window size.' ) group.add_argument( '--max-aligned', type=int, default=5, help='Max number of alignments between source and target.', ) group.add_argument( '--source-embeddings-size', type=int, default=50, help='Word embedding size for source.', ) group.add_argument( '--target-embeddings-size', type=int, default=50, help='Word embedding size for target.', ) group.add_argument( '--freeze-embeddings', type=lambda x: bool(strtobool(x)), nargs='?', const=True, default=False, help='Freeze embedding weights during training.', ) group.add_argument( '--embeddings-dropout', type=float, default=0.0, help='Dropout rate for embedding layers.', ) group.add_argument( '--hidden-sizes', type=int, nargs='+', # action='append', default=[400, 200, 100, 50], help='List of hidden sizes.', ) group.add_argument( '--dropout', type=float, default=0.0, help='Dropout rate for linear layers.', ) group.add_argument( '--init-type', type=str, default='uniform', choices=[ 'uniform', 'normal', 'constant', 'glorot_uniform', 'glorot_normal', ], help='Distribution type for parameters initialization.', ) group.add_argument( '--init-support', type=float, default=0.1, help='Parameters are initialized over uniform distribution with ' 'support (-param_init, param_init). Use 0 to not use ' 'initialization.', ) return group def add_training_options(training_parser): add_training_data_file_opts(training_parser) add_data_flags(training_parser) add_vocabulary_opts(training_parser) add_model_hyper_params_opts(training_parser) def parser_for_pipeline(pipeline): if pipeline == 'train': return ModelParser( 'nuqe', 'train', title=NuQE.title, options_fn=add_training_options, api_module=NuQE, ) if pipeline == 'predict': return ModelParser( 'nuqe', 'predict', title=NuQE.title, options_fn=add_predicting_options, api_module=NuQE, ) return None PK!kiwi/cli/models/predictor.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # from kiwi.cli.better_argparse import ModelParser from kiwi.cli.models.predictor_estimator import add_pretraining_options from kiwi.models.predictor import Predictor def parser_for_pipeline(pipeline): if pipeline == 'train': return ModelParser( 'predictor', 'train', title=Predictor.title, options_fn=add_pretraining_options, api_module=Predictor, ) return None PK!i9?9?&kiwi/cli/models/predictor_estimator.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # from distutils.util import strtobool from kiwi import constants as const from kiwi.cli.better_argparse import ModelParser from kiwi.cli.opts import PathType from kiwi.lib.utils import parse_integer_with_positive_infinity from kiwi.models.predictor_estimator import Estimator title = 'Estimator (Predictor-Estimator)' def _add_training_data_file_opts(parser): # Data options group = parser.add_argument_group('data') group.add_argument( '--train-source', type=PathType(exists=True), required=True, help='Path to training source file', ) group.add_argument( '--train-target', type=PathType(exists=True), # required=True, help='Path to training target file', ) group.add_argument( '--train-source-tags', type=PathType(exists=True), help='Path to validation label file for source (WMT18 format)', ) group.add_argument( '--train-target-tags', type=PathType(exists=True), help='Path to validation label file for target', ) group.add_argument( '--train-pe', type=PathType(exists=True), help='Path to file containing post-edited target.', ) group.add_argument( '--train-sentence-scores', type=PathType(exists=True), help='Path to file containing sentence level scores.', ) valid_group = parser.add_argument_group('validation data') valid_group.add_argument( '--split', type=float, help='Split Train dataset in case that no validation set is given.', ) valid_group.add_argument( '--valid-source', type=PathType(exists=True), # required=True, help='Path to validation source file', ) valid_group.add_argument( '--valid-target', type=PathType(exists=True), # required=True, help='Path to validation target file', ) valid_group.add_argument( '--valid-alignments', type=str, # required=True, help='Path to valid alignments between source and target.', ) valid_group.add_argument( '--valid-source-tags', type=PathType(exists=True), help='Path to validation label file for source (WMT18 format)', ) valid_group.add_argument( '--valid-target-tags', type=PathType(exists=True), help='Path to validation label file for target', ) valid_group.add_argument( '--valid-pe', type=PathType(exists=True), help='Path to file containing postedited target.', ) valid_group.add_argument( '--valid-sentence-scores', type=PathType(exists=True), help='Path to file containing sentence level scores.', ) def _add_predicting_data_file_opts(parser): # Data options group = parser.add_argument_group('data') group.add_argument( '--test-source', type=PathType(exists=True), required=True, help='Path to validation source file', ) group.add_argument( '--test-target', type=PathType(exists=True), required=True, help='Path to validation target file', ) return group def _add_data_flags(parser): group = parser.add_argument_group('data processing options') group.add_argument( '--predict-side', type=str, default=const.TARGET_TAGS, choices=[const.TARGET_TAGS, const.SOURCE_TAGS, const.GAP_TAGS], help='Tagset to predict. Leave unchanged for WMT17 format.', ) group.add_argument( '--wmt18-format', type=lambda x: bool(strtobool(x)), nargs='?', const=True, default=False, help='Read target tags in WMT18 format.', ) group.add_argument( '--source-max-length', type=parse_integer_with_positive_infinity, default=float("inf"), help='Maximum source sequence length', ) group.add_argument( '--source-min-length', type=int, default=0, help='Truncate source sequence length.', ) group.add_argument( '--target-max-length', type=parse_integer_with_positive_infinity, default=float("inf"), help='Maximum target sequence length to keep.', ) group.add_argument( '--target-min-length', type=int, default=0, help='Truncate target sequence length.', ) return group def _add_vocabulary_opts(parser): group = parser.add_argument_group( 'vocabulary options', description='Options for loading vocabulary from a previous run. ' 'This is used for e.g. training a source predictor via predict-' 'inverse: True ; If set, other vocab options are ignored', ) group.add_argument( '--source-vocab-size', type=int, default=None, help='Size of the source vocabulary.', ) group.add_argument( '--target-vocab-size', type=int, default=None, help='Size of the target vocabulary.', ) group.add_argument( '--source-vocab-min-frequency', type=int, default=1, help='Min word frequency for source vocabulary.', ) group.add_argument( '--target-vocab-min-frequency', type=int, default=1, help='Min word frequency for target vocabulary.', ) def _add_data_options(data_parser): group = data_parser.add_argument_group( 'PredEst data', description='Predictor Estimator specific data ' 'options. (POSTECH)', ) group.add( '--extend-source-vocab', type=PathType(exists=True), help='Optionally load more data which is used only for vocabulary ' 'creation. Path to additional Data' '(Predictor)', ) group.add( '--extend-target-vocab', type=PathType(exists=True), help='Optionally load more data which is used only for vocabulary ' 'creation. Path to additional Data' '(Predictor)', ) def add_pretraining_options(parser): _add_training_data_file_opts(parser) _add_data_flags(parser) _add_vocabulary_opts(parser) _add_data_options(parser) group = parser.add_argument_group( 'predictor training', description='Predictor Estimator (POSTECH)' ) # Only for training group.add_argument( '--warmup', type=int, default=0, help='Pretrain Predictor for this number of steps.', ) group.add_argument( '--rnn-layers-pred', type=int, default=2, help='Layers in Pred RNN' ) group.add_argument( '--dropout-pred', type=float, default=0.0, help='Dropout in predictor' ) group.add_argument( '--hidden-pred', type=int, default=100, help='Size of hidden layers in LSTM', ) group.add_argument( '--out-embeddings-size', type=int, default=200, help='Word Embedding in Output layer', ) group.add_argument( '--embedding-sizes', type=int, default=0, help='If set, takes precedence over other embedding params', ) group.add_argument( '--share-embeddings', type=lambda x: bool(strtobool(x)), nargs='?', const=True, default=False, help='Tie input and output embeddings for target.', ) group.add_argument( '--predict-inverse', type=lambda x: bool(strtobool(x)), nargs='?', const=True, default=False, help='Predict target -> source instead of source -> target.', ) group = parser.add_argument_group( 'model-embeddings', description='Embedding layers size in case pre-trained embeddings ' 'are not used.', ) group.add_argument( '--source-embeddings-size', type=int, default=50, help='Word embedding size for source.', ) group.add_argument( '--target-embeddings-size', type=int, default=50, help='Word embedding size for target.', ) def add_training_options(training_parser): add_pretraining_options(training_parser) group = training_parser.add_argument_group( 'predictor-estimator training', description='Predictor Estimator (POSTECH). These settings are used ' ' to train the Predictor. They will be ignored if training a ' ' Predictor-Estimator and the `load-model` flag is set.', ) group.add_argument( '--start-stop', type=lambda x: bool(strtobool(x)), nargs='?', const=True, default=False, help='Append start and stop symbols to estimator feature sequence.', ) group.add_argument( '--predict-gaps', type=lambda x: bool(strtobool(x)), nargs='?', const=True, default=False, help='Predict Gap Tags. Requires `train-gap-tags`, `valid-' 'gap-tags` to be set.', ) group.add_argument( '--predict-target', type=lambda x: bool(strtobool(x)), nargs='?', const=True, default=True, help='Predict Target Tags. Requires `train-target-tags`, `valid-' 'target-tags` to be set.', ) group.add_argument( '--predict-source', type=lambda x: bool(strtobool(x)), nargs='?', const=True, default=False, help='Predict Source Tags. Requires `train-source-tags`, `valid-' 'source-tags` to be set.', ) group.add_argument( '--load-pred-source', type=PathType(exists=True), help='If set, model architecture and vocabulary parameters are ' 'ignored. Load pretrained predictor src->tgt.', ) group.add_argument( '--load-pred-target', type=PathType(exists=True), help='If set, model architecture and vocabulary parameters are ' 'ignored. Load pretrained predictor tgt->src.', ) group.add_argument( '--rnn-layers-est', type=int, default=2, help='Layers in Estimator RNN' ) group.add_argument( '--dropout-est', type=float, default=0.0, help='Dropout in estimator' ) group.add_argument( '--hidden-est', type=int, default=100, help='Size of hidden layers in LSTM', ) group.add_argument( '--mlp-est', type=lambda x: bool(strtobool(x)), nargs='?', const=True, default=False, help="""Pass the Estimator input through a linear layer reducing dimensionality before RNN.""", ) group.add_argument( '--sentence-level', type=lambda x: bool(strtobool(x)), nargs='?', const=True, default=False, help="""Predict Sentence Level Scores. Requires setting `train-sentence-scores, valid-sentence-scores`""", ) group.add_argument( '--sentence-ll', type=lambda x: bool(strtobool(x)), nargs='?', const=True, default=False, help="""Use probabilistic Loss for sentence scores instead of squared error. If set, the model will output mean and variance of a truncated Gaussian distribution over the interval [0, 1], and use the NLL of ground truth `hter` as the loss. This seems to improve performance, and gives you uncertainty estimates for sentence level predictions as a byproduct. If `sentence-level == False`, this is without effect. """, ) group.add_argument( '--sentence-ll-predict-mean', type=lambda x: bool(strtobool(x)), nargs='?', const=True, default=False, help="""If `sentence-ll == True`, by default the prediction for `hter` will be the mean of the Guassian /before truncation/. After truncation, this will be the mode of the distribution, but not the mean as truncated Gaussian is skewed to one side. set this to `True` to use the True mean after truncation for prediction. """, ) group.add_argument( '--use-probs', type=lambda x: bool(strtobool(x)), nargs='?', const=True, default=False, help='Predict scores as product/sum of word level probs', ) group.add_argument( '--binary-level', type=lambda x: bool(strtobool(x)), nargs='?', const=True, default=False, help="""Predict binary sentence labels indicating `hter == 0.0` Requires setting `train-sentence-scores`, `valid-sentence-scores`""", ) group.add_argument( '--token-level', type=lambda x: bool(strtobool(x)), nargs='?', const=True, default=False, help="""Continue training the predictor on the postedited text. If set, will do an additional forward pass through the predictor Using the SRC, PE pair and add the `Predictor` loss for the tokens in the postedited text PE. Recommended if you have access to PE. Requires setting `train-pe`, `valid-pe`""", ) group.add_argument( '--target-bad-weight', type=float, default=3.0, help='Relative weight for target bad labels.', ) group.add_argument( '--gaps-bad-weight', type=float, default=3.0, help='Relative weight for gaps bad labels.', ) group.add_argument( '--source-bad-weight', type=float, default=3.0, help='Relative weight for source bad labels.', ) def add_predicting_options(predicting_parser): _add_predicting_data_file_opts(predicting_parser) group = predicting_parser.add_argument_group( 'predictor-estimator Prediction', description='Predictor Estimator (POSTECH)', ) group.add_argument( '--wmt18-format', type=lambda x: bool(strtobool(x)), nargs='?', const=True, default=False, help='Read target tags in WMT18 format.', ) group.add_argument( '--sentence-level', type=lambda x: bool(strtobool(x)), nargs='?', const=True, default=False, help='Predict Sentence Level Scores', ) group.add_argument( '--binary-level', type=lambda x: bool(strtobool(x)), nargs='?', const=True, default=False, help='Predict binary sentence labels', ) group.add_argument( '--valid-batch-size', type=int, default=32, help='Batch Size' ) group.add_argument( '--predict-inverse', type=lambda x: bool(strtobool(x)), nargs='?', const=True, default=False, help='Predict target -> source instead of source -> target.', ) def parser_for_pipeline(pipeline): if pipeline == 'train': return ModelParser( 'estimator', 'train', title=Estimator.title, options_fn=add_training_options, api_module=Estimator, ) if pipeline == 'predict': return ModelParser( 'estimator', 'predict', title=Estimator.title, options_fn=add_predicting_options, api_module=Estimator, ) return None PK!wl ( (kiwi/cli/models/quetch.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # from distutils.util import strtobool from kiwi.cli.better_argparse import ModelParser from kiwi.cli.opts import PathType from kiwi.models.quetch import QUETCH def add_training_data_file_opts(parser): # Data options group = parser.add_argument_group('data') group.add_argument( '--train-source', type=PathType(exists=True), required=True, help='Path to training source file', ) group.add_argument( '--train-target', type=PathType(exists=True), required=True, help='Path to training target file', ) group.add_argument( '--train-alignments', type=str, required=True, help='Path to train alignments between source and target.', ) group.add_argument( '--train-source-tags', type=PathType(exists=True), help='Path to training label file for source (WMT18 format)', ) group.add_argument( '--train-target-tags', type=PathType(exists=True), help='Path to training label file for target', ) group.add_argument( '--valid-source', type=PathType(exists=True), required=True, help='Path to validation source file', ) group.add_argument( '--valid-target', type=PathType(exists=True), required=True, help='Path to validation target file', ) group.add_argument( '--valid-alignments', type=str, required=True, help='Path to valid alignments between source and target.', ) group.add_argument( '--valid-source-tags', type=PathType(exists=True), help='Path to validation label file for source (WMT18 format)', ) group.add_argument( '--valid-target-tags', type=PathType(exists=True), help='Path to validation label file for target', ) group.add_argument( '--valid-source-pos', type=PathType(exists=True), help='Path to training PoS tags file for source', ) group.add_argument( '--valid-target-pos', type=PathType(exists=True), help='Path to training PoS tags file for target', ) return group def add_predicting_data_file_opts(parser): # Data options group = parser.add_argument_group('data') group.add_argument( '--test-source', type=PathType(exists=True), required=True, help='Path to validation source file', ) group.add_argument( '--test-target', type=PathType(exists=True), required=True, help='Path to validation target file', ) group.add( '--test-alignments', type=PathType(exists=True), required=True, help='Path to test alignments between source and target.', ) return group def add_data_flags(parser): group = parser.add_argument_group('data processing options') group.add_argument( '--predict-target', type=lambda x: bool(strtobool(x)), nargs='?', const=True, default=True, help='Predict Target Tags. Leave unchanged for WMT17 format', ) group.add_argument( '--predict-gaps', type=lambda x: bool(strtobool(x)), nargs='?', const=True, default=False, help='Predict Gap Tags.', ) group.add_argument( '--predict-source', type=lambda x: bool(strtobool(x)), nargs='?', const=True, default=False, help='Predict Source Tags.', ) group.add_argument( '--wmt18-format', type=lambda x: bool(strtobool(x)), nargs='?', const=True, default=False, help='Read target tags in WMT18 format.', ) group.add_argument( '--source-max-length', type=int, default=float("inf"), help='Maximum source sequence length', ) group.add_argument( '--source-min-length', type=int, default=1, help='Truncate source sequence length.', ) group.add_argument( '--target-max-length', type=int, default=float("inf"), help='Maximum target sequence length to keep.', ) group.add_argument( '--target-min-length', type=int, default=1, help='Truncate target sequence length.', ) return group def add_vocabulary_opts(parser): group = parser.add_argument_group('vocabulary options') group.add_argument( '--source-vocab-size', type=int, default=None, help='Size of the source vocabulary.', ) group.add_argument( '--target-vocab-size', type=int, default=None, help='Size of the target vocabulary.', ) group.add_argument( '--source-vocab-min-frequency', type=int, default=1, help='Min word frequency for source vocabulary.', ) group.add_argument( '--target-vocab-min-frequency', type=int, default=1, help='Min word frequency for target vocabulary.', ) group.add_argument( '--keep-rare-words-with-embeddings', type=lambda x: bool(strtobool(x)), nargs='?', const=True, default=False, help='Keep words that occur less then min-frequency but ' 'are in embeddings vocabulary.', ) group.add_argument( '--add-embeddings-vocab', type=lambda x: bool(strtobool(x)), nargs='?', const=True, default=False, help='Add words from embeddings vocabulary to source/target ' 'vocabulary.', ) group.add_argument( '--embeddings-format', type=str, default='polyglot', choices=['polyglot', 'word2vec', 'fasttext', 'glove', 'text'], help='Word embeddings format. ' 'See README for specific formatting instructions.', ) group.add_argument( '--embeddings-binary', type=lambda x: bool(strtobool(x)), nargs='?', const=True, default=False, help='Load embeddings stored in binary.', ) group.add_argument( '--source-embeddings', type=PathType(exists=True), help='Path to word embeddings file for source.', ) group.add_argument( '--target-embeddings', type=PathType(exists=True), help='Path to word embeddings file for target.', ) return group def add_model_hyper_params_opts(training_parser): group = training_parser.add_argument_group('hyper-parameters') group.add_argument( '--bad-weight', type=float, default=3.0, help='Relative weight for bad labels.', ) group.add_argument( '--window-size', type=int, default=3, help='Sliding window size.' ) group.add_argument( '--max-aligned', type=int, default=5, help='Max number of alignments between source and target.', ) group.add_argument( '--source-embeddings-size', type=int, default=50, help='Word embedding size for source.', ) group.add_argument( '--target-embeddings-size', type=int, default=50, help='Word embedding size for target.', ) group.add_argument( '--freeze-embeddings', type=lambda x: bool(strtobool(x)), nargs='?', const=True, default=False, help='Freeze embedding weights during training.', ) group.add_argument( '--embeddings-dropout', type=float, default=0.0, help='Dropout rate for embedding layers.', ) group.add_argument( '--hidden-sizes', type=int, nargs='+', default=[50], help='List of hidden sizes.', ) group.add_argument( '--dropout', type=float, default=0.0, help='Dropout rate for linear layers.', ) group.add_argument( '--init-type', type=str, default='uniform', choices=[ 'uniform', 'normal', 'constant', 'glorot_uniform', 'glorot_normal', ], help='Distribution type for parameters initialization.', ) group.add_argument( '--init-support', type=float, default=0.1, help='Parameters are initialized over uniform distribution with ' 'support (-param_init, param_init). Use 0 to not use ' 'initialization.', ) return group def add_training_options(training_parser): add_training_data_file_opts(training_parser) add_data_flags(training_parser) add_vocabulary_opts(training_parser) add_model_hyper_params_opts(training_parser) def add_predicting_options(predicting_parser): add_predicting_data_file_opts(predicting_parser) add_data_flags(predicting_parser) def parser_for_pipeline(pipeline): if pipeline == 'train': return ModelParser( 'quetch', 'train', title=QUETCH.title, options_fn=add_training_options, api_module=QUETCH, ) if pipeline == 'predict': return ModelParser( 'quetch', 'predict', title=QUETCH.title, options_fn=add_predicting_options, api_module=QUETCH, ) return None PK!wtQQkiwi/cli/opts.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # import argparse from distutils.util import strtobool from pathlib import Path from kiwi import constants as const class PathType(object): """Factory for creating pathlib.Path objects Instances of PathType should passed as type= arguments to the ArgumentParser add_argument() method. Strongly based on argparse.FileType. Keyword Arguments: - exists -- Whether the file must exists or not. """ def __init__(self, exists=False): self._must_exist = exists def __call__(self, string): if not string: return string # The special argument "-" means sys.std{in,out} in argparse.FileType if string == '-': msg = ( "argument type PathType does not support '-' for referring " "to sys.std{in,out}" ) raise ValueError(msg) # all other arguments are used as file names path = Path(string) if self._must_exist and not path.exists(): message = 'path must exist: {}'.format(string) raise argparse.ArgumentTypeError(message) return str(path) def __repr__(self): arg_str = repr(self._must_exist) return '{}({})'.format(type(self).__name__, arg_str) def io_opts(parser): # Logging group = parser.add_argument_group('I/O') group.add_argument( '--save-config', required=False, type=PathType(exists=False), is_write_out_config_file_arg=False, # Setting it to true makes it save and exit help='Save parsed configuration and arguments to the specified file', ) group.add_argument( '-d', '--debug', action='store_true', help='Output additional messages.' ) group.add_argument( '-q', '--quiet', action='store_true', help='Only output warning and error messages.', ) def logging_opts(parser): # Logging options group = parser.add_argument_group('Logging') group.add_argument( '--log-interval', type=int, default=100, help='Log every k batches.' ) group.add_argument( '--mlflow-tracking-uri', type=str, default='mlruns/', help='If using MLflow, logs model parameters, training metrics, and ' 'artifacts (files) to this MLflow server. Uses the localhost by ' 'default.', ) group.add_argument( '--experiment-name', required=False, help='If using MLflow, it will log this run under this experiment ' 'name, which appears as a separate section' 'in the UI. It will also be used in some messages and files.', ) group.add_argument( '--run-uuid', required=False, help='If specified, MLflow/Default Logger will log metrics and params ' 'under this ID. If it exists, the run status will ' 'change to running. This ID is also used for creating ' 'this run\'s output directory. ' '(Run ID must be a 32-character hex string)', ) group.add_argument( '--output-dir', type=str, help='Output several files for this run under this directory. ' 'If not specified, a directory under "runs" is created ' 'or reused based on the Run UUID. ' 'Files might also be sent to MLflow depending on the ' '--mlflow-always-log-artifacts option.', ) group.add_argument( '--mlflow-always-log-artifacts', type=lambda x: bool(strtobool(x)), nargs='?', const=True, default=False, help='If using MLFlow, always log (send) artifacts (files) to MLflow ' 'artifacts URI. By default (false), artifacts are only logged if' 'MLflow is a remote server (as specified by --mlflow-tracking-uri ' 'option). All generated files are always saved in --output-dir, so it ' 'might be considered redundant to copy them to a local MLflow ' 'server. If this is not the case, set this option to true.', ) def general_opts(parser): # Data processing options group = parser.add_argument_group('random') group.add_argument('--seed', type=int, default=42, help='Random seed') # Cuda group = parser.add_argument_group('gpu') group.add_argument( '--gpu-id', default=None, type=int, help='Use CUDA on the listed devices', ) def save_load_opts(parser): group = parser.add_argument_group('save-load') group.add_argument( '--load-model', type=PathType(exists=True), help='Directory containing a {} file to be loaded'.format( const.MODEL_FILE ), ) group.add_argument( '--save-data', type=str, help='Output dir for saving the preprocessed data files.', ) group.add_argument( '--load-data', type=PathType(exists=True), help='Input dir for loading the preprocessed data files.', ) group.add_argument( '--load-vocab', type=PathType(exists=True), help='Directory containing a {} file to be loaded'.format( const.VOCAB_FILE ), ) PK!hkiwi/cli/pipelines/__init__.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # PK!jE}}kiwi/cli/pipelines/evaluate.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # import logging from kiwi.cli.better_argparse import PipelineParser from kiwi.cli.opts import PathType from kiwi.lib import evaluate logger = logging.getLogger(__name__) def evaluate_opts(parser): # Evaluation options group = parser.add_argument_group("Evaluation of WMT Quality Estimation") group.add_argument( "--type", help="Input type for prediction file", choices=["probs", "tags"], type=str, default="probs", ) group.add_argument( "--format", help="Input format for gold files", choices=["wmt17", "wmt18"], type=str, default="wmt17", ) group.add_argument( "--pred-format", help="Input format for predicted files. Defaults to the same as " "--format.", choices=["wmt17", "wmt18"], type=str, default="wmt18", ) group.add_argument( "--sents-avg", help="Obtain scores for sentences by averaging over tags or " "probabilities.", choices=["probs", "tags"], type=str, # default=None ) # Gold files. group.add_argument( "--gold-sents", help="Sentences gold standard. ", type=PathType(exists=True), required=False, ) group.add_argument( "--gold-target", help="Target tags gold standard, or target and gaps " 'if format == "wmt18".', type=PathType(exists=True), required=False, ) group.add_argument( "--gold-source", help="Source tags gold standard.", type=PathType(exists=True), required=False, ) group.add_argument( "--gold-cal", help="Target Tags to calibrate.", type=PathType(exists=True), required=False, ) # Prediction Files group.add_argument( "--input-dir", help="Directory with prediction files generated by predict pipeline. " "Setting this argument will evaluate all predictions for " "which a gold file is set.", nargs="+", type=PathType(exists=True), # required=True ) group.add_argument( "--pred-sents", help="Sentences HTER predictions.", type=PathType(exists=True), nargs="+", required=False, ) group.add_argument( "--pred-target", help="Target predictions; can be tags or probabilities (of BAD). " "See --type.", type=PathType(exists=True), nargs="+", required=False, ) group.add_argument( "--pred-gaps", help="Gap predictions; can be tags or probabilities (of BAD). " "(see --type). Use this option for files that only contain gap " "tags.", type=PathType(exists=True), nargs="+", required=False, ) group.add_argument( "--pred-source", help="Source predictions. can be tags or probabilities (of BAD). " " See --type.", type=PathType(exists=True), nargs="+", required=False, ) group.add_argument( "--pred-cal", help="Target Predictions to calibrate.", type=PathType(exists=True), required=False, ) def build_parser(): return PipelineParser( name="evaluate", model_parsers=None, options_fn=evaluate_opts, add_io_options=True, add_general_options=False, add_logging_options=False, add_save_load_options=False, ) def main(argv=None): parser = build_parser() options = parser.parse(args=argv) evaluate.evaluate_from_options(options) if __name__ == "__main__": main() PK! 22kiwi/cli/pipelines/jackknife.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # import logging from kiwi.cli.better_argparse import HyperPipelineParser from kiwi.cli.opts import PathType from kiwi.cli.pipelines import train from kiwi.lib import jackknife logger = logging.getLogger(__name__) def jackknife_opts(parser): # Training loop options group = parser.add_argument_group('jackknifing') group.add( '--splits', required=False, type=int, default=5, help='Jackknife with X folds.', ) group.add( '--train-config', required=False, type=PathType(exists=True), help='Path to config file with model parameters.', ) def build_parser(): return HyperPipelineParser( name='jackknife', pipeline_parser=train.build_parser(), pipeline_config_key='train-config', options_fn=jackknife_opts, ) def main(argv=None): parser = build_parser() options = parser.parse(args=argv) jackknife.run_from_options(options) if __name__ == '__main__': # pragma: no cover main() # pragma: no cover PK!q::kiwi/cli/pipelines/predict.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # import logging from kiwi.cli.better_argparse import PipelineParser from kiwi.cli.models import linear, nuqe, predictor_estimator, quetch from kiwi.lib import predict logger = logging.getLogger(__name__) def predict_opts(parser): group = parser.add_argument_group("predicting") group.add_argument( "--batch-size", type=int, default=64, help="Maximum batch size for predicting.", ) def build_parser(): return PipelineParser( name="predict", model_parsers=[ nuqe.parser_for_pipeline("predict"), predictor_estimator.parser_for_pipeline("predict"), quetch.parser_for_pipeline("predict"), linear.parser_for_pipeline("predict"), ], options_fn=predict_opts, ) def main(argv=None): parser = build_parser() options = parser.parse(args=argv) # is this needed? if options is None: return predict.predict_from_options(options) if __name__ == "__main__": # pragma: no cover main() # pragma: no cover PK!t3ڕkiwi/cli/pipelines/train.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # import logging from distutils.util import strtobool from kiwi.cli.better_argparse import PipelineParser from kiwi.cli.models import linear, nuqe, predictor, predictor_estimator, quetch from kiwi.lib import train logger = logging.getLogger(__name__) def train_opts(parser): # Training loop options group = parser.add_argument_group('training') group.add_argument( '--epochs', type=int, default=50, help='Number of epochs for training.' ) group.add_argument( '--train-batch-size', type=int, default=64, help='Maximum batch size for training.', ) group.add_argument( '--valid-batch-size', type=int, default=64, help='Maximum batch size for evaluating.', ) # Optimization options group = parser.add_argument_group('training-optimization') group.add_argument( '--optimizer', default='adam', choices=['sgd', 'adagrad', 'adadelta', 'adam', 'sparseadam'], help='Optimization method.', ) group.add_argument( '--learning-rate', type=float, default=1.0, help='Starting learning rate. ' 'Recommended settings: sgd = 1, adagrad = 0.1, ' 'adadelta = 1, adam = 0.001', ) group.add_argument( '--learning-rate-decay', type=float, default=1.0, help='Decay learning rate by this factor. ', ) group.add_argument( '--learning-rate-decay-start', type=int, default=0, help='Start decay after this epoch.', ) # Saving and resuming options group = parser.add_argument_group('training-save-load') group.add_argument( '--checkpoint-validation-steps', type=int, default=0, help='Perform validation every X training batches. Saves model' ' if `checkpoint-save` is true.', ) group.add_argument( '--checkpoint-save', type=lambda x: bool(strtobool(x)), nargs='?', const=True, default=True, help='Save a training snapshot when validation is run. If false ' 'it will never save the model.', ) group.add_argument( '--checkpoint-keep-only-best', type=int, default=1, help='Keep only n best models according to main metric (F1Mult ' 'by default); 0 will keep all.', ) group.add_argument( '--checkpoint-early-stop-patience', type=int, default=0, help='Stop training if evaluation metrics do not improve after X ' 'validations; 0 disables this.', ) group.add_argument( '--resume', type=lambda x: bool(strtobool(x)), nargs='?', const=True, default=False, help='Resume training a previous run. ' 'The --run-uuid (and possibly --experiment-name) ' 'option must be specified. Files are then searched ' 'under the "runs" directory. If not found, they are ' 'downloaded from the MLflow server ' '(check the --mlflow-tracking-uri option).', ) def build_parser(): return PipelineParser( name='train', model_parsers=[ nuqe.parser_for_pipeline('train'), predictor_estimator.parser_for_pipeline('train'), predictor.parser_for_pipeline('train'), quetch.parser_for_pipeline('train'), linear.parser_for_pipeline('train'), ], options_fn=train_opts, ) def main(argv=None): parser = build_parser() options = parser.parse(args=argv) train.train_from_options(options) if __name__ == '__main__': # pragma: no cover main() # pragma: no cover PK!kiwi/constants.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # # lowercased special tokens UNK = '' PAD = '' START = '' STOP = '' UNALIGNED = '' # special tokens id (don't edit this order) # FIXME: avoid using these IDs since we don't really make sure they correspond # to the above tokens UNK_ID = 0 PAD_ID = 1 START_ID = 2 STOP_ID = 3 UNALIGNED_ID = 4 PAD_TAGS_ID = 2 # binary labels OK = 'OK' BAD = 'BAD' OK_ID = 0 BAD_ID = 1 LABELS = [OK, BAD] # this should be removed in the future # fields SOURCE = 'source' TARGET = 'target' PE = 'pe' TARGET_TAGS = 'tags' SOURCE_TAGS = 'source_tags' GAP_TAGS = 'gap_tags' TAGS = [TARGET_TAGS, SOURCE_TAGS, GAP_TAGS] SENTENCE_SCORES = 'sentence_scores' BINARY = 'binary' TARGETS = [SENTENCE_SCORES, BINARY] + TAGS ALIGNMENTS = 'alignments' SOURCE_POS = 'source_pos' TARGET_POS = 'target_pos' TARGET_PARSE_HEADS = 'target_parse_heads' TARGET_PARSE_RELATIONS = 'target_parse_relations' TARGET_NGRAM_LEFT = 'target_ngram_left' TARGET_NGRAM_RIGHT = 'target_ngram_right' TARGET_STACKED = 'target_stacked' # Constants for model output names SENT_SIGMA = 'sentence_sigma' LOSS = 'loss' PREQEFV = 'PreQEFV' POSTQEFV = 'PostQEFV' # Standard Names for saving files TRAIN = 'train' DEV = 'dev' TEST = 'test' EVAL = 'eval' VOCAB = 'vocab' CONFIG = 'config' STATE_DICT = 'state_dict' VOCAB_FILE = 'vocab.torch' MODEL_FILE = 'model.torch' DATAFILE = 'dataset.torch' OPTIMIZER = 'optim.torch' BEST_MODEL_FILE = 'best_model.torch' TRAINER = 'trainer.torch' EPOCH = 'epoch' PK!hkiwi/data/__init__.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # PK!|kiwi/data/builders.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # from functools import partial from pathlib import Path from kiwi.data.corpus import Corpus from kiwi.data.fieldsets import extend_vocabs_fieldset from kiwi.data.fieldsets.fieldset import Fieldset from kiwi.data.qe_dataset import QEDataset from kiwi.data.utils import ( build_vocabulary, filter_len, load_vocabularies_to_datasets, ) def build_dataset(fieldset, prefix='', filter_pred=None, **kwargs): fields, files = fieldset.fields_and_files(prefix, **kwargs) examples = Corpus.from_files(fields=fields, files=files) dataset = QEDataset( examples=examples, fields=fields, filter_pred=filter_pred ) return dataset def build_training_datasets( fieldset, split=0.0, valid_source=None, valid_target=None, load_vocab=None, **kwargs ): """Build a training and validation QE datasets. Required Args: fieldset (Fieldset): specific set of fields to be used (depends on the model to be used). train_source: Train Source train_target: Train Target (MT) Optional Args (depends on the model): train_pe: Train Post-edited train_target_tags: Train Target Tags train_source_tags: Train Source Tags train_sentence_scores: Train HTER scores valid_source: Valid Source valid_target: Valid Target (MT) valid_pe: Valid Post-edited valid_target_tags: Valid Target Tags valid_source_tags: Valid Source Tags valid_sentence_scores: Valid HTER scores split (float): If no validation sets are provided, randomly sample 1 - split of training examples as validation set. target_vocab_size: Maximum Size of target vocabulary source_vocab_size: Maximum Size of source vocabulary target_max_length: Maximum length for target field target_min_length: Minimum length for target field source_max_length: Maximum length for source field source_min_length: Minimum length for source field target_vocab_min_freq: Minimum word frequency target field source_vocab_min_freq: Minimum word frequency source field load_vocab: Path to existing vocab file Returns: A training and a validation Dataset. """ # TODO: improve handling these length options (defaults are set multiple # times). filter_pred = partial( filter_len, source_min_length=kwargs.get('source_min_length', 1), source_max_length=kwargs.get('source_max_length', float('inf')), target_min_length=kwargs.get('target_min_length', 1), target_max_length=kwargs.get('target_max_length', float('inf')), ) train_dataset = build_dataset( fieldset, prefix=Fieldset.TRAIN, filter_pred=filter_pred, **kwargs ) if valid_source and valid_target: valid_dataset = build_dataset( fieldset, prefix=Fieldset.VALID, filter_pred=filter_pred, valid_source=valid_source, valid_target=valid_target, **kwargs, ) elif split: if not 0.0 < split < 1.0: raise Exception( 'Invalid data split value: {}; it must be in the ' '(0, 1) interval.'.format(split) ) train_dataset, valid_dataset = train_dataset.split(split) else: raise Exception('Validation data not provided.') if load_vocab: vocab_path = Path(load_vocab) load_vocabularies_to_datasets(vocab_path, train_dataset, valid_dataset) # Even if vocab is loaded, we need to build the vocabulary # in case fields are missing datasets_for_vocab = [train_dataset] if kwargs.get('extend_source_vocab') or kwargs.get('extend_target_vocab'): vocabs_fieldset = extend_vocabs_fieldset.build_fieldset(fieldset) extend_vocabs_ds = build_dataset(vocabs_fieldset, **kwargs) datasets_for_vocab.append(extend_vocabs_ds) fields_vocab_options = fieldset.fields_vocab_options(**kwargs) build_vocabulary(fields_vocab_options, *datasets_for_vocab) return train_dataset, valid_dataset def build_test_dataset(fieldset, load_vocab=None, **kwargs): """Build a test QE dataset. Args: fieldset (Fieldset): specific set of fields to be used (depends on the model to be used.) load_vocab: A path to a saved vocabulary. Returns: A Dataset object. """ test_dataset = build_dataset(fieldset, prefix=Fieldset.TEST, **kwargs) fields_vocab_options = fieldset.fields_vocab_options(**kwargs) if load_vocab: vocab_path = Path(load_vocab) load_vocabularies_to_datasets(vocab_path, test_dataset) else: build_vocabulary(fields_vocab_options, test_dataset) return test_dataset PK!cUg"!"!kiwi/data/corpus.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # from torchtext import data class Corpus: def __init__(self, fields_examples=None, dataset_fields=None): """Create a Corpus by specifying examples and fields. Arguments: fields_examples: A list of lists of field values per example. dataset_fields: A list of pairs (field name, field object). Both lists have the same size (number of fields). """ self.fields_examples = ( fields_examples if fields_examples is not None else [] ) self.dataset_fields = ( dataset_fields if dataset_fields is not None else [] ) self.number_of_examples = ( len(self.fields_examples[0]) if self.fields_examples else 0 ) def examples_per_field(self): examples = { field: examples for (field, _), examples in zip( self.dataset_fields, self.fields_examples ) } return examples @classmethod def from_files(cls, fields, files): """Create a QualityEstimationDataset given paths and fields. Arguments: fields: A dict between field name and field object. files: A dict between field name and file dict (with 'name' and 'format' keys). """ fields_examples = [] dataset_fields = [] # first load the data for each field for attrib_name, field in fields.items(): file_dict = files[attrib_name] file_name = file_dict['name'] reader = file_dict['reader'] if not reader: with open(file_name, 'r', encoding='utf8') as f: fields_values_for_example = [line.strip() for line in f] else: fields_values_for_example = reader(file_name) fields_examples.append(fields_values_for_example) dataset_fields.append((attrib_name, field)) # then add each corresponding sentence from each field nb_lines = [len(fe) for fe in fields_examples] assert min(nb_lines) == max(nb_lines) # Assert files have the same size return cls(fields_examples, dataset_fields) @staticmethod def read_tabular_file(file_path, sep='\t', extract_column=None): examples = [] line_values = [] with open(file_path, 'r', encoding='utf8') as f: num_columns = None for line_num, line in enumerate(f): line = line.rstrip() if line: values = line.split(sep) line_values.append(values) if num_columns is None: num_columns = len(values) if extract_column is not None and ( extract_column < 1 or extract_column > num_columns ): raise IndexError( 'Cannot extract column {} (of {})'.format( extract_column, num_columns ) ) elif len(values) != num_columns: raise IndexError( 'Number of columns ({}) in line {} is different ' '({}) for file: {}'.format( len(values), line_num + 1, num_columns, file_path, ) ) else: if extract_column is not None: examples.append( ' '.join( [ values[extract_column - 1] for values in line_values ] ) ) else: examples.append( [ ' '.join([values[i] for values in line_values]) for i in range(num_columns) ] ) line_values = [] if line_values: # Add trailing lines before EOF. if extract_column is not None: examples.append( ' '.join( [values[extract_column - 1] for values in line_values] ) ) else: examples.append( [ ' '.join([values[i] for values in line_values]) for i in range(num_columns) ] ) return examples @classmethod def from_tabular_file(cls, fields, file_fields, file_path, sep='\t'): """Create a QualityEstimationDataset given paths and fields. Arguments: fields: A dict between field name and field object. file_fields: A list of field names for each column of the file (by order). File fields not in fields will be ignored, but every field in fields should correspond to some column. file_path: Path to the tabular file. """ fields_examples = [] dataset_fields = [] examples = {field_name: [] for field_name in fields.keys()} example_values = [] with open(file_path, 'r', encoding='utf8') as f: for line in f: line = line.rstrip() if line: values = line.split(sep) example_values.append(values) else: for i, field_name in enumerate(file_fields): if field_name not in fields: # TODO continue examples[field_name].append( ' '.join([values[i] for values in example_values]) ) example_values = [] if example_values: # Add trailing lines before EOF. for i, field_name in enumerate(file_fields): if field_name not in fields: continue examples[field_name].append( ' '.join([values[i] for values in example_values]) ) for attrib_name, field in fields.items(): fields_examples.append(examples[attrib_name]) dataset_fields.append((attrib_name, field)) # then add each corresponding sentence from each field nb_lines = [len(fe) for fe in fields_examples] assert min(nb_lines) == max(nb_lines) # Assert files have the same size return cls(fields_examples, dataset_fields) def __iter__(self): for j in range(self.number_of_examples): fields_values_for_example = [ self.fields_examples[i][j] for i in range(len(self.dataset_fields)) ] yield data.Example.fromlist( fields_values_for_example, self.dataset_fields ) def paste_fields(self, corpus): """Pastes (appends) fields from another corpus. Arguments: corpus: A corpus object. Must have the same number of examples as the current corpus. """ assert self.number_of_examples == corpus.number_of_examples self.fields_examples += corpus.fields_examples self.dataset_fields += corpus.dataset_fields PK!hkiwi/data/fields/__init__.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # PK!aJJ#kiwi/data/fields/alignment_field.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # from torchtext import data class AlignmentField(data.Field): def process(self, batch, *args, **kwargs): """ Process a list of examples to create a batch. Postprocess the batch with user-provided Pipeline. Args: batch (list(object)): A list of object from a batch of examples. Returns: object: Processed object given the input and custom postprocessing Pipeline. """ if self.postprocessing is not None: batch = self.postprocessing(batch) return batch PK!; kiwi/data/fields/qe_field.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # from collections.__init__ import Counter, OrderedDict from itertools import chain from torchtext import data from kiwi.constants import PAD, START, STOP, UNALIGNED, UNK from kiwi.data.vocabulary import Vocabulary class QEField(data.Field): def __init__( self, unaligned_token=UNALIGNED, unk_token=UNK, pad_token=PAD, init_token=START, eos_token=STOP, **kwargs ): kwargs.setdefault('batch_first', True) super().__init__(**kwargs) self.unk_token = unk_token self.pad_token = pad_token self.init_token = init_token self.eos_token = eos_token self.unaligned_token = unaligned_token self.vocab = None self.vocab_cls = Vocabulary def build_vocab(self, *args, **kwargs): """Add unaligned_token to the list of special symbols.""" counter = Counter() sources = [] for arg in args: if isinstance(arg, data.Dataset): sources += [ getattr(arg, name) for name, field in arg.fields.items() if field is self ] else: sources.append(arg) for sample in sources: for x in sample: if not self.sequential: x = [x] try: counter.update(x) except TypeError: counter.update(chain.from_iterable(x)) specials = list( OrderedDict.fromkeys( tok for tok in [ self.unk_token, self.pad_token, self.init_token, self.eos_token, self.unaligned_token, ] if tok is not None ) ) self.vocab = self.vocab_cls(counter, specials=specials, **kwargs) PK![Ӆ//)kiwi/data/fields/sequence_labels_field.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # from collections import Counter from torchtext.data import Field class SequenceLabelsField(Field): """Sequence of Labels. """ def __init__(self, classes, *args, **kwargs): self.classes = classes self.vocab = None super().__init__(*args, **kwargs) def build_vocab(self, *args, **kwargs): specials = self.classes + [ self.pad_token, self.init_token, self.eos_token, ] self.vocab = self.vocab_cls(Counter(), specials=specials, **kwargs) PK!hkiwi/data/fieldsets/__init__.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # PK!-kiwi/data/fieldsets/extend_vocabs_fieldset.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # from kiwi import constants as const from kiwi.data.fieldsets.fieldset import Fieldset def build_fieldset(base_fieldset): source_field = base_fieldset.fields[const.SOURCE] target_field = base_fieldset.fields[const.TARGET] source_vocab_options = dict( min_freq='source_vocab_min_frequency', max_size='source_vocab_size' ) target_vocab_options = dict( min_freq='target_vocab_min_frequency', max_size='target_vocab_size' ) extend_vocabs = Fieldset() extend_vocabs.add( name=const.SOURCE, field=source_field, file_option_suffix='extend_source_vocab', required=None, vocab_options=source_vocab_options, ) extend_vocabs.add( name=const.TARGET, field=target_field, file_option_suffix='extend_target_vocab', required=None, vocab_options=target_vocab_options, ) return extend_vocabs PK!kiwi/data/fieldsets/fieldset.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # from functools import partial from kiwi.data.vectors import AvailableVectors class Fieldset: ALL = 'all' TRAIN = 'train' VALID = 'valid' TEST = 'test' def __init__(self): """ """ self._fields = {} self._options = {} self._required = {} self._vocab_options = {} self._vocab_vectors = {} self._file_reader = {} def add( self, name, field, file_option_suffix, required=ALL, vocab_options=None, vocab_vectors=None, file_reader=None, ): """ Args: name: field: file_option_suffix: required (str or list or None): file_reader (callable): by default, uses Corpus.from_files(). Returns: """ self._fields[name] = field self._options[name] = file_option_suffix if not isinstance(required, list): required = [required] self._required[name] = required self._file_reader[name] = file_reader if vocab_options is None: vocab_options = {} self._vocab_options[name] = vocab_options self._vocab_vectors[name] = vocab_vectors @property def fields(self): return self._fields def is_required(self, name, set_name): required = self._required[name] if set_name in required or self.ALL in required: return True else: return False def fields_and_files(self, set_name, **files_options): fields = {} files = {} for name, file_option_suffix in self._options.items(): file_option = '{}{}'.format(set_name, file_option_suffix) file_name = files_options.get(file_option) if not file_name and self.is_required(name, set_name): raise FileNotFoundError( 'File {} is required (use the {} ' 'option).'.format(file_name, file_option.replace('_', '-')) ) elif file_name: files[name] = { 'name': file_name, 'reader': self._file_reader.get(name), } fields[name] = self._fields[name] return fields, files # def files_formats(self): # return { # set_name: self._file_format.get(set_name) # for set_name in self._fields # } # def vocab_kwargs(self, name, **kwargs): if name not in self._vocab_options: raise KeyError( 'Field named "{}" does not exist in this fieldset'.format(name) ) vkwargs = {} for argument, option_name in self._vocab_options[name].items(): option_value = kwargs.get(option_name) if option_value is not None: vkwargs[argument] = option_value return vkwargs def vocab_vectors_loader( self, name, embeddings_format='polyglot', embeddings_binary=False, **kwargs ): if name not in self._vocab_vectors: raise KeyError( 'Field named "{}" does not exist in this fieldset'.format(name) ) def no_vectors_fn(): return None vectors_fn = no_vectors_fn option_name = self._vocab_vectors[name] if option_name: option_value = kwargs.get(option_name) if option_value: emb_model = AvailableVectors[embeddings_format] # logger.info('Loading {} embeddings from {}'.format( # name, option_value)) vectors_fn = partial( emb_model, option_value, binary=embeddings_binary ) return vectors_fn def vocab_vectors(self, name, **kwargs): vectors_fn = self.vocab_vectors_loader(name, **kwargs) return vectors_fn() def fields_vocab_options(self, **kwargs): vocab_options = {} for name, field in self.fields.items(): vocab_options[name] = dict( vectors_fn=self.vocab_vectors_loader(name, **kwargs) ) vocab_options[name].update(self.vocab_kwargs(name, **kwargs)) return vocab_options PK! kiwi/data/fieldsets/linear.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # from functools import partial from torchtext import data from kiwi import constants as const from kiwi.data.corpus import Corpus from kiwi.data.fields.alignment_field import AlignmentField from kiwi.data.fields.qe_field import QEField from kiwi.data.fieldsets.fieldset import Fieldset from kiwi.data.tokenizers import align_tokenizer, tokenizer def build_fieldset(): fs = Fieldset() source_vocab_options = dict( min_freq='source_vocab_min_frequency', max_size='source_vocab_size' ) target_vocab_options = dict( min_freq='target_vocab_min_frequency', max_size='target_vocab_size' ) source_field = QEField(tokenize=tokenizer) target_field = QEField(tokenize=tokenizer) source_pos = QEField(tokenize=tokenizer) target_pos = QEField(tokenize=tokenizer) target_tags_field = data.Field( tokenize=tokenizer, pad_token=None, unk_token=None ) fs.add( name=const.SOURCE, field=source_field, file_option_suffix='_source', required=Fieldset.ALL, vocab_options=source_vocab_options, ) fs.add( name=const.TARGET, field=target_field, file_option_suffix='_target', required=Fieldset.ALL, vocab_options=target_vocab_options, ) fs.add( name=const.ALIGNMENTS, field=AlignmentField(tokenize=align_tokenizer, use_vocab=False), file_option_suffix='_alignments', required=Fieldset.ALL, ) fs.add( name=const.TARGET_TAGS, field=target_tags_field, file_option_suffix='_target_tags', required=[Fieldset.TRAIN, Fieldset.VALID], ) fs.add( name=const.SOURCE_POS, field=source_pos, file_option_suffix='_source_pos', required=None, ) fs.add( name=const.TARGET_POS, field=target_pos, file_option_suffix='_target_pos', required=None, ) target_stacked = data.Field(tokenize=tokenizer) fs.add( name=const.TARGET_STACKED, field=target_stacked, file_option_suffix='_target_stacked', file_reader=partial(Corpus.read_tabular_file, extract_column=1), required=None, ) target_parse_heads = data.Field(tokenize=tokenizer, use_vocab=False) target_parse_relations = data.Field(tokenize=tokenizer) fs.add( name=const.TARGET_PARSE_HEADS, field=target_parse_heads, file_option_suffix='_target_parse', file_reader=partial(Corpus.read_tabular_file, extract_column=1), required=None, ) fs.add( name=const.TARGET_PARSE_RELATIONS, field=target_parse_relations, file_option_suffix='_target_parse', file_reader=partial(Corpus.read_tabular_file, extract_column=2), required=None, ) target_ngram_left = data.Field(tokenize=tokenizer) target_ngram_right = data.Field(tokenize=tokenizer) fs.add( name=const.TARGET_NGRAM_LEFT, field=target_ngram_left, file_option_suffix='_target_ngram', file_reader=partial(Corpus.read_tabular_file, extract_column=1), required=None, ) fs.add( name=const.TARGET_NGRAM_RIGHT, field=target_ngram_right, file_option_suffix='_target_ngram', file_reader=partial(Corpus.read_tabular_file, extract_column=2), required=None, ) return fs # # def build_test_dataset(options): # source_field = QEField(tokenize=tokenizer) # target_field = QEField(tokenize=tokenizer) # source_pos = QEField(tokenize=tokenizer) # target_pos = QEField(tokenize=tokenizer) # alignments_field = AlignmentField( # tokenize=align_tokenizer, use_vocab=False) # target_tags_field = data.Field( # tokenize=tokenizer, pad_token=None, unk_token=None # ) # target_parse_heads = data.Field(tokenize=tokenizer, use_vocab=False) # target_parse_relations = data.Field(tokenize=tokenizer) # target_ngram_left = data.Field(tokenize=tokenizer) # target_ngram_right = data.Field(tokenize=tokenizer) # target_stacked = data.Field(tokenize=tokenizer) # # fields = { # const.SOURCE: source_field, # const.TARGET: target_field, # const.ALIGNMENTS: alignments_field, # const.TARGET_TAGS: target_tags_field # } # # test_files = { # const.SOURCE: options.test_source, # const.TARGET: options.test_target, # const.TARGET_TAGS: options.test_target_tags, # const.ALIGNMENTS: options.test_alignments, # } # # if options.test_target_parse: # parse_fields = { # const.TARGET_PARSE_HEADS: target_parse_heads, # const.TARGET_PARSE_RELATIONS: target_parse_relations, # } # parse_file_fields = [ # '', # '', # '', # '', # '', # const.TARGET_PARSE_HEADS, # const.TARGET_PARSE_RELATIONS, # ] # # if options.test_target_ngram: # ngram_fields = { # const.TARGET_NGRAM_LEFT: target_ngram_left, # const.TARGET_NGRAM_RIGHT: target_ngram_right, # } # ngram_file_fields = [ # '', '', '', '', '', '', '', '', '', '', '', '', '', # const.TARGET_NGRAM_LEFT, # const.TARGET_NGRAM_RIGHT, # ] # # if options.test_target_stacked: # stacked_fields = {const.TARGET_STACKED: target_stacked} # stacked_file_fields = [const.TARGET_STACKED] # # if options.test_source_pos: # fields[const.SOURCE_POS] = source_pos # test_files[const.SOURCE_POS] = options.test_source_pos # # if options.test_target_pos: # fields[const.TARGET_POS] = target_pos # test_files[const.TARGET_POS] = options.test_target_pos # # if options.test_target_parse: # test_target_parse_file = options.test_target_parse # # if options.test_target_ngram: # test_target_ngram_file = options.test_target_ngram # # if options.test_target_stacked: # test_target_stacked_file = options.test_target_stacked # # def filter_len(x): # return ( # options.source_min_length # <= len(x.source) # <= options.source_max_length # ) and ( # options.target_min_length # <= len(x.target) # <= options.target_max_length # ) # # test_examples = Corpus.from_files(fields=fields, files=test_files) # if options.test_target_parse: # test_examples.paste_fields( # Corpus.from_tabular_file( # fields=parse_fields, # file_fields=parse_file_fields, # file_path=test_target_parse_file, # ) # ) # if options.test_target_ngram: # test_examples.paste_fields( # Corpus.from_tabular_file( # fields=ngram_fields, # file_fields=ngram_file_fields, # file_path=test_target_ngram_file, # ) # ) # if options.test_target_stacked: # test_examples.paste_fields( # Corpus.from_tabular_file( # fields=stacked_fields, # file_fields=stacked_file_fields, # file_path=test_target_stacked_file, # ) # ) # # dataset = QEDataset( # examples=test_examples, # fields=test_examples.dataset_fields, # filter_pred=filter_len, # ) # # return dataset PK!{ kiwi/data/fieldsets/predictor.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # from torchtext import data from kiwi import constants as const from kiwi.data.fields.sequence_labels_field import SequenceLabelsField from kiwi.data.fieldsets.fieldset import Fieldset from kiwi.data.tokenizers import tokenizer def build_text_field(): return data.Field( tokenize=tokenizer, init_token=const.START, batch_first=True, eos_token=const.STOP, pad_token=const.PAD, unk_token=const.UNK, ) def build_label_field(postprocessing=None): return SequenceLabelsField( classes=const.LABELS, tokenize=tokenizer, pad_token=const.PAD, batch_first=True, postprocessing=postprocessing, ) def build_fieldset(): source_field = build_text_field() target_field = build_text_field() source_vocab_options = dict( min_freq='source_vocab_min_frequency', max_size='source_vocab_size' ) target_vocab_options = dict( min_freq='target_vocab_min_frequency', max_size='target_vocab_size' ) fieldset = Fieldset() fieldset.add( name=const.SOURCE, field=source_field, file_option_suffix='_source', required=Fieldset.TRAIN, vocab_options=source_vocab_options, ) fieldset.add( name=const.TARGET, field=target_field, file_option_suffix='_target', required=Fieldset.TRAIN, vocab_options=target_vocab_options, ) return fieldset PK!ʋy*kiwi/data/fieldsets/predictor_estimator.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # import torch from torchtext import data from kiwi import constants as const from kiwi.data import utils from kiwi.data.fields.sequence_labels_field import SequenceLabelsField from kiwi.data.fieldsets.fieldset import Fieldset from kiwi.data.tokenizers import tokenizer def build_text_field(): return data.Field( tokenize=tokenizer, init_token=const.START, batch_first=True, eos_token=const.STOP, pad_token=const.PAD, unk_token=const.UNK, ) def build_label_field(postprocessing=None): return SequenceLabelsField( classes=const.LABELS, tokenize=tokenizer, pad_token=const.PAD, batch_first=True, postprocessing=postprocessing, ) def build_fieldset(wmt18_format=False): target_field = build_text_field() source_field = build_text_field() source_vocab_options = dict( min_freq='source_vocab_min_frequency', max_size='source_vocab_size' ) target_vocab_options = dict( min_freq='target_vocab_min_frequency', max_size='target_vocab_size' ) fieldset = Fieldset() fieldset.add( name=const.SOURCE, field=source_field, file_option_suffix='_source', required=Fieldset.TRAIN, vocab_options=source_vocab_options, ) fieldset.add( name=const.TARGET, field=target_field, file_option_suffix='_target', required=Fieldset.TRAIN, vocab_options=target_vocab_options, ) fieldset.add( name=const.PE, field=target_field, file_option_suffix='_pe', required=None, vocab_options=target_vocab_options, ) post_pipe_target = data.Pipeline(utils.project) if wmt18_format: post_pipe_gaps = data.Pipeline(utils.wmt18_to_gaps) post_pipe_target = data.Pipeline(utils.wmt18_to_target) fieldset.add( name=const.GAP_TAGS, field=build_label_field(post_pipe_gaps), file_option_suffix='_target_tags', required=[Fieldset.TRAIN, Fieldset.VALID], ) fieldset.add( name=const.TARGET_TAGS, field=build_label_field(post_pipe_target), file_option_suffix='_target_tags', required=None, ) fieldset.add( name=const.SOURCE_TAGS, field=build_label_field(), file_option_suffix='_source_tags', required=None, ) fieldset.add( name=const.SENTENCE_SCORES, field=data.Field( sequential=False, use_vocab=False, dtype=torch.float32 ), file_option_suffix='_sentence_scores', required=None, ) pipe = data.Pipeline(utils.hter_to_binary) fieldset.add( name=const.BINARY, field=data.Field( sequential=False, use_vocab=False, dtype=torch.long, preprocessing=pipe, ), file_option_suffix='_sentence_scores', required=None, ) return fieldset PK!xxkiwi/data/fieldsets/quetch.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # from torchtext import data from kiwi import constants as const from kiwi.data import utils from kiwi.data.fields.alignment_field import AlignmentField from kiwi.data.fields.qe_field import QEField from kiwi.data.fields.sequence_labels_field import SequenceLabelsField from kiwi.data.fieldsets.fieldset import Fieldset from kiwi.data.tokenizers import align_tokenizer, tokenizer def build_fieldset(wmt18_format=False): fs = Fieldset() fs.add( name=const.SOURCE, field=QEField( tokenize=tokenizer, init_token=None, eos_token=None, include_lengths=True, ), file_option_suffix='_source', required=Fieldset.ALL, vocab_options=dict( min_freq='source_vocab_min_frequency', max_size='source_vocab_size', rare_with_vectors='keep_rare_words_with_embeddings', add_vectors_vocab='add_embeddings_vocab', ), vocab_vectors='source_embeddings', ) fs.add( name=const.TARGET, field=QEField( tokenize=tokenizer, init_token=None, eos_token=None, include_lengths=True, ), file_option_suffix='_target', required=Fieldset.ALL, vocab_options=dict( min_freq='target_vocab_min_frequency', max_size='target_vocab_size', rare_with_vectors='keep_rare_words_with_embeddings', add_vectors_vocab='add_embeddings_vocab', ), vocab_vectors='target_embeddings', ) fs.add( name=const.ALIGNMENTS, field=AlignmentField(tokenize=align_tokenizer, use_vocab=False), file_option_suffix='_alignments', required=Fieldset.ALL, ) post_pipe_target = data.Pipeline(utils.project) if wmt18_format: post_pipe_gaps = data.Pipeline(utils.wmt18_to_gaps) post_pipe_target = data.Pipeline(utils.wmt18_to_target) fs.add( name=const.GAP_TAGS, field=SequenceLabelsField( classes=const.LABELS, tokenize=tokenizer, pad_token=const.PAD, unk_token=None, batch_first=True, # eos_token=const.STOP, postprocessing=post_pipe_gaps, ), file_option_suffix='_target_tags', required=[Fieldset.TRAIN, Fieldset.VALID], ) fs.add( name=const.TARGET_TAGS, field=SequenceLabelsField( classes=const.LABELS, tokenize=tokenizer, pad_token=const.PAD, unk_token=None, batch_first=True, postprocessing=post_pipe_target, ), file_option_suffix='_target_tags', required=[Fieldset.TRAIN, Fieldset.VALID], ) fs.add( name=const.SOURCE_TAGS, field=SequenceLabelsField( classes=const.LABELS, tokenize=tokenizer, pad_token=const.PAD, unk_token=None, batch_first=True, ), file_option_suffix='_source_tags', required=None, ) return fs PK!A^jvkiwi/data/iterators.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # import torch from torchtext import data def build_bucket_iterator(dataset, device, batch_size, is_train): device_obj = None if device is None else torch.device(device) iterator = data.BucketIterator( dataset=dataset, batch_size=batch_size, repeat=False, sort_key=dataset.sort_key, sort=False, # sorts the data within each minibatch in decreasing order # set to true if you want use pack_padded_sequences sort_within_batch=is_train, # shuffle batches shuffle=is_train, device=device_obj, train=is_train, ) return iterator PK!cG kiwi/data/qe_dataset.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # from torchtext import data class QEDataset(data.Dataset): """Defines a dataset for quality estimation. Based on the WMT 201X.""" @staticmethod def sort_key(ex): # don't work for pack_padded_sequences # return data.interleave_keys(len(ex.source), len(ex.target)) return len(ex.source) def __init__(self, examples, fields, filter_pred=None): """Create a dataset from a list of Examples and Fields. Arguments: examples: List of Examples. fields (List(tuple(str, Field))): The Fields to use in this tuple. The string is a field name, and the Field is the associated field. filter_pred (callable or None): Use only examples for which filter_pred(example) is True, or use all examples if None. Default is None. """ # ensure that examples is not a generator examples = list(examples) super().__init__(examples, fields, filter_pred) def __getstate__(self): """For pickling. Copied from OpenNMT-py DatasetBase implementation. """ return self.__dict__ def __setstate__(self, _d): """For pickling. Copied from OpenNMT-py DatasetBase implementation. """ self.__dict__.update(_d) def __reduce_ex__(self, proto): """For pickling. Copied from OpenNMT-py DatasetBase implementation. """ return super(QEDataset, self).__reduce_ex__(proto) def split( self, split_ratio=0.7, stratified=False, strata_field='label', random_state=None, ): datasets = super().split( split_ratio, stratified, strata_field, random_state ) casted_datasets = [ QEDataset(examples=dataset.examples, fields=dataset.fields) for dataset in datasets ] return casted_datasets PK!EZkiwi/data/tokenizers.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # def tokenizer(sentence): """Implement your own tokenize procedure.""" return sentence.strip().split() def align_tokenizer(s): """Return a list of pair of integers for each sentence.""" return [tuple(map(int, x.split('-'))) for x in s.strip().split()] def align_reversed_tokenizer(s): """Return a list of pair of integers for each sentence.""" return [tuple(map(int, x.split('-')))[::-1] for x in s.strip().split()] PK!U$T%%kiwi/data/utils.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # import copy import logging from collections import defaultdict from math import ceil from pathlib import Path import torch from kiwi import constants as const from kiwi.data.fieldsets.fieldset import Fieldset logger = logging.getLogger(__name__) def serialize_vocabs(vocabs, include_vectors=False): """Make vocab dictionary serializable. """ serialized_vocabs = [] for name, vocab in vocabs.items(): vocab = copy.copy(vocab) vocab.stoi = dict(vocab.stoi) if not include_vectors: vocab.vectors = None serialized_vocabs.append((name, vocab)) return serialized_vocabs def deserialize_vocabs(vocabs): """Restore defaultdict lost in serialization. """ vocabs = dict(vocabs) for name, vocab in vocabs.items(): # Hack. Can't pickle defaultdict :( vocab.stoi = defaultdict(lambda: const.UNK_ID, vocab.stoi) return vocabs def serialize_fields_to_vocabs(fields): """ Save Vocab objects in Field objects to `vocab.pt` file. From OpenNMT """ vocabs = fields_to_vocabs(fields) vocabs = serialize_vocabs(vocabs) return vocabs def deserialize_fields_from_vocabs(fields, vocabs): """ Load serialized vocabularies into their fields. """ # TODO redundant deserialization vocabs = deserialize_vocabs(vocabs) return fields_from_vocabs(fields, vocabs) def fields_from_vocabs(fields, vocabs): """ Load Field objects from vocabs dict. From OpenNMT """ vocabs = deserialize_vocabs(vocabs) for name, vocab in vocabs.items(): if name not in fields: logger.debug( 'No field "{}" for loading vocabulary; ignoring.'.format(name) ) else: fields[name].vocab = vocab return fields def fields_to_vocabs(fields): """ Extract Vocab Dictionary from Fields Dictionary. Args: fields: A dict mapping field names to Field objects Returns: vocab: A dict mapping field names to Vocabularies """ vocabs = {} for name, field in fields.items(): if field is not None and 'vocab' in field.__dict__: vocabs[name] = field.vocab return vocabs def save_vocabularies_from_fields(directory, fields, include_vectors=False): """ Save Vocab objects in Field objects to `vocab.pt` file. From OpenNMT """ vocabs = serialize_fields_to_vocabs(fields) vocab_path = Path(directory, const.VOCAB_FILE) torch.save({const.VOCAB: vocabs}, str(vocab_path)) return vocab_path def load_vocabularies_to_fields(vocab_path, fields): """Load serialized Vocabularies from disk into fields.""" if Path(vocab_path).exists(): vocabs_dict = torch.load( str(vocab_path), map_location=lambda storage, loc: storage ) vocabs = vocabs_dict[const.VOCAB] fields = deserialize_fields_from_vocabs(fields, vocabs) logger.info('Loaded vocabularies from {}'.format(vocab_path)) return all( [vocab_loaded_if_needed(field) for _, field in fields.items()] ) return False def load_vocabularies_to_datasets(vocab_path, *datasets): fields = {} for dataset in datasets: fields.update(dataset.fields) return load_vocabularies_to_fields(vocab_path, fields) def vocab_loaded_if_needed(field): return not field.use_vocab or (hasattr(field, const.VOCAB) and field.vocab) def save_vocabularies_from_datasets(directory, *datasets): fields = {} for dataset in datasets: fields.update(dataset.fields) return save_vocabularies_from_fields(directory, fields) def build_vocabulary(fields_vocab_options, *datasets): fields = {} for dataset in datasets: fields.update(dataset.fields) for name, field in fields.items(): if not vocab_loaded_if_needed(field): kwargs_vocab = fields_vocab_options[name] if 'vectors_fn' in kwargs_vocab: vectors_fn = kwargs_vocab['vectors_fn'] kwargs_vocab['vectors'] = vectors_fn() del kwargs_vocab['vectors_fn'] field.build_vocab(*datasets, **kwargs_vocab) def load_datasets(directory, *datasets_names): dataset_path = Path(directory, const.DATAFILE) dataset_dict = torch.load( str(dataset_path), map_location=lambda storage, loc: storage ) datasets = [dataset_dict[name] for name in datasets_names] return datasets def save_datasets(directory, **named_datasets): """Pickle datasets to standard file in directory Note that fields cannot be saved as part of a dataset, so they are ignored. Args: directory (str or Path): directory where to save the datasets pickle. named_datasets (dict): mapping of name and respective dataset. """ # Fields cannot be pickled # Saving field to a temporary list dataset_fields_tmp = [] for dataset in named_datasets.values(): dataset_fields_tmp.append(dataset.fields) dataset.fields = [] logging.info('Saving preprocessed datasets...') dataset_path = Path(directory, const.DATAFILE) torch.save(named_datasets, str(dataset_path)) # Reconstructing dataset.field from the temporary list for dataset, fields in zip(named_datasets.values(), dataset_fields_tmp): dataset.fields = fields def save_training_datasets(directory, train_dataset, valid_dataset): ds_dict = {const.TRAIN: train_dataset, const.EVAL: valid_dataset} save_datasets(directory, **ds_dict) def load_training_datasets(directory, fieldset): # FIXME: test if this works. Ideally, fields would be already contained # inside the loaded datasets. train_ds, valid_ds = load_datasets(directory, const.TRAIN, const.EVAL) # Remove fields not actually loaded (checking if they're required). fields = fieldset.fields for field in dict(fields): # Make a copy so del can be used if not hasattr(train_ds.examples[0], field): for set_name in [Fieldset.TRAIN, Fieldset.VALID]: if fieldset.is_required(field, set_name): raise AttributeError( 'Loaded {} dataset does not have a ' '{} field.'.format(set_name, field) ) del fields[field] train_ds.fields = fields valid_ds.fields = fields load_vocabularies_to_fields( Path(directory, const.VOCAB_FILE), fieldset.fields ) return train_ds, valid_ds def cross_split_dataset(dataset, splits): examples_per_split = ceil(len(dataset) / splits) for split in range(splits): held_out_start = examples_per_split * split held_out_stop = examples_per_split * (split + 1) held_out_examples = dataset[held_out_start:held_out_stop] held_in_examples = dataset[:held_out_start] + dataset[held_out_stop:] train_split = dataset.__class__(held_in_examples, dataset.fields) eval_split = dataset.__class__(held_out_examples, dataset.fields) yield train_split, eval_split def save_file(file_path, data, token_sep=' ', example_sep='\n'): if data and isinstance(data[0], list): data = [token_sep.join(map(str, sentence)) for sentence in data] else: data = map(str, data) example_str = example_sep.join(data) + '\n' Path(file_path).write_text(example_str) def save_predicted_probabilities(directory, predictions, prefix=''): directory = Path(directory) directory.mkdir(parents=True, exist_ok=True) for key, preds in predictions.items(): if prefix: key = '{}.{}'.format(prefix, key) output_path = Path(directory, key) logger.info('Saving {} predictions to {}'.format(key, output_path)) save_file(output_path, preds, token_sep=' ', example_sep='\n') def read_file(path): """Reads a file into a list of lists of words. """ with Path(path).open('r', encoding='utf8') as f: return [[token for token in line.strip().split()] for line in f] def hter_to_binary(x): """Transform hter score into binary OK/BAD label. """ return ceil(float(x)) def wmt18_to_target(batch, *args): """Extract target tags from wmt18 format file. """ return batch[1::2] def wmt18_to_gaps(batch, *args): """Extract gap tags from wmt18 format file. """ return batch[::2] def project(batch, *args): """Projection onto the first argument. Needed to create a postprocessing pipeline that implements the identity. """ return batch def filter_len( x, source_min_length=1, source_max_length=float('inf'), target_min_length=1, target_max_length=float('inf'), ): return (source_min_length <= len(x.source) <= source_max_length) and ( target_min_length <= len(x.target) <= target_max_length ) PK!j # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # import logging from functools import partial import torch from torchtext.vocab import Vectors from kiwi.constants import PAD, START, STOP, UNK logger = logging.getLogger(__name__) class WordEmbeddings(Vectors): def __init__( self, name, emb_format='polyglot', binary=True, map_fn=lambda x: x, **kwargs ): """ Arguments: emb_format: the saved embedding model format, choices are: polyglot, word2vec, fasttext, glove and text binary: only for word2vec, fasttext and text map_fn: a function that maps special original tokens to Polyglot tokens (e.g. to ) save_vectors: save a vectors cache """ self.binary = binary self.emb_format = emb_format self.itos = None self.stoi = None self.dim = None self.vectors = None self.map_fn = map_fn super().__init__(name, **kwargs) def __getitem__(self, token): if token in self.stoi: token = self.map_fn(token) return self.vectors[self.stoi[token]] else: return self.unk_init(torch.Tensor(1, self.dim)) def cache(self, name, cache, url=None, max_vectors=None): if self.emb_format in ['polyglot', 'glove']: try: from polyglot.mapping import Embedding except ImportError: logger.error('Please install `polyglot` package first.') return None if self.emb_format == 'polyglot': embeddings = Embedding.load(name) else: embeddings = Embedding.from_glove(name) self.itos = embeddings.vocabulary.id_word self.stoi = embeddings.vocabulary.word_id self.dim = embeddings.shape[1] self.vectors = torch.Tensor(embeddings.vectors).view(-1, self.dim) elif self.emb_format in ['word2vec', 'fasttext']: try: from gensim.models import KeyedVectors except ImportError: logger.error('Please install `gensim` package first.') return None embeddings = KeyedVectors.load_word2vec_format( name, unicode_errors='ignore', binary=self.binary ) self.itos = embeddings.index2word self.stoi = dict(zip(self.itos, range(len(self.itos)))) self.dim = embeddings.vector_size self.vectors = torch.Tensor(embeddings.vectors).view(-1, self.dim) elif self.emb_format == 'text': tokens = [] vectors = [] if self.binary: import pickle # vectors should be a dict mapping str keys to numpy arrays with open(name, 'rb') as f: d = pickle.load(f) tokens = list(d.keys()) vectors = list(d.values()) else: # each line should contain a token and its following fields # ... with open(name, 'r', encoding='utf8') as f: for line in f: if line: # ignore empty lines fields = line.rstrip().split() tokens.append(fields[0]) vectors.append(list(map(float, fields[1:]))) self.itos = tokens self.stoi = dict(zip(self.itos, range(len(self.itos)))) self.vectors = torch.Tensor(vectors) self.dim = self.vectors.shape[1] def map_to_polyglot(token): mapping = {UNK: '', PAD: '', START: '', STOP: ''} if token in mapping: return mapping[token] return token Polyglot = partial( WordEmbeddings, emb_format='polyglot', map_fn=map_to_polyglot ) Word2Vec = partial(WordEmbeddings, emb_format='word2vec') FastText = partial(WordEmbeddings, emb_format='fasttext') Glove = partial(WordEmbeddings, emb_format='glove') TextVectors = partial(WordEmbeddings, emb_format='text') AvailableVectors = { 'polyglot': Polyglot, 'word2vec': Word2Vec, 'fasttext': FastText, 'glove': Glove, 'text': TextVectors, } PK!ueaZZkiwi/data/vocabulary.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # import warnings from collections import defaultdict import torchtext from kiwi.constants import PAD, START, STOP, UNALIGNED, UNK, UNK_ID def _default_unk_index(): return UNK_ID # should be zero class Vocabulary(torchtext.vocab.Vocab): """Defines a vocabulary object that will be used to numericalize a field. Attributes: freqs: A collections.Counter object holding the frequencies of tokens in the data used to build the Vocab. stoi: A collections.defaultdict instance mapping token strings to numerical identifiers. itos: A list of token strings indexed by their numerical identifiers. """ def __init__( self, counter, max_size=None, min_freq=1, specials=None, vectors=None, unk_init=None, vectors_cache=None, rare_with_vectors=True, add_vectors_vocab=False, ): """Create a Vocab object from a collections.Counter. Arguments: counter: collections.Counter object holding the frequencies of each value found in the data. max_size: The maximum size of the vocabulary, or None for no maximum. Default: None. min_freq: The minimum frequency needed to include a token in the vocabulary. Values less than 1 will be set to 1. Default: 1. specials: The list of special tokens (e.g., padding or eos) that will be prepended to the vocabulary in addition to an token. Default: [''] vectors: One of either the available pretrained vectors or custom pretrained vectors (see Vocab.load_vectors); or a list of aforementioned vectors unk_init (callback): by default, initialize out-of-vocabulary word vectors to zero vectors; can be any function that takes in a Tensor and returns a Tensor of the same size. Default: torch.Tensor.zero_ vectors_cache: directory for cached vectors. Default: '.vector_cache' rare_with_vectors: if True and a vectors object is passed, then it will add words that appears less than min_freq but are in vectors vocabulary. Default: True. add_vectors_vocab: by default, the vocabulary is built using only words from the provided datasets. If this flag is true, the vocabulary will add words that are not in the datasets but are in the vectors vocabulary (e.g. words from polyglot vectors). Default: False. """ if specials is None: specials = [''] self.freqs = counter counter = counter.copy() min_freq = max(min_freq, 1) self.itos = list(specials) # frequencies of special tokens are not counted when building vocabulary # in frequency order for tok in specials: del counter[tok] max_size = None if max_size is None else max_size + len(self.itos) # sort by frequency, then alphabetically words_and_frequencies = sorted(counter.items(), key=lambda tup: tup[0]) words_and_frequencies.sort(key=lambda tup: tup[1], reverse=True) if not isinstance(vectors, list) and vectors is not None: vectors = [vectors] # add words that appears less than min_freq but are in embeddings # vocabulary for word, freq in words_and_frequencies: if freq < min_freq: if vectors is not None and rare_with_vectors: for v in vectors: if word in v.stoi: self.itos.append(word) else: break elif len(self.itos) == max_size: break else: self.itos.append(word) if add_vectors_vocab: if ( max_size is not None and sum(v.stoi for v in vectors) + len(self.itos) > max_size ): warnings.warn( 'Adding the vectors vocabulary will make ' 'len(vocab) > max_vocab_size!' ) vset = set() for v in vectors: vset.update(v.stoi.keys()) v_itos = vset - set(self.itos) self.itos.extend(list(v_itos)) self.stoi = defaultdict(_default_unk_index) # stoi is simply a reverse dict for itos self.stoi.update({tok: i for i, tok in enumerate(self.itos)}) self.vectors = None if vectors is not None: self.load_vectors(vectors, unk_init=unk_init, cache=vectors_cache) else: assert unk_init is None and vectors_cache is None def merge_vocabularies(vocab_a, vocab_b, max_size=None, vectors=None, **kwargs): merged = vocab_a.freqs + vocab_b.freqs return Vocabulary( merged, specials=[UNK, PAD, START, STOP, UNALIGNED], max_size=max_size, vectors=vectors, **kwargs, ) PK!hkiwi/lib/__init__.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # PK!Ek7k7kiwi/lib/evaluate.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # import os.path import warnings from pathlib import Path import numpy as np from more_itertools import flatten from scipy.stats.stats import pearsonr, rankdata, spearmanr from kiwi import constants as const from kiwi.data.utils import read_file from kiwi.metrics.functions import ( delta_average, f1_scores, mean_absolute_error, mean_squared_error, ) def evaluate_from_options(options): """ Evaluates a model's predictions based on the flags received from the configuration files. Refer to configuration for a list of available configuration flags for the evaluate pipeline. Args: options (Namespace): Namespace containing all pipeline options """ if options is None: return setup() pipeline_options = options.pipeline # flag denoting format so there's no need to always check is_wmt18_format = pipeline_options.format.lower() == "wmt18" is_wmt18_pred_format = pipeline_options.pred_format.lower() == "wmt18" # handling of gold target golds = retrieve_gold_standard(pipeline_options, is_wmt18_format) # handling of prediction files pred_files = retrieve_predictions(pipeline_options, is_wmt18_pred_format) if not any(pred_files.values()): print( "Please specify at least one of these options: " "--input-dir, --pred-target, --pred-source, --pred-sents" ) return # evaluate word level for tag in const.TAGS: if tag in golds and pred_files[tag]: scores = eval_word_level(golds, pred_files, tag) print_scores_table(scores, tag) # evaluate sentence level if const.SENTENCE_SCORES in golds: sent_golds = golds[const.SENTENCE_SCORES] sent_preds = retrieve_sentence_predictions(pipeline_options, pred_files) if sent_preds: sentence_scores, sentence_ranking = eval_sentence_level( sent_golds, sent_preds ) print_sentences_scoring_table(sentence_scores) print_sentences_ranking_table(sentence_ranking) for pred_file in pred_files[const.BINARY]: sent_preds.append(pred_file) teardown() # TODO return some evaluation info besides just printing the graph def retrieve_gold_standard(pipeline_options, is_wmt18_format): golds = {} if pipeline_options.gold_target: gold_target = _wmt_to_labels(read_file(pipeline_options.gold_target)) if is_wmt18_format: gold_target, gold_gaps = _split_wmt18(gold_target) golds[const.GAP_TAGS] = gold_gaps golds[const.TARGET_TAGS] = gold_target # handling of gold source if pipeline_options.gold_source: gold_source = _wmt_to_labels(read_file(pipeline_options.gold_source)) golds[const.SOURCE_TAGS] = gold_source # handling of gold sentences if pipeline_options.gold_sents: gold_sentences = _read_sentence_scores(pipeline_options.gold_sents) golds[const.SENTENCE_SCORES] = gold_sentences return golds def retrieve_predictions(pipeline_options, is_wmt18_pred_format): pred_files = {target: [] for target in const.TARGETS} if pipeline_options.pred_target: for pred_file in pipeline_options.pred_target: pred_target = read_file(pred_file) if is_wmt18_pred_format: pred_target, pred_gaps = _split_wmt18(pred_target) pred_files[const.GAP_TAGS].append((str(pred_file), pred_gaps)) pred_files[const.TARGET_TAGS].append((str(pred_file), pred_target)) if pipeline_options.pred_gaps: for pred_file in pipeline_options.pred_gaps: pred_gaps = read_file(pred_file) pred_files[const.GAP_TAGS].append((str(pred_file), pred_gaps)) if pipeline_options.pred_source: for pred_file in pipeline_options.pred_source: pred_source = read_file(pred_file) pred_files[const.SOURCE_TAGS].append((str(pred_file), pred_source)) if pipeline_options.pred_sents: for pred_file in pipeline_options.pred_sents: pred_sents = _read_sentence_scores(pred_file) pred_files[const.SENTENCE_SCORES].append( (str(pred_file), pred_sents) ) if pipeline_options.input_dir: for input_dir in pipeline_options.input_dir: input_dir = Path(input_dir) for target in const.TAGS: pred_file = input_dir.joinpath(target) if pred_file.exists() and pred_file.is_file(): pred_files[pred_file.name].append( (str(pred_file), read_file(pred_file)) ) for target in [const.SENTENCE_SCORES, const.BINARY]: pred_file = input_dir.joinpath(target) if pred_file.exists() and pred_file.is_file(): pred_files[pred_file.name].append( (str(pred_file), _read_sentence_scores(str(pred_file))) ) # Numericalize Text Labels if pipeline_options.type == "tags": for tag_name in const.TAGS: for i in range(len(pred_files[tag_name])): fname, pred_tags = pred_files[tag_name][i] pred_files[tag_name][i] = (fname, _wmt_to_labels(pred_tags)) return pred_files def retrieve_sentence_predictions(pipeline_options, pred_files): sent_preds = pred_files[const.SENTENCE_SCORES] sents_avg = ( pipeline_options.sents_avg if pipeline_options.sents_avg else pipeline_options.type ) tag_to_sent = _probs_to_sentence_score if sents_avg == "tags": tag_to_sent = _tags_to_sentence_score for pred_file, pred in pred_files[const.TARGET_TAGS]: sent_pred = np.array(tag_to_sent(pred)) sent_preds.append((pred_file, sent_pred)) return sent_preds def _split_wmt18(tags): """Split tags list of lists in WMT18 format into target and gap tags.""" tags_mt = [sent_tags[1::2] for sent_tags in tags] tags_gaps = [sent_tags[::2] for sent_tags in tags] return tags_mt, tags_gaps def _wmt_to_labels(corpus): """Generates numeric labels from text labels.""" dictionary = dict(zip(const.LABELS, range(len(const.LABELS)))) return [[dictionary[word] for word in sent] for sent in corpus] def _read_sentence_scores(sent_file): """Read File with numeric scores for sentences.""" return np.array([line.strip() for line in open(sent_file)], dtype="float") def _tags_to_sentence_score(tags_sentences): scores = [] bad_label = const.LABELS.index(const.BAD) for tags in tags_sentences: labels = _probs_to_labels(tags) scores.append(labels.count(bad_label) / len(tags)) return scores def _probs_to_sentence_score(probs_sentences): scores = [] for probs in probs_sentences: probs = [float(p) for p in probs] scores.append(np.mean(probs)) return scores def _probs_to_labels(probs, threshold=0.5): """Generates numeric labels from probabilities. This assumes two classes and default decision threshold 0.5 """ return [int(float(prob) > threshold) for prob in probs] def _check_lengths(gold, prediction): for i, (g, p) in enumerate(zip(gold, prediction)): if len(g) != len(p): warnings.warn( "Mismatch length for {}th sample " "{} x {}".format(i, len(g), len(p)) ) def _average(probs_per_file): # flat_probs = [list(flatten(probs)) for probs in probs_per_file] probabilities = np.array(probs_per_file, dtype="float32") return probabilities.mean(axis=0).tolist() def _extract_path_prefix(file_names): if len(file_names) < 2: return "", file_names prefix_path = os.path.commonpath( [path for path in file_names if not path.startswith("*")] ) if len(prefix_path) > 0: file_names = [ os.path.relpath(path, prefix_path) if not path.startswith("*") else path for path in file_names ] return prefix_path, file_names def setup(): pass def teardown(): pass def eval_word_level(golds, pred_files, tag_name): scores_table = [] for pred_file, pred in pred_files[tag_name]: _check_lengths(golds[tag_name], pred) scores = score_word_level( list(flatten(golds[tag_name])), list(flatten(pred)) ) scores_table.append((pred_file, *scores)) # If more than one system is provided, compute ensemble score if len(pred_files[tag_name]) > 1: ensemble_pred = _average( [list(flatten(pred)) for _, pred in pred_files[tag_name]] ) ensemble_score = score_word_level( list(flatten(golds[tag_name])), ensemble_pred ) scores_table.append(("*ensemble*", *ensemble_score)) scores = np.array( scores_table, dtype=[ ("File", "object"), ("F1_{}".format(const.LABELS[0]), float), ("F1_{}".format(const.LABELS[1]), float), ("F1_mult", float), ], ) # Put the main metric in the first column scores = scores[ [ "File", "F1_mult", "F1_{}".format(const.LABELS[0]), "F1_{}".format(const.LABELS[1]), ] ] return scores def eval_sentence_level(sent_gold, sent_preds): sentence_scores, sentence_ranking = [], [] for file_name, pred in sent_preds: scoring, ranking = score_sentence_level(sent_gold, pred) sentence_scores.append((file_name, *scoring)) sentence_ranking.append((file_name, *ranking)) ensemble_pred = _average([pred for _, pred in sent_preds]) ensemble_score, ensemble_ranking = score_sentence_level( sent_gold, ensemble_pred ) sentence_scores.append(("*ensemble*", *ensemble_score)) sentence_ranking.append(("*ensemble*", *ensemble_ranking)) sentence_scores = np.array( sentence_scores, dtype=[ ("File", "object"), ("Pearson r", float), ("MAE", float), ("RMSE", float), ], ) sentence_ranking = np.array( sentence_ranking, dtype=[("File", "object"), ("Spearman r", float), ("DeltaAvg", float)], ) return sentence_scores, sentence_ranking def score_word_level(gold, prediction): gold_tags = gold pred_tags = _probs_to_labels(prediction) return f1_scores(pred_tags, gold_tags) def score_sentence_level(gold, pred): pearson = pearsonr(gold, pred) mae = mean_absolute_error(gold, pred) rmse = np.sqrt(mean_squared_error(gold, pred)) spearman = spearmanr( rankdata(gold, method="ordinal"), rankdata(pred, method="ordinal") ) delta_avg = delta_average(gold, rankdata(pred, method="ordinal")) return (pearson[0], mae, rmse), (spearman[0], delta_avg) def print_scores_table(scores, prefix="TARGET"): prefix_path, scores["File"] = _extract_path_prefix(scores["File"]) path_str = " ({})".format(prefix_path) if prefix_path else "" max_method_length = max(len(path_str) + 4, max(map(len, scores["File"]))) print("-" * (max_method_length + 13 * 3)) print("Word-level scores for {}:".format(prefix)) print( "{:{width}} {:9} {:9} {:9}".format( "File{}".format(path_str), "F1_mult", "F1_{}".format(const.LABELS[0]), "F1_{}".format(const.LABELS[1]), width=max_method_length, ) ) for score in np.sort(scores, order=["F1_mult", "File"])[::-1]: print( "{:{width}s} {:<9.5f} {:<9.5} {:<9.5f}".format( *score, width=max_method_length ) ) def print_sentences_scoring_table(scores): prefix_path, scores["File"] = _extract_path_prefix(scores["File"]) path_str = " ({})".format(prefix_path) if prefix_path else "" max_method_length = max(len(path_str) + 4, max(map(len, scores["File"]))) print("-" * (max_method_length + 13 * 3)) print("Sentence-level scoring:") print( "{:{width}} {:9} {:9} {:9}".format( "File{}".format(path_str), "Pearson r", "MAE", "RMSE", width=max_method_length, ) ) for score in np.sort(scores, order=["Pearson r", "File"])[::-1]: print( "{:{width}s} {:<9.5f} {:<9.5f} {:<9.5f}".format( *score, width=max_method_length ) ) def print_sentences_ranking_table(scores): prefix_path, scores["File"] = _extract_path_prefix(scores["File"]) path_str = " ({})".format(prefix_path) if prefix_path else "" max_method_length = max(len(path_str) + 4, max(map(len, scores["File"]))) print("-" * (max_method_length + 13 * 3)) print("Sentence-level ranking:") print( "{:{width}} {:10} {:9}".format( "File{}".format(path_str), "Spearman r", "DeltaAvg", width=max_method_length, ) ) # noqa for score in np.sort(scores, order=["Spearman r", "File"])[::-1]: print( "{:{width}s} {:<10.5f} {:<9.5f}".format( *score, width=max_method_length ) ) PK!SBN""kiwi/lib/jackknife.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # import logging from collections import defaultdict from pathlib import Path import numpy as np import torch from kiwi import constants as const from kiwi import load_model from kiwi.data import utils from kiwi.data.builders import build_test_dataset from kiwi.data.iterators import build_bucket_iterator from kiwi.data.utils import cross_split_dataset, save_predicted_probabilities from kiwi.lib import train from kiwi.lib.utils import merge_namespaces from kiwi.loggers import tracking_logger logger = logging.getLogger(__name__) def run_from_options(options): if options is None: return meta_options = options.meta pipeline_options = options.pipeline.pipeline model_options = options.pipeline.model ModelClass = options.pipeline.model_api tracking_run = tracking_logger.configure( run_uuid=pipeline_options.run_uuid, experiment_name=pipeline_options.experiment_name, tracking_uri=pipeline_options.mlflow_tracking_uri, always_log_artifacts=pipeline_options.mlflow_always_log_artifacts, ) with tracking_run: output_dir = train.setup( output_dir=pipeline_options.output_dir, debug=pipeline_options.debug, quiet=pipeline_options.quiet, ) all_options = merge_namespaces( meta_options, pipeline_options, model_options ) train.log( output_dir, config_options=vars(all_options), config_file_name='jackknife_config.yml', ) run( ModelClass, output_dir, pipeline_options, model_options, splits=meta_options.splits, ) teardown(pipeline_options) def run(ModelClass, output_dir, pipeline_options, model_options, splits): model_name = getattr(ModelClass, 'title', ModelClass.__name__) logger.info('Jackknifing with the {} model'.format(model_name)) # Data fieldset = ModelClass.fieldset( wmt18_format=model_options.__dict__.get('wmt18_format') ) train_set, dev_set = train.retrieve_datasets( fieldset, pipeline_options, model_options, output_dir ) test_set = None try: test_set = build_test_dataset(fieldset, **vars(pipeline_options)) except ValueError: pass except FileNotFoundError: pass device_id = None if pipeline_options.gpu_id is not None and pipeline_options.gpu_id >= 0: device_id = pipeline_options.gpu_id parent_dir = output_dir train_predictions = defaultdict(list) dev_predictions = defaultdict(list) test_predictions = defaultdict(list) splitted_datasets = cross_split_dataset(train_set, splits) for i, (train_fold, pred_fold) in enumerate(splitted_datasets): run_name = 'train_split_{}'.format(i) output_dir = Path(parent_dir, run_name) output_dir.mkdir(parents=True, exist_ok=True) # options.output_dir = str(options.output_dir) # Train vocabs = utils.fields_to_vocabs(train_fold.fields) tracking_run = tracking_logger.start_nested_run(run_name=run_name) with tracking_run: train.setup( output_dir=output_dir, seed=pipeline_options.seed, gpu_id=pipeline_options.gpu_id, debug=pipeline_options.debug, quiet=pipeline_options.quiet, ) trainer = train.retrieve_trainer( ModelClass, pipeline_options, model_options, vocabs, output_dir, device_id, ) # Dataset iterators train_iter = build_bucket_iterator( train_fold, batch_size=pipeline_options.train_batch_size, is_train=True, device=device_id, ) valid_iter = build_bucket_iterator( pred_fold, batch_size=pipeline_options.valid_batch_size, is_train=False, device=device_id, ) trainer.run(train_iter, valid_iter, epochs=pipeline_options.epochs) # Predict predictor = load_model(trainer.checkpointer.best_model_path()) train_predictions_i = predictor.run( pred_fold, batch_size=pipeline_options.valid_batch_size ) dev_predictions_i = predictor.run( dev_set, batch_size=pipeline_options.valid_batch_size ) test_predictions_i = None if test_set: test_predictions_i = predictor.run( test_set, batch_size=pipeline_options.valid_batch_size ) torch.cuda.empty_cache() for output_name in train_predictions_i: train_predictions[output_name] += train_predictions_i[output_name] dev_predictions[output_name].append(dev_predictions_i[output_name]) if test_set: test_predictions[output_name].append( test_predictions_i[output_name] ) dev_predictions = average_all(dev_predictions) if test_set: test_predictions = average_all(test_predictions) save_predicted_probabilities( parent_dir, train_predictions, prefix=const.TRAIN ) save_predicted_probabilities(parent_dir, dev_predictions, prefix=const.DEV) if test_set: save_predicted_probabilities( parent_dir, test_predictions, prefix=const.TEST ) teardown(pipeline_options) return train_predictions def teardown(options): pass def average_all(predictions): for output_name in predictions: predictions[output_name] = average_predictions(predictions[output_name]) return predictions def average_predictions(ensemble): """Average an ensemble of predictions. """ word_level = isinstance(ensemble[0][0], list) if word_level: sentence_lengths = [len(sentence) for sentence in ensemble[0]] ensemble = [ [word for sentence in predictions for word in sentence] for predictions in ensemble ] ensemble = np.array(ensemble, dtype='float32') averaged_predictions = ensemble.mean(axis=0).tolist() if word_level: averaged_predictions = reshape_by_lengths( averaged_predictions, sentence_lengths ) return averaged_predictions def reshape_by_lengths(sequence, lengths): new_sequences = [] t = 0 for length in lengths: new_sequences.append(sequence[t : t + length]) t += length return new_sequences PK!{+kiwi/lib/predict.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # import logging from pathlib import Path from pprint import pformat from kiwi.data.builders import build_test_dataset from kiwi.data.utils import ( deserialize_fields_from_vocabs, save_predicted_probabilities, ) from kiwi.lib.utils import ( configure_device, configure_logging, configure_seed, save_config_file, setup_output_directory, ) from kiwi.models.linear_word_qe_classifier import LinearWordQEClassifier from kiwi.models.model import Model from kiwi.predictors.linear_tester import LinearTester from kiwi.predictors.predictor import Predicter logger = logging.getLogger(__name__) def predict_from_options(options): """ Uses the configuration options to run the prediction pipeline. Iteratively calls `setup`, `run` and `teardown`. Args: options (Namespace): Namespace containing all parsed options. """ logger.debug("Setting up predict..") output_dir = setup(options.pipeline) logger.debug("Predict set up. Running...") run(options.model_api, output_dir, options.pipeline, options.model) logger.debug("Prediction finished. Tearing down") teardown(options.pipeline) def load_model(model_path): """Load a pretrained model into a `Predicter` object. Args: load_model (str): A path to the saved model file. Throws: Exception: If the path does not exist, or is not a valid model file. """ model_path = Path(model_path) if not model_path.exists(): raise Exception('Path "{}" does not exist!'.format(model_path)) model = Model.create_from_file(model_path) if not model: raise Exception('No model found in "{}"'.format(model_path)) fieldset = model.fieldset() fields = deserialize_fields_from_vocabs(fieldset.fields, model.vocabs) predicter = Predicter(model, fields=fields) return predicter def run(ModelClass, output_dir, pipeline_opts, model_opts): """ Runs the prediction pipeline. Loads the model and necessary files and creates the model's predictions for all data received. Args: ModelClass (type): Python Type of the Model to train output_dir: Directory to save predictions pipeline_options (Namespace): Generic predict Options batch_size: Max batch size for predicting model_options (Namespace): Model Specific options Returns: Predictions (dict): Dictionary with format {'target':predictions} """ model_name = getattr(ModelClass, "title", ModelClass.__name__) logger.info("Predict with the {} model".format(model_name)) if ModelClass == LinearWordQEClassifier: load_vocab = None model = LinearWordQEClassifier( evaluation_metric=model_opts.evaluation_metric ) model.load(pipeline_opts.load_model) predicter = LinearTester(model) else: load_vocab = pipeline_opts.load_model model = Model.create_from_file(pipeline_opts.load_model) # Set GPU or CPU. This has to be done before instantiating the optimizer device_id = None if pipeline_opts.gpu_id is not None and pipeline_opts.gpu_id >= 0: device_id = pipeline_opts.gpu_id model.to(device_id) predicter = Predicter(model) test_dataset = build_test_dataset( fieldset=ModelClass.fieldset( wmt18_format=model_opts.__dict__.get("wmt18_format") ), load_vocab=load_vocab, **vars(model_opts), ) predictions = predicter.run( test_dataset, batch_size=pipeline_opts.batch_size ) save_predicted_probabilities(output_dir, predictions) return predictions def setup(options): """ Analyze pipeline options and set up requirements to running the prediction pipeline. This includes setting up the output directory, random seeds and the device where predictions are run. Args: options(Namespace): Pipeline specific options Returns: output_dir(str): Path to output directory """ output_dir = setup_output_directory( options.output_dir, options.run_uuid, experiment_id=None, create=True ) configure_logging( output_dir=output_dir, debug=options.debug, quiet=options.quiet ) configure_seed(options.seed) configure_device(options.gpu_id) logger.info(pformat(vars(options))) logger.info("Local output directory is: {}".format(output_dir)) if options.save_config: save_config_file(options, options.save_config) del options.output_dir # FIXME: remove this after making sure no other # place uses it! # noqa return output_dir def teardown(options): """ Tears down after executing prediction pipeline. Args: options(Namespace): Pipeline specific options """ pass PK!!' ' kiwi/lib/search.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # import itertools from collections import OrderedDict from pathlib import Path import configargparse from kiwi.cli.opts import PathType from kiwi.lib import train from kiwi.models.model import Model parser = configargparse.get_argument_parser('search') parser.add_argument( '-e', '--experiment-name', required=False, help='MLflow will log this run under this experiment name, ' 'which appears as a separate section in the UI. It ' 'will also be used in some messages and files.', ) parser.add( '-c', '--config', required=True, is_config_file=False, type=PathType(exists=True), help='Load config file from path', ) group = parser.add_argument_group('models') group.add_argument('model_name', choices=Model.subclasses.keys()) def get_action(option): for action in train.parser._actions: if option in train.parser.get_possible_config_keys(action): return action return None def split_options(options): meta_options = OrderedDict() normal_options = [] for key, value in options.items(): if isinstance(value, list): meta_options[key] = value else: action = get_action(key) normal_options += parser.convert_item_to_command_line_arg( action, key, value ) return meta_options, normal_options def run(options, extra_options): config_parser = configargparse.YAMLConfigFileParser() config_options = config_parser.parse(Path(options.config).read_text()) meta, fixed_options = split_options(config_options) # Run for each combination of arguments fixed_args = [options.model_name] + extra_options if options.experiment_name: fixed_args += parser.convert_item_to_command_line_arg( None, 'experiment-name', options.experiment_name ) meta_keys = meta.keys() meta_values = meta.values() for values in itertools.product(*meta_values): assert len(meta_keys) == len(values) run_args = [] for key, value in zip(meta_keys, values): action = get_action(key) run_args.extend( parser.convert_item_to_command_line_arg(action, key, str(value)) ) full_args = fixed_args + run_args + fixed_options train.main(full_args) def main(argv=None, external_options=None): raise NotImplementedError('Pipeline not yet supported.') # options, extra_options = parser.parse_known_args(args=argv) # run(options, extra_options) if __name__ == '__main__': main() PK!Vi::kiwi/lib/train.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # import logging from pathlib import Path from pprint import pformat import torch from kiwi import constants as const from kiwi.cli.pipelines.train import build_parser from kiwi.data import builders, utils from kiwi.data.iterators import build_bucket_iterator from kiwi.data.utils import ( save_training_datasets, save_vocabularies_from_datasets, ) from kiwi.lib.utils import ( configure_logging, configure_seed, merge_namespaces, save_args_to_file, setup_output_directory, ) from kiwi.loggers import tracking_logger from kiwi.models.linear_word_qe_classifier import LinearWordQEClassifier from kiwi.models.model import Model from kiwi.trainers.callbacks import Checkpoint from kiwi.trainers.linear_word_qe_trainer import LinearWordQETrainer from kiwi.trainers.trainer import Trainer from kiwi.trainers.utils import optimizer_class logger = logging.getLogger(__name__) class TrainRunInfo: """ Encapsulates relevant information on training runs. Can be instantiated with a trainer object. Attributes: stats: Stats of the best model so far model_path: Path of the best model so far run_uuid: Unique identifier of the current run """ def __init__(self, trainer): # FIXME: linear trainer not yet supported here # (no full support to checkpointer) self.stats = trainer.checkpointer.best_stats() self.model_path = trainer.checkpointer.best_model_path() self.run_uuid = tracking_logger.run_uuid def train_from_file(filename): """ Loads options from a config file and calls the training procedure. Args: filename (str): filename of the configuration file """ parser = build_parser() options = parser.parse_config_file(filename) return train_from_options(options) def train_from_options(options): """ Runs the entire training pipeline using the configuration options received. These options include the pipeline and model options plus the model's API. Args: options (Namespace): All the configuration options retrieved from either a config file or input flags and the model being used. """ if options is None: return pipeline_options = options.pipeline model_options = options.model ModelClass = options.model_api tracking_run = tracking_logger.configure( run_uuid=pipeline_options.run_uuid, experiment_name=pipeline_options.experiment_name, tracking_uri=pipeline_options.mlflow_tracking_uri, always_log_artifacts=pipeline_options.mlflow_always_log_artifacts, ) with tracking_run: output_dir = setup( output_dir=pipeline_options.output_dir, seed=pipeline_options.seed, gpu_id=pipeline_options.gpu_id, debug=pipeline_options.debug, quiet=pipeline_options.quiet, ) all_options = merge_namespaces(pipeline_options, model_options) log( output_dir, config_options=vars(all_options), save_config=pipeline_options.save_config, ) trainer = run(ModelClass, output_dir, pipeline_options, model_options) train_info = TrainRunInfo(trainer) teardown(pipeline_options) return train_info def run(ModelClass, output_dir, pipeline_options, model_options): """ Implements the main logic of the training module. Instantiates the dataset, model class and sets their attributes according to the pipeline options received. Loads or creates a trainer and runs it. Args: ModelClass (Model): Python Type of the Model to train output_dir: Directory to save models pipeline_options (Namespace): Generic Train Options load_model: load pre-trained predictor model resume: load trainer state and resume training gpu_id: Set to non-negative integer to train on GPU train_batch_size: Batch Size for training valid_batch_size: Batch size for validation model_options(Namespace): Model Specific options Returns: The trainer object """ model_name = getattr(ModelClass, "title", ModelClass.__name__) logger.info("Training the {} model".format(model_name)) # FIXME: make sure all places use output_dir # del pipeline_options.output_dir pipeline_options.output_dir = None # Data step fieldset = ModelClass.fieldset( wmt18_format=model_options.__dict__.get("wmt18_format") ) datasets = retrieve_datasets( fieldset, pipeline_options, model_options, output_dir ) save_vocabularies_from_datasets(output_dir, *datasets) if pipeline_options.save_data: save_training_datasets(pipeline_options.save_data, *datasets) # Trainer step device_id = None if pipeline_options.gpu_id is not None and pipeline_options.gpu_id >= 0: device_id = pipeline_options.gpu_id vocabs = utils.fields_to_vocabs(datasets[0].fields) trainer = retrieve_trainer( ModelClass, pipeline_options, model_options, vocabs, output_dir, device_id, ) logger.info(str(trainer.model)) logger.info("{} parameters".format(trainer.model.num_parameters())) # Dataset iterators train_iter = build_bucket_iterator( datasets[0], batch_size=pipeline_options.train_batch_size, is_train=True, device=device_id, ) valid_iter = build_bucket_iterator( datasets[1], batch_size=pipeline_options.valid_batch_size, is_train=False, device=device_id, ) trainer.run(train_iter, valid_iter, epochs=pipeline_options.epochs) return trainer def retrieve_trainer( ModelClass, pipeline_options, model_options, vocabs, output_dir, device_id ): """ Creates a Trainer object with an associated model. This object encapsulates the logic behind training the model and checkpointing. This method uses the received pipeline options to instantiate a Trainer object with the the requested model and hyperparameters. Args: ModelClass pipeline_options (Namespace): Generic training options resume (bool): Set to true if resuming an existing run. load_model (str): Directory containing model.torch for loading pre-created model. checkpoint_save (bool): Boolean indicating if snapshots should be saved after validation runs. warning: if false, will never save the model. checkpoint_keep_only_best (int): Indicates kiwi to keep the best `n` models. checkpoint_early_stop_patience (int): Stops training if metrics don't improve after `n` validation runs. checkpoint_validation_steps (int): Perform validation every `n` training steps. optimizer (string): The optimizer to be used in training. learning_rate (float): Starting learning rate. learning_rate_decay (float): Factor of learning rate decay. learning_rate_decay_start (int): Start decay after epoch `x`. log_interval (int): Log after `k` batches. model_options (Namespace): Model specific options. vocabs (dict): Vocab dictionary. output_dir (str or Path): Output directory for models and stats concerning training. device_id (int): The gpu id to be used in training. Set to negative to use cpu. Returns: Trainer """ if pipeline_options.resume: return Trainer.resume(local_path=output_dir, device_id=device_id) if pipeline_options.load_model: model = Model.create_from_file(pipeline_options.load_model) else: model = ModelClass.from_options(vocabs=vocabs, opts=model_options) checkpointer = Checkpoint( output_dir, pipeline_options.checkpoint_save, pipeline_options.checkpoint_keep_only_best, pipeline_options.checkpoint_early_stop_patience, pipeline_options.checkpoint_validation_steps, ) if isinstance(model, LinearWordQEClassifier): trainer = LinearWordQETrainer( model, model_options.training_algorithm, model_options.regularization_constant, checkpointer, ) else: # Set GPU or CPU; has to be before instantiating the optimizer model.to(device_id) # Optimizer OptimizerClass = optimizer_class(pipeline_options.optimizer) optimizer = OptimizerClass( model.parameters(), lr=pipeline_options.learning_rate ) scheduler = None if 0.0 < pipeline_options.learning_rate_decay < 1.0: scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau( optimizer, factor=pipeline_options.learning_rate_decay, patience=pipeline_options.learning_rate_decay_start, verbose=True, mode="max", ) trainer = Trainer( model, optimizer, checkpointer, log_interval=pipeline_options.log_interval, scheduler=scheduler, ) return trainer def retrieve_datasets(fieldset, pipeline_options, model_options, output_dir): """ Creates `Dataset` objects for the training and validation sets. Parses files according to pipeline and model options. Args: fieldset pipeline_options (Namespace): Generic training options load_data (str): Input directory for loading preprocessed data files. load_model (str): Directory containing model.torch for loading pre-created model. resume (boolean): Indicates if you should resume training from a previous run. load_vocab (str): Directory containing vocab.torch file to be loaded. model_options (Namespace): Model specific options. output_dir (str): Path to directory where experiment files should be saved. Returns: datasets (Dataset): Training and validation datasets """ if pipeline_options.load_data: datasets = utils.load_training_datasets( pipeline_options.load_data, fieldset ) else: load_vocab = None if pipeline_options.resume: load_vocab = Path(output_dir, const.VOCAB_FILE) elif pipeline_options.load_model: load_vocab = pipeline_options.load_model elif model_options.__dict__.get("load_pred_source"): load_vocab = model_options.load_pred_source elif model_options.__dict__.get("load_pred_target"): load_vocab = model_options.load_pred_target elif pipeline_options.load_vocab: load_vocab = pipeline_options.load_vocab datasets = builders.build_training_datasets( fieldset, load_vocab=load_vocab, **vars(model_options) ) return datasets def setup(output_dir, seed=42, gpu_id=None, debug=False, quiet=False): """ Analyzes pipeline options and sets up requirements for running the training pipeline. This includes setting up the output directory, random seeds and the device(s) where training is run. Args: output_dir: Path to directory to use or None, in which case one is created automatically. seed (int): Random seed for all random engines (Python, PyTorch, NumPy). gpu_id (int): GPU number to use or `None` to use the CPU. debug (bool): Whether to increase the verbosity of output messages. quiet (bool): Whether to decrease the verbosity of output messages. Takes precedence over `debug`. Returns: output_dir(str): Path to output directory """ output_dir = setup_output_directory( output_dir, tracking_logger.run_uuid, tracking_logger.experiment_id, create=True, ) configure_logging(output_dir=output_dir, debug=debug, quiet=quiet) configure_seed(seed) logging.info("This is run ID: {}".format(tracking_logger.run_uuid)) logging.info( "Inside experiment ID: {} ({})".format( tracking_logger.experiment_id, tracking_logger.experiment_name ) ) logging.info("Local output directory is: {}".format(output_dir)) logging.info( "Logging execution to MLflow at: {}".format( tracking_logger.get_tracking_uri() ) ) if gpu_id is not None and gpu_id >= 0: torch.cuda.set_device(gpu_id) logging.info("Using GPU: {}".format(gpu_id)) else: logging.info("Using CPU") logging.info( "Artifacts location: {}".format(tracking_logger.get_artifact_uri()) ) return output_dir def teardown(options): """ Tears down after executing prediction pipeline. Args: options(Namespace): Pipeline specific options """ pass def log( output_dir, config_options, config_file_name="train_config.yml", save_config=None, ): """ Logs configuration options for the current training run. Args: output_dir (str): Path to directory where experiment files should be saved. config_options (Namespace): Namespace representing all configuration options. config_file_name (str): Filename of the config file save_config (str or Path): Boolean stating if you should save a configuration file. """ logging.debug(pformat(config_options)) config_file_copy = Path(output_dir, config_file_name) save_args_to_file(config_file_copy, **config_options) if tracking_logger.should_log_artifacts(): tracking_logger.log_artifact(str(config_file_copy)) if save_config: save_args_to_file(save_config, output_dir=output_dir, **config_options) # Log parameters tracking_logger.log_param("output_dir", output_dir) tracking_logger.log_param("save_config", save_config) for param, value in config_options.items(): tracking_logger.log_param(param, value) PK!``kiwi/lib/utils.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # import argparse import logging import random from argparse import Namespace from pathlib import Path from time import gmtime import configargparse import numpy as np import torch def configure_seed(seed): """ Configure the random seed for all relevant packages. These include: random, numpy, torch and torch.cuda Args: seed (int): the random seed to be set """ random.seed(seed) np.random.seed(seed) torch.manual_seed(seed) torch.cuda.manual_seed(seed) def configure_device(gpu_id): """ Configure gpu to be used in computation. Args: gpu_id (int): The id of the gpu to be used """ if gpu_id is not None: torch.cuda.set_device(gpu_id) def configure_logging(output_dir=None, debug=False, quiet=False): """ Configure the logger. Sets up the log format, logging level and output directory of logging. Args: output_dir: The directory where log output will be stored. Defaults to None. debug (bool): Change logging level to debug. quiet (bool): Change logging level to warning to supress info logs. """ logging.Formatter.converter = gmtime logging.Formatter.default_msec_format = '%s.%03d' log_format = '%(asctime)s [%(name)s %(funcName)s:%(lineno)s] %(message)s' if logging.getLogger().handlers: log_formatter = logging.Formatter(log_format) for handler in logging.getLogger().handlers: handler.setFormatter(log_formatter) else: logging.basicConfig(level=logging.INFO, format=log_format) log_level = logging.INFO if debug: log_level = logging.DEBUG if quiet: log_level = logging.WARNING logging.getLogger().setLevel(log_level) if output_dir is not None: fh = logging.FileHandler(str(Path(output_dir, 'output.log'))) fh.setLevel(log_level) logging.getLogger().addHandler(fh) def save_args_to_file(file_name, **kwargs): """ Saves `**kwargs` to a file. Args: file_name (str): The name of the file where the args should be saved in. """ options_to_save = { k.replace('_', '-'): v for k, v in kwargs.items() if v is not None } content = configargparse.YAMLConfigFileParser().serialize(options_to_save) Path(file_name).write_text(content) logging.debug('Saved current options to config file: {}'.format(file_name)) def save_config_file(options, file_name): """ Saves a configuration file with OpenKiwi configuration options. Calls `save_args_to_file`. Args: options (Namespace): Namespace with all configuration options that should be saved. file_name (str): Name of the output configuration file. """ # parser.write_config_file(options, [file_name], exit_after=False) save_args_to_file(file_name, **vars(options)) def setup_output_directory( output_dir, run_uuid=None, experiment_id=None, create=True ): """ Sets up the output directory. This means either creating one, or verifying that the provided directory exists. Output directories are created using the run and experiment ids. Args: output_dir (str): The target output directory run_uuid : The current hash of the current run. experiment_id: The id of the current experiment create (bool): Boolean indicating whether to create a new folder. """ if not output_dir: if experiment_id is None or run_uuid is None: raise argparse.ArgumentError( message='Please specify an output directory (--output-dir).', argument=output_dir, ) output_path = Path('runs', str(experiment_id), str(run_uuid)) output_dir = str(output_path) if create: Path(output_dir).mkdir(parents=True, exist_ok=True) elif not Path(output_dir).exists(): raise FileNotFoundError( 'Output directory does not exist: {}'.format(output_dir) ) return output_dir def merge_namespaces(*args): """ Utility function used to merge Namespaces. Useful for merging Argparse options. Args: *args: Variable length list of Namespaces """ if not args: return None options = {} for arg in filter(None, args): options.update(dict(vars(arg))) return Namespace(**options) def parse_integer_with_positive_infinity(string): """ Workaround to be able to pass both integers and infinity as CLAs. Args: string: A string representation of an integer, or infinity """ try: integer = int(string) return integer except ValueError: infinity = float(string) if infinity == float('inf'): return infinity raise ValueError( 'Could not parse argument "{}" as integer' ' with positive infinity'.format(string) ) PK!L]kiwi/loggers.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # import logging import threading import uuid logger = logging.getLogger(__name__) class TrackingLogger: class ActiveRun: def __init__(self, run_uuid, experiment_id): self.run_uuid = run_uuid self.experiment_name = experiment_id def __enter__(self): return self def __exit__(self, exc_type, exc_val, exc_tb): return exc_type is None def __init__(self): self._experiment_id = None self._experiment_name = None self._active_run_uuids = [] def configure( self, run_uuid, experiment_name, nest_run=True, *args, **kwargs ): if len(self._active_run_uuids) > 0 and not nest_run: raise Exception( ( "A run is already active. To start a nested run, call " "start_nested_run(), or configure() with nest_run=True" ) ) if not self._active_run_uuids: self._experiment_name = experiment_name self._experiment_id = 0 if run_uuid is None: self._active_run_uuids.append(uuid.uuid4().hex) else: self._active_run_uuids.append(run_uuid) return TrackingLogger.ActiveRun( run_uuid=self._active_run_uuids[-1], experiment_id=self._experiment_id, ) def start_nested_run(self, run_name=None): return self.configure( run_uuid=run_name, experiment_name=None, nest_run=True ) @property def run_uuid(self): return self._active_run_uuids[-1] if self._active_run_uuids else None @property def experiment_id(self): return self._experiment_id @property def experiment_name(self): return self._experiment_name def should_log_artifacts(self): return False def get_tracking_uri(self): return None @staticmethod def log_metric(key, value): pass @staticmethod def log_param(key, value): pass @staticmethod def log_artifact(local_path, artifact_path=None): pass @staticmethod def log_artifacts(local_dir, artifact_path=None): return None @staticmethod def get_artifact_uri(): return None @staticmethod def end_run(): pass class MLflowLogger: def __init__(self): self.always_log_artifacts = False self._experiment_name = None def configure( self, run_uuid, experiment_name, tracking_uri, always_log_artifacts=False, create_run=True, create_experiment=True, nest_run=True, ): if mlflow.active_run() and not nest_run: logger.info('Ending previous MLFlow run: {}.'.format(self.run_uuid)) mlflow.end_run() self.always_log_artifacts = always_log_artifacts self._experiment_name = experiment_name # MLflow specific if tracking_uri: mlflow.set_tracking_uri(tracking_uri) if run_uuid: existing_run = MlflowClient().get_run(run_uuid) if not existing_run and not create_run: raise FileNotFoundError( 'Run ID {} not found under {}'.format( run_uuid, mlflow.get_tracking_uri() ) ) experiment_id = self._retrieve_mlflow_experiment_id( experiment_name, create=create_experiment ) return mlflow.start_run( run_uuid=run_uuid, experiment_id=experiment_id, nested=nest_run ) def start_nested_run(self, run_name=None): return mlflow.start_run(run_name=run_name, nested=True) @property def run_uuid(self): return mlflow.tracking.fluent.active_run().info.run_uuid @property def experiment_id(self): return mlflow.tracking.fluent.active_run().info.experiment_id @property def experiment_name(self): # return MlflowClient().get_experiment(self.experiment_id).name return self._experiment_name def should_log_artifacts(self): return self.always_log_artifacts or self._is_remote() @staticmethod def get_tracking_uri(): return mlflow.get_tracking_uri() @staticmethod def log_metric(key, value): mlflow.log_metric(key, value) @staticmethod def log_param(key, value): mlflow.log_param(key, value) @staticmethod def log_artifact(local_path, artifact_path=None): t = threading.Thread( target=mlflow.log_artifact, args=(local_path,), kwargs={'artifact_path': artifact_path}, daemon=True, ) t.start() @staticmethod def log_artifacts(local_dir, artifact_path=None): def send(dpath, e, path): mlflow.log_artifacts(dpath, artifact_path=path) e.set() event = threading.Event() t = threading.Thread( target=send, args=(local_dir, event, artifact_path), daemon=True ) t.start() return event @staticmethod def get_artifact_uri(): return mlflow.get_artifact_uri() @staticmethod def end_run(): mlflow.end_run() def _is_remote(self): return not mlflow.tracking.utils._is_local_uri( mlflow.get_tracking_uri() ) @staticmethod def _retrieve_mlflow_experiment_id(name, create=False): experiment_id = None if name: existing_experiment = MlflowClient().get_experiment_by_name(name) if existing_experiment: experiment_id = existing_experiment.experiment_id else: if create: experiment_id = mlflow.create_experiment(name) else: raise Exception( 'Experiment "{}" not found in {}'.format( name, mlflow.get_tracking_uri() ) ) return experiment_id try: import mlflow from mlflow.tracking import MlflowClient tracking_logger = MLflowLogger() except ImportError: tracking_logger = TrackingLogger() PK!vkiwi/metrics/__init__.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # from kiwi.metrics import metrics F1Metric = metrics.F1Metric LogMetric = metrics.LogMetric ExpectedErrorMetric = metrics.ExpectedErrorMetric PerplexityMetric = metrics.PerplexityMetric CorrectMetric = metrics.CorrectMetric RMSEMetric = metrics.RMSEMetric PearsonMetric = metrics.PearsonMetric SpearmanMetric = metrics.SpearmanMetric TokenMetric = metrics.TokenMetric ThresholdCalibrationMetric = metrics.ThresholdCalibrationMetric PK!`b77kiwi/metrics/functions.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # import numpy as np from more_itertools import collapse # def calibrate_threshold(scores, labels, MetricClass=LazyF1): # """Finds optimal decision threshold according to metric. # Args: # scores (list[float]): List of model output scores # labels (list): List of corresponding target labels # Returns: # (metric, threshold): The value of the Metric and the Threshold to be used. # """ # metric = MetricClass(scores, labels) # scores, labels = metric.sort(scores, labels) # init_threshold = scores[0] # thresholds = [(metric.compute(), init_threshold)] # for score, label in zip(scores, labels): # metric.update(score, label) # thresholds.append((metric.compute(), score)) # return metric.choose(thresholds) def mean_absolute_error(y, y_hat): return np.mean(np.absolute(y_hat - y)) def mean_squared_error(y, y_hat): return np.square(np.subtract(y, y_hat)).mean() def delta_average(y_true, y_rank): """Calculate the DeltaAvg score This is a much faster version than the Perl one provided in the WMT QE task 1. References: could not find any. Author: Fabio Kepler (contributed to MARMOT) Args: y_true: array of reference score (not rank) of each segment. y_rank: array of rank of each segment. Returns: the absolute delta average score. """ sorted_ranked_indexes = np.argsort(y_rank) y_length = len(sorted_ranked_indexes) delta_avg = 0 max_quantiles = y_length // 2 set_value = ( np.sum(y_true[sorted_ranked_indexes[np.arange(y_length)]]) / y_length ) quantile_values = { head: np.sum(y_true[sorted_ranked_indexes[np.arange(head)]]) / head for head in range(2, y_length) } # Cache values, since there are many that are repeatedly computed # between various quantiles. for quantiles in range(2, max_quantiles + 1): # Current number of quantiles quantile_length = y_length // quantiles quantile_sum = 0 for head in np.arange( quantile_length, quantiles * quantile_length, quantile_length ): quantile_sum += quantile_values[head] delta_avg += quantile_sum / (quantiles - 1) - set_value if max_quantiles > 1: delta_avg /= max_quantiles - 1 else: delta_avg = 0 return abs(delta_avg) def precision(tp, fp, fn): if tp + fp > 0: return tp / (tp + fp) return 0 def recall(tp, fp, fn): if tp + fn > 0: return tp / (tp + fn) return 0 def fscore(tp, fp, fn): p = precision(tp, fp, fn) r = recall(tp, fp, fn) if p + r > 0: return 2 * (p * r) / (p + r) return 0 def confusion_matrix(hat_y, y, n_classes=None): hat_y = np.array(list(collapse(hat_y))) y = np.array(list(collapse(y))) if n_classes is None: classes = np.unique(np.union1d(hat_y, y)) n_classes = len(classes) cnfm = np.zeros((n_classes, n_classes)) for j in range(y.shape[0]): cnfm[y[j], hat_y[j]] += 1 return cnfm def scores_for_class(class_index, cnfm): tp = cnfm[class_index, class_index] fp = cnfm[:, class_index].sum() - tp fn = cnfm[class_index, :].sum() - tp tn = cnfm.sum() - tp - fp - fn p = precision(tp, fp, fn) r = recall(tp, fp, fn) f1 = fscore(tp, fp, fn) support = tp + tn return p, r, f1, support def precision_recall_fscore_support(hat_y, y, labels=None): n_classes = len(labels) if labels else None cnfm = confusion_matrix(hat_y, y, n_classes) if n_classes is None: n_classes = cnfm.shape[0] scores = np.zeros((n_classes, 4)) for class_id in range(n_classes): scores[class_id] = scores_for_class(class_id, cnfm) return scores.T.tolist() def f1_product(hat_y, y): p, r, f1, s = precision_recall_fscore_support(hat_y, y) f1_mult = np.prod(f1) return f1_mult def f1_scores(hat_y, y): """ Return f1_bad, f1_ok and f1_product """ p, r, f1, s = precision_recall_fscore_support(hat_y, y) f_mult = np.prod(f1) return (*f1, f_mult) PK!OCCkiwi/metrics/metrics.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # import math import time from collections import OrderedDict import numpy as np import torch from scipy.stats.stats import pearsonr, spearmanr from torch import nn from kiwi import constants as const from kiwi.metrics.functions import fscore, precision_recall_fscore_support from kiwi.models.utils import replace_token class Metric: def __init__( self, target_name=None, metric_name=None, PAD=None, STOP=None, prefix=None, ): super().__init__() self.reset() self.prefix = prefix self.target_name = target_name self.metric_name = metric_name self.PAD = PAD self.STOP = STOP def update(self, **kwargs): raise NotImplementedError def reset(self): raise NotImplementedError def summarize(self, **kwargs): raise NotImplementedError def get_name(self): return self._prefix(self.metric_name) def _prefix_keys(self, summary): if self.prefix: summary = OrderedDict( {self._prefix(key): value for key, value in summary.items()} ) return summary def _prefix(self, key): if self.prefix: return '{}_{}'.format(self.prefix, key) return key def token_mask(self, batch): target = self.get_target(batch) if self.PAD is not None: return target != self.PAD else: return torch.ones( target.shape, dtype=torch.uint8, device=target.device ) def get_target(self, batch): target = getattr(batch, self.target_name) if self.STOP is not None: target = replace_token(target[:, 1:-1], self.STOP, self.PAD) return target def get_token_indices(self, batch): mask = self.token_mask(batch) return mask.view(-1).nonzero().squeeze() def get_predictions(self, model_out): predictions = model_out[self.target_name] return predictions def get_target_flat(self, batch): target_flat = self.get_target(batch).contiguous().view(-1) token_indices = self.get_token_indices(batch) return target_flat[token_indices] def get_predictions_flat(self, model_out, batch): predictions = self.get_predictions(model_out).contiguous() predictions_flat = predictions.view(-1, predictions.shape[-1]).squeeze() token_indices = self.get_token_indices(batch) return predictions_flat[token_indices] def get_tokens(self, batch): return self.token_mask(batch).sum().item() class NLLMetric(Metric): def __init__(self, **kwargs): super().__init__(metric_name='NLL', **kwargs) def update(self, loss, batch, **kwargs): self.tokens += self.get_tokens(batch) self.nll += loss[self.target_name].item() def summarize(self): summary = {self.metric_name: self.nll / self.tokens} return self._prefix_keys(summary) def reset(self): self.nll = 0.0 self.tokens = 0 class PerplexityMetric(Metric): def __init__(self, **kwargs): super().__init__(metric_name='PERP', **kwargs) def reset(self): self.tokens = 0 self.nll = 0.0 def update(self, loss, batch, **kwargs): self.tokens += self.get_tokens(batch) self.nll += loss[self.target_name].item() def summarize(self): summary = {self.metric_name: math.e ** (self.nll / self.tokens)} return self._prefix_keys(summary) class CorrectMetric(Metric): def __init__(self, **kwargs): super().__init__(metric_name='CORRECT', **kwargs) def update(self, model_out, batch, **kwargs): self.tokens += self.get_tokens(batch) logits = self.get_predictions_flat(model_out, batch) target = self.get_target_flat(batch) _, pred = logits.max(-1) correct = target == pred correct_count = correct.sum().item() self.correct += correct_count def summarize(self): summary = {self.metric_name: float(self.correct) / self.tokens} return self._prefix_keys(summary) def reset(self): self.correct = 0 self.tokens = 0 class F1Metric(Metric): def __init__(self, labels, **kwargs): super().__init__(metric_name='F1_MULT', **kwargs) self.labels = labels def update(self, model_out, batch, **kwargs): logits = self.get_predictions_flat(model_out, batch) target = self.get_target_flat(batch) _, y_hat = logits.max(-1) self.Y_HAT += y_hat.tolist() self.Y += target.tolist() def summarize(self): summary = OrderedDict() _, _, f1, _ = precision_recall_fscore_support(self.Y_HAT, self.Y) summary[self.metric_name] = np.prod(f1) for i, label in enumerate(self.labels): summary['F1_' + label] = f1[i] return self._prefix_keys(summary) def reset(self): self.Y = [] self.Y_HAT = [] class PearsonMetric(Metric): def __init__(self, **kwargs): super().__init__(metric_name='PEARSON', **kwargs) def reset(self): self.predictions = [] self.target = [] def update(self, model_out, batch, **kwargs): target = self.get_target_flat(batch) predictions = self.get_predictions_flat(model_out, batch) self.predictions += predictions.tolist() self.target += target.tolist() def summarize(self): pearson = pearsonr(self.predictions, self.target)[0] summary = {self.metric_name: pearson} return self._prefix_keys(summary) class SpearmanMetric(Metric): def __init__(self, **kwargs): super().__init__(metric_name='SPEARMAN', **kwargs) def reset(self): self.predictions = [] self.target = [] def update(self, model_out, batch, **kwargs): target = self.get_target_flat(batch) predictions = self.get_predictions_flat(model_out, batch) self.predictions += predictions.tolist() self.target += target.tolist() def summarize(self): spearman = spearmanr(self.predictions, self.target)[0] summary = {self.metric_name: spearman} return self._prefix_keys(summary) class ExpectedErrorMetric(Metric): def __init__(self, **kwargs): super().__init__(metric_name='ExpErr', **kwargs) def update(self, model_out, batch, **kwargs): logits = self.get_predictions_flat(model_out, batch) target = self.get_target_flat(batch) probs = nn.functional.softmax(logits, -1) probs = probs.gather(-1, target.unsqueeze(-1)).squeeze() errors = 1.0 - probs self.tokens += self.get_tokens(batch) self.expected_error += errors.sum().item() def summarize(self): summary = {self.metric_name: self.expected_error / self.tokens} return self._prefix_keys(summary) def reset(self): self.expected_error = 0.0 self.tokens = 0 class TokPerSecMetric(Metric): def __init__(self, **kwargs): super().__init__(metric_name='TokPerSec', **kwargs) def update(self, batch, **kwargs): self.tokens += self.get_tokens(batch) def summarize(self): summary = {self.metric_name: self.tokens / (time.time() - self.time)} return self._prefix_keys(summary) def reset(self): self.tokens = 0 self.time = time.time() class LogMetric(Metric): """Logs averages of values in loss, model or batch. """ def __init__(self, targets, metric_name=None, **kwargs): self.targets = targets metric_name = metric_name or self._format(*targets[0]) super().__init__(metric_name=metric_name, **kwargs) def update(self, **kwargs): self.steps += 1 for side, target in self.targets: key = self._format(side, target) self.log[key] += kwargs[side][target].mean().item() def summarize(self): summary = { key: value / float(self.steps) for key, value in self.log.items() } return self._prefix_keys(summary) def reset(self): self.log = { self._format(side, target): 0.0 for side, target in self.targets } self.steps = 0 def _format(self, side, target): return '{}_{}'.format(side, target) class RMSEMetric(Metric): def __init__(self, **kwargs): super().__init__(metric_name='RMSE', **kwargs) def update(self, batch, model_out, **kwargs): predictions = self.get_predictions_flat(model_out, batch) target = self.get_target_flat(batch) self.squared_error += ((predictions - target) ** 2).sum().item() self.tokens += self.get_tokens(batch) def summarize(self): rmse = math.sqrt(self.squared_error / self.tokens) summary = {self.metric_name: rmse} return self._prefix_keys(summary) def reset(self): self.squared_error = 0.0 self.tokens = 0 class TokenMetric(Metric): def __init__(self, target_token=const.UNK_ID, token_name='UNK', **kwargs): self.target_token = target_token super().__init__(metric_name='UNKS', **kwargs) def update(self, batch, **kwargs): target = self.get_target_flat(batch) self.targets += (target == self.target_token).sum().item() self.tokens += self.get_tokens(batch) def summarize(self): summary = {} if self.tokens: summary = {self.metric_name: self.targets / self.tokens} return self._prefix_keys(summary) def reset(self): self.tokens = 0 self.targets = 0 class ThresholdCalibrationMetric(Metric): def __init__(self, **kwargs): super().__init__(metric_name='F1_CAL', **kwargs) def update(self, model_out, batch, **kwargs): logits = self.get_predictions_flat(model_out, batch) bad_probs = nn.functional.softmax(logits, -1)[:, const.BAD_ID] target = self.get_target_flat(batch) self.scores += bad_probs.tolist() self.Y += target.tolist() def summarize(self): summary = {} mid = len(self.Y) // 2 if mid: perm = np.random.permutation(len(self.Y)) self.Y = [self.Y[idx] for idx in perm] self.scores = [self.scores[idx] for idx in perm] m = MovingF1() fscore, threshold = m.choose( m.eval(self.scores[:mid], self.Y[:mid]) ) predictions = [ const.BAD_ID if score >= threshold else const.OK_ID for score in self.scores[mid:] ] _, _, f1, _ = precision_recall_fscore_support( predictions, self.Y[mid:] ) f1_mult = np.prod(f1) summary = {self.metric_name: f1_mult} return self._prefix_keys(summary) def reset(self): self.scores = [] self.Y = [] class MovingMetric: """Class to compute the changes in one metric as a function of a second metric. Example: F1 score vs. Classification Threshold, Quality vs Skips """ def eval(self, scores, labels): """Compute the graph metric1 vs metric2 Args: Scores: Model Outputs Labels: Corresponding Labels """ self.init(scores, labels) scores, labels = self.sort(scores, labels) init_threshold = scores[0] thresholds = [(self.compute(), init_threshold)] for score, label in zip(scores, labels): self.update(score, label) thresholds.append((self.compute(), score)) return thresholds def init(self, scores, labels): """Initialize the Metric for threshold < min(scores) """ return scores, labels def sort(self, scores, labels): """Sort List of labels and scores. """ return zip(*sorted(zip(scores, labels))) def update(self, score, label): """Move the threshold past score """ return None def compute(self): """Compute the current Value of the metric """ pass def choose(self, thresholds): """Choose the best (threshold, metric) tuple from an iterable. """ pass class MovingF1(MovingMetric): def init(self, scores, labels, class_idx=1): """ Compute F1 Mult for all decision thresholds over (scores, labels) Initialize the threshold s.t. all examples are classified as `class_idx`. Args: scores: Likelihood scores for class index Labels: Gold Truth classes in {0,1} class_index: ID of class """ # -1 if class_idx == 0 , 1 if class_idx == 1 self.sign = 2 * class_idx - 1 class_one = sum(labels) class_zero = len(labels) - class_one self.fp_zero = (1 - class_idx) * class_one self.tp_zero = (1 - class_idx) * class_zero self.fp_one = class_idx * class_zero self.tp_one = class_idx * class_one def update(self, score, label): """Move the decision threshold. """ self.tp_zero += self.sign * (1 - label) self.fp_zero += self.sign * label self.tp_one -= self.sign * label self.fp_one -= self.sign * (1 - label) def compute(self): f1_zero = fscore(self.tp_zero, self.fp_zero, self.fp_one) f1_one = fscore(self.tp_one, self.fp_one, self.fp_zero) return f1_one * f1_zero def choose(self, thresholds): return max(thresholds) class MovingSkipsAtQuality(MovingMetric): """Computes Quality of skipped examples vs fraction of skips. """ def __init__( self, scores_higher_is_better=False, labels_higher_is_better=False ): """ Args: scores_higher_is_better: If True, higher model outputs indicate higher quality. labels_higher_is_better: If True, higher label values indicate higher quality. """ self.scores_higher_is_better = scores_higher_is_better self.labels_higher_is_better = labels_higher_is_better def eval(self, scores, labels): """ Args: scores: Model output quality or error scores. If quality scores are provided, pass scores_higher_is_better=True. labels: Ground truth quality or error scores. If quality scores are provided, pass labels_higher_is_better=True. """ return super().eval(scores, labels) def init(self, scores, labels): """ Args: scores: Model output quality or error scores. If quality scores are provided, pass scores_higher_is_better=True. labels: Ground truth quality or error scores. If quality scores are provided, pass labels_higher_is_better=True. """ self.cumulative_qual = 0.0 self.skipped = 0 self.data_size = len(scores) def update(self, score, label): self.cumulative_qual += label self.skipped += 1 def compute(self): if not self.skipped: return None, 0.0 return ( self.skipped / self.data_size, self.cumulative_qual / self.skipped, ) def choose(self, thresholds, target_qual): """Chooses the smallest threshold such that avg. quality is greater than or equal to target_qual """ best = None sign = 1 if self.labels_higher_is_better else -1 for ((skip, qual), t) in thresholds: if (sign * (qual - target_qual)) >= 0: # The quality at threshold t is admissible given target_qual if best is None: best = ((skip, qual), t) else: last_best = abs(best[0][0] - target_qual) if abs(qual - target_qual) < last_best: # The quality at threshold t is admissible given # target_qual and closer than the previous best best = ((skip, qual), t) return best def sort(self, scores, labels): return zip( *sorted(zip(scores, labels), reverse=self.scores_higher_is_better) ) PK!kiwi/metrics/stats.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # import functools import logging from collections import OrderedDict from kiwi.loggers import tracking_logger logger = logging.getLogger(__name__) @functools.total_ordering class StatsSummary(OrderedDict): def __init__(self, prefix=None, main_metric=None, ordering=max, **kwargs): self.prefix = prefix self._main_metric_name = main_metric self.ordering = ordering super().__init__(**kwargs) @property def main_metric(self): if self._main_metric_name: return self._main_metric_name elif self: return list(self.keys())[0] return None def main_metric_value(self): return self.__getitem__(self.main_metric) def _make_key(self, key): if self.prefix: key = '{}_{}'.format(self.prefix, key) return key def __str__(self): return ', '.join(['{}: {:0.4f}'.format(k, v) for k, v in self.items()]) def log(self): """Log statistics to output and also to tracking logger. :param stats_summary: StatsSummary object """ print('\r', end='\r') logger.info(self) for k, v in self.items(): tracking_logger.log_metric(k, v) def __setitem__(self, key, value): key = self._make_key(key) super().__setitem__(key, value) def __getitem__(self, key): key = self._make_key(key) return super().__getitem__(key) def __contains__(self, key): key = self._make_key(key) return super().__contains__(key) def get(self, key, default=None): key = self._make_key(key) return super().get(key, default) def __eq__(self, other): return isinstance(other, StatsSummary) and self.get( self.main_metric ) == other.get(self.main_metric) def __le__(self, other): if self.ordering == max: return isinstance(other, StatsSummary) and self.get( self.main_metric ) <= other.get(self.main_metric) else: return isinstance(other, StatsSummary) and self.get( self.main_metric ) >= other.get(self.main_metric) def __gt__(self, other): if self.ordering == max: return isinstance(other, StatsSummary) and self.get( self.main_metric ) > other.get(self.main_metric) else: return isinstance(other, StatsSummary) and self.get( self.main_metric ) < other.get(self.main_metric) def better_than(self, other): if self.ordering == max: return isinstance(other, StatsSummary) and self.get( self.main_metric ) > other.get(self.main_metric) else: return isinstance(other, StatsSummary) and self.get( self.main_metric ) < other.get(self.main_metric) class Stats: def __init__( self, metrics, main_metric=None, main_metric_ordering=max, log_interval=0, ): self.metrics = metrics main_metric = main_metric or self.metrics[0] self.main_metric_name = main_metric.get_name() self.main_metric_ordering = main_metric_ordering self.log_interval = log_interval self.reset() def update(self, **kwargs): self.steps += 1 for metric in self.metrics: metric.update(**kwargs) def summarize(self, prefix=None): summary = StatsSummary( prefix=prefix, main_metric=self.main_metric_name, ordering=self.main_metric_ordering, ) if self.steps: for metric in self.metrics: summary.update(metric.summarize()) return summary def reset(self): self.steps = 0 for metric in self.metrics: metric.reset() def wrap_up(self, prefix=None): summary = self.summarize(prefix) self.reset() return summary def log(self, step=None): if ( step is None or self.log_interval > 0 and not step % self.log_interval ): stats_summary = self.wrap_up() stats_summary.log() PK!hkiwi/models/__init__.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # PK!hkiwi/models/linear/__init__.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # PK!~\&kiwi/models/linear/label_dictionary.py# -*- coding: utf-8 -*- """This implements a dictionary of labels.""" # OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # import warnings class LabelDictionary(dict): """This class implements a dictionary of labels. Labels as mapped to integers, and it is efficient to retrieve the label name from its integer representation, and vice-versa.""" def __init__(self, label_names=None): dict.__init__(self) self.names = [] if label_names is not None: for name in label_names: self.add(name) def add(self, name): """Add new label.""" label_id = len(self.names) if name in self: warnings.warn('Ignoring duplicated label ' + name) self[name] = label_id self.names.append(name) return label_id def get_label_name(self, label_id): """Get label name from id.""" return self.names[label_id] def get_label_id(self, name): """Get label id from name.""" return self[name] def save(self, label_file): """Save labels to a file.""" f = open(label_file, 'w') for name in self.names: f.write(name + '\n') f.close() def load(self, label_file): """Load labels from a file.""" self.names = [] self.clear() f = open(label_file) for line in f: name = line.rstrip('\n') self.add(name) f.close() PK!~2  "kiwi/models/linear/linear_model.py"""This implements a linear model.""" # OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # from .sparse_vector import SparseVector class LinearModel(object): """ An abstract linear model.""" def __init__(self): self.use_average = True self.weights = SparseVector() self.averaged_weights = SparseVector() def clear(self): """Clear all weights.""" self.weights.clear() self.averaged_weights.clear() def finalize(self, t): """Finalize by setting the weights as the running average. This is a no-op if use_average=False.""" if self.use_average: self.averaged_weights.scale(1.0 / float(t)) self.weights.add(self.averaged_weights) def compute_score(self, features): """Compute a score by taking the inner product with a feature vector.""" score = features.dot_product(self.weights) return score def compute_score_binary_features(self, binary_features): """Compute a score by taking the inner product with a binary feature vector.""" score = 0.0 for f in binary_features: if f in self.weights: score += self.weights[f] return score def make_gradient_step(self, features, eta, t, gradient): """Make a gradient step with stepsize eta.""" self.weights.add(features, -eta * gradient) if self.use_average: self.averaged_weights.add(features, eta * float(t) * gradient) def save(self, model_file, average=False, feature_indices=None): """Save the model to a file.""" f = open(model_file, 'w') if feature_indices is not None: w = SparseVector() for index in self.weights: w[feature_indices.get_label_name(index)] = self.weights[index] w.save(f) else: self.weights.save(f) f.close() if average: f = open(model_file + '_average', 'w') if feature_indices is not None: w = SparseVector() for index in self.averaged_weights: w[ feature_indices.get_label_name(index) ] = self.averaged_weights[index] w.save(f) else: self.averaged_weights.save(f) f.close() def load(self, model_file, average=False, feature_indices=None): """Load the model from a file.""" f = open(model_file, 'r') if feature_indices is not None: w = SparseVector() w.load(f) for key in w: index = feature_indices.add(key) self.weights[index] = w[key] else: self.weights.load(f) f.close() if average: f = open(model_file + '_average', 'r') if feature_indices is not None: w = SparseVector() w.load(f) for key in w: index = feature_indices.get_label_id(key) self.averaged_weights[index] = w[key] else: self.averaged_weights.load(f) f.close() def write_fnames(self, fnames_file, fnames): """Write file mapping from integers to feature descriptions.""" f = open(fnames_file, 'w') for fid, fname in enumerate(fnames): f.write(str(1 + fid) + ' ' + fname + '\n') f.close() def read_fnames(self, fnames_file): """Read file mapping from integers to feature descriptions.""" assert False, 'This is not being called' fids = {} f = open(fnames_file) maxfid = -1 for line in f: line = line.rstrip('\n') fields = line.split(' ') fid = int(fields[0]) fname = fields[1] fids[fname] = fid if fid > maxfid: maxfid = fid fnames = [''] * maxfid for fname, fid in fids.iteritems(): fnames[fid - 1] = fname f.close() return fnames, fids PK!#:[+[+$kiwi/models/linear/linear_trainer.py"""A generic implementation of a basic trainer.""" # OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # import logging from pathlib import Path import numpy as np from kiwi import constants as const from kiwi.models.linear.sparse_vector import SparseVector from .utils import nearly_eq_tol logger = logging.getLogger(__name__) class LinearTrainer(object): def __init__( self, classifier, checkpointer, algorithm='svm_mira', regularization_constant=1e12, ): self.classifier = classifier self.algorithm = algorithm self.regularization_constant = regularization_constant self.checkpointer = checkpointer # Only for training with SGD. self.initial_learning_rate = 0.001 # Only for training with SGD. Change to 'inv' for Pegasos-style # updating. self.learning_rate_schedule = 'invsqrt' # Best metric value (to pick the best iteration). self.best_metric_value = -np.inf def _make_gradient_step( self, parts, features, eta, t, gold_output, predicted_output ): """Perform a gradient step updating the current model.""" for r in range(len(parts)): if predicted_output[r] == gold_output[r]: continue if self.classifier.use_binary_features: part_features = features[r].to_sparse_vector() else: part_features = features[r] self.classifier.model.make_gradient_step( part_features, eta, t, predicted_output[r] - gold_output[r] ) def _make_feature_difference( self, parts, features, gold_output, predicted_output ): """Compute the difference between predicted and gold feature vector.""" difference = SparseVector() for r in range(len(parts)): if predicted_output[r] == gold_output[r]: continue if self.classifier.use_binary_features: part_features = features[r].to_sparse_vector() else: part_features = features[r] # FIXME: shouldn't the next line be outside the else? difference.add( part_features, predicted_output[r] - gold_output[r] ) return difference def run(self, train_iterator, valid_iterator, epochs=50): """Train with a general online algorithm.""" import time dataset = self.classifier.create_instances(train_iterator.dataset) if not isinstance(valid_iterator, list): valid_iterator = [valid_iterator] dev_datasets = [ self.classifier.create_instances(iterator.dataset) for iterator in valid_iterator ] self.classifier.model.clear() for epoch in range(epochs): tic = time.time() logger.info('Epoch %d' % (epoch + 1)) self._train_epoch(epoch, dataset, dev_datasets) toc = time.time() logger.info('Elapsed time (epoch): %d' % (toc - tic)) if self.algorithm != 'svm_sgd': self.classifier.model.finalize(len(train_iterator.dataset) * epochs) self.checkpointer.check_out() def _train_epoch(self, epoch, dataset, dev_datasets): """Run one epoch of an online algorithm.""" algorithm = self.algorithm total_loss = 0.0 total_cost = 0.0 if algorithm in ['perceptron']: num_mistakes = 0 num_total = 0 elif algorithm in ['mira', 'svm_mira']: truncated = 0 lambda_coefficient = 1.0 / ( self.regularization_constant * float(len(dataset)) ) t = len(dataset) * epoch for instance in dataset: # Compute parts, features, and scores. parts, gold_output = self.classifier.make_parts(instance) features = self.classifier.make_features(instance, parts) scores = self.classifier.compute_scores(instance, parts, features) # Do the decoding. if algorithm in ['perceptron']: predicted_output = self.classifier.decoder.decode( instance, parts, scores ) for r in range(len(parts)): num_total += 1 if not nearly_eq_tol( gold_output[r], predicted_output[r], 1e-6 ): num_mistakes += 1 elif algorithm in ['mira']: predicted_output, cost, loss = self.classifier.decoder.decode_mira( # NOQA instance, parts, scores, gold_output, True ) elif algorithm in ['svm_mira', 'svm_sgd']: predicted_output, cost, loss = self.classifier.decoder.decode_cost_augmented( # NOQA instance, parts, scores, gold_output ) else: raise NotImplementedError # Update the total loss and cost. if algorithm in ['mira', 'svm_mira', 'svm_sgd']: if loss < 0.0: if loss < -1e-12: logger.warning('Negative loss: ' + str(loss)) loss = 0.0 if cost < 0.0: if cost < -1e-12: logger.warning('Negative cost:' + str(cost)) cost = 0.0 total_loss += loss total_cost += cost num_parts = len(parts) assert len(gold_output) == num_parts assert len(predicted_output) == num_parts # Compute the stepsize. if algorithm in ['perceptron']: eta = 1.0 elif algorithm in ['mira', 'svm_mira']: difference = self._make_feature_difference( parts, features, gold_output, predicted_output ) squared_norm = difference.squared_norm() threshold = 1e-9 if loss < threshold or squared_norm < threshold: eta = 0.0 else: eta = loss / squared_norm if eta > self.regularization_constant: eta = self.regularization_constant truncated += 1 elif algorithm in ['svm_sgd']: if self.learning_rate_schedule == 'invsqrt': eta = self.initial_learning_rate / np.sqrt(float(t + 1)) elif self.learning_rate_schedule == 'inv': eta = self.initial_learning_rate / (float(t + 1)) else: raise NotImplementedError # Scale the weight vector. decay = 1.0 - eta * lambda_coefficient assert decay >= -1e-12 self.classifier.model.weights.scale(decay) # Make gradient step. self._make_gradient_step( parts, features, eta, t, gold_output, predicted_output ) # Increment the round. t += 1 # Evaluate on development data. weights = self.classifier.model.weights.copy() averaged_weights = self.classifier.model.averaged_weights.copy() if algorithm != 'svm_sgd': self.classifier.model.finalize(len(dataset) * (1 + epoch)) dev_scores = [] for dev_dataset in dev_datasets: predictions = self.classifier.test(dev_dataset) dev_score = self.classifier.evaluate( dev_dataset, predictions, print_scores=True ) dev_scores.append(dev_score) if algorithm in ['perceptron']: logger.info( '\t'.join( [ 'Epoch: %d' % (epoch + 1), 'Mistakes: %d/%d (%f)' % ( num_mistakes, num_total, float(num_mistakes) / float(num_total), ), 'Dev scores: %s' % ' '.join( ["%.5g" % (100 * score) for score in dev_scores] ), ] ) ) else: sq_norm = self.classifier.model.weights.squared_norm() regularization_value = ( 0.5 * lambda_coefficient * float(len(dataset)) * weights.squared_norm() ) logger.info( '\t'.join( [ 'Epoch: %d' % (epoch + 1), 'Cost: %f' % total_cost, 'Loss: %f' % total_loss, 'Reg: %f' % regularization_value, 'Loss+Reg: %f' % (total_loss + regularization_value), 'Norm: %f' % sq_norm, 'Dev scores: %s' % ' '.join( ["%.5g" % (100 * score) for score in dev_scores] ), ] ) ) # If this is the best model so far, save it as the default model. # Assume the metric to optimize is on the first dev set, the highest # the best. # TODO: replace by checkpointer functionality metric_value = dev_scores[0] if metric_value > self.best_metric_value: self.best_metric_value = metric_value self.checkpointer.check_in( self, self.best_metric_value, epoch=epoch ) self.classifier.model.weights = weights self.classifier.model.averaged_weights = averaged_weights def save(self, output_directory): output_directory = Path(output_directory) output_directory.mkdir(exist_ok=True) logging.info('Saving training state to {}'.format(output_directory)) model_path = output_directory / const.MODEL_FILE self.classifier.model.save( str(model_path), feature_indices=self.classifier.feature_indices ) self.classifier.save(str(model_path)) return None PK!# _ ,kiwi/models/linear/linear_word_qe_decoder.py"""Decoder for word-level quality estimation.""" # OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # import numpy as np from .sequence_parts import SequenceBigramPart, SequenceUnigramPart from .structured_decoder import StructuredDecoder def logzero(): """Return log of zero.""" return -np.inf class LinearWordQEDecoder(StructuredDecoder): """A decoder for word-level quality estimation.""" def __init__( self, estimator, cost_false_positives=0.5, cost_false_negatives=0.5 ): StructuredDecoder.__init__(self) self.estimator = estimator self.cost_false_positives = cost_false_positives self.cost_false_negatives = cost_false_negatives def decode_mira( self, instance, parts, scores, gold_outputs, old_mira=False ): """Cost-augmented decoder. Allows a compromise between precision and recall. In general: p = a - (a+b)*z0 q = b*sum(z0) p'*z + q = a*sum(z) - (a+b)*z0'*z + b*sum(z0) = a*(1-z0)'*z + b*(1-z)'*z0 a => penalty for predicting 1 when it is 0 (FP) b => penalty for predicting 0 when it is 1 (FN) F1: a = 0.5, b = 0.5 recall: a = 0, b = 1""" a = self.cost_false_positives b = self.cost_false_negatives # Allow multiple bad labels. bad = [] for label in self.estimator.labels: coarse_label = self.estimator.get_coarse_label(label) if coarse_label == 'BAD': bad.append(self.estimator.labels[label]) bad = set(bad) index_parts = [ i for i in range(len(parts)) if isinstance(parts[i], SequenceUnigramPart) and parts[i].label in bad ] p = np.zeros(len(parts)) p[index_parts] = a - (a + b) * gold_outputs[index_parts] q = b * np.ones(len(gold_outputs[index_parts])).dot( gold_outputs[index_parts] ) if old_mira: predicted_outputs = self.decode(instance, parts, scores) else: scores_cost = scores + p predicted_outputs = self.decode(instance, parts, scores_cost) cost = p.dot(predicted_outputs) + q loss = cost + scores.dot(predicted_outputs - gold_outputs) return predicted_outputs, cost, loss def decode(self, instance, parts, scores): """Decoder. Return the most likely sequence of OK/BAD labels.""" if self.estimator.use_bigrams: return self.decode_with_bigrams(instance, parts, scores) else: return self.decode_with_unigrams(instance, parts, scores) def decode_with_unigrams(self, instance, parts, scores): """Decoder for a non-sequential model (unigrams only).""" predicted_output = np.zeros(len(scores)) parts_by_index = [[] for _ in range(instance.num_words())] for r, part in enumerate(parts): parts_by_index[part.index].append(r) for i in range(instance.num_words()): num_labels = len(parts_by_index[i]) label_scores = np.zeros(num_labels) predicted_for_word = [0] * num_labels for k, r in enumerate(parts_by_index[i]): label_scores[k] = scores[r] best = np.argmax(label_scores) predicted_for_word[best] = 1.0 r = parts_by_index[i][best] predicted_output[r] = 1.0 return predicted_output def decode_with_bigrams(self, instance, parts, scores): """Decoder for a sequential model (with bigrams).""" num_labels = len(self.estimator.labels) num_words = instance.num_words() initial_scores = np.zeros(num_labels) transition_scores = np.zeros((num_words - 1, num_labels, num_labels)) final_scores = np.zeros(num_labels) emission_scores = np.zeros((num_words, num_labels)) indexed_unigram_parts = [{} for _ in range(num_words)] indexed_bigram_parts = [{} for _ in range(num_words + 1)] for r, part in enumerate(parts): if isinstance(part, SequenceUnigramPart): indexed_unigram_parts[part.index][part.label] = r emission_scores[part.index, part.label] = scores[r] elif isinstance(part, SequenceBigramPart): indexed_bigram_parts[part.index][ (part.label, part.previous_label) ] = r if part.previous_label < 0: assert part.index == 0 initial_scores[part.label] = scores[r] elif part.label < 0: assert part.index == num_words final_scores[part.previous_label] = scores[r] else: transition_scores[ part.index - 1, part.label, part.previous_label ] = scores[r] else: raise NotImplementedError best_path, _ = self.run_viterbi( initial_scores, transition_scores, final_scores, emission_scores ) predicted_output = np.zeros(len(scores)) previous_label = -1 for i, label in enumerate(best_path): r = indexed_unigram_parts[i][label] predicted_output[r] = 1.0 r = indexed_bigram_parts[i][(label, previous_label)] predicted_output[r] = 1.0 previous_label = label r = indexed_bigram_parts[num_words][(-1, previous_label)] predicted_output[r] = 1.0 return predicted_output def run_viterbi( self, initial_scores, transition_scores, final_scores, emission_scores ): """Computes the viterbi trellis for a given sequence. Receives: - Initial scores: (num_states) array - Transition scores: (length-1, num_states, num_states) array - Final scores: (num_states) array - Emission scores: (length, num_states) array.""" length = np.size(emission_scores, 0) # Length of the sequence. num_states = np.size(initial_scores) # Number of states. # Variables storing the Viterbi scores. viterbi_scores = np.zeros([length, num_states]) + logzero() # Variables storing the paths to backtrack. viterbi_paths = -np.ones([length, num_states], dtype=int) # Most likely sequence. best_path = -np.ones(length, dtype=int) # Initialization. viterbi_scores[0, :] = emission_scores[0, :] + initial_scores # Viterbi loop. for pos in range(1, length): for current_state in range(num_states): viterbi_scores[pos, current_state] = np.max( viterbi_scores[pos - 1, :] + transition_scores[pos - 1, current_state, :] ) viterbi_scores[pos, current_state] += emission_scores[ pos, current_state ] viterbi_paths[pos, current_state] = np.argmax( viterbi_scores[pos - 1, :] + transition_scores[pos - 1, current_state, :] ) # Termination. assert len(viterbi_scores[length - 1, :] + final_scores) best_score = np.max(viterbi_scores[length - 1, :] + final_scores) best_path[length - 1] = np.argmax( viterbi_scores[length - 1, :] + final_scores ) # Backtrack. for pos in range(length - 2, -1, -1): best_path[pos] = viterbi_paths[pos + 1, best_path[pos + 1]] return best_path, best_score PK!hb UU-kiwi/models/linear/linear_word_qe_features.py"""A class for handling features for word-level quality estimation.""" # OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # import numpy as np from kiwi.models.linear.linear_word_qe_sentence import LinearWordQESentence from .sparse_feature_vector import SparseFeatureVector def quantize(value, bins_down): """Quantize a numeric feature into bins. Example: bins = [50, 40, 30, 25, 20, 18, 16, 14, 12, 10].""" bin_up = np.inf for bin_down in bins_down: if bin_down < value <= bin_up: bin_value = bin_down return bin_value bin_up = bin_down return value class LinearWordQEFeatures(SparseFeatureVector): """This class implements a feature vector for word-level quality estimation.""" def __init__( self, use_basic_features_only=True, use_simple_bigram_features=True, use_parse_features=False, use_stacked_features=False, save_to_cache=False, load_from_cache=False, cached_features_file=None, ): SparseFeatureVector.__init__( self, save_to_cache, load_from_cache, cached_features_file ) self.use_basic_features_only = use_basic_features_only # True for using only a single bigram indicator feature. self.use_simple_bigram_features = use_simple_bigram_features self.use_parse_features = use_parse_features self.use_stacked_features = use_stacked_features self.use_client_features = False def get_siblings(self, sentence_word_features, index): if index < 0 or index >= len(sentence_word_features): info = None else: info = sentence_word_features[index] if info is not None: siblings = [ k for k in range(len(sentence_word_features)) if sentence_word_features[k].target_head == info.target_head ] left_siblings = [k for k in siblings if k < index] right_siblings = [k for k in siblings if k > index] if len(left_siblings) > 0: left_sibling = max(left_siblings) else: left_sibling = -1 if len(right_siblings) > 0: right_sibling = min(right_siblings) else: right_sibling = -1 else: left_sibling = -2 right_sibling = -2 if left_sibling >= 0: left_sibling_info = sentence_word_features[left_sibling] left_sibling_token = left_sibling_info.token left_sibling_pos = left_sibling_info.target_pos elif left_sibling == -1: left_sibling_token = '__ROOT__' left_sibling_pos = '__ROOT__' else: left_sibling_info = None left_sibling_token = '__START__' left_sibling_pos = '__START__' if right_sibling >= 0: right_sibling_info = sentence_word_features[right_sibling] right_sibling_token = right_sibling_info.token right_sibling_pos = right_sibling_info.target_pos elif right_sibling == -1: right_sibling_info = None right_sibling_token = '__ROOT__' right_sibling_pos = '__ROOT__' else: right_sibling_info = None right_sibling_token = '__START__' right_sibling_pos = '__START__' return ( left_sibling_token, left_sibling_pos, right_sibling_token, right_sibling_pos, ) def get_head(self, sentence_word_features, index): if index < 0 or index >= len(sentence_word_features): info = None else: info = sentence_word_features[index] if info is not None: head_index = info.target_head - 1 else: head_index = -2 if head_index >= 0: head_info = sentence_word_features[head_index] head_token = head_info.token head_pos = head_info.target_pos head_morph = head_info.target_morph elif head_index == -1: head_info = None head_token = '__ROOT__' head_pos = '__ROOT__' head_morph = '__ROOT__' else: head_info = None head_token = '__START__' head_pos = '__START__' head_morph = '__START__' return head_index, head_token, head_pos, head_morph def compute_unigram_features(self, sentence_word_features, part): """Compute unigram features (depending only on a single label).""" if self.load_from_cache: self.load_cached_features() return index = part.index ignore_source = False only_basic_features = self.use_basic_features_only use_client_features = self.use_client_features use_parse_features = self.use_parse_features use_stacked_features = self.use_stacked_features use_bias = True use_language_model = True use_binary_features = False if use_parse_features: use_split_morphs = False use_morph_features = False use_deprel_features = True use_head_features = True use_grandparent_features = True use_sibling_features = True else: use_split_morphs = False use_morph_features = False use_deprel_features = False use_head_features = False use_grandparent_features = False use_sibling_features = False use_unuseful_shared_task_features = False info = sentence_word_features[index] if use_client_features: labels = [str(part.label), info.client_name + '_' + str(part.label)] else: labels = [str(part.label)] for label in labels: if use_bias: self.add_binary_feature('BIAS_%s' % label) if use_unuseful_shared_task_features: self.add_binary_feature( 'F0=%d_%s' % ( quantize(info.source_token_count, [40, 30, 20, 10]), label, ) ) self.add_binary_feature( 'F1=%d_%s' % ( quantize(info.target_token_count, [40, 30, 20, 10]), label, ) ) self.add_binary_feature( 'F2=%f_%s' % ( quantize( info.source_target_token_count_ratio, [5.0, 2.0] ), label, ) ) self.add_binary_feature('F3=%s_%s' % (info.token, label)) self.add_binary_feature('F4=%s_%s' % (info.left_context, label)) self.add_binary_feature('F5=%s_%s' % (info.right_context, label)) if not ignore_source: self.add_binary_feature( 'F6=%s_%s' % (info.first_aligned_token, label) ) self.add_binary_feature( 'F7=%s_%s' % (info.left_alignment, label) ) self.add_binary_feature( 'F8=%s_%s' % (info.right_alignment, label) ) if use_binary_features and not only_basic_features: # Ablated for German WMT16 (the provided stoplist is wrong). # self.add_binary_feature( # 'F9=%d_%s' % (int(info.is_stopword), label)) self.add_binary_feature( 'F10=%d_%s' % (int(info.is_punctuation), label) ) # Ablated for German (capitalized words are nouns) # self.add_binary_feature( # 'F11=%d_%s' % (int(info.is_proper_noun), label)) self.add_binary_feature( 'F12=%d_%s' % (int(info.is_digit), label) ) if use_language_model and not only_basic_features: self.add_binary_feature( 'F13=%d_%s' % (info.highest_order_ngram_left, label) ) self.add_binary_feature( 'F14=%d_%s' % (info.highest_order_ngram_right, label) ) # if use_language_model and not only_basic_features: # self.add_binary_feature( # 'F15=%d_%s' % (info.backoff_behavior_left, label)) # self.add_binary_feature( # 'F16=%d_%s' % (info.backoff_behavior_middle, label)) # self.add_binary_feature( # 'F17=%d_%s' % (info.backoff_behavior_right, label)) if use_language_model and not only_basic_features: self.add_binary_feature( 'F18=%d_%s' % (info.source_highest_order_ngram_left, label) ) self.add_binary_feature( 'F19=%d_%s' % (info.source_highest_order_ngram_right, label) ) self.add_binary_feature( 'F20=%d_%s' % (int(info.pseudo_reference), label) ) if not only_basic_features: self.add_binary_feature('F21=%s_%s' % (info.target_pos, label)) self.add_binary_feature( 'F22=%s_%s' % (info.aligned_source_pos_list, label) ) if use_unuseful_shared_task_features: self.add_binary_feature( 'F23=%d_%s' % (info.polysemy_count_source, label) ) self.add_binary_feature( 'F24=%d_%s' % (info.polysemy_count_target, label) ) # QUETCH linear model conjoined features. self.add_binary_feature( 'G0=%s_%s_%s' % (info.token, info.left_context, label) ) self.add_binary_feature( 'G1=%s_%s_%s' % (info.token, info.right_context, label) ) if not ignore_source: self.add_binary_feature( 'G2=%s_%s_%s' % (info.token, info.first_aligned_token, label) ) if not only_basic_features: self.add_binary_feature( 'G3=%s_%s_%s' % (info.target_pos, info.aligned_source_pos_list, label) ) # Parse features. if use_parse_features: head_index, head_token, head_pos, head_morph = self.get_head( sentence_word_features, index ) head_on_left = True # (head_index <= index) if head_index >= 0: _, grandparent_token, grandparent_pos, _ = self.get_head( sentence_word_features, head_index ) else: grandparent_token, grandparent_pos = head_token, head_pos grandparent_on_left = True # (grandparent_index <= index) left_sibling_token, left_sibling_pos, right_sibling_token, right_sibling_pos = self.get_siblings( # NOQA sentence_word_features, index ) if use_deprel_features: self.add_binary_feature( 'H0=%s_%s' % (info.target_deprel, label) ) self.add_binary_feature( 'H1=%s_%s_%s' % (info.token, info.target_deprel, label) ) if use_head_features: # self.add_binary_feature( # 'H2=%s_%s_%s' % (info.target_pos, head_pos, label)) # self.add_binary_feature( # 'H3=%s_%s_%s' % (info.token, head_token, label)) self.add_binary_feature( 'H2=%s_%s_%d_%s' % (info.target_pos, head_pos, int(head_on_left), label) ) self.add_binary_feature( 'H3=%s_%s_%d_%s' % (info.token, head_token, int(head_on_left), label) ) self.add_binary_feature( 'H3a=%s_%s_%d_%s' % (info.token, head_pos, int(head_on_left), label) ) self.add_binary_feature( 'H3b=%s_%s_%d_%s' % ( info.target_pos, head_token, int(head_on_left), label, ) ) if use_morph_features: self.add_binary_feature( 'H4=%s_%s' % (info.target_morph, label) ) self.add_binary_feature( 'H5=%s_%s_%s' % (info.target_morph, head_morph, label) ) if use_split_morphs: all_morphs = info.target_morph.split('|') all_head_morphs = head_morph.split('|') for m in all_morphs: self.add_binary_feature('H6=%s_%s' % (m, label)) for hm in all_head_morphs: self.add_binary_feature( 'H7=%s_%s_%s' % (m, hm, label) ) if use_sibling_features: self.add_binary_feature( 'H8=%s_%s_%s' % (info.target_pos, left_sibling_pos, label) ) self.add_binary_feature( 'H9=%s_%s_%s' % (info.token, left_sibling_token, label) ) self.add_binary_feature( 'H10=%s_%s_%s' % (info.target_pos, right_sibling_pos, label) ) self.add_binary_feature( 'H11=%s_%s_%s' % (info.token, right_sibling_token, label) ) if use_grandparent_features: self.add_binary_feature( 'H12=%s_%s_%d_%s' % ( info.target_pos, grandparent_pos, int(grandparent_on_left), label, ) ) self.add_binary_feature( 'H13=%s_%s_%d_%s' % ( info.token, grandparent_token, int(grandparent_on_left), label, ) ) self.add_binary_feature( 'H14=%s_%s_%s_%d_%s' % ( info.target_pos, head_pos, grandparent_pos, int(grandparent_on_left), label, ) ) self.add_binary_feature( 'H15=%s_%s_%s_%d_%s' % ( info.token, head_pos, grandparent_token, int(grandparent_on_left), label, ) ) self.add_binary_feature( 'H16=%s_%s_%s_%d_%s' % ( info.token, head_token, grandparent_pos, int(grandparent_on_left), label, ) ) self.add_binary_feature( 'H17=%s_%s_%s_%d_%s' % ( info.target_pos, head_token, grandparent_token, int(grandparent_on_left), label, ) ) self.add_binary_feature( 'H18=%s_%s_%s_%d_%s' % ( info.target_pos, head_pos, grandparent_token, int(grandparent_on_left), label, ) ) self.add_binary_feature( 'H19=%s_%s_%s_%d_%s' % ( info.target_pos, head_token, grandparent_pos, int(grandparent_on_left), label, ) ) self.add_binary_feature( 'H20=%s_%s_%s_%d_%s' % ( info.token, head_pos, grandparent_pos, int(grandparent_on_left), label, ) ) if use_stacked_features: if len(info.stacked_features) > 0: for i, value in enumerate(info.stacked_features): self.add_numeric_feature('S%d_%s' % (i, label), value) if self.save_to_cache: self.save_cached_features() return def compute_bigram_features(self, sentence_word_features, part): """Compute bigram features (that depend on consecutive labels).""" if self.load_from_cache: self.load_cached_features() return index = part.index label = part.label previous_label = part.previous_label ignore_source = False only_basic_features = self.use_basic_features_only use_client_features = self.use_client_features use_parse_features = self.use_parse_features use_stacked_features = self.use_stacked_features # False use_bias = True # True for using only a single bigram indicator feature. use_only_bias = self.use_simple_bigram_features use_language_model = True use_binary_features = False use_trigram_features = True if use_parse_features: use_split_morphs = False use_morph_features = False use_deprel_features = True use_head_features = False use_sibling_features = False else: use_split_morphs = False use_morph_features = False use_deprel_features = False use_head_features = False use_sibling_features = False if index < len(sentence_word_features): info = sentence_word_features[index] else: info = LinearWordQESentence.create_stop_symbol() if index > 0: info_previous = sentence_word_features[index - 1] else: info_previous = LinearWordQESentence.create_stop_symbol() bigram_label = str(previous_label) + '_' + str(label) if use_client_features: labels = [bigram_label, info.client_name + '_' + bigram_label] else: labels = [bigram_label] for label in labels: if use_bias: self.add_binary_feature('B1=%s' % label) if use_only_bias: continue self.add_binary_feature('B2=%s_%s' % (info.token, label)) self.add_binary_feature('B3=%s_%s' % (info_previous.token, label)) self.add_binary_feature('B4=%s_%s' % (info.right_context, label)) self.add_binary_feature( 'B5=%s_%s' % (info_previous.left_context, label) ) if not ignore_source: self.add_binary_feature( 'B6=%s_%s' % (info.first_aligned_token, label) ) self.add_binary_feature( 'B7=%s_%s' % (info.left_alignment, label) ) self.add_binary_feature( 'B8=%s_%s' % (info.right_alignment, label) ) self.add_binary_feature( 'B9=%s_%s' % (info_previous.first_aligned_token, label) ) self.add_binary_feature( 'B10=%s_%s' % (info_previous.left_alignment, label) ) self.add_binary_feature( 'B11=%s_%s' % (info_previous.right_alignment, label) ) if use_binary_features and not only_basic_features: # Ablated for German WMT16 (the provided stoplist is wrong). # self.add_binary_feature( # 'B12=%d_%s' % (int(info.is_stopword), label)) # self.add_binary_feature( # 'B13=%d_%s' % (int(info_previous.is_stopword), label)) self.add_binary_feature( 'B14=%d_%s' % (int(info.is_punctuation), label) ) self.add_binary_feature( 'B15=%d_%s' % (int(info_previous.is_punctuation), label) ) # Ablated for German (capitalized words are nouns) # self.add_binary_feature( # 'B16=%d_%s' % (int(info.is_proper_noun), label)) # self.add_binary_feature( # 'B17=%d_%s' % (int(info_previous.is_proper_noun), label)) self.add_binary_feature( 'B18=%d_%s' % (int(info.is_digit), label) ) self.add_binary_feature( 'B19=%d_%s' % (int(info_previous.is_digit), label) ) if use_language_model and not only_basic_features: self.add_binary_feature( 'B20=%d_%s' % (info.highest_order_ngram_left, label) ) self.add_binary_feature( 'B21=%d_%s' % (info.highest_order_ngram_right, label) ) self.add_binary_feature( 'B22=%d_%s' % (info_previous.highest_order_ngram_left, label) ) self.add_binary_feature( 'B23=%d_%s' % (info_previous.highest_order_ngram_right, label) ) # if use_language_model and not only_basic_features: # self.add_binary_feature( # 'B24=%d_%s' % (info.backoff_behavior_left, label)) # self.add_binary_feature( # 'B25=%d_%s' % (info.backoff_behavior_middle, label)) # self.add_binary_feature( # 'B26=%d_%s' % (info.backoff_behavior_right, label)) # self.add_binary_feature( # 'B27=%d_%s' % (info_previous.backoff_behavior_left, # label)) # self.add_binary_feature( # 'B28=%d_%s' % (info_previous.backoff_behavior_middle, # label)) # self.add_binary_feature( # 'B29=%d_%s' % (info_previous.backoff_behavior_right, # label)) if use_language_model and not only_basic_features: self.add_binary_feature( 'B30=%d_%s' % (info.source_highest_order_ngram_left, label) ) self.add_binary_feature( 'B31=%d_%s' % (info.source_highest_order_ngram_right, label) ) self.add_binary_feature( 'B33=%d_%s' % (info_previous.source_highest_order_ngram_left, label) ) self.add_binary_feature( 'B34=%d_%s' % (info_previous.source_highest_order_ngram_right, label) ) if not only_basic_features: self.add_binary_feature('B35=%s_%s' % (info.target_pos, label)) self.add_binary_feature( 'B36=%s_%s' % (info.aligned_source_pos_list, label) ) self.add_binary_feature( 'B37=%s_%s' % (info_previous.target_pos, label) ) self.add_binary_feature( 'B38=%s_%s' % (info_previous.aligned_source_pos_list, label) ) # Conjoined features. self.add_binary_feature( 'C0=%s_%s_%s' % (info.token, info.left_context, label) ) self.add_binary_feature( 'C1=%s_%s_%s' % (info.token, info.right_context, label) ) self.add_binary_feature( 'C2=%s_%s_%s' % (info_previous.token, info_previous.left_context, label) ) self.add_binary_feature( 'C3=%s_%s_%s' % (info_previous.token, info_previous.right_context, label) ) if use_trigram_features: self.add_binary_feature( 'D1=%s_%s_%s_%s' % ( info_previous.left_context, info_previous.token, info.token, label, ) ) self.add_binary_feature( 'D2=%s_%s_%s_%s' % ( info_previous.token, info.token, info.right_context, label, ) ) if not ignore_source: self.add_binary_feature( 'C4=%s_%s_%s' % (info.token, info.first_aligned_token, label) ) self.add_binary_feature( 'C5=%s_%s_%s' % ( info_previous.token, info_previous.first_aligned_token, label, ) ) if not only_basic_features: self.add_binary_feature( 'C6=%s_%s_%s' % (info.target_pos, info.aligned_source_pos_list, label) ) self.add_binary_feature( 'C7=%s_%s_%s' % ( info_previous.target_pos, info_previous.aligned_source_pos_list, label, ) ) # Parse features. if use_parse_features: head_index = info.target_head - 1 previous_head_index = info_previous.target_head - 1 if head_index >= 0: head_info = sentence_word_features[head_index] head_token = head_info.token head_pos = head_info.target_pos head_morph = head_info.target_morph elif head_index == -1: head_info = None head_token = '__ROOT__' head_pos = '__ROOT__' head_morph = '__ROOT__' else: head_info = None head_token = '__START__' head_pos = '__START__' head_morph = '__START__' if previous_head_index >= 0: previous_head_info = sentence_word_features[ previous_head_index ] previous_head_token = previous_head_info.token previous_head_pos = previous_head_info.target_pos previous_head_morph = previous_head_info.target_morph elif previous_head_index == -1: previous_head_info = None previous_head_token = '__ROOT__' previous_head_pos = '__ROOT__' previous_head_morph = '__ROOT__' else: previous_head_info = None previous_head_token = '__START__' previous_head_pos = '__START__' previous_head_morph = '__START__' left_sibling_token, left_sibling_pos, right_sibling_token, right_sibling_pos = self.get_siblings( # NOQA sentence_word_features, index ) previous_left_sibling_token, previous_left_sibling_pos, previous_right_sibling_token, previous_right_sibling_pos = self.get_siblings( # NOQA sentence_word_features, index - 1 ) if use_deprel_features: self.add_binary_feature( 'D0=%s_%s' % (info_previous.target_deprel, label) ) self.add_binary_feature( 'D1=%s_%s_%s' % ( info_previous.token, info_previous.target_deprel, label, ) ) if use_head_features: self.add_binary_feature( 'D2=%s_%s_%s' % (info_previous.target_pos, previous_head_pos, label) ) self.add_binary_feature( 'D3=%s_%s_%s' % (info_previous.token, previous_head_token, label) ) if use_morph_features: self.add_binary_feature( 'D4=%s_%s' % (info_previous.target_morph, label) ) self.add_binary_feature( 'D5=%s_%s_%s' % ( info_previous.target_morph, previous_head_morph, label, ) ) if use_split_morphs: all_morphs = info_previous.target_morph.split('|') all_head_morphs = previous_head_morph.split('|') for m in all_morphs: self.add_binary_feature('D6=%s_%s' % (m, label)) for hm in all_head_morphs: self.add_binary_feature( 'D7=%s_%s_%s' % (m, hm, label) ) if use_sibling_features: self.add_binary_feature( 'D8=%s_%s_%s' % ( info_previous.target_pos, previous_left_sibling_pos, label, ) ) self.add_binary_feature( 'D9=%s_%s_%s' % ( info_previous.token, previous_left_sibling_token, label, ) ) self.add_binary_feature( 'D10=%s_%s_%s' % ( info_previous.target_pos, previous_right_sibling_pos, label, ) ) self.add_binary_feature( 'D11=%s_%s_%s' % ( info_previous.token, previous_right_sibling_token, label, ) ) if use_deprel_features: self.add_binary_feature( 'E0=%s_%s' % (info.target_deprel, label) ) self.add_binary_feature( 'E1=%s_%s_%s' % (info.token, info.target_deprel, label) ) if use_head_features: self.add_binary_feature( 'E2=%s_%s_%s' % (info.target_pos, head_pos, label) ) self.add_binary_feature( 'E3=%s_%s_%s' % (info.token, head_token, label) ) if use_morph_features: self.add_binary_feature( 'E4=%s_%s' % (info.target_morph, label) ) self.add_binary_feature( 'E5=%s_%s_%s' % (info.target_morph, head_morph, label) ) if use_split_morphs: all_morphs = info.target_morph.split('|') all_head_morphs = head_morph.split('|') for m in all_morphs: self.add_binary_feature('E6=%s_%s' % (m, label)) for hm in all_head_morphs: self.add_binary_feature( 'E7=%s_%s_%s' % (m, hm, label) ) if use_sibling_features: self.add_binary_feature( 'E8=%s_%s_%s' % (info.target_pos, left_sibling_pos, label) ) self.add_binary_feature( 'E9=%s_%s_%s' % (info.token, left_sibling_token, label) ) self.add_binary_feature( 'E10=%s_%s_%s' % (info.target_pos, right_sibling_pos, label) ) self.add_binary_feature( 'E11=%s_%s_%s' % (info.token, right_sibling_token, label) ) if use_stacked_features: if len(info.stacked_features) > 0: for i, value in enumerate(info.stacked_features): self.add_numeric_feature('Z%d_%s' % (i, label), value) if len(info_previous.stacked_features) > 0: for i, value in enumerate(info_previous.stacked_features): self.add_numeric_feature('ZZ%d_%s' % (i, label), value) if self.save_to_cache: self.save_cached_features() return PK!Y=&&-kiwi/models/linear/linear_word_qe_sentence.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # import string class LinearWordQETokenFeatures(object): def __init__( self, stacked_features=None, source_token_count=-1, target_token_count=-1, source_target_token_count_ratio=0.0, token='', left_context='', right_context='', first_aligned_token='', left_alignment='', right_alignment='', is_stopword=False, is_punctuation=False, is_proper_noun=False, is_digit=False, highest_order_ngram_left=-1, highest_order_ngram_right=-1, backoff_behavior_left=0.0, backoff_behavior_middle=0.0, backoff_behavior_right=0.0, source_highest_order_ngram_left=-1, source_highest_order_ngram_right=-1, pseudo_reference=False, target_pos='', target_morph='', target_head=-1, target_deprel='', aligned_source_pos_list='', polysemy_count_source=0, polysemy_count_target=0, ): self.stacked_features = ( stacked_features if stacked_features is not None else [] ) self.source_token_count = source_token_count # Not used. self.target_token_count = target_token_count # Not used. # Not used. self.source_target_token_count_ratio = source_target_token_count_ratio self.token = token self.left_context = left_context self.right_context = right_context self.first_aligned_token = first_aligned_token self.left_alignment = left_alignment self.right_alignment = right_alignment self.is_stopword = is_stopword # Not used (at least for En-De). self.is_punctuation = is_punctuation self.is_proper_noun = is_proper_noun # Not used (at least for En-De). self.is_digit = is_digit self.highest_order_ngram_left = highest_order_ngram_left self.highest_order_ngram_right = highest_order_ngram_right self.backoff_behavior_left = backoff_behavior_left # Not used. self.backoff_behavior_middle = backoff_behavior_middle # Not used. self.backoff_behavior_right = backoff_behavior_right # Not used. self.source_highest_order_ngram_left = source_highest_order_ngram_left self.source_highest_order_ngram_right = source_highest_order_ngram_right self.pseudo_reference = pseudo_reference # Not used in the WMT16+ data. self.target_pos = target_pos self.target_morph = target_morph # Not used. self.target_head = target_head self.target_deprel = target_deprel self.aligned_source_pos_list = aligned_source_pos_list self.polysemy_count_source = polysemy_count_source # Not used. self.polysemy_count_target = polysemy_count_target # Not used. class LinearWordQESentence: """Represents a sentence (word features and their labels).""" @staticmethod def create_stop_symbol(): """Generates dummy features for a stop symbol.""" return LinearWordQETokenFeatures( token='__STOP__', left_context='__STOP__', right_context='__STOP__', first_aligned_token='__STOP__', left_alignment='__STOP__', right_alignment='__STOP__', target_pos='__STOP__', aligned_source_pos_list='__STOP__', target_morph='__STOP__', target_deprel='__STOP__', ) def __init__(self): self.sentence_word_features = [] self.sentence_word_labels = [] def num_words(self): """Returns the number of words of the sentence.""" return len(self.sentence_word_features) def create_from_sentence_pair( self, source_words, target_words, alignments, source_pos_tags=None, target_pos_tags=None, target_parse_heads=None, target_parse_relations=None, target_ngram_left=None, target_ngram_right=None, target_stacked_features=None, labels=None, ): """Creates an instance from source/target token and alignment information.""" self.sentence_word_features = [] aligned_source_words = [[] for _ in target_words] for source, target in alignments: aligned_source_words[target].append(source) aligned_source_words = [ sorted(aligned) for aligned in aligned_source_words ] if source_pos_tags is None: source_pos_tags = ['' for _ in source_words] if target_pos_tags is None: target_pos_tags = ['' for _ in target_words] if target_parse_heads is None: target_parse_heads = [-1 for _ in target_words] if target_parse_relations is None: target_parse_relations = ['' for _ in target_words] if target_ngram_left is None: target_ngram_left = [-1 for _ in target_words] if target_ngram_right is None: target_ngram_right = [-1 for _ in target_words] if target_stacked_features is None: target_stacked_features = ['' for _ in target_words] if labels is not None: if len(labels) != len(target_words): # WMT18 format with labels for the gaps. assert len(labels) == 2 * len(target_words) + 1 labels = labels[1::2] for i in range(len(target_words)): word = target_words[i] tag = target_pos_tags[i] parse_head = int(target_parse_heads[i]) # TODO: don't cast here. parse_relation = target_parse_relations[i] ngram_left = int(target_ngram_left[i]) # TODO: don't cast here. ngram_right = int(target_ngram_right[i]) # TODO: don't cast here. if not target_stacked_features[i]: stacked_features = None else: stacked_features = [ float(p) for p in target_stacked_features[i].split('|') ] if i == 0: previous_word = '' # previous_tag = '' else: previous_word = target_words[i - 1] # previous_tag = target_pos_tags[i-1] if i == len(target_words) - 1: next_word = '' # next_tag = '' else: next_word = target_words[i + 1] # next_tag = target_pos_tags[i+1] if len(aligned_source_words[i]) == 0: source_word = '__unaligned__' previous_source_word = '__unaligned__' next_source_word = '__unaligned__' source_tag = '__unaligned__' # previous_source_tag = '__unaligned__' # next_source_tag = '__unaligned__' else: # Concatenate all source words in order of appearance. # The previous word is the one before the first aligned # source word; the next word is the one after the last # aligned word. source_word = '|'.join( [source_words[j] for j in aligned_source_words[i]] ) source_tag = '|'.join( [source_pos_tags[j] for j in aligned_source_words[i]] ) j = aligned_source_words[i][0] if j == 0: previous_source_word = "" # previous_source_tag = "" else: previous_source_word = source_words[j - 1] # previous_source_tag = source_pos_tags[j-1] if j == len(source_words) - 1: next_source_word = "" # next_source_tag = "" else: next_source_word = source_words[j + 1] # next_source_tag = source_pos_tags[j+1] word_features = LinearWordQETokenFeatures( stacked_features=stacked_features, source_token_count=len(source_words), target_token_count=len(target_words), source_target_token_count_ratio=float(len(source_words)) / len(target_words), token=word, is_punctuation=all([c in string.punctuation for c in word]), is_digit=word.isdigit(), target_pos=tag, left_context=previous_word, right_context=next_word, first_aligned_token=source_word, aligned_source_pos_list=source_tag, left_alignment=previous_source_word, right_alignment=next_source_word, target_head=parse_head, target_deprel=parse_relation, highest_order_ngram_left=ngram_left, highest_order_ngram_right=ngram_right, ) self.sentence_word_features.append(word_features) self.sentence_word_labels.append( labels[i] if labels is not None else '' ) PK!0:  $kiwi/models/linear/sequence_parts.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # class SequenceUnigramPart(object): """A part for unigrams (a single label at a word position).""" def __init__(self, index, label): self.label = label self.index = index class SequenceBigramPart(object): """A part for bigrams (two labels at consecutive words position). Necessary for the model to be sequential.""" def __init__(self, index, label, previous_label): self.label = label self.previous_label = previous_label self.index = index PK!+kiwi/models/linear/sparse_feature_vector.py"""This defines the class for defining sparse features in linear models.""" # OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # from .sparse_vector import SparseVector class SparseFeatureVector(SparseVector): """A generic class for a sparse feature vector.""" def __init__( self, save_to_cache=False, load_from_cache=False, cached_features_file=None, ): SparseVector.__init__(self) self.cached_features_file = cached_features_file self.save_to_cache = save_to_cache self.load_from_cache = load_from_cache def add_categorical_feature(self, name, value, allow_duplicates=False): """Add a categorical feature, represented internally as a binary feature.""" fname = name + "=" + value assert allow_duplicates or fname not in self self[fname] = 1.0 def add_binary_feature(self, name): """Add a binary feature.""" if name in self: return self[name] = 1.0 def add_numeric_feature(self, name, value): """Add a numeric feature.""" self[name] = value def save_cached_features(self): """Save features to file.""" self.cached_features_file.write(str(len(self)) + '\n') for key in self: self.cached_features_file.write(key + '\t' + str(self[key]) + '\n') def load_cached_features(self): """Load features from file.""" num_features = int(self.cached_features_file.next()) for i in range(num_features): key, value = ( self.cached_features_file.next().rstrip('\n').split('\t') ) self[key] = float(value) class SparseBinaryFeatureVector(list): """A generic class for a sparse binary feature vector.""" def __init__( self, feature_indices=None, save_to_cache=False, load_from_cache=False, cached_features_file=None, ): list.__init__(self) self.feature_indices = feature_indices self.cached_features_file = cached_features_file self.save_to_cache = save_to_cache self.load_from_cache = load_from_cache def add_categorical_feature(self, name, value): """Add a categorical feature, represented internally as a binary feature.""" fname = name + "=" + value self.add_binary_feature(fname) def add_binary_feature(self, name): """Add a binary feature.""" add = True index = self.feature_indices.get(name, -1) if index < 0: if not add: return else: index = self.feature_indices.add(name) self.append(index) def to_sparse_vector(self): """Convert to a SparseVector.""" vector = SparseVector() for index in self: vector[index] = 1.0 return vector def save_cached_features(self): """Save features to file.""" self.cached_features_file.write( '\t'.join([str(key) for key in self]) + '\n' ) def load_cached_features(self): """Load features from file.""" self[:] = [ int(key) for key in self.cached_features_file.next().rstrip('\n').split('\t') ] PK!h/? ? #kiwi/models/linear/sparse_vector.py# -*- coding: utf-8 -*- """This defines a generic class for sparse vectors.""" # OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # import math class SparseVector(dict): """Implementation of a sparse vector using a dictionary.""" def __init__(self): dict.__init__(self) def copy(self): """Returns a copy of the current vector.""" vector = SparseVector() for key in self: vector[key] = self[key] return vector def as_string(self): """Returns a string representation.""" s = '' for key in self: s += key + ':' + str(self[key]) + ' ' return s def save(self, f): """Save vector to file.""" for key in self: f.write(str(key) + '\t' + str(self[key]) + '\n') def load(self, f, dtype=str): """Load vector from file.""" self.clear() for line in f: fields = line.split('\t') key = fields[0] value = float(fields[1]) self[dtype(key)] = value def add(self, vector, scalar=1.0): """ Adds this vector and a given vector.""" for key in vector: if key in self: self[key] += scalar * vector[key] else: self[key] = scalar * vector[key] def scale(self, scalar): """Scales this vector by a scale factor.""" for key in self: self[key] *= scalar def add_constant(self, scalar): """Adds a constant to each element of the vector.""" for key in self: self[key] += scalar def squared_norm(self): """Computes the squared norm of the vector.""" return self.dot_product(self) def dot_product(self, vector): """ Computes the dot product with a given vector. Note: this iterates through the self vector, so it may be inefficient if the number of nonzeros in self is much larger than the number of nonzeros in vector. Hence the function reverts to vector.dot_product(self) if that is beneficial.""" if len(self) > len(vector): return vector.dot_product(self) value = 0.0 for key in self: if key in vector: value += self[key] * vector[key] return value def normalize(self): """ Normalize the vector. Note: if the norm is zero, do nothing.""" norm = 0.0 for key in self: value = self[key] norm += value * value norm = math.sqrt(norm) if norm > 0.0: for key in self: self[key] /= norm PK!n=Ɣ+kiwi/models/linear/structured_classifier.py"""A generic implementation of an abstract structured linear classifier.""" # OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # import logging import numpy as np from .linear_model import LinearModel from .structured_decoder import StructuredDecoder from .utils import nearly_eq_tol logger = logging.getLogger(__name__) class StructuredClassifier: """ An abstract structured classifier.""" def __init__(self): self.model = LinearModel() self.decoder = StructuredDecoder() self.use_binary_features = False self.feature_indices = None def save(self, model_path): """Save the full configuration and model.""" raise NotImplementedError def load(self, model_path): """Load the full configuration and model.""" raise NotImplementedError def create_instances(self, dataset): """Preprocess the dataset if needed to create instances. Default is returning the dataset itself. Override if needed.""" return dataset def label_instance(self, instance, parts, predicted_output): """Return a labeled instance by adding the predicted output information.""" raise NotImplementedError def create_prediction(self, instance, parts, predicted_output): """Create a prediction for an instance.""" raise NotImplementedError def make_parts(self, instance): """Compute the task-specific parts for this instance.""" raise NotImplementedError def make_features(self, instance, parts): """Create a feature vector for each part.""" raise NotImplementedError def compute_scores(self, instance, parts, features): """Compute a score for every part in the instance using the current model and the part-specific features.""" num_parts = len(parts) scores = np.zeros(num_parts) for r in range(num_parts): if self.use_binary_features: scores[r] = self.model.compute_score_binary_features( features[r] ) else: scores[r] = self.model.compute_score(features[r]) return scores def run(self, instance): """Run the structured classifier on a single instance.""" parts, gold_output = self.make_parts(instance) features = self.make_features(instance, parts) scores = self.compute_scores(instance, parts, features) predicted_output = self.decoder.decode(instance, parts, scores) labeled_instance = self.label_instance( instance, parts, predicted_output ) return labeled_instance def test(self, instances): """Run the structured classifier on dev/test data.""" num_mistakes = 0 num_parts_total = 0 predictions = [] for instance in instances: # TODO: use self.run(instance) instead? parts, gold_output = self.make_parts(instance) features = self.make_features(instance, parts) scores = self.compute_scores(instance, parts, features) predicted_output = self.decoder.decode(instance, parts, scores) predictions.append( self.create_prediction(instance, parts, predicted_output) ) num_parts = len(parts) assert len(predicted_output) == num_parts assert len(gold_output) == num_parts for i in range(num_parts): if not nearly_eq_tol(gold_output[i], predicted_output[i], 1e-6): num_mistakes += 1 num_parts_total += num_parts logger.info( 'Part accuracy: %f', float(num_parts_total - num_mistakes) / float(num_parts_total), ) return predictions def evaluate(self, instances, predictions, print_scores=True): """Evaluate the structure classifier, computing a task-dependent evaluation metric.""" raise NotImplementedError PK!]!(kiwi/models/linear/structured_decoder.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # import numpy as np class StructuredDecoder(object): """An abstract decoder for structured prediction.""" def __init__(self): pass def decode(self, instance, parts, scores): """Decode, computing the highest-scores output. Must return a vector of 0/1 predicted_outputs of the same size as parts.""" raise NotImplementedError def decode_mira( self, instance, parts, scores, gold_outputs, old_mira=False ): """Perform cost-augmented decoding or classical MIRA.""" p = 0.5 - gold_outputs q = 0.5 * np.ones(len(gold_outputs)).dot(gold_outputs) if old_mira: predicted_outputs = self.decode(instance, parts, scores) else: scores_cost = scores + p predicted_outputs = self.decode(instance, parts, scores_cost) cost = p.dot(predicted_outputs) + q loss = cost + scores.dot(predicted_outputs - gold_outputs) return predicted_outputs, cost, loss def decode_cost_augmented(self, instance, parts, scores, gold_outputs): """Perform cost-augmented decoding.""" return self.decode_mira( instance, parts, scores, gold_outputs, old_mira=False ) PK!Ikkiwi/models/linear/utils.py# -*- coding: utf-8 -*- """Several utility functions.""" # OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # def nearly_eq_tol(a, b, tol): """Checks if two numbers are equal up to a tolerance.""" return (a - b) * (a - b) <= tol def nearly_binary_tol(a, tol): """Checks if a number is binary up to a tolerance.""" return nearly_eq_tol(a, 0.0, tol) or nearly_eq_tol(a, 1.0, tol) def nearly_zero_tol(a, tol): """Checks if a number is zero up to a tolerance.""" return (a <= tol) and (a >= -tol) PK!ZcgBgB(kiwi/models/linear_word_qe_classifier.py"""This is the main script for the linear sequential word-based quality estimator.""" # OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # import logging from pathlib import Path import numpy as np from kiwi import constants as const from kiwi.data.fieldsets.linear import build_fieldset from kiwi.models.linear.label_dictionary import LabelDictionary from kiwi.models.linear.linear_word_qe_decoder import LinearWordQEDecoder from kiwi.models.linear.linear_word_qe_features import LinearWordQEFeatures from kiwi.models.linear.linear_word_qe_sentence import LinearWordQESentence from kiwi.models.linear.sequence_parts import ( SequenceBigramPart, SequenceUnigramPart, ) from kiwi.models.linear.structured_classifier import StructuredClassifier logger = logging.getLogger(__name__) class LinearWordQEClassifier(StructuredClassifier): """Main class for the word-level quality estimator. Inherits from a general structured classifier.""" title = 'Linear Model' def __init__( self, use_basic_features_only=True, use_bigrams=True, use_simple_bigram_features=True, use_parse_features=False, use_stacked_features=False, evaluation_metric='f1_bad', cost_false_positives=0.5, cost_false_negatives=0.5, ): super().__init__() self.decoder = LinearWordQEDecoder( self, cost_false_positives, cost_false_negatives ) self.labels = LabelDictionary() self.use_basic_features_only = use_basic_features_only self.use_bigrams = use_bigrams self.use_simple_bigram_features = use_simple_bigram_features self.use_parse_features = use_parse_features self.use_stacked_features = use_stacked_features # Evaluation. self.evaluation_metric = evaluation_metric @staticmethod def fieldset(*args, **kwargs): return build_fieldset() @staticmethod def from_options(vocabs, opts): use_parse_features = True if opts.train_target_parse else False use_stacked_features = True if opts.train_target_stacked else False model = LinearWordQEClassifier( use_basic_features_only=opts.use_basic_features_only, use_bigrams=opts.use_bigrams, use_simple_bigram_features=opts.use_simple_bigram_features, use_parse_features=use_parse_features, use_stacked_features=use_stacked_features, evaluation_metric=opts.evaluation_metric, cost_false_positives=opts.cost_false_positives, cost_false_negatives=opts.cost_false_negatives, ) return model def num_parameters(self): return len(self.__dict__) # -- END of new methods -- # TODO: Eliminate this function. def get_coarse_label(self, label): """Get the coarse part of a fine-grained label. The coarse label is the prefix before the underscore (if any). For example, the coarse part of BAD_SUB, BAD_DEL, and BAD is BAD.""" sep = label.find('_') if sep >= 0: coarse_label = label[:sep] else: coarse_label = label return coarse_label def create_instances(self, dataset): instances = [] num_words = 0 for example in dataset: sentence = LinearWordQESentence() labels = [] for label in example.tags: if label in self.labels: label_id = self.labels.get_label_id(label) else: label_id = self.labels.add(label) labels.append(label_id) sentence.create_from_sentence_pair( source_words=example.source, target_words=example.target, alignments=example.alignments, source_pos_tags=getattr(example, const.SOURCE_POS, None), target_pos_tags=getattr(example, const.TARGET_POS, None), target_parse_heads=getattr( example, const.TARGET_PARSE_HEADS, None ), target_parse_relations=getattr( example, const.TARGET_PARSE_RELATIONS, None ), target_ngram_left=getattr( example, const.TARGET_NGRAM_LEFT, None ), target_ngram_right=getattr( example, const.TARGET_NGRAM_RIGHT, None ), target_stacked_features=getattr( example, const.TARGET_STACKED, None ), labels=labels, ) instances.append(sentence) num_words += sentence.num_words() logger.info('Number of sentences: %d' % len(instances)) logger.info('Number of words: %d' % num_words) logger.info('Number of labels: %d' % len(self.labels)) return instances def make_parts(self, instance): """Creates the parts (unigrams and bigrams) for an instance.""" gold_list = [] parts = [] make_gold = True for word_index in range(instance.num_words()): for label_id in range(len(self.labels)): part = SequenceUnigramPart(word_index, label_id) parts.append(part) if make_gold: if label_id == instance.sentence_word_labels[word_index]: gold_list.append(1.0) else: gold_list.append(0.0) if self.use_bigrams: # First word. for label_id in range(len(self.labels)): part = SequenceBigramPart(0, label_id, -1) parts.append(part) if make_gold: if label_id == instance.sentence_word_labels[0]: gold_list.append(1.0) else: gold_list.append(0.0) # Intermediate word. for word_index in range(1, instance.num_words()): for label_id in range(len(self.labels)): for previous_label_id in range(len(self.labels)): part = SequenceBigramPart( word_index, label_id, previous_label_id ) parts.append(part) if make_gold: if ( label_id == instance.sentence_word_labels[word_index] and previous_label_id == instance.sentence_word_labels[word_index - 1] ): gold_list.append(1.0) else: gold_list.append(0.0) # Last word. for previous_label_id in range(len(self.labels)): part = SequenceBigramPart( instance.num_words(), -1, previous_label_id ) parts.append(part) if make_gold: if ( previous_label_id == instance.sentence_word_labels[ instance.num_words() - 1 ] ): gold_list.append(1.0) else: gold_list.append(0.0) if make_gold: gold_array = np.array(gold_list) return parts, gold_array else: return parts def make_features(self, instance, parts): """Creates a feature vector for each part.""" features = [] for part in parts: part_features = LinearWordQEFeatures( use_basic_features_only=self.use_basic_features_only, use_simple_bigram_features=self.use_simple_bigram_features, use_parse_features=self.use_parse_features, use_stacked_features=self.use_stacked_features, ) if isinstance(part, SequenceUnigramPart): part_features.compute_unigram_features( instance.sentence_word_features, part ) elif isinstance(part, SequenceBigramPart): part_features.compute_bigram_features( instance.sentence_word_features, part ) else: raise NotImplementedError features.append(part_features) return features def label_instance(self, instance, parts, predicted_output): """Return a labeled instance by adding the predicted output information.""" assert False, 'This does not seem to be called' labeled_instance = LinearWordQESentence(instance.sentence) labeled_instance.sentence_word_features = ( instance.sentence_word_features ) predictions = np.zeros(instance.num_words(), dtype=int) for r, part in enumerate(parts): if isinstance(part, SequenceUnigramPart): continue if predicted_output[r] > 0.5: predictions[part.index] = part.label labeled_instance.sentence_word_labels = [ self.labels.get_label_name(pred) for pred in predictions ] return labeled_instance def create_prediction(self, instance, parts, predicted_output): """Creates a list of word-level predictions for a sentence. For compliance with probabilities, it returns 1 if label is BAD, 0 if OK.""" predictions = np.zeros(instance.num_words(), dtype=int) for r, part in enumerate(parts): if not isinstance(part, SequenceUnigramPart): continue if predicted_output[r] > 0.5: predictions[part.index] = part.label predictions = [ int(const.BAD == self.labels.get_label_name(pred)) for pred in predictions ] return predictions def test(self, instances): """Run the model on test data.""" logger.info('Testing...') predictions = StructuredClassifier.test(self, instances) return predictions def evaluate(self, instances, predictions, print_scores=True): """Evaluates the model's accuracy and F1-BAD score.""" all_predictions = [] for word_predictions in predictions: labels = [ const.BAD if prediction else const.OK for prediction in word_predictions ] labels = [int(self.labels[label]) for label in labels] all_predictions.append(labels) # TODO: Get rid of fine-grained labels. # Allow fine-grained labels. Their names should be a coarse-grained # label, followed by an underscore, followed by a sub-label. # For example, BAD_SUB or BAD_DEL are two instances of bad labels. fine_to_coarse = -np.ones(len(self.labels), dtype=int) coarse_labels = LabelDictionary() for label in self.labels: coarse_label = self.get_coarse_label(label) if coarse_label not in coarse_labels: lid = coarse_labels.add(coarse_label) else: lid = coarse_labels[coarse_label] fine_to_coarse[self.labels[label]] = lid # Iterate through sentences and compare gold values with predicted # values. Update counts. num_matched = 0 num_matched_labels = np.zeros(len(coarse_labels)) num_predicted = 0 num_predicted_labels = np.zeros(len(coarse_labels)) num_gold_labels = np.zeros(len(coarse_labels)) assert len(all_predictions) == len(instances) for i, instance in enumerate(instances): predictions = all_predictions[i] assert len(instance.sentence_word_labels) == len(predictions) for j in range(len(predictions)): if ( fine_to_coarse[predictions[j]] == fine_to_coarse[instance.sentence_word_labels[j]] ): num_matched += 1 num_predicted += 1 if ( fine_to_coarse[predictions[j]] == fine_to_coarse[instance.sentence_word_labels[j]] ): num_matched_labels[fine_to_coarse[predictions[j]]] += 1 num_predicted_labels[fine_to_coarse[predictions[j]]] += 1 num_gold_labels[ fine_to_coarse[instance.sentence_word_labels[j]] ] += 1 acc = float(num_matched) / float(num_predicted) logger.info('Accuracy: %f' % acc) # We allow multiple bad labels. They should be named BAD*. bad = coarse_labels['BAD'] if num_matched_labels[bad] == 0: f1_bad = 0.0 else: precision_bad = float(num_matched_labels[bad]) / float( num_predicted_labels[bad] ) recall_bad = float(num_matched_labels[bad]) / float( num_gold_labels[bad] ) f1_bad = ( 2 * precision_bad * recall_bad / (precision_bad + recall_bad) ) logger.info( '# gold bad: %d/%d' % (num_gold_labels[bad], sum(num_gold_labels)) ) logger.info( '# predicted bad: %d/%d' % (num_predicted_labels[bad], sum(num_predicted_labels)) ) ok = coarse_labels['OK'] if num_matched_labels[ok] == 0: f1_ok = 0.0 else: precision_ok = float(num_matched_labels[ok]) / float( num_predicted_labels[ok] ) recall_ok = float(num_matched_labels[ok]) / float( num_gold_labels[ok] ) f1_ok = 2 * precision_ok * recall_ok / (precision_ok + recall_ok) logger.info( '# gold ok: %d/%d' % (num_gold_labels[ok], sum(num_gold_labels)) ) logger.info( '# predicted ok: %d/%d' % (num_predicted_labels[ok], sum(num_predicted_labels)) ) logger.info('F1 bad: %f' % f1_bad) logger.info('F1 ok: %f' % f1_ok) logger.info('F1 mult: %f' % (f1_bad * f1_ok)) if self.evaluation_metric == 'f1_mult': return f1_bad * f1_ok elif self.evaluation_metric == 'f1_bad': return f1_bad else: raise NotImplementedError def load_configuration(self, config): self.use_basic_features_only = config['use_basic_features_only'] self.use_bigrams = config['use_bigrams'] self.use_simple_bigram_features = config['use_simple_bigram_features'] self.use_stacked_features = config['use_stacked'] self.use_parse_features = config['use_parse'] def save_configuration(self): config = { 'use_basic_features_only': self.use_basic_features_only, 'use_bigrams': self.use_bigrams, 'use_simple_bigram_features': self.use_simple_bigram_features, 'use_stacked': self.use_stacked_features, 'use_parse': self.use_parse_features, } return config def load(self, model_path): import pickle with Path(model_path).open('rb') as fid: config = pickle.load(fid) self.load_configuration(config) self.labels = pickle.load(fid) self.model = pickle.load(fid) try: self.source_vocab = pickle.load(fid) self.target_vocab = pickle.load(fid) except EOFError: self.source_vocab = None self.target_vocab = None def save(self, model_path): import pickle with Path(model_path).open('wb') as fid: config = self.save_configuration() pickle.dump(config, fid) pickle.dump(self.labels, fid) pickle.dump(self.model, fid) # pickle.dump(self.source_vocab, fid) # pickle.dump(self.target_vocab, fid) PK! @!@!kiwi/models/model.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # import logging from abc import ABCMeta, abstractmethod import torch import torch.nn as nn import kiwi from kiwi import constants as const from kiwi.data import utils logger = logging.getLogger(__name__) class ModelConfig: __metaclass__ = ABCMeta def __init__(self, vocabs): """Model Configuration Base Class. Args: vocabs: Dictionary Mapping Field Names to Vocabularies. Must contain 'source' and 'target' keys """ self.source_vocab_size = len(vocabs[const.SOURCE]) self.target_vocab_size = len(vocabs[const.TARGET]) @classmethod def from_dict(cls, config_dict, vocabs): """Create config from a saved state_dict. Args: config_dict: A dictionary that is the return value of a call to the `state_dict()` method of `cls` vocab: See `ModelConfig.__init__` """ config = cls(vocabs) config.update(config_dict) return config def update(self, other_config): """Updates the config object with the values of `other_config` Args: other_config: The `dict` or `ModelConfig` object to update with. """ config_dict = dict() if isinstance(self, other_config.__class__): config_dict = other_config.__dict__ elif isinstance(other_config, dict): config_dict = other_config self.__dict__.update(config_dict) def state_dict(self): """Return the __dict__ for serialization. """ self.__dict__['__version__'] = kiwi.__version__ return self.__dict__ class Model(nn.Module): __metaclass__ = ABCMeta subclasses = {} def __init__(self, vocabs, ConfigCls=ModelConfig, config=None, **kwargs): """Quality Estimation Base Class. Args: vocabs: Dictionary Mapping Field Names to Vocabularies. ConfigCls: ModelConfig Subclass config: A State Dict of a ModelConfig subclass. If set, passing other kwargs will raise an Exception. """ super().__init__() self.vocabs = vocabs if config is None: config = ConfigCls(vocabs=vocabs, **kwargs) else: config = ConfigCls.from_dict(config_dict=config, vocabs=vocabs) assert not kwargs self.config = config @classmethod def register_subclass(cls, subclass): cls.subclasses[subclass.__name__] = subclass return subclass @abstractmethod def loss(self, model_out, target): pass @abstractmethod def forward(self, *args, **kwargs): pass def num_parameters(self): return sum(p.numel() for p in self.parameters()) def predict(self, batch, class_name=const.BAD, unmask=True): model_out = self(batch) predictions = {} class_index = torch.tensor([const.LABELS.index(class_name)]) for key in model_out: if key in [const.TARGET_TAGS, const.SOURCE_TAGS, const.GAP_TAGS]: # Models are assumed to return logits, not probabilities logits = model_out[key] probs = torch.softmax(logits, dim=-1) class_probs = probs.index_select( -1, class_index.to(device=probs.device) ) class_probs = class_probs.squeeze(-1).tolist() if unmask: if key == const.SOURCE_TAGS: input_key = const.SOURCE else: input_key = const.TARGET mask = self.get_mask(batch, input_key) if key == const.GAP_TAGS: # Append one extra token mask = torch.cat( [mask.new_ones((mask.shape[0], 1)), mask], dim=1 ) lengths = mask.int().sum(dim=-1) for i, x in enumerate(class_probs): class_probs[i] = x[: lengths[i]] predictions[key] = class_probs elif key == const.SENTENCE_SCORES: predictions[key] = model_out[key].tolist() elif key == const.BINARY: logits = model_out[key] probs = torch.softmax(logits, dim=-1) class_probs = probs.index_select( -1, class_index.to(device=probs.device) ) predictions[key] = class_probs.tolist() return predictions def predict_raw(self, examples): batch = self.preprocess(examples) return self.predict(batch, class_name=const.BAD_ID, unmask=True) def preprocess(self, examples): """Preprocess Raw Data. Args: examples (list of dict): List of examples. Each Example is a dict with field strings as keys, and unnumericalized, tokenized data as values. Return: A batch object. """ raise NotImplementedError def get_mask(self, batch, output): """Compute Mask of Tokens for side. Args: batch: Namespace of tensors side: String identifier. """ side = output # if output in [const.TARGET_TAGS, const.GAP_TAGS]: # side = const.TARGET # elif output == const.SOURCE_TAGS: # side = const.SOURCE input_tensor = getattr(batch, side) if isinstance(input_tensor, tuple) and len(input_tensor) == 2: input_tensor, lengths = input_tensor # output_tensor = getattr(batch, output) # if isinstance(output_tensor, tuple) and len(output_tensor) == 2: # output_tensor, lengths = output_tensor mask = torch.ones_like(input_tensor, dtype=torch.uint8) possible_padding = [const.PAD, const.START, const.STOP] unk_id = self.vocabs[side].stoi.get(const.UNK) for pad in possible_padding: pad_id = self.vocabs[side].stoi.get(pad) if pad_id is not None and pad_id != unk_id: mask &= input_tensor != pad_id return mask @staticmethod def create_from_file(path): model_dict = torch.load( str(path), map_location=lambda storage, loc: storage ) for model_name in Model.subclasses: if model_name in model_dict: model = Model.subclasses[model_name].from_dict(model_dict) return model return None @classmethod def from_file(cls, path): model_dict = torch.load( str(path), map_location=lambda storage, loc: storage ) if cls.__name__ not in model_dict: raise KeyError( '{} model data not found in {}'.format(cls.__name__, path) ) return cls.from_dict(model_dict) @classmethod def from_dict(cls, model_dict): vocabs = utils.deserialize_vocabs(model_dict[const.VOCAB]) class_dict = model_dict[cls.__name__] model = cls(vocabs=vocabs, config=class_dict[const.CONFIG]) model.load_state_dict(class_dict[const.STATE_DICT]) return model def save(self, path): vocabs = utils.serialize_vocabs(self.vocabs) model_dict = { '__version__': kiwi.__version__, const.VOCAB: vocabs, self.__class__.__name__: { const.CONFIG: self.config.state_dict(), const.STATE_DICT: self.state_dict(), }, } torch.save(model_dict, str(path)) PK!hkiwi/models/modules/__init__.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # PK!Pds@yy kiwi/models/modules/attention.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # import torch from torch import nn class Attention(nn.Module): """Generic Attention Implementation. Module computes a convex combination of a set of values based on the fit of their keys with a query. """ def __init__(self, scorer): super().__init__() self.scorer = scorer self.mask = None def forward(self, query, keys, values=None): if values is None: values = keys scores = self.scorer(query, keys) # Masked Softmax scores = scores - scores.mean(1, keepdim=True) # numerical stability scores = torch.exp(scores) if self.mask is not None: scores = self.mask * scores convex = scores / scores.sum(1, keepdim=True) return torch.einsum('bs,bsi->bi', [convex, values]) def set_mask(self, mask): self.mask = mask PK!Luukiwi/models/modules/scorer.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # import torch from torch import nn class Scorer(nn.Module): """Score function for Attention module. """ def __init__(self): super().__init__() def forward(self, query, keys): """Computes Scores for each key given the query. args: query: FloatTensor batch x n keys: FloatTensor batch x seq_length x m ret: scores: FloatTensor batch x seq_length """ raise NotImplementedError class MLPScorer(Scorer): """Implements a score function based on a Multilayer Perceptron. """ def __init__(self, query_size, key_size, layers=2, nonlinearity=nn.Tanh): super().__init__() layer_list = [] size = query_size + key_size for i in range(layers): size_next = size // 2 if i < layers - 1 else 1 layer_list.append( nn.Sequential(nn.Linear(size, size_next), nonlinearity()) ) size = size_next self.layers = nn.ModuleList(layer_list) def forward(self, query, keys): layer_in = torch.cat([query.unsqueeze(1).expand_as(keys), keys], dim=-1) layer_in = layer_in.reshape(-1, layer_in.size(-1)) for layer in self.layers: layer_in = layer(layer_in) out = layer_in.reshape(keys.size()[:-1]) return out PK!d# # kiwi/models/nuqe.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # from collections import OrderedDict import torch import torch.nn as nn import torch.nn.functional as F from kiwi import constants as const from kiwi.data.fieldsets.quetch import build_fieldset from kiwi.models.model import Model from kiwi.models.quetch import QUETCH from kiwi.models.utils import make_loss_weights @Model.register_subclass class NuQE(QUETCH): """Neural Quality Estimation (NuQE) model for word level quality estimation.""" title = 'NuQE' def __init__(self, vocabs, **kwargs): self.source_emb = None self.target_emb = None self.linear_1 = None self.linear_2 = None self.linear_3 = None self.linear_4 = None self.linear_5 = None self.linear_6 = None self.linear_out = None self.embeddings_dropout = None self.dropout = None self.gru1 = None self.gru2 = None self.is_built = False super().__init__(vocabs, **kwargs) def build(self, source_vectors=None, target_vectors=None): nb_classes = self.config.nb_classes # FIXME: Remove dependency on magic number weight = make_loss_weights( nb_classes, const.BAD_ID, self.config.bad_weight ) self._loss = nn.CrossEntropyLoss( weight=weight, ignore_index=self.config.tags_pad_id, reduction='sum' ) # Embeddings layers: self._build_embeddings(source_vectors, target_vectors) feature_set_size = ( self.config.source_embeddings_size + self.config.target_embeddings_size ) * self.config.window_size l1_dim = self.config.hidden_sizes[0] l2_dim = self.config.hidden_sizes[1] l3_dim = self.config.hidden_sizes[2] l4_dim = self.config.hidden_sizes[3] nb_classes = self.config.nb_classes dropout = self.config.dropout # Linear layers self.linear_1 = nn.Linear(feature_set_size, l1_dim) self.linear_2 = nn.Linear(l1_dim, l1_dim) self.linear_3 = nn.Linear(2 * l2_dim, l2_dim) self.linear_4 = nn.Linear(l2_dim, l2_dim) self.linear_5 = nn.Linear(2 * l2_dim, l3_dim) self.linear_6 = nn.Linear(l3_dim, l4_dim) # Output layer self.linear_out = nn.Linear(l4_dim, nb_classes) # Recurrent Layers self.gru_1 = nn.GRU( l1_dim, l2_dim, bidirectional=True, batch_first=True ) self.gru_2 = nn.GRU( l2_dim, l2_dim, bidirectional=True, batch_first=True ) # Dropout after linear layers self.dropout_in = nn.Dropout(dropout) self.dropout_out = nn.Dropout(dropout) # Explicit initializations nn.init.xavier_uniform_(self.linear_1.weight) nn.init.xavier_uniform_(self.linear_2.weight) nn.init.xavier_uniform_(self.linear_3.weight) nn.init.xavier_uniform_(self.linear_4.weight) nn.init.xavier_uniform_(self.linear_5.weight) nn.init.xavier_uniform_(self.linear_6.weight) # nn.init.xavier_uniform_(self.linear_out) nn.init.constant_(self.linear_1.bias, 0.0) nn.init.constant_(self.linear_2.bias, 0.0) nn.init.constant_(self.linear_3.bias, 0.0) nn.init.constant_(self.linear_4.bias, 0.0) nn.init.constant_(self.linear_5.bias, 0.0) nn.init.constant_(self.linear_6.bias, 0.0) # nn.init.constant_(self.linear_out.bias, 0.) self.is_built = True @staticmethod def fieldset(*args, **kwargs): return build_fieldset(*args, **kwargs) @staticmethod def from_options(vocabs, opts): model = NuQE( vocabs=vocabs, predict_target=opts.predict_target, predict_gaps=opts.predict_gaps, predict_source=opts.predict_source, source_embeddings_size=opts.source_embeddings_size, target_embeddings_size=opts.target_embeddings_size, hidden_sizes=opts.hidden_sizes, bad_weight=opts.bad_weight, window_size=opts.window_size, max_aligned=opts.max_aligned, dropout=opts.dropout, embeddings_dropout=opts.embeddings_dropout, freeze_embeddings=opts.freeze_embeddings, ) return model def forward(self, batch): assert self.is_built if self.config.predict_source: align_side = const.SOURCE_TAGS else: align_side = const.TARGET_TAGS target_input, source_input, nb_alignments = self.make_input( batch, align_side ) # # Source Branch # # (bs, ts, aligned, window) -> (bs, ts, aligned, window, emb) h_source = self.source_emb(source_input) h_source = self.embeddings_dropout(h_source) if len(h_source.shape) == 5: # (bs, ts, aligned, window, emb) -> (bs, ts, window, emb) h_source = h_source.sum(2, keepdim=False) / nb_alignments.unsqueeze( -1 ).unsqueeze(-1) # (bs, ts, window, emb) -> (bs, ts, window * emb) h_source = h_source.view(source_input.size(0), source_input.size(1), -1) # # Target Branch # # (bs, ts * window) -> (bs, ts * window, emb) h_target = self.target_emb(target_input) h_target = self.embeddings_dropout(h_target) if len(h_target.shape) == 5: # (bs, ts, aligned, window, emb) -> (bs, ts, window, emb) h_target = h_target.sum(2, keepdim=False) / nb_alignments.unsqueeze( -1 ).unsqueeze(-1) # (bs, ts * window, emb) -> (bs, ts, window * emb) h_target = h_target.view(target_input.size(0), target_input.size(1), -1) # # POS tags branches # feature_set = (h_source, h_target) # # Merge Branches # # (bs, ts, window * emb) -> (bs, ts, 2 * window * emb) h = torch.cat(feature_set, dim=-1) h = self.dropout_in(h) # # First linears # # (bs, ts, 2 * window * emb) -> (bs, ts, l1_dim) h = F.relu(self.linear_1(h)) # (bs, ts, l1_dim) -> (bs, ts, l1_dim) h = F.relu(self.linear_2(h)) # # First recurrent # # (bs, ts, l1_dim) -> (bs, ts, l1_dim) h, _ = self.gru_1(h) # # Second linears # # (bs, ts, l1_dim) -> (bs, ts, l2_dim) h = F.relu(self.linear_3(h)) # (bs, ts, l2_dim) -> (bs, ts, l2_dim) h = F.relu(self.linear_4(h)) # # Second recurrent # # (bs, ts, l2_dim) -> (bs, ts, l2_dim) h, _ = self.gru_2(h) # # Third linears # # (bs, ts, l1_dim) -> (bs, ts, l3_dim) h = F.relu(self.linear_5(h)) # (bs, ts, l3_dim) -> (bs, ts, l4_dim) h = F.relu(self.linear_6(h)) h = self.dropout_out(h) # # Output layer # # (bs, ts, hs) -> (bs, ts, 2) h = self.linear_out(h) # h = F.log_softmax(h, dim=-1) outputs = OrderedDict() if self.config.predict_target: outputs[const.TARGET_TAGS] = h if self.config.predict_gaps: outputs[const.GAP_TAGS] = h if self.config.predict_source: outputs[const.SOURCE_TAGS] = h return outputs PK!^g33kiwi/models/predictor.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # from collections import OrderedDict import torch from torch import nn from kiwi import constants as const from kiwi.metrics import CorrectMetric, ExpectedErrorMetric, PerplexityMetric from kiwi.models.model import Model, ModelConfig from kiwi.models.modules.attention import Attention from kiwi.models.modules.scorer import MLPScorer from kiwi.models.utils import apply_packed_sequence, replace_token class PredictorConfig(ModelConfig): def __init__( self, vocabs, hidden_pred=400, rnn_layers_pred=3, dropout_pred=0.0, share_embeddings=False, embedding_sizes=0, target_embeddings_size=200, source_embeddings_size=200, out_embeddings_size=200, predict_inverse=False, ): """Predictor Hyperparams. """ super().__init__(vocabs) # Vocabulary self.target_side = const.TARGET self.source_side = const.SOURCE self.predict_inverse = predict_inverse if self.predict_inverse: self.source_side, self.target_side = ( self.target_side, self.source_side, ) self.target_vocab_size, self.source_vocab_size = ( self.source_vocab_size, self.target_vocab_size, ) # Architecture self.hidden_pred = hidden_pred self.rnn_layers_pred = rnn_layers_pred self.dropout_pred = dropout_pred self.share_embeddings = share_embeddings if embedding_sizes: self.target_embeddings_size = embedding_sizes self.source_embeddings_size = embedding_sizes self.out_embeddings_size = embedding_sizes else: self.target_embeddings_size = target_embeddings_size self.source_embeddings_size = source_embeddings_size self.out_embeddings_size = out_embeddings_size @Model.register_subclass class Predictor(Model): """Bidirectional Conditional Language Model Implemented after Kim et al 2017, see: http://www.statmt.org/wmt17/pdf/WMT63.pdf """ title = 'PredEst Predictor model (an embedder model)' def __init__(self, vocabs, **kwargs): """ Args: vocabs: Dictionary Mapping Field Names to Vocabularies. kwargs: config: A state dict of a PredictorConfig object. dropout: LSTM dropout Default 0.0 hidden_pred: LSTM Hidden Size, default 200 rnn_layers: Default 3 embedding_sizes: If set, takes precedence over other embedding params Default 100 source_embeddings_size: Default 100 target_embeddings_size: Default 100 out_embeddings_size: Output softmax embedding. Default 100 share_embeddings: Tie input and output embeddings for target. Default False predict_inverse: Predict from target to source. Default False """ super().__init__(vocabs=vocabs, ConfigCls=PredictorConfig, **kwargs) scorer = MLPScorer( self.config.hidden_pred * 2, self.config.hidden_pred * 2, layers=2 ) self.attention = Attention(scorer) self.embedding_source = nn.Embedding( self.config.source_vocab_size, self.config.source_embeddings_size, const.PAD_ID, ) self.embedding_target = nn.Embedding( self.config.target_vocab_size, self.config.target_embeddings_size, const.PAD_ID, ) self.lstm_source = nn.LSTM( input_size=self.config.source_embeddings_size, hidden_size=self.config.hidden_pred, num_layers=self.config.rnn_layers_pred, batch_first=True, dropout=self.config.dropout_pred, bidirectional=True, ) self.forward_target = nn.LSTM( input_size=self.config.target_embeddings_size, hidden_size=self.config.hidden_pred, num_layers=self.config.rnn_layers_pred, batch_first=True, dropout=self.config.dropout_pred, bidirectional=False, ) self.backward_target = nn.LSTM( input_size=self.config.target_embeddings_size, hidden_size=self.config.hidden_pred, num_layers=self.config.rnn_layers_pred, batch_first=True, dropout=self.config.dropout_pred, bidirectional=False, ) self.W1 = self.embedding_target if not self.config.share_embeddings: self.W1 = nn.Embedding( self.config.target_vocab_size, self.config.out_embeddings_size, const.PAD_ID, ) self.W2 = nn.Parameter( torch.zeros( self.config.out_embeddings_size, self.config.out_embeddings_size ) ) self.V = nn.Parameter( torch.zeros( 2 * self.config.target_embeddings_size, 2 * self.config.out_embeddings_size, ) ) self.C = nn.Parameter( torch.zeros( 2 * self.config.hidden_pred, 2 * self.config.out_embeddings_size ) ) self.S = nn.Parameter( torch.zeros( 2 * self.config.hidden_pred, 2 * self.config.out_embeddings_size ) ) for p in self.parameters(): if len(p.shape) > 1: nn.init.xavier_uniform_(p) self._loss = nn.CrossEntropyLoss( reduction='sum', ignore_index=const.PAD_ID ) @staticmethod def fieldset(*args, **kwargs): from kiwi.data.fieldsets.predictor import build_fieldset return build_fieldset() @staticmethod def from_options(vocabs, opts): """ Args: vocabs: opts: Returns: """ model = Predictor( vocabs, hidden_pred=opts.hidden_pred, rnn_layers_pred=opts.rnn_layers_pred, dropout_pred=opts.dropout_pred, share_embeddings=opts.share_embeddings, embedding_sizes=opts.embedding_sizes, target_embeddings_size=opts.target_embeddings_size, source_embeddings_size=opts.source_embeddings_size, out_embeddings_size=opts.out_embeddings_size, predict_inverse=opts.predict_inverse, ) return model def loss(self, model_out, batch, target_side=None): if not target_side: target_side = self.config.target_side target = getattr(batch, target_side) # There are no predictions for first/last element target = replace_token(target[:, 1:-1], const.STOP_ID, const.PAD_ID) # Predicted Class must be in dim 1 for xentropyloss logits = model_out[target_side] logits = logits.transpose(1, 2) loss = self._loss(logits, target) loss_dict = OrderedDict() loss_dict[target_side] = loss loss_dict[const.LOSS] = loss return loss_dict def forward(self, batch, source_side=None, target_side=None): if not source_side: source_side = self.config.source_side if not target_side: target_side = self.config.target_side source = getattr(batch, source_side) target = getattr(batch, target_side) batch_size, target_len = target.shape[:2] # Remove First and Last Element (Start / Stop Tokens) source_mask = self.get_mask(batch, source_side)[:, 1:-1] source_lengths = source_mask.sum(1) target_lengths = self.get_mask(batch, target_side).sum(1) source_embeddings = self.embedding_source(source) target_embeddings = self.embedding_target(target) # Source Encoding source_contexts, hidden = apply_packed_sequence( self.lstm_source, source_embeddings, source_lengths ) # Target Encoding. h_forward, h_backward = self._split_hidden(hidden) forward_contexts, _ = self.forward_target(target_embeddings, h_forward) target_emb_rev = self._reverse_padded_seq( target_lengths, target_embeddings ) backward_contexts, _ = self.backward_target(target_emb_rev, h_backward) backward_contexts = self._reverse_padded_seq( target_lengths, backward_contexts ) # For each position, concatenate left context i-1 and right context i+1 target_contexts = torch.cat( [forward_contexts[:, :-2], backward_contexts[:, 2:]], dim=-1 ) # For each position i, concatenate Emeddings i-1 and i+1 target_embeddings = torch.cat( [target_embeddings[:, :-2], target_embeddings[:, 2:]], dim=-1 ) # Get Attention vectors for all positions and stack. self.attention.set_mask(source_mask.float()) attns = [ self.attention( target_contexts[:, i], source_contexts, source_contexts ) for i in range(target_len - 2) ] attns = torch.stack(attns, dim=1) # Combine attention, embeddings and target context vectors C = torch.einsum('bsi,il->bsl', [attns, self.C]) V = torch.einsum('bsj,jl->bsl', [target_embeddings, self.V]) S = torch.einsum('bsk,kl->bsl', [target_contexts, self.S]) t_tilde = C + V + S # Maxout with pooling size 2 t, _ = torch.max( t_tilde.view( t_tilde.shape[0], t_tilde.shape[1], t_tilde.shape[-1] // 2, 2 ), dim=-1, ) f = torch.einsum('oh,bso->bsh', [self.W2, t]) logits = torch.einsum('vh,bsh->bsv', [self.W1.weight, f]) PreQEFV = torch.einsum('bsh,bsh->bsh', [self.W1(target[:, 1:-1]), f]) PostQEFV = torch.cat([forward_contexts, backward_contexts], dim=-1) return { target_side: logits, const.PREQEFV: PreQEFV, const.POSTQEFV: PostQEFV, } @staticmethod def _reverse_padded_seq(lengths, sequence): """ Reverses a batch of padded sequences of different length. """ batch_size, max_length = sequence.shape[:-1] reversed_idx = [] for i in range(batch_size * max_length): batch_id = i // max_length sent_id = i % max_length if sent_id < lengths[batch_id]: sent_id_rev = lengths[batch_id] - sent_id - 1 else: sent_id_rev = sent_id # Padding symbol, don't change order reversed_idx.append(max_length * batch_id + sent_id_rev) flat_sequence = sequence.contiguous().view(batch_size * max_length, -1) reversed_seq = flat_sequence[reversed_idx, :].view(*sequence.shape) return reversed_seq @staticmethod def _split_hidden(hidden): """Split Hidden State into forward/backward parts. """ h, c = hidden size = h.shape[0] idx_forward = torch.arange(0, size, 2, dtype=torch.long) idx_backward = torch.arange(1, size, 2, dtype=torch.long) hidden_forward = (h[idx_forward], c[idx_forward]) hidden_backward = (h[idx_backward], c[idx_backward]) return hidden_forward, hidden_backward def metrics(self): metrics = [] main_metric = PerplexityMetric( prefix=self.config.target_side, target_name=self.config.target_side, PAD=const.PAD_ID, STOP=const.STOP_ID, ) metrics.append(main_metric) metrics.append( CorrectMetric( prefix=self.config.target_side, target_name=self.config.target_side, PAD=const.PAD_ID, STOP=const.STOP_ID, ) ) metrics.append( ExpectedErrorMetric( prefix=self.config.target_side, target_name=self.config.target_side, PAD=const.PAD_ID, STOP=const.STOP_ID, ) ) return metrics def metrics_ordering(self): return min PK!XzU_U_"kiwi/models/predictor_estimator.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # import logging from collections import OrderedDict import torch from torch import nn from torch.distributions.normal import Normal from kiwi import constants as const from kiwi.metrics import ( CorrectMetric, ExpectedErrorMetric, F1Metric, LogMetric, PearsonMetric, PerplexityMetric, RMSEMetric, SpearmanMetric, ThresholdCalibrationMetric, TokenMetric, ) from kiwi.models.model import Model from kiwi.models.predictor import Predictor, PredictorConfig from kiwi.models.utils import apply_packed_sequence, make_loss_weights logger = logging.getLogger(__name__) class EstimatorConfig(PredictorConfig): def __init__( self, vocabs, hidden_est=100, rnn_layers_est=1, mlp_est=True, dropout_est=0.0, start_stop=False, predict_target=True, predict_gaps=False, predict_source=False, token_level=True, sentence_level=True, sentence_ll=True, binary_level=True, target_bad_weight=2.0, source_bad_weight=2.0, gaps_bad_weight=2.0, **kwargs ): """Predictor Estimator Hyperparams. """ super().__init__(vocabs, **kwargs) self.start_stop = start_stop or predict_gaps self.hidden_est = hidden_est self.rnn_layers_est = rnn_layers_est self.mlp_est = mlp_est self.dropout_est = dropout_est self.predict_target = predict_target self.predict_gaps = predict_gaps self.predict_source = predict_source self.token_level = token_level self.sentence_level = sentence_level self.sentence_ll = sentence_ll self.binary_level = binary_level self.target_bad_weight = target_bad_weight self.source_bad_weight = source_bad_weight self.gaps_bad_weight = gaps_bad_weight @Model.register_subclass class Estimator(Model): title = 'PredEst (Predictor-Estimator)' def __init__( self, vocabs, predictor_tgt=None, predictor_src=None, **kwargs ): super().__init__(vocabs=vocabs, ConfigCls=EstimatorConfig, **kwargs) if predictor_src: self.config.update(predictor_src.config) elif predictor_tgt: self.config.update(predictor_tgt.config) # Predictor Settings # predict_tgt = self.config.predict_target or self.config.predict_gaps if predict_tgt and not predictor_tgt: predictor_tgt = Predictor( vocabs=vocabs, predict_inverse=False, hidden_pred=self.config.hidden_pred, rnn_layers_pred=self.config.rnn_layers_pred, dropout_pred=self.config.dropout_pred, target_embeddings_size=self.config.target_embeddings_size, source_embeddings_size=self.config.source_embeddings_size, out_embeddings_size=self.config.out_embeddings_size, ) if self.config.predict_source and not predictor_src: predictor_src = Predictor( vocabs=vocabs, predict_inverse=True, hidden_pred=self.config.hidden_pred, rnn_layers_pred=self.config.rnn_layers_pred, dropout_pred=self.config.dropout_pred, target_embeddings_size=self.config.target_embeddings_size, source_embeddings_size=self.config.source_embeddings_size, out_embeddings_size=self.config.out_embeddings_size, ) # Update the predictor vocabs if token level == True # Required by `get_mask` call in predictor forward with `pe` side # to determine padding IDs. if self.config.token_level: if predictor_src: predictor_src.vocabs = vocabs if predictor_tgt: predictor_tgt.vocabs = vocabs self.predictor_tgt = predictor_tgt self.predictor_src = predictor_src predictor_hidden = self.config.hidden_pred embedding_size = self.config.out_embeddings_size input_size = 2 * predictor_hidden + embedding_size self.nb_classes = len(const.LABELS) self.lstm_input_size = input_size self.mlp = None self.sentence_pred = None self.sentence_sigma = None self.binary_pred = None self.binary_scale = None # Build Model # if self.config.start_stop: self.start_PreQEFV = nn.Parameter(torch.zeros(1, 1, embedding_size)) self.end_PreQEFV = nn.Parameter(torch.zeros(1, 1, embedding_size)) if self.config.mlp_est: self.mlp = nn.Sequential( nn.Linear(input_size, self.config.hidden_est), nn.Tanh() ) self.lstm_input_size = self.config.hidden_est self.lstm = nn.LSTM( input_size=self.lstm_input_size, hidden_size=self.config.hidden_est, num_layers=self.config.rnn_layers_est, batch_first=True, dropout=self.config.dropout_est, bidirectional=True, ) self.embedding_out = nn.Linear( 2 * self.config.hidden_est, self.nb_classes ) if self.config.predict_gaps: self.embedding_out_gaps = nn.Linear( 4 * self.config.hidden_est, self.nb_classes ) self.dropout = None if self.config.dropout_est: self.dropout = nn.Dropout(self.config.dropout_est) # Multitask Learning Objectives # sentence_input_size = ( 2 * self.config.rnn_layers_est * self.config.hidden_est ) if self.config.sentence_level: self.sentence_pred = nn.Sequential( nn.Linear(sentence_input_size, sentence_input_size // 2), nn.Sigmoid(), nn.Linear(sentence_input_size // 2, sentence_input_size // 4), nn.Sigmoid(), nn.Linear(sentence_input_size // 4, 1), ) self.sentence_sigma = None if self.config.sentence_ll: # Predict truncated Gaussian distribution self.sentence_sigma = nn.Sequential( nn.Linear(sentence_input_size, sentence_input_size // 2), nn.Sigmoid(), nn.Linear( sentence_input_size // 2, sentence_input_size // 4 ), nn.Sigmoid(), nn.Linear(sentence_input_size // 4, 1), nn.Sigmoid(), ) if self.config.binary_level: self.binary_pred = nn.Sequential( nn.Linear(sentence_input_size, sentence_input_size // 2), nn.Tanh(), nn.Linear(sentence_input_size // 2, sentence_input_size // 4), nn.Tanh(), nn.Linear(sentence_input_size // 4, 2), ) # Build Losses # # FIXME: Remove dependency on magic numbers self.xents = nn.ModuleDict() weight = make_loss_weights( self.nb_classes, const.BAD_ID, self.config.target_bad_weight ) self.xents[const.TARGET_TAGS] = nn.CrossEntropyLoss( reduction='sum', ignore_index=const.PAD_TAGS_ID, weight=weight ) if self.config.predict_source: weight = make_loss_weights( self.nb_classes, const.BAD_ID, self.config.source_bad_weight ) self.xents[const.SOURCE_TAGS] = nn.CrossEntropyLoss( reduction='sum', ignore_index=const.PAD_TAGS_ID, weight=weight ) if self.config.predict_gaps: weight = make_loss_weights( self.nb_classes, const.BAD_ID, self.config.gaps_bad_weight ) self.xents[const.GAP_TAGS] = nn.CrossEntropyLoss( reduction='sum', ignore_index=const.PAD_TAGS_ID, weight=weight ) if self.config.sentence_level and not self.config.sentence_ll: self.mse_loss = nn.MSELoss(reduction='sum') if self.config.binary_level: self.xent_binary = nn.CrossEntropyLoss(reduction='sum') @staticmethod def fieldset(*args, **kwargs): from kiwi.data.fieldsets.predictor_estimator import build_fieldset return build_fieldset(*args, **kwargs) @staticmethod def from_options(vocabs, opts): """ Args: vocabs: opts: predict_target (bool): Predict target tags predict_source (bool): Predict source tags predict_gaps (bool): Predict gap tags token_level (bool): Train predictor using PE field. sentence_level (bool): Predict Sentence Scores sentence_ll (bool): Use likelihood loss for sentence scores (instead of squared error) binary_level: Predict binary sentence labels target_bad_weight: Weight for target tags bad class. Default 3.0 source_bad_weight: Weight for source tags bad class. Default 3.0 gaps_bad_weight: Weight for gap tags bad class. Default 3.0 Returns: """ predictor_src = predictor_tgt = None if opts.load_pred_source: predictor_src = Predictor.from_file(opts.load_pred_source) if opts.load_pred_target: predictor_tgt = Predictor.from_file(opts.load_pred_target) model = Estimator( vocabs, predictor_tgt=predictor_tgt, predictor_src=predictor_src, hidden_est=opts.hidden_est, rnn_layers_est=opts.rnn_layers_est, mlp_est=opts.mlp_est, dropout_est=opts.dropout_est, start_stop=opts.start_stop, predict_target=opts.predict_target, predict_gaps=opts.predict_gaps, predict_source=opts.predict_source, token_level=opts.token_level, sentence_level=opts.sentence_level, sentence_ll=opts.sentence_ll, binary_level=opts.binary_level, target_bad_weight=opts.target_bad_weight, source_bad_weight=opts.source_bad_weight, gaps_bad_weight=opts.gaps_bad_weight, hidden_pred=opts.hidden_pred, rnn_layers_pred=opts.rnn_layers_pred, dropout_pred=opts.dropout_pred, share_embeddings=opts.dropout_est, embedding_sizes=opts.embedding_sizes, target_embeddings_size=opts.target_embeddings_size, source_embeddings_size=opts.source_embeddings_size, out_embeddings_size=opts.out_embeddings_size, predict_inverse=opts.predict_inverse, ) return model def forward(self, batch): outputs = OrderedDict() contexts_tgt, h_tgt = None, None contexts_src, h_src = None, None if self.config.predict_target or self.config.predict_gaps: model_out_tgt = self.predictor_tgt(batch) input_seq, target_lengths = self.make_input( model_out_tgt, batch, const.TARGET_TAGS ) contexts_tgt, h_tgt = apply_packed_sequence( self.lstm, input_seq, target_lengths ) if self.config.predict_target: logits = self.predict_tags(contexts_tgt) if self.config.start_stop: logits = logits[:, 1:-1] outputs[const.TARGET_TAGS] = logits if self.config.predict_gaps: contexts_gaps = self.make_contexts_gaps(contexts_tgt) logits = self.predict_tags( contexts_gaps, out_embed=self.embedding_out_gaps ) outputs[const.GAP_TAGS] = logits if self.config.predict_source: model_out_src = self.predictor_src(batch) input_seq, target_lengths = self.make_input( model_out_src, batch, const.SOURCE_TAGS ) contexts_src, h_src = apply_packed_sequence( self.lstm, input_seq, target_lengths ) logits = self.predict_tags(contexts_src) outputs[const.SOURCE_TAGS] = logits # Sentence/Binary/Token Level prediction sentence_input = self.make_sentence_input(h_tgt, h_src) if self.config.sentence_level: outputs.update(self.predict_sentence(sentence_input)) if self.config.binary_level: bin_logits = self.binary_pred(sentence_input).squeeze() outputs[const.BINARY] = bin_logits if self.config.token_level and hasattr(batch, const.PE): if self.predictor_tgt: model_out = self.predictor_tgt(batch, target_side=const.PE) logits = model_out[const.PE] outputs[const.PE] = logits if self.predictor_src: model_out = self.predictor_src(batch, source_side=const.PE) logits = model_out[const.SOURCE] outputs[const.SOURCE] = logits # TODO remove? # if self.use_probs: # logits -= logits.mean(-1, keepdim=True) # logits_exp = logits.exp() # logprobs = logits - logits_exp.sum(-1, keepdim=True).log() # sentence_scores = ((logprobs.exp() * token_mask).sum(1) # / target_lengths) # sentence_scores = sentence_scores[..., 1 - self.BAD_ID] # binary_logits = (logprobs * token_mask).sum(1) return outputs def make_input(self, model_out, batch, tagset): """Make Input Sequence from predictor outputs. """ PreQEFV = model_out[const.PREQEFV] PostQEFV = model_out[const.POSTQEFV] side = const.TARGET if tagset == const.SOURCE_TAGS: side = const.SOURCE token_mask = self.get_mask(batch, side) batch_size = token_mask.shape[0] target_lengths = token_mask.sum(1) if self.config.start_stop: target_lengths += 2 start = self.start_PreQEFV.expand( batch_size, 1, self.config.out_embeddings_size ) end = self.end_PreQEFV.expand( batch_size, 1, self.config.out_embeddings_size ) PreQEFV = torch.cat((start, PreQEFV, end), dim=1) else: PostQEFV = PostQEFV[:, 1:-1] input_seq = torch.cat([PreQEFV, PostQEFV], dim=-1) length, input_dim = input_seq.shape[1:] if self.mlp: input_flat = input_seq.view(batch_size * length, input_dim) input_flat = self.mlp(input_flat) input_seq = input_flat.view( batch_size, length, self.lstm_input_size ) return input_seq, target_lengths def make_contexts_gaps(self, contexts): # Concat Contexts Shifted contexts = torch.cat((contexts[:, :-1], contexts[:, 1:]), dim=-1) return contexts def make_sentence_input(self, h_tgt, h_src): """Reshape last hidden state. """ h = h_tgt[0] if h_tgt else h_src[0] h = h.contiguous().transpose(0, 1) return h.reshape(h.shape[0], -1) def predict_sentence(self, sentence_input): """Compute Sentence Score predictions.""" outputs = OrderedDict() sentence_scores = self.sentence_pred(sentence_input).squeeze() outputs[const.SENTENCE_SCORES] = sentence_scores if self.sentence_sigma: # Predict truncated Gaussian on [0,1] sigma = self.sentence_sigma(sentence_input).squeeze() outputs[const.SENT_SIGMA] = sigma outputs['SENT_MU'] = outputs[const.SENTENCE_SCORES] mean = outputs['SENT_MU'].clone().detach() # Compute log-likelihood of x given mu, sigma normal = Normal(mean, sigma) # Renormalize on [0,1] for truncated Gaussian partition_function = (normal.cdf(1) - normal.cdf(0)).detach() outputs[const.SENTENCE_SCORES] = mean + ( ( sigma ** 2 * (normal.log_prob(0).exp() - normal.log_prob(1).exp()) ) / partition_function ) return outputs def predict_tags(self, contexts, out_embed=None): """Compute Tag Predictions.""" if not out_embed: out_embed = self.embedding_out batch_size, length, hidden = contexts.shape if self.dropout: contexts = self.dropout(contexts) # Fold sequence length in batch dimension contexts_flat = contexts.contiguous().view(-1, hidden) logits_flat = out_embed(contexts_flat) logits = logits_flat.view(batch_size, length, self.nb_classes) return logits def sentence_loss(self, model_out, batch): """Compute Sentence score loss""" sentence_pred = model_out[const.SENTENCE_SCORES] sentence_scores = batch.sentence_scores if not self.sentence_sigma: return self.mse_loss(sentence_pred, sentence_scores) else: sigma = model_out[const.SENT_SIGMA] mean = model_out['SENT_MU'] # Compute log-likelihood of x given mu, sigma normal = Normal(mean, sigma) # Renormalize on [0,1] for truncated Gaussian partition_function = (normal.cdf(1) - normal.cdf(0)).detach() nll = partition_function.log() - normal.log_prob(sentence_scores) return nll.sum() def word_loss(self, model_out, batch): """Compute Sequence Tagging Loss""" word_loss = OrderedDict() for tag in const.TAGS: if tag in model_out: logits = model_out[tag] logits = logits.transpose(1, 2) word_loss[tag] = self.xents[tag](logits, getattr(batch, tag)) return word_loss def binary_loss(self, model_out, batch): """Compute Sentence Classification Loss""" labels = getattr(batch, const.BINARY) loss = self.xent_binary(model_out[const.BINARY], labels.long()) return loss def loss(self, model_out, batch): """Compute Model Loss""" loss_dict = self.word_loss(model_out, batch) if self.config.sentence_level: loss_sent = self.sentence_loss(model_out, batch) loss_dict[const.SENTENCE_SCORES] = loss_sent if self.config.binary_level: loss_bin = self.binary_loss(model_out, batch) loss_dict[const.BINARY] = loss_bin if const.PE in model_out: loss_token = self.predictor_tgt.loss( model_out, batch, target_side=const.PE ) loss_dict[const.PE] = loss_token[const.PE] if const.SOURCE in model_out: loss_token = self.predictor_src.loss(model_out, batch) loss_dict[const.SOURCE] = loss_token[const.SOURCE] loss_dict[const.LOSS] = sum(loss.sum() for _, loss in loss_dict.items()) return loss_dict def metrics(self): metrics = [] if self.config.predict_target: metrics.append( F1Metric( prefix=const.TARGET_TAGS, target_name=const.TARGET_TAGS, PAD=const.PAD_TAGS_ID, labels=const.LABELS, ) ) metrics.append( ThresholdCalibrationMetric( prefix=const.TARGET_TAGS, target_name=const.TARGET_TAGS, PAD=const.PAD_TAGS_ID, ) ) metrics.append( CorrectMetric( prefix=const.TARGET_TAGS, target_name=const.TARGET_TAGS, PAD=const.PAD_TAGS_ID, ) ) if self.config.predict_source: metrics.append( F1Metric( prefix=const.SOURCE_TAGS, target_name=const.SOURCE_TAGS, PAD=const.PAD_TAGS_ID, labels=const.LABELS, ) ) metrics.append( CorrectMetric( prefix=const.SOURCE_TAGS, target_name=const.SOURCE_TAGS, PAD=const.PAD_TAGS_ID, ) ) if self.config.predict_gaps: metrics.append( F1Metric( prefix=const.GAP_TAGS, target_name=const.GAP_TAGS, PAD=const.PAD_TAGS_ID, labels=const.LABELS, ) ) metrics.append( CorrectMetric( prefix=const.GAP_TAGS, target_name=const.GAP_TAGS, PAD=const.PAD_TAGS_ID, ) ) if self.config.sentence_level: metrics.append(RMSEMetric(target_name=const.SENTENCE_SCORES)) metrics.append(PearsonMetric(target_name=const.SENTENCE_SCORES)) metrics.append(SpearmanMetric(target_name=const.SENTENCE_SCORES)) if self.config.sentence_ll: metrics.append( LogMetric(targets=[('model_out', const.SENT_SIGMA)]) ) if self.config.binary_level: metrics.append( CorrectMetric(prefix=const.BINARY, target_name=const.BINARY) ) if self.config.token_level and self.predictor_tgt is not None: metrics.append( CorrectMetric( prefix=const.PE, target_name=const.PE, PAD=const.PAD_ID, STOP=const.STOP_ID, ) ) metrics.append( ExpectedErrorMetric( prefix=const.PE, target_name=const.PE, PAD=const.PAD_ID, STOP=const.STOP_ID, ) ) metrics.append( PerplexityMetric( prefix=const.PE, target_name=const.PE, PAD=const.PAD_ID, STOP=const.STOP_ID, ) ) if self.config.token_level and self.predictor_src is not None: metrics.append( CorrectMetric( prefix=const.SOURCE, target_name=const.SOURCE, PAD=const.PAD_ID, STOP=const.STOP_ID, ) ) metrics.append( ExpectedErrorMetric( prefix=const.SOURCE, target_name=const.SOURCE, PAD=const.PAD_ID, STOP=const.STOP_ID, ) ) metrics.append( PerplexityMetric( prefix=const.SOURCE, target_name=const.SOURCE, PAD=const.PAD_ID, STOP=const.STOP_ID, ) ) metrics.append( TokenMetric( target_name=const.TARGET, STOP=const.STOP_ID, PAD=const.PAD_ID ) ) return metrics def metrics_ordering(self): return max PK!19S$!7!7kiwi/models/quetch.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # from collections import OrderedDict import torch import torch.nn as nn import torch.nn.functional as F from kiwi import constants as const from kiwi.data.fieldsets.quetch import build_fieldset from kiwi.metrics import CorrectMetric, F1Metric, LogMetric from kiwi.models.model import Model, ModelConfig from kiwi.models.utils import align_tensor, convolve_tensor, make_loss_weights class QUETCHConfig(ModelConfig): def __init__( self, vocabs, predict_target=True, predict_gaps=False, predict_source=False, source_embeddings_size=50, target_embeddings_size=50, hidden_sizes=None, bad_weight=3.0, window_size=10, max_aligned=5, dropout=0.4, embeddings_dropout=0.4, freeze_embeddings=False, ): super().__init__(vocabs) if hidden_sizes is None: hidden_sizes = [100] source_vectors = vocabs[const.SOURCE].vectors target_vectors = vocabs[const.TARGET].vectors if source_vectors is not None: source_embeddings_size = source_vectors.size(1) if target_vectors is not None: target_embeddings_size = target_vectors.size(1) self.source_embeddings_size = source_embeddings_size self.target_embeddings_size = target_embeddings_size self.bad_weight = bad_weight self.dropout = dropout self.embeddings_dropout = embeddings_dropout self.freeze_embeddings = freeze_embeddings # self.predict_side = predict_side # if predicting tags or source, default predict_target=true # doesn't make sense if predict_gaps or predict_source: predict_target = predict_target self.predict_target = predict_target self.predict_gaps = predict_gaps self.predict_source = predict_source self.window_size = window_size self.max_aligned = max_aligned self.hidden_sizes = hidden_sizes if const.SOURCE_TAGS in vocabs: self.tags_pad_id = vocabs[const.SOURCE_TAGS].stoi[const.PAD] elif const.GAP_TAGS in vocabs: self.tags_pad_id = vocabs[const.GAP_TAGS].stoi[const.PAD] else: self.tags_pad_id = vocabs[const.TARGET_TAGS].stoi[const.PAD] # FIXME: this might not correspond to reality (in vocabs)! self.nb_classes = len(const.LABELS) self.tag_bad_index = const.BAD_ID self.pad_token = const.PAD self.unaligned_idx = const.UNALIGNED_ID self.source_padding_idx = const.PAD_ID self.target_padding_idx = const.PAD_ID @Model.register_subclass class QUETCH(Model): """QUality Estimation from scraTCH (QUETCH) model. TODO: add references. """ title = "QUETCH" def __init__(self, vocabs, **kwargs): super().__init__(vocabs=vocabs, ConfigCls=QUETCHConfig, **kwargs) self.source_emb = None self.target_emb = None self.embeddings_dropout = None self.linear = None self.dropout = None self.linear_out = None source_vectors = vocabs[const.SOURCE].vectors target_vectors = vocabs[const.TARGET].vectors self.build(source_vectors, target_vectors) @staticmethod def fieldset(*args, **kwargs): return build_fieldset(*args, **kwargs) @staticmethod def from_options(vocabs, opts): model = QUETCH( vocabs=vocabs, predict_target=opts.predict_target, predict_gaps=opts.predict_gaps, predict_source=opts.predict_source, source_embeddings_size=opts.source_embeddings_size, target_embeddings_size=opts.target_embeddings_size, hidden_sizes=opts.hidden_sizes, bad_weight=opts.bad_weight, window_size=opts.window_size, max_aligned=opts.max_aligned, dropout=opts.dropout, embeddings_dropout=opts.embeddings_dropout, freeze_embeddings=opts.freeze_embeddings, ) return model def loss(self, model_out, target): if self.config.predict_source: output_name = const.SOURCE_TAGS elif self.config.predict_gaps: output_name = const.GAP_TAGS else: output_name = const.TARGET_TAGS # (bs*ts, nb_classes) probs = model_out[output_name] # (bs*ts, ) y = getattr(target, output_name) predicted = probs.view(-1, self.config.nb_classes) y = y.view(-1) loss = self._loss(predicted, y) return {const.LOSS: loss} def _build_embeddings(self, source_vectors=None, target_vectors=None): # Embeddings layers: if source_vectors is not None: # source_embeddings_size = self.source_embeddings.size(1) self.source_emb = nn.Embedding( num_embeddings=source_vectors.size(0), embedding_dim=source_vectors.size(1), padding_idx=self.config.source_padding_idx, _weight=source_vectors, ) else: self.source_emb = nn.Embedding( num_embeddings=self.config.source_vocab_size, embedding_dim=self.config.source_embeddings_size, padding_idx=self.config.source_padding_idx, ) if target_vectors is not None: self.target_emb = nn.Embedding( num_embeddings=target_vectors.size(0), embedding_dim=target_vectors.size(1), padding_idx=self.config.target_padding_idx, _weight=target_vectors, ) else: self.target_emb = nn.Embedding( num_embeddings=self.config.target_vocab_size, embedding_dim=self.config.target_embeddings_size, padding_idx=self.config.target_padding_idx, ) if self.config.freeze_embeddings: self.source_emb.weight.requires_grad = False self.source_emb.bias.requires_grad = False self.target_emb.weight.requires_grad = False self.target_emb.bias.requires_grad = False self.embeddings_dropout = nn.Dropout(self.config.embeddings_dropout) def build(self, source_vectors=None, target_vectors=None): hidden_size = self.config.hidden_sizes[0] nb_classes = self.config.nb_classes dropout = self.config.dropout weight = make_loss_weights( nb_classes, const.BAD_ID, self.config.bad_weight ) self._loss = nn.CrossEntropyLoss( weight=weight, ignore_index=const.PAD_TAGS_ID ) # Embeddings layers: self._build_embeddings(source_vectors, target_vectors) feature_set_size = ( self.config.source_embeddings_size + self.config.target_embeddings_size ) * self.config.window_size self.linear = nn.Linear(feature_set_size, hidden_size) self.linear_out = nn.Linear(hidden_size, nb_classes) self.dropout = nn.Dropout(dropout) torch.nn.init.xavier_uniform_(self.linear.weight) torch.nn.init.xavier_uniform_(self.linear_out.weight) torch.nn.init.constant_(self.linear.bias, 0.0) torch.nn.init.constant_(self.linear_out.bias, 0.0) self.is_built = True def make_input(self, batch, side): target_input, target_lengths = getattr(batch, const.TARGET) source_input, source_lengths = getattr(batch, const.SOURCE) alignments = batch.alignments if self.config.predict_gaps and not self.config.predict_target: target_input = F.pad( target_input, pad=(0, 1), value=self.vocabs[const.TARGET].stoi[const.UNALIGNED], ) source_input = F.pad( source_input, pad=(0, 1), value=self.vocabs[const.SOURCE].stoi[const.UNALIGNED], ) target_input = convolve_tensor( target_input, self.config.window_size, self.config.target_padding_idx, ) source_input = convolve_tensor( source_input, self.config.window_size, self.config.source_padding_idx, ) if side == const.SOURCE_TAGS: alignments = [ [alignment[::-1] for alignment in example_alignment] for example_alignment in alignments ] target_input, nb_alignments = align_tensor( target_input, alignments, self.config.max_aligned, self.config.unaligned_idx, self.config.target_padding_idx, pad_size=source_input.shape[1], ) else: source_input, nb_alignments = align_tensor( source_input, alignments, self.config.max_aligned, self.config.unaligned_idx, self.config.source_padding_idx, pad_size=target_input.shape[1], ) return target_input, source_input, nb_alignments def forward(self, batch): assert self.is_built if self.config.predict_source: align_side = const.SOURCE_TAGS else: align_side = const.TARGET_TAGS target_input, source_input, nb_alignments = self.make_input( batch, align_side ) # # Source Branch # # (bs, ts, aligned, window) -> (bs, ts, aligned, window, emb) h_source = self.source_emb(source_input) if len(h_source.shape) == 5: # (bs, ts, aligned, window, emb) -> (bs, ts, window, emb) h_source = h_source.sum(2, keepdim=False) / nb_alignments.unsqueeze( -1 ).unsqueeze(-1) # (bs, ts, window, emb) -> (bs, ts, window * emb) h_source = h_source.view(source_input.size(0), source_input.size(1), -1) # # Target Branch # # (bs, ts * window) -> (bs, ts * window, emb) h_target = self.target_emb(target_input) if len(h_target.shape) == 5: # (bs, ts, aligned, window, emb) -> (bs, ts, window, emb) h_target = h_target.sum(2, keepdim=False) / nb_alignments.unsqueeze( -1 ).unsqueeze(-1) # (bs, ts * window, emb) -> (bs, ts, window * emb) h_target = h_target.view(target_input.size(0), target_input.size(1), -1) # # POS tags branches # feature_set = (h_source, h_target) # # Merge Branches # # (bs, ts, window * emb) -> (bs, ts, 2 * window * emb) h = torch.cat(feature_set, dim=-1) h = self.embeddings_dropout(h) # (bs, ts, 2 * window * emb) -> (bs, ts, hs) h = torch.tanh(self.linear(h)) h = self.dropout(h) # (bs, ts, hs) -> (bs, ts, 2) h = self.linear_out(h) outputs = OrderedDict() if self.config.predict_target: outputs[const.TARGET_TAGS] = h if self.config.predict_gaps: outputs[const.GAP_TAGS] = h if self.config.predict_source: outputs[const.SOURCE_TAGS] = h return outputs @staticmethod def _unmask(tensor, mask): lengths = mask.int().sum(dim=-1) return [x[: lengths[i]] for i, x in enumerate(tensor)] def metrics(self): metrics = [] if self.config.predict_target: metrics.append( F1Metric( prefix=const.TARGET_TAGS, target_name=const.TARGET_TAGS, PAD=const.PAD_TAGS_ID, labels=const.LABELS, ) ) metrics.append( CorrectMetric( prefix=const.TARGET_TAGS, target_name=const.TARGET_TAGS, PAD=const.PAD_TAGS_ID, ) ) if self.config.predict_source: metrics.append( F1Metric( prefix=const.SOURCE_TAGS, target_name=const.SOURCE_TAGS, PAD=const.PAD_TAGS_ID, labels=const.LABELS, ) ) metrics.append( CorrectMetric( prefix=const.SOURCE_TAGS, target_name=const.SOURCE_TAGS, PAD=const.PAD_TAGS_ID, ) ) if self.config.predict_gaps: metrics.append( F1Metric( prefix=const.GAP_TAGS, target_name=const.GAP_TAGS, PAD=const.PAD_TAGS_ID, labels=const.LABELS, ) ) metrics.append( CorrectMetric( prefix=const.GAP_TAGS, target_name=const.GAP_TAGS, PAD=const.PAD_TAGS_ID, ) ) metrics.append(LogMetric(targets=[(const.LOSS, const.LOSS)])) return metrics def metrics_ordering(self): return max PK!/kiwi/models/utils.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # import numpy as np import torch import torch.nn.functional as F from more_itertools import first, flatten from torch.autograd import Function from torch.nn.utils.rnn import pack_padded_sequence as pack from torch.nn.utils.rnn import pad_packed_sequence as unpack def unroll(list_of_lists): """ :param list_of_lists: a list that contains lists :param rec: unroll recursively :return: a flattened list """ if isinstance(first(list_of_lists), (np.ndarray, list)): return list(flatten(list_of_lists)) return list_of_lists def convolve_tensor(sequences, window_size, pad_value=0): """Convolve a sequence and apply padding :param sequence: 2D tensor :param window_size: filter length :param pad_value: int value used as padding :return: 3D tensor, where the last dimension has size window_size """ pad = (window_size // 2,) * 2 t = F.pad(sequences, pad=pad, value=pad_value) t = t.unfold(1, window_size, 1) # For 3D tensors # torch.nn.ConstantPad2d((0, 0, 1, 1), 0)(x).unfold(1, 3, 1) # F.pad(x, (0, 0, 1, 1), value=0).unfold(1, 3, 1) return t # def convolve_sequence(sequence, window_size, pad_value=0): # """Convolve a sequence and apply padding # # :param sequence: list of ids # :param window_size: filter length # :param pad_value: int value used as padding # :return: list of lists with size of window_size # """ # pad = [pad_value for _ in range(window_size // 2)] # pad_sequence = pad + sequence + pad # return list(windowed(pad_sequence, window_size, fillvalue=pad_value)) def align_tensor( tensor, alignments, max_aligned, unaligned_idx, padding_idx, pad_size, target_length=None, ): alignments = [ map_alignments_to_target(sample, target_length=target_length) for sample in alignments ] # aligned_tensor = tensor.new_full( # (tensor.shape[0], pad_size, max_aligned, tensor.shape[2]), # padding_idx) aligned = [ align_source( sample, alignment, max_aligned, unaligned_idx, padding_idx, pad_size ) for sample, alignment in zip(tensor, alignments) ] aligned_tensor = torch.stack([sample[0] for sample in aligned]) nb_alignments = torch.stack([sample[1] for sample in aligned]) return aligned_tensor, nb_alignments def map_alignments_to_target(src2tgt_alignments, target_length=None): """Maps a target index to a list of source indexes. Args: src2tgt_alignments (list): list of tuples with source, target indexes. target_length: size of the target side; if None, the highest index in the alignments is used. Returns: A list of size target_length where position i refers to the i-th target token and contains a list of source indexes aligned to it. """ if target_length is None: if not src2tgt_alignments: target_length = 0 else: target_length = 1 + max(src2tgt_alignments, key=lambda a: a[1])[1] trg2src = [None] * target_length for source, target in src2tgt_alignments: if not trg2src[target]: trg2src[target] = [] trg2src[target].append(source) return trg2src def align_source( source, trg2src_alignments, max_aligned, unaligned_idx, padding_idx, pad_size, ): assert len(source.shape) == 2 window_size = source.shape[1] assert len(trg2src_alignments) <= pad_size aligned_source = source.new_full( (pad_size, max_aligned, window_size), padding_idx ) unaligned = source.new_full((window_size,), unaligned_idx) nb_alignments = source.new_ones(pad_size, dtype=torch.float) for i, source_positions in enumerate(trg2src_alignments): if not source_positions: aligned_source[i, 0] = unaligned else: selected = torch.index_select( source, 0, torch.tensor( source_positions[:max_aligned], device=source.device ), ) aligned_source[i, : len(selected)] = selected # counts how many tokens is a target token aligned to nb_alignments[i] = len(selected) return aligned_source, nb_alignments def apply_packed_sequence(rnn, embedding, lengths): """ Runs a forward pass of embeddings through an rnn using packed sequence. Args: rnn: The RNN that that we want to compute a forward pass with. embedding (FloatTensor b x seq x dim): A batch of sequence embeddings. lengths (LongTensor batch): The length of each sequence in the batch. Returns: output: The output of the RNN `rnn` with input `embedding` """ # Sort Batch by sequence length lengths_sorted, permutation = torch.sort(lengths, descending=True) embedding_sorted = embedding[permutation] # Use Packed Sequence embedding_packed = pack(embedding_sorted, lengths_sorted, batch_first=True) outputs_packed, (hidden, cell) = rnn(embedding_packed) outputs_sorted, _ = unpack(outputs_packed, batch_first=True) # Restore original order _, permutation_rev = torch.sort(permutation, descending=False) outputs = outputs_sorted[permutation_rev] hidden, cell = hidden[:, permutation_rev], cell[:, permutation_rev] return outputs, (hidden, cell) def replace_token(target, old, new): """Replaces old tokens with new. args: target (LongTensor) old (int): The token to be replaced by new new (int): The token used to replace old """ return target.masked_fill(target == old, new) def make_loss_weights(nb_classes, target_idx, weight): """Creates a loss weight vector for nn.CrossEntropyLoss args: nb_classes: Number of classes target_idx: ID of the target (reweighted) class weight: Weight of the target class returns: weights (FloatTensor): Weight Tensor of shape `nb_classes` such that `weights[target_idx] = weight` `weights[other_idx] = 1.0` """ weights = torch.ones(nb_classes) weights[target_idx] = weight return weights class GradientMul(Function): @staticmethod def forward(ctx, x, constant=0): ctx.constant = constant return x @staticmethod def backward(ctx, grad): return ctx.constant * grad, None gradient_mul = GradientMul.apply PK!hkiwi/predictors/__init__.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # PK! - kiwi/predictors/linear_tester.py"""A generic implementation of a basic tester.""" # OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # from kiwi import constants as const class LinearTester(object): def __init__(self, classifier): self.classifier = classifier def run(self, dataset, **kwargs): instances = self.classifier.create_instances(dataset) predictions = self.classifier.test(instances) self.classifier.evaluate(instances, predictions) return {const.TARGET_TAGS: predictions} PK!b7'##kiwi/predictors/predictor.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # import logging from collections import defaultdict import torch from torchtext.data import Example from kiwi import constants as const from kiwi.data.iterators import build_bucket_iterator from kiwi.data.qe_dataset import QEDataset logger = logging.getLogger(__name__) class Predicter: def __init__(self, model, fields=None): """Class to load a model for inference. Args: model (kiwi.models.Model): A trained QE model fields (dict[str: Field]): A dict mapping field names to strings. For online prediction. """ self.model = model self.fields = fields # Will break in Multi GPU mode self._device = next(model.parameters()).device def predict(self, examples, batch_size=1): """Create Predictions for a list of examples. Args: examples: A dict mapping field names to the list of raw examples (strings). batch_size: Batch Size to use. Default 1. Returns: A dict mapping prediction levels (word, sentence ..) to the model predictions for each example. Raises: Exception: If an example has an empty string as `source` or `target` field. Example: >>> import kiwi >>> predictor = kiwi.load_model('tests/toy-data/models/nuqe.torch') >>> src = ['a b c', 'd e f g'] >>> tgt = ['q w e r', 't y'] >>> align = ['0-0 1-1 1-2', '1-1 3-0'] >>> examples = [kiwi.constants.SOURCE: src, kiwi.constants.TARGET: tgt, kiwi.constants.ALIGNMENTS: align] >>> predictor.predict(examples) {'tags': [[0.4760947525501251, 0.47569847106933594, 0.4948718547821045, 0.5305878520011902], [0.5105430483818054, 0.5252899527549744]]} """ if not examples: return defaultdict(list) if self.fields is None: raise Exception('Missing fields object.') if not examples.get(const.SOURCE): raise KeyError('Missing required field "{}"'.format(const.SOURCE)) if not examples.get(const.TARGET): raise KeyError('Missing required field "{}"'.format(const.TARGET)) if not all( [s.strip() for s in examples[const.SOURCE] + examples[const.TARGET]] ): raise Exception( 'Empty String in {} or {} field found!'.format( const.SOURCE, const.TARGET ) ) fields = [(name, self.fields[name]) for name in examples] field_examples = [ Example.fromlist(values, fields) for values in zip(*examples.values()) ] dataset = QEDataset(field_examples, fields=fields) return self.run(dataset, batch_size) def run(self, dataset, batch_size=1): iterator = build_bucket_iterator( dataset, self._device, batch_size, is_train=False ) self.model.eval() predictions = defaultdict(list) with torch.no_grad(): for batch in iterator: model_pred = self.model.predict(batch) for key, values in model_pred.items(): if isinstance(values, list): predictions[key] += values else: predictions[key].append(values) return dict(predictions) PK!hkiwi/trainers/__init__.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # PK!Ĩ"--kiwi/trainers/callbacks.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # import heapq import logging import shutil import threading from pathlib import Path from kiwi import constants as const from kiwi.data.utils import save_predicted_probabilities logger = logging.getLogger(__name__) class EarlyStopException(StopIteration): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) class Checkpoint: """Class for determining whether to evaluate / save the model. """ def __init__( self, output_dir, checkpoint_save=False, checkpoint_keep_only_best=0, checkpoint_early_stop_patience=0, checkpoint_validation_steps=0, ): """ Args: output_dir (Path): Required if checkpoint_save == True. checkpoint_save (bool): Save a training snapshot when validation is run. checkpoint_keep_only_best: Keep only this number of saved snapshots; 0 will keep all. checkpoint_early_stop_patience: Stop training if evaluation metrics do not improve after /X/ validations; 0 disables this. checkpoint_validation_steps: Perform validation every /X/ training batches. """ self.output_directory = Path(output_dir) self.validation_steps = checkpoint_validation_steps self.early_stop_patience = checkpoint_early_stop_patience self.save = checkpoint_save self.keep_only_best = checkpoint_keep_only_best if self.keep_only_best <= 0: self.keep_only_best = float('inf') # if self.save and not self.output_directory: # logger.warning('Asked to save training snapshots, ' # 'but no output directory was specified.') # self.save = False self.main_metric = None # This should be kept as a heap self.best_stats_summary = [] self.stats_summary_history = [] self._last_saved = 0 self._validation_epoch = 0 def must_eval(self, epoch=None, step=None): if epoch is not None: return True if step is not None: return self.validation_steps and step % self.validation_steps == 0 return False def must_save(self, stats): if self.save: if self._validation_epoch <= self.keep_only_best: return True elif stats > self.worst_stats(): return True return False def early_stopping(self): no_improvement = self._validation_epoch - self._last_saved return 0 < self.early_stop_patience <= no_improvement def __call__(self, trainer, valid_iterator, epoch=None, step=None): if self.must_eval(epoch=epoch, step=step): eval_stats_summary = trainer.eval_epoch(valid_iterator) eval_stats_summary.log() if trainer.scheduler: trainer.scheduler.step(eval_stats_summary.main_metric_value()) saved_path = self.check_in( trainer, eval_stats_summary, epoch=epoch, step=step ) if saved_path: predictions = trainer.predict(valid_iterator) if predictions is not None: save_predicted_probabilities(saved_path, predictions) elif self.early_stopping(): raise EarlyStopException( 'Early stopping training after {} validations ' 'without improvements on the validation set'.format( self.early_stop_patience ) ) def check_in(self, trainer, stats, epoch=None, step=None): self._validation_epoch += 1 self.stats_summary_history.append(stats) if self.must_save(stats): self._last_saved = self._validation_epoch output_path = self.make_output_path(epoch=epoch, step=step) path_to_remove = self.push_to_heap(stats, output_path) event = trainer.save(output_path) if path_to_remove: self.remove_snapshot(path_to_remove, event) return output_path return None def make_output_path(self, epoch=None, step=None): if epoch is not None: sub_dir = 'epoch_{}'.format(epoch) elif step is not None: sub_dir = 'step_{}'.format(step) else: sub_dir = 'epoch_unknown' return self.output_directory / sub_dir def push_to_heap(self, stats, output_path): """Push stats and output path to the heap.""" path_to_remove = None # The second element (`-self._validation_epoch`) serves as a timestamp # to ensure that in case of a tie, the earliest model is saved. heap_element = (stats, -self._validation_epoch, output_path) if self._validation_epoch <= self.keep_only_best: heapq.heappush(self.best_stats_summary, heap_element) else: worst_stat = heapq.heapreplace( self.best_stats_summary, heap_element ) path_to_remove = str(worst_stat[2]) # Worst output path return path_to_remove def remove_snapshot(self, path_to_remove, event=None): """Remove snapshot locally and in MLFlow.""" def _remove_snapshot(path, event, message): if event: event.wait() logger.info(message) shutil.rmtree(str(path)) if event: event.clear() removal_message = ( 'Removing previous snapshot because it is worse: ' '{}'.format(path_to_remove) ) t = threading.Thread( target=_remove_snapshot, args=(path_to_remove, event, removal_message), daemon=True, ) try: t.start() except FileNotFoundError as e: logger.exception(e) def best_stats_and_path(self): if self.best_stats_summary: stat, order, path = max(self.best_stats_summary) return stat, path return None, None def best_iteration_path(self): _, path = self.best_stats_and_path() return path def best_stats(self): stats, _ = self.best_stats_and_path() return stats def worst_stats(self): if self.best_stats_summary: return self.best_stats_summary[0][0] else: return None def best_model_path(self): path = self.output_directory / const.BEST_MODEL_FILE if path.exists(): return path return self.best_iteration_path() def check_out(self): best_path = self.best_iteration_path() if best_path: self.copy_best_model(best_path, self.output_directory) @staticmethod def copy_best_model(model_dir, output_dir): model_path = model_dir / const.MODEL_FILE best_model_path = output_dir / const.BEST_MODEL_FILE logging.info('Copying best model to {}'.format(best_model_path)) shutil.copy(str(model_path), str(best_model_path)) return best_model_path PK! 'kiwi/trainers/linear_word_qe_trainer.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # from kiwi.models.linear.linear_trainer import LinearTrainer class LinearWordQETrainer(LinearTrainer): def __init__( self, model, optimizer_name, regularization_constant, checkpointer ): super().__init__( classifier=model, checkpointer=checkpointer, algorithm=optimizer_name, regularization_constant=regularization_constant, ) @property def model(self): return self.classifier PK! O(O(kiwi/trainers/trainer.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # import logging from collections import defaultdict from pathlib import Path import torch from tqdm import tqdm import kiwi from kiwi import constants as const from kiwi.loggers import tracking_logger from kiwi.metrics.stats import Stats from kiwi.models.model import Model from kiwi.trainers.callbacks import EarlyStopException from kiwi.trainers.utils import optimizer_class logger = logging.getLogger(__name__) class Trainer: def __init__( self, model, optimizer, checkpointer, log_interval=100, scheduler=None ): """ Args: model: A kiwi.Model to train optimizer: An optimizer checkpointer: A Checkpointer object log_interval: Log train stats every /n/ batches. Default 100 scheduler: A learning rate scheduler """ self.model = model self.stats = Stats( metrics=model.metrics(), main_metric_ordering=model.metrics_ordering(), log_interval=log_interval, ) self.optimizer = optimizer self.checkpointer = checkpointer self.scheduler = scheduler self._step = 0 self._epoch = 0 @property def stats_summary_history(self): return self.checkpointer.stats_summary_history def run(self, train_iterator, valid_iterator, epochs=50): """ Args: train_iterator: epochs: Number of epochs for training. """ # log(self.eval_epoch(valid_dataset)) for epoch in range(self._epoch + 1, epochs + 1): logger.info('Epoch {} of {}'.format(epoch, epochs)) self.train_epoch(train_iterator, valid_iterator) self.stats.log() try: self.checkpointer(self, valid_iterator, epoch=epoch) except EarlyStopException as e: logger.info(e) break self.checkpointer.check_out() def train_epoch(self, train_iterator, valid_iterator): self.model.train() for batch in tqdm( train_iterator, total=len(train_iterator), desc='Batches', unit=' batches', ncols=80, ): self._step += 1 outputs = self.train_step(batch) self.stats.update(batch=batch, **outputs) self.stats.log(step=self._step) try: self.checkpointer(self, valid_iterator, step=self._step) except EarlyStopException as e: logger.info(e) break self._epoch += 1 def train_steps(self, train_iterator, valid_iterator, max_steps): train_iterator.repeat = True self.model.train() step = 0 for step, batch in tqdm( enumerate(train_iterator, 1), total=max_steps, desc='Steps', unit=' batches', ncols=80, ): self._step += 1 outputs = self.train_step(batch) self.stats.update(batch=batch, **outputs) self.stats.log(step=self._step) try: self.checkpointer(self, valid_iterator, step=self._step) except EarlyStopException as e: logger.info(e) break if step > max_steps: break eval_stats_summary = self.eval_epoch(valid_iterator) eval_stats_summary.log() sub_path = Path('step_{}'.format(self._step)) self.save(self.checkpointer.output_directory / sub_path) train_iterator.repeat = False def train_step(self, batch): self.model.zero_grad() model_out = self.model(batch) loss_dict = self.model.loss(model_out, batch) loss_dict[const.LOSS].backward() self.optimizer.step() return dict(loss=loss_dict, model_out=model_out) def eval_epoch(self, valid_iterator, prefix='EVAL'): self.model.eval() self.stats.reset() with torch.no_grad(): for batch in valid_iterator: outputs = self.eval_step(batch) self.stats.update(batch=batch, **outputs) stats_summary = self.stats.wrap_up(prefix=prefix) self.model.train() return stats_summary def eval_step(self, batch): model_out = self.model(batch) loss_dict = self.model.loss(model_out, batch) return dict(loss=loss_dict, model_out=model_out) def predict(self, valid_iterator): self.model.eval() with torch.no_grad(): predictions = defaultdict(list) for batch in valid_iterator: model_pred = self.model.predict(batch) for key, values in model_pred.items(): predictions[key] += values self.model.train() return predictions def make_sub_directory(self, root_directory, current_epoch, prefix='epoch'): root_path = Path(root_directory) epoch_path = Path('{}_{}'.format(prefix, current_epoch)) output_path = root_path / epoch_path output_path.mkdir(exist_ok=True) return output_path def save(self, output_directory): output_directory = Path(output_directory) output_directory.mkdir(exist_ok=True) logging.info('Saving training state to {}'.format(output_directory)) model_path = output_directory / const.MODEL_FILE self.model.save(str(model_path)) optimizer_path = output_directory / const.OPTIMIZER scheduler_dict = None if self.scheduler: scheduler_dict = { 'name': type(self.scheduler).__name__.lower(), 'state_dict': self.scheduler.state_dict(), } optimizer_dict = { 'name': type(self.optimizer).__name__.lower(), 'state_dict': self.optimizer.state_dict(), 'scheduler_dict': scheduler_dict, } torch.save(optimizer_dict, str(optimizer_path)) state = { '__version__': kiwi.__version__, '_epoch': self._epoch, '_step': self._step, 'checkpointer': self.checkpointer, } state_path = output_directory / const.TRAINER torch.save(state, str(state_path)) # Send to MLflow event = None if tracking_logger.should_log_artifacts(): logger.info('Logging artifacts to {}'.format(output_directory)) event = tracking_logger.log_artifacts( str(output_directory), artifact_path=str(output_directory.name) ) return event def load(self, directory): logger.info('Loading training state from {}'.format(directory)) root_path = Path(directory) model_path = root_path / const.MODEL_FILE self.model = self.model.from_file(model_path) optimizer_path = root_path / const.OPTIMIZER optimizer_dict = torch.load( str(optimizer_path), map_location=lambda storage, loc: storage ) if optimizer_dict['name'] != (type(self.optimizer).__name__.lower()): logger.warning('Trying to load the wrong optimizer.') self.optimizer.load_state_dict(optimizer_dict['state_dict']) scheduler_dict = optimizer_dict['scheduler_dict'] if scheduler_dict: self.scheduler.load_state_dict(scheduler_dict['state_dict']) trainer_path = root_path / const.TRAINER state = torch.load( str(trainer_path), map_location=lambda storage, loc: storage ) self.__dict__.update(state) @classmethod def from_directory(cls, directory, device_id=None): logger.info('Loading training state from {}'.format(directory)) root_path = Path(directory) model_path = root_path / const.MODEL_FILE model = Model.create_from_file(model_path) if device_id is not None: model.to(device_id) optimizer_path = root_path / const.OPTIMIZER optimizer_dict = torch.load( str(optimizer_path), map_location=lambda storage, loc: storage ) optimizer = optimizer_class(optimizer_dict['name'])( model.parameters(), lr=0.0 ) optimizer.load_state_dict(optimizer_dict['state_dict']) trainer = cls(model, optimizer, checkpointer=None) trainer_path = root_path / const.TRAINER state = torch.load( str(trainer_path), map_location=lambda storage, loc: storage ) trainer.__dict__.update(state) return trainer @classmethod def resume(cls, local_path=None, prefix='epoch_', device_id=None): if local_path: artifacts_uri = Path(local_path) else: artifacts_uri = Path(tracking_logger.get_artifact_uri()) saved_checkpoints = [ int(str(path.name).replace(prefix, '')) for path in artifacts_uri.glob('{}*'.format(prefix)) if path.is_dir() ] if not saved_checkpoints: logger.warning( 'No saved trainer checkpoint found at {}'.format( artifacts_uri / (prefix + '*') ) ) return None last_save = max(saved_checkpoints) snapshot_dir = artifacts_uri / '{}{}'.format(prefix, last_save) logger.info('Resuming training from: {}'.format(snapshot_dir)) return cls.from_directory(snapshot_dir, device_id=device_id) PK!xjtkiwi/trainers/utils.py# OpenKiwi: Open-Source Machine Translation Quality Estimation # Copyright (C) 2019 Unbabel # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published # by the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . # from torch import optim def optimizer_class(name): if name == 'sgd': OptimizerClass = optim.SGD elif name == 'adagrad': OptimizerClass = optim.Adagrad elif name == 'adadelta': OptimizerClass = optim.Adadelta elif name == 'adam': OptimizerClass = optim.Adam elif name == 'sparseadam': OptimizerClass = optim.SparseAdam else: raise RuntimeError("Invalid optim method: " + name) return OptimizerClass PK! xX X pyproject.toml# Configuration file as per PEP 518 # https://www.python.org/dev/peps/pep-0518/ [tool.poetry] name = "openkiwi" version = "0.1.0" description = "Machine Translation Quality Estimation Toolkit" authors = ["AI Research, Unbabel "] license = "AGPL-3.0" readme = 'README.md' homepage = 'https://github.com/Unbabel/OpenKiwi' repository = 'https://github.com/Unbabel/OpenKiwi' documentation = 'https://unbabel.github.io/OpenKiwi/' keywords = ['OpenKiwi', 'Quality Estimation', 'Machine Translation', 'Unbabel'] classifiers = [ 'Development Status :: 4 - Beta', 'Environment :: Console', 'Intended Audience :: Science/Research', 'Topic :: Scientific/Engineering :: Artificial Intelligence', ] packages = [ {include = "kiwi"}, ] include = ['pyproject.toml', 'CHANGELOG', 'LICENSE', 'CONTRIBUTING.md'] [tool.poetry.scripts] kiwi = 'kiwi.__main__:main' [tool.poetry.dependencies] python = "^3.5" torch = ">= 0.4.1" torchtext = "^0.3.1" tqdm = "^4.29" configargparse = "^0.14.0" numpy = "^1.16" more-itertools = "^5.0" scipy = "^1.2" pyyaml = "^3.13" pathlib2 = {version = "^2.3",python = "3.5"} mlflow = {version = "~0.8",optional = true} seaborn = {version = "^0.9.0",optional = true} polyglot = {version = "^16.7",optional = true} [tool.poetry.dev-dependencies] bump2version = "^0.5.10" tox = "^3.7" pytest = "^4.1" flake8 = "^3.6" isort = "^4.3" coverage = "^4.5" sphinx = "^1.8" sphinx-argparse = "^0.2.5" m2r = "^0.2.1" sphinx_rtd_theme = "^0.4.3" yapf = {version = "^0.25.0",allows-prereleases = true,python = "~3.5"} black = {version = "^18.9-beta.0",allows-prereleases = true,python = "^3.6"} [tool.poetry.extras] embeddings = ["polyglot"] plots = ["seaborn"] mlflow = ["mlflow"] [tool.black] line-length = 80 # Changed from default 88 # You probably noticed the peculiar default line length. Black defaults # to 88 characters per line, which happens to be 10% over 80. This number was # found to produce significantly shorter files than sticking with 80 (the most # popular), or even 79 (used by the standard library). In general, 90-ish # seems like the wise choice: https://youtu.be/wf-BqAjZb8M?t=260. skip-string-normalization = true # Don't switch to double quotes py36 = false exclude = ''' /( \.git | \.tox | \.venv | build | dist )/ ''' [build-system] requires = ["poetry>=0.12"] build-backend = "poetry.masonry.api" PK!HY:(+)openkiwi-0.1.0.dist-info/entry_points.txtN+I/N.,(),ϴz񹉙yV PK!Oۆۆ openkiwi-0.1.0.dist-info/LICENSE GNU AFFERO GENERAL PUBLIC LICENSE Version 3, 19 November 2007 Copyright (C) 2007 Free Software Foundation, Inc. Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Preamble The GNU Affero General Public License is a free, copyleft license for software and other kinds of works, specifically designed to ensure cooperation with the community in the case of network server software. The licenses for most software and other practical works are designed to take away your freedom to share and change the works. By contrast, our General Public Licenses are intended to guarantee your freedom to share and change all versions of a program--to make sure it remains free software for all its users. When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for them if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs, and that you know you can do these things. Developers that use our General Public Licenses protect your rights with two steps: (1) assert copyright on the software, and (2) offer you this License which gives you legal permission to copy, distribute and/or modify the software. A secondary benefit of defending all users' freedom is that improvements made in alternate versions of the program, if they receive widespread use, become available for other developers to incorporate. Many developers of free software are heartened and encouraged by the resulting cooperation. However, in the case of software used on network servers, this result may fail to come about. The GNU General Public License permits making a modified version and letting the public access it on a server without ever releasing its source code to the public. The GNU Affero General Public License is designed specifically to ensure that, in such cases, the modified source code becomes available to the community. It requires the operator of a network server to provide the source code of the modified version running there to the users of that server. Therefore, public use of a modified version, on a publicly accessible server, gives the public access to the source code of the modified version. An older license, called the Affero General Public License and published by Affero, was designed to accomplish similar goals. This is a different license, not a version of the Affero GPL, but Affero has released a new version of the Affero GPL which permits relicensing under this license. The precise terms and conditions for copying, distribution and modification follow. TERMS AND CONDITIONS 0. Definitions. "This License" refers to version 3 of the GNU Affero General Public License. "Copyright" also means copyright-like laws that apply to other kinds of works, such as semiconductor masks. "The Program" refers to any copyrightable work licensed under this License. Each licensee is addressed as "you". "Licensees" and "recipients" may be individuals or organizations. To "modify" a work means to copy from or adapt all or part of the work in a fashion requiring copyright permission, other than the making of an exact copy. The resulting work is called a "modified version" of the earlier work or a work "based on" the earlier work. A "covered work" means either the unmodified Program or a work based on the Program. To "propagate" a work means to do anything with it that, without permission, would make you directly or secondarily liable for infringement under applicable copyright law, except executing it on a computer or modifying a private copy. Propagation includes copying, distribution (with or without modification), making available to the public, and in some countries other activities as well. To "convey" a work means any kind of propagation that enables other parties to make or receive copies. Mere interaction with a user through a computer network, with no transfer of a copy, is not conveying. An interactive user interface displays "Appropriate Legal Notices" to the extent that it includes a convenient and prominently visible feature that (1) displays an appropriate copyright notice, and (2) tells the user that there is no warranty for the work (except to the extent that warranties are provided), that licensees may convey the work under this License, and how to view a copy of this License. If the interface presents a list of user commands or options, such as a menu, a prominent item in the list meets this criterion. 1. Source Code. The "source code" for a work means the preferred form of the work for making modifications to it. "Object code" means any non-source form of a work. A "Standard Interface" means an interface that either is an official standard defined by a recognized standards body, or, in the case of interfaces specified for a particular programming language, one that is widely used among developers working in that language. The "System Libraries" of an executable work include anything, other than the work as a whole, that (a) is included in the normal form of packaging a Major Component, but which is not part of that Major Component, and (b) serves only to enable use of the work with that Major Component, or to implement a Standard Interface for which an implementation is available to the public in source code form. A "Major Component", in this context, means a major essential component (kernel, window system, and so on) of the specific operating system (if any) on which the executable work runs, or a compiler used to produce the work, or an object code interpreter used to run it. The "Corresponding Source" for a work in object code form means all the source code needed to generate, install, and (for an executable work) run the object code and to modify the work, including scripts to control those activities. However, it does not include the work's System Libraries, or general-purpose tools or generally available free programs which are used unmodified in performing those activities but which are not part of the work. For example, Corresponding Source includes interface definition files associated with source files for the work, and the source code for shared libraries and dynamically linked subprograms that the work is specifically designed to require, such as by intimate data communication or control flow between those subprograms and other parts of the work. The Corresponding Source need not include anything that users can regenerate automatically from other parts of the Corresponding Source. The Corresponding Source for a work in source code form is that same work. 2. Basic Permissions. All rights granted under this License are granted for the term of copyright on the Program, and are irrevocable provided the stated conditions are met. This License explicitly affirms your unlimited permission to run the unmodified Program. The output from running a covered work is covered by this License only if the output, given its content, constitutes a covered work. This License acknowledges your rights of fair use or other equivalent, as provided by copyright law. You may make, run and propagate covered works that you do not convey, without conditions so long as your license otherwise remains in force. You may convey covered works to others for the sole purpose of having them make modifications exclusively for you, or provide you with facilities for running those works, provided that you comply with the terms of this License in conveying all material for which you do not control copyright. Those thus making or running the covered works for you must do so exclusively on your behalf, under your direction and control, on terms that prohibit them from making any copies of your copyrighted material outside their relationship with you. Conveying under any other circumstances is permitted solely under the conditions stated below. Sublicensing is not allowed; section 10 makes it unnecessary. 3. Protecting Users' Legal Rights From Anti-Circumvention Law. No covered work shall be deemed part of an effective technological measure under any applicable law fulfilling obligations under article 11 of the WIPO copyright treaty adopted on 20 December 1996, or similar laws prohibiting or restricting circumvention of such measures. When you convey a covered work, you waive any legal power to forbid circumvention of technological measures to the extent such circumvention is effected by exercising rights under this License with respect to the covered work, and you disclaim any intention to limit operation or modification of the work as a means of enforcing, against the work's users, your or third parties' legal rights to forbid circumvention of technological measures. 4. Conveying Verbatim Copies. You may convey verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice; keep intact all notices stating that this License and any non-permissive terms added in accord with section 7 apply to the code; keep intact all notices of the absence of any warranty; and give all recipients a copy of this License along with the Program. You may charge any price or no price for each copy that you convey, and you may offer support or warranty protection for a fee. 5. Conveying Modified Source Versions. You may convey a work based on the Program, or the modifications to produce it from the Program, in the form of source code under the terms of section 4, provided that you also meet all of these conditions: a) The work must carry prominent notices stating that you modified it, and giving a relevant date. b) The work must carry prominent notices stating that it is released under this License and any conditions added under section 7. This requirement modifies the requirement in section 4 to "keep intact all notices". c) You must license the entire work, as a whole, under this License to anyone who comes into possession of a copy. This License will therefore apply, along with any applicable section 7 additional terms, to the whole of the work, and all its parts, regardless of how they are packaged. This License gives no permission to license the work in any other way, but it does not invalidate such permission if you have separately received it. d) If the work has interactive user interfaces, each must display Appropriate Legal Notices; however, if the Program has interactive interfaces that do not display Appropriate Legal Notices, your work need not make them do so. A compilation of a covered work with other separate and independent works, which are not by their nature extensions of the covered work, and which are not combined with it such as to form a larger program, in or on a volume of a storage or distribution medium, is called an "aggregate" if the compilation and its resulting copyright are not used to limit the access or legal rights of the compilation's users beyond what the individual works permit. Inclusion of a covered work in an aggregate does not cause this License to apply to the other parts of the aggregate. 6. Conveying Non-Source Forms. You may convey a covered work in object code form under the terms of sections 4 and 5, provided that you also convey the machine-readable Corresponding Source under the terms of this License, in one of these ways: a) Convey the object code in, or embodied in, a physical product (including a physical distribution medium), accompanied by the Corresponding Source fixed on a durable physical medium customarily used for software interchange. b) Convey the object code in, or embodied in, a physical product (including a physical distribution medium), accompanied by a written offer, valid for at least three years and valid for as long as you offer spare parts or customer support for that product model, to give anyone who possesses the object code either (1) a copy of the Corresponding Source for all the software in the product that is covered by this License, on a durable physical medium customarily used for software interchange, for a price no more than your reasonable cost of physically performing this conveying of source, or (2) access to copy the Corresponding Source from a network server at no charge. c) Convey individual copies of the object code with a copy of the written offer to provide the Corresponding Source. This alternative is allowed only occasionally and noncommercially, and only if you received the object code with such an offer, in accord with subsection 6b. d) Convey the object code by offering access from a designated place (gratis or for a charge), and offer equivalent access to the Corresponding Source in the same way through the same place at no further charge. You need not require recipients to copy the Corresponding Source along with the object code. If the place to copy the object code is a network server, the Corresponding Source may be on a different server (operated by you or a third party) that supports equivalent copying facilities, provided you maintain clear directions next to the object code saying where to find the Corresponding Source. Regardless of what server hosts the Corresponding Source, you remain obligated to ensure that it is available for as long as needed to satisfy these requirements. e) Convey the object code using peer-to-peer transmission, provided you inform other peers where the object code and Corresponding Source of the work are being offered to the general public at no charge under subsection 6d. A separable portion of the object code, whose source code is excluded from the Corresponding Source as a System Library, need not be included in conveying the object code work. A "User Product" is either (1) a "consumer product", which means any tangible personal property which is normally used for personal, family, or household purposes, or (2) anything designed or sold for incorporation into a dwelling. In determining whether a product is a consumer product, doubtful cases shall be resolved in favor of coverage. For a particular product received by a particular user, "normally used" refers to a typical or common use of that class of product, regardless of the status of the particular user or of the way in which the particular user actually uses, or expects or is expected to use, the product. A product is a consumer product regardless of whether the product has substantial commercial, industrial or non-consumer uses, unless such uses represent the only significant mode of use of the product. "Installation Information" for a User Product means any methods, procedures, authorization keys, or other information required to install and execute modified versions of a covered work in that User Product from a modified version of its Corresponding Source. The information must suffice to ensure that the continued functioning of the modified object code is in no case prevented or interfered with solely because modification has been made. If you convey an object code work under this section in, or with, or specifically for use in, a User Product, and the conveying occurs as part of a transaction in which the right of possession and use of the User Product is transferred to the recipient in perpetuity or for a fixed term (regardless of how the transaction is characterized), the Corresponding Source conveyed under this section must be accompanied by the Installation Information. But this requirement does not apply if neither you nor any third party retains the ability to install modified object code on the User Product (for example, the work has been installed in ROM). The requirement to provide Installation Information does not include a requirement to continue to provide support service, warranty, or updates for a work that has been modified or installed by the recipient, or for the User Product in which it has been modified or installed. Access to a network may be denied when the modification itself materially and adversely affects the operation of the network or violates the rules and protocols for communication across the network. Corresponding Source conveyed, and Installation Information provided, in accord with this section must be in a format that is publicly documented (and with an implementation available to the public in source code form), and must require no special password or key for unpacking, reading or copying. 7. Additional Terms. "Additional permissions" are terms that supplement the terms of this License by making exceptions from one or more of its conditions. Additional permissions that are applicable to the entire Program shall be treated as though they were included in this License, to the extent that they are valid under applicable law. If additional permissions apply only to part of the Program, that part may be used separately under those permissions, but the entire Program remains governed by this License without regard to the additional permissions. When you convey a copy of a covered work, you may at your option remove any additional permissions from that copy, or from any part of it. (Additional permissions may be written to require their own removal in certain cases when you modify the work.) You may place additional permissions on material, added by you to a covered work, for which you have or can give appropriate copyright permission. Notwithstanding any other provision of this License, for material you add to a covered work, you may (if authorized by the copyright holders of that material) supplement the terms of this License with terms: a) Disclaiming warranty or limiting liability differently from the terms of sections 15 and 16 of this License; or b) Requiring preservation of specified reasonable legal notices or author attributions in that material or in the Appropriate Legal Notices displayed by works containing it; or c) Prohibiting misrepresentation of the origin of that material, or requiring that modified versions of such material be marked in reasonable ways as different from the original version; or d) Limiting the use for publicity purposes of names of licensors or authors of the material; or e) Declining to grant rights under trademark law for use of some trade names, trademarks, or service marks; or f) Requiring indemnification of licensors and authors of that material by anyone who conveys the material (or modified versions of it) with contractual assumptions of liability to the recipient, for any liability that these contractual assumptions directly impose on those licensors and authors. All other non-permissive additional terms are considered "further restrictions" within the meaning of section 10. If the Program as you received it, or any part of it, contains a notice stating that it is governed by this License along with a term that is a further restriction, you may remove that term. If a license document contains a further restriction but permits relicensing or conveying under this License, you may add to a covered work material governed by the terms of that license document, provided that the further restriction does not survive such relicensing or conveying. If you add terms to a covered work in accord with this section, you must place, in the relevant source files, a statement of the additional terms that apply to those files, or a notice indicating where to find the applicable terms. Additional terms, permissive or non-permissive, may be stated in the form of a separately written license, or stated as exceptions; the above requirements apply either way. 8. Termination. You may not propagate or modify a covered work except as expressly provided under this License. Any attempt otherwise to propagate or modify it is void, and will automatically terminate your rights under this License (including any patent licenses granted under the third paragraph of section 11). However, if you cease all violation of this License, then your license from a particular copyright holder is reinstated (a) provisionally, unless and until the copyright holder explicitly and finally terminates your license, and (b) permanently, if the copyright holder fails to notify you of the violation by some reasonable means prior to 60 days after the cessation. Moreover, your license from a particular copyright holder is reinstated permanently if the copyright holder notifies you of the violation by some reasonable means, this is the first time you have received notice of violation of this License (for any work) from that copyright holder, and you cure the violation prior to 30 days after your receipt of the notice. Termination of your rights under this section does not terminate the licenses of parties who have received copies or rights from you under this License. If your rights have been terminated and not permanently reinstated, you do not qualify to receive new licenses for the same material under section 10. 9. Acceptance Not Required for Having Copies. You are not required to accept this License in order to receive or run a copy of the Program. Ancillary propagation of a covered work occurring solely as a consequence of using peer-to-peer transmission to receive a copy likewise does not require acceptance. However, nothing other than this License grants you permission to propagate or modify any covered work. These actions infringe copyright if you do not accept this License. Therefore, by modifying or propagating a covered work, you indicate your acceptance of this License to do so. 10. Automatic Licensing of Downstream Recipients. Each time you convey a covered work, the recipient automatically receives a license from the original licensors, to run, modify and propagate that work, subject to this License. You are not responsible for enforcing compliance by third parties with this License. An "entity transaction" is a transaction transferring control of an organization, or substantially all assets of one, or subdividing an organization, or merging organizations. If propagation of a covered work results from an entity transaction, each party to that transaction who receives a copy of the work also receives whatever licenses to the work the party's predecessor in interest had or could give under the previous paragraph, plus a right to possession of the Corresponding Source of the work from the predecessor in interest, if the predecessor has it or can get it with reasonable efforts. You may not impose any further restrictions on the exercise of the rights granted or affirmed under this License. For example, you may not impose a license fee, royalty, or other charge for exercise of rights granted under this License, and you may not initiate litigation (including a cross-claim or counterclaim in a lawsuit) alleging that any patent claim is infringed by making, using, selling, offering for sale, or importing the Program or any portion of it. 11. Patents. A "contributor" is a copyright holder who authorizes use under this License of the Program or a work on which the Program is based. The work thus licensed is called the contributor's "contributor version". A contributor's "essential patent claims" are all patent claims owned or controlled by the contributor, whether already acquired or hereafter acquired, that would be infringed by some manner, permitted by this License, of making, using, or selling its contributor version, but do not include claims that would be infringed only as a consequence of further modification of the contributor version. For purposes of this definition, "control" includes the right to grant patent sublicenses in a manner consistent with the requirements of this License. Each contributor grants you a non-exclusive, worldwide, royalty-free patent license under the contributor's essential patent claims, to make, use, sell, offer for sale, import and otherwise run, modify and propagate the contents of its contributor version. In the following three paragraphs, a "patent license" is any express agreement or commitment, however denominated, not to enforce a patent (such as an express permission to practice a patent or covenant not to sue for patent infringement). To "grant" such a patent license to a party means to make such an agreement or commitment not to enforce a patent against the party. If you convey a covered work, knowingly relying on a patent license, and the Corresponding Source of the work is not available for anyone to copy, free of charge and under the terms of this License, through a publicly available network server or other readily accessible means, then you must either (1) cause the Corresponding Source to be so available, or (2) arrange to deprive yourself of the benefit of the patent license for this particular work, or (3) arrange, in a manner consistent with the requirements of this License, to extend the patent license to downstream recipients. "Knowingly relying" means you have actual knowledge that, but for the patent license, your conveying the covered work in a country, or your recipient's use of the covered work in a country, would infringe one or more identifiable patents in that country that you have reason to believe are valid. If, pursuant to or in connection with a single transaction or arrangement, you convey, or propagate by procuring conveyance of, a covered work, and grant a patent license to some of the parties receiving the covered work authorizing them to use, propagate, modify or convey a specific copy of the covered work, then the patent license you grant is automatically extended to all recipients of the covered work and works based on it. A patent license is "discriminatory" if it does not include within the scope of its coverage, prohibits the exercise of, or is conditioned on the non-exercise of one or more of the rights that are specifically granted under this License. You may not convey a covered work if you are a party to an arrangement with a third party that is in the business of distributing software, under which you make payment to the third party based on the extent of your activity of conveying the work, and under which the third party grants, to any of the parties who would receive the covered work from you, a discriminatory patent license (a) in connection with copies of the covered work conveyed by you (or copies made from those copies), or (b) primarily for and in connection with specific products or compilations that contain the covered work, unless you entered into that arrangement, or that patent license was granted, prior to 28 March 2007. Nothing in this License shall be construed as excluding or limiting any implied license or other defenses to infringement that may otherwise be available to you under applicable patent law. 12. No Surrender of Others' Freedom. If conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot convey a covered work so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not convey it at all. For example, if you agree to terms that obligate you to collect a royalty for further conveying from those to whom you convey the Program, the only way you could satisfy both those terms and this License would be to refrain entirely from conveying the Program. 13. Remote Network Interaction; Use with the GNU General Public License. Notwithstanding any other provision of this License, if you modify the Program, your modified version must prominently offer all users interacting with it remotely through a computer network (if your version supports such interaction) an opportunity to receive the Corresponding Source of your version by providing access to the Corresponding Source from a network server at no charge, through some standard or customary means of facilitating copying of software. This Corresponding Source shall include the Corresponding Source for any work covered by version 3 of the GNU General Public License that is incorporated pursuant to the following paragraph. Notwithstanding any other provision of this License, you have permission to link or combine any covered work with a work licensed under version 3 of the GNU General Public License into a single combined work, and to convey the resulting work. The terms of this License will continue to apply to the part which is the covered work, but the work with which it is combined will remain governed by version 3 of the GNU General Public License. 14. Revised Versions of this License. The Free Software Foundation may publish revised and/or new versions of the GNU Affero General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. Each version is given a distinguishing version number. If the Program specifies that a certain numbered version of the GNU Affero General Public License "or any later version" applies to it, you have the option of following the terms and conditions either of that numbered version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of the GNU Affero General Public License, you may choose any version ever published by the Free Software Foundation. If the Program specifies that a proxy can decide which future versions of the GNU Affero General Public License can be used, that proxy's public statement of acceptance of a version permanently authorizes you to choose that version for the Program. Later license versions may give you additional or different permissions. However, no additional obligations are imposed on any author or copyright holder as a result of your choosing to follow a later version. 15. Disclaimer of Warranty. THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 16. Limitation of Liability. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. 17. Interpretation of Sections 15 and 16. If the disclaimer of warranty and limitation of liability provided above cannot be given local legal effect according to their terms, reviewing courts shall apply local law that most closely approximates an absolute waiver of all civil liability in connection with the Program, unless a warranty or assumption of liability accompanies a copy of the Program in return for a fee. END OF TERMS AND CONDITIONS How to Apply These Terms to Your New Programs If you develop a new program, and you want it to be of the greatest possible use to the public, the best way to achieve this is to make it free software which everyone can redistribute and change under these terms. To do so, attach the following notices to the program. It is safest to attach them to the start of each source file to most effectively state the exclusion of warranty; and each file should have at least the "copyright" line and a pointer to where the full notice is found. Copyright (C) This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details. You should have received a copy of the GNU Affero General Public License along with this program. If not, see . Also add information on how to contact you by electronic and paper mail. If your software can interact with users remotely through a computer network, you should also make sure that it provides a way for users to get its source. For example, if your program is a web application, its interface could display a "Source" link that leads users to an archive of the code. There are many ways you could offer source, and different solutions will be better for different programs; see section 13 for the specific requirements. You should also get your employer (if you work as a programmer) or school, if any, to sign a "copyright disclaimer" for the program, if necessary. For more information on this, and how to apply and follow the GNU AGPL, see . PK!HڽTUopenkiwi-0.1.0.dist-info/WHEEL A н#Z;/"d&F[xzw@Zpy3Fv]\fi4WZ^EgM_-]#0(q7PK!HBߝ !openkiwi-0.1.0.dist-info/METADATAYrSlH<kUdQ-)(x%$X4s z &y~ M:=.,X_xoEdXk:<#'"k^k8< y+/e$]# 4emعdh^qp/38n̲D̖*@ z)})>tPBlq:S#VXU<[)D^[Oe~tEePy|&#[6P@'m[{܁ugWJΥb%8 EqƳ\шu˾;JqIAsG*Eh&f4IǾ~~pX;M4^A~~"s&/VY7iHyh.yAd h] oy@i_K B\ A?|D8ާ0zuęTوq4 ./x!j5g0 ဨb4'O{\q*\41T% fÀx^ݞ}ʗ L420-܍3 q ckI֟31$0d2{; 㵆:H ?s_^|~NZVMŲ+XIس=>OeB\ T̽$N~#lRx9z4Yw<2\ԋ2`8O88γ8wQ}?zy~̤b1Ds- %z I"/ 5b2Slŀ;X Co1bo5q5o*8ŪTr]<̀ʃBhtG Sw h@\!MJd #L`(9aJqJ1Y(2?7I `H<6΃q$q +1@#;-0z/Aa\(, 1b/_ߝ}&75vl2iqҦ0x&cN&ޥ7fԻq2DnZc&LX\x\@Ol$H[ȔydRRGxC3!0lǢUl"]+aBc\X ~^ 4qWl%9Ëd.2z@:r5Md L9)ľM呙 1vEOHt`cI=Ou~y{v3'>zii-_?̄Hp N,]k:}x}6m̡Mlzv%;,h^cPi4yuQXw M-y0ID?^EdkwGR{Co8< OXIigclEY+:iZ^a5t[3 ~BOֺNֵZޠ󺥥h"Rjp^"Bq LwPj[Ų ;G_:LJoq",^֑dE5e$tQ,\lBÀ M'rN¾ {*o>[qHj,4{1-R7Y /BezX4CUъ*ns.=0J *(VS5-3 3N/ڜɳ:vʽX$w2 c;A4hc8 hy;[X3$%Z)dQ>Ue'sסFlLp'4NLGX6TNso-2AT| Pݫ97f4'7w_~7G8֭>;Nņ2fAnsaֺ [kgybNlrST={h.]78cE(!L `͑}(c%3[/tscy;_fFEO=7#viVfwy~{P)Kfs,T>"#}51=a?>S.tFWT _Ź:DqS?V +88nbS>4A<{;!qdi&3Rr? Z؆BPӀo?gp\bGV'~̆&/Rg?Z$6#"oeJ'brN>E ӛhyX:;LbQy)r#9r7FogLQ9|/wH5NyJ֘JQ(^9Х(A}; Erca\E qrB^z^#`[%~R~3<}Ai!3ZVҜEM}'0NO:YUwO@)I$q=GU7sA)m*Cւn{$_>GOǐ%EyɎ=>_,bӚlā>kn&z!) ( ȵ|׷ܹi&"pjt< i6`3ԋ!-x!Q F"4/M5$<CW~;CxB1LB>+W\o}G$^4d޵V,Y7CDuGYt8ߣ-xO84JejGzVإ=b YWW.~tɔ̗Z7C8ufFNJRجs_`L(MS(4U 33Y;\XUJ0" Oe;6ePt%m^!xƹ2AId(fE*j~xB_ɰVeQAV d8t( $O$TD0+B `1m>}gLkT')C0փ0 ?:җl^B~BJ^ͪ< ሺvc~X̆$'~Ə!"oC?iާ ^20éUŐ-X2w$F|ی' )/S\SA9Mu=Lh·SJҗȥ eZPz礳gi%j}s]j7b]<)CL^AqǑ`hA^L6]dŇR98YlZp~9 %>Q :g=zׄ, ߎ2 㜋̫XqڮJfJe36*?_7[ l@h/19O [ i&R@軤j1 ܲ/N !gQ rׇf]Go?7y\nπn@rDػ(reR&i97qHg~O(w)F=>JB?_G>,E@jOqɃln"u I稇}Do3[&ST>Hroaϰ [x ''b颢 ETeÓs4!~" Z#yc!0_tw]8JX4>h6[@ϴW#8V KG7<HH@}lG Pϯ2dͲ[-w\%f݉v:ftG~ o]CUᶛ:aa~Pð; vh! 9 EWbDn7Gp5Lt6H@ {W)&FA(+.eH9#qTaNo5)7ύc\s fy+R=@8ХAn/Kt5Xwlf0 v:"MEL*M=k|'5v) @ǔ]:D:]o(Q?O]ͽ8|tn6;Cf" Qܵjo`ztPy$И9\V`{J"rGF"l vo~L)7 |aGGg\=VǪ(bbN%PK QCm McpfmgR5릨[iC{w.& QA]̭PIL[JG$/g|aHFS-%,<e'jhdmzyЍ~ߧ{>Q/UэJ!!cJ4Ty_Vv{g%>kނS0:2KFlv,S6ʖLP0Ljҳ/kAw.suT.vЗqlbWu+)_F>VᖧF1Ȗ9`ReW66@.ԧJt;谩}Y }Ĵ"SD#! y2@ngXůA[ ~VN(P^n=Qm6?ݼ^>4 fitUrPh|h"unAΠl{,O+G͎ĵ[z4~=c@Ɣ~3@G _-i=g0/&:R8qX òo6H/3-:av[W.^Un/!*)0=uA~hb%i5!JִX=*{sۿdc[\Jݼ`.3{z*0F]2 ~OD@޷_|B-thF;ER}-qg q y|P&_<3SvMZz9mAڮ+i o;N&@?s0"֙KJǛҥxTk-/T!t:pStdmg/䧎U셅T(!y#d5re23gkq6 G>™<ţ-5FLzgkEDU/,%xM|t`ơp]@!n>F7U[a Yζ}Q:'YԫNފ/Xy[MY.}?^v;P𷺉7vw;=JaVIgv1Ws|^ M/ $C2V 'gSyeӟHv;֒ԨJ)iQӥ^ަ"[1X\z#&ӟ2gٽ'|$P Uzw2a1yNJ栈qCN&9F=PK! CHANGELOGPK!R 'CONTRIBUTING.mdPK!Oۆۆ LICENSEPK!ٶkiwi/__init__.pyPK!?~~6kiwi/__main__.pyPK!hkiwi/cli/__init__.pyPK!"'kiwi/cli/better_argparse.pyPK!st t Qkiwi/cli/main.pyPK!hkiwi/cli/models/__init__.pyPK!a4$4$?kiwi/cli/models/linear.pyPK!;;kiwi/cli/models/nuqe.pyPK!kiwi/cli/models/predictor.pyPK!i9?9?&'kiwi/cli/models/predictor_estimator.pyPK!wl ( (@kiwi/cli/models/quetch.pyPK!wtQQhkiwi/cli/opts.pyPK!hdkiwi/cli/pipelines/__init__.pyPK!jE}}kiwi/cli/pipelines/evaluate.pyPK! 22lkiwi/cli/pipelines/jackknife.pyPK!q::ۜkiwi/cli/pipelines/predict.pyPK!t3ڕPkiwi/cli/pipelines/train.pyPK!kiwi/constants.pyPK!h#kiwi/data/__init__.pyPK!|ikiwi/data/builders.pyPK!cUg"!"!kiwi/data/corpus.pyPK!hkiwi/data/fields/__init__.pyPK!aJJ#Lkiwi/data/fields/alignment_field.pyPK!; kiwi/data/fields/qe_field.pyPK![Ӆ//) kiwi/data/fields/sequence_labels_field.pyPK!h?kiwi/data/fieldsets/__init__.pyPK!-kiwi/data/fieldsets/extend_vocabs_fieldset.pyPK!kiwi/data/fieldsets/fieldset.pyPK! 1kiwi/data/fieldsets/linear.pyPK!{ Rkiwi/data/fieldsets/predictor.pyPK!ʋy*[kiwi/data/fieldsets/predictor_estimator.pyPK!xxjkiwi/data/fieldsets/quetch.pyPK!A^jvzkiwi/data/iterators.pyPK!cG Zkiwi/data/qe_dataset.pyPK!EZ/kiwi/data/tokenizers.pyPK!U$T%%4kiwi/data/utils.pyPK!j