{ "info": { "author": "Graham Neubig", "author_email": "", "bugtrack_url": null, "classifiers": [ "Intended Audience :: Developers", "License :: OSI Approved :: BSD License", "Programming Language :: Python :: 3", "Topic :: Scientific/Engineering :: Artificial Intelligence", "Topic :: Text Processing" ], "description": "# compare-mt\nby [NeuLab](http://www.cs.cmu.edu/~neulab/) @ [CMU LTI](https://lti.cs.cmu.edu), and other contributors\n\n[![Build Status](https://travis-ci.org/neulab/compare-mt.svg?branch=master)](https://travis-ci.org/neulab/compare-mt)\n\n`compare-mt` (for \"compare my text\") is a program to compare the output of multiple systems for language generation,\nincluding machine translation, summarization, dialog response generation, etc. \nTo use it you need to have, in text format, a \"correct\" reference, and the output of two different systems.\nBased on this, `compare-mt` will run a number of analyses that attempt to pick out salient differences between\nthe systems, which will make it easier for you to figure out what things one system is doing better than another.\n\n## Basic Usage\n\nFirst, you need to install the package:\n\n```bash\n# Requirements\npip install -r requirements.txt\n# Install the package\npython setup.py install\n```\n\nThen, as an example, you can run this over two included system outputs.\n\n```bash\ncompare-mt example/ted.ref.eng example/ted.sys1.eng example/ted.sys2.eng\n```\n\nHere, system 1 and system 2 are the baseline phrase-based and neural Slovak-English systems from our\n[EMNLP 2018 paper](http://aclweb.org/anthology/D18-1103). This will print out a number of statistics including:\n\n* **Aggregate Scores:** A report on overall BLEU scores and length ratios\n* **Word Accuracy Analysis:** A report on the F-measure of words by frequency bucket\n* **Sentence Bucket Analysis:** Bucket sentences by various statistics (e.g. sentence BLEU, length difference with the\n reference, overall length), and calculate statistics by bucket (e.g. number of sentences, BLEU score per bucket)\n* **N-gram Difference Analysis:** Calculate which n-grams one system is consistently translating better\n* **Sentence Examples:** Find sentences where one system is doing better than the other according to sentence BLEU\n\nYou can see an example of running this analysis (as well as the more advanced analysis below) either through a\n[generated HTML report here](http://phontron.com/compare-mt/output/), or in the following narrated video:\n\n[![IMAGE ALT TEXT HERE](https://img.youtube.com/vi/K-MNPOGKnDQ/0.jpg)](https://www.youtube.com/watch?v=K-MNPOGKnDQ)\n\nTo summarize the results that immediately stick out from the basic analysis:\n\n* From the *aggregate scores* we can see that the BLEU of neural MT is higher, but its sentences are slightly shorter.\n* From the *word accuracy analysis* we can see that phrase-based MT is better at low-frequency words.\n* From the *sentence bucket analysis* we can see that neural seems to be better at translating shorter sentences.\n* From the *n-gram difference analysis* we can see that there are a few words that neural MT is not good at\n but phrase based MT gets right (e.g. \"phantom\"), while there are a few long phrases that neural MT does better with\n (e.g. \"going to show you\").\n\nIf you run on your own data, you might be able to find more interesting things about your own systems. Try comparing\nyour modified system with your baseline and seeing what you find! \n\n## Other Options\n\nThere are many options that can be used to do different types of analysis.\nIf you want to find all the different types of analysis supported, the most comprehensive way to do so is by\ntaking a look at `compare-mt`, which is documented relatively well and should give examples.\nWe do highlight a few particularly useful and common types of analysis below:\n\n### Significance Tests\n\nThe script allows you to perform statistical significance tests for scores based on bootstrap resampling. You can set\nthe number of samplings manually. Here is an example using the example data:\n\n\n```bash\ncompare-mt example/ted.ref.eng example/ted.sys1.eng example/ted.sys2.eng --compare_scores score_type=bleu,bootstrap=1000,prob_thresh=0.05\n```\n\n### Using Training Set Frequency\n\nOne useful piece of analysis is the \"word accuracy by frequency\" analysis. By default this frequency is the frequency\nin the *test set*, but arguably it is more informative to know accuracy by frequency in the *training set* as this\ndemonstrates the models' robustness to words they haven't seen much, or at all, in the training data. To change the\ncorpus used to calculate word frequency and use the training set (or some other set), you can set the `freq_corpus_file`\noption to the appropriate corpus.\n\n\n```bash\ncompare-mt example/ted.ref.eng example/ted.sys1.eng example/ted.sys2.eng\n --compare_word_accuracies bucket_type=freq,freq_corpus_file=example/ted.train.eng\n```\n\nIn addition, because training sets may be very big, you can also calculate the counts on the file beforehand,\n\n```bash\npython scripts/count.py < example/ted.train.eng > example/ted.train.counts\n```\n\nand then use these counts directly to improve efficiency.\n\n```bash\ncompare-mt example/ted.ref.eng example/ted.sys1.eng example/ted.sys2.eng\n --compare_word_accuracies bucket_type=freq,freq_count_file=example/ted.train.counts\n```\n\n\n### Incorporating Word/Sentence Labels\n\nIf you're interested in performing aggregate analysis over labels for each word/sentence instead of the words/sentences themselves, it\nis possible to do so. As an example, we've included POS tags for each of the example outputs. You can use these in\naggregate analysis, or n-gram-based analysis. The following gives an example:\n\n\n```bash\ncompare-mt example/ted.ref.eng example/ted.sys1.eng example/ted.sys2.eng \n --compare_word_accuracies bucket_type=label,ref_labels=example/ted.ref.eng.tag,out_labels=\"example/ted.sys1.eng.tag;example/ted.sys2.eng.tag\",label_set=CC+DT+IN+JJ+NN+NNP+NNS+PRP+RB+TO+VB+VBP+VBZ \n --compare_ngrams compare_type=match,ref_labels=example/ted.ref.eng.tag,out_labels=\"example/ted.sys1.eng.tag;example/ted.sys2.eng.tag\"\n```\n\nThis will calculate word accuracies and n-gram matches by POS bucket, and allows you to see things like the fact\nthat the phrase-based MT system is better at translating content words such as nouns and verbs, while neural MT\nis doing better at translating function words.\n\nIt also is possible to create labels that represent numberical values. For example, `scripts/relativepositiontag.py` calculates the relative position of words in the sentence, where 0 is the first word in the sentence, 0.5 is the word in the middle, and 1.0 is the word in the end. These numerical values can then be bucketed. Here is an example:\n\n```bash\ncompare-mt example/ted.ref.eng example/ted.sys1.eng example/ted.sys2.eng \n --compare_word_accuracies bucket_type=numlabel,ref_labels=example/ted.ref.eng.rptag,out_labels=\"example/ted.sys1.eng.rptag;example/ted.sys2.eng.rptag\"\n```\n\nFrom this particular analysis we can discover that NMT does worse than PBMT at the end of the sentence, and of course other varieties of numerical labels could be used to measure different properties of words.\n\nYou can also perform analysis over labels for sentences. Here is an example:\n\n```bash\ncompare-mt example/ted.ref.eng example/ted.sys1.eng example/ted.sys2.eng \n --compare_sentence_buckets 'bucket_type=label,out_labels=example/ted.sys1.eng.senttag;example/ted.sys2.eng.senttag,label_set=0+10+20+30+40+50+60+70+80+90+100,statistic_type=score,score_measure=bleu'\n```\n\n\n### Analyzing Source Words\n\nIf you have a source corpus that is aligned to the target, you can also analyze accuracies according to features of the\nsource language words, which would allow you to examine whether, for example, infrequent words on the source side are\nhard to output properly. Here is an example using the example data:\n\n```bash\ncompare-mt example/ted.ref.eng example/ted.sys1.eng example/ted.sys2.eng --src_file example/ted.orig.slk --compare_src_word_accuracies ref_align_file=example/ted.ref.align\n```\n\n### Analyzing Word Likelihoods\n\nIf you wish to analyze the word log likelihoods by two systems on the target corpus, you can use the following\n\n```bash\ncompare-ll --ref example/ll_test.txt --ll-files example/ll_test.sys1.likelihood example/ll_test.sys2.likelihood --compare-word-likelihoods bucket_type=freq,freq_corpus_file=example/ll_test.txt\n```\n\nYou can analyze the word log likelihoods over labels for each word instead of the words themselves:\n\n```bash\ncompare-ll --ref example/ll_test.txt --ll-files example/ll_test.sys1.likelihood example/ll_test.sys2.likelihood --compare-word-likelihoods bucket_type=label,label_corpus=example/ll_test.tag,label_set=CC+DT+IN+JJ+NN+NNP+NNS+PRP+RB+TO+VB+VBP+VBZ\n```\n\nNOTE: You can also use the above to also analyze the word likelihoods produced by two language models.\n\n### Analyzing Other Language Generation Systems\n\nYou can also analyze other language generation systems using the script. Here is an example of comparing two text summarization systems. \n\n```bash\ncompare-mt example/sum.ref.eng example/sum.sys1.eng example/sum.sys2.eng --compare_scores 'score_type=rouge1' 'score_type=rouge2' 'score_type=rougeL'\n```\n\n## Citation/References\n\nIf you use compare-mt, we'd appreciate if you cite the [paper](http://arxiv.org/abs/1903.07926) about it!\n\n @inproceedings{neubig19naacl,\n title = {compare-mt: A Tool for Holistic Comparison of Language Generation Systems},\n author = {Graham Neubig and Zi-Yi Dou and Junjie Hu and Paul Michel and Danish Pruthi and Xinyi Wang},\n booktitle = {Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL) Demo Track},\n address = {Minneapolis, USA},\n month = {June},\n url = {http://arxiv.org/abs/1903.07926},\n year = {2019}\n }\n\nThere is an extensive literature review included in the paper above, but some key papers that it borrows ideas from are below:\n\n* **Automatic Error Analysis:**\n Popovic and Ney \"[Towards Automatic Error Analysis of Machine Translation Output](https://www.mitpressjournals.org/doi/pdf/10.1162/COLI_a_00072)\" Computational Linguistics 2011.\n* **POS-based Analysis:**\n Chiang et al. \"[The Hiero Machine Translation System](http://aclweb.org/anthology/H05-1098)\" EMNLP 2005.\n* **n-gram Difference Analysis**\n Akabe et al. \"[Discriminative Language Models as a Tool for Machine Translation Error Analysis](http://www.phontron.com/paper/akabe14coling.pdf)\" COLING 2014.\n\nThere is also other good software for automatic comparison or error analysis of MT systems:\n\n* **[MT-ComparEval](https://github.com/choko/MT-ComparEval):** Very nice for visualization of individual examples, but\n not as focused on aggregate analysis as `compare-mt`. Also has more software dependencies and requires using a web\n browser, while `compare-mt` can be used as a command-line tool.", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/neulab/compare-mt", "keywords": "", "license": "BSD 3-Clause", "maintainer": "", "maintainer_email": "", "name": "compare-mt", "package_url": "https://pypi.org/project/compare-mt/", "platform": "", "project_url": "https://pypi.org/project/compare-mt/", "project_urls": { "Homepage": "https://github.com/neulab/compare-mt" }, "release_url": "https://pypi.org/project/compare-mt/0.2.7/", "requires_dist": null, "requires_python": "", "summary": "Holistic comparison of the output of text generation models", "version": "0.2.7" }, "last_serial": 5853176, "releases": { "0.2": [ { "comment_text": "", "digests": { "md5": "bfe581b92a6bfccd2cf48c0f67df91c3", "sha256": "10d1e0a87f134544364aaaf0bee80947b6e7f8b5e43f6b2b49c00ee521a978d7" }, "downloads": -1, "filename": "compare_mt-0.2.tar.gz", "has_sig": false, "md5_digest": "bfe581b92a6bfccd2cf48c0f67df91c3", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 31527, "upload_time": "2019-02-11T00:30:35", "url": "https://files.pythonhosted.org/packages/37/17/c2e0651e6df64da1c4b796da4a61c7c0fb87f85561622b6b813b2305cb39/compare_mt-0.2.tar.gz" } ], "0.2.1": [ { "comment_text": "", "digests": { "md5": "25516892e8c1de6ba11f18912da9e35a", "sha256": "25a1684cc2a07cf4d1db288067ae7785797478e083b5071a6fa6b65044488480" }, "downloads": -1, "filename": "compare_mt-0.2.1.tar.gz", "has_sig": false, "md5_digest": "25516892e8c1de6ba11f18912da9e35a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 31617, "upload_time": "2019-02-11T15:59:53", "url": "https://files.pythonhosted.org/packages/ac/97/60c8782726b224b1fa9bdf823f6ac2dbd8f1a6c6e63d2653e3c69d715179/compare_mt-0.2.1.tar.gz" } ], "0.2.2": [ { "comment_text": "", "digests": { "md5": "66923abe2e7e200d36d5278e0effc9f8", "sha256": "366ac175fd42cea3340bc81035fa4f13e89a5b484bd2abd526b4cf10266a044a" }, "downloads": -1, "filename": "compare_mt-0.2.2.tar.gz", "has_sig": false, "md5_digest": "66923abe2e7e200d36d5278e0effc9f8", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 34752, "upload_time": "2019-06-11T22:11:57", "url": "https://files.pythonhosted.org/packages/14/94/25f5f79d2305a8437c969d98bc79f190d2b05736ec5e3c015120e02fadd6/compare_mt-0.2.2.tar.gz" } ], "0.2.3": [ { "comment_text": "", "digests": { "md5": "c732630de37bf5eb6daed90b66afbdf3", "sha256": "08249b346f0613f1ca8cc5296fbd70a116d8e5e0493692e31ebf187718333a87" }, "downloads": -1, "filename": "compare_mt-0.2.3.tar.gz", "has_sig": false, "md5_digest": "c732630de37bf5eb6daed90b66afbdf3", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 37211, "upload_time": "2019-07-02T17:31:56", "url": "https://files.pythonhosted.org/packages/86/bb/c24f0185b78c20d8827bc1038f8f51dd645284d8657ba2384b5c38e7940c/compare_mt-0.2.3.tar.gz" } ], "0.2.4": [ { "comment_text": "", "digests": { "md5": "e84b6f678ef034bc1bd088936bbc390e", "sha256": "52b07d611e799af2da6c5a60c5243f8bdc0db65c32ce92a96b33ed3feb5d3f10" }, "downloads": -1, "filename": "compare_mt-0.2.4.tar.gz", "has_sig": false, "md5_digest": "e84b6f678ef034bc1bd088936bbc390e", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 37804, "upload_time": "2019-07-03T21:16:46", "url": "https://files.pythonhosted.org/packages/f1/54/872de9ecba0371ac5daaa280df7357dfd19e8f1c02a2a7d7a61a3a6b338c/compare_mt-0.2.4.tar.gz" } ], "0.2.5": [ { "comment_text": "", "digests": { "md5": "8fadf016a57384654110a6be14454abc", "sha256": "fda4beb9fa0a1f9070ef316b764e23afda529dd4ded074d9dd4846df1c72af3e" }, "downloads": -1, "filename": "compare_mt-0.2.5.tar.gz", "has_sig": false, "md5_digest": "8fadf016a57384654110a6be14454abc", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 37874, "upload_time": "2019-07-20T00:52:00", "url": "https://files.pythonhosted.org/packages/64/29/0f04eec25e0ad7f88696e0cd8db72f5d7ba5da25cd2bb743a1cb2fead312/compare_mt-0.2.5.tar.gz" } ], "0.2.6": [ { "comment_text": "", "digests": { "md5": "cef393105baa3db1fea5d56dd6e559dd", "sha256": "cefc400964a1374c058b9212645e3f76d06e1a24ad0975e79f3414134921a727" }, "downloads": -1, "filename": "compare_mt-0.2.6.tar.gz", "has_sig": false, "md5_digest": "cef393105baa3db1fea5d56dd6e559dd", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 40154, "upload_time": "2019-08-07T14:31:08", "url": "https://files.pythonhosted.org/packages/2e/95/ca3f56d29c8e2180098ba381f0a63c10fb5eb341251be846e631808a5694/compare_mt-0.2.6.tar.gz" } ], "0.2.7": [ { "comment_text": "", "digests": { "md5": "452fff7ef817d6f391710c9c96869b1f", "sha256": "5d338aba0e8dfc2ffb9a0905620b2ccb26e8390896343b81caf9fc8fccaa747b" }, "downloads": -1, "filename": "compare_mt-0.2.7.tar.gz", "has_sig": false, "md5_digest": "452fff7ef817d6f391710c9c96869b1f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 41437, "upload_time": "2019-09-19T00:01:58", "url": "https://files.pythonhosted.org/packages/14/6c/6b9866a00a977e2375a20310334df2767aed5e26bf8e8aa803441ea282a5/compare_mt-0.2.7.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "452fff7ef817d6f391710c9c96869b1f", "sha256": "5d338aba0e8dfc2ffb9a0905620b2ccb26e8390896343b81caf9fc8fccaa747b" }, "downloads": -1, "filename": "compare_mt-0.2.7.tar.gz", "has_sig": false, "md5_digest": "452fff7ef817d6f391710c9c96869b1f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 41437, "upload_time": "2019-09-19T00:01:58", "url": "https://files.pythonhosted.org/packages/14/6c/6b9866a00a977e2375a20310334df2767aed5e26bf8e8aa803441ea282a5/compare_mt-0.2.7.tar.gz" } ] }