{ "info": { "author": "Masashi Yoshikawa", "author_email": "yoshikawa.masashi.yh8@is.naist.jp", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: MIT License", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6" ], "description": "# depccg v1\n\nUPDATE 2019/6/7 \n_The datasets and codes for my ACL2019 paper ([Automatic Generation of High Quality CCGbanks for Parser Domain Adaptation](https://arxiv.org/abs/1906.01834)) are available at the following repo!_: https://github.com/masashi-y/ud2ccg\n\nCodebase for [A\\* CCG Parsing with a Supertag and Dependency Factored Model](https://arxiv.org/abs/1704.06936)\n\n### Requirements\n\n* Python >= 3.6.0\n* A C++ compiler supporting [C++11 standard](https://en.wikipedia.org/wiki/C%2B%2B11) (in case of gcc, must be >= 4.8)\n* OpenMP (optional, for efficient batched parsing)\n\n\n## Installation\n\nUsing pip:\n```sh\n\u279c pip install cython numpy depccg\n```\n\nIf OpenMP is available in your environment, you can use it for more efficient parsing:\n```sh\n\u279c USE_OPENMP=1 pip install cython numpy depccg\n```\n\n## Usage\n\n### Using a pretrained English parser\n\n__Better performing ELMo model is also [available](#the-best-performing-elmo-model) now.__\n\nThe best performing model in the paper trained on tri-training is available:\n```sh\n\u279c depccg_en download\n```\n\nIt can be downloaded directly [here](https://drive.google.com/file/d/1mxl1HU99iEQcUYhWhvkowbE4WOH0UKxv/view?usp=sharing) (189M).\n\n\n```sh\n\u279c echo \"this is a test sentence .\" | depccg_en\nID=1, Prob=-0.0006299018859863281\n( ( () ( () ( () ( () () ) ) ) ) () )\n```\nYou can specify output format (see [below](#available-output-formats)).\n\n```sh\n\u279c echo \"this is a test sentence .\" | depccg_en --format deriv\nID=1, Prob=-0.0006299018859863281\n this is a test sentence .\n NP (S[dcl]\\NP)/NP NP[nb]/N N/N N .\n ---------------->\n N\n -------------------------->\n NP\n ------------------------------------------>\n S[dcl]\\NP\n------------------------------------------------<\n S[dcl]\n---------------------------------------------------\n S[dcl]\n```\n\nBy default, the input is expected to be pre-tokenized. If you want to process untokenized sentences, you can pass `--tokenize` option.\n\nThe POS and NER tags in the output are filled with `XX` by default. You can replace them with ones predicted using [SpaCy](https://spacy.io):\n```sh\n\u279c pip install spacy\n\u279c python -m spacy download en\n\u279c echo \"this is a test sentence .\" | depccg_en --annotator spacy\nID=1, Prob=-0.0006299018859863281\n( ( () ( () ( () ( () () ) ) ) ) () )\n```\nThe parser uses a SpaCy's model symbolic-linked to `en` (it loads a model by `spacy('en')`).\n\nOrelse, you can use POS/NER taggers implemented in [C&C](https://www.cl.cam.ac.uk/~sc609/candc-1.00.html), which may be useful in some sorts of parsing experiments:\n\n```sh\n\u279c export CANDC=/path/to/candc\n\u279c echo \"this is a test sentence .\" | depccg_en --annotator candc\nID=1, Prob=-0.0006299018859863281\n( ( () ( () ( () ( () () ) ) ) ) () )\n```\n\nBy default, depccg expects the POS and NER models are placed in `$CANDC/models/pos` and `$CANDC/models/ner`, but you can explicitly specify them by setting `CANDC_MODEL_POS` and `CANDC_MODEL_NER` environmental variables.\n\nIt is also possible to obtain logical formulas using [ccg2lambda](https://github.com/mynlp/ccg2lambda)'s semantic parsing algorithm.\n```sh\n\u279c echo \"This is a test sentence .\" | depccg_en --format ccg2lambda --annotator spacy\nID=0 log probability=-0.0006299018859863281\nexists x.(_this(x) & exists z1.(_sentence(z1) & _test(z1) & (x = z1)))\n```\n\n### The best performing ELMo model\n\n\nIn accordance with many other reported results, depccg obtains the improved performance by using contextualized word embeddings ([ELMo](https://allennlp.org/elmo); Peters et al., 2018).\n\nThe ELMo model replaces affix embeddings in (Yoshikawa et al., 2017) with ELMo, resulting in 1124 dimensional input embeddings (ELMo + GloVe). It is trained on CCGbank and the [tri-training](https://drive.google.com/file/d/1rCJyb98AcNx5eBuC18-koCWJFfU4OV06/view?usp=sharing) silver dataset.\n\n||Unlabeled F1|Labeled F1|\n|:-|:-|:-|\n|(Yoshikawa et al., 2017)|94.0|88.8|\n|+ELMo|94.98|90.51|\n\n\nPlease download the model from the following link.\n* [English ELMo model](https://drive.google.com/file/d/1UldQDigVq4VG2pJx9yf3krFjV0IYOwLr/view?usp=sharing) (649M)\n\nTo use the model, install `allennlp`:\n\n```sh\n\u279c pip install allennlp\n```\n\nand then,\n```sh\n\u279c echo \"this is a test sentence .\" | depccg_en --model lstm_parser_elmo_finetune.tar.gz\n```\n\nUsing a GPU (by `--gpu` option) is recommended if possible.\n\n### Using a pretrained Japanese parser\n\nThe best performing model is available by:\n```sh\n\u279c depccg_ja download\n```\n\nIt can be downloaded directly [here](https://drive.google.com/file/d/1bblQ6FYugXtgNNKnbCYgNfnQRkBATSY3/view?usp=sharing) (56M).\n\nThe Japanese parser depends on [Janome](https://github.com/mocobeta/janome) for the tokenization. Please install it by:\n```sh\n\u279c pip install janome\n```\n\nThe parser provides the almost same interface as with the English one, with slight differences including the default output format, which is now one compatible with the Japanese CCGbank:\n```sh\n\u279c echo \"\u3053\u308c\u306f\u30c6\u30b9\u30c8\u306e\u6587\u3067\u3059\u3002\" | depccg_ja\nID=1, Prob=-53.98793411254883\n{< S[mod=nm,form=base,fin=t] {< S[mod=nm,form=base,fin=f] {< NP[case=nc,mod=nm,fin=f] {NP[case=nc,mod=nm,fin=f] \u3053\u308c/\u3053\u308c/**} {NP[case=nc,mod=nm,fin=f]\\NP[case=nc,mod=nm,fin=f] \u306f/\u306f/**}} {< S[mod=nm,form=base,fin=f]\\NP[case=nc,mod=nm,fin=f] {< NP[case=nc,mod=nm,fin=f] {< NP[case=nc,mod=nm,fin=f] {NP[case=nc,mod=nm,fin=f] \u30c6\u30b9\u30c8/\u30c6\u30b9\u30c8/**} {NP[case=nc,mod=nm,fin=f]\\NP[case=nc,mod=nm,fin=f] \u306e/\u306e/**}} {NP[case=nc,mod=nm,fin=f]\\NP[case=nc,mod=nm,fin=f] \u6587/\u6587/**}} {(S[mod=nm,form=base,fin=f]\\NP[case=nc,mod=nm,fin=f])\\NP[case=nc,mod=nm,fin=f] \u3067\u3059/\u3067\u3059/**}}} {S[mod=nm,form=base,fin=t]\\S[mod=nm,form=base,fin=f] \u3002/\u3002/**}}\n```\n\nYou can pass pre-tokenized sentences as well:\n```sh\n\u279c echo \"\u3053\u308c \u306f \u30c6\u30b9\u30c8 \u306e \u6587 \u3067\u3059 \u3002\" | depccg_ja --pre-tokenized\nID=1, Prob=-53.98793411254883\n{< S[mod=nm,form=base,fin=t] {< S[mod=nm,form=base,fin=f] {< NP[case=nc,mod=nm,fin=f] {NP[case=nc,mod=nm,fin=f] \u3053\u308c/\u3053\u308c/**} {NP[case=nc,mod=nm,fin=f]\\NP[case=nc,mod=nm,fin=f] \u306f/\u306f/**}} {< S[mod=nm,form=base,fin=f]\\NP[case=nc,mod=nm,fin=f] {< NP[case=nc,mod=nm,fin=f] {< NP[case=nc,mod=nm,fin=f] {NP[case=nc,mod=nm,fin=f] \u30c6\u30b9\u30c8/\u30c6\u30b9\u30c8/**} {NP[case=nc,mod=nm,fin=f]\\NP[case=nc,mod=nm,fin=f] \u306e/\u306e/**}} {NP[case=nc,mod=nm,fin=f]\\NP[case=nc,mod=nm,fin=f] \u6587/\u6587/**}} {(S[mod=nm,form=base,fin=f]\\NP[case=nc,mod=nm,fin=f])\\NP[case=nc,mod=nm,fin=f] \u3067\u3059/\u3067\u3059/**}}} {S[mod=nm,form=base,fin=t]\\S[mod=nm,form=base,fin=f] \u3002/\u3002/**}}\n```\n\n### Available output formats\n\n* `auto` - the most standard format following AUTO format in the English CCGbank\n* `deriv` - visualized derivations in ASCII art\n* `xml` - XML format compatible with C&C's XML format (only for English parsing)\n* `conll` - CoNLL format\n* `html` - visualized trees in MathML\n* `prolog` - Prolog-like format\n* `jigg_xml` - XML format compatible with [Jigg](https://github.com/mynlp/jigg)\n* `ptb` - Penn Treebank-style format\n* `ccg2lambda` - logical formula converted from a derivation using [ccg2lambda](https://github.com/mynlp/ccg2lambda)\n* `jigg_xml_ccg2lambda` - jigg_xml format with ccg2lambda logical formula inserted\n* `json` - JSON format\n* `ja` - a format adopted in Japanese CCGbank (only for Japanese)\n\n### Programatic Usage\n\n```python\nfrom depccg.parser import EnglishCCGParser\nfrom pathlib import Path\n\n# Available keyword arguments in initializing a CCG parser\n# Please refer to the following paper for category dictionary, seen rules, pruning etc.\n# \"A* CCG Parsing with a Supertag-factored Model\", Lewis and Steedman, 2014\nkwargs = dict(\n # A list of binary rules \n # By default: depccg.combinator.en_default_binary_rules\n binary_rules=None,\n # Penalize an application of a unary rule by adding this value (negative log probability)\n unary_penalty=0.1,\n # Prune supertags with low probabilities using this value\n beta=0.00001,\n # Set False if not prune\n use_beta=True,\n # Use category dictionary\n use_category_dict=True,\n # Use seen rules\n use_seen_rules=True,\n # This also used to prune supertags\n pruning_size=50,\n # Nbest outputs\n nbest=1,\n # Limit categories that can appear at the root of a CCG tree\n # By default: S[dcl], S[wq], S[q], S[qem], NP.\n possible_root_cats=None,\n # Give up parsing long sentences\n max_length=250,\n # Give up parsing if it runs too many steps\n max_steps=100000,\n # You can specify a GPU\n gpu=-1\n)\n\n# Initialize a parser from a model directory\nmodel = \"/path/to/model/directory\"\nparser = EnglishCCGParser.from_dir(\n model,\n load_tagger=True, # Load supertagging model\n **kwargs)\n\nmodel = Path(\"/path/to/model/directory\")\nparser = EnglishCCGParser.from_files(\n unary_rules=model / 'unary_rules.txt',\n category_dict=model / 'cat_dict.txt',\n seen_rules=model / 'seen_rules.txt',\n tagger_model=model / 'tagger_model',\n **kwargs)\n\n# If you don't like to keep separate files,\n# wget http://cl.naist.jp/~masashi-y/resources/depccg/config.json\nmodel = Path(\"/path/to/model/directory\")\nparser = EnglishCCGParser.from_json(\n model / 'config.json',\n tagger_model=model / 'tagger_model',\n **kwargs)\n\nsents = [\n \"This is a test sentence .\",\n \"This is second .\"\n]\n\nresults = parser.parse_doc(sents)\nfor nbests in results:\n for tree, log_prob in nbests:\n print(tree.deriv)\n```\n\nFor Japanese CCG parsing, use `depccg.parser.JapaneseCCGParser`,\nwhich has the exactly same interface.\nNote that the Japanese parser accepts pre-tokenized sentences as input.\n\n## Train your own English supertagging model\n\nYou can use my [allennlp](https://allennlp.org/)-based supertagger and extend it.\n\nTo train a supertagger, prepare [the English CCGbank](https://catalog.ldc.upenn.edu/LDC2005T13) and download [vocab](http://cl.naist.jp/~masashi-y/resources/depccg/vocabulary.tar.gz):\n```sh\n\u279c cat ccgbank/data/AUTO/{0[2-9],1[0-9],20,21}/* > wsj_02-21.auto\n\u279c cat ccgbank/data/AUTO/00/* > wsj_00.auto\n```\n```sh\n\u279c wget http://cl.naist.jp/~masashi-y/resources/depccg/vocabulary.tar.gz\n\u279c tar xvf vocabulary.tar.gz\n```\n\nthen,\n```sh\n\u279c vocab=vocabulary train_data=wsj_02-21.auto test_data=wsj_00.auto gpu=0 \\\n encoder_type=lstm token_embedding_type=char \\\n allennlp train --include-package depccg.models.my_allennlp --serialization-dir results supertagger.jsonnet\n```\nThe training configs are passed either through environmental variables or directly writing to jsonnet config files, which are available in [supertagger.jsonnet](depccg/models/my_allennlp/config/supertagger.jsonnet) or [supertagger_tritrain.jsonnet](depccg/models/my_allennlp/config/supertagger_tritrain.jsonnet).\nThe latter is a config file for using [tri-training silver data](http://cl.naist.jp/~masashi-y/resources/depccg/headfirst_parsed.conll.stagged.gz) (309M) constructed in (Yoshikawa et al., 2017), on top of the English CCGbank.\n\nTo use the trained supertagger,\n```sh\n\u279c echo \"this is a test sentence .\" | depccg_en --model results/model.tar.gz\n```\n\nor alternatively,\n```sh\n\u279c echo '{\"sentence\": \"this is a test sentence .\"}' > input.jsonl\n\u279c allennlp predict results/model.tar.gz --include-package depccg.models.my_allennlp --output-file weights.json input.jsonl\n\u279c cat weights.json | depccg_en --input-format json\n```\nwhere `weights.json` contains probabilities used in the parser (`p_tag` and `p_dep`).\n\n### Evaluation in terms of predicate-argument dependencies\nThe standard CCG parsing evaluation can be performed with the following script:\n\n```sh\n\u279c cat ccgbank/data/PARG/00/* > wsj_00.parg\n\u279c export CANDC=/path/to/candc\n\u279c python -m depccg.tools.evaluate wsj_00.parg wsj_00.predicted.auto\n```\nCurrently, the script is dependent on [C&C](https://www.cl.cam.ac.uk/~sc609/candc-1.00.html)'s `generate` program, which is only available by compiling the C&C program from the source.\n\n## Miscellaneous\n\n### Diff tool\n\nIn error analysis, you must want to see diffs between trees in an intuitive way.\n`depccg.tools.diff` does exactly this:\n\n```sh\n\u279c python -m depccg.tools.diff file1.auto file2.auto > diff.html\n```\n\nwhich outputs:\n\n![show diffs between trees](images/diff.png)\n\nwhere trees in the same lines of the files are compared and the diffs are marked in color.\n\n## Citation\n\nIf you make use of this software, please cite the following:\n\n @inproceedings{yoshikawa:2017acl,\n author={Yoshikawa, Masashi and Noji, Hiroshi and Matsumoto, Yuji},\n title={A* CCG Parsing with a Supertag and Dependency Factored Model},\n booktitle={Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},\n publisher={Association for Computational Linguistics},\n year={2017},\n pages={277--287},\n location={Vancouver, Canada},\n doi={10.18653/v1/P17-1026},\n url={http://aclweb.org/anthology/P17-1026}\n }\n\n\n\n## Licence\nMIT Licence\n\n## Contact\nFor questions and usage issues, please contact yoshikawa.masashi.yh8@is.naist.jp .\n\n## Acknowledgement\nIn creating the parser, I owe very much to:\n- [EasyCCG](https://github.com/mikelewis0/easyccg): from which I learned everything\n- [NLTK](http://www.nltk.org/): for nice pretty printing for parse derivation", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/masashi-y/depccg", "keywords": "", "license": "MIT License", "maintainer": "", "maintainer_email": "", "name": "depccg", "package_url": "https://pypi.org/project/depccg/", "platform": "", "project_url": "https://pypi.org/project/depccg/", "project_urls": { "Homepage": "https://github.com/masashi-y/depccg" }, "release_url": "https://pypi.org/project/depccg/1.0.8/", "requires_dist": null, "requires_python": "", "summary": "A parser for natural language based on combinatory categorial grammar", "version": "1.0.8" }, "last_serial": 5869011, "releases": { "1.0.0": [ { "comment_text": "", "digests": { "md5": "923594962678697a2c304b3ccd4ff71e", "sha256": "40c3f3356328cf599a66ae1ff79c15dd9f6765f7e8668fc4455ae265f0c8eaa2" }, "downloads": -1, "filename": "depccg-1.0.0.tar.gz", "has_sig": false, "md5_digest": "923594962678697a2c304b3ccd4ff71e", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 3307635, "upload_time": "2019-04-09T10:24:50", "url": "https://files.pythonhosted.org/packages/58/49/98e546a5f52d1b99b65ba5765dbc3c5f7867d1eda7d32f9e4660beb80ca0/depccg-1.0.0.tar.gz" } ], "1.0.1": [ { "comment_text": "", "digests": { "md5": "9f82b8aee87a47c4dbc24b492d130a9e", "sha256": "668ca6ef0ded3452940f73c25cafca46de80f22e292164b9450a967c735cb735" }, "downloads": -1, "filename": "depccg-1.0.1.tar.gz", "has_sig": false, "md5_digest": "9f82b8aee87a47c4dbc24b492d130a9e", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 3309196, "upload_time": "2019-04-09T10:30:21", "url": "https://files.pythonhosted.org/packages/b6/06/17ab7dc9b078baf86b2682a0f5eeb91011d6644a8235e3e4ee5669013c6c/depccg-1.0.1.tar.gz" } ], "1.0.2": [ { "comment_text": "", "digests": { "md5": "79681519edb8c10defc809e140ebdaf1", "sha256": "6d6d77a8c3f99f5ac0869fddce61df801ac52072523ef1dc3bcd632c63d31d77" }, "downloads": -1, "filename": "depccg-1.0.2.tar.gz", "has_sig": false, "md5_digest": "79681519edb8c10defc809e140ebdaf1", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 3311779, "upload_time": "2019-04-09T10:39:43", "url": "https://files.pythonhosted.org/packages/68/1b/f6089b576116fa059298a977322ed99c675a5e65a86679114b1ad7d0bed8/depccg-1.0.2.tar.gz" } ], "1.0.3": [ { "comment_text": "", "digests": { "md5": "5b795244b38eaed3de3625ebde0342a0", "sha256": "099016259c57f136ec9674d9974ebde570ace1b63ed9e55632f38bc2e2e52ed6" }, "downloads": -1, "filename": "depccg-1.0.3.tar.gz", "has_sig": false, "md5_digest": "5b795244b38eaed3de3625ebde0342a0", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 3479410, "upload_time": "2019-04-13T07:48:32", "url": "https://files.pythonhosted.org/packages/c3/b1/f929b373d2adf666082f4e11a5a0568e206758fcd28029a23c8930104651/depccg-1.0.3.tar.gz" } ], "1.0.4": [ { "comment_text": "", "digests": { "md5": "da9176f56c47bfb89605436ad1ae72e4", "sha256": "082d1c5a7750de80fd8917c9c170f23cafd4942d17de9f6d806872cf11e8363e" }, "downloads": -1, "filename": "depccg-1.0.4.tar.gz", "has_sig": false, "md5_digest": "da9176f56c47bfb89605436ad1ae72e4", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 3482289, "upload_time": "2019-04-25T14:44:23", "url": "https://files.pythonhosted.org/packages/b8/b7/370e753d6832582a30544462a62737b1241f9e3aa68742173aa024627d75/depccg-1.0.4.tar.gz" } ], "1.0.5": [ { "comment_text": "", "digests": { "md5": "80c8a0ee4679ad174ce21c0d0c5b287b", "sha256": "0071101aea0369358452705289b22ae419cf002084f8d17af173c51bd36d2477" }, "downloads": -1, "filename": "depccg-1.0.5.tar.gz", "has_sig": false, "md5_digest": "80c8a0ee4679ad174ce21c0d0c5b287b", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 3482289, "upload_time": "2019-04-25T14:46:55", "url": "https://files.pythonhosted.org/packages/dd/22/08ac5f2a06453be03df0f1b4438aa00d36f1b13c5c10848433ce3960a8b9/depccg-1.0.5.tar.gz" } ], "1.0.6": [ { "comment_text": "", "digests": { "md5": "f470a0e8f16f1b2882ce92b2352387d6", "sha256": "c4caedde98a8f6c64832985adaca7ae9394f593c3ea2e3f527411ee84358df4f" }, "downloads": -1, "filename": "depccg-1.0.6.tar.gz", "has_sig": false, "md5_digest": "f470a0e8f16f1b2882ce92b2352387d6", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 3483063, "upload_time": "2019-06-10T17:19:25", "url": "https://files.pythonhosted.org/packages/5f/2e/49edf3d7df404c8bbcb8db21918274193287bdbf0b78338ddfc1921ef617/depccg-1.0.6.tar.gz" } ], "1.0.7": [ { "comment_text": "", "digests": { "md5": "94bdb674ceb186312f85cb7d2cbeb5a0", "sha256": "285dcd640db4ac8c9e4009d6d098931d78c318d47073bd126cbfc72a064edbe9" }, "downloads": -1, "filename": "depccg-1.0.7.tar.gz", "has_sig": false, "md5_digest": "94bdb674ceb186312f85cb7d2cbeb5a0", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 3480701, "upload_time": "2019-06-13T07:04:55", "url": "https://files.pythonhosted.org/packages/e6/49/45f4f4b38192a6ff619a0f1eea39c7c3ad01dd03c1d4b99be4f4c9ba2375/depccg-1.0.7.tar.gz" } ], "1.0.8": [ { "comment_text": "", "digests": { "md5": "8e78d42cba57f78bd25823d2ce799982", "sha256": "6d4e2c7c437cf3e71d7859a68f0753d28ad69180cd3f8ac00376b8dcc99b455e" }, "downloads": -1, "filename": "depccg-1.0.8.tar.gz", "has_sig": false, "md5_digest": "8e78d42cba57f78bd25823d2ce799982", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 61877367, "upload_time": "2019-09-22T12:59:05", "url": "https://files.pythonhosted.org/packages/83/0b/6df5160a885e3bad64f910ca95495e73208f94d11a054eec9f9949f7094e/depccg-1.0.8.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "8e78d42cba57f78bd25823d2ce799982", "sha256": "6d4e2c7c437cf3e71d7859a68f0753d28ad69180cd3f8ac00376b8dcc99b455e" }, "downloads": -1, "filename": "depccg-1.0.8.tar.gz", "has_sig": false, "md5_digest": "8e78d42cba57f78bd25823d2ce799982", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 61877367, "upload_time": "2019-09-22T12:59:05", "url": "https://files.pythonhosted.org/packages/83/0b/6df5160a885e3bad64f910ca95495e73208f94d11a054eec9f9949f7094e/depccg-1.0.8.tar.gz" } ] }