{ "info": { "author": "giganticode", "author_email": "hlibbabii@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "Environment :: Console", "Intended Audience :: Science/Research", "License :: OSI Approved :: MIT License", "Natural Language :: English", "Operating System :: POSIX :: Linux", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7", "Topic :: Software Development :: Pre-processors" ], "description": "# Dataprep\n\n[![Build Status](https://travis-ci.org/giganticode/dataprep.svg?branch=master)](https://travis-ci.org/giganticode/dataprep)\n\n**This is a tool for preprocessing source code corpora according to a specified vocabulary modeling choice.**\n\nSupported modeling choices are: \n* Splitting algorithm (no identifier splitting, camel-case splitting, snake-case splitting, BPE (byte-pair-encoding), \nnumber-splitting, ronin: http://joss.theoj.org/papers/10.21105/joss.00653); \n* Number of merges if using BPE; \n* Ignoring/preserving string literals; \n* Ignoring/preserving comments; \n* Preserving case/lowercasing;\n* Preserving/ignoring newlines and tabs.\n* applying/not applying stemming after basic splitting \n\n# Getting started\n\nMake sure you have python >= 3.6 installed in your system; pip, setuptools and wheel are up to date.\n```bash\npython --version\npython -m pip install --upgrade pip setuptools wheel\n```\n\nInstall **dataprep** lib:\n```bash\npip install giganticode-dataprep\n```\n\nIn order to run the **ronin** algorithm, you will have to additionally install Spiral module (https://github.com/casics/spiral/):\n```bash\npip install git+https://github.com/casics/spiral.git\n```\n\nThe tool can be used **as a python library** as well as a standalone module runnable with a **CLI**. \nYou can pass the path to the dataset or the text itself to be preprocessed. When using Python API for the former option \nyou need to import methods from `dataprep.api.text` module, for the latter - from `dataprep.api.corpus`.\nBelow you can see the general patterns of usage.\n\n\nPython API\n```python\n>>> import dataprep.api.text as pp\n>>> pp.('Some code to be split')\n```\n\n```python\n>>> import dataprep.api.corpus as pp\n>>> pp.('/path/to/the/dataset')\n```\n\nCLI\n```bash\ndataprep \"Some code to be split\"\n```\n\n```bash\ndataprep --path /path/to/the/dataset\n```\n\nHereafter we will demonstrate the usage as a python library. The CLI is analogous to the python API. You can find the documentation about how to use it [here](dataprep/cli/spec.py). \n\n## Usage examples\n\n### Basic splitting \nTokenization + CamelCase- and snake_case- splitting:\n\n```python\n>>> import dataprep.api.text as pp\n>>> input_code = '''void test_WordUeberraschungPrinter() {\n... if (eps >= 0.345e+4) { // FIXME\n... printWord(\" ... \u00dcberraschung\");\n... }\n... }'''\n>>> pp.basic(input_code)\n['void', '', 'test', '_', 'Word', 'Ueberraschung', 'Printer', '', '(', ')', '{', '\\n', \n'\\t', 'if', '(', 'eps', '>', '=', '0', '.', '', '345', 'e', '', '+', '4', ')', '{', '/', '/', 'FIXME', '\\n', \n'\\t', '\\t', '', 'print', 'Word', '', '(', '\"', '\\t', '.', '.', '.', '\\t', '\u00dcberraschung', '\"', ')', ';', '\\n', \n'\\t', '}', '\\n', \n'}']\n```\n\n### Tokenize but don't split identifiers\n\n```python\n>>> import dataprep.api.text as pp\n>>> input_code = '''void test_WordUeberraschungPrinter() {\n... if (eps >= 0.345e+4) { // FIXME\n... printWord(\" ... \u00dcberraschung\");\n... }\n... }'''\n>>> pp.nosplit(input_code)\n['void', 'test_WordUeberraschungPrinter', '(', ')', '{', '\\n', \n'\\t', 'if', '(', 'eps', '>', '=', '0', '.', '345e', '+', '4', ')', '{', '/', '/', 'FIXME', '\\n', \n'\\t', '\\t', 'printWord', '(', '\"', '\\t', '.', '.', '.', '\\t', '\u00dcberraschung', '\"', ')', ';', '\\n', \n'\\t', '}', '\\n', \n'}']\n```\n\n### BPE (Byte-Pair encoding)\n\nThe following code does **camelCase-** and **snake_case-** splitting and applies **bpe with 10k merges** on top:\n\n```python\n>>> import dataprep.api.text as pp\n>>> input_code = '''void test_WordUeberraschungPrinter() {\n... if (eps >= 0.345e+4) { // FIXME\n... printWord(\" ... \u00dcberraschung\");\n... }\n... }'''\n>>> pp.bpe(input_code,bpe_codes_id='10k')\n['void', '', 'test', '_', 'Word', 'U', 'e', 'ber', 'r', 'as', 'ch', 'ung', 'Printer', '', '(', ')', '{', '\\n', \n'\\t', 'if', '(', '', 'ep', 's', '', '>', '=', '0', '.', '', '34', '5', 'e', '', '+', '4', ')', '{', '/', '/', 'FIXME', '\\n', \n'\\t', '\\t', '', 'print', 'Word', '', '(', '\"', '\\t', '.', '.', '.', '\\t', '', '\u00dc', 'ber', 'r', 'as', 'ch', 'ung', '', '\"', ')', ';', '\\n', \n'\\t', '}', '\\n', \n'}']\n```\n\n**Dataprep** by default does BPE using bpe codes leaned on [the Github Java Corpus](http://groups.inf.ed.ac.uk/cup/javaGithub/). The argument `bpe_codes_id='10k'` tells the **dataprep** tool to use 10,000 bpe merges. \nOther possible values are `1k` and `5k` (1,000 and 5,000 merges respectively). Please refer to section [Learning custom BPE codes](#Learning-custom-BPE-codes) to train custom bpe codes.\n\n**For other commands and options like `chars`, `--split-numbers`, `--ronin`, `--stem`, please refer to the [docs](dataprep/cli/spec.py)**.\n\n## Calculate vocabulary \nSet `calc_vocab` param to `True` when calling a preprocessing method to calculate the vocabulary of the preprocessed corpus, e.g.:\n```python\n>>> import dataprep.api.corpus as pp\n>>> pp.basic('/path/to/train/on',calc_vocab=True)\n...\nVocab is available at /path/to/vocab\n```\n\n## Learning custom BPE codes\nIf you don't want to use, pre-trained BPE codes, it's possible to train custom ones. For example, to train 10,000 merges on the corpus located at the path `/path/to/train/on`, the following command should be run (only CLI):\n\n```bash\ndataprep learn-bpe 10000 -p /path/to/train/on --id custom-bpe-codes \n```\n\nNow it is possible to do bpe splitting by running the bpe command with the number of merges from 0 to 10,000 (for example with 3500 merges):\n\n```bash\ndataprep bpe custom-bpe-codes-3500 -p /path/to/preprocess \n```\n\nBefore bpe codes are trained, the [basic preprocessing](#basic-splitting) is done, which can also be tuned with arguments described in section [Tweaking preprocessing](#tweaking-preprocessing).\n\n\n## Additional options\n### Tweaking preprocessing\nYou can pass the following parameters with a `True` value (default values for all of them are False), to tweak the way the imput is preprocessed:\n\n * `no_str` - replace strings with placeholders.\n * `no_com` - replace comments with placeholders.\n * `no_spaces` - remove newlines and tabs.\n * `no_unicode` - replace words containing non-ascii characters with placeholders.\n * `no_case` - lowercase words and encode information about case in tokens.\n```python\n>>> import dataprep.api.text as pp\n>>> input_code = '''void test_WordUeberraschungPrinter() {\n... if (eps >= 0.345e+4) { // FIXME\n... printWord(\" ... \u00dcberraschung\");\n... }\n... }'''\n>>> pp.basic(input_code,no_spaces=True,no_unicode=True,no_case=True,no_com=True,no_str=True)\n['void', '', 'test', '_', '', 'word', '', 'ueberraschung', '', 'printer', '', '(', ')', '{', \n'if', '(', 'eps', '>', '=', '0', '.', '', '345', 'e', '', '+', '4', ')', '{', '/', '/', '', 'fixme', \n'', 'print', '', 'word', '', '(', '\"', '.', '.', '.', '', '', '\"', ')', ';', \n'}', \n'}']\n```\n\nSimilar params can be specified as switches `--no-str`, `--no-com`, `--no-spaces`, `--no-unicode`, `--no-case` in CLI commands.\n\n### Specifying the language\nUnless explicitely specified, **dataprep** will try to guess the language of the code to be preprocessed. To make sure the input is preprocessed as intended, it is always **highly recommended** to specify it:\n```python\nimport dataprep.api.text as pp\n>>> pp.bpe(\"volatile\",'1k',extension=\"py\")\n['', 'vo', 'l', 'at', 'ile', '']\n>>> pp.bpe(\"volatile\",'1k',extension=\"java\")\n['volatile']\n# Since 'volatile' is a keyword in java, it is represented as one token unlike in python \n# where it is pretty rare when used as an identifier and therefore represented as multiple subtokens.\n```\n\nWhen preprocessing a corpus, `dateprep` identifies the language based on the file extension. If you want only files with (a) certain extension(s) to be preprocessed, you can specify --ext param \n```bash\ndataprep basic --path /path/to/be/preprocessed --ext \"java\"\n\n# or if you want to pre-process multiple types of files: \ndataprep basic --path /path/to/be/preprocessed --ext \"java|c|py|js\"\n```\n### Miscellaneous\nYou can specify the path to where the preprocessed corpus will be written:\n```bash\ndataprep basic --path /path/to/preprocess --output-path /path/to/output\n```\n\nTo print logs with log level DEBUG and higher to stdout:\n```bash\ndataprep basic --path /path/to/preprocess --verbose\n```\n\n## Getting Help\nTo get help on commands and options:\n\n```bash\ndataprep --help\n```\n\n\n# Advanced\n\n### Caching\n\nWhen preprocessing a dataset, **dataprep** first parses source code and converts it into internal representation, \nwhich is after that converted to a preprocessed dataset depending on provided parameters. The intermediate \nrepresentation is cached, so that when the same dataset is pre-processed again with different parameters,\n**dataprep** (providing no changes have been made to the dataset) would use the cache rather than parsing \nthe source code again.\n\nTo store the cache, **dataprep** uses a directory speecified by `$XDG_CACHE_HOME/dataprep/` variable if its value is set, \n`$HOME/.cache/dataprep/` otherwise.\n\nRemoving the cache will not change the final result, however, will result in slower pre-processing.\n\n# Releases\n\n## 1.0.0-alpha.6\n\nInitial PyPI release\n\n## 1.0.0-alpha.7 (NOT backward compatible with 1.0.0-alpha.6)\n\n- Store version in `dataprep.__version__`\n- implement `--full-strings` and `--max-str-length` options\n- replace `ronin` method/command wit`--ronin` option and apply ronin algorithm on word level instead of full identifier level\n- if `split_numbers` option is set to `True`, split numbers not only in code but also in strings and comments\n- add `get_corpus_size()` method to `PreprocessedCorpus` class\n- change placeholder values to more human-readable\n- improve logging displaying\n- Bugfixes\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "http://github.com/giganticode/dataprep", "keywords": "big large data source code corpus machine learning pre-processing nlp", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "giganticode-dataprep", "package_url": "https://pypi.org/project/giganticode-dataprep/", "platform": "", "project_url": "https://pypi.org/project/giganticode-dataprep/", "project_urls": { "Homepage": "http://github.com/giganticode/dataprep" }, "release_url": "https://pypi.org/project/giganticode-dataprep/1.0.0a7/", "requires_dist": [ "appdirs (==1.4.3)", "coverage (==4.5.3)", "dill (==0.2.9)", "docopt (==0.6.2)", "docopt-subcommands (==3.0.0)", "jsons (==0.8.3)", "matplotlib (==3.0.3)", "nltk (==3.4.4)", "Pygments (==2.3.1)", "PyYAML (==5.1)", "regex (==2019.3.12)", "tqdm (==4.31.1)" ], "requires_python": ">=3.6", "summary": "A toolkit for pre-processing large source code corpora", "version": "1.0.0a7" }, "last_serial": 5615616, "releases": { "1.0.0a6": [ { "comment_text": "", "digests": { "md5": "eb422f7705f73cbe8b8fa5a93076ae2f", "sha256": "97f9fc2bfd19cbbf26d628567dc65b2f8406ecc438cb183aeaeacaec37bf393e" }, "downloads": -1, "filename": "giganticode_dataprep-1.0.0a6-py3-none-any.whl", "has_sig": false, "md5_digest": "eb422f7705f73cbe8b8fa5a93076ae2f", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6", "size": 196475, "upload_time": "2019-07-15T09:23:34", "url": "https://files.pythonhosted.org/packages/bf/02/916ce568534f8717c909812d3cfbf85ce9271014aa963fdc9c09117e6665/giganticode_dataprep-1.0.0a6-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "4f4d633fe51280501c76a9a2c69dff95", "sha256": "1a5da5cf41cf542d427e4c9747ff6ba219085c560fa9ffd76d032e824bd20e6f" }, "downloads": -1, "filename": "giganticode-dataprep-1.0.0a6.tar.gz", "has_sig": false, "md5_digest": "4f4d633fe51280501c76a9a2c69dff95", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 168406, "upload_time": "2019-07-15T09:23:43", "url": "https://files.pythonhosted.org/packages/85/85/7207ff07c4754f322f1de924d1a63e8d033baa256b7f125fb9e414e5db75/giganticode-dataprep-1.0.0a6.tar.gz" } ], "1.0.0a7": [ { "comment_text": "", "digests": { "md5": "a0170230f873d3a985ef068e999c35ca", "sha256": "2d2a8b57e70a248574a6bd3a3cd4923127bab47700b2ba666c9a22073a1568d0" }, "downloads": -1, "filename": "giganticode_dataprep-1.0.0a7-py3-none-any.whl", "has_sig": false, "md5_digest": "a0170230f873d3a985ef068e999c35ca", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6", "size": 195958, "upload_time": "2019-07-31T21:10:21", "url": "https://files.pythonhosted.org/packages/57/26/0bf2eafdc5743ac05fcdbd47c42a6f4f8aab012abf9f0b0bd6fb6327e004/giganticode_dataprep-1.0.0a7-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "c4b6c2285bee5d22b742d391c660055d", "sha256": "db53fde5ac8ab82ccd2df9be8e659de3eed02a844aa4abd80b548ec31df340da" }, "downloads": -1, "filename": "giganticode-dataprep-1.0.0a7.tar.gz", "has_sig": false, "md5_digest": "c4b6c2285bee5d22b742d391c660055d", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 168992, "upload_time": "2019-07-31T21:10:24", "url": "https://files.pythonhosted.org/packages/0a/1b/1784abf2d1eaff39240f2d1034de235c50e0f408133bcd5eb9dfa0b24c5b/giganticode-dataprep-1.0.0a7.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "a0170230f873d3a985ef068e999c35ca", "sha256": "2d2a8b57e70a248574a6bd3a3cd4923127bab47700b2ba666c9a22073a1568d0" }, "downloads": -1, "filename": "giganticode_dataprep-1.0.0a7-py3-none-any.whl", "has_sig": false, "md5_digest": "a0170230f873d3a985ef068e999c35ca", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6", "size": 195958, "upload_time": "2019-07-31T21:10:21", "url": "https://files.pythonhosted.org/packages/57/26/0bf2eafdc5743ac05fcdbd47c42a6f4f8aab012abf9f0b0bd6fb6327e004/giganticode_dataprep-1.0.0a7-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "c4b6c2285bee5d22b742d391c660055d", "sha256": "db53fde5ac8ab82ccd2df9be8e659de3eed02a844aa4abd80b548ec31df340da" }, "downloads": -1, "filename": "giganticode-dataprep-1.0.0a7.tar.gz", "has_sig": false, "md5_digest": "c4b6c2285bee5d22b742d391c660055d", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 168992, "upload_time": "2019-07-31T21:10:24", "url": "https://files.pythonhosted.org/packages/0a/1b/1784abf2d1eaff39240f2d1034de235c50e0f408133bcd5eb9dfa0b24c5b/giganticode-dataprep-1.0.0a7.tar.gz" } ] }