{
    "info": {
        "author": "giganticode",
        "author_email": "hlibbabii@gmail.com",
        "bugtrack_url": null,
        "classifiers": [
            "Development Status :: 3 - Alpha",
            "Environment :: Console",
            "Intended Audience :: Science/Research",
            "License :: OSI Approved :: MIT License",
            "Natural Language :: English",
            "Operating System :: POSIX :: Linux",
            "Programming Language :: Python :: 3.6",
            "Programming Language :: Python :: 3.7",
            "Topic :: Software Development :: Pre-processors"
        ],
        "description": "# Dataprep\n\n[![Build Status](https://travis-ci.org/giganticode/dataprep.svg?branch=master)](https://travis-ci.org/giganticode/dataprep)\n\n**This is a tool for preprocessing source code corpora according to a specified vocabulary modeling choice.**\n\nSupported modeling choices are: \n* Splitting algorithm (no identifier splitting, camel-case splitting, snake-case splitting, BPE (byte-pair-encoding), \nnumber-splitting, ronin: http://joss.theoj.org/papers/10.21105/joss.00653); \n* Number of merges if using BPE; \n* Ignoring/preserving string literals; \n* Ignoring/preserving comments; \n* Preserving case/lowercasing;\n* Preserving/ignoring newlines and tabs.\n* applying/not applying stemming after basic splitting \n\n# Getting started\n\nMake sure you have python >= 3.6 installed in your system; pip, setuptools and wheel are up to date.\n```bash\npython --version\npython -m pip install --upgrade pip setuptools wheel\n```\n\nInstall **dataprep** lib:\n```bash\npip install giganticode-dataprep\n```\n\nIn order to run the **ronin** algorithm, you will have to additionally install Spiral module (https://github.com/casics/spiral/):\n```bash\npip install git+https://github.com/casics/spiral.git\n```\n\nThe tool can be used **as a python library** as well as a standalone module runnable with a **CLI**. \nYou can pass the path to the dataset or the text itself to be preprocessed. When using Python API for the former option \nyou need to import methods from `dataprep.api.text` module, for the latter - from `dataprep.api.corpus`.\nBelow you can see the general patterns of usage.\n\n\nPython API\n```python\n>>> import dataprep.api.text as pp\n>>> pp.<commmand>('Some code to be split')\n```\n\n```python\n>>> import dataprep.api.corpus as pp\n>>> pp.<commmand>('/path/to/the/dataset')\n```\n\nCLI\n```bash\ndataprep <commmand> \"Some code to be split\"\n```\n\n```bash\ndataprep <commmand> --path /path/to/the/dataset\n```\n\nHereafter we will demonstrate the usage as a python library. The CLI is analogous to the python API. You can find the documentation about how to use it [here](dataprep/cli/spec.py). \n\n## Usage examples\n\n### Basic splitting \nTokenization + CamelCase- and snake_case- splitting:\n\n```python\n>>> import dataprep.api.text as pp\n>>> input_code = '''void test_WordUeberraschungPrinter() {\n...     if (eps >= 0.345e+4) { // FIXME\n...         printWord(\"     ...     \u00dcberraschung\");\n...     }\n... }'''\n>>> pp.basic(input_code)\n['void', '<w>', 'test', '_', 'Word', 'Ueberraschung', 'Printer', '</w>', '(', ')', '{', '\\n', \n'\\t', 'if', '(', 'eps', '>', '=', '0', '.', '<w>', '345', 'e', '</w>', '+', '4', ')', '{', '/', '/', 'FIXME', '\\n', \n'\\t', '\\t', '<w>', 'print', 'Word', '</w>', '(', '\"', '\\t', '.', '.', '.', '\\t', '\u00dcberraschung', '\"', ')', ';', '\\n', \n'\\t', '}', '\\n', \n'}']\n```\n\n### Tokenize but don't split identifiers\n\n```python\n>>> import dataprep.api.text as pp\n>>> input_code = '''void test_WordUeberraschungPrinter() {\n...     if (eps >= 0.345e+4) { // FIXME\n...         printWord(\"     ...     \u00dcberraschung\");\n...     }\n... }'''\n>>> pp.nosplit(input_code)\n['void', 'test_WordUeberraschungPrinter', '(', ')', '{', '\\n', \n'\\t', 'if', '(', 'eps', '>', '=', '0', '.', '345e', '+', '4', ')', '{', '/', '/', 'FIXME', '\\n', \n'\\t', '\\t', 'printWord', '(', '\"', '\\t', '.', '.', '.', '\\t', '\u00dcberraschung', '\"', ')', ';', '\\n', \n'\\t', '}', '\\n', \n'}']\n```\n\n### BPE (Byte-Pair encoding)\n\nThe following code does **camelCase-** and **snake_case-** splitting and applies **bpe with 10k merges** on top:\n\n```python\n>>> import dataprep.api.text as pp\n>>> input_code = '''void test_WordUeberraschungPrinter() {\n...     if (eps >= 0.345e+4) { // FIXME\n...         printWord(\"     ...     \u00dcberraschung\");\n...     }\n... }'''\n>>> pp.bpe(input_code,bpe_codes_id='10k')\n['void', '<w>', 'test', '_', 'Word', 'U', 'e', 'ber', 'r', 'as', 'ch', 'ung', 'Printer', '</w>', '(', ')', '{', '\\n', \n'\\t', 'if', '(', '<w>', 'ep', 's', '</w>', '>', '=', '0', '.', '<w>', '34', '5', 'e', '</w>', '+', '4', ')', '{', '/', '/', 'FIXME', '\\n', \n'\\t', '\\t', '<w>', 'print', 'Word', '</w>', '(', '\"', '\\t', '.', '.', '.', '\\t', '<w>', '\u00dc', 'ber', 'r', 'as', 'ch', 'ung', '</w>', '\"', ')', ';', '\\n', \n'\\t', '}', '\\n', \n'}']\n```\n\n**Dataprep** by default does BPE using bpe codes leaned on [the Github Java Corpus](http://groups.inf.ed.ac.uk/cup/javaGithub/). The argument `bpe_codes_id='10k'` tells the **dataprep** tool to use 10,000 bpe merges. \nOther possible values are `1k` and `5k` (1,000 and 5,000 merges respectively). Please refer to section [Learning custom BPE codes](#Learning-custom-BPE-codes) to train custom bpe codes.\n\n**For other commands and options like `chars`, `--split-numbers`, `--ronin`, `--stem`, please refer to the [docs](dataprep/cli/spec.py)**.\n\n## Calculate vocabulary \nSet `calc_vocab` param to `True` when calling a preprocessing method to calculate the vocabulary of the preprocessed corpus, e.g.:\n```python\n>>> import dataprep.api.corpus as pp\n>>> pp.basic('/path/to/train/on',calc_vocab=True)\n...\nVocab is available at /path/to/vocab\n```\n\n## Learning custom BPE codes\nIf you don't want to use, pre-trained BPE codes, it's possible to train custom ones. For example, to train 10,000 merges on the corpus located at the path `/path/to/train/on`, the following command should be run (only CLI):\n\n```bash\ndataprep learn-bpe 10000 -p /path/to/train/on --id custom-bpe-codes \n```\n\nNow it is possible to do bpe splitting by running the bpe command with the number of merges from 0 to 10,000 (for example with 3500 merges):\n\n```bash\ndataprep bpe custom-bpe-codes-3500 -p /path/to/preprocess \n```\n\nBefore bpe codes are trained, the [basic preprocessing](#basic-splitting) is done, which can also be tuned with arguments described in section [Tweaking preprocessing](#tweaking-preprocessing).\n\n\n## Additional options\n### Tweaking preprocessing\nYou can pass the following parameters with a `True` value (default values for all of them are False), to tweak the way the imput is preprocessed:\n\n * `no_str` - replace strings with <string> placeholders.\n * `no_com` - replace comments with <comment> placeholders.\n * `no_spaces` - remove newlines and tabs.\n * `no_unicode` - replace words containing non-ascii characters with <non-en> placeholders.\n * `no_case` - lowercase words and encode information about case in <Cap> <CAP> tokens.\n```python\n>>> import dataprep.api.text as pp\n>>> input_code = '''void test_WordUeberraschungPrinter() {\n...     if (eps >= 0.345e+4) { // FIXME\n...         printWord(\"     ...     \u00dcberraschung\");\n...     }\n... }'''\n>>> pp.basic(input_code,no_spaces=True,no_unicode=True,no_case=True,no_com=True,no_str=True)\n['void', '<w>', 'test', '_', '<Cap>', 'word', '<Cap>', 'ueberraschung', '<Cap>', 'printer', '</w>', '(', ')', '{', \n'if', '(', 'eps', '>', '=', '0', '.', '<w>', '345', 'e', '</w>', '+', '4', ')', '{', '/', '/', '<CAPS>', 'fixme', \n'<w>', 'print', '<Cap>', 'word', '</w>', '(', '\"', '.', '.', '.', '<Cap>', '<non-en>', '\"', ')', ';', \n'}', \n'}']\n```\n\nSimilar params can be specified as switches `--no-str`, `--no-com`, `--no-spaces`, `--no-unicode`, `--no-case` in CLI commands.\n\n### Specifying the language\nUnless explicitely specified, **dataprep** will try to guess the language of the code to be preprocessed. To make sure the input is preprocessed as intended, it is always **highly recommended** to specify it:\n```python\nimport dataprep.api.text as pp\n>>> pp.bpe(\"volatile\",'1k',extension=\"py\")\n['<w>', 'vo', 'l', 'at', 'ile', '</w>']\n>>> pp.bpe(\"volatile\",'1k',extension=\"java\")\n['volatile']\n# Since 'volatile' is a keyword in java, it is represented as one token unlike in python \n# where it is pretty rare when used as an identifier and therefore represented as multiple subtokens.\n```\n\nWhen preprocessing a corpus, `dateprep` identifies the language based on the file extension. If you want only files with (a) certain extension(s) to be preprocessed, you can specify --ext param \n```bash\ndataprep basic --path /path/to/be/preprocessed --ext \"java\"\n\n# or if you want to pre-process multiple types of files: \ndataprep basic --path /path/to/be/preprocessed --ext \"java|c|py|js\"\n```\n### Miscellaneous\nYou can specify the path to where the preprocessed corpus will be written:\n```bash\ndataprep basic --path /path/to/preprocess --output-path /path/to/output\n```\n\nTo print logs with log level DEBUG and higher to stdout:\n```bash\ndataprep basic --path /path/to/preprocess --verbose\n```\n\n## Getting Help\nTo get help on commands and options:\n\n```bash\ndataprep --help\n```\n\n\n# Advanced\n\n### Caching\n\nWhen preprocessing a dataset, **dataprep** first parses source code and converts it into internal representation, \nwhich is after that converted to a preprocessed dataset depending on provided parameters. The intermediate \nrepresentation is cached, so that when the same dataset is pre-processed again with different parameters,\n**dataprep** (providing no changes have been made to the dataset) would use the cache rather than parsing \nthe source code again.\n\nTo store the cache, **dataprep** uses a directory speecified by `$XDG_CACHE_HOME/dataprep/<dataprep_version>` variable if its value is set, \n`$HOME/.cache/dataprep/<dataprep_version>` otherwise.\n\nRemoving the cache will not change the final result, however, will result in slower pre-processing.\n\n# Releases\n\n## 1.0.0-alpha.6\n\nInitial PyPI release\n\n## 1.0.0-alpha.7 (NOT backward compatible with 1.0.0-alpha.6)\n\n- Store version in `dataprep.__version__`\n- implement `--full-strings` and `--max-str-length` options\n- replace `ronin` method/command wit`--ronin` option and apply ronin algorithm on word level instead of full identifier level\n- if `split_numbers` option is set to `True`, split numbers not only in code but also in strings and comments\n- add `get_corpus_size()` method to `PreprocessedCorpus` class\n- change placeholder values to more human-readable\n- improve logging displaying\n- Bugfixes\n\n\n",
        "description_content_type": "text/markdown",
        "docs_url": null,
        "download_url": "",
        "downloads": {
            "last_day": -1,
            "last_month": -1,
            "last_week": -1
        },
        "home_page": "http://github.com/giganticode/dataprep",
        "keywords": "big large data source code corpus machine learning pre-processing nlp",
        "license": "MIT",
        "maintainer": "",
        "maintainer_email": "",
        "name": "giganticode-dataprep",
        "package_url": "https://pypi.org/project/giganticode-dataprep/",
        "platform": "",
        "project_url": "https://pypi.org/project/giganticode-dataprep/",
        "project_urls": {
            "Homepage": "http://github.com/giganticode/dataprep"
        },
        "release_url": "https://pypi.org/project/giganticode-dataprep/1.0.0a7/",
        "requires_dist": [
            "appdirs (==1.4.3)",
            "coverage (==4.5.3)",
            "dill (==0.2.9)",
            "docopt (==0.6.2)",
            "docopt-subcommands (==3.0.0)",
            "jsons (==0.8.3)",
            "matplotlib (==3.0.3)",
            "nltk (==3.4.4)",
            "Pygments (==2.3.1)",
            "PyYAML (==5.1)",
            "regex (==2019.3.12)",
            "tqdm (==4.31.1)"
        ],
        "requires_python": ">=3.6",
        "summary": "A toolkit for pre-processing large source code corpora",
        "version": "1.0.0a7"
    },
    "last_serial": 5615616,
    "releases": {
        "1.0.0a6": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "eb422f7705f73cbe8b8fa5a93076ae2f",
                    "sha256": "97f9fc2bfd19cbbf26d628567dc65b2f8406ecc438cb183aeaeacaec37bf393e"
                },
                "downloads": -1,
                "filename": "giganticode_dataprep-1.0.0a6-py3-none-any.whl",
                "has_sig": false,
                "md5_digest": "eb422f7705f73cbe8b8fa5a93076ae2f",
                "packagetype": "bdist_wheel",
                "python_version": "py3",
                "requires_python": ">=3.6",
                "size": 196475,
                "upload_time": "2019-07-15T09:23:34",
                "url": "https://files.pythonhosted.org/packages/bf/02/916ce568534f8717c909812d3cfbf85ce9271014aa963fdc9c09117e6665/giganticode_dataprep-1.0.0a6-py3-none-any.whl"
            },
            {
                "comment_text": "",
                "digests": {
                    "md5": "4f4d633fe51280501c76a9a2c69dff95",
                    "sha256": "1a5da5cf41cf542d427e4c9747ff6ba219085c560fa9ffd76d032e824bd20e6f"
                },
                "downloads": -1,
                "filename": "giganticode-dataprep-1.0.0a6.tar.gz",
                "has_sig": false,
                "md5_digest": "4f4d633fe51280501c76a9a2c69dff95",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": ">=3.6",
                "size": 168406,
                "upload_time": "2019-07-15T09:23:43",
                "url": "https://files.pythonhosted.org/packages/85/85/7207ff07c4754f322f1de924d1a63e8d033baa256b7f125fb9e414e5db75/giganticode-dataprep-1.0.0a6.tar.gz"
            }
        ],
        "1.0.0a7": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "a0170230f873d3a985ef068e999c35ca",
                    "sha256": "2d2a8b57e70a248574a6bd3a3cd4923127bab47700b2ba666c9a22073a1568d0"
                },
                "downloads": -1,
                "filename": "giganticode_dataprep-1.0.0a7-py3-none-any.whl",
                "has_sig": false,
                "md5_digest": "a0170230f873d3a985ef068e999c35ca",
                "packagetype": "bdist_wheel",
                "python_version": "py3",
                "requires_python": ">=3.6",
                "size": 195958,
                "upload_time": "2019-07-31T21:10:21",
                "url": "https://files.pythonhosted.org/packages/57/26/0bf2eafdc5743ac05fcdbd47c42a6f4f8aab012abf9f0b0bd6fb6327e004/giganticode_dataprep-1.0.0a7-py3-none-any.whl"
            },
            {
                "comment_text": "",
                "digests": {
                    "md5": "c4b6c2285bee5d22b742d391c660055d",
                    "sha256": "db53fde5ac8ab82ccd2df9be8e659de3eed02a844aa4abd80b548ec31df340da"
                },
                "downloads": -1,
                "filename": "giganticode-dataprep-1.0.0a7.tar.gz",
                "has_sig": false,
                "md5_digest": "c4b6c2285bee5d22b742d391c660055d",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": ">=3.6",
                "size": 168992,
                "upload_time": "2019-07-31T21:10:24",
                "url": "https://files.pythonhosted.org/packages/0a/1b/1784abf2d1eaff39240f2d1034de235c50e0f408133bcd5eb9dfa0b24c5b/giganticode-dataprep-1.0.0a7.tar.gz"
            }
        ]
    },
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "a0170230f873d3a985ef068e999c35ca",
                "sha256": "2d2a8b57e70a248574a6bd3a3cd4923127bab47700b2ba666c9a22073a1568d0"
            },
            "downloads": -1,
            "filename": "giganticode_dataprep-1.0.0a7-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a0170230f873d3a985ef068e999c35ca",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 195958,
            "upload_time": "2019-07-31T21:10:21",
            "url": "https://files.pythonhosted.org/packages/57/26/0bf2eafdc5743ac05fcdbd47c42a6f4f8aab012abf9f0b0bd6fb6327e004/giganticode_dataprep-1.0.0a7-py3-none-any.whl"
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "c4b6c2285bee5d22b742d391c660055d",
                "sha256": "db53fde5ac8ab82ccd2df9be8e659de3eed02a844aa4abd80b548ec31df340da"
            },
            "downloads": -1,
            "filename": "giganticode-dataprep-1.0.0a7.tar.gz",
            "has_sig": false,
            "md5_digest": "c4b6c2285bee5d22b742d391c660055d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 168992,
            "upload_time": "2019-07-31T21:10:24",
            "url": "https://files.pythonhosted.org/packages/0a/1b/1784abf2d1eaff39240f2d1034de235c50e0f408133bcd5eb9dfa0b24c5b/giganticode-dataprep-1.0.0a7.tar.gz"
        }
    ]
}