{ "info": { "author": "Madhu TV", "author_email": "madhavi.tv@gmail.com", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python :: 3" ], "description": "Create Train, Test and Validation Datasets for NLP from wikipedia. Datasets are created \nusing provided seed WikiPages and also by traversing links within pages that meet the \nspecified match pattern. Idea is to leverage links within wiki pages to create more data. \nThe thought is, wikipedia will already contain links to additional pages that are \nrelevant and links within pages can be narrowed through pattern matching.\n\n### Installation\n```\n pip install nlp-data-py\n```\n\n### Usage\n- [Command line](#command-line)\n- [Programatic](#programmatic-usage) \n\n### Command line\n\n#### QuickStart Example\n```\nwiki_dataset --seed Brain Human_Brain --match .*neuro|.*neural \n```\n\nIn short the above command:\n - **Read Wiki**: Reads Brain and Human_Brain pages from wikipedia\n - **Shuffle**: Shuffles data based on some default criteria\n (see, [chunk_splitter](#--chunk_splitter-or--cs) & \n [chunks_per_page](#--chunks_per_page-or--cp) for defaults) \n - **Create Datasets**: Creates train, validation and test datasets in ./vars/ folder. \n By default, split ratio is 80%, 10% and 10% for train, val and test datasets\n - **Extract Links**: Extracts any link that match the pattern mentioned \n in [--match](#--match-or--m) option. In this example, links containing \n 'neuro' or 'neural' are tracked\n - **Read More**: Additional 20 pages from the above \"Extract Links\" are read \n and appended to datasets and the links in those pages that match pattern \n are also tracked.\n - **Track Read**: Pages that are read are tracked and written to a pickle \n file at ./vars/scanned.pkl. This will be useful when the same command \n is run again. i.e. if the above command is re-run, Brain & Human_brain \n & 20 pages from \"Read More\" will not be read again. Instead, the next \n 20 pages due to \"Extract Links\" will be read and appended to datasets \n in ./vars/\n\n#### Command Line Options\n\n- [--seed or -s](#--seed-or--s)\n- [--match or -m](#--match-or--m)\n- [--recursive or -r](#--recursive-or--r)\n- [--limit or -l](#--limit-or--l)\n- [--pickle or -p](#--pickle-or--p)\n- [--output or -o](#--output-or--o)\n- [--chunk_splitter or -cs](#--chunk_splitter-or--cs)\n- [--chunks_per_page or -cp](#--chunks_per_page-or--cp)\n- [--split_ratio or -sr](#--split_ratio-or--sr)\n- [--datasets or -ds](#--datasets-or--ds)\n- [--shuffle or -sf](#--shuffle-or--sf)\n\n#### --seed or -s:\n`Description`: List of initial Wiki Page names to start with. \n\n`Default`: None. If nothing is specified, items in [pickle](#--pickle-or--p) \nfile will be read. If pickle file also dose not exists, nothing will be done and \nthe code exits.\n\n`Example`: \n```\nwiki_dataset --seed Brain Human_Brain\n```\n\n#### --match or -m: \n`Description`: This option serves 2 purpose. One to track links in WikiPages \nand another to read additional pages either from links or [saved pickle](#--pickle-or--p) \nfile. Links that match the pattern will be considered to be added to datasets.\nAlso see [limit](#--limit-or--l) \n\n`Default`: \"\". All links from a wikipage will be considered and tracked.\n\n`Example`: In the below example, any links that match *neuro* or *neural* will be tracked \nand/or read to create datasets. \n\n```\nwiki_dataset --seed Brain Human_Brain -m \".*neuro|.*neural\"\n```\n\n#### --recursive or -r:\n`Description`: If this option is true, then additional pages will be read \neither based on links or previously scanned pickle file. This option will \nbe used in conjunction with limit to determine number of additional \npages to read.\nAlso see [limit](#--limit-or--l)\n\n\n`Default`: true\n\n`Example`: In the below example, only Brain and Human_Brain wiki pages will be read. \nHowever, links that match the match patter from these pages will be tracked \nand stored in a pickle file which may be used later on.\n\n```\nwiki_dataset --seed Brain Human_Brain -m \".*neuro|.*neural\" -r false\n```\n\n#### --limit or -l:\n`Description`: Wikipedia may contain too many links especially when looking \nat pages recursively. This option limits the number of additional pages to be read.\nThis option will only be relevant if recursive is set to true.\n\n`Default` 20\n\n`Example`: In the below example, along with reading Brain & Human_Brain \nand tracking links that match the match pattern, 100 additional pages \nare read either based on links or [pickle file](#--pickle-or--p).\n\n```\nwiki_dataset --seed Brain Human_Brain -m \".*neuro|.*neural\" -l 100\n```\n\n#### --pickle or -p:\n`Description`: Path to pickle file tracking items that are read. This enables to \nincrementally read items. Pickle file stores a dict. Example:\n```\n {\n \"item1\": 1,\n \"item2\": 0,\n \"item3\": -1\n }\n```\n\nIn the above example, item1 was read previously hence, wont be read again. item2 was \nnot read and will be consider in future reads. item3 errored out in previous reads \nand will not be attempted again\n\n`Default`: ./vars/scanned.pkl\n\n`Example`:\n```\nwiki_dataset --seed Brain Human_Brain -m \".*neuro|.*neural\" -p scanned.pkl\n```\nIn the above example:\n- Brain & Human_Brain and 20 pages matching pattern are read and stored as read \n in the pickle file. Any additional links that were not read due to reaching\n the limit will be stored as unread in the pickle file\n- If the above command is re-run, all the read pages including seed will not\n be read again. Instead, additional unread pages from pickle file will\n be read and pickle file will be updated to track read pages and any additional\n links that were encountered in the newly traversed pages\n\n#### --output or -o:\n`Description`: Path for datasets. \n\n`Default`: ./vars/datasets/\n\n`Example`: \n```\nwiki_dataset --seed Brain Human_Brain -m \".*neuro|.*neural\" -o ./datasets/\n```\nIn the above example, train, val and test datasets will be created in datasets/\nfolder. Future re-runs will append to these files\n\n\n#### --chunk_splitter or -cs: \n`Description`: This option, along with [chunks_per_page](#--chunks_per_page-or--cp) \ndefines a page. This comes in handy when creating datasets, especially, if the \ndata needs to be shuffled.\n\n`Default`: '(?<=[.!?]) +'\n\n`Example`:\n```\nwiki_dataset --seed Brain Human_Brain -m \".*neuro|.*neural\" -cs '(?<=[.!?]) +'\n```\nIn the above example, text from wiki pages are split into sentences (chunks) based on ., ! or ?\n**Note**: On windows, use double quotes like -cs \"(?<=[.!?]) +\"\n\n#### --chunks_per_page or -cp:\n`Description`: This defines pages. i.e. this defines number of chunks for a page. \nThis comes in handy when data needs to be shuffled for creating test, train and val datasets.\n\n`Default`: 5\n\n`Example`:\n```\nwiki_dataset --seed Brain Human_Brain -m \".*neuro|.*neural\" -cs '(?<=[.!?]) +' -cp 10\n```\n\nIn the above example, wiki page is split into chunks based on ., ? or !. And 10 contiguous \nchunks form a page. For example, if wiki page has 100 sentences, in the above example,\ngroups of 10s are considered to form a page. So, this wiki page contain 10 pages. \n\n#### --split_ratio or -sr:\n`Description`: Ratio to split the train, val and test datasets. Split happens based on\nnumber of pages. \n\n`Default`: 80%, 10% and 10% for train, val and test\n\n`Example`:\n```\nwiki_dataset --seed Brain Human_Brain -m \".*neuro|.*neural\" -cs '(?<=[.!?]) +' -cp 10 -sr .8 0.1 0.1\n```\n\nIf a wiki page has 10 pages (as defined by [chuck_splitter](#--chunk_splitter-or--cs) and \n[chunks_per_page](#--chunks_per_page-or--cp)), then in the above example, train will contain\n8, val and test will contain 1 each. Note that the actual page in each of these datasets depend\non if [shuffle](#--shuffle-or--sf) is on. If shuffle is on, pages are shuffled and any 8 page\ncan make train dataset and any of the remaining 2 pages can be val and test. If shuffle is \noff, then first 8 pages will be train, next 1 is val and final page is test\n\n#### --datasets or -ds:\n\n`Description`: Names the datasets\n\n`Default`: train, val and test\n\n`Example`:\n```\nwiki_dataset --seed Brain Human_Brain -m \".*neuro|.*neural\" -sr 80 20 -ds set1 set2 \n```\nIn the above example, 2 datasets: set1 & set2 will be created\n\n#### --shuffle or -sf:\n`Description`: Shuffle pages (see [chuck_splitter](#--chunk_splitter-or--cs) and \n[chunks_per_page](#--chunks_per_page-or--cp) for pages) before creating datasets\n\n`Default`: True\n\n`Example`:\n```\nwiki_dataset --seed Brain Human_Brain -m \".*neuro|.*neural\" -sf false\n```\n\nSince shuffle is false in the above example, pages in wiki page will be taken in order. i.e.\nsince default ratio is 80%, 10% and 10%, first 80% of this wiki page will be in train, next 10%\nin val and final 10% in test.\n\nActual pages in each of the datasets depend on if shuffle is on. \nIf shuffle is on, pages are shuffled and any 80% page can make train dataset \nand any of the remaining 20% pages can be val and test. \nIf shuffle is off, then first 80% will be train, next 10% val and final 10% is test\n\n\n\n### Programmatic Usage\n\nBelow is a simple example:\n\n```python\nfrom nlp_data_py import WikiDataset\n\nWikiDataset.create_dataset_from_wiki(seeds=['Brain', 'Human_brain'], match=\".*neuro\")\n\n```\nIn the above example,\n- Brain will be read from wikipedia\n- contents will be broken to pages as defined by default book. ([chuck_splitter](#--chunk_splitter-or--cs) and \n[chunks_per_page](#--chunks_per_page-or--cp))\n- pages will be shuffled\n- pages will be split as defined by default splitter \n([split_ratio](#--split_ratio-or--sr), [datasets](#--datasets-or--ds), [shuffle](#--shuffle-or--sf))\n- links will be extracted from the page\n- links matching patter in match (in this case links containing 'neuro')\n will be added to self.scanned if they are not already there\n- Brain will be set to 1 in self.scanned to indicate that this\n page is already read\n- same steps are repeated with 'Medulla_oblongata'\n- since recursive is set to true, and limit is 20, next\n 20 unread items from self.scanned will be read and their\n links will be tracked in self.scanned\n- finally self.scanned is written to a pickle file\n- if the same code is run again, pickle file will be read\n and since Brain and Medulla oblangata are already read,\n they will be skipped and next 20 items from self.scanned\n are read\n\n\nBelow is an example where default options are overridden:\n\n```python\nfrom nlp_data_py import WikiDataset\nfrom nlp_data_py import Book, Splitter\n\nscanned_pickle = \"./scanned.pkl\"\nsave_dataset_path = \"./datasets/\"\n\nbook_def: Book = Book(chunk_splitter='(?<=[.!?]) +', chunks_per_page=2)\nsplitter: Splitter = Splitter(split_ratios=[0.5, 0.25, 0.25], dataset_names=['train', 'val', 'test'], shuffle=False)\n\nwiki = WikiDataset.create_dataset_from_wiki(seeds=['Brain', 'Human_brain'], \n match=\".*neuro\",\n recursive=True, limit=2,\n scanned_pickle=scanned_pickle,\n save_dataset_path=save_dataset_path,\n book_def=book_def,\n splitter=splitter)\n\n```\n\n\n\n\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/madutv/nlp-data-py.git", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "nlp-data-py", "package_url": "https://pypi.org/project/nlp-data-py/", "platform": "", "project_url": "https://pypi.org/project/nlp-data-py/", "project_urls": { "Homepage": "https://github.com/madutv/nlp-data-py.git" }, "release_url": "https://pypi.org/project/nlp-data-py/0.0.9/", "requires_dist": [ "numpy", "wikipedia" ], "requires_python": "", "summary": "Create Test, Train and Validation datasets for NLP. Currently, creating these datasets from wikipedia is supported", "version": "0.0.9" }, "last_serial": 4500557, "releases": { "0.0.7": [ { "comment_text": "", "digests": { "md5": "d2e217c060b99f7e1cdab708b73d2b72", "sha256": "0ab8c328c54a2d147874da5e5e35cd3feee35a56f022426d5d2af939b0dfcea8" }, "downloads": -1, "filename": "nlp_data_py-0.0.7-py3-none-any.whl", "has_sig": false, "md5_digest": "d2e217c060b99f7e1cdab708b73d2b72", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 23648, "upload_time": "2018-11-18T17:49:57", "url": "https://files.pythonhosted.org/packages/e4/d5/cc7b06626a66ee9af4f2a1f8db66f09cb6b6d545b2ddce4487a35ad08431/nlp_data_py-0.0.7-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "6af72c93775d56041112b5a016f27390", "sha256": "f3670392ceefa145e5ae49a508c8dcde10bb6e9f9d56f1c847ef098372121c98" }, "downloads": -1, "filename": "nlp-data-py-0.0.7.tar.gz", "has_sig": false, "md5_digest": "6af72c93775d56041112b5a016f27390", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 19337, "upload_time": "2018-11-18T17:49:58", "url": "https://files.pythonhosted.org/packages/64/cc/3f50262739a032836f6c5b5951e09d5bc6d601a9837b1d34562394f63244/nlp-data-py-0.0.7.tar.gz" } ], "0.0.8": [ { "comment_text": "", "digests": { "md5": "ddf86f490039dd51090b4cf4b6f0f18d", "sha256": "1f8689924e44ef1579a5c66ceda02ff8e25bd8827bc42876b7e3e5e881e383b3" }, "downloads": -1, "filename": "nlp_data_py-0.0.8-py3-none-any.whl", "has_sig": false, "md5_digest": "ddf86f490039dd51090b4cf4b6f0f18d", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 23753, "upload_time": "2018-11-18T19:03:30", "url": "https://files.pythonhosted.org/packages/c3/f5/b6bde3b7e936b19680cb824a0345a70a302a62f8259aa50f07b44e6cbb12/nlp_data_py-0.0.8-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "898de14d99f821605ef830af1f53bf70", "sha256": "afb2ad516bcddcef866151c2b3a0c1925971b91f6c901e3be8d565c5cf27ac54" }, "downloads": -1, "filename": "nlp-data-py-0.0.8.tar.gz", "has_sig": false, "md5_digest": "898de14d99f821605ef830af1f53bf70", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 19455, "upload_time": "2018-11-18T19:03:31", "url": "https://files.pythonhosted.org/packages/bf/07/31974d5b109ab26b3843735e465829fa9e81b69396981e95d29df0911062/nlp-data-py-0.0.8.tar.gz" } ], "0.0.9": [ { "comment_text": "", "digests": { "md5": "e6d22e02a8b9c4eccddb0ae323fe661b", "sha256": "4169d8577fcc09d06dcf87ab444788d536ee5005cf2e9c8daef6b4cc8fee2d19" }, "downloads": -1, "filename": "nlp_data_py-0.0.9-py3-none-any.whl", "has_sig": false, "md5_digest": "e6d22e02a8b9c4eccddb0ae323fe661b", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 23782, "upload_time": "2018-11-18T19:59:36", "url": "https://files.pythonhosted.org/packages/ea/db/cf17c8c2613164c9676e9af938403bd27d7daca3ac8edf4b20950ba0ec89/nlp_data_py-0.0.9-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "683489f5ad89cb8288bf1ab50be3adff", "sha256": "af212b4dcf5814d1b5efff0a3d55c22bd5c0196f1ad99780cd456ca58e39c257" }, "downloads": -1, "filename": "nlp-data-py-0.0.9.tar.gz", "has_sig": false, "md5_digest": "683489f5ad89cb8288bf1ab50be3adff", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 19546, "upload_time": "2018-11-18T19:59:37", "url": "https://files.pythonhosted.org/packages/47/2c/2e77705898b9a7006ea3218e7cb84ee2d2ec3428e38eb16a47f809855cff/nlp-data-py-0.0.9.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "e6d22e02a8b9c4eccddb0ae323fe661b", "sha256": "4169d8577fcc09d06dcf87ab444788d536ee5005cf2e9c8daef6b4cc8fee2d19" }, "downloads": -1, "filename": "nlp_data_py-0.0.9-py3-none-any.whl", "has_sig": false, "md5_digest": "e6d22e02a8b9c4eccddb0ae323fe661b", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 23782, "upload_time": "2018-11-18T19:59:36", "url": "https://files.pythonhosted.org/packages/ea/db/cf17c8c2613164c9676e9af938403bd27d7daca3ac8edf4b20950ba0ec89/nlp_data_py-0.0.9-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "683489f5ad89cb8288bf1ab50be3adff", "sha256": "af212b4dcf5814d1b5efff0a3d55c22bd5c0196f1ad99780cd456ca58e39c257" }, "downloads": -1, "filename": "nlp-data-py-0.0.9.tar.gz", "has_sig": false, "md5_digest": "683489f5ad89cb8288bf1ab50be3adff", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 19546, "upload_time": "2018-11-18T19:59:37", "url": "https://files.pythonhosted.org/packages/47/2c/2e77705898b9a7006ea3218e7cb84ee2d2ec3428e38eb16a47f809855cff/nlp-data-py-0.0.9.tar.gz" } ] }