{ "info": { "author": "Ingo Kleiber", "author_email": "ingo@kleiber.me", "bugtrack_url": null, "classifiers": [ "Development Status :: 2 - Pre-Alpha", "Intended Audience :: Developers", "License :: OSI Approved :: MIT License", "Natural Language :: English", "Programming Language :: Python :: 3.6" ], "description": "=============\nTextDirectory\n=============\n\n\n.. image:: https://img.shields.io/pypi/v/textdirectory.svg\n :target: https://pypi.python.org/pypi/textdirectory\n\n.. image:: https://img.shields.io/travis/IngoKl/textdirectory.svg\n :target: https://travis-ci.org/IngoKl/textdirectory\n\n.. image:: https://readthedocs.org/projects/textdirectory/badge/?version=latest\n :target: https://textdirectory.readthedocs.io/en/latest/?badge=latest\n :alt: Documentation Status\n\n|\n|\n\n.. image:: https://user-images.githubusercontent.com/16179317/39367680-cd409a00-4a37-11e8-8d42-0bed5a4e814b.png\n :alt: TextDirectory\n\n*TextDirectory* allows you to combine multiple text files into one aggregated file. TextDirectory also supports matching\nfiles for certain criteria and applying transformations to the aggregated text.\n\n*TextDirectory* can be used as a mere tool (via the CLI) and as a Python library.\n\nOf course, everything *TextDirectory* does could be achieved in bash or PowerShell. However, there are certain\nuse-cases (e.g. when used as a library) in which it might be useful.\n\n\n* Free software: MIT license\n* Documentation: https://textdirectory.readthedocs.io.\n\n\nFeatures\n--------\n* Aggregating multiple text files\n* Filtering documents/texts based on various parameters such as length, content, and random sampling\n* Transforming the aggregated text (e.g. transforming the text to lowercase)\n\n.. csv-table::\n :header: \"Version\", \"Filters\", \"Transformations\"\n :widths: 10, 30, 30\n\n 0.1.0, filter_by_max_chars(n int); filter_by_min_chars(n int); filter_by_max_tokens(n int); filter_by_min_tokens(n int); filter_by_contains(str); filter_by_not_contains(str); filter_by_random_sampling(n int; replace=False), transformation_lowercase\n 0.1.1, filter_by_chars_outliers(n sigmas int), transformation_remove_nl\n 0.1.2, filter_by_filename_contains(str), transformation_usas_en_semtag; transformation_uppercase; transformation_postag(spacy_model str)\n 0.1.3, filter_by_similar_documents(reference_file str; threshold float), transformation_remove_non_ascii; transformation_remove_non_alphanumerical\n 0.2.0, filter_by_max_filesize(max_kb int); filter_by_min_filesize(min_kb int), transformation_to_leetspeak; transformation_crude_spellchecker(spacy_model str)\n 0.2.1, ,transformation_remove_stopwords(stopwords_source str; stopwords str [en]; spacy_model str; custom_stopwords str); transformation_remove_htmltags\n\nQuickstart\n----------\nInstall *TextDirectory* via pip: ``pip install textdirectory``\n\n*TextDirectory*, as exemplified below, works with a two-stage model. After loading in your data (directory) you can iteratively select the files you want to process. In a second step you can perform transformations on the text before finally aggregating it.\n\n.. image:: https://user-images.githubusercontent.com/16179317/39367589-7f774116-4a37-11e8-9a09-5cbdf5f3311b.png\n :alt: TextDirectory\n\nAs a Command-Line Tool\n~~~~~~~~~~~~~~~~~~~~~~\n*TextDirectory* comes equipped with a CLI.\n\nThe syntax for both the *filters* and *tranformations* works similarly. They are chained by adding slashes (/) and\nparameters are passed via commas (,): ``filter_by_min_tokens,5/filter_by_random_sampling,2``.\n\n**Example 1: A Very Simple Aggregation**\n\n``textdirectory --directory testdata --output_file aggregated.txt``\n\nThis will take all files (.txt) in *testdata* and then aggregates the files into a file called *aggregated.txt*.\n\n**Example 2: Applying Filters and Transformations**\n\nIn this example we want to filter the files based on their token count, perform a random sampling and finally transform all text to lowercase.\n\n``textdirectory --directory testdata --output_file aggregated.txt --filters filter_by_min_tokens,5/filter_by_random_sampling,2 --transformations transformation_lowercase``\n\nAfter passing two filters (*filter_by_min_tokens* and *filter_by_random_sampling*) we've applied the *transform_lowercase* transformation.\n\nThe resulting file will contain the content of two files that each have at least five tokens.\n\nAs a Python Library\n~~~~~~~~~~~~~~~~~~~\nIn order to demonstrate *TextDirectory* as a Python library, we'll recreate the second example from above:\n\n.. code:: python\n\n import textdirectory\n td = textdirectory.TextDirectory(directory='testdata')\n td.load_files(recursive=False, filetype='txt', sort=True)\n td.filter_by_min_tokens(5)\n td.filter_by_random_sampling(2)\n td.stage_transformation(['transformation_lowercase'])\n td.aggregate_to_file('aggregated.txt')\n\nIf we wanted to keep working with the actual aggregated text, we could have called ``text = td.aggregate_to_memory()``.\n\nIt's also possible to pass arguments to the individual transformations. In order to do this (at the moment) you have to adhere to the correct order of arguments.\n\n.. code:: python\n\n # def transformation_remove_stopwords(text, stopwords_source='internal', stopwords='en', spacy_model='en_core_web_sm', custom_stopwords=None, *args)\n td.stage_transformation(['transformation_remove_stopwords', 'internal', 'en', 'en_core_web_sm', 'dolor'])\n\nIn the above example, we are adding additional custom stopwords to the transformer.\n\nToDo\n--------\n* Increasing test coverage\n* Writing better documentation\n* Adding better error handling (raw exception are, well ...)\n* Adding logging\n* Allowing users to pass keyword arguments to transformers\n* Implementing autodoc (via Sphinx)\n\nBehaviour\n---------\nWe are not holding the actual texts in memory. This leads to much more disk read activity (and time inefficiency), but\nsaves memory.\n\n``transformation_usas_en_semtag`` relies on the web versionof `Paul Rayson's USAS Tagger\n`_. Don't use this transformation for large amounts of text, give credit, and\nconsider using their commercial product `Wmatrix `_.\n\n\nCredits\n-------\nThis package is based on the `audreyr/cookiecutter-pypackage`_ coockiecutter template. The *crude spellchecker*\n(transformation) is implemented following Peter Norvig's excellent `tutorial`_.\n\n.. _Cookiecutter: https://github.com/audreyr/cookiecutter\n.. _`audreyr/cookiecutter-pypackage`: https://github.com/audreyr/cookiecutter-pypackage\n.. _`tutorial`: http://norvig.com/spell-correct.html\n\n\n=======\nHistory\n=======\n\n0.1.0 (2018-04-26)\n------------------\n\n* Initial release\n* First release on PyPI.\n\n0.1.1 (2018-04-27)\n------------------\n\n* added filter_by_chars_outliers\n* added transformation_remove_nl\n\n0.1.2 (2018-04-29)\n------------------\n* added transformation_postag\n* added transformation_usas_en_semtag\n* added transformation_uppercase\n* added filter_by_filename_contains\n* added parameter support for transformations\n\n0.1.3 (2018-04-30)\n------------------\n* filter_by_random_sampling now has a \"replacement\" option\n* changed from tabulate to an embedded function\n* added transformation_remove_non_ascii\n* added transformation_remove_non_alphanumerical\n* added filter_by_similar_documents\n\n0.1.4 (2018-04-02)\n------------------\n* fixed an object mutation problem in the tabulate function\n\n0.2.0 (2018-05-13)\n------------------\n* added transform_to_memory() function\n* added transformation_to_leetspeak() function\n* added transformation_crude_spellchecker\n* added filter_by_max_filesize\n* added filter_by_min_filesize\n* fixed a bug where load_files() would fail if there were no files\n\n0.2.1 (2019-06-13)\n------------------\n* added transformation_remove_stopwords\n* added transformation_remove_htmltags\n* fixed some minor bugs\n\n0.2.2 (2019-06-13)\n------------------\n* changed the data packaging\n\n\n", "description_content_type": "", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/IngoKl/textdirectory", "keywords": "textdirectory", "license": "MIT license", "maintainer": "", "maintainer_email": "", "name": "textdirectory", "package_url": "https://pypi.org/project/textdirectory/", "platform": "", "project_url": "https://pypi.org/project/textdirectory/", "project_urls": { "Homepage": "https://github.com/IngoKl/textdirectory" }, "release_url": "https://pypi.org/project/textdirectory/0.2.2/", "requires_dist": [ "Click (>=6.0)", "numpy", "requests", "beautifulsoup4", "spacy" ], "requires_python": "", "summary": "TextDirectory allows you to combine multiple text files into one.", "version": "0.2.2" }, "last_serial": 5396032, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "9dfa59c304fb89d17e00daf90e314e07", "sha256": "fa01cd7a4267ba538d25625f43b0444ba7964ec56925b7241de1a7ae11600032" }, "downloads": -1, "filename": "textdirectory-0.1.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "9dfa59c304fb89d17e00daf90e314e07", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 7294, "upload_time": "2018-04-27T13:24:30", "url": "https://files.pythonhosted.org/packages/a1/3d/c99a1735b5e500e7ad8f40e58d39997d038403a82c4434e02677ebc98072/textdirectory-0.1.0-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "900f0c3460822ad87b675ebc185f2e4a", "sha256": "e5e2634f1d80ef6b633e18795f1126c5b349e12fea7eee840368466fec2c2938" }, "downloads": -1, "filename": "textdirectory-0.1.0.tar.gz", "has_sig": false, "md5_digest": "900f0c3460822ad87b675ebc185f2e4a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 11030, "upload_time": "2018-04-27T13:24:31", "url": "https://files.pythonhosted.org/packages/ac/0d/c695dd4aaf04126e9c98fa73357fdb5cddc74e016c2422ab5e40df16c54f/textdirectory-0.1.0.tar.gz" } ], "0.1.2": [ { "comment_text": "", "digests": { "md5": "31c5d0d2bc75bde1745dcc711fbf9d05", "sha256": "fd0eea9a172cbb953b1211a169c501d1b527eebd47281d61ceb61705867318ec" }, "downloads": -1, "filename": "textdirectory-0.1.2-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "31c5d0d2bc75bde1745dcc711fbf9d05", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 10842, "upload_time": "2018-04-29T16:47:46", "url": "https://files.pythonhosted.org/packages/7e/b6/034c56cd6c5fa37444b926e3c4764200711b82368c1c0edabc46767c6c00/textdirectory-0.1.2-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "eb1c138e3ceba7ad89c695d119c6b73d", "sha256": "7112d64abb93d23f68fe84cbb837f3b4e946b119bf88d5064604790186836408" }, "downloads": -1, "filename": "textdirectory-0.1.2.tar.gz", "has_sig": false, "md5_digest": "eb1c138e3ceba7ad89c695d119c6b73d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 15119, "upload_time": "2018-04-29T16:47:48", "url": "https://files.pythonhosted.org/packages/cc/ac/b262c793628858a67a2b256803528de57848d919899afd3efda7e35d7d1e/textdirectory-0.1.2.tar.gz" } ], "0.1.3": [ { "comment_text": "", "digests": { "md5": "1b9a8b4a612f2755279a7180a93dc211", "sha256": "c3ac9d91f5d9a9daab5b64a06d7ed6588ebd1a66afb70034b90ce7f5b749da1d" }, "downloads": -1, "filename": "textdirectory-0.1.3-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "1b9a8b4a612f2755279a7180a93dc211", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 12255, "upload_time": "2018-04-30T21:28:39", "url": "https://files.pythonhosted.org/packages/88/4c/a48b4dc25c13e4f3aea6c497e30595240aae9b72f60645f843ba34e8d521/textdirectory-0.1.3-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "51f23add19b369faec806699ab8ec592", "sha256": "5563836b58b380996eb489c68557c0fcc1d55beecc3af32c3b88a74011f24916" }, "downloads": -1, "filename": "textdirectory-0.1.3.tar.gz", "has_sig": false, "md5_digest": "51f23add19b369faec806699ab8ec592", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 22945, "upload_time": "2018-04-30T21:28:42", "url": "https://files.pythonhosted.org/packages/a0/6b/82ffab0692ac9f9fed201c406a30f0c66a9c1e5b7e03e2ad434aa966e60f/textdirectory-0.1.3.tar.gz" } ], "0.1.4": [ { "comment_text": "", "digests": { "md5": "3f2c6b8e69090834b20985c855f830b4", "sha256": "bd674a12328042cc8f55b65ca270805564a06aab1b9b32b9b84bd866cc805692" }, "downloads": -1, "filename": "textdirectory-0.1.4-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "3f2c6b8e69090834b20985c855f830b4", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 12412, "upload_time": "2018-05-02T12:18:57", "url": "https://files.pythonhosted.org/packages/22/9f/938d8b38eaa2e04849e2ccf829951a9da2813a8b8db109f8b6264d5e6b45/textdirectory-0.1.4-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "cd13396ecef704d39a731df01aa8ccf8", "sha256": "8ae1d01633e160590a344954c32f5ce386ddd2ce0176bc74321f6456806ae0ad" }, "downloads": -1, "filename": "textdirectory-0.1.4.tar.gz", "has_sig": false, "md5_digest": "cd13396ecef704d39a731df01aa8ccf8", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 17817, "upload_time": "2018-05-02T12:18:58", "url": "https://files.pythonhosted.org/packages/90/28/f2cbc73d1dcbf4eee2ab49877726c66848f9c1db2fc1ac8943a94ae73685/textdirectory-0.1.4.tar.gz" } ], "0.2.0": [ { "comment_text": "", "digests": { "md5": "19871d49fde4451372431fe67a68d4a4", "sha256": "5d1070f016a8e5c80882a02991c3993c10f6fbbf711af30f8402e89b5a03ce65" }, "downloads": -1, "filename": "textdirectory-0.2.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "19871d49fde4451372431fe67a68d4a4", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 15235, "upload_time": "2018-05-13T14:40:23", "url": "https://files.pythonhosted.org/packages/bd/00/ae8f1c153f758fca8ae3ef794435a6da876f93130f03d5d8aa38ac58999a/textdirectory-0.2.0-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "f93ff3c5779e3ef8abd983364925e71c", "sha256": "5c3b42c1aabc6cb07668413da052802ef9af943c1c60ce1842316af5575809cf" }, "downloads": -1, "filename": "textdirectory-0.2.0.tar.gz", "has_sig": false, "md5_digest": "f93ff3c5779e3ef8abd983364925e71c", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 20589, "upload_time": "2018-05-13T14:40:25", "url": "https://files.pythonhosted.org/packages/20/42/b953c62a08b9881858f1ebb116d1ba2d9dd224118ebca9f54d7ac768553a/textdirectory-0.2.0.tar.gz" } ], "0.2.1": [ { "comment_text": "", "digests": { "md5": "cc818dea2ffdce59c03fdd6ae5a48b47", "sha256": "3cb99a771ca7c47f919197af453feb34201d551f4e8391a5c2f1a29010dfd989" }, "downloads": -1, "filename": "textdirectory-0.2.1-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "cc818dea2ffdce59c03fdd6ae5a48b47", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 13269, "upload_time": "2019-06-13T12:28:52", "url": "https://files.pythonhosted.org/packages/3c/74/097ebbf63c234ecf39b8fa16bf011585214f4fb58c953de69912438e6720/textdirectory-0.2.1-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "783cf331bf9d798f0b3256dbda16b8ec", "sha256": "100f712062508b3b267bde9bf84731d39bbf442767fcfe4ba880deb02b3ae09c" }, "downloads": -1, "filename": "textdirectory-0.2.1.tar.gz", "has_sig": false, "md5_digest": "783cf331bf9d798f0b3256dbda16b8ec", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 21467, "upload_time": "2019-06-13T12:28:53", "url": "https://files.pythonhosted.org/packages/08/45/f4d73d4ee9cf73405be138acf3b254d2472e7acb6bab87199e18b70457b3/textdirectory-0.2.1.tar.gz" } ], "0.2.2": [ { "comment_text": "", "digests": { "md5": "2ddb4bf0eb146b9d68c772f902562b9c", "sha256": "d8f8f5a632b5bcd7d52a4a64d765004310046c636070ad5657fa620eb20011a9" }, "downloads": -1, "filename": "textdirectory-0.2.2-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "2ddb4bf0eb146b9d68c772f902562b9c", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 12288038, "upload_time": "2019-06-13T13:59:52", "url": "https://files.pythonhosted.org/packages/7d/77/73fc13c178a976a4b991de92873341b6cae7f74e9a8aadfe39805dc458e5/textdirectory-0.2.2-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "ff756268f2c104be75002204657965f4", "sha256": "419f7e4c272d2439a2f0bdd8f99254b9259c1c0279aff8bde3d90b343c4e9193" }, "downloads": -1, "filename": "textdirectory-0.2.2.tar.gz", "has_sig": false, "md5_digest": "ff756268f2c104be75002204657965f4", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6159019, "upload_time": "2019-06-13T13:59:55", "url": "https://files.pythonhosted.org/packages/60/35/59a8fe5368dc0cf5ed084f050cd7b68e6847b79bae66dd7546b8ead65d28/textdirectory-0.2.2.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "2ddb4bf0eb146b9d68c772f902562b9c", "sha256": "d8f8f5a632b5bcd7d52a4a64d765004310046c636070ad5657fa620eb20011a9" }, "downloads": -1, "filename": "textdirectory-0.2.2-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "2ddb4bf0eb146b9d68c772f902562b9c", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 12288038, "upload_time": "2019-06-13T13:59:52", "url": "https://files.pythonhosted.org/packages/7d/77/73fc13c178a976a4b991de92873341b6cae7f74e9a8aadfe39805dc458e5/textdirectory-0.2.2-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "ff756268f2c104be75002204657965f4", "sha256": "419f7e4c272d2439a2f0bdd8f99254b9259c1c0279aff8bde3d90b343c4e9193" }, "downloads": -1, "filename": "textdirectory-0.2.2.tar.gz", "has_sig": false, "md5_digest": "ff756268f2c104be75002204657965f4", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6159019, "upload_time": "2019-06-13T13:59:55", "url": "https://files.pythonhosted.org/packages/60/35/59a8fe5368dc0cf5ed084f050cd7b68e6847b79bae66dd7546b8ead65d28/textdirectory-0.2.2.tar.gz" } ] }