{
"info": {
"author": "Ingo Kleiber",
"author_email": "ingo@kleiber.me",
"bugtrack_url": null,
"classifiers": [
"Development Status :: 2 - Pre-Alpha",
"Intended Audience :: Developers",
"License :: OSI Approved :: MIT License",
"Natural Language :: English",
"Programming Language :: Python :: 3.6"
],
"description": "=============\nTextDirectory\n=============\n\n\n.. image:: https://img.shields.io/pypi/v/textdirectory.svg\n :target: https://pypi.python.org/pypi/textdirectory\n\n.. image:: https://img.shields.io/travis/IngoKl/textdirectory.svg\n :target: https://travis-ci.org/IngoKl/textdirectory\n\n.. image:: https://readthedocs.org/projects/textdirectory/badge/?version=latest\n :target: https://textdirectory.readthedocs.io/en/latest/?badge=latest\n :alt: Documentation Status\n\n|\n|\n\n.. image:: https://user-images.githubusercontent.com/16179317/39367680-cd409a00-4a37-11e8-8d42-0bed5a4e814b.png\n :alt: TextDirectory\n\n*TextDirectory* allows you to combine multiple text files into one aggregated file. TextDirectory also supports matching\nfiles for certain criteria and applying transformations to the aggregated text.\n\n*TextDirectory* can be used as a mere tool (via the CLI) and as a Python library.\n\nOf course, everything *TextDirectory* does could be achieved in bash or PowerShell. However, there are certain\nuse-cases (e.g. when used as a library) in which it might be useful.\n\n\n* Free software: MIT license\n* Documentation: https://textdirectory.readthedocs.io.\n\n\nFeatures\n--------\n* Aggregating multiple text files\n* Filtering documents/texts based on various parameters such as length, content, and random sampling\n* Transforming the aggregated text (e.g. transforming the text to lowercase)\n\n.. csv-table::\n :header: \"Version\", \"Filters\", \"Transformations\"\n :widths: 10, 30, 30\n\n 0.1.0, filter_by_max_chars(n int); filter_by_min_chars(n int); filter_by_max_tokens(n int); filter_by_min_tokens(n int); filter_by_contains(str); filter_by_not_contains(str); filter_by_random_sampling(n int; replace=False), transformation_lowercase\n 0.1.1, filter_by_chars_outliers(n sigmas int), transformation_remove_nl\n 0.1.2, filter_by_filename_contains(str), transformation_usas_en_semtag; transformation_uppercase; transformation_postag(spacy_model str)\n 0.1.3, filter_by_similar_documents(reference_file str; threshold float), transformation_remove_non_ascii; transformation_remove_non_alphanumerical\n 0.2.0, filter_by_max_filesize(max_kb int); filter_by_min_filesize(min_kb int), transformation_to_leetspeak; transformation_crude_spellchecker(spacy_model str)\n 0.2.1, ,transformation_remove_stopwords(stopwords_source str; stopwords str [en]; spacy_model str; custom_stopwords str); transformation_remove_htmltags\n\nQuickstart\n----------\nInstall *TextDirectory* via pip: ``pip install textdirectory``\n\n*TextDirectory*, as exemplified below, works with a two-stage model. After loading in your data (directory) you can iteratively select the files you want to process. In a second step you can perform transformations on the text before finally aggregating it.\n\n.. image:: https://user-images.githubusercontent.com/16179317/39367589-7f774116-4a37-11e8-9a09-5cbdf5f3311b.png\n :alt: TextDirectory\n\nAs a Command-Line Tool\n~~~~~~~~~~~~~~~~~~~~~~\n*TextDirectory* comes equipped with a CLI.\n\nThe syntax for both the *filters* and *tranformations* works similarly. They are chained by adding slashes (/) and\nparameters are passed via commas (,): ``filter_by_min_tokens,5/filter_by_random_sampling,2``.\n\n**Example 1: A Very Simple Aggregation**\n\n``textdirectory --directory testdata --output_file aggregated.txt``\n\nThis will take all files (.txt) in *testdata* and then aggregates the files into a file called *aggregated.txt*.\n\n**Example 2: Applying Filters and Transformations**\n\nIn this example we want to filter the files based on their token count, perform a random sampling and finally transform all text to lowercase.\n\n``textdirectory --directory testdata --output_file aggregated.txt --filters filter_by_min_tokens,5/filter_by_random_sampling,2 --transformations transformation_lowercase``\n\nAfter passing two filters (*filter_by_min_tokens* and *filter_by_random_sampling*) we've applied the *transform_lowercase* transformation.\n\nThe resulting file will contain the content of two files that each have at least five tokens.\n\nAs a Python Library\n~~~~~~~~~~~~~~~~~~~\nIn order to demonstrate *TextDirectory* as a Python library, we'll recreate the second example from above:\n\n.. code:: python\n\n import textdirectory\n td = textdirectory.TextDirectory(directory='testdata')\n td.load_files(recursive=False, filetype='txt', sort=True)\n td.filter_by_min_tokens(5)\n td.filter_by_random_sampling(2)\n td.stage_transformation(['transformation_lowercase'])\n td.aggregate_to_file('aggregated.txt')\n\nIf we wanted to keep working with the actual aggregated text, we could have called ``text = td.aggregate_to_memory()``.\n\nIt's also possible to pass arguments to the individual transformations. In order to do this (at the moment) you have to adhere to the correct order of arguments.\n\n.. code:: python\n\n # def transformation_remove_stopwords(text, stopwords_source='internal', stopwords='en', spacy_model='en_core_web_sm', custom_stopwords=None, *args)\n td.stage_transformation(['transformation_remove_stopwords', 'internal', 'en', 'en_core_web_sm', 'dolor'])\n\nIn the above example, we are adding additional custom stopwords to the transformer.\n\nToDo\n--------\n* Increasing test coverage\n* Writing better documentation\n* Adding better error handling (raw exception are, well ...)\n* Adding logging\n* Allowing users to pass keyword arguments to transformers\n* Implementing autodoc (via Sphinx)\n\nBehaviour\n---------\nWe are not holding the actual texts in memory. This leads to much more disk read activity (and time inefficiency), but\nsaves memory.\n\n``transformation_usas_en_semtag`` relies on the web versionof `Paul Rayson's USAS Tagger\n`_. Don't use this transformation for large amounts of text, give credit, and\nconsider using their commercial product `Wmatrix `_.\n\n\nCredits\n-------\nThis package is based on the `audreyr/cookiecutter-pypackage`_ coockiecutter template. The *crude spellchecker*\n(transformation) is implemented following Peter Norvig's excellent `tutorial`_.\n\n.. _Cookiecutter: https://github.com/audreyr/cookiecutter\n.. _`audreyr/cookiecutter-pypackage`: https://github.com/audreyr/cookiecutter-pypackage\n.. _`tutorial`: http://norvig.com/spell-correct.html\n\n\n=======\nHistory\n=======\n\n0.1.0 (2018-04-26)\n------------------\n\n* Initial release\n* First release on PyPI.\n\n0.1.1 (2018-04-27)\n------------------\n\n* added filter_by_chars_outliers\n* added transformation_remove_nl\n\n0.1.2 (2018-04-29)\n------------------\n* added transformation_postag\n* added transformation_usas_en_semtag\n* added transformation_uppercase\n* added filter_by_filename_contains\n* added parameter support for transformations\n\n0.1.3 (2018-04-30)\n------------------\n* filter_by_random_sampling now has a \"replacement\" option\n* changed from tabulate to an embedded function\n* added transformation_remove_non_ascii\n* added transformation_remove_non_alphanumerical\n* added filter_by_similar_documents\n\n0.1.4 (2018-04-02)\n------------------\n* fixed an object mutation problem in the tabulate function\n\n0.2.0 (2018-05-13)\n------------------\n* added transform_to_memory() function\n* added transformation_to_leetspeak() function\n* added transformation_crude_spellchecker\n* added filter_by_max_filesize\n* added filter_by_min_filesize\n* fixed a bug where load_files() would fail if there were no files\n\n0.2.1 (2019-06-13)\n------------------\n* added transformation_remove_stopwords\n* added transformation_remove_htmltags\n* fixed some minor bugs\n\n0.2.2 (2019-06-13)\n------------------\n* changed the data packaging\n\n\n",
"description_content_type": "",
"docs_url": null,
"download_url": "",
"downloads": {
"last_day": -1,
"last_month": -1,
"last_week": -1
},
"home_page": "https://github.com/IngoKl/textdirectory",
"keywords": "textdirectory",
"license": "MIT license",
"maintainer": "",
"maintainer_email": "",
"name": "textdirectory",
"package_url": "https://pypi.org/project/textdirectory/",
"platform": "",
"project_url": "https://pypi.org/project/textdirectory/",
"project_urls": {
"Homepage": "https://github.com/IngoKl/textdirectory"
},
"release_url": "https://pypi.org/project/textdirectory/0.2.2/",
"requires_dist": [
"Click (>=6.0)",
"numpy",
"requests",
"beautifulsoup4",
"spacy"
],
"requires_python": "",
"summary": "TextDirectory allows you to combine multiple text files into one.",
"version": "0.2.2"
},
"last_serial": 5396032,
"releases": {
"0.1.0": [
{
"comment_text": "",
"digests": {
"md5": "9dfa59c304fb89d17e00daf90e314e07",
"sha256": "fa01cd7a4267ba538d25625f43b0444ba7964ec56925b7241de1a7ae11600032"
},
"downloads": -1,
"filename": "textdirectory-0.1.0-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "9dfa59c304fb89d17e00daf90e314e07",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": null,
"size": 7294,
"upload_time": "2018-04-27T13:24:30",
"url": "https://files.pythonhosted.org/packages/a1/3d/c99a1735b5e500e7ad8f40e58d39997d038403a82c4434e02677ebc98072/textdirectory-0.1.0-py2.py3-none-any.whl"
},
{
"comment_text": "",
"digests": {
"md5": "900f0c3460822ad87b675ebc185f2e4a",
"sha256": "e5e2634f1d80ef6b633e18795f1126c5b349e12fea7eee840368466fec2c2938"
},
"downloads": -1,
"filename": "textdirectory-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "900f0c3460822ad87b675ebc185f2e4a",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 11030,
"upload_time": "2018-04-27T13:24:31",
"url": "https://files.pythonhosted.org/packages/ac/0d/c695dd4aaf04126e9c98fa73357fdb5cddc74e016c2422ab5e40df16c54f/textdirectory-0.1.0.tar.gz"
}
],
"0.1.2": [
{
"comment_text": "",
"digests": {
"md5": "31c5d0d2bc75bde1745dcc711fbf9d05",
"sha256": "fd0eea9a172cbb953b1211a169c501d1b527eebd47281d61ceb61705867318ec"
},
"downloads": -1,
"filename": "textdirectory-0.1.2-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "31c5d0d2bc75bde1745dcc711fbf9d05",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": null,
"size": 10842,
"upload_time": "2018-04-29T16:47:46",
"url": "https://files.pythonhosted.org/packages/7e/b6/034c56cd6c5fa37444b926e3c4764200711b82368c1c0edabc46767c6c00/textdirectory-0.1.2-py2.py3-none-any.whl"
},
{
"comment_text": "",
"digests": {
"md5": "eb1c138e3ceba7ad89c695d119c6b73d",
"sha256": "7112d64abb93d23f68fe84cbb837f3b4e946b119bf88d5064604790186836408"
},
"downloads": -1,
"filename": "textdirectory-0.1.2.tar.gz",
"has_sig": false,
"md5_digest": "eb1c138e3ceba7ad89c695d119c6b73d",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 15119,
"upload_time": "2018-04-29T16:47:48",
"url": "https://files.pythonhosted.org/packages/cc/ac/b262c793628858a67a2b256803528de57848d919899afd3efda7e35d7d1e/textdirectory-0.1.2.tar.gz"
}
],
"0.1.3": [
{
"comment_text": "",
"digests": {
"md5": "1b9a8b4a612f2755279a7180a93dc211",
"sha256": "c3ac9d91f5d9a9daab5b64a06d7ed6588ebd1a66afb70034b90ce7f5b749da1d"
},
"downloads": -1,
"filename": "textdirectory-0.1.3-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "1b9a8b4a612f2755279a7180a93dc211",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": null,
"size": 12255,
"upload_time": "2018-04-30T21:28:39",
"url": "https://files.pythonhosted.org/packages/88/4c/a48b4dc25c13e4f3aea6c497e30595240aae9b72f60645f843ba34e8d521/textdirectory-0.1.3-py2.py3-none-any.whl"
},
{
"comment_text": "",
"digests": {
"md5": "51f23add19b369faec806699ab8ec592",
"sha256": "5563836b58b380996eb489c68557c0fcc1d55beecc3af32c3b88a74011f24916"
},
"downloads": -1,
"filename": "textdirectory-0.1.3.tar.gz",
"has_sig": false,
"md5_digest": "51f23add19b369faec806699ab8ec592",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 22945,
"upload_time": "2018-04-30T21:28:42",
"url": "https://files.pythonhosted.org/packages/a0/6b/82ffab0692ac9f9fed201c406a30f0c66a9c1e5b7e03e2ad434aa966e60f/textdirectory-0.1.3.tar.gz"
}
],
"0.1.4": [
{
"comment_text": "",
"digests": {
"md5": "3f2c6b8e69090834b20985c855f830b4",
"sha256": "bd674a12328042cc8f55b65ca270805564a06aab1b9b32b9b84bd866cc805692"
},
"downloads": -1,
"filename": "textdirectory-0.1.4-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "3f2c6b8e69090834b20985c855f830b4",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": null,
"size": 12412,
"upload_time": "2018-05-02T12:18:57",
"url": "https://files.pythonhosted.org/packages/22/9f/938d8b38eaa2e04849e2ccf829951a9da2813a8b8db109f8b6264d5e6b45/textdirectory-0.1.4-py2.py3-none-any.whl"
},
{
"comment_text": "",
"digests": {
"md5": "cd13396ecef704d39a731df01aa8ccf8",
"sha256": "8ae1d01633e160590a344954c32f5ce386ddd2ce0176bc74321f6456806ae0ad"
},
"downloads": -1,
"filename": "textdirectory-0.1.4.tar.gz",
"has_sig": false,
"md5_digest": "cd13396ecef704d39a731df01aa8ccf8",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 17817,
"upload_time": "2018-05-02T12:18:58",
"url": "https://files.pythonhosted.org/packages/90/28/f2cbc73d1dcbf4eee2ab49877726c66848f9c1db2fc1ac8943a94ae73685/textdirectory-0.1.4.tar.gz"
}
],
"0.2.0": [
{
"comment_text": "",
"digests": {
"md5": "19871d49fde4451372431fe67a68d4a4",
"sha256": "5d1070f016a8e5c80882a02991c3993c10f6fbbf711af30f8402e89b5a03ce65"
},
"downloads": -1,
"filename": "textdirectory-0.2.0-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "19871d49fde4451372431fe67a68d4a4",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": null,
"size": 15235,
"upload_time": "2018-05-13T14:40:23",
"url": "https://files.pythonhosted.org/packages/bd/00/ae8f1c153f758fca8ae3ef794435a6da876f93130f03d5d8aa38ac58999a/textdirectory-0.2.0-py2.py3-none-any.whl"
},
{
"comment_text": "",
"digests": {
"md5": "f93ff3c5779e3ef8abd983364925e71c",
"sha256": "5c3b42c1aabc6cb07668413da052802ef9af943c1c60ce1842316af5575809cf"
},
"downloads": -1,
"filename": "textdirectory-0.2.0.tar.gz",
"has_sig": false,
"md5_digest": "f93ff3c5779e3ef8abd983364925e71c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 20589,
"upload_time": "2018-05-13T14:40:25",
"url": "https://files.pythonhosted.org/packages/20/42/b953c62a08b9881858f1ebb116d1ba2d9dd224118ebca9f54d7ac768553a/textdirectory-0.2.0.tar.gz"
}
],
"0.2.1": [
{
"comment_text": "",
"digests": {
"md5": "cc818dea2ffdce59c03fdd6ae5a48b47",
"sha256": "3cb99a771ca7c47f919197af453feb34201d551f4e8391a5c2f1a29010dfd989"
},
"downloads": -1,
"filename": "textdirectory-0.2.1-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "cc818dea2ffdce59c03fdd6ae5a48b47",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": null,
"size": 13269,
"upload_time": "2019-06-13T12:28:52",
"url": "https://files.pythonhosted.org/packages/3c/74/097ebbf63c234ecf39b8fa16bf011585214f4fb58c953de69912438e6720/textdirectory-0.2.1-py2.py3-none-any.whl"
},
{
"comment_text": "",
"digests": {
"md5": "783cf331bf9d798f0b3256dbda16b8ec",
"sha256": "100f712062508b3b267bde9bf84731d39bbf442767fcfe4ba880deb02b3ae09c"
},
"downloads": -1,
"filename": "textdirectory-0.2.1.tar.gz",
"has_sig": false,
"md5_digest": "783cf331bf9d798f0b3256dbda16b8ec",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 21467,
"upload_time": "2019-06-13T12:28:53",
"url": "https://files.pythonhosted.org/packages/08/45/f4d73d4ee9cf73405be138acf3b254d2472e7acb6bab87199e18b70457b3/textdirectory-0.2.1.tar.gz"
}
],
"0.2.2": [
{
"comment_text": "",
"digests": {
"md5": "2ddb4bf0eb146b9d68c772f902562b9c",
"sha256": "d8f8f5a632b5bcd7d52a4a64d765004310046c636070ad5657fa620eb20011a9"
},
"downloads": -1,
"filename": "textdirectory-0.2.2-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "2ddb4bf0eb146b9d68c772f902562b9c",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": null,
"size": 12288038,
"upload_time": "2019-06-13T13:59:52",
"url": "https://files.pythonhosted.org/packages/7d/77/73fc13c178a976a4b991de92873341b6cae7f74e9a8aadfe39805dc458e5/textdirectory-0.2.2-py2.py3-none-any.whl"
},
{
"comment_text": "",
"digests": {
"md5": "ff756268f2c104be75002204657965f4",
"sha256": "419f7e4c272d2439a2f0bdd8f99254b9259c1c0279aff8bde3d90b343c4e9193"
},
"downloads": -1,
"filename": "textdirectory-0.2.2.tar.gz",
"has_sig": false,
"md5_digest": "ff756268f2c104be75002204657965f4",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 6159019,
"upload_time": "2019-06-13T13:59:55",
"url": "https://files.pythonhosted.org/packages/60/35/59a8fe5368dc0cf5ed084f050cd7b68e6847b79bae66dd7546b8ead65d28/textdirectory-0.2.2.tar.gz"
}
]
},
"urls": [
{
"comment_text": "",
"digests": {
"md5": "2ddb4bf0eb146b9d68c772f902562b9c",
"sha256": "d8f8f5a632b5bcd7d52a4a64d765004310046c636070ad5657fa620eb20011a9"
},
"downloads": -1,
"filename": "textdirectory-0.2.2-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "2ddb4bf0eb146b9d68c772f902562b9c",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": null,
"size": 12288038,
"upload_time": "2019-06-13T13:59:52",
"url": "https://files.pythonhosted.org/packages/7d/77/73fc13c178a976a4b991de92873341b6cae7f74e9a8aadfe39805dc458e5/textdirectory-0.2.2-py2.py3-none-any.whl"
},
{
"comment_text": "",
"digests": {
"md5": "ff756268f2c104be75002204657965f4",
"sha256": "419f7e4c272d2439a2f0bdd8f99254b9259c1c0279aff8bde3d90b343c4e9193"
},
"downloads": -1,
"filename": "textdirectory-0.2.2.tar.gz",
"has_sig": false,
"md5_digest": "ff756268f2c104be75002204657965f4",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 6159019,
"upload_time": "2019-06-13T13:59:55",
"url": "https://files.pythonhosted.org/packages/60/35/59a8fe5368dc0cf5ed084f050cd7b68e6847b79bae66dd7546b8ead65d28/textdirectory-0.2.2.tar.gz"
}
]
}