{ "info": { "author": "Riccardo Di Maio", "author_email": "riccardodimaio11@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 5 - Production/Stable", "Environment :: Console", "Intended Audience :: Developers", "Intended Audience :: Information Technology", "Intended Audience :: Other Audience", "Intended Audience :: Science/Research", "Intended Audience :: System Administrators", "License :: OSI Approved :: MIT License", "Operating System :: POSIX", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3", "Topic :: Text Processing", "Topic :: Utilities" ], "description": "
\n  \n \n \"Logo\"/\n \n\n A text parser that doesn't care about your file extensions\n\n \n \n \"Build\n \n \n \n \"Code\n \n \n \n \"SemVer\n \n\n Key Features \u2022\n Supported Formats \u2022\n Installation \u2022\n Usage \u2022\n Related projects \u2022\n Contributing \u2022\n MIT License\n
\n\n![Demo GIF](parsa/img/demo.gif?raw=true \"Demo GIF\")\n\n

\n \n Parsa is a textract-based CLI text parser that supports multiple file extensions.\n It takes any number of inputs, and outputs them to .txt files in a directory of choice, preserving the structure of the original text.\n \n

\n\n# Key features\n- Extends [textract](https://github.com/deanmalmgren/textract)'s functionalities to work with multiple inputs and to automatically save the output\n- Takes an arbitrary number of inputs of different filetypes, and processess them all equally when supported\n- Outputs the parsed text from the input files individually to corresponding .txt files, with the option of selecting a custom output path\n- Includes a naming system that always avoids overwriting existing files, instead naming new files in a simple manner\n- Supports over 20 of the most common formats (see [Supported formats](#supported-formats) for more)\n- Preserves the structure of document file formats (.docx, .pdf, ...)\n- Supports audio formats (.wav, .mp3, ...) via the speech recognition tools [sox](https://github.com/chirlu/sox), [SpeechRecognition](https://github.com/Uberi/speech_recognition) and [pocketsphinx](https://github.com/cmusphinx/pocketsphinx/)\n- Supports image formats (.jpg, .png, ...), via the optical character recognition (OCR) tool [tesseract-ocr](https://github.com/tesseract-ocr/tesseract)\n- Prompts the user for an input file's extension if it's not explicitly present; this feature can be turned off via `--noprompt`\n\n# Supported formats\nSee [this page](https://textract.readthedocs.io/en/stable/#currently-supporting) from textract's documentation for a full list of the supported formats and their linked dependencies.\n\n# Installation\n## System requirements\n- Linux\n- Python 2.7/3.x (any Python 3 version)\n## Linux\nVia `pip`:\n```bash\n$ pip install parsa\n```\n\nOr, if you prefer, you can install it from source:\n```bash\n# Clone the repository\n$ git clone https://github.com/rdimaio/parsa\n\n# Go into the parsa folder\n$ cd parsa\n\n# Install parsa\n$ python setup.py install\n```\n\n### Tests\n```bash\n$ python -m unittest discover tests\n```\n\n# Usage\n## Single input\n```bash\n# Basic usage\n$ parsa path/to/input_file\n# The output will be saved inside the input file's parent folder.\n```\n\n## Multi input\n```bash\n# Basic usage\n$ parsa path/to/input_folder\n# The output will be saved inside a folder named `parsaoutput` in the input folder.\n```\n\n### Optional: custom output folder\n```bash\n# Basic usage\n$ parsa path/to/input -o path/to/output_folder\n# Works with both single and multi input.\n```\n\n### Optional: ignore files without an explicit extension\n```bash\n# Basic usage\n$ parsa --noprompt path/to/input\n# Useful for situations where your input includes log/system files without an extension.\n```\n\n## Full help message\n```\n$ parsa --help\nusage: parsa [-h] [--noprompt] [--output [OUTPUT]] input\n\nTextract-based text parser that supports most text file extensions. Parsa can\nparse multiple formats at once, writing them to .txt files in the directory of\nchoice.\n\npositional arguments:\n input input file or folder; if a folder is passed as input,\n parsa will scan every file inside it recursively\n (scanning subfolders as well)\n\noptional arguments:\n -h, --help show this help message and exit\n --noprompt, -n ignore files without an extension and don't prompt the\n user to input their extension\n --output [OUTPUT], -o [OUTPUT]\n folder where the output files will be stored. The default folder is:\n (a) the input file's parent folder, if the input is a file, or\n (b) a folder named 'parsaoutput' located in the input folder, if the input is a folder.\n```\n\n# Related projects\n- [parsa-gui](https://github.com/rdimaio/parsa-gui) - Graphical version of parsa (WIP)\n- [xparsa](https://github.com/rdimaio/xparsa) - Extended parsa, enhanced with statistics about the parsed files (WIP)\n- [xparsa-gui](https://github.com/rdimaio/xparsa-gui) - GUI for xparsa (WIP)\n\n# Contributing\nPull requests are welcome! If you would like to include/remove/change a major feature, please open an issue first.\n\n# License\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "http://github.com/rdimaio/parsa", "keywords": "parsa", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "parsa", "package_url": "https://pypi.org/project/parsa/", "platform": "", "project_url": "https://pypi.org/project/parsa/", "project_urls": { "Homepage": "http://github.com/rdimaio/parsa" }, "release_url": "https://pypi.org/project/parsa/1.1.5/", "requires_dist": null, "requires_python": "", "summary": "A multiformat text parser", "version": "1.1.5" }, "last_serial": 4613030, "releases": { "1.1.5": [ { "comment_text": "", "digests": { "md5": "26065fb2f539792912bb6beadf2c84f9", "sha256": "443f77b3ca810d4faadad70d03a2d68b3b1928076fdcb27547e7458a2620ae07" }, "downloads": -1, "filename": "parsa-1.1.5.tar.gz", "has_sig": false, "md5_digest": "26065fb2f539792912bb6beadf2c84f9", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7211, "upload_time": "2018-12-18T17:06:29", "url": "https://files.pythonhosted.org/packages/c8/ec/2ef5d03b49665f10109934bbf095178ec6728c3a77700d4ab3f89f2a85b8/parsa-1.1.5.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "26065fb2f539792912bb6beadf2c84f9", "sha256": "443f77b3ca810d4faadad70d03a2d68b3b1928076fdcb27547e7458a2620ae07" }, "downloads": -1, "filename": "parsa-1.1.5.tar.gz", "has_sig": false, "md5_digest": "26065fb2f539792912bb6beadf2c84f9", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7211, "upload_time": "2018-12-18T17:06:29", "url": "https://files.pythonhosted.org/packages/c8/ec/2ef5d03b49665f10109934bbf095178ec6728c3a77700d4ab3f89f2a85b8/parsa-1.1.5.tar.gz" } ] }