{ "info": { "author": "Roberto Ferro", "author_email": "roberto.ferro@ikdev.eu", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python :: 2.7" ], "description": "#Document Parser\r\n>A simple CLI tool that allow to extract all text contained into a document.\r\n\r\n## Installation\r\nExecute the followings command to before installing DocumentParser\r\n\r\n#### Debian/Ubuntu \r\n* sudo apt-get update\r\n* sudo apt-get install build-essential libpoppler-cpp-dev pkg-config python-dev\r\n* apt-get install python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr \\\r\nflac ffmpeg lame libmad0 libso-fmt-mp3 sox libjpeg-dev swigx\r\n* pip install docparser\r\n\r\n#### MacOSx\r\n* brew install pkg-config poppler\r\n* brew cask install xquartz\r\n* brew install poppler antiword unrtf tesseract swig\r\n\r\n#### Fedora / CentOS\r\nBefore you start you've to know that there's no a quickly way to install DocParser in a Fedora based system.\r\nThis is caused by some missing dependences. This can be the hardest way, but in the end you'll be proud of yourself XD.\r\n\r\n* yum -y update\r\n* yum install python-pip\r\n\r\n>Required by the .docx parser which uses lxml via python-docx.\r\n\r\n* yum install libxml2 libxslt-devel libxml2-devel\r\n\r\n>Required by the .docx parser which users lxml via python-docx.\r\n\r\n* yum install libxslt\r\n\r\n>Required by the .doc and .ps parser.\r\n\r\n* wget https://forensics.cert.org/cert-forensics-tools-release-el7.rpm\r\n* rpm -Uvh cert-forensics-tools-release*rpm\r\n* yum --enablerepo=forensics install antiword\r\n* yum --enablerepo=forensics install pstotext\r\n\r\n>Require by .pdf parser\r\n\r\n*yum install poppler-utils\r\n\r\n>Requred by .jpg, .png, gif parser\r\n\r\n* cd /opt\r\n\r\n* yum -y install libstdc++ autoconf automake libtool autoconf-archive pkg-config gcc gcc-c++ make libjpeg-devel libpng-devel libtiff-devel zlib-devel\r\n\r\n>Install AutoConf-Archive\r\n\r\n* wget ftp://mirror.switch.ch/pool/4/mirror/epel/7/ppc64/a/autoconf-archive-2016.09.16-1.el7.noarch.rpm\r\n* rpm -i autoconf-archive-2016.09.16-1.el7.noarch.rpm\r\n\r\n>Install Leptonica from Source\r\n\r\n* wget http://www.leptonica.com/source/leptonica-1.75.3.tar.gz\r\n* tar -zxvf leptonica-1.75.3.tar.gz\r\n* cd leptonica-1.75.3\r\n* ./autobuild\r\n* ./configure\r\n* make\r\n* make install\r\n* cd ..\r\n\r\n>Install Tesseract from Source\r\n\r\n* wget https://github.com/tesseract-ocr/tesseract/archive/3.05.01.tar.gz\r\n* tar -zxvf 3.05.01.tar.gz\r\n* cd tesseract-3.05.01\r\n* ./autogen.sh\r\n* PKG_CONFIG_PATH=/usr/local/lib/pkgconfig LIBLEPT_HEADERSDIR=/usr/local/include ./configure --with-extra-includes=/usr/local/include --with-extra-libraries=/usr/local/lib\r\n* LDFLAGS=\"-L/usr/local/lib\" CFLAGS=\"-I/usr/local/include\" make\r\n* make install\r\n* ldconfig\r\n* cd ..\r\n\r\n>Download and install tesseract language files\r\n\r\n* wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/ben.traineddata\r\n* wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/eng.traineddata\r\n* wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.traineddata\r\n* wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/tha.traineddata\r\n* wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/osd.traineddata\r\n* mv *.traineddata /usr/local/share/tessdata\r\n\r\n>Download Hindi Cube data\r\n\r\n* wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.cube.bigrams\r\n* wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.cube.fold\r\n* wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.cube.lm\r\n* wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.cube.nn\r\n* wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.cube.params\r\n* wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.cube.word-freq\r\n* wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.tesseract_cube.nn\r\n* mv hin.* /usr/local/share/tessdata\r\n\r\n* ln -s /opt/tesseract-3.05.01 /opt/tesseract-latest\r\n\r\n>Required by .mp3 and .ogg parser\r\n\r\n* yum install sox\r\n* rm cert-forensics-tools-release-el7.rpm\r\n\r\n>Install textract without unsupported features\r\n\r\n* git clone https://github.com/deanmalmgren/textract.git\r\n* rm textract/requirements/python && cp requirements/textract/python textract/requirements/python\r\n* cd textract && chmod +x setup.py\r\n* python setup.py install\r\n\r\n* yum install gcc-c++ pkgconfig poppler-cpp-devel python-devel redhat-rpm-config\r\n\r\n\r\n\r\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/RobyFerro/DocumentParser", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "documentparser", "package_url": "https://pypi.org/project/documentparser/", "platform": "", "project_url": "https://pypi.org/project/documentparser/", "project_urls": { "Homepage": "https://github.com/RobyFerro/DocumentParser" }, "release_url": "https://pypi.org/project/documentparser/1.0a1/", "requires_dist": null, "requires_python": "", "summary": "A simple CLI tool that allow to extract all text contained into a document.", "version": "1.0a1" }, "last_serial": 3944933, "releases": { "1.0a1": [ { "comment_text": "", "digests": { "md5": "b113aa4daf6e8fd47b15b27e1cfeb1b6", "sha256": "93137d65ee7193d4b8e5704dc4438ad31b9ec2a9954b5b648e9c8f1adcdc6f12" }, "downloads": -1, "filename": "documentparser-1.0a1-py2-none-any.whl", "has_sig": false, "md5_digest": "b113aa4daf6e8fd47b15b27e1cfeb1b6", "packagetype": "bdist_wheel", "python_version": "py2", "requires_python": null, "size": 3499, "upload_time": "2018-06-09T09:11:38", "url": "https://files.pythonhosted.org/packages/e3/d7/6290a56f1b37f90194b965fdd00dbdd0ecfaaf3779bc890ef58b57dc8888/documentparser-1.0a1-py2-none-any.whl" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "b113aa4daf6e8fd47b15b27e1cfeb1b6", "sha256": "93137d65ee7193d4b8e5704dc4438ad31b9ec2a9954b5b648e9c8f1adcdc6f12" }, "downloads": -1, "filename": "documentparser-1.0a1-py2-none-any.whl", "has_sig": false, "md5_digest": "b113aa4daf6e8fd47b15b27e1cfeb1b6", "packagetype": "bdist_wheel", "python_version": "py2", "requires_python": null, "size": 3499, "upload_time": "2018-06-09T09:11:38", "url": "https://files.pythonhosted.org/packages/e3/d7/6290a56f1b37f90194b965fdd00dbdd0ecfaaf3779bc890ef58b57dc8888/documentparser-1.0a1-py2-none-any.whl" } ] }