{ "info": { "author": "Ben Timby", "author_email": "btimby@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Intended Audience :: Developers", "Operating System :: OS Independent", "Programming Language :: Python", "Topic :: Software Development :: Libraries :: Python Modules", "Topic :: Text Processing :: Indexing" ], "description": ".. figure:: https://travis-ci.org/btimby/fulltext.png\n :alt: Travis CI Status\n :target: https://travis-ci.org/btimby/fulltext\n\n.. figure:: https://www.smartfile.com/assets/img/smartfile-logo-new.png\n :alt: SmartFile\n\nA `SmartFile`_ Open Source project.\n\nIntroduction\n------------\n\nFulltext extracts texts from various document formats. It can be used as the\nfirst part of search indexing, document analysis etc.\n\nFulltext differs from other libraries in that it tries to use file data in the\nform it is given. For most backends, a file-like object or path can be handled\ndirectly, removing the need to write temporary files.\n\nFulltext uses native python libraries when possible and utilizes CLI tools\nwhen necessary, for example, the following CLI tools are required.\n\n * antiword - Legacy .doc (Word) format.\n * unrtf - .rtf format.\n * pdf2text (poppler-utils) - .pdf format.\n\nSupported formats\n-----------------\n\n* csv - Uses Python ``csv`` module.\n* doc - Uses ``/bin/antiword`` CLI tool.\n* docx - Uses Python ``docx2txt`` module.\n* html - Uses Python ``BeautifulSoup`` module.\n* ods - Uses Python ``lxml``, ``zipfile`` modules.\n* odt - Uses Python ``lxml``, ``zipfile`` modules.\n* pdf - Uses ``/bin/pdf2text`` CLI tool.\n* rtf - Uses ``/bin/unrtf`` CLI tool.\n* text - Default backend that uses various Python stdlib modules to extract\n strings from arbitrary (possibly) binary files.\n* xls - Uses Python ``xlrd`` module.\n* xlsx - Uses Python ``xlrd`` module.\n* zip - Uses Python ``zipfile`` module.\n\nInstalling tools\n----------------\n\nFulltext uses a number of pure Python libraries. Fulltext also uses the\ncommand line tools: antiword, pdf2text and unrtf. To install the required\nlibraries and CLI tools, you can use your package manager.\n\n.. code:: bash\n\n $ sudo yum install antiword unrtf poppler-utils libjpeg-dev\n\nOr for debian-based systems:\n\n.. code:: bash\n\n $ sudo apt-get install antiword unrtf poppler-utils libjpeg-dev\n\n\nUsage\n-----\n\nFulltext uses a simple dictionary-style interface. A single public function\n``fulltext.get()`` is provided. This function takes an optional default\nparameter which when supplied will supress errors and return that default if\ntext could not be extracted.\n\n.. code:: python\n\n > import fulltext\n > fulltext.get('does-not-exist.pdf', '< no content >')\n '< no content >'\n > fulltext.get('exists.pdf', '< no content >'')\n 'Lorem ipsum...'\n\nYou can pass a file-like object or a path to ``.get()`` Fulltext will try to\ndo the right thing, using memory buffers or temp files depending on the\nbackend.\n\n\nCustom backends\n---------------\n\nTo write a new backend, you need to do two things. First, create a python\nmodule that implements the interface that Fulltext expects. Second, define an\nenvironment variable that informs Fulltext where to find your module.\n\n.. code:: python\n\n # Tell Fulltext what file extensions your backend supports.\n EXTENSIONS = ('foo', 'bar')\n\n\n def _get_file(f, **kwargs):\n # Extract text from a file-like object. This should be defined when\n # possible.\n pass\n\n\n def _get_path(path, **kwargs):\n # Extract text from a path. This should only be defined if it can be\n # done more efficiently than having Python open() and read() the file,\n # passing it to _get_file().\n pass\n\nIf you only implement ``_get_file`` Fulltext will open any paths and pass them\nto that function. Therefore if possible, define at least this function. If\nworking with file-like objects is not possible and you only define\n``_get_path`` then Fulltext will save any file-like objects to a temporary\nfile and use that function. Sometimes it is advantageous to define both\nfunctions in cases when you can do each efficiently.\n\nIf you have questions about writing a backend, see the `backends/`_ directory\nfor some examples.\n\n.. _backends/: fulltext/backends/\n\nOnce written, simply define an environment variable ``FULLTEXT_PATH`` to\ncontain paths to your backend modules.\n\n.. code:: bash\n\n FULLTEXT_PATH=/path/to/my/module;/path/to/other/module python myprogram.py", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "http://github.com/btimby/fulltext/", "keywords": "", "license": "GPLv3", "maintainer": "", "maintainer_email": "", "name": "fulltext", "package_url": "https://pypi.org/project/fulltext/", "platform": "", "project_url": "https://pypi.org/project/fulltext/", "project_urls": { "Homepage": "http://github.com/btimby/fulltext/" }, "release_url": "https://pypi.org/project/fulltext/0.5/", "requires_dist": null, "requires_python": "", "summary": "Convert binary files to plain text for indexing.", "version": "0.5" }, "last_serial": 3254402, "releases": { "0.3-1": [], "0.3-2": [], "0.4-1": [ { "comment_text": "", "digests": { "md5": "d4c3b69aa560f74167f4f82458af6625", "sha256": "9f4432dc8e25e12b440a334da55c09bbbc6bf765bcac3897de8b6e0463b6ae07" }, "downloads": -1, "filename": "fulltext-0.4-1.tar.gz", "has_sig": false, "md5_digest": "d4c3b69aa560f74167f4f82458af6625", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6551, "upload_time": "2013-02-15T16:02:35", "url": "https://files.pythonhosted.org/packages/b7/0f/6321bcff52b261542cef09ccee9e45e322e1c2d93c88efa7905a8b32a559/fulltext-0.4-1.tar.gz" } ], "0.4.post1": [ { "comment_text": "", "digests": { "md5": "714fcedc939f8624fd41554b78ab7595", "sha256": "8741a20c0120fda20224d23182cd2165a2ef8b35d8da0323515845a37cd53bb8" }, "downloads": -1, "filename": "fulltext-0.4.post1.tar.gz", "has_sig": false, "md5_digest": "714fcedc939f8624fd41554b78ab7595", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6939, "upload_time": "2017-10-16T18:25:05", "url": "https://files.pythonhosted.org/packages/98/ef/cad95623d47ed12ef98410717f8d4d73fe414d387e96e636c8ecd6fbd0f2/fulltext-0.4.post1.tar.gz" } ], "0.5": [ { "comment_text": "", "digests": { "md5": "5cbfe51d60324a1902ba4ab5d7fd6604", "sha256": "9e903570f80dfbbab763245d3f18dc1f16af975f0eda280d8138b7a7bccde21b" }, "downloads": -1, "filename": "fulltext-0.5.tar.gz", "has_sig": false, "md5_digest": "5cbfe51d60324a1902ba4ab5d7fd6604", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 8029, "upload_time": "2017-10-16T18:25:06", "url": "https://files.pythonhosted.org/packages/ed/f6/291a9f44fb048ef2d78822490213394d74fbbcc4b1c316b1763c9acf5740/fulltext-0.5.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "5cbfe51d60324a1902ba4ab5d7fd6604", "sha256": "9e903570f80dfbbab763245d3f18dc1f16af975f0eda280d8138b7a7bccde21b" }, "downloads": -1, "filename": "fulltext-0.5.tar.gz", "has_sig": false, "md5_digest": "5cbfe51d60324a1902ba4ab5d7fd6604", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 8029, "upload_time": "2017-10-16T18:25:06", "url": "https://files.pythonhosted.org/packages/ed/f6/291a9f44fb048ef2d78822490213394d74fbbcc4b1c316b1763c9acf5740/fulltext-0.5.tar.gz" } ] }