{ "info": { "author": "Olivier Cort\u00e8s", "author_email": "contact@oliviercortes.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 5 - Production/Stable", "License :: OSI Approved :: GNU Affero General Public License v3 or later (AGPLv3+)", "Natural Language :: English", "Operating System :: OS Independent", "Programming Language :: Python", "Programming Language :: Python :: 2.7", "Topic :: Text Processing :: Filters" ], "description": "# python-ftr\n\n\nFTR is a *partial* (re-)implementation of the [Five-Filters extractor\n](http://fivefilters.org/) in Python.\n\nIt cleans up HTML web pages and extract their content and metadata for a\nmore comfortable reading experience (or whatever you need it for). It uses\na centralized and mutualized repository of configuration files to parse\nwebsites at the most precise level possible, and fallbacks to the well-known\n`readability` automatic extractor if no configuration is found.\n\nA notable difference is that this python implementation will fetch the\nwebsite configuration from a centralized repository on the internet on the\nfly if no configuration is found locally.\n\n[Full documentation is available](http://python-ftr.readthedocs.org).\n\n\n\n\n## Differences with the FiveFilters PHP implementation\n\n`python-ftr`\u00a0:\n- has only one parser library (`lxml`) for now. The `html5lib` has not been ported yet.\n- does not convert date strings to `datetime` objects. I felt this more flexible to handle them at the upper level, giving access to custom datetime parsers. This is likely to change if I implement passing a custom parsing function to the extractor.\n- uses [readability-lxml](https://github.com/buriy/python-readability) for cleaning after non-automatic body extraction. Even if it's a port of Arc90 `readability.js` like the [PHP Readability port](https://github.com/wallabag/wallabag/blob/master/inc/3rdparty/libraries/readability/Readability.php) used by Five Filters, it could **eventually** produce different results from it given the way they compute weight on contents (I didn't compare them code-wise). \n- **does not fallback to automatic parsing when no site config is available**, but it does partially when a config is found and fails. As `python-ftr` was created to be included in a complex parsing chain, we had no need for an automatic parsing *when there is no config for current site*. See below for details. \n- has no fingerprints support. This feature looked unfinished or at least not enough documented for me to understand it in the original code.\n- does not use the `global` five-filters config file at all. The `.txt` looked unmaintained, and generic fallbacks can still be implemented outside of this module\u00a0: you can provide your own global config via an argument when using the API.\n\n\n\n## Automatic extraction\n\nIf you need fully-automatic parsing in no-config-found situations \u2014\u00a0which are easily detectable because `process()` and the low-level API raise `SiteConfigNotFound`\u00a0\u2014 just use `readability-lxml`, `breadability`, `python-goose`, `soup-strainer` or whatever fits you. \n\nIn the case of an **existing config but parsing failing** for whatever reason, we still honor `autodetect_on_failure` and try to extract at least a `title` and a `body` via `readability-lxml`.\n\nThis is not as featureful as the PHP implementation which tries to extract date, language and authors via other ways, but still better than nothing. \n\nWhen automatic extraction is used, the `ContentExtractor` instance will have a `.failures` attributes, listing exactly which non-automatic extraction(s) failed.\n\nIn the case where a config is found but it has no `site` or `body` directive (eg. automatic extraction should be explicitely used), the `.failures` attributes will not be set if automatic extraction worked. \n\n\n\n## TODO\n\nSee [issues wishlist](/1flow/python-ftr/labels/wishlist) for a dynamic todo list.\n\n\n\n## License\n\nGNU Affero GPL version 3.", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/1flow/python-ftr", "keywords": null, "license": "AGPLv3", "maintainer": null, "maintainer_email": null, "name": "ftr", "package_url": "https://pypi.org/project/ftr/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/ftr/", "project_urls": { "Download": "UNKNOWN", "Homepage": "https://github.com/1flow/python-ftr" }, "release_url": "https://pypi.org/project/ftr/0.9.3/", "requires_dist": null, "requires_python": null, "summary": "HTML Article cleaner / extractor, Five-Filters compatible.", "version": "0.9.3" }, "last_serial": 1470722, "releases": { "0.7.12": [ { "comment_text": "", "digests": { "md5": "73ea935249f5cf0f27a8b059f40c2f6c", "sha256": "98e3b627375c84a076582ef16c202ef1327ba16f020c4f1abc12ee80dd56bb95" }, "downloads": -1, "filename": "ftr-0.7.12.tar.gz", "has_sig": false, "md5_digest": "73ea935249f5cf0f27a8b059f40c2f6c", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 21875, "upload_time": "2015-03-16T08:58:53", "url": "https://files.pythonhosted.org/packages/bb/c1/63ddcd9c2fbfcbf046446a976c8994046f7e85f8805d8872947393395595/ftr-0.7.12.tar.gz" } ], "0.7.13": [ { "comment_text": "", "digests": { "md5": "095bd9f7f1fdf00781b3e9892fd5dbb6", "sha256": "bc031289e15d452a070cd5b88714403a13395999badb3e41baf5888227718b45" }, "downloads": -1, "filename": "ftr-0.7.13.tar.gz", "has_sig": false, "md5_digest": "095bd9f7f1fdf00781b3e9892fd5dbb6", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 21961, "upload_time": "2015-03-16T09:02:14", "url": "https://files.pythonhosted.org/packages/48/89/79b6ee9f7aef3c48e0f23199b49810dd4cd36b346b2384b470f6f9aee90b/ftr-0.7.13.tar.gz" } ], "0.7.14": [ { "comment_text": "", "digests": { "md5": "8d30355f09ef8bb651a5505041aff7db", "sha256": "3d6ff9ec3e7f7d986c7e9fb69f1da0d62f778a27cc0561c5809b8b811f11692a" }, "downloads": -1, "filename": "ftr-0.7.14.tar.gz", "has_sig": false, "md5_digest": "8d30355f09ef8bb651a5505041aff7db", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 22060, "upload_time": "2015-03-16T10:00:24", "url": "https://files.pythonhosted.org/packages/d3/58/ce6570ac50bc88888d93f27fd15e85286b5c2b98b57cd985fc3826bd5038/ftr-0.7.14.tar.gz" } ], "0.7.15": [ { "comment_text": "", "digests": { "md5": "c58b2fd7fa4a20307c5defb65d233fd4", "sha256": "882431ecf0d3c56e1f668e02ab01dd736dda1540a0d632b7cdad50f89162f5de" }, "downloads": -1, "filename": "ftr-0.7.15.tar.gz", "has_sig": false, "md5_digest": "c58b2fd7fa4a20307c5defb65d233fd4", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 22087, "upload_time": "2015-03-16T10:11:59", "url": "https://files.pythonhosted.org/packages/4c/4f/8e00701bfc6e8b1eac0a4c51a9d8446cbe7515cd9e757018110be13d6cf7/ftr-0.7.15.tar.gz" } ], "0.8": [ { "comment_text": "", "digests": { "md5": "c95aedd05616e87f19971b632f002bd3", "sha256": "5bc32c3ec1551e1857318730b064db46f2f77c1fdf210f8ba7627136c81d748a" }, "downloads": -1, "filename": "ftr-0.8.tar.gz", "has_sig": false, "md5_digest": "c95aedd05616e87f19971b632f002bd3", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 22442, "upload_time": "2015-03-16T11:41:37", "url": "https://files.pythonhosted.org/packages/1c/3f/4e5dbef7369a7eb6ee77cd284428dd3109da258841c4db7284ff96957708/ftr-0.8.tar.gz" } ], "0.8.1": [ { "comment_text": "", "digests": { "md5": "af172090d1cc7c543219e1f94914fb49", "sha256": "04096e00d1ac3d432906f7e30b97ab79c4c138f93f61bbb0bdfb96110cca955e" }, "downloads": -1, "filename": "ftr-0.8.1.tar.gz", "has_sig": false, "md5_digest": "af172090d1cc7c543219e1f94914fb49", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 22527, "upload_time": "2015-03-16T11:45:18", "url": "https://files.pythonhosted.org/packages/d1/d3/0488e1b027c34a4ad37aef02c3a9fbfe97febfc4cecc7a97e12cb19fda27/ftr-0.8.1.tar.gz" } ], "0.8.2": [ { "comment_text": "", "digests": { "md5": "1d6e3b12d7c29456c68397b2b7fdf160", "sha256": "a83bdfb436aae25020361233f616a8644b433eab8ce424269b77305d518f20fb" }, "downloads": -1, "filename": "ftr-0.8.2.tar.gz", "has_sig": false, "md5_digest": "1d6e3b12d7c29456c68397b2b7fdf160", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 22570, "upload_time": "2015-03-16T15:31:46", "url": "https://files.pythonhosted.org/packages/3c/8e/2b82133566e53490933dbe55e70f5e90d0744c26cf26788c9101d94df990/ftr-0.8.2.tar.gz" } ], "0.9.3": [ { "comment_text": "", "digests": { "md5": "7929013b6df4f93daa71bad2c4fcbb50", "sha256": "0d483f55533f916c83c9ce3a3e7fbdf2ef2392900136af6ee326c290f15ca956" }, "downloads": -1, "filename": "ftr-0.9.3.tar.gz", "has_sig": false, "md5_digest": "7929013b6df4f93daa71bad2c4fcbb50", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 26478, "upload_time": "2015-03-20T21:01:49", "url": "https://files.pythonhosted.org/packages/b3/af/2f17052f6bee4e04bb46098f72883327f327e155938b0997ce09888b4bf1/ftr-0.9.3.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "7929013b6df4f93daa71bad2c4fcbb50", "sha256": "0d483f55533f916c83c9ce3a3e7fbdf2ef2392900136af6ee326c290f15ca956" }, "downloads": -1, "filename": "ftr-0.9.3.tar.gz", "has_sig": false, "md5_digest": "7929013b6df4f93daa71bad2c4fcbb50", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 26478, "upload_time": "2015-03-20T21:01:49", "url": "https://files.pythonhosted.org/packages/b3/af/2f17052f6bee4e04bb46098f72883327f327e155938b0997ce09888b4bf1/ftr-0.9.3.tar.gz" } ] }