{ "info": { "author": "Rodrigo Palacios", "author_email": "rodrigopala91@gmail.com", "bugtrack_url": null, "classifiers": [], "description": "eatiht\r\n======\r\n\r\nA python package for **e**\\ xtracting **a**\\ rticle **t**\\ ext **i**\\ n\r\n**ht**\\ ml documents. Check out the new twitter-bootstrap-ready\r\n`demo `__\r\nproduced by the new extraction algorithm!\r\n\r\nLatest News\r\n~~~~~~~~~~~\r\n\r\nCheck out my latest project: `autocomplete - a kid and adult friendly\r\nexercise in machine\r\nlearning `__\r\n\r\nI'm collaborating with `Tim Weninger `__\r\nin a must-read data-driven opinion piece (publish date is tba). I\r\nbenchmarked Eatiht and many more content extractors; you can follow the\r\n`current work\r\nhere! `__.\r\n\r\nRead `Matthew Peters's `__ article that\r\nbenchmarked Eatiht, along with few other content extractors written in\r\nPython.\r\n\r\ntl;dr: Eatiht's etv2 is fast, but not so accurate (my own research\r\nsuggests that the original algo is more reliable).\r\n\r\nCheck out eatiht's `website `__\r\nwhere I walk through each step in the original algorithm!\r\n\r\nFollow me on `twitter `__ :)\r\n\r\nWhat people have been saying\r\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~\r\n\r\n*You should write a paper on this work* -\r\n`/u/queue\\_cumber `__\r\n\r\n*This is neat-o. A short and sweet project...* -\r\n`/u/CandyCorns\\_ `__\r\n\r\n*This is both useful and shows a simple use case for data mining for the\r\ngeneral population - an outreach of sorts.* -\r\n`/u/tweninger `__\r\n\r\nAt a Glance\r\n-----------\r\n\r\nTo install:\r\n^^^^^^^^^^^\r\n\r\n.. code:: bash\r\n\r\n pip install eatiht\r\n ...\r\n easy_install eatiht\r\n\r\nNote: On Windows, you may need to install lxml manually using: pip\r\ninstall lxml\r\n\r\nUsing in Python\r\n^^^^^^^^^^^^^^^\r\n\r\nCurrently, there are two new submodules:\r\n\r\n- etv2.py - class-based approach\r\n\r\n- v2.py - script-like approach\r\n\r\nAs `requested `__,\r\netv2.extract will extract not only the text, but also the parent\r\nelement's html:\r\n\r\n.. code:: python\r\n\r\n import eatiht.etv2 as etv2\r\n\r\n url = \"http://sputniknews.com/middleeast/20141225/1016239222.html\"\r\n\r\n tree = etv2.extract(url)\r\n\r\n # we know what this does...\r\n # print tree.get_text()\r\n\r\n # add necessary link tags to bootstrap cdn, center content, etc.\r\n tree.bootstrapify()\r\n\r\n print tree.get_html_string()\r\n\r\nOutput:\r\n\r\n::\r\n\r\n Syrian Army Kills Nearly 5,000 IS Militants in Three Months: Source / Sputnik International\r\n \r\n

Syrian Army Kills Nearly 5,000 IS Militants in Three Months: Source / Sputnik International

...\r\n\r\nNow what about if that's rendered?\r\n\r\n`With\r\nboostrap `__\r\n\r\n`Without `__\r\n\r\netv2 uses classes defined in\r\n`eatiht\\_trees.py `__\r\nto construct what is sometimes known as the \"state space\" in the world\r\nof AI. But instead of only keeping track of averages and totals - as is\r\nrequired for the algorithm - the \"state\" class\r\n`TextNodeSubTree `__\r\nalso keeps a reference to its original lxml.html element from whence it\r\ncame.\r\n\r\nYou can access the original, extracted html elements like this:\r\n\r\n.. code:: python\r\n\r\n subtrees = tree.get_subtrees()\r\n\r\n first_subtree = subtrees[0]\r\n\r\n first_subtree.get_html()\r\n # \r\n\r\n first_subtree.get_html().tag\r\n # 'div'\r\n\r\nPlease refer to\r\n`eatiht\\_trees.py `__\r\nfor more info on what functions are available for you to use.\r\n\r\nv2 is functionally identical to the original eatiht:\r\n\r\n.. code:: python\r\n\r\n import eatiht.v2 as v2\r\n\r\n url = 'http://www.washingtonpost.com/blogs/the-switch/wp/2014/12/26/elon-musk-the-new-tesla-roadster-can-travel-some-400-miles-on-a-single-charge/'\r\n\r\n print v2.extract(url)\r\n\r\nOutput:\r\n\r\n::\r\n\r\n Car nerds, you just got an extra present under the tree.\r\n\r\n Tesla announced Friday an upgrade for its Roadster, the electric car company\u2019s convertible model,\r\n and said that the new features significantly boost its range -- beyond what many traditional cars\r\n can get on a tank of gasoline.\r\n\r\nv2 contains one extra function that executes the extraction algorithm,\r\nbut along with returning the text, it also returns the structures that\r\nwere used to calculate the output (ie. histogram, list of xpaths, etc.):\r\n\r\n.. code:: python\r\n\r\n results = v2.extract_more(url)\r\n\r\n results[0] # extracted text\r\n results[1] # frequency distribution (histogram)\r\n results[2] # subtrees (list of textnodes pre-filter)\r\n results[3] # pruned subtrees\r\n results[4] # list of paragraphs (as seperated in original website)\r\n\r\nNow whether or not this function's output looks messy is up for debate;\r\nI personally think it looks messy and difficult to remember which index\r\nleads to what.\r\n\r\nI suggest using this module if you simply want the extracted text.\r\n\r\nAnd of course, there is the original:\r\n\r\n.. code:: python\r\n\r\n # from initial release\r\n import eatiht\r\n\r\n url = 'http://news.yahoo.com/curiosity-rover-drills-mars-rock-finds-water-122321635.html'\r\n\r\n print eatiht.extract(url)\r\n\r\nOutput\r\n''''''\r\n\r\n::\r\n\r\n NASA's Curiosity rover is continuing to help scientists piece together the mystery of how Mars\r\n lost its surface water over the course of billions of years. The rover drilled into a piece of\r\n Martian rock called Cumberland and found some ancient water hidden within it...\r\n\r\nUsing as a command line tool:\r\n^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\r\n\r\n.. code:: bash\r\n\r\n eatiht http://news.yahoo.com/curiosity-rover-drills-mars-rock-finds-water-122321635.html >> out.txt\r\n\r\nNote: Window's users may have to add the C:directory to your\r\n`\"path\" `__ so that the\r\ncommand line tool works from any directory, not only the ..directory.\r\n\r\nRequirements\r\n------------\r\n\r\n::\r\n\r\n lxml\r\n *requests, as of v0.1.0, is no longer required\r\n\r\nMotivation\r\n----------\r\n\r\nAfter searching through the deepest crevices of the internet for some\r\ntool\\|library\\|module that could effectively extract the main content\r\nfrom a website (ignoring text from ads, sidebar links, etc.), I was\r\nslightly disheartened by the apparent ambiguity caused by this\r\ncontent-extraction problem.\r\n\r\nMy survey resulted in some of the following solutions:\r\n\r\n- `boilerpipe `__ - *Boilerplate\r\n Removal and Fulltext Extraction from HTML pages*. Java library\r\n written by Christian Kohlsch\u00fctter\r\n- `\"The Easy Way to Extract Useful Text from Arbitrary\r\n HTML\" `__\r\n - a Python tutorial on implementing a neural network for html content\r\n extraction. Written by alexjc\r\n- `Pyteaser's Cleaners\r\n module `__\r\n - from what I can tell, it's a purely heuristic-based process\r\n- `\"Text Extraction from the Web via Text-to-Tag\r\n Ratio\" `__ - a\r\n thesis on Text-to-Tag-heuristic driven clustering as a solution for\r\n the problem at hand. Written by Tim Weninger & William H. Hsu\r\n\r\nThe number of research papers I found on the subject largely outweighs\r\nthe number available open-source projects. This is my attempt at\r\nbalancing out the disparity.\r\n\r\nIn the process of coming up with a solution, I made two unoriginal\r\nobservations:\r\n\r\n1. XPath's select all (//), parent node (..) queries and functions\r\n ('string-length') are remarkably powerful when used together\r\n2. Unnecessary machine learning is unnecessary\r\n\r\nBy making an assumption on sentence length, and this is trivial, one can\r\nquery for text-nodes satisfying said sentence length, then create a\r\nfrequency distribution (histogram) across the parent-nodes, and the\r\nargmax of the resulting distribution is the xpath that is shared amongst\r\nlikely sentences.\r\n\r\nThe results were surprisingly good. I personally prefer this approach to\r\nthe others as it seems to lie somewhere in between the purely rule-based\r\nand the drowning-in-ML approaches.\r\n\r\nIssues or Contact\r\n-----------------\r\n\r\nPlease raise any `issues `__\r\nor yell at me at rodrigopala91@gmail.com or\r\n[@rodricios](https://twitter.com/rodricios)\r\n\r\nTests\r\n-----\r\n\r\nCurrently, the tests are lacking. But please still run these tests to\r\nensure that modifications to eatiht.py, v2.py, and etv2.py run properly.\r\n\r\n.. code:: python\r\n\r\n python setup.py test\r\n\r\nTODO:\r\n-----\r\n\r\n- [STRIKEOUT:HTML-and-text extraction]\r\n- etv2 command line scripts\r\n- [STRIKEOUT:etv2.py tests]\r\n- improve filtering\\|pruning step so that taglines from articles get\r\n dropped\r\n\r\n - if and only if tagline has a reference image, don't prune\r\n\r\n- add some template engine (see \"bootstrapify\" function for current\r\n state)\r\n\r\n", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/rodricios/eatiht", "keywords": "extract extracted extracting extraction filter filtered filtering out remove removed removing removal text textbody body content contents sentence sentences word words boilerplate boilerpipe delete tag tags unsupervised classification machine learning algorithm opening closing main article html hypertext Rodrigo Palacios rodrigo palacios im-rodrigo im_rodrigo rodricios", "license": "MIT", "maintainer": null, "maintainer_email": null, "name": "eatiht", "package_url": "https://pypi.org/project/eatiht/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/eatiht/", "project_urls": { "Download": "UNKNOWN", "Homepage": "https://github.com/rodricios/eatiht" }, "release_url": "https://pypi.org/project/eatiht/0.1.14/", "requires_dist": null, "requires_python": null, "summary": "A simple tool used to extract an article's text in html documents.", "version": "0.1.14" }, "last_serial": 1481085, "releases": { "0.0.10": [ { "comment_text": "", "digests": { "md5": "d45b08e0f9c6a02465bda93788f16820", "sha256": "79de69ea53aa71cef653fb7962a8f7d160b0723691021162fc34bade3bdb0d6b" }, "downloads": -1, "filename": "eatiht-0.0.10.zip", "has_sig": false, "md5_digest": "d45b08e0f9c6a02465bda93788f16820", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 9062, "upload_time": "2014-12-21T11:54:25", "url": "https://files.pythonhosted.org/packages/9d/e4/c65bff5f373923708dbd5a28e35a8f7a11c3fb26d3da8603b1d8fafb8f83/eatiht-0.0.10.zip" } ], "0.0.3": [ { "comment_text": "", "digests": { "md5": "6979ea9f46ae4b13d727ea07c7d2063f", "sha256": "867dcfee9c88c58b891dd0fa6e2fe64d7a9b8593af05f386c17d04a7d57cf416" }, "downloads": -1, "filename": "eatiht-0.0.3.zip", "has_sig": false, "md5_digest": "6979ea9f46ae4b13d727ea07c7d2063f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4991, "upload_time": "2014-12-18T11:59:59", "url": "https://files.pythonhosted.org/packages/6e/22/2bf7f58c9064ed206cb81cbd41e5c37aa50e3d26f716bbba363c4e7dd1b5/eatiht-0.0.3.zip" } ], "0.0.4": [ { "comment_text": "", "digests": { "md5": "43a6119220e4ceb214cf6f957594cb11", "sha256": "4084d927ec1bd4f4013a28b93cfe64bd5c1ca9664cbdfa34827006a818b8f5a3" }, "downloads": -1, "filename": "eatiht-0.0.4.zip", "has_sig": false, "md5_digest": "43a6119220e4ceb214cf6f957594cb11", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4984, "upload_time": "2014-12-18T12:06:56", "url": "https://files.pythonhosted.org/packages/93/c9/a0ce53611424595086d1e640c50b0ff7b15280db7c9acd4689317a1510a9/eatiht-0.0.4.zip" } ], "0.0.5": [ { "comment_text": "", "digests": { "md5": "e7e59dd46f5c268eac7176a6d783bf47", "sha256": "b5dc5b9675dbefdc233b6631137bf033d0679c68b55ab4eedf1efe0a255327b0" }, "downloads": -1, "filename": "eatiht-0.0.5.zip", "has_sig": false, "md5_digest": "e7e59dd46f5c268eac7176a6d783bf47", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4969, "upload_time": "2014-12-18T13:59:23", "url": "https://files.pythonhosted.org/packages/5b/d0/6276b61b36003512683ef7a9442eb4bbc20c6b7dc871973daeb25eebbf4c/eatiht-0.0.5.zip" } ], "0.0.6": [ { "comment_text": "", "digests": { "md5": "64514420c171e2ae21993029914b6ea7", "sha256": "f20c433ba7c39e1d5086b3714c1c20b4b0c01abd2af69f882e99b0fc37e7e1a2" }, "downloads": -1, "filename": "eatiht-0.0.6.zip", "has_sig": false, "md5_digest": "64514420c171e2ae21993029914b6ea7", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4969, "upload_time": "2014-12-18T14:05:41", "url": "https://files.pythonhosted.org/packages/9c/49/bbd604e19917bd6ef15b4f057030b0ec49d6198722ce0969646c9f0d9c40/eatiht-0.0.6.zip" } ], "0.0.7": [ { "comment_text": "", "digests": { "md5": "52868e8d55dadcaf761f6cf2ca458452", "sha256": "1b5aefa4ffeced8fc318f1a0a254d64f61e0d8cb13710473f1a5f919849794a5" }, "downloads": -1, "filename": "eatiht-0.0.7.zip", "has_sig": false, "md5_digest": "52868e8d55dadcaf761f6cf2ca458452", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5487, "upload_time": "2014-12-19T04:31:19", "url": "https://files.pythonhosted.org/packages/8e/ac/9bb9ce0432f24579c08ca6a651985e9e55ffc797e43da5e78db3925dd679/eatiht-0.0.7.zip" } ], "0.0.8": [ { "comment_text": "", "digests": { "md5": "ef72cc8412e5e317167891e1c394fcd1", "sha256": "15e6d143af976d653211ae37de58aca98b6dddbb7f8ce0ea609df8092ba178bf" }, "downloads": -1, "filename": "eatiht-0.0.8.zip", "has_sig": false, "md5_digest": "ef72cc8412e5e317167891e1c394fcd1", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 8356, "upload_time": "2014-12-19T15:25:06", "url": "https://files.pythonhosted.org/packages/68/5c/ab5b2f5cbc0202fa08ab9234f250a743222c4e0af6861136d836f64385e2/eatiht-0.0.8.zip" } ], "0.0.9": [ { "comment_text": "", "digests": { "md5": "dbc668c31dd3a1cc19168071ace3082b", "sha256": "4164a18081f8830c21f3fae01657a0dbacfd86bb1ca7fb0f4d4e62e97f6dc2b5" }, "downloads": -1, "filename": "eatiht-0.0.9.zip", "has_sig": false, "md5_digest": "dbc668c31dd3a1cc19168071ace3082b", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 8528, "upload_time": "2014-12-19T23:35:31", "url": "https://files.pythonhosted.org/packages/ac/d1/9392a508268462fed1a5c9a4f96e83d87f7442778ac090133457cd04071d/eatiht-0.0.9.zip" } ], "0.1.0": [ { "comment_text": "", "digests": { "md5": "353da498e7f19da4d3d60bed152fe48a", "sha256": "9008e899137aec2f7be5de5397322322b4707220d3dc8a0456311c18accdde05" }, "downloads": -1, "filename": "eatiht-0.1.0.zip", "has_sig": false, "md5_digest": "353da498e7f19da4d3d60bed152fe48a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 135052, "upload_time": "2014-12-27T06:24:52", "url": "https://files.pythonhosted.org/packages/36/b5/62883675689e2313f45b0eed6a3477da8b30b2466634515e8a7accfd4000/eatiht-0.1.0.zip" } ], "0.1.1": [ { "comment_text": "", "digests": { "md5": "8f12d203890f7ab1c423b864ea4eb833", "sha256": "5968435c4f0c4f05c04f8ba44b532706b7cb7c80799bdb0874163abfd3c647d5" }, "downloads": -1, "filename": "eatiht-0.1.1.zip", "has_sig": false, "md5_digest": "8f12d203890f7ab1c423b864ea4eb833", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 135905, "upload_time": "2014-12-27T13:28:02", "url": "https://files.pythonhosted.org/packages/0c/d0/ce7df187d733463e3eaa83ada1f7c67a8fe988d92383cd27dbf872639629/eatiht-0.1.1.zip" } ], "0.1.11": [ { "comment_text": "", "digests": { "md5": "111c1cc398ba73c922f7d1eabaf249b6", "sha256": "172a82c27a04d2474eef97e84d561e6732e9e60477fed8735e7144eefc71b663" }, "downloads": -1, "filename": "eatiht-0.1.11.zip", "has_sig": false, "md5_digest": "111c1cc398ba73c922f7d1eabaf249b6", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 136640, "upload_time": "2014-12-28T12:36:17", "url": "https://files.pythonhosted.org/packages/12/b7/456e0bb1c8f39eaca42bba62f395cf1f0b0a50184ac22c7497ebd591eb1a/eatiht-0.1.11.zip" } ], "0.1.12": [ { "comment_text": "", "digests": { "md5": "28bc85435af95850b4d51d41b9d5ffaf", "sha256": "3d74d2c623a5fc80db74a84a065d9f5e107bf1070736718717b9b73b5f8b8d06" }, "downloads": -1, "filename": "eatiht-0.1.12.zip", "has_sig": false, "md5_digest": "28bc85435af95850b4d51d41b9d5ffaf", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 136667, "upload_time": "2014-12-31T03:50:16", "url": "https://files.pythonhosted.org/packages/7b/93/6ffa800cc54c512a6c8010b1dbcdc5508fd44f6701ea8633bc78b3f8eb89/eatiht-0.1.12.zip" } ], "0.1.13": [ { "comment_text": "", "digests": { "md5": "f6c0e86ea6bd90210373e9c83fae15d1", "sha256": "30e1a6a915a732d67e25604e70e3210bf7701b146ad119e52080505deec1fa0a" }, "downloads": -1, "filename": "eatiht-0.1.13.zip", "has_sig": false, "md5_digest": "f6c0e86ea6bd90210373e9c83fae15d1", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 137013, "upload_time": "2015-03-13T08:56:41", "url": "https://files.pythonhosted.org/packages/fe/97/f40de046e0c59743f1aaa8086dde7d96bb7cfb62edcfc7e7b24a7b75994d/eatiht-0.1.13.zip" } ], "0.1.14": [ { "comment_text": "", "digests": { "md5": "8ce80a1a59458534caada1a10533d80f", "sha256": "fe47b5b4e14fec2848126d9cbf3c6f8c4601118cece3b909e5669af6d9cc336a" }, "downloads": -1, "filename": "eatiht-0.1.14.zip", "has_sig": false, "md5_digest": "8ce80a1a59458534caada1a10533d80f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 137309, "upload_time": "2015-03-28T08:28:46", "url": "https://files.pythonhosted.org/packages/09/3d/c0dbdb054d5a097fcef58dd0ecb590d71e0a10ce7aaeff16402d06ee2267/eatiht-0.1.14.zip" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "8ce80a1a59458534caada1a10533d80f", "sha256": "fe47b5b4e14fec2848126d9cbf3c6f8c4601118cece3b909e5669af6d9cc336a" }, "downloads": -1, "filename": "eatiht-0.1.14.zip", "has_sig": false, "md5_digest": "8ce80a1a59458534caada1a10533d80f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 137309, "upload_time": "2015-03-28T08:28:46", "url": "https://files.pythonhosted.org/packages/09/3d/c0dbdb054d5a097fcef58dd0ecb590d71e0a10ce7aaeff16402d06ee2267/eatiht-0.1.14.zip" } ] }