{
    "info": {
        "author": "Jeroen Janssens",
        "author_email": "jeroen.janssens@visualrevenue.com",
        "bugtrack_url": null,
        "classifiers": [],
        "description": "Reporter\n========\n\nFlexible text extraction from HTML in Python.\n\nIn short, Reporter:\n\n-\tExtracts the main text from HTML.\n-\tUses a white-box scoring algorithm to determine the main text container.\n-\tCan easily be extended.\n-\tSupports Unicode without pain.\n-\tHas awesome debugging facilities.\n\n\nBackground\n----------\nReporter is being developed at [Visual Revenue, Inc.](http://www.visualrevenue.com) where it is used to extract the main text from news articles. \nThe name Reporter and internal terms are inspired by the news domain.\n\nUsage\n-----\n\nReporter can be invoked from the command line:\n\n\t$ ./reporter.py --url URL\n\nThe HTML from URL will be parsed by [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) and the main\ntext will be printed on stdout. If the **--debug** flag is added, the text\nand HTML will be saved to file. The HTML will be styled as follows.\nEach tag will get a background color based on its score, ranging from\nred (low score) to green (high score). Moreover, the tag that is\nselected as news container (see below) will have a blue dashed line.\n\nIf the **--test** flag is given, all files in ./test/input will be\nprocessed, and the text and HTML will be saved, as in **--debug**. This is\nuseful for processing many local files, so that these only have to\ndownloaded once. \n\nPlease see **./reporter.py --help** for more options.\n\nReporter can also be used from Python:\n\n\tmy_reporter = Reporter()\n\tmy_reporter.read(url='http://example.com')\n\tprint my_reporter.report_news()\n\nPlease note that Reporter is not (yet) a Python package.\n\n\nScoring algorithm\n-----------------\n\nTo extract the main text from an HTML document, Reporter gives each HTML\ntag (e.g, DIV, H1, and P) a score. The text contained in the tag with\nthe highest score is returned as the main text of the news article.\n\nThe main part of the scoring algorithm is based on traversing the\nparsed HTML and works as follows. Reporter traverses the HTML in\nreverse order, i.e., it starts at the leafs of the DOM tree. Each tag\nis scored either as a paragraph or as a container. A tag is considered\nto be a paragraph (in the abstract sense, not in the P sense) when it\ncontains more than 10 characters\\*, otherwise it is considered to be a\ncontainer. The exact scoring of a tag is defined in the **Autocue**. An\nAutocue is a list of scoring rules that get triggered at various\nstages. For example, when a tag is to be scored as a paragraph, one\nrule may count the number of words and return 2 points per word. Once\na tag (and its siblings) are scored, its parent is scored. If the\nparent is also considered to be a paragraph, which happens, for\ninstance, with the P tag in: \n\n\t<DIV><P>Hello World, this is the <B>Reporter package</B></P></DIV>\n\nthe scores of the B tag are discarded and the complete text is re-scored. The DIV tag is scored as a container because (in this case) it contains no text by itself. In fact, there is\nan important scoring rule which penalises containers. If such a rule\nwould not be included, the HTML tag would always receive the highest score,\nwhich would not be very effective.\n\n_\\*) Currently, this is the only heuristic that is hard-coded. In\n[Readability](https://github.com/gfxmonk/python-readability), which served as the inspiration for Reporter, all scoring is hard-coded._\n\nAs mentioned, a scoring rule is triggered at a certain stage as the\nReporter is processing the Autocue. Below, we list and explain the\nseven triggers with Python code. (The complete default Autocue is in\n**autocues.py**, which is easily extensible with additional rules.)\n\n\n- **HTML**, operates on the raw HTML. For example: split a paragraph with two consecutive line breaks into two paragraphs\n\n\t\tdefault_autocue.append((RegExReplacer(pattern='<br */? *>[\\\\r\\\\n]\\*<br */? *>', repl='</p><p>'), HTML))\n\n- **PRE\\_TRAVERSAL**, scores or prunes (deletes tags) before the DOM is traversed. This is useful for getting rid of specific tags such as footers, or give positive scores to certain tags For example, delete all comments (specific to a certain news property):\n\tdefault\\_autocue.append((CSSSelector(\"div#comments\", Pruner()),\nPRE\\_TRAVERSAL))\n\nNow, the HTML will be traversed as explained above.\n\n- **EVAL\\_PARAGRAPH**, scores a tag as a paragraph. For example, by counting words.\n\n\t\tdefault\\_autocue.append((Scorer(RegExMatcher(\"(\\w)+(['`]\\w)?\", factor=2, name=\"word\"), reset_children=True), EVAL_PARAGRAPH))\n\n- **EVAL\\_CONTAINER**, scores a tag as a container. For example, combining the scores of the children tags with a 70 points penalty, giving a minimal score of 0.\n\n\t\tdefault_autocue.append((ScoreAggregator(start_score=-70, vmin=0), EVAL_CONTAINER))\n\nThis concludes the traversing of the HTML.\n\n- **POST\\_TRAVERSAL**, scores or prunes tags after Reporter has traversed the HTML. \n\nThe tag with the highest score is selected as news container.\n\n- **NEWS\\_CONTAINER** is like POST\\_TRAVERSAL but only applies to the tag that is selected as news container.\n\n  Example: penalize DIVs inside the news container:\n\n\t\tdefault_autocue.append((CSSSelector(\"div\", Scorer(FixedValue(-60))), NEWS_CONTAINER))\n\n  Example: Get rid of any tags that have a score below -50:\n\n\t\tdefault_autocue.append((ScoreSelector(threshold=-50, mode=\"upper\", actor=Pruner()), NEWS_CONTAINER))\n\n- **NEWS\\_TEXT**, operates on the text inside the news container. For example, put all text on one line:\n\n\t\tdefault_autocue.append((RegExReplacer(pattern='\\s+', repl=' '), NEWS_TEXT))\n\nNow, we can return the final text as the main text of the HTML!\n\n\nLicense\n-------\nBSD",
        "description_content_type": null,
        "docs_url": null,
        "download_url": "UNKNOWN",
        "downloads": {
            "last_day": -1,
            "last_month": -1,
            "last_week": -1
        },
        "home_page": "http://pypi.python.org/pypi/reporter/",
        "keywords": null,
        "license": "BSD",
        "maintainer": null,
        "maintainer_email": null,
        "name": "reporter",
        "package_url": "https://pypi.org/project/reporter/",
        "platform": "UNKNOWN",
        "project_url": "https://pypi.org/project/reporter/",
        "project_urls": {
            "Download": "UNKNOWN",
            "Homepage": "http://pypi.python.org/pypi/reporter/"
        },
        "release_url": "https://pypi.org/project/reporter/0.1.2/",
        "requires_dist": null,
        "requires_python": null,
        "summary": "Flexible text extraction from HTML in Python.",
        "version": "0.1.2"
    },
    "last_serial": 798784,
    "releases": {
        "0.1.0": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "37e08fd654cdc9910c97991b99858d7e",
                    "sha256": "81b9a89314f5dd6fc3c2818f2895fe57337895fed7ab6b6dc6a4d7467ccdd39c"
                },
                "downloads": -1,
                "filename": "reporter-0.1.0.tar.gz",
                "has_sig": false,
                "md5_digest": "37e08fd654cdc9910c97991b99858d7e",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 9043,
                "upload_time": "2012-10-29T22:22:57",
                "url": "https://files.pythonhosted.org/packages/08/8b/ab75314db6641c24f2d22329c4feada0b5f2348913549afa2d914cca629c/reporter-0.1.0.tar.gz"
            }
        ],
        "0.1.1": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "a152c575353b7d87a0aa97fcfe34e843",
                    "sha256": "0dc455e70c6c6b669d0c06701a3615cce1fb3074cbfc6ec061cace39bb5b023e"
                },
                "downloads": -1,
                "filename": "reporter-0.1.1.tar.gz",
                "has_sig": false,
                "md5_digest": "a152c575353b7d87a0aa97fcfe34e843",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 9066,
                "upload_time": "2012-10-29T22:30:13",
                "url": "https://files.pythonhosted.org/packages/dd/5c/38b638f3293ceedbffdf5dac48f926c5ae6a78349ccbd16907ddea8980ac/reporter-0.1.1.tar.gz"
            }
        ],
        "0.1.2": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "05ad32a50ddd2761ada621cc4b97e00e",
                    "sha256": "662d9c7a5598b016d38bd41499aee569bc2d2b806af63a81ff32e1202edd5d21"
                },
                "downloads": -1,
                "filename": "reporter-0.1.2.tar.gz",
                "has_sig": false,
                "md5_digest": "05ad32a50ddd2761ada621cc4b97e00e",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 9096,
                "upload_time": "2012-10-29T22:51:48",
                "url": "https://files.pythonhosted.org/packages/29/c5/25a4e0716b9050df82b146b2379d3ce40040273a39c02d5f0f4617322910/reporter-0.1.2.tar.gz"
            }
        ]
    },
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "05ad32a50ddd2761ada621cc4b97e00e",
                "sha256": "662d9c7a5598b016d38bd41499aee569bc2d2b806af63a81ff32e1202edd5d21"
            },
            "downloads": -1,
            "filename": "reporter-0.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "05ad32a50ddd2761ada621cc4b97e00e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 9096,
            "upload_time": "2012-10-29T22:51:48",
            "url": "https://files.pythonhosted.org/packages/29/c5/25a4e0716b9050df82b146b2379d3ce40040273a39c02d5f0f4617322910/reporter-0.1.2.tar.gz"
        }
    ]
}