{ "info": { "author": "Aleksandr Chuklin", "author_email": "UNKNOWN", "bugtrack_url": null, "classifiers": [], "description": "`ClickModels `__\n===========================================================\n\nClickModels is a small set of Python scripts for the user click models\ninitially developed at `Yandex `__. A *Click\nModel* is a probabilistic graphical model used to predict search engine\nclick data from past observations. This project is aimed to deal with\nclick models used in Information Retrieval (see next section) and\nintended to be easy-to-read and easy-to-modify. If it's not, please let\nme know how to improve it :)\n\nIf you are using this code for your research work, consider citing one\nof our papers when appropriate (see\n`References `__\nsection below).\n\nIf you are looking for a general-purpose framework to work with\nprobabilistic graphical models you might want to examine\n`Infer.NET `__.\nIt should also work with IronPython.\n\nQuick Start\n===========\n\n- ``cp clickmodels/config_sample.py config.py``\n- ``vim config.py``\n- ``python bin/run_inference.py < data/click_log_sample.tsv 2>inference.log``\n\nMore details about the config and input data formats below.\n\nSystem-Wide Install\n===================\n\nIf you wish, you can install the ClickModels core (parameter inference\nand click simulation) to a system-wide location:\n\n::\n\n sudo python setup.py install\n\nUninstall:\n\n::\n\n sudo pip uninstall clickmodels\n\nNew!\n====\n\nNow, thanks to `agrotov `__, the models can\nalso be run in a click generation mode and predict relevance (DBN only).\nCheck out ``ClickModel.get_model_relevances()`` and\n``ClickModels.generate_clicks()`` methods.\n\n**N.B.**: Use this code with care as it is not fully tested yet.\n\n--------------\n\nModels Implemented\n==================\n\n- *Dynamic Bayesian Network* ( **DBN** ) model: Chapelle, O. and Zhang,\n Y. 2009. A dynamic bayesian network click model for web search\n ranking. WWW (2009).\n- *User Browsing Model* ( **UBM** ): Dupret, G. and Piwowarski, B.\n 2008. A user browsing model to predict search engine click data from\n past observations. SIGIR (2008).\n- *Exploration Bias User Browsing Model* ( **EB\\_UBM** ): Chen, D. et\n al. 2012. Beyond ten blue links: enabling user click modeling in\n federated web search. WSDM (2012).\n- *Dependent Click Model* ( **DCM** ): Guo, F. et al. 2009. Efficient\n multiple-click models in web search. WSDM (2009).\n- *Intent-Aware Models* ( **DBN-IA, UBM-IA, EB\\_UBM-IA, DCM-IA** ):\n `Chuklin, A. et al. 2013. Using Intent Information to Model User\n Behavior in Diversified Search. ECIR\n (2013). `__\n\n--------------\n\nFormat of the Input Data (Click Log)\n====================================\n\nA small example can be found under ``data/click_log_sample.tsv``. This\nis a tab-separated file, where each line has 7 elements. For example,\nthe line\n``1dd100500 QUERY1 50 0.259109 [\"http://1\", \"http://2\", \"http://3\",\"http://4\",\"http://5\",\"http://6\",\"http://7\",\"http://8\",\"http://9\",\"Http://10\"] [false, false, false, false, true, true, false, false, false, false] [0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0]``\nhas the following fields:\n\n1. ``1dd100500`` \u2014 some identifier (currently not used)\n2. ``QUERY1`` \u2014 text of the query. It can contain any UTF-8 characters\n except tab sign ``\\t``\n3. ``50`` \u2014 integer identifier of the region (country, city) of the user\n who submitted the query. **If you don't want this, just put some\n constant (e.g., ``0``) in this column**. At Yandex user region is\n heavily used by ranking, so throughout the code the pair\n ``(query, region)`` is used to identify the query, i.e., if we have\n the same query string from two different region we consider them as\n two separate queries.\n4. ``0.259109`` \u2014 float value, corresponding to the probability\n ``P(I = V)`` that user has a *vertical* intent V, i.e., he or she is\n interested more in the vertical documents than in organic web\n documents. **If you do not want any of this intent stuff just put\n ``0`` in this column**. We assume that the user has one of the two\n intents: *vertical* intent V with probability ``P(I = V)`` and\n regular *web* intent with probability ``1 - P(I = V)``. For example,\n if we want to take into account user's interest in images and we\n somehow know the probability that the user is interested more in\n images than organic *web* results, we can make use of intent-aware\n click models. See `Chuklin, A. et al. 2013. Using Intent Information\n to Model User Behavior in Diversified Search. ECIR\n (2013) `__\n for more details.\n5. **json** list of the URLs of the documents that make up SERP (search\n engine result page). Document's url is an identifier, so in principle\n you can use any (string) id you want. **NB**: this is not a python\n list, so yuo have to use double quotes and no comma after the last\n element.\n6. **json** list with the *presentation types* (vertical types) of the\n documents (see `Chuklin, A. et al. 2013. Using Intent Information to\n Model User Behavior in Diversified Search. ECIR\n (2013). `__).\n **If you do not want to know this** just set it to the list of\n ``false`` of the same length as the previous list.\n7. **json** list of clicks. Each element is the number of times\n corresponding URL was clicked\n\nIf you need more data to experiment with you can use any publicly\navailable dataset and convert it to the format described above. For\nexample, you can use a dataset provided by one of the Yandex challenges\n(you need to register to get access to the data): -\nhttp://imat-relpred.yandex.ru/en/datasets -\nhttp://switchdetect.yandex.ru/en/datasets\n\n--------------\n\nFiles\n=====\n\nREADME.md\n---------\n\nThis file.\n\nLICENSE, AUTHORS, CHANES.txt\n----------------------------\n\nSelf explaining.\n\nsetup.py\n--------\n\nPython package installation file: ``python setup.py --help``\n\nbin/\n----\n\nDirectory with the scripts.\n\nclickmodels/\n------------\n\nDirectory with the core code. This is the directory that gets installed\nby ``setup.py``.\n\ndata/\n-----\n\n``data/`` directory contains an example of click log (see format\ndescription above) as well as two examples of result pages with fresh\nblock included (see ``makeGluedSERP.py`` description above):\n``data/serp_sample.json`` is used in an example above, while\n``data/serp_sample2.json`` was used to create a picture in the paper\n`Chuklin, A. et al. 2013. Using Intent Information to Model User\nBehavior in Diversified Search. ECIR\n(2013). `__\n\ndoc/\n----\n\nAutomaticaly generated documentation. Also available\n`online `__.\n\nhtml/\n-----\n\nDirectory with the CSS/JS files for ``bin/generate_serp.py``; it outputs\nthere a number of HTML files for you to examine to get an idea what kind\nof SERPs we address with the intent-aware click models.\n\nbin/generate\\_serp.py\n---------------------\n\n**{not used by the other, does not use other code}** Create html of the\nSERP containing fresh block item. This is used just to illustrate the\nnotion of *presentation types* used by Intent-Aware models. Run as:\n``bin/generate_serp.py < data/serp_sample.json``. Output is placed in\n``html/`` directory.\n\n**WARNING:** all previously generated html files in this directory will\nbe removed\n\nbin/compare\\_click\\_models.py\n-----------------------------\n\n**{not used by other scripts}** Script used to compare different models\nand output significance of the difference. The pair of models to compare\nis specified in the code by modifying ``TESTED_MODEL_PAIRS`` variable.\nModel pair is a text string which is mapped to the pair of functions\nreturning model objects (see ``MODEL_CONSTRUCTORS`` dict for the\nmapping). E.g. ``UBMvsDBN`` is used to compare UBM model\n(``UbmModel()``) to the default DBN model\n(``DbnModel((0.9, 0.9, 0.9, 0.9))``). **NB**: we may have a list of\nmodels needed to be compared to each other. For this purpose the same\nnotion of *pair* is abused. For instance,\n``MODEL_CONSTRUCTORS['EB_UBM']`` contains 3 algorithms to be compared to\neach other: **UBM, EB\\_UBM, EB\\_UBM-IA**.\n\n- **Usage**:\n ``bin/compare_click_models.py directory_with_click_logs 2>run.log``\n- **Input**: ``directory_with_click_logs`` \u2014 directory containing files\n with click logs. Each file is in the format described above. These\n files are then sorted alphabetically and split into pairs where first\n file is used for training, the second one is used for testing. For\n example, if the directory contains files ``f01``, ``f02``, ``f03``,\n ``f04`` then ``f02`` will be used to test models trained using\n ``f01``, ``f04`` will be used to test models trained using ``f03``\n and so on. Two models are evaluated on the test set and their\n performances (Average Perplexity or Log Likelihood) are compared\n using appropriate formula (see ``perpGain`` and ``llGain`` functions\n respectively). **NB**: Multiple train and test files are needed to\n calculate confidence interval for the gains (``bootstrap.py`` is used\n for this purpose).\n- **Output**: some progress output is printend to ``sys.stderr`` which\n might be useful for a long run. Finally the gains of one model over\n another is output in the following format:\n\n ::\n\n UBM (0, 1) [-0.0115, 0.0659, 0.075, 0.0778, 0.0623, 0.0403, 0.0593, -0.040] (0.0095, 0.0662) \n\n It first outputs the name of the \"pair\" ``pairName`` specified in the\n ``TESTED_MODEL_PAIRS``, then pair of indeces ``(i, j)`` which mean\n that the model compared are ``MODEL_CONSTRUCTORS[pairName][i]`` and\n ``MODEL_CONSTRUCTORS[pairName][j]`` which is UBM and UBM-IA in our\n example. Next is the list of gains of model ``j`` over model ``i``\n for each pair of the train/test logs (in our example we had 8 pairs\n of files under ``directory_with_click_logs``). The next element is\n the confidence interval according to bootstrap test (with 95%\n confidence level and 1000 bootstrap samples). This line will be\n printed for all the \"pairs\" listed under ``TESTED_MODEL_PAIRS`` and\n for both Average Perplexity (PERPLEXITY) and Log Likelihood (LL). For\n perplexity measure also the gains for individual position (rank) are\n printed. Like this:\n ``UBM POSITION PERPLEXITY GAINS: (0, 1) [[average_gain_for_pos1, confidence_interval_for_pos_1], \u2026]``\n\nbin/run\\_inference.py\n---------------------\n\nRun click model inference and evaluate the models:\n\n- **Usage**:\n ``bin/run_inference.py < data/click_log_sample.tsv 2>inference.log``\n- **Input**: click log in the format described above (``sys.stdin``)\n- **Output** (assuming that ``TRAIN_FOR_METRIC = False``):\n ``ModelName (LogLikelihood, Perplexity)``\n\nclickmodels/inference.py\n------------------------\n\nThis file contains implementation of all the click models, probabilistic\ninference and helper functions needed to work with them. This is the\ncore of the codebase. More details about the classes/functions below.\n\nclickmodels/bootstrap.py\n------------------------\n\nCopyright \u00a9 `Ernesto P.\nAdorio `__: code we use\nto perform a bootstrap test.\n\nclickmodels/config\\_sample.py\n-----------------------------\n\nCopy this file to ``config.py`` and modify it; it will be used by\n``bin/run_inference.py`` and ``bin/compare_click_models.py``. This is\nthe file with default settings for different parameters of the\ninference, input and output data.\n\n- ``MAX_ITERATIONS`` \u2014 maximum number of iterations in Expectation\n Maximization (EM) algorithm (applicable only for models using EM\n algorithm).\n- ``DEBUG`` \u2014 perform some additional tests when running algorithm\n (makes it slower)\n- ``PRETTY_LOG`` \u2014 make log output prettier. If ``False`` then more\n information is put into log.\n- ``USED_MODELS`` \u2014 list of model names to be tested in ``__main__``\n section of the script. Possible names are\n ``['Baseline', 'SDBN', 'UBM', 'UBM-IA', 'EB_UBM', 'EB_UBM-IA', 'DCM', 'DCM-IA', 'DBN', 'DBN-IA']``.\n Please refer to the ``__main__`` section of ``inference.py`` to see\n how these names are expressed in terms of our class hierarchy (all\n those nasty ``if 'XXX' in USED_MODELS``).\n- ``MIN_DOCS_PER_QUERY``, ``MAX_DOCS_PER_QUERY`` \u2013 number of documents\n per query. Set to 10 by default as most of search engines return list\n of 10 doucments.\n- ``SERP_SIZE`` - size of the search engine result page (SERP). Used if\n we want to model clicks beyond the first result page. See the section\n named **Beyond the First Result Page** below for more details.\n- ``EXTENDED_LOG_FORMAT`` - if set to ``True`` the urls, layout and\n clicks are dicts instead of lists (see **Format of the Click Log**\n section above). Example:\n ``data/click_log_sample_extended_format.tsv``.\n- ``TRANSFORM_LOG`` - transform the click log by inserting the fake\n documents for pagination button (currently works only with\n ``EXTENDED_LOG_FORMAT = True``). See the section named **Beyond the\n First Result Page** below.\n- ``QUERY_INDEPENDENT_PAGER`` - used to switch between ``SDBN(P)`` /\n ``SDBN(P-Q)``. Only used with ``TRANSFORM_LOG = True``.\n- ``TRAIN_FOR_METRIC`` \u2013 if ``True`` the model will be trained such\n that its parameters can be used in a metric (like `Chuklin, A. et al.\n 2013. Click Model-Based Information Retrieval Metrics. SIGIR\n (2013). `__).\n See the section below for more details.\n- ``PRINT_EBU_STATS`` \u2014 if ``True`` the parameters of the EBU metric\n will be printed first (*Yilmaz, E. et al. 2010. Expected browsing\n utility for web search evaluation. CIKM. (2010)*).\n- ``DEFAULT_REL`` - default (prior) relevance (attractiveness,\n satisfaction) parameter values used in click models like DBN or EBU.\n\nclickmodels/input\\_reader.py\n----------------------------\n\nA class used for reading input data in a click log format described\nabove.\n\n--------------\n\nClass Hierarchy\n===============\n\nAlso see epydoc-generated\n`documentation `__.\n\nClick Models\n------------\n\n.. figure:: https://raw.github.com/varepsilon/clickmodels/master/doc/html/class_hierarchy_for_clickmodel.gif\n :alt: \n\nThe base class for all the click models is the class called\n``ClickModel``. In order to define a new click model you should create a\nclas inherited from it and re-define methods ``train`` and\n``_getClickProbs``.\n\n- ``train`` function\n\nNote, that ``test`` method is already implemented and uses\n``_getClickProbs`` function. If you redefine ``__init__`` method, then\nbe sure to invoke the ``__init__`` of the parent class to set the\n``ignoreIntents`` and ``ignoreLayout`` parameters (they should be set to\n``True`` unless you are using *Intent Aware* model)\n\n``ClickModel`` class by itself represents a baseline click model which\nsets probability 0.5 to any click event.\n\nDbnModel\n~~~~~~~~\n\nThis class is, in fact an implementation of a more general **DBN-IA**\nmodel (`Chuklin, A. et al. 2013. Using Intent Information to Model User\nBehavior in Diversified Search. ECIR\n(2013). `__\n) that makes use of intent and presentation type of the documents when\n``ignoreIntent`` and/or ``ignoreLayout`` is set to ``False``. The\n``train`` method is a probabilistic EM inference.\n\nIf all what you want is just original DBN model by Chapelle et al. you\nshould creat it as ``DbnModel((0.9, 0.9, 0.9, 0.9))`` (``ignoreIntent``\nand ``ignoreLayout`` is ``True`` by default).\n\nSimplifiedDbnModel (DbnModel)\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nThis is the same as\n``DbnModel((1.0, 1.0, 1.0, 1.0), ignoreIntents, ignoreLayout)``, but\n``train`` method is just counting instead of EM algorithm. See\n*Chapelle, O. and Zhang, Y. 2009. A Dynamic Bayesian Network click model\nfor web search ranking. WWW (2009).*, Section 5 (Algorithm 1).\n\nUbmModel\n~~~~~~~~\n\nThis is the most general case for all UBM-like intent-aware models.\nChanging ``ignoreIntents``, ``ignoreLayout`` and ``explorationBias``\nparameters you can get different models: **UBM, UBM-intent, UBM-layout,\nUBM-IA, EB\\_UBM, EB\\_UBM-intent, EB\\_UBM-layout, EB\\_UBM-IA** (for the\nnames see `Chuklin, A. et al. 2013. Using Intent Information to Model\nUser Behavior in Diversified Search. ECIR\n(2013). `__).\n\nEbUbmModel (UbmModel)\n~~~~~~~~~~~~~~~~~~~~~\n\nJust a shortcut for\n``UbmModel(ignoreIntents, ignoreLayout, explorationBias=True)`` which\ncorrespond to the model called *Exploration Bias UBM* in *Chen, D. et\nal. 2012. Beyond ten blue links: enabling user click modeling in\nfederated web search. WSDM (2012).*\n\nDcmModel\n~~~~~~~~\n\nThis model is, again, more general **DCM-IA** model which reduces to\n**DCM** when ``ignoreIntents = True``, ``ignoreLayout = True``.\n``train`` method is a simple counting, no EM algorithm.\n\nPlease note, that ``getGamma`` method invokes ``DbnModel.getGamma``, so\nbe careful when changing that.\n\nInputReader\n-----------\n\nThis class intented to read input (click log) in the format described\nabove. To save memory, it maps queries and urls to ids. It means, that\nyou need to use the same instance of the ``InputReader`` class even if\nyou read multiple click log files. Otherwise you will end up with two\ndifferent ids assigned to the same query.\n\n--------------\n\nPerformance Issues\n==================\n\nIf you experience performance issues consider using\n`PyPy `__ instead of regular cPython. It may lead to\n10x spead up. You can also install and use\n`simplejson `__ module instead\nof ``json``.\n\n--------------\n\nTRAIN\\_FOR\\_METRIC\n==================\n\nIf you set ``TRAIN_FOR_METRIC = True`` the code will expect you to\nprovide document relevances instead of URLs. We make an assumption, that\ndocument attractiveness and/or satisfaction probability only depends on\nits human-assigned relevance grade. A model will then be trained to\nassign the same attractiveness / satisfaction probabilities to all the\ndocuments with the same relevance.\n\nInput Format\n------------\n\nThe format in this case is similar to the one descirbed above with only\ndifference that URLs should be replaced by the relevance grade of the\ncorresponding document to the corresponding query. The query field will\nbe ignored in that case. The relevance grade should take one of the\nfollowing values:\n\n- ``IRRELEVANT`` \u2014 lowest relevance scorn, document is not relevant to\n the query\n- ``RELEVANT`` \u2014 combines marginally relevant and just relevant\n documents\n- ``USEFUL`` \u2014 document is more than just relevant, it is really useful\n- ``VITAL`` \u2014 highest relevance score, the document is essential\n\nPlease note, that if you also have ``PRINT_EBU_STATS`` set to ``True``,\nthen the parameters of the EBU / rrDBN metric will be printed out first\n(these ones can be computed directly without need to train a model).\n\nOutput\n------\n\nFor each model corresponding parametres will be printed out:\n\n- ``UBM`` \u2014 attractiveness probabilities ``alpha`` and position\n discount parameters ``gamma``\n- ``DCM`` \u2014 attractiveness probabilities ``alpha`` (named as\n ``urlRelevances`` in the code) and position discount parameters\n ``gamma``\n\nFor more conceptual details about converting click models into\nevaluation metrics please refer to the paper `Chuklin, A. et al. 2013.\nClick Model-Based Information Retrieval Metrics. SIGIR\n(2013). `__\n\n--------------\n\nBeyond the First Result Page\n============================\n\nIf you want to model the clicks beyond the first result page you may\nwant to model pagination button separately. We implemented the models\ndescribed in the paper `A. Chuklin, P. Serdyukov, and M. de Rijke.\nModeling Clicks Beyond the First Result Page. In CIKM. ACM,\n2013. `__.\nNamely, by setting the following config options you will get:\n\n- ``TRANSFORM_LOG = True``, ``QUERY_INDEPENDENT_PAGER = False``:\n ``SDBN(P)`` model\n- ``TRANSFORM_LOG = True``, ``QUERY_INDEPENDENT_PAGER = True``:\n ``SDBN(P-Q)`` model\n- ``TRANSFORM_LOG = False``: reqular ``SDBN`` model\n\nPlease, refer to the paper for more details. \\*\\*\\*\n\nReferences\n==========\n\n- *A. Chuklin, P. Serdyukov, and M. de Rijke. Using Intent Information\n to Model User Behavior in Diversified Search. In ECIR, 2013.*\n `[pdf] `__\n- *A. Chuklin, P. Serdyukov, and M. de Rijke. Click model-based\n information retrieval metrics. In SIGIR. ACM, 2013.*\n `[pdf] `__\n- *A. Chuklin, P. Serdyukov, and M. de Rijke. Modeling Clicks Beyond\n the First Result Page. In CIKM. ACM, 2013.*\n `[pdf] `__\n\nCopyright and License\n=====================\n\nCopyright \u00a9 `Yandex `__ 2012-2013,\n`varepsilon `__ 2012-\u221e\n\nPublished under the BSD license.", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/varepsilon/clickmodels", "keywords": "", "license": "LICENSE", "maintainer": "", "maintainer_email": "", "name": "clickmodels", "package_url": "https://pypi.org/project/clickmodels/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/clickmodels/", "project_urls": { "Download": "UNKNOWN", "Homepage": "https://github.com/varepsilon/clickmodels" }, "release_url": "https://pypi.org/project/clickmodels/1.0.2/", "requires_dist": null, "requires_python": null, "summary": "Probabilistic models of user behavior on a search engine result page", "version": "1.0.2" }, "last_serial": 1118123, "releases": { "1.0.1": [ { "comment_text": "", "digests": { "md5": "eda4fdd73dad161ddcb9e1204ca4261f", "sha256": "7d20c910d941c874c753b8ba380f02072c50322fd9aafbc59e9a188688d42550" }, "downloads": -1, "filename": "clickmodels-1.0.1.tar.gz", "has_sig": false, "md5_digest": "eda4fdd73dad161ddcb9e1204ca4261f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 16502, "upload_time": "2014-06-08T09:25:35", "url": "https://files.pythonhosted.org/packages/73/38/fefee398ef0f9d74070c3f972fc37bd9589f27ebc6d8dd844ce068606cdd/clickmodels-1.0.1.tar.gz" } ], "1.0.2": [ { "comment_text": "", "digests": { "md5": "90ded537733a5a078f872a69e4173362", "sha256": "419a1562f2c5d6d38289322ea55030b8f5d2fecc37e1c668ac0359ba19211cad" }, "downloads": -1, "filename": "clickmodels-1.0.2.tar.gz", "has_sig": false, "md5_digest": "90ded537733a5a078f872a69e4173362", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 21586, "upload_time": "2014-06-08T11:16:15", "url": "https://files.pythonhosted.org/packages/50/7c/c40372822c8177febc6593d5b785f069b3ab2134c1b509de98036f5c41e7/clickmodels-1.0.2.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "90ded537733a5a078f872a69e4173362", "sha256": "419a1562f2c5d6d38289322ea55030b8f5d2fecc37e1c668ac0359ba19211cad" }, "downloads": -1, "filename": "clickmodels-1.0.2.tar.gz", "has_sig": false, "md5_digest": "90ded537733a5a078f872a69e4173362", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 21586, "upload_time": "2014-06-08T11:16:15", "url": "https://files.pythonhosted.org/packages/50/7c/c40372822c8177febc6593d5b785f069b3ab2134c1b509de98036f5c41e7/clickmodels-1.0.2.tar.gz" } ] }