{ "info": { "author": "Edward Newell", "author_email": "edward.newell@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "Intended Audience :: Developers", "License :: OSI Approved :: MIT License", "Programming Language :: Python :: 2.7", "Topic :: Software Development :: Build Tools" ], "description": "# theano-word2vec\nAn implementation of Mikolov's word2vec in Python 2 using Theano and Lasagne.\n\n## About this package\nThis package has been written with care for modularity of it's components, \nwith the hope that they will be re-usable in creating variations on standard\nword2vec. Soon I'll provide full documentation with guidelines on \ncustomization and extension, as well as a tour of how the package is setup.\nFor now, please enjoy this quickstart guide\n\n## Quickstart\n*NOTE: This package is only available for Python 2 right now.*\n\n### Install\nInstall from the Python Package Index:\n```bash\npip install theano-word2vec\n```\n\nAlternatively, install a version you can hack on:\n```bash\ngit clone https://github.com/enewe101/word2vec.git\ncd word2vec\npython setup.py develop\n```\n\n### Use\n\nThe simplest way to train a word2vec embedding:\n```python\n>>> from word2vec import word2vec\n>>> embedder, dictionary = word2vec(files=['corpus/file1.txt', 'corpus/file2.txt'])\n```\nWhere the input files should be formatted with one sentence per line, with\ntokens space-separated.\n\nOnce trained, the embedder can be used to convert words to vectors:\n```python\n>>> tokens = 'A sentence to embed'.split()\n>>> token_ids = dictionary.get_ids(tokens)\n>>> vectors = word2vec_embedder.embed(token_ids)\n```\n\nThe `word2vec()` function exposes most of the basic parameters appearing\nin Mikolov's skip-gram model based on noise contrastive estimation:\n```python\n>>> embedder, dictionary = word2vec(\n...\t\t# directory in which to save embedding parameters (deepest dir created if doesn't exist)\n...\t\tsavedir='data/my-embedding',\n...\n...\t\t# List of files comprising the corpus\n...\t\tfiles=['corpus/file1.txt', 'corpus/file2.txt'],\t\n...\n...\t\t# Include whole directories of files (deep files not included)\n...\t\tdirectories=['corpus', 'corpus/additional'],\n...\n...\t\t# Indicate files to exclude using regexes\n...\t\tskip=[re.compile('*.bk$'), re.compile('exclude-from-corpus')],\t\n...\n...\t\t# Number of passes through training corpus\n...\t\tnum_epochs=5,\t\t\t\t\n...\n...\t\t# Specify the mapping from tokens to ints (else create it automatically)\n...\t\tunigram_dictionary=preexisting_dictionary,\t\n...\n...\t\t# Number of \"noise\" examples included for every \"signal\" example\n...\t\tnoise_ratio=15,\t\n...\n...\t\t# Relative probability of skip-gram sampling centered on query word\n...\t\tkernel=[1,2,3,3,2,1],\t\t\n...\n...\t\t# Threshold used to calculate discard-probability for query words\n...\t\tt=1.0e-5,\t\t\t\t\n...\n...\t\t# Size of minibatches during training\n...\t\tbatch_size = 1000,\n...\n...\t\t# Dimensionality of the embedding vector space \n...\t\tnum_embedding_dimensions = 500, \n...\n...\t\t# Initializer for embedding parameters (can be a numpy array too)\n...\t\tword_embedding_init=lasagne.init.Normal(),\t\n...\n...\t\t# Initializer for context embedding parameters (can be numpy array)\n...\t\tcontext_embedding_init=lasagne.init.Normal(),\t\n...\n...\t\t# Size of stochastic gradient descent steps during training\n...\t\tlearning_rate = 0.1,\t\n...\n...\t\t# Amount of Nesterov momentum during training\n...\t\tmomentum=0.9,\t\t\n...\n...\t\t# Print messages during training\n...\t\tverbose=True,\n...\n...\t\t# Number of parrallel corpus-reading processes \n...\t\tnum_example_generators=3\t\n...\t)\n```\n\nFor more customization, check out the documentation (soon) to see how to \nassemble your own training setup using the classes provided in word2vec.", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/enewe101/word2vec", "keywords": "word2vec word embeddings deep learning nlp", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "theano-word2vec", "package_url": "https://pypi.org/project/theano-word2vec/", "platform": "", "project_url": "https://pypi.org/project/theano-word2vec/", "project_urls": { "Homepage": "https://github.com/enewe101/word2vec" }, "release_url": "https://pypi.org/project/theano-word2vec/0.2.2/", "requires_dist": null, "requires_python": "", "summary": "word2vec using Theano and Lasagne", "version": "0.2.2" }, "last_serial": 3029281, "releases": { "0.1": [ { "comment_text": "", "digests": { "md5": "8a1635da8408b749ae27ea7c9f895161", "sha256": "106d906c3f1d8762ec4bde7ad8d55a1f18d8781a52cb37aba8689155ed8adc45" }, "downloads": -1, "filename": "theano-word2vec-0.1.tar.gz", "has_sig": false, "md5_digest": "8a1635da8408b749ae27ea7c9f895161", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7314, "upload_time": "2016-04-11T04:36:25", "url": "https://files.pythonhosted.org/packages/7a/15/83191d85e83f1dd6d6a092b8dceee2a10178c009f0ee95f7911f18fd2d5a/theano-word2vec-0.1.tar.gz" } ], "0.1.1": [ { "comment_text": "", "digests": { "md5": "4970cad7cf1c9822dddd487e9b3ecd1a", "sha256": "bb1e9a82b540222b0d9100f447527322fce70d4115fa44e3e18d1babfa3d7e12" }, "downloads": -1, "filename": "theano-word2vec-0.1.1.tar.gz", "has_sig": false, "md5_digest": "4970cad7cf1c9822dddd487e9b3ecd1a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7358, "upload_time": "2016-04-11T04:46:45", "url": "https://files.pythonhosted.org/packages/5d/0b/07cf87e7698eeb57210fb939af318a7ecd30f9bda208367435613bf9b935/theano-word2vec-0.1.1.tar.gz" } ], "0.1.2": [ { "comment_text": "", "digests": { "md5": "1894c473525acef9f850e310046ab5e0", "sha256": "96891617731138d95bbb187c3f29bac4834bf7ef59d84ac5a53bd9191a431642" }, "downloads": -1, "filename": "theano-word2vec-0.1.2.tar.gz", "has_sig": false, "md5_digest": "1894c473525acef9f850e310046ab5e0", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 22562, "upload_time": "2016-04-18T04:17:40", "url": "https://files.pythonhosted.org/packages/f2/a7/311bfa18afe115a856e4caf93c125434d1f3fc49e847ab8d8be0d02340df/theano-word2vec-0.1.2.tar.gz" } ], "0.1.3": [ { "comment_text": "", "digests": { "md5": "77acdaad29ccb509a43c9f458d59c879", "sha256": "c394a6f181e424a8e298521ac11bc99f4d1b96afb3cd451bef9f9b7cefc1d346" }, "downloads": -1, "filename": "theano-word2vec-0.1.3.tar.gz", "has_sig": false, "md5_digest": "77acdaad29ccb509a43c9f458d59c879", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 31739, "upload_time": "2016-05-04T16:17:23", "url": "https://files.pythonhosted.org/packages/bc/9e/970f48922ed385c2eceb6dda17fbce2755b7488f88fc197bb407321773c1/theano-word2vec-0.1.3.tar.gz" } ], "0.1.4": [ { "comment_text": "", "digests": { "md5": "f36a7396739d99c47354cb4dc8e786df", "sha256": "aabe3a6e2876482c74233e56adaba71313faf7dd54d08fdc9917b5329d19b177" }, "downloads": -1, "filename": "theano-word2vec-0.1.4.tar.gz", "has_sig": false, "md5_digest": "f36a7396739d99c47354cb4dc8e786df", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 32128, "upload_time": "2016-05-04T16:38:20", "url": "https://files.pythonhosted.org/packages/bf/ad/d7ed763b43516bbf2c6e185f4b513fcb7c1e27f88b0acd504112f43c18b4/theano-word2vec-0.1.4.tar.gz" } ], "0.1.5": [ { "comment_text": "", "digests": { "md5": "459dab5ba18a3ecdf16765f9314c7313", "sha256": "18efbf5244584e9771d6af11142f31e7e4d732e1f037962fba1667894eb69a71" }, "downloads": -1, "filename": "theano-word2vec-0.1.5.tar.gz", "has_sig": false, "md5_digest": "459dab5ba18a3ecdf16765f9314c7313", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 32216, "upload_time": "2016-05-04T16:57:24", "url": "https://files.pythonhosted.org/packages/ec/51/ca84bafb95d8e1d0870537d52288e619912ca8862f3b35fa5b41f2a96683/theano-word2vec-0.1.5.tar.gz" } ], "0.2.1": [ { "comment_text": "", "digests": { "md5": "3f65843c300759c6f25124035edfaf25", "sha256": "f8387be665b91902785204965c0b3719aa894cc489198dcad3e16f24fae9df15" }, "downloads": -1, "filename": "theano-word2vec-0.2.1.tar.gz", "has_sig": false, "md5_digest": "3f65843c300759c6f25124035edfaf25", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 38053, "upload_time": "2016-10-05T02:47:24", "url": "https://files.pythonhosted.org/packages/9d/71/740635afbcdcf200b1664b5a92a8eaee11e8a9c5f9dcd4876ebcb65e1e1f/theano-word2vec-0.2.1.tar.gz" } ], "0.2.2": [ { "comment_text": "", "digests": { "md5": "95db195793e40836fb0787b17fb5ca18", "sha256": "b5ef5dd940c18c1c39d3d73f06c62ef9012f9ea281253734dff92205eef73211" }, "downloads": -1, "filename": "theano-word2vec-0.2.2.tar.gz", "has_sig": false, "md5_digest": "95db195793e40836fb0787b17fb5ca18", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 39039, "upload_time": "2017-07-17T17:50:02", "url": "https://files.pythonhosted.org/packages/01/69/40dea6b9c4bc992fbc2fa6c15f1d404bceb389aa0d86f85fc103064570a9/theano-word2vec-0.2.2.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "95db195793e40836fb0787b17fb5ca18", "sha256": "b5ef5dd940c18c1c39d3d73f06c62ef9012f9ea281253734dff92205eef73211" }, "downloads": -1, "filename": "theano-word2vec-0.2.2.tar.gz", "has_sig": false, "md5_digest": "95db195793e40836fb0787b17fb5ca18", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 39039, "upload_time": "2017-07-17T17:50:02", "url": "https://files.pythonhosted.org/packages/01/69/40dea6b9c4bc992fbc2fa6c15f1d404bceb389aa0d86f85fc103064570a9/theano-word2vec-0.2.2.tar.gz" } ] }