{ "info": { "author": "Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer", "author_email": "allennlp-contact@allenai.org", "bugtrack_url": null, "classifiers": [ "Intended Audience :: Developers", "Intended Audience :: Education", "Intended Audience :: Science/Research", "License :: OSI Approved :: Apache Software License", "Operating System :: OS Independent", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7", "Topic :: Scientific/Engineering", "Topic :: Scientific/Engineering :: Artificial Intelligence", "Topic :: Scientific/Engineering :: Information Analysis", "Topic :: Scientific/Engineering :: Mathematics", "Topic :: Software Development", "Topic :: Software Development :: Libraries", "Topic :: Software Development :: Libraries :: Python Modules", "Topic :: Text Processing" ], "description": "# bilm-tf\nTensorflow implementation of the pretrained biLM used to compute ELMo\nrepresentations from [\"Deep contextualized word representations\"](http://arxiv.org/abs/1802.05365).\n\nThis repository supports both training biLMs and using pre-trained models for prediction.\n\nWe also have a pytorch implementation available in [AllenNLP](http://allennlp.org/).\n\nYou may also find it easier to use the version provided in [Tensorflow Hub](https://www.tensorflow.org/hub/modules/google/elmo/2) if you just like to make predictions.\n\nCitation:\n\n```\n@inproceedings{Peters:2018,\n author={Peters, Matthew E. and Neumann, Mark and Iyyer, Mohit and Gardner, Matt and Clark, Christopher and Lee, Kenton and Zettlemoyer, Luke},\n title={Deep contextualized word representations},\n booktitle={Proc. of NAACL},\n year={2018}\n}\n```\n\n\n## Installing\nInstall python version 3.5 or later, tensorflow version 1.2 and h5py:\n\n```\npip install tensorflow-gpu==1.2 h5py\npython setup.py install\n```\n\nEnsure the tests pass in your environment by running:\n```\npython -m unittest discover tests/\n```\n\n## Installing with Docker\n\nTo run the image, you must use nvidia-docker, because this repository\nrequires GPUs.\n```\nsudo nvidia-docker run -t allennlp/bilm-tf:training-gpu\n```\n\n## Using pre-trained models\n\nWe have several different English language pre-trained biLMs available for use.\nEach model is specified with two separate files, a JSON formatted \"options\"\nfile with hyperparameters and a hdf5 formatted file with the model\nweights. Links to the pre-trained models are available [here](https://allennlp.org/elmo).\n\n\nThere are three ways to integrate ELMo representations into a downstream task, depending on your use case.\n\n1. Compute representations on the fly from raw text using character input. This is the most general method and will handle any input text. It is also the most computationally expensive.\n2. Precompute and cache the context independent token representations, then compute context dependent representations using the biLSTMs for input data. This method is less computationally expensive then #1, but is only applicable with a fixed, prescribed vocabulary.\n3. Precompute the representations for your entire dataset and save to a file.\n\nWe have used all of these methods in the past for various use cases. #1 is necessary for evaluating at test time on unseen data (e.g. public SQuAD leaderboard). #2 is a good compromise for large datasets where the size of the file in #3 is unfeasible (SNLI, SQuAD). #3 is a good choice for smaller datasets or in cases where you'd like to use ELMo in other frameworks.\n\nIn all cases, the process roughly follows the same steps.\nFirst, create a `Batcher` (or `TokenBatcher` for #2) to translate tokenized strings to numpy arrays of character (or token) ids.\nThen, load the pretrained ELMo model (class `BidirectionalLanguageModel`).\nFinally, for steps #1 and #2 use `weight_layers` to compute the final ELMo representations.\nFor #3, use `BidirectionalLanguageModel` to write all the intermediate layers to a file.\n\n#### Shape conventions\nEach tokenized sentence is a list of `str`, with a batch of sentences\na list of tokenized sentences (`List[List[str]]`).\n\nThe `Batcher` packs these into a shape\n`(n_sentences, max_sentence_length + 2, 50)` numpy array of character\nids, padding on the right with 0 ids for sentences less then the maximum\nlength. The first and last tokens for each sentence are special\nbegin and end of sentence ids added by the `Batcher`.\n\nThe input character id placeholder can be dimensioned `(None, None, 50)`,\nwith both the batch dimension (axis=0) and time dimension (axis=1) determined\nfor each batch, up the the maximum batch size specified in the\n`BidirectionalLanguageModel` constructor.\n\nAfter running inference with the batch, the return biLM embeddings are\na numpy array with shape `(n_sentences, 3, max_sentence_length, 1024)`,\nafter removing the special begin/end tokens.\n\n#### Vocabulary file\nThe `Batcher` takes a vocabulary file as input for efficency. This is a\ntext file, with one token per line, separated by newlines (`\\n`).\nEach token in the vocabulary is cached as the appropriate 50 character id\nsequence once. Since the model is completely character based, tokens not in\nthe vocabulary file are handled appropriately at run time, with a slight\ndecrease in run time. It is recommended to always include the special\n`` and `` tokens (case sensitive) in the vocabulary file.\n\n### ELMo with character input\n\nSee `usage_character.py` for a detailed usage example.\n\n### ELMo with pre-computed and cached context independent token representations\nTo speed up model inference with a fixed, specified vocabulary, it is\npossible to pre-compute the context independent token representations,\nwrite them to a file, and re-use them for inference. Note that we don't\nsupport falling back to character inputs for out-of-vocabulary words,\nso this should only be used when the biLM is used to compute embeddings\nfor input with a fixed, defined vocabulary.\n\nTo use this option:\n\n1. First create a vocabulary file with all of the unique tokens in your\ndataset and add the special `` and `` tokens.\n2. Run `dump_token_embeddings` with the full model to write the token\nembeddings to a hdf5 file.\n3. Use `TokenBatcher` (instead of `Batcher`) with your vocabulary file,\nand pass `use_token_inputs=False` and the name of the output file from step\n2 to the `BidirectonalLanguageModel` constructor.\n\nSee `usage_token.py` for a detailed usage example.\n\n### Dumping biLM embeddings for an entire dataset to a single file.\n\nTo take this option, create a text file with your tokenized dataset. Each line is one tokenized sentence (whitespace separated). Then use `dump_bilm_embeddings`.\n\nThe output file is `hdf5` format. Each sentence in the input data is stored as a dataset with key `str(sentence_id)` where `sentence_id` is the line number in the dataset file (indexed from 0).\nThe embeddings for each sentence are a shape (3, n_tokens, 1024) array.\n\nSee `usage_cached.py` for a detailed example.\n\n## Training a biLM on a new corpus\n\nBroadly speaking, the process to train and use a new biLM is:\n\n1. Prepare input data and a vocabulary file.\n2. Train the biLM.\n3. Test (compute the perplexity of) the biLM on heldout data.\n4. Write out the weights from the trained biLM to a hdf5 file.\n5. See the instructions above for using the output from Step #4 in downstream models.\n\n\n#### 1. Prepare input data and a vocabulary file.\nTo train and evaluate a biLM, you need to provide:\n\n* a vocabulary file\n* a set of training files\n* a set of heldout files\n\nThe vocabulary file is a a text file with one token per line. It must also include the special tokens ``, `` and `` (case sensitive) in the file.\n\nIMPORTANT: the vocabulary file should be sorted in descending order by token count in your training data. The first three lines should be the special tokens (``, `` and ``), then the most common token in the training data, ending with the least common token.\n\nNOTE: the vocabulary file used in training may differ from the one use for prediction.\n\nThe training data should be randomly split into many training files,\neach containing one slice of the data. Each file contains pre-tokenized and\nwhite space separated text, one sentence per line.\nDon't include the `` or `` tokens in your training data.\n\nAll tokenization/normalization is done before training a model, so both\nthe vocabulary file and training files should include normalized tokens.\nAs the default settings use a fully character based token representation, in general we do not recommend any normalization other then tokenization.\n\nFinally, reserve a small amount of the training data as heldout data for evaluating the trained biLM.\n\n#### 2. Train the biLM.\nThe hyperparameters used to train the ELMo model can be found in `bin/train_elmo.py`.\n\nThe ELMo model was trained on 3 GPUs.\nTo train a new model with the same hyperparameters, first download the training data from the [1 Billion Word Benchmark](http://www.statmt.org/lm-benchmark/).\nThen download the [vocabulary file](https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/vocab-2016-09-10.txt).\nFinally, run:\n\n```\nexport CUDA_VISIBLE_DEVICES=0,1,2\npython bin/train_elmo.py \\\n --train_prefix='/path/to/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/*' \\\n --vocab_file /path/to/vocab-2016-09-10.txt \\\n --save_dir /output_path/to/checkpoint\n```\n\n#### 3. Evaluate the trained model.\n\nUse `bin/run_test.py` to evaluate a trained model, e.g.\n\n```\nexport CUDA_VISIBLE_DEVICES=0\npython bin/run_test.py \\\n --test_prefix='/path/to/1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-000*' \\\n --vocab_file /path/to/vocab-2016-09-10.txt \\\n --save_dir /output_path/to/checkpoint\n```\n\n#### 4. Convert the tensorflow checkpoint to hdf5 for prediction with `bilm` or `allennlp`.\n\nFirst, create an `options.json` file for the newly trained model. To do so,\nfollow the template in an existing file (e.g. the [original `options.json`](https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_options.json) and modify for your hyperpararameters.\n\n**Important**: always set `n_characters` to 262 after training (see below).\n\nThen Run:\n\n```\npython bin/dump_weights.py \\\n --save_dir /output_path/to/checkpoint\n --outfile /output_path/to/weights.hdf5\n```\n\n## Frequently asked questions and other warnings\n\n#### Can you provide the tensorflow checkpoint from training?\nThe tensorflow checkpoint is available by downloading these files:\n\n* [vocabulary](https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/vocab-2016-09-10.txt)\n* [checkpoint](https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway_tf_checkpoint/checkpoint)\n* [options](https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway_tf_checkpoint/options.json)\n* [1](https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway_tf_checkpoint/model.ckpt-935588.data-00000-of-00001)\n* [2](https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway_tf_checkpoint/model.ckpt-935588.index)\n* [3](https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway_tf_checkpoint/model.ckpt-935588.meta)\n\n\n#### How to do fine tune a model on additional unlabeled data?\n\nFirst download the checkpoint files above.\nThen prepare the dataset as described in the section \"Training a biLM on a new corpus\", with the exception that we will use the existing vocabulary file instead of creating a new one. Finally, use the script `bin/restart.py` to restart training with the existing checkpoint on the new dataset.\nFor small datasets (e.g. < 10 million tokens) we only recommend tuning for a small number of epochs and monitoring the perplexity on a heldout set, otherwise the model will overfit the small dataset.\n\n#### Are the softmax weights available?\n\nThey are available in the training checkpoint above.\n\n#### Can you provide some more details about how the model was trained?\nThe script `bin/train_elmo.py` has hyperparameters for training the model.\nThe original model was trained on 3 GTX 1080 for 10 epochs, taking about\ntwo weeks.\n\nFor input processing, we used the raw 1 Billion Word Benchmark dataset\n[here](\nhttp://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz), and the existing vocabulary of 793471 tokens, including ``, `` and ``.\nYou can find our vocabulary file [here](https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/vocab-2016-09-10.txt).\nAt the model input, all text used the full character based representation,\nincluding tokens outside the vocab.\nFor the softmax output we replaced OOV tokens with ``.\n\nThe model was trained with a fixed size window of 20 tokens.\nThe batches were constructed by padding sentences with `` and ``, then packing tokens from one or more sentences into each row to fill completely fill each batch.\nPartial sentences and the LSTM states were carried over from batch to batch so that the language model could use information across batches for context, but backpropogation was broken at each batch boundary.\n\n#### Why do I get slightly different embeddings if I run the same text through the pre-trained model twice?\nAs a result of the training method (see above), the LSTMs are stateful, and carry their state forward from batch to batch.\nConsequently, this introduces a small amount of non-determinism, expecially\nfor the first two batches.\n\n#### Why does training seem to take forever even with my small dataset?\nThe number of gradient updates during training is determined by:\n\n* the number of tokens in the training data (`n_train_tokens`)\n* the batch size (`batch_size`)\n* the number of epochs (`n_epochs`)\n\nBe sure to set these values for your particular dataset in `bin/train_elmo.py`.\n\n\n#### What's the deal with `n_characters` and padding?\nDuring training, we fill each batch to exactly 20 tokens by adding `` and `` to each sentence, then packing tokens from one or more sentences into each row to fill completely fill each batch.\nAs a result, we do not allocate space for a special padding token.\nThe `UnicodeCharsVocabulary` that converts token strings to lists of character\nids always uses a fixed number of character embeddings of `n_characters=261`, so always\nset `n_characters=261` during training.\n\nHowever, for prediction, we ensure each sentence is fully contained in a single batch,\nand as a result pad sentences of different lengths with a special padding id.\nThis occurs in the `Batcher` [see here](https://github.com/allenai/bilm-tf/blob/master/bilm/data.py#L220).\nAs a result, set `n_characters=262` during prediction in the `options.json`.\n\n#### How can I use ELMo to compute sentence representations?\nSimple methods like average and max pooling of the word level ELMo representations across sentences works well, often outperforming supervised methods on benchmark datasets.\nSee \"Evaluation of sentence embeddings in downstream and linguistic probing tasks\", Perone et al, 2018 [arxiv link](https://arxiv.org/abs/1806.06259).\n\n\n#### I'm seeing a WARNING when serializing models, is it a problem?\nThe below warning can be safely ignored:\n```\n2018-08-24 13:04:08,779 : WARNING : Error encountered when serializing lstm_output_embeddings.\nType is unsupported, or the types of the items don't match field type in CollectionDef.\n'list' object has no attribute 'name'\n```\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "http://github.com/allenai/bilm-tf", "keywords": "bilm elmo nlp embedding", "license": "Apache License 2.0", "maintainer": "Matthew Peters", "maintainer_email": "", "name": "bilm", "package_url": "https://pypi.org/project/bilm/", "platform": "", "project_url": "https://pypi.org/project/bilm/", "project_urls": { "Homepage": "http://github.com/allenai/bilm-tf" }, "release_url": "https://pypi.org/project/bilm/0.1.post5/", "requires_dist": [ "h5py" ], "requires_python": ">=3.5", "summary": "Tensorflow implementation of contextualized word representations from bi-directional language models", "version": "0.1.post5" }, "last_serial": 4395481, "releases": { "0.1.post4": [ { "comment_text": "", "digests": { "md5": "e7fdc187c05b59253c9a0afd58d0564a", "sha256": "84b394e0b78462f207ecdbef35948bcbae155c06246d9c35ee1cd3760015f193" }, "downloads": -1, "filename": "bilm-0.1.post4-py3-none-any.whl", "has_sig": false, "md5_digest": "e7fdc187c05b59253c9a0afd58d0564a", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.5", "size": 28972, "upload_time": "2018-09-06T17:02:16", "url": "https://files.pythonhosted.org/packages/44/41/965dae76efb9f22472202c5c8d900266594ef13c8cde58eb6627fd1cf2f8/bilm-0.1.post4-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "89de9977f13ad39aa9657c150bf29e6f", "sha256": "93c11e34bb1643184a1d931ce363843581f60534840212bda943d7747a9d2c0a" }, "downloads": -1, "filename": "bilm-0.1.post4.tar.gz", "has_sig": false, "md5_digest": "89de9977f13ad39aa9657c150bf29e6f", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.5", "size": 31764, "upload_time": "2018-09-06T17:02:18", "url": "https://files.pythonhosted.org/packages/28/9d/ac5fe475f5caa223b9cf501c350ab27a715b82dc3a98fbfee444ab739923/bilm-0.1.post4.tar.gz" } ], "0.1.post5": [ { "comment_text": "", "digests": { "md5": "d282ed77d512992a0a4d781bad1b4367", "sha256": "98a2e3c332bcefcccda6bf6056f8a0be52c474e62bcb0817787010b22b6969a3" }, "downloads": -1, "filename": "bilm-0.1.post5-py3-none-any.whl", "has_sig": false, "md5_digest": "d282ed77d512992a0a4d781bad1b4367", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.5", "size": 29126, "upload_time": "2018-10-19T19:10:07", "url": "https://files.pythonhosted.org/packages/22/a6/711e6ea5a05f7ce72f0a5c6c3bfbd1451aeb8810c9ec8074d5667e3ff433/bilm-0.1.post5-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "eec7e87b5914ed7c12ddde48b62fcfac", "sha256": "62b85a938098bd7296008d9e6df1e04a0fd4c17bb0c5e78b848a3210afb76ee2" }, "downloads": -1, "filename": "bilm-0.1.post5.tar.gz", "has_sig": false, "md5_digest": "eec7e87b5914ed7c12ddde48b62fcfac", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.5", "size": 32060, "upload_time": "2018-10-19T19:10:10", "url": "https://files.pythonhosted.org/packages/74/63/8493c275ca15774dd4de12a657317449575ff4e325b309688c4b7c8429f6/bilm-0.1.post5.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "d282ed77d512992a0a4d781bad1b4367", "sha256": "98a2e3c332bcefcccda6bf6056f8a0be52c474e62bcb0817787010b22b6969a3" }, "downloads": -1, "filename": "bilm-0.1.post5-py3-none-any.whl", "has_sig": false, "md5_digest": "d282ed77d512992a0a4d781bad1b4367", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.5", "size": 29126, "upload_time": "2018-10-19T19:10:07", "url": "https://files.pythonhosted.org/packages/22/a6/711e6ea5a05f7ce72f0a5c6c3bfbd1451aeb8810c9ec8074d5667e3ff433/bilm-0.1.post5-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "eec7e87b5914ed7c12ddde48b62fcfac", "sha256": "62b85a938098bd7296008d9e6df1e04a0fd4c17bb0c5e78b848a3210afb76ee2" }, "downloads": -1, "filename": "bilm-0.1.post5.tar.gz", "has_sig": false, "md5_digest": "eec7e87b5914ed7c12ddde48b62fcfac", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.5", "size": 32060, "upload_time": "2018-10-19T19:10:10", "url": "https://files.pythonhosted.org/packages/74/63/8493c275ca15774dd4de12a657317449575ff4e325b309688c4b7c8429f6/bilm-0.1.post5.tar.gz" } ] }