{ "info": { "author": "Explosion", "author_email": "contact@explosion.ai", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Intended Audience :: Developers", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7", "Topic :: Scientific/Engineering :: Artificial Intelligence" ], "description": "\n\n# spaCy wrapper for PyTorch Transformers\n\nThis package provides [spaCy](https://github.com/explosion/spaCy) model\npipelines that wrap\n[Hugging Face's `pytorch-transformers`](https://github.com/huggingface/pytorch-transformers)\npackage, so you can use them in spaCy. The result is convenient access to\nstate-of-the-art transformer architectures, such as BERT, GPT-2, XLNet, etc. For\nmore details and background, check out\n[our blog post](https://explosion.ai/blog/spacy-pytorch-transformers).\n\n[![Azure Pipelines](https://img.shields.io/azure-devops/build/explosion-ai/public/11/master.svg?logo=azure-devops&style=flat-square)](https://dev.azure.com/explosion-ai/public/_build?definitionId=11)\n[![PyPi](https://img.shields.io/pypi/v/spacy-pytorch-transformers.svg?style=flat-square)](https://pypi.python.org/pypi/spacy-pytorch-transformers)\n[![GitHub](https://img.shields.io/github/release/explosion/spacy-pytorch-transformers/all.svg?style=flat-square)](https://github.com/explosion/spacy-pytorch-transformers/releases)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg?style=flat-square)](https://github.com/ambv/black)\n[![Open demo in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/explosion/spacy-pytorch-transformers/blob/master/examples/Spacy_Transformers_Demo.ipynb)\n\n## Features\n\n- Use **BERT**, **RoBERTa**, **XLNet** and **GPT-2** directly in your spaCy\n pipeline.\n- **Fine-tune** pretrained transformer models on your task using spaCy's API.\n- Custom component for **text classification** using transformer features.\n- Automatic **alignment** of wordpieces and outputs to linguistic tokens.\n- Process multi-sentence documents with intelligent **per-sentence\n prediction**.\n- Built-in hooks for **context-sensitive vectors** and similarity.\n- Out-of-the-box serialization and model packaging.\n\n## \ud83d\ude80 Quickstart\n\nInstalling the package from pip will automatically install all dependencies,\nincluding PyTorch and spaCy. Make sure you install this package **before** you\ninstall the models. Also note that this package requires **Python 3.6+** and the\nlatest version of spaCy,\n[v2.1.7](https://github.com/explosion/spaCy/releases/tag/v2.1.7) or above.\n\n```bash\npip install spacy-pytorch-transformers\n```\n\nFor GPU installation, find your CUDA version using `nvcc --version` and add the\n[version in brackets](https://spacy.io/usage/#gpu), e.g.\n`spacy-pytorch-transformers[cuda92]` for CUDA9.2 or\n`spacy-pytorch-transformers[cuda100]` for CUDA10.0.\n\nWe've also pre-packaged some of the pretrained models as spaCy model packages.\nYou can either use the `spacy download` command or download the packages from\nthe [model releases](https://github.com/explosion/spacy-models/releases).\n\n| Package name | Pretrained model | Language | Author | Size | Release |\n| ---------------------------------- | ------------------------- | -------- | --------------------------------------------------------------------------- | ----: | :--------------------------------------------------------------------------------------------------: |\n| `en_pytt_bertbaseuncased_lg` | `bert-base-uncased` | English | [Google Research](https://github.com/google-research/bert) | 406MB | [\ud83d\udce6\ufe0f](https://github.com/explosion/spacy-models/releases/tag/en_pytt_bertbaseuncased_lg-2.1.1) |\n| `de_pytt_bertbasecased_lg` | `bert-base-german-cased` | German | [deepset](https://deepset.ai/german-bert) | 406MB | [\ud83d\udce6\ufe0f](https://github.com/explosion/spacy-models/releases/tag/de_pytt_bertbasecased_lg-2.1.1) |\n| `en_pytt_xlnetbasecased_lg` | `xlnet-base-cased` | English | [CMU/Google Brain](https://github.com/zihangdai/xlnet/) | 434MB | [\ud83d\udce6\ufe0f](https://github.com/explosion/spacy-models/releases/tag/en_pytt_xlnetbasecased_lg-2.1.1) |\n| `en_pytt_robertabase_lg` | `roberta-base` | English | [Facebook](https://github.com/pytorch/fairseq/tree/master/examples/roberta) | 292MB | [\ud83d\udce6\ufe0f](https://github.com/explosion/spacy-models/releases/tag/en_pytt_robertabase_lg-2.1.0) |\n| `en_pytt_distilbertbaseuncased_lg` | `distilbert-base-uncased` | English | [Hugging Face](https://medium.com/huggingface/distilbert-8cf3380435b5) | 245MB | [\ud83d\udce6\ufe0f](https://github.com/explosion/spacy-models/releases/tag/en_pytt_distilbertbaseuncased_lg-2.1.0) |\n\n```bash\npython -m spacy download en_pytt_bertbaseuncased_lg\npython -m spacy download de_pytt_bertbasecased_lg\npython -m spacy download en_pytt_xlnetbasecased_lg\npython -m spacy download en_pytt_robertabase_lg\npython -m spacy download en_pytt_distilbertbaseuncased_lg\n```\n\nOnce the model is installed, you can load it in spaCy like any other model\npackage.\n\n```python\nimport spacy\n\nnlp = spacy.load(\"en_pytt_bertbaseuncased_lg\")\ndoc = nlp(\"Apple shares rose on the news. Apple pie is delicious.\")\nprint(doc[0].similarity(doc[7]))\nprint(doc._.pytt_last_hidden_state.shape)\n```\n\n> \ud83d\udca1 If you're seeing an error like `No module named 'spacy.lang.pytt'`,\n> double-check that `spacy-pytorch-transformers` is installed. It needs to be\n> available so it can register its language entry points. Also make sure that\n> you're running spaCy v2.1.7 or higher.\n\n## \ud83d\udcd6 Usage\n\n### Transfer learning\n\nThe main use case for pretrained transformer models is transfer learning. You\nload in a large generic model pretrained on lots of text, and start training on\nyour smaller dataset with labels specific to your problem. This package has\ncustom pipeline components that make this especially easy. We provide an example\ncomponent for text categorization. Development of analogous components for other\ntasks should be quite straight-forward.\n\nThe `pytt_textcat` component is based on spaCy's built-in\n[`TextCategorizer`](https://spacy.io/api/textcategorizer) and supports using the\nfeatures assigned by the PyTorch-Transformers models, via the `pytt_tok2vec`\ncomponent. This lets you use a model like BERT to predict contextual token\nrepresentations, and then learn a text categorizer on top as a task-specific\n\"head\". The API is the same as any other spaCy pipeline:\n\n```python\nTRAIN_DATA = [\n (\"text1\", {\"cats\": {\"POSITIVE\": 1.0, \"NEGATIVE\": 0.0}})\n]\n```\n\n```python\nimport spacy\nfrom spacy.util import minibatch\nimport random\nimport torch\n\nis_using_gpu = spacy.prefer_gpu()\nif is_using_gpu:\n torch.set_default_tensor_type(\"torch.cuda.FloatTensor\")\n\nnlp = spacy.load(\"en_pytt_bertbaseuncased_lg\")\nprint(nlp.pipe_names) # [\"sentencizer\", \"pytt_wordpiecer\", \"pytt_tok2vec\"]\ntextcat = nlp.create_pipe(\"pytt_textcat\", config={\"exclusive_classes\": True})\nfor label in (\"POSITIVE\", \"NEGATIVE\"):\n textcat.add_label(label)\nnlp.add_pipe(textcat)\n\noptimizer = nlp.resume_training()\nfor i in range(10):\n random.shuffle(TRAIN_DATA)\n losses = {}\n for batch in minibatch(TRAIN_DATA, size=8):\n texts, cats = zip(*batch)\n nlp.update(texts, cats, sgd=optimizer, losses=losses)\n print(i, losses)\nnlp.to_disk(\"/bert-textcat\")\n```\n\nFor a full example, see the\n[`examples/train_textcat.py` script](examples/train_textcat.py).\n\n### Vectors and similarity\n\nThe `PyTT_TokenVectorEncoder` component of the model sets custom hooks that\noverride the default behaviour of the `.vector` attribute and `.similarity`\nmethod of the `Token`, `Span` and `Doc` objects. By default, these usually refer\nto the word vectors table at `nlp.vocab.vectors`. Naturally, in the transformer\nmodels we'd rather use the `doc.tensor` attribute, since it holds a much more\ninformative context-sensitive representation.\n\n```python\napple1 = nlp(\"Apple shares rose on the news.\")\napple2 = nlp(\"Apple sold fewer iPhones this quarter.\")\napple3 = nlp(\"Apple pie is delicious.\")\nprint(apple1[0].similarity(apple2[0]))\nprint(apple1[0].similarity(apple3[0]))\n```\n\n### Serialization\n\nSaving and loading pretrained transformer models and packaging them as spaCy\nmodels \u2728just works \u2728 (at least, it should). The wrapper and components follow\nspaCy's API, so when you save and load the `nlp` object, it...\n\n- Writes the pretrained weights to disk / bytes and loads them back in.\n- Adds `\"lang_factory\": \"pytt\"` in the `meta.json` so spaCy knows how to\n initialize the `Language` class when you load the model.\n- Adds this package and its version to the `\"requirements\"` in the\n `meta.json`, so when you run\n [`spacy package`](https://spacy.io/api/cli#package) to create an installable\n Python package it's automatically added to the setup's `install_requires`.\n\nFor example, if you've trained your own text classifier, you can package it like\nthis:\n\n```bash\npython -m spacy package /bert-textcat /output\ncd /output/en_pytt_bertbaseuncased_lg-1.0.0\npython setup.py sdist\npip install dist/en_pytt_bertbaseuncased_lg-1.0.0.tar.gz\n```\n\n### Extension attributes\n\nThis wrapper sets the following\n[custom extension attributes](https://spacy.io/usage/processing-pipelines#custom-components-attributes)\non the `Doc`, `Span` and `Token` objects:\n\n| Name | Type | Description |\n| ----------------------------- | ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| `._.pytt_alignment` | `List[List[int]]` | Alignment between wordpieces and spaCy tokens. Contains lists of wordpiece token indices (one per spaCy token) or a list of indices (if called on a `Token`). |\n| `._.pytt_word_pieces` | `List[int]` | The wordpiece IDs. |\n| `._.pytt_word_pieces_` | `List[str]` | The string forms of the wordpiece IDs. |\n| `._.pytt_last_hidden_state` | `ndarray` | The `last_hidden_state` output from the PyTorch-Transformers model. |\n| `._.pytt_pooler_output` | `List[ndarray]` | The `pooler_output` output from the PyTorch-Transformers model. |\n| `._.pytt_all_hidden_states` | `List[ndarray]` | The `all_hidden_states` output from the PyTorch-Transformers model. |\n| `._.all_attentions` | `List[ndarray]` | The `all_attentions` output from the PyTorch-Transformers model. |\n| `._.pytt_d_last_hidden_state` | `ndarray` | The gradient of the `last_hidden_state` output from the PyTorch-Transformers model. |\n| `._.pytt_d_pooler_output` | `List[ndarray]` | The gradient of the `pooler_output` output from the PyTorch-Transformers model. |\n| `._.pytt_d_all_hidden_states` | `List[ndarray]` | The gradient of the `all_hidden_states` output from the PyTorch-Transformers model. |\n| `._.pytt_d_all_attentions` | `List[ndarray]` | The gradient of the `all_attentions` output from the PyTorch-Transformers model. |\n\nThe values can be accessed via the `._` attribute. For example:\n\n```python\ndoc = nlp(\"This is a text.\")\nprint(doc._.pytt_word_pieces_)\n```\n\n### Setting up the pipeline\n\nIn order to run, the `nlp` object created using `PyTT_Language` requires a few\ncomponents to run in order: a component that assigns sentence boundaries (e.g.\nspaCy's built-in\n[`Sentencizer`](https://spacy.io/usage/linguistic-features#sbd-component)), the\n`PyTT_WordPiecer`, which assigns the wordpiece tokens and the\n`PyTT_TokenVectorEncoder`, which assigns the token vectors. The `pytt_name`\nargument defines the name of the pretrained model to use. The `from_pretrained`\nmethods load the pretrained model via `pytorch-transformers`.\n\n```python\nfrom spacy_pytorch_transformers import PyTT_Language, PyTT_WordPiecer, PyTT_TokenVectorEncoder\n\nname = \"bert-base-uncased\"\nnlp = PyTT_Language(pytt_name=name, meta={\"lang\": \"en\"})\nnlp.add_pipe(nlp.create_pipe(\"sentencizer\"))\nnlp.add_pipe(PyTT_WordPiecer.from_pretrained(nlp.vocab, name))\nnlp.add_pipe(PyTT_TokenVectorEncoder.from_pretrained(nlp.vocab, name))\nprint(nlp.pipe_names) # ['sentencizer', 'pytt_wordpiecer', 'pytt_tok2vec']\n```\n\nYou can also use the [`init_model.py`](examples/init_model.py) script in the\nexamples.\n\n#### Loading models from a path\n\nPytorch-Transformers models can also be loaded from a file path instead of just\na name. For instance, let's say you want to use Allen AI's\n[`scibert`](https://github.com/allenai/scibert). First, download the PyTorch\nmodel files, unpack them them, unpack the `weights.tar`, rename the\n`bert_config.json` to `config.json` and put everything into one directory. Your\ndirectory should now have a `pytorch_model.bin`, `vocab.txt` and `config.json`.\nAlso make sure that your path **includes the name of the model**. You can then\ninitialize the `nlp` object like this:\n\n```python\nfrom spacy_pytorch_transformers import PyTT_Language, PyTT_WordPiecer, PyTT_TokenVectorEncoder\n\nname = \"scibert-scivocab-uncased\"\npath = \"/path/to/scibert-scivocab-uncased\"\n\nnlp = PyTT_Language(pytt_name=name, meta={\"lang\": \"en\"})\nnlp.add_pipe(nlp.create_pipe(\"sentencizer\"))\nnlp.add_pipe(PyTT_WordPiecer.from_pretrained(nlp.vocab, path))\nnlp.add_pipe(PyTT_TokenVectorEncoder.from_pretrained(nlp.vocab, path))\n```\n\n### Tokenization alignment\n\nTransformer models are usually trained on text preprocessed with the \"wordpiece\"\nalgorithm, which limits the number of distinct token-types the model needs to\nconsider. Wordpiece is convenient for training neural networks, but it doesn't\nproduce segmentations that match up to any linguistic notion of a \"word\". Most\nrare words will map to multiple wordpiece tokens, and occasionally the alignment\nwill be many-to-many. `spacy-pytorch-transformers` calculates this alignment,\nwhich you can access at `doc._.pytt_alignment`. It's a list of length equal to\nthe number of spaCy tokens. Each value in the list is a list of consecutive\nintegers, which are indexes into the wordpieces list.\n\nIf you can work on representations that aren't aligned to actual words, it's\nbest to use the raw outputs of the transformer, which can be accessed at\n`doc._.pytt_last_hidden_state`. This variable gives you a tensor with one row\nper wordpiece token.\n\nIf you're working on token-level tasks such as part-of-speech tagging or\nspelling correction, you'll want to work on the token-aligned features, which\nare stored in the `doc.tensor` variable.\n\nWe've taken care to calculate the aligned `doc.tensor` representation as\nfaithfully as possible, with priority given to avoid information loss. The\nalignment has been calculated such that\n`doc.tensor.sum(axis=1) == doc._.pytt_last_hidden_state.sum(axis=1)`. To make\nthis work, each row of the `doc.tensor` (which corresponds to a spaCy token) is\nset to a weighted sum of the rows of the `last_hidden_state` tensor that the\ntoken is aligned to, where the weighting is proportional to the number of other\nspaCy tokens aligned to that row. To include the information from the (often\nimportant --- see Clark et al., 2019) boundary tokens, we imagine that these are\nalso \"aligned\" to all of the tokens in the sentence.\n\n### Batching, padding and per-sentence processing\n\nTransformer models have cubic runtime and memory complexity with respect to\nsequence length. This means that longer texts need to be divided into sentences\nin order to achieve reasonable efficiency.\n\n`spacy-pytorch-transformers` handles this internally, and requires that sort of\nsentence-boundary detection component has been added to the pipeline. We\nrecommend:\n\n```python\nsentencizer = nlp.create_pipe(\"sentencizer\")\nnlp.add_pipe(sentencizer, first=True)\n```\n\nInternally, the transformer model will predict over sentences, and the resulting\ntensor features will be reconstructed to produce document-level annotations.\n\nIn order to further improve efficiency and reduce memory requirements,\n`spacy-pytorch-transformers` also performs length-based subbatching internally.\nThe subbatching regroups the batched sentences by sequence length, to minimise\nthe amount of padding required. The configuration option `words_per_batch`\ncontrols this behaviour. You can set it to 0 to disable the subbatching, or set\nit to an integer to require a maximum limit on the number of words (including\npadding) per subbatch. The default value of 3000 words works reasonably well on\na Tesla V100.\n\nMany of the pretrained transformer models have a maximum sequence length. If a\nsentence is longer than the maximum, it is truncated and the affected ending\ntokens will receive zeroed vectors.\n\n## \ud83c\udf9b API\n\n### class `PyTT_Language`\n\nA subclass of [`Language`](https://spacy.io/api/language) that holds a\nPyTorch-Transformer (PyTT) pipeline. PyTT pipelines work only slightly\ndifferently from spaCy's default pipelines. Specifically, we introduce a new\npipeline component at the start of the pipeline, `PyTT_TokenVectorEncoder`. We\nthen modify the [`nlp.update`](https://spacy.io/api/language#update) function to\nrun the `PyTT_TokenVectorEncoder` before the other pipeline components, and\nbackprop it after the other components are done.\n\n#### staticmethod `PyTT_Language.install_extensions`\n\nRegister the\n[custom extension attributes](https://spacy.io/usage/processing-pipelines#custom-components-attributes)\non the `Doc`, `Span` and `Token` objects. If the extensions have already been\nregistered, spaCy will raise an error. [See here](#extension-attributes) for the\nextension attributes that will be set. You shouldn't have to call this method\nyourself \u2013 it already runs when you import the package.\n\n#### method `PyTT_Language.__init__`\n\nSee [`Language.__init__`](https://spacy.io/api/language#init). Expects either a\n`pytt_name` setting in the `meta` or as a keyword argument, specifying the\npretrained model name. This is used to set up the model-specific tokenizer.\n\n#### method `PyTT_Language.update`\n\nUpdate the models in the pipeline.\n\n| Name | Type | Description |\n| --------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------ |\n| `docs` | iterable | A batch of `Doc` objects or unicode. If unicode, a `Doc` object will be created from the text. |\n| `golds` | iterable | A batch of `GoldParse` objects or dictionaries. Dictionaries will be used to create [`GoldParse`](https://spacy.io/api/goldparse) objects. |\n| `drop` | float | The dropout rate. |\n| `sgd` | callable | An optimizer. |\n| `losses` | dict | Dictionary to update with the loss, keyed by pipeline component. |\n| `component_cfg` | dict | Config parameters for specific pipeline components, keyed by component name. |\n\n### class `PyTT_WordPiecer`\n\nspaCy pipeline component to assign PyTorch-Transformers wordpiece tokenization\nto the Doc, which can then be used by the token vector encoder. Note that this\ncomponent doesn't modify spaCy's tokenization. It only sets extension attributes\n`pytt_word_pieces_`, `pytt_word_pieces` and `pytt_alignment` (alignment between\nwordpiece tokens and spaCy tokens).\n\nThe component is available as `pytt_wordpiecer` and registered via an entry\npoint, so it can also be created using\n[`nlp.create_pipe`](https://spacy.io/api/language#create_pipe):\n\n```python\nwordpiecer = nlp.create_pipe(\"pytt_wordpiecer\")\n```\n\n#### Config\n\nThe component can be configured with the following settings, usually passed in\nas the `**cfg`.\n\n| Name | Type | Description |\n| ----------- | ------- | ----------------------------------------------------- |\n| `pytt_name` | unicode | Name of pretrained model, e.g. `\"bert-base-uncased\"`. |\n\n#### classmethod `PyTT_WordPiecer.from_nlp`\n\nFactory to add to `Language.factories` via entry point.\n\n| Name | Type | Description |\n| ----------- | ------------------------- | ----------------------------------------------- |\n| `nlp` | `spacy.language.Language` | The `nlp` object the component is created with. |\n| `**cfg` | - | Optional config parameters. |\n| **RETURNS** | `PyTT_WordPiecer` | The wordpiecer. |\n\n#### method `PyTT_WordPiecer.__init__`\n\nInitialize the component.\n\n| Name | Type | Description |\n| ----------- | ------------------- | ----------------------------------------------------- |\n| `vocab` | `spacy.vocab.Vocab` | The spaCy vocab to use. |\n| `name` | unicode | Name of pretrained model, e.g. `\"bert-base-uncased\"`. |\n| `**cfg` | - | Optional config parameters. |\n| **RETURNS** | `PyTT_WordPiecer` | The wordpiecer. |\n\n#### method `PyTT_WordPiecer.predict`\n\nRun the wordpiece tokenizer on a batch of docs and return the extracted strings.\n\n| Name | Type | Description |\n| ----------- | -------- | -------------------------------------------------------------------------------- |\n| `docs` | iterable | A batch of `Doc`s to process. |\n| **RETURNS** | tuple | A `(strings, None)` tuple. The strings are lists of strings, one list per `Doc`. |\n\n#### method `PyTT_WordPiecer.set_annotations`\n\nAssign the extracted tokens and IDs to the `Doc` objects.\n\n| Name | Type | Description |\n| --------- | -------- | ------------------------- |\n| `docs` | iterable | A batch of `Doc` objects. |\n| `outputs` | iterable | A batch of outputs. |\n\n### class `PyTT_TokenVectorEncoder`\n\nspaCy pipeline component to use PyTorch-Transformers models. The component\nassigns the output of the transformer to extension attributes. We also calculate\nan alignment between the wordpiece tokens and the spaCy tokenization, so that we\ncan use the last hidden states to set the `doc.tensor` attribute. When multiple\nwordpiece tokens align to the same spaCy token, the spaCy token receives the sum\nof their values.\n\nThe component is available as `pytt_tok2vec` and registered via an entry point,\nso it can also be created using\n[`nlp.create_pipe`](https://spacy.io/api/language#create_pipe):\n\n```python\ntok2vec = nlp.create_pipe(\"pytt_tok2vec\")\n```\n\n#### Config\n\nThe component can be configured with the following settings, usually passed in\nas the `**cfg`.\n\n| Name | Type | Description |\n| ----------------- | ------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| `pytt_name` | unicode | Name of pretrained model, e.g. `\"bert-base-uncased\"`. |\n| `words_per_batch` | int | Group sentences into subbatches of max `words_per_batch` in size. For instance, a batch with one 100 word sentence and one 10 word sentence will have size 200 (due to padding). Set to `0` to disable. Defaults to `2000`. |\n\n#### classmethod `PyTT_TokenVectorEncoder.from_nlp`\n\nFactory to add to `Language.factories` via entry point.\n\n| Name | Type | Description |\n| ----------- | ------------------------- | ----------------------------------------------- |\n| `nlp` | `spacy.language.Language` | The `nlp` object the component is created with. |\n| `**cfg` | - | Optional config parameters. |\n| **RETURNS** | `PyTT_TokenVectorEncoder` | The token vector encoder. |\n\n#### classmethod `PyTT_TokenVectorEncoder.from_pretrained`\n\nCreate a `PyTT_TokenVectorEncoder` instance using pretrained weights from a\nPyTorch-Transformers model, even if it's not installed as a spaCy package.\n\n```python\nfrom spacy_pytorch_transformers import PyTT_TokenVectorEncoder\nfrom spacy.tokens import Vocab\ntok2vec = PyTT_TokenVectorEncoder.from_pretrained(Vocab(), \"bert-base-uncased\")\n```\n\n| Name | Type | Description |\n| ----------- | ------------------------- | ----------------------------------------------------- |\n| `vocab` | `spacy.vocab.Vocab` | The spaCy vocab to use. |\n| `name` | unicode | Name of pretrained model, e.g. `\"bert-base-uncased\"`. |\n| `**cfg` | - | Optional config parameters. |\n| **RETURNS** | `PyTT_TokenVectorEncoder` | The token vector encoder. |\n\n#### classmethod `PyTT_TokenVectorEncoder.Model`\n\nCreate an instance of `PyTT_Wrapper`, which holds the PyTorch-Transformers\nmodel.\n\n| Name | Type | Description |\n| ----------- | -------------------- | ----------------------------------------------------- |\n| `name` | unicode | Name of pretrained model, e.g. `\"bert-base-uncased\"`. |\n| `**cfg` | - | Optional config parameters. |\n| **RETURNS** | `thinc.neural.Model` | The wrapped model. |\n\n#### method `PyTT_TokenVectorEncoder.__init__`\n\nInitialize the component.\n\n| Name | Type | Description |\n| ----------- | ----------------------------- | ------------------------------------------------------- |\n| `vocab` | `spacy.vocab.Vocab` | The spaCy vocab to use. |\n| `model` | `thinc.neural.Model` / `True` | The component's model or `True` if not initialized yet. |\n| `**cfg` | - | Optional config parameters. |\n| **RETURNS** | `PyTT_TokenVectorEncoder` | The token vector encoder. |\n\n#### method `PyTT_TokenVectorEncoder.__call__`\n\nProcess a `Doc` and assign the extracted features.\n\n| Name | Type | Description |\n| ----------- | ------------------ | --------------------- |\n| `doc` | `spacy.tokens.Doc` | The `Doc` to process. |\n| **RETURNS** | `spacy.tokens.Doc` | The processed `Doc`. |\n\n#### method `PyTT_TokenVectorEncoder.pipe`\n\nProcess `Doc` objects as a stream and assign the extracted features.\n\n| Name | Type | Description |\n| ------------ | ------------------ | ------------------------------------------------- |\n| `stream` | iterable | A stream of `Doc` objects. |\n| `batch_size` | int | The number of texts to buffer. Defaults to `128`. |\n| **YIELDS** | `spacy.tokens.Doc` | Processed `Doc`s in order. |\n\n#### method `PyTT_TokenVectorEncoder.predict`\n\nRun the transformer model on a batch of docs and return the extracted features.\n\n| Name | Type | Description |\n| ----------- | ------------ | ----------------------------------- |\n| `docs` | iterable | A batch of `Doc`s to process. |\n| **RETURNS** | `namedtuple` | Named tuple containing the outputs. |\n\n#### method `PyTT_TokenVectorEncoder.set_annotations`\n\nAssign the extracted features to the Doc objects and overwrite the vector and\nsimilarity hooks.\n\n| Name | Type | Description |\n| --------- | -------- | ------------------------- |\n| `docs` | iterable | A batch of `Doc` objects. |\n| `outputs` | iterable | A batch of outputs. |\n\n### class `PyTT_TextCategorizer`\n\nSubclass of spaCy's built-in\n[`TextCategorizer`](https://spacy.io/api/textcategorizer) component that\nsupports using the features assigned by the PyTorch-Transformers models via the\ntoken vector encoder. It requires the `PyTT_TokenVectorEncoder` to run before it\nin the pipeline.\n\nThe component is available as `pytt_textcat` and registered via an entry point,\nso it can also be created using\n[`nlp.create_pipe`](https://spacy.io/api/language#create_pipe):\n\n```python\ntextcat = nlp.create_pipe(\"pytt_textcat\")\n```\n\n#### classmethod `PyTT_TextCategorizer.from_nlp`\n\nFactory to add to `Language.factories` via entry point.\n\n| Name | Type | Description |\n| ----------- | ------------------------- | ----------------------------------------------- |\n| `nlp` | `spacy.language.Language` | The `nlp` object the component is created with. |\n| `**cfg` | - | Optional config parameters. |\n| **RETURNS** | `PyTT_TextCategorizer` | The text categorizer. |\n\n#### classmethod `PyTT_TextCategorizer.Model`\n\nCreate a text classification model using a PyTorch-Transformers model for token\nvector encoding.\n\n| Name | Type | Description |\n| ------------------- | -------------------- | -------------------------------------------------------- |\n| `nr_class` | int | Number of classes. |\n| `width` | int | The width of the tensors being assigned. |\n| `exclusive_classes` | bool | Make categories mutually exclusive. Defaults to `False`. |\n| `**cfg` | - | Optional config parameters. |\n| **RETURNS** | `thinc.neural.Model` | The model. |\n\n### dataclass `Activations`\n\nDataclass to hold the features produced by PyTorch-Transformers.\n\n| Attribute | Type | Description |\n| ------------------- | ------ | ----------- |\n| `last_hidden_state` | object | |\n| `pooler_output` | object | |\n| `all_hidden_states` | object | |\n| `all_attentions` | object | |\n| `is_grad` | bool | |\n\n### Entry points\n\nThis package exposes several\n[entry points](https://spacy.io/usage/saving-loading#entry-points) that tell\nspaCy how to initialize its components. If `spacy-pytorch-transformers` and\nspaCy are installed in the same environment, you'll be able to run the following\nand it'll work as expected:\n\n```python\ntok2vec = nlp.create_pipe(\"pytt_tok2vec\")\n```\n\nThis also means that your custom models can ship a `pytt_tok2vec` component and\ndefine `\"pytt_tok2vec\"` in their pipelines, and spaCy will know how to create\nthose components when you deserialize the model. The following entry points are\nset:\n\n| Name | Target | Type | Description |\n| ----------------- | ------------------------- | ----------------- | -------------------------------- |\n| `pytt_wordpiecer` | `PyTT_WordPiecer` | `spacy_factories` | Factory to create the component. |\n| `pytt_tok2vec` | `PyTT_TokenVectorEncoder` | `spacy_factories` | Factory to create the component. |\n| `pytt_textcat` | `PyTT_TextCategorizer` | `spacy_factories` | Factory to create the component. |\n| `pytt` | `PyTT_Language` | `spacy_languages` | Custom `Language` subclass. |", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://explosion.ai", "keywords": "", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "spacy-pytorch-transformers", "package_url": "https://pypi.org/project/spacy-pytorch-transformers/", "platform": "", "project_url": "https://pypi.org/project/spacy-pytorch-transformers/", "project_urls": { "Homepage": "https://explosion.ai" }, "release_url": "https://pypi.org/project/spacy-pytorch-transformers/0.4.0/", "requires_dist": null, "requires_python": ">=3.6", "summary": "spaCy pipelines for pre-trained BERT and other transformers", "version": "0.4.0" }, "last_serial": 5852998, "releases": { "0.0.1": [ { "comment_text": "", "digests": { "md5": "730cdfd62c294a97e8eadc11d7b73466", "sha256": "b7319154416f51cbe7ada9ec3f1e7777ce7cac83bba21f629be7800af220379f" }, "downloads": -1, "filename": "spacy_pytorch_transformers-0.0.1-py3-none-any.whl", "has_sig": false, "md5_digest": "730cdfd62c294a97e8eadc11d7b73466", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6", "size": 36407, "upload_time": "2019-08-02T19:37:12", "url": "https://files.pythonhosted.org/packages/46/a9/c39c72db9f2bb9f47355afa39beda3bbce46c4eb0f9f4347289fdb20344e/spacy_pytorch_transformers-0.0.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "47e0984fc3435643339bc4fba8571235", "sha256": "b7e4f9d057a11a4291e26a83f1f7acb0bab9a919038eaf5c70600a6b8b5f81cd" }, "downloads": -1, "filename": "spacy-pytorch-transformers-0.0.1.tar.gz", "has_sig": false, "md5_digest": "47e0984fc3435643339bc4fba8571235", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 43017, "upload_time": "2019-08-02T19:36:58", "url": "https://files.pythonhosted.org/packages/21/18/ce2c35584c91a0127c0f45ac093426b7143564a8553dc7b176aef8a92442/spacy-pytorch-transformers-0.0.1.tar.gz" } ], "0.1.0": [ { "comment_text": "", "digests": { "md5": "ca53cd19d48dcd962d9dd462d283623c", "sha256": "49b09791d73f3d355361c2d70f0f58908ac60602fe84d55ffd2642f3c7bea2f0" }, "downloads": -1, "filename": "spacy_pytorch_transformers-0.1.0-py3-none-any.whl", "has_sig": false, "md5_digest": "ca53cd19d48dcd962d9dd462d283623c", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6", "size": 54546, "upload_time": "2019-08-10T09:28:42", "url": "https://files.pythonhosted.org/packages/3a/14/d167f0e41813f3a61f4dadeaddbf6eda291932aac3431b526fbeb424526d/spacy_pytorch_transformers-0.1.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "5d7d8f32d40e7f9dec96ed1ced7ac310", "sha256": "5e7b82002615bf31a68a6357ee9c18ddb05a412e0e502f0d510f767311c1b88b" }, "downloads": -1, "filename": "spacy-pytorch-transformers-0.1.0.tar.gz", "has_sig": false, "md5_digest": "5d7d8f32d40e7f9dec96ed1ced7ac310", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 51196, "upload_time": "2019-08-10T09:28:28", "url": "https://files.pythonhosted.org/packages/17/52/c676c2663bb8b53ffe69ad38b6790bfc98c6e15561e5e7169dc2d8d3330f/spacy-pytorch-transformers-0.1.0.tar.gz" } ], "0.1.1": [ { "comment_text": "", "digests": { "md5": "99ac4d9f07ef00205964447424926b40", "sha256": "eb8caf7e10d945dcb87059d9103f3ff8b7fd78b3da4bc955b25ee3fb5729d44d" }, "downloads": -1, "filename": "spacy_pytorch_transformers-0.1.1-py3-none-any.whl", "has_sig": false, "md5_digest": "99ac4d9f07ef00205964447424926b40", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6", "size": 54580, "upload_time": "2019-08-10T12:12:36", "url": "https://files.pythonhosted.org/packages/a2/a0/c003c654c172b9e7b127c371af6792ffa87a60c9e576976ef10572caf939/spacy_pytorch_transformers-0.1.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "9b58f0a4fab27a6097790e2014ab813e", "sha256": "9577559b49678c3ba60371d0d85979f1795cfa0e7cc98ec8f3a2f43089bfa2da" }, "downloads": -1, "filename": "spacy-pytorch-transformers-0.1.1.tar.gz", "has_sig": false, "md5_digest": "9b58f0a4fab27a6097790e2014ab813e", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 51240, "upload_time": "2019-08-10T12:12:22", "url": "https://files.pythonhosted.org/packages/3f/53/31d7fea43f9177a7b59805edef0b78102c937cb03730d83c9f039967b509/spacy-pytorch-transformers-0.1.1.tar.gz" } ], "0.2.0": [ { "comment_text": "", "digests": { "md5": "df0b83e8fdc4da7220eedcca91cecbbd", "sha256": "e937cfcdb2d612cc6ebc81cbc3ecf5b58af7a3f3b0e8863aca6a440a61c3852a" }, "downloads": -1, "filename": "spacy_pytorch_transformers-0.2.0-py3-none-any.whl", "has_sig": false, "md5_digest": "df0b83e8fdc4da7220eedcca91cecbbd", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6", "size": 57934, "upload_time": "2019-08-12T11:24:12", "url": "https://files.pythonhosted.org/packages/b4/a5/45618feff3774b96b046eaafd0d5980c8671159da0bac8ac308dc387532f/spacy_pytorch_transformers-0.2.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "71833d04a7c2ef0ba4f950f034a284d2", "sha256": "ffd17828ba073c4fa7bd593fd41f8969f45c52c78f9847a7a8f847d07d67474a" }, "downloads": -1, "filename": "spacy-pytorch-transformers-0.2.0.tar.gz", "has_sig": false, "md5_digest": "71833d04a7c2ef0ba4f950f034a284d2", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 52123, "upload_time": "2019-08-12T11:24:00", "url": "https://files.pythonhosted.org/packages/79/94/15d79aed3d0822ed98a8593592de73e53cac1708e9a5500d4a2c0c53c0e9/spacy-pytorch-transformers-0.2.0.tar.gz" } ], "0.3.0": [ { "comment_text": "", "digests": { "md5": "7cc7220083fb1eaca5513345f842ff23", "sha256": "9b2f943e19d01dd35b3689fac5f99fcc9becd6816cac7cef08f4c5822926f37e" }, "downloads": -1, "filename": "spacy_pytorch_transformers-0.3.0-py3-none-any.whl", "has_sig": false, "md5_digest": "7cc7220083fb1eaca5513345f842ff23", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6", "size": 61921, "upload_time": "2019-08-27T10:11:07", "url": "https://files.pythonhosted.org/packages/52/4e/c1e18b58eeb7d1bcee19aca284daf1e8658005b8f47c05443a94be377ee8/spacy_pytorch_transformers-0.3.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "726238a81713300c6f6dafc47cd99eac", "sha256": "6ceac4c6db0f507ffd6aeee1620ebed7b07ab97eff4edb72c3a3e2aa3f40d788" }, "downloads": -1, "filename": "spacy-pytorch-transformers-0.3.0.tar.gz", "has_sig": false, "md5_digest": "726238a81713300c6f6dafc47cd99eac", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 56125, "upload_time": "2019-08-27T10:10:54", "url": "https://files.pythonhosted.org/packages/2e/50/478af9def719ca887734bde66775f51bab0f775cacbe039bad65eb484b77/spacy-pytorch-transformers-0.3.0.tar.gz" } ], "0.4.0": [ { "comment_text": "", "digests": { "md5": "943c07c260ba7f3da26880cac4b0ab34", "sha256": "77ed625fb02001f73b55d62ff5388b33ad23995465e3a48334742317eac7a10f" }, "downloads": -1, "filename": "spacy_pytorch_transformers-0.4.0-py3-none-any.whl", "has_sig": false, "md5_digest": "943c07c260ba7f3da26880cac4b0ab34", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6", "size": 62135, "upload_time": "2019-09-04T15:13:11", "url": "https://files.pythonhosted.org/packages/fd/46/3271586944ee5e0bd493df03b1ad189eb9ccdad1d2476aeb843b0d2f1b47/spacy_pytorch_transformers-0.4.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "54d1a0d200054b33bd2013cac20c9638", "sha256": "7ff4be62212d636c126b21ac54367c0c210ebdabfb67837b904cde8e55123cde" }, "downloads": -1, "filename": "spacy-pytorch-transformers-0.4.0.tar.gz", "has_sig": false, "md5_digest": "54d1a0d200054b33bd2013cac20c9638", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 56516, "upload_time": "2019-09-04T15:12:58", "url": "https://files.pythonhosted.org/packages/46/65/85f4cc13b6b2fe93b9eb646d422b9a049e2ac4372e86ab081b1ba5e996f4/spacy-pytorch-transformers-0.4.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "943c07c260ba7f3da26880cac4b0ab34", "sha256": "77ed625fb02001f73b55d62ff5388b33ad23995465e3a48334742317eac7a10f" }, "downloads": -1, "filename": "spacy_pytorch_transformers-0.4.0-py3-none-any.whl", "has_sig": false, "md5_digest": "943c07c260ba7f3da26880cac4b0ab34", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6", "size": 62135, "upload_time": "2019-09-04T15:13:11", "url": "https://files.pythonhosted.org/packages/fd/46/3271586944ee5e0bd493df03b1ad189eb9ccdad1d2476aeb843b0d2f1b47/spacy_pytorch_transformers-0.4.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "54d1a0d200054b33bd2013cac20c9638", "sha256": "7ff4be62212d636c126b21ac54367c0c210ebdabfb67837b904cde8e55123cde" }, "downloads": -1, "filename": "spacy-pytorch-transformers-0.4.0.tar.gz", "has_sig": false, "md5_digest": "54d1a0d200054b33bd2013cac20c9638", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 56516, "upload_time": "2019-09-04T15:12:58", "url": "https://files.pythonhosted.org/packages/46/65/85f4cc13b6b2fe93b9eb646d422b9a049e2ac4372e86ab081b1ba5e996f4/spacy-pytorch-transformers-0.4.0.tar.gz" } ] }