{ "info": { "author": "LOGILAB S.A. (Paris, FRANCE)", "author_email": "contact@logilab.fr", "bugtrack_url": null, "classifiers": [ "Environment :: Web Environment", "Framework :: CubicWeb", "Programming Language :: JavaScript", "Programming Language :: Python" ], "description": "=======\nSummary\n=======\n\nCube for data input/output, import and export\n\n\nMassive Store\n=============\n\nThe Massive Store is a CW store used to push massive amount\nof data using pure SQL logic, thus avoiding CW checks.\nIt is faster than other CW stores (it does not check eid at each step,\nit use COPY FROM method), but is less safe (no data integrity securities),\nand does not return an eid while using create_entity function.\n\nWARNING: This store may be only used with PostgreSQL for now, as it\nrelies on the COPY FROM method, and on specific PostgreSQL tables\nto get all the indexes.\n\n\nWorkflow of Massive Store\n-------------------------\n\nThe Massive Store workflow is the following:\n\n * Drop indexes and constraints from the meta-data tables (entities, is_instance_of, ...);\n\n\n * Insertion of data:\n\n * using the `create_entity` function for entities;\n\n * using the `relate` function for relations;\n\n * using the `related_by_iid` function for relations based on external identifiers;\n\n * each insertion of a rtype that has not been seen yet will trigger the\n creation of a temporary table for this rtype, to store the results.\n\n * each insertion of an etype that has not been seen yet will remove all\n the indexes/constraints on the entity table.\n\n\n * At a given point, one should call the `flush` method:\n\n * it will flush the entities data into the database based on `COPY_FROM`.\n\n * it will flush the relations data into the database based on `COPY_FROM`.\n\n * it will flush the relations-iid data into the database based on `COPY_FROM`.\n\n * it will create the metadata (entities, ...) for the insered entities.\n\n * it will commit.\n\n\n * If some relations are created based on external identifiers (`relate_by_iid`),\n the conversion should be manually done using the `convert_relations` method.\n\n\n * At the end of the insertion, one should call the `cleanup` method:\n\n * it will re-create the indexes/constraints/primary key for the entities/relations tables.\n\n * it will re-create the indexes/constraints on the meta-data tables.\n\n * it will remove temporary tables and internal store tables.\n\n\n\n\nEntities/Relations in Massive Store\n-----------------------------------\n\nDue to the technical constraints on the database insertion, there are some following specific points to notice:\n\n * a `create_entity` call will return an entity with a specific `eid`. Eids are automatically dealt with\n by the Massive Store (it will fetch for a given range of eids for its internal use), but you can\n pass a specific eid in the kwargs of the `create_entity` call to bypass the automatic assignation of an eid.\n\n * inlined-relations are not supported in the `relate` method.\n\nA buffer will be created for the call to the PostgreSQL `COPY_FROM` clause.\nIf the separator used for the creation of this tabular file is found in the data of the entities (or relations),\nit will be replace by the `replace_sep` of the store (default is to '').\n\n\n\nBasic use of Massive Store\n--------------------------\n\nA simple script using the Massive Store::\n\n # Initialize the store\n store = MassiveObjectStore(session)\n # Initialize the Relation table\n store.init_rtype_table('Person', 'lives', 'Location')\n\n # Import logic\n ...\n entity = store.create_entity('Person', ...)\n entity = store.create_entity('Location', ...)\n\n # Flush the data in memory to sql database\n store.flush()\n\n # Import logic\n ...\n entity = store.create_entity('Person', ...)\n entity = store.create_entity('Location', ...)\n # Person_iid and location_iid are unique iid that are data dependant (e.g URI)\n store.relate_by_iid(person_iid, 'lives', location_iid)\n ...\n\n # Flush the data in memory to sql database\n store.flush()\n\n # Convert the relation\n store.convert_relations('Person', 'lives', 'Location')\n\n # Clean the store / rebuild indexes\n store.cleanup()\n\nIn this case, iid_subj and iid_obj represent an unique id\n(e.g. uri, or id from the imported database) that can be used to create\nrelations after importing entities.\n\n\nAdvanced use of Massive Store\n-----------------------------\n\nThe simple and default use of the Massive Store is conservative to\navoid issues in meta-data management. However it is possible to\nincrease insertion speed:\n\n* the flushing of meta-data could be costly if done too many times.\n A good practive is to do only once at the end of the import.\n For doing so, you should set `autoflush_metadata` to False in the store creation,\n and you should call the `flush_meta_data` at the end of the import\n (**but before the call to `cleanup`**).\n\n* you may avoid to commit at each flush, by setting `commit_at_flush`\n to False in the store creation. Thus you should explicitely call\n the `commit` method at least once **before flushing the meta data\n and cleaning up the store**.\n\n* you could avoid dropping the different indexes and constraints using\n the `drop_index` attribute during the store creation.\n\n* you could set a different starting point of the eids sequence using\n the `eids_seq_start` attribute during the store creation.\n\n* additional callbacks could be given to deal with commit and rollback\n (`on_commit_callback` and `on_rollback_callback`).\n\n\nExample of advanced use of Massive Store::\n\n store = MassiveObjectStore(session,\n\t\t\t\t autoflush_metadata=False,\n commit_at_flush=False)\n store.init_rtype_table('Location', 'names', 'LocationName')\n for ind, infos in enumerate(ucsvreader(open(dumpname))):\n entity = {'name': infos[1], ...}\n entity['my_inlined_relation'] = my_dict.get(infos[2])\n entity = store.create_entity('Location', **entity)\n store.relate_by_iid(entity.cwuri, 'my_external_relation', infos[3])\n if ind and ind % 200000 == 0:\n store.flush()\n store.commit()\n store.flush()\n store.commit()\n store.flush_meta_data()\n store.convert_relations('Location', 'my_external_relation', 'Location',\n 'cwuri', 'cwuri')\n store.cleanup()\n\n\nRestoring a database after Massive Store failure\n------------------------------------------------\n\nThe Massive Store remove some constraints and indexes that are\nautomatically rebuild during the `cleanup` call. If there is an error\nduring the import process, you could still call to the `cleanup`\nmethod, or even recreate after the failure another store and call the\n`cleanup` method of this store.\n\nThe Massive Store create the following tables for its internal use:\n\n* `dataio_initialized`: information on the initialized etype/rtype tables.\n\n* `dataio_constraints`: the queries that may be used to restore the constraints/indexes\n for the different etype/rtype tables.\n\n* `dataio_metadata`: the etypes that have already have their meta-data pushed.\n\n\nSlave Mode\n----------\n\nA slave mode is available for parallel use of the Massive Store:\n\n* a Massive Store (*master*) should be created.\n\n* for all the possible etype/rtype that may be encoutered during the\n import, the `init_etype_table`/`init_relation_table` methods of the\n *master* store should be called.\n\n* different *slave* stores could be created using the `slave_mode`\n attribute during the store creation. The `autoflush_metadata`\n attribute should be setted to False.\n\n* each *slave* store could be used in a different thread, for creating\n entity and relation, and should only call to its `flush` and\n `commit` methods.\n\n* The *master* store should call its `flush_meta_data` and `cleanup`\n methods at the end of the import.\n\n\nRDF Store\n=========\n\nThe RDF Store is used to import RDF data into a CubicWeb data, based\non a Yams <-> RDF schema conversion. The conversion rules are stored\nin a XY structure.\n\n\nBuilding an XY structure\n------------------------\n\nYou have to create a file (usually called `xy.py`) in your cube, and\nimport the dataio version of xy::\n\n from cubes.dataio import xy\n\nYou have to register the different prefixes (common prefixes as skos\nor foaf are already registered)::\n\n xy.register_prefix('diseasome', 'http://www4.wiwiss.fu-berlin.de/diseasome/resource/diseasome/')\n\nBy default, the entity type is based on the rdf property \"rdf:type\",\nbut you may changed it using::\n\n xy.register_rdf_etype_property('skos:inScheme')\n\nIt is also possible to give a specific callback to determine the\nentity type from the rdf properties::\n\n def _rameau_etype_callback(rdf_properties):\n if 'skos:inScheme' in rdf_properties and 'skos:prefLabel' in rdf_properties:\n \t return 'Rameau'\n\n xy.register_etype_callback(_rameau_etype_callback)\n\n\nThe URI is fetched from the \"rdf:about\" property, and can be\nnormalized using a specific callback::\n\n def normalize_uri(uri):\n \tif uri.endswith('.rdf'):\n return uri[:-4]\n\treturn uri\n\n xy.register_uri_conversion_callback(normalize_uri)\n\n\n\nDefining the conversion rules\n-----------------------------\n\n\nThen, you may write the conversion rules:\n\n- xy.add_equivalence allows you to add a basic equivalence between\n entity type / attribute / relations, and RDF properties. You may use\n \"*\" as a wild cart in the Yams part. E.g. for entity types::\n\n\txy.add_equivalence('Gene', 'diseasome:genes')\n\txy.add_equivalence('Disease', 'diseasome:diseases')\n\n E.g. for attributes::\n\n \txy.add_equivalence('* name', 'diseasome:name')\n\txy.add_equivalence('* label', 'rdfs:label')\n\txy.add_equivalence('* label', 'diseasome:label')\n\txy.add_equivalence('* class_degree', 'diseasome:classDegree')\n\txy.add_equivalence('* size', 'diseasome:size')\n\n\n E.g. for relations::\n\n xy.add_equivalence('Disease close_match ExternalUri', 'diseasome:classes')\n xy.add_equivalence('Disease subtype_of Disease', 'diseasome:diseaseSubtypeOf')\n xy.add_equivalence('Disease associated_genes Gene', 'diseasome:associatedGene')\n xy.add_equivalence('Disease chromosomal_location ExternalUri', 'diseasome:chromosomalLocation')\n xy.add_equivalence('* sameas ExternalUri', 'owl:sameAs')\n xy.add_equivalence('Gene gene_id ExternalUri', 'diseasome:geneId')\n xy.add_equivalence('Gene bio2rdf_symbol ExternalUri', 'diseasome:bio2rdfSymbol')\n\n\n- A base URI can be given to automatically determine if a Resource should be considered\n as an external URI or an internal relation::\n\n xy.register_base_uri('http://www4.wiwiss.fu-berlin.de/diseasome/resource/')\n\n\n A more complex logic can be used by giving a specific callback::\n\n def externaluri_callback(uri):\n \t if uri.startswith('http://www4.wiwiss.fu-berlin.de/diseasome/resource/'):\n if uri.endswith('disease') or uri.endswith('gene'):\n return False\n\t return True\n\t return True\n\n xy.register_externaluri_callback(externaluri_callback)\n\n\n\nThe values of attributes are built based on the Yams type. But you\ncould use a specific callback to compute the correct values from the\nrdf properties::\n\n\t def _convert_date(_object, datetime_format='%Y-%m-%d'):\n\t \"\"\" Convert an rdf value to a date \"\"\"\n\t try:\n\t\treturn datetime.strptime(_object.format(), datetime_format)\n\t except:\n\t return None\n\n xy.register_attribute_callback('Date', _convert_date)\n\nor::\n\n\tdef format_isbn(rdf_properties):\n \t if 'bnf-onto:isbn' in rdf_properties:\n\t isbn = rdf_properties['bnf-onto:isbn'][0]\n\t isbn = [i for i in isbn if i in '0123456789']\n\t return int(''.join(isbn)) if isbn else None\n\n\txy.register_attribute_callback('Manifestation formatted_isbn', format_isbn)\n\n\n\nImporting data\n--------------\n\nData may thus be imported using the \"import-rdf\" command of cubicweb-ctl::\n\n\n cubicweb-ctl import-rdf \n\nThe default library used for reading the data is \"rdflib\" but one may\nuse \"librdf\" using the \"--lib\" option.\n\nIt is also possible to force the rdf-format (it is automatically\ndetermined, but this may sometimes lead to errors), using the\n\"--rdf-format\" option.\n\n\n\nExporting data\n--------------\n\nThe view 'rdf' may be called and will create a RDF file from the\nresult set. It is a modified version of the CubicWeb RDFView, that\ntake into account the more complex conversion rules from the dataio\ncube. The format can also be forced (default is XML) using the\n\"--format\" option in the url (xml, n3 or nt).\n\n\n\nExamples\n--------\n\nExamples of use of dataio rdf import could be found in the nytimes and\ndiseasome cubes.", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "http://www.cubicweb.org/project/cubicweb-dataio", "keywords": null, "license": "LGPL", "maintainer": null, "maintainer_email": null, "name": "cubicweb-dataio", "package_url": "https://pypi.org/project/cubicweb-dataio/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/cubicweb-dataio/", "project_urls": { "Download": "UNKNOWN", "Homepage": "http://www.cubicweb.org/project/cubicweb-dataio" }, "release_url": "https://pypi.org/project/cubicweb-dataio/0.7.0/", "requires_dist": null, "requires_python": null, "summary": "Cube for data input/output, import and export", "version": "0.7.0" }, "last_serial": 2060157, "releases": { "0.2.0": [ { "comment_text": "", "digests": { "md5": "b1a7f1a0918b10dcb6e43629b26a6924", "sha256": "5f4bb4784cd6838c139dae581d53ce5f89497a8ac8aabba24660837d9e98beb4" }, "downloads": -1, "filename": "cubicweb-dataio-0.2.0.tar.gz", "has_sig": false, "md5_digest": "b1a7f1a0918b10dcb6e43629b26a6924", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 32671, "upload_time": "2013-07-18T13:22:59", "url": "https://files.pythonhosted.org/packages/fb/47/6bbb0fdedfb698a23fd5632be1be0e06465e41edba4b9e1e06c1ec3193d0/cubicweb-dataio-0.2.0.tar.gz" } ], "0.3.3": [ { "comment_text": "", "digests": { "md5": "8653e1529e94287171734c5f2792f264", "sha256": "d9f0395fb5b5f2c88357b53d75bac31da2133c25064d87436bf9bc09970acb1c" }, "downloads": -1, "filename": "cubicweb-dataio-0.3.3.tar.gz", "has_sig": false, "md5_digest": "8653e1529e94287171734c5f2792f264", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 41430, "upload_time": "2013-11-21T13:30:40", "url": "https://files.pythonhosted.org/packages/c7/66/8386c6c52a5be11c84cb9eaaded3dd4a37489903019640a6acdaf5530083/cubicweb-dataio-0.3.3.tar.gz" } ], "0.3.4": [ { "comment_text": "", "digests": { "md5": "4b31f12a5c80db1f1b4855b690192f8e", "sha256": "4c9929e8d68bd53ddc3a42f036444fb395b23c4867799bea626c4ae7f0916ef4" }, "downloads": -1, "filename": "cubicweb-dataio-0.3.4.tar.gz", "has_sig": false, "md5_digest": "4b31f12a5c80db1f1b4855b690192f8e", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 41435, "upload_time": "2014-01-24T17:14:37", "url": "https://files.pythonhosted.org/packages/0b/22/93c12fc1ee2c78e1c59ff38c0ab654c40a1358413e139ffb58f5e80d2067/cubicweb-dataio-0.3.4.tar.gz" } ], "0.4.1": [ { "comment_text": "", "digests": { "md5": "5eeee928e45d953562f9b5058e9a05c9", "sha256": "d8d4243af26986ee988c1f960e2b35dd578423e2ddfc7b00b7fb024551580b0c" }, "downloads": -1, "filename": "cubicweb-dataio-0.4.1.tar.gz", "has_sig": false, "md5_digest": "5eeee928e45d953562f9b5058e9a05c9", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 41269, "upload_time": "2014-03-05T10:53:25", "url": "https://files.pythonhosted.org/packages/d9/9d/9f3741adb6868b39527a3d5b1447493aa159675030a87573b1f9748fbdca/cubicweb-dataio-0.4.1.tar.gz" } ], "0.5.0": [ { "comment_text": "", "digests": { "md5": "78d98c87e594c4ade3af715e4c772100", "sha256": "370610382a53ae56ef74e7916f697a36f6e7d7f7df83eb52d4995e09ac137ba3" }, "downloads": -1, "filename": "cubicweb-dataio-0.5.0.tar.gz", "has_sig": false, "md5_digest": "78d98c87e594c4ade3af715e4c772100", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 42518, "upload_time": "2014-04-29T14:45:05", "url": "https://files.pythonhosted.org/packages/e3/ea/83075067084530dc1042b19341aeda8dead77e7786db838af71414863580/cubicweb-dataio-0.5.0.tar.gz" } ], "0.6.1": [ { "comment_text": "", "digests": { "md5": "41f2e77b52407ff049ee5be4d1d50d31", "sha256": "639435e43d983f0d1c29cfc8113eb72c69a46cbe4d0097cc828bac8087d1f49f" }, "downloads": -1, "filename": "cubicweb-dataio-0.6.1.tar.gz", "has_sig": false, "md5_digest": "41f2e77b52407ff049ee5be4d1d50d31", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 41173, "upload_time": "2015-07-16T14:01:51", "url": "https://files.pythonhosted.org/packages/05/fd/e1506b96f0c21a09ff1aa22213ea2da0bcff28028d44f21bccc5e4043d70/cubicweb-dataio-0.6.1.tar.gz" } ], "0.7.0": [ { "comment_text": "", "digests": { "md5": "bd8120be5b1949f969e5b5a035820877", "sha256": "4e96b4715a9d9e7a2bbce1324c2349dfbb4fe0e7e3ac3e673f5104b35619ddea" }, "downloads": -1, "filename": "cubicweb-dataio-0.7.0.tar.gz", "has_sig": false, "md5_digest": "bd8120be5b1949f969e5b5a035820877", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 41136, "upload_time": "2016-04-12T15:22:39", "url": "https://files.pythonhosted.org/packages/c7/4a/31f5b219df3c07dc89f64c240e8d5dd12a4cb1e367041ba5805fecf6135e/cubicweb-dataio-0.7.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "bd8120be5b1949f969e5b5a035820877", "sha256": "4e96b4715a9d9e7a2bbce1324c2349dfbb4fe0e7e3ac3e673f5104b35619ddea" }, "downloads": -1, "filename": "cubicweb-dataio-0.7.0.tar.gz", "has_sig": false, "md5_digest": "bd8120be5b1949f969e5b5a035820877", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 41136, "upload_time": "2016-04-12T15:22:39", "url": "https://files.pythonhosted.org/packages/c7/4a/31f5b219df3c07dc89f64c240e8d5dd12a4cb1e367041ba5805fecf6135e/cubicweb-dataio-0.7.0.tar.gz" } ] }