{ "info": { "author": "Sylvain Mari\u00e9", "author_email": "\"Sylvain Mari\u00e9\" ", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "Intended Audience :: Developers", "License :: OSI Approved :: BSD License", "Programming Language :: Python :: 3.5", "Topic :: Software Development :: Libraries :: Python Modules" ], "description": "python simple file collection parsing framework (sficopaf)\n==========================================================\n\nA declarative framework to read complex objects made of several files,\nusing a user-provided library of unitary file parsers.\n\nThis library provides a *framework*, not a specific parser ! Although it\ncomes with a couple unitary file parsers as an example, it is intended\nfor users that already know how to parse their various files\nindependently, but who are looking for a higher-level tool to read\ncomplex objects made of several files/folders and potentially requiring\nto combine several parsers.\n\nTypical use cases of this library :\n\n- read collections of test cases on the file system - each test case\n being composed of several files (for example 2 'test inputs' .csv\n files, 1 'test configuration' .cfg file, and one 'reference test\n results' json file)\n- more generally, read complex objects, for example made of several csv\n files (timeseries + descriptive data), combinations of csv and\n xml/json files, configuration files, etc.\n\nMain features\n-------------\n\n- **Declarative**: you *first* define the type of objects to parse - by\n creating a class -, then you use ``parse_collection`` or\n ``parse_item`` on the appropriate folder or file path.\n- **Supports several unitary file parsers for the same object type**.\n Thanks to a combined *{Type+Extension}* registration, you register\n unitary file parsers for a given object type *and* for a given file\n extension (for example ``str`` + ``.txt``). This allows users to\n register several parsers for the same object type, supporting various\n formats represented by the extensions.\n- **Supports complex classes** : the main interest of this framework is\n its ability to define complex classes that spans across several\n files. For example, a ``MyTestCase`` class that would have two fields\n ``input: DataFrame`` and ``expected_output: str``. The class\n constructor is introspected in order to find the *required* and\n *optional* fields and their names. Fields may be objects or\n collections (that should be declared with the ``typing`` module such\n as ``Dict[str, Foo]``) in order for the framework to keep track of\n the underlying collection types)\n- **Recursive**: fields may themselves be collections or complex types.\n In which case they are represented by several files.\n- Supports **two main file mapping flavours**:\n\n - *flat*, where all items are represented as files in the same\n folder (even fields and collection elements)\n - *wrapped*, where all items that represent collections or complex\n types are represented by folders, and all ready-to-parse items are\n represented by files.\n\n- **Safe**: files are opened and finally closed by the framework, your\n parsing function may exit without closing\n- **Lazy-parsing** : TODO, a later version will allow to only trigger\n parsing when objects are read, in the case of collections\n\nInstallation\n------------\n\nRecommended : create a clean virtual environment\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nWe strongly recommend that you use conda *environment* or pip\n*virtualenv*/*venv* in order to better manage packages. Once you are in\nyour virtual environment, open a terminal and check that the python\ninterpreter is correct:\n\n.. code:: bash\n\n (Windows)> where python\n (Linux) > which python\n\nThe first executable that should show up should be the one from the\nvirtual environment.\n\nInstallation steps\n~~~~~~~~~~~~~~~~~~\n\nThis package is available on ``PyPI``. You may therefore use ``pip`` to\ninstall from a release\n\n.. code:: bash\n\n > pip install sficopaf\n\nUninstalling\n~~~~~~~~~~~~\n\nAs usual :\n\n.. code:: bash\n\n > pip uninstall sficopaf\n\nExamples\n--------\n\nBasic: the ``op_function`` test cases\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\na - example overview\n^^^^^^^^^^^^^^^^^^^^\n\nIn this very simple example, we will parse 'test cases' for an imaginary\nfunction that performs operations :\n``op_function(a:int, b:int, operation:str = '+') -> output:int.``\n\nEach of our 'test case' items will be made of several things: \\*\nmandatory input data (here, ``a`` and ``b``) \\* optional configuration\n(here, ``operation``) \\* mandatory expected result (here, ``output``)\n\nWe would like these things stored in separate files. Typically the\nreason is that you will want to separate the various formats that you\nwish to use: *csv*, *xml*, *json*...\n\nIn this first example we decide to store all items in separate files. So\nour data folder structure looks like this:\n\n.. code:: bash\n\n test_cases\n \u251c\u2500\u2500 case1\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 input_a.txt\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 input_b.txt\n \u2502\u00a0\u00a0 \u2514\u2500\u2500 output.txt\n \u251c\u2500\u2500 case2\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 input_a.txt\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 input_b.txt\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 options.txt\n \u2502\u00a0\u00a0 \u2514\u2500\u2500 output.txt\n \u2514\u2500\u2500 case3\n \u251c\u2500\u2500 input_a.txt\n \u251c\u2500\u2500 input_b.txt\n \u251c\u2500\u2500 options.cfg\n \u2514\u2500\u2500 output.txt\n\n(this data is available in the source code of this project, in folder\n``test_data/demo``)\n\nNote that the configuration file is optional. Here, only ``case2`` and\n``case3`` will have a non-default configuration.\n\nYou may also have noticed that the configuration file is present with\ntwo different extensions : ``.txt`` (in case2) and ``.cfg`` (in case3).\nThis framework allows to register several file extensions for the same\ntype of object to parse. Each extension may have its own parser\nfunction.\n\nb - base types and parsers registration - simple\n^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\nFirst import the package and create a root parser.\n\n.. code:: python\n\n import sficopaf as sf\n root_parser = sf.RootParser()\n\nThen register a parser function for all items that will be represented\nas **single** files.\n\nIn this example, all inputs and outputs are ``int`` so we create a first\nfunction to parse an ``int`` from a text file:\n\n.. code:: python\n\n from io import TextIOBase\n def parse_int_file(file_object: TextIOBase) -> int:\n integer_str = file_object.readline()\n return int(integer_str)\n\nand we register it:\n\n.. code:: python\n\n root_parser.register_extension_parser(int, '.txt', parse_int_file)\n\nNote that the parsing framework automatically opens and closes the file\nfor you, even in case of exception.\n\nc - base types and parsers registration - proxies and item collections\n^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\nWe also need to be able to read a ``configuration``, that we would like\nto be a ``Dict[str, str]`` in order for it to later contain more\nconfiguration options.\n\nUnfortunately this type is an 'item collection' type (dict, list, set,\ntuple), so we have to create our own custom wrapper class, in order for\nthe framework not to think that it has to read each ````\npair as a separate file. Indeed by default the framework considers that\nall 'item collection' types are collections of files.\n\n.. code:: python\n\n class OpConfig(dict):\n \"\"\"\n An OpConfig object is a Dict[str, str] object\n \"\"\"\n def __init__(self, config: Dict[str, str]):\n super(OpConfig, self).__init__()\n self.__wrapped_impl = config\n\n # here you may wish to perform additional checks on the wrapped object\n unrecognized = set(config.keys()) - set('operation')\n if len(unrecognized) > 0:\n raise ValueError('Unrecognized options : ' + unrecognized)\n\n # Delegate all calls to the implementation:\n def __getattr__(self, name):\n return getattr(self.__wrapped_impl, name)\n\nThis is named a dynamic proxy. The ``OpConfig`` class extends the\n``dict`` class, but delegates everything to the underlying ``dict``\nimplementation provided in the constructor.\n\n*Note: this pattern is very useful to use this library, even if the\nunderlying class is not an 'item collection' type. Indeed, this is a\ngood way to create specialized versions of generic objects created by\nyour favourite parsers. For example two ``pandas.DataFrame`` might\nrepresent a training set, and a prediction table. Both objects, although\nsimilar (both are tables with rows and columns), might have very\ndifferent contents (column names, column types, number of rows, etc.).\nWe can make this fundamental difference appear at the parsing level, by\ncreating two classes.*\n\nBack to our example, we propose two formats for the ``OpConfig``: \\* one\n``.txt`` format where the first row will directly contain the value for\nthe ``operation`` \\* one ``.cfg`` format where the configuration will be\navailable in a ``configparser`` format, and for which we want to reuse\nthe existing parser.\n\n.. code:: python\n\n from typing import Dict\n def parse_configuration_txt_file(file_object: TextIOBase) -> OpConfig:\n return {'operation': file_object.readline()}\n\n def parse_configuration_cfg_file(file_object: TextIOBase) -> OpConfig:\n import configparser\n config = configparser.ConfigParser()\n config.read_file(file_object)\n return dict(config['main'].items())\n\nOnce again, we finally register the parsers:\n\n.. code:: python\n\n root_parser.register_extension_parser(OpConfig, '.txt', parse_configuration_txt_file)\n root_parser.register_extension_parser(OpConfig, '.cfg', parse_configuration_cfg_file)\n\nd - main complex type and final parsing execution\n^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\nFinally, we define the ``OpTestCase`` object. Its constructor should\nreflect the way we want to dispatch the various pieces of information in\nseparate files, as well as indicate the files the are optional:\n\n.. code:: python\n\n class OpTestCase(object):\n def __init__(self, input_a: int, input_b: int, output: int, options: OpConfig = None):\n self.input_a, self.input_b, self.output = input_a, input_b, output\n if options is None:\n self.op = '+'\n else:\n self.op = options['operation']\n\n def __str__(self):\n return self.__repr__()\n\n def __repr__(self):\n return str(self.input_a) + ' ' + self.op + ' ' + str(self.input_b) + ' =? ' + str(self.output)\n\nParsing is then straightforward: simply provide the root folder, the\nobject type, and the file mapping flavour.\n\n.. code:: python\n\n results = root_parser.parse_collection('./test_data/demo', OpTestCase)\n\nThe output shows the created test case objects:\n\n.. code:: python\n\n pprint(results)\n\n {'case1': 1 + 2 =? 3, 'case2': 1 + 3 =? 4, 'case3': 1 - 2 =? -1}\n\nAdvanced topics\n~~~~~~~~~~~~~~~\n\nFlat mode\n^^^^^^^^^\n\nIn our example we used folders to encapsulate object fields and item\ncollections. This is ``flat_mode=False``. Alternatively you may wish to\nset ``flat_mode=True``. In this case the folder structure should be\nflat, as shown below. Item names and field names are separated by a\nconfigurable character string. For example to parse the following tree\nstructure:\n\n.. code:: bash\n\n .\n \u251c\u2500\u2500 case1--input_a.txt\n \u251c\u2500\u2500 case1--input_b.txt\n \u251c\u2500\u2500 case1--output.txt\n \u251c\u2500\u2500 case2--input_a.txt\n \u251c\u2500\u2500 case2--input_b.txt\n \u251c\u2500\u2500 case2--options.txt\n \u251c\u2500\u2500 case2--output.txt\n \u251c\u2500\u2500 case3--input_a.txt\n \u251c\u2500\u2500 case3--input_b.txt\n \u251c\u2500\u2500 case3--options.cfg\n \u2514\u2500\u2500 case3--output.txt\n\nyou'll need to call\n\n.. code:: python\n\n results = root_parser.parse_collection('./test_data/demo_flat', OpTestCase, flat_mode=True, sep_for_flat='--')\n pprint(results)\n\nNote that the dot may be safely used as a separator too.\n\nItem collections\n^^^^^^^^^^^^^^^^\n\nThe parsing framework automatically detects any object that is of a\n'item collection' type (``dict``, ``list``, ``set``, and currently\n``tuple`` is not supported). These types should be well defined\naccording to the ``typing`` module: for example let's imagine that we\nhave an additional ``input_c`` in our example, with type\n``typing.Dict[str, typing.List[int]]``.\n\n.. code:: python\n\n class OpTestCaseColl(object):\n def __init__(self, input_a: int, input_b: int, output: int,\n input_c: Dict[str, List[int]] = None, options: OpConfig = None):\n self.input_a, self.input_b, self.output = input_a, input_b, output\n if options is None:\n self.op = '+'\n else:\n self.op = options['operation']\n self.input_c = input_c or None\n\n def __str__(self):\n return self.__repr__()\n\n def __repr__(self):\n return str(self.input_a) + ' ' + self.op + ' ' + str(self.input_b) + ' =? ' + str(\n self.output) + ' ' + str(self.input_c)\n\nFor ``flat_mode=True`` : \\* dictionary keys are read from the text\nbehind the separator after ``input_c`` (so below, ``keyA`` and ``keyB``\nare the key names) \\* list items are indicated by any character\nsequence, but that sequence is not kept when creating the list object\n(below, ``item1`` and ``item2`` will not be kept in the output list)\n\n.. code:: bash\n\n .\n \u251c\u2500\u2500 case1--input_a.txt\n \u251c\u2500\u2500 case1--input_b.txt\n \u251c\u2500\u2500 case1--output.txt\n \u251c\u2500\u2500 case2--input_a.txt\n \u251c\u2500\u2500 case2--input_b.txt\n \u251c\u2500\u2500 case2--options.txt\n \u251c\u2500\u2500 case2--output.txt\n \u251c\u2500\u2500 case3--input_a.txt\n \u251c\u2500\u2500 case3--input_b.txt\n \u251c\u2500\u2500 case3--input_c--keyA--item1.txt\n \u251c\u2500\u2500 case3--input_c--keyA--item2.txt\n \u251c\u2500\u2500 case3--input_c--keyB--item1.txt\n \u251c\u2500\u2500 case3--options.cfg\n \u2514\u2500\u2500 case3--output.txt\n\n.. code:: python\n\n results = root_parser.parse_collection('./test_data/demo_flat_coll', OpTestCaseColl, flat_mode=True, sep_for_flat='--')\n pprint(results['case3'].input_c)\n\nResults:\n\n.. code:: python\n\n {'keyA': [-1, -1], 'keyB': [-1]}\n\nFor ``flat_mode=False`` : \\* we already saw that complex objects are\nrepresented by folders (for example ``case1``, ``case2`` and ``case3``)\n\\* item collections are, too : ``input_c`` is a folder \\* dictionary\nkeys are read from the files or folder names (so below, ``keyA`` and\n``keyB`` are the key names, and since their content is a complex or\ncollection object they are folders themselves) \\* list items are\nindicated by files or folders with any name, but that name is not kept\nwhen creating the list object (below, ``item1`` and ``item2`` are not\nkept in the output list, only their contents is)\n\n.. code:: bash\n\n .\n \u251c\u2500\u2500 case1\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 input_a.txt\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 input_b.txt\n \u2502\u00a0\u00a0 \u2514\u2500\u2500 output.txt\n \u251c\u2500\u2500 case2\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 input_a.txt\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 input_b.txt\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 options.txt\n \u2502\u00a0\u00a0 \u2514\u2500\u2500 output.txt\n \u2514\u2500\u2500 case3\n \u251c\u2500\u2500 input_a.txt\n \u251c\u2500\u2500 input_b.txt\n \u251c\u2500\u2500 input_c\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 keyA\n \u2502\u00a0\u00a0 \u2502\u00a0\u00a0 \u251c\u2500\u2500 item1.txt\n \u2502\u00a0\u00a0 \u2502\u00a0\u00a0 \u2514\u2500\u2500 item2.txt\n \u2502\u00a0\u00a0 \u2514\u2500\u2500 keyB\n \u2502\u00a0\u00a0 \u2514\u2500\u2500 item1.txt\n \u251c\u2500\u2500 options.cfg\n \u2514\u2500\u2500 output.txt\n\n.. code:: python\n\n results = root_parser.parse_collection('./test_data/demo_coll', OpTestCaseColl, flat_mode=False)\n pprint(results['case3'].input_c)\n\nResults:\n\n.. code:: python\n\n {'keyA': [-1, -1], 'keyB': [-1]}\n\nFinally, note that it is not possible to mix collection and\nnon-collection items together (for example, ``Union[int, List[int]]`` is\nnot supported)\n\nSee Also\n--------\n\nCheck `here `__ for\nother parsers in Python, that you might wish to register as unitary\nparsers to perform specific file format parsing (binary, json,\ncustom...) for some of your objects.\n\nDevelopers\n----------\n\nPackaging\n~~~~~~~~~\n\nThis project uses ``setuptools_scm`` to synchronise the version number.\nTherefore the following command should be used for development snapshots\nas well as official releases:\n\n.. code:: bash\n\n python setup.py egg_info bdist_wheel rotate -m.whl -k3\n\nReleasing memo\n~~~~~~~~~~~~~~\n\n.. code:: bash\n\n twine upload dist/* -r pypitest\n twine upload dist/*", "description_content_type": null, "docs_url": null, "download_url": "https://github.com/smarie/python-simple-file-collection-parsing-framework/tarball/1.0.1", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/smarie/python-simple-file-collection-parsing-framework", "keywords": "file collection parsing framework", "license": "BSD 3-Clause", "maintainer": "", "maintainer_email": "", "name": "sficopaf", "package_url": "https://pypi.org/project/sficopaf/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/sficopaf/", "project_urls": { "Download": "https://github.com/smarie/python-simple-file-collection-parsing-framework/tarball/1.0.1", "Homepage": "https://github.com/smarie/python-simple-file-collection-parsing-framework" }, "release_url": "https://pypi.org/project/sficopaf/1.0.1/", "requires_dist": [ "numpy; extra == 'pandas_parser'", "pandas; extra == 'pandas_parser'" ], "requires_python": "", "summary": "Simple FIle COllection PArsing Framework", "version": "1.0.1" }, "last_serial": 2529607, "releases": { "1.0.0": [ { "comment_text": "", "digests": { "md5": "6cf7678c6428c569f88ab1d3f88bc20f", "sha256": "dcbbac2f4f896564ff9d1179a671747a1b1c6872ce8ae852ae2dc83612a0d55c" }, "downloads": -1, "filename": "sficopaf-1.0.0-py3-none-any.whl", "has_sig": false, "md5_digest": "6cf7678c6428c569f88ab1d3f88bc20f", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 25048, "upload_time": "2016-12-20T03:40:07", "url": "https://files.pythonhosted.org/packages/df/dc/d668ea871a2e3b330a8db5e564b8c1616b66cfd522fef3ab66bb7d341ef0/sficopaf-1.0.0-py3-none-any.whl" } ], "1.0.1": [ { "comment_text": "", "digests": { "md5": "a182c58eff5f72f746ee8c137e725687", "sha256": "48408bfff832c4248bdaef17eca061cbdf9c86f05e66903c1372efa8c6b0df17" }, "downloads": -1, "filename": "sficopaf-1.0.1-py3-none-any.whl", "has_sig": false, "md5_digest": "a182c58eff5f72f746ee8c137e725687", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 25045, "upload_time": "2016-12-20T03:44:23", "url": "https://files.pythonhosted.org/packages/73/02/d137bcc1e4836fd0f31b88a7ff426a118090c4aaed0c102c797e0a223a59/sficopaf-1.0.1-py3-none-any.whl" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "a182c58eff5f72f746ee8c137e725687", "sha256": "48408bfff832c4248bdaef17eca061cbdf9c86f05e66903c1372efa8c6b0df17" }, "downloads": -1, "filename": "sficopaf-1.0.1-py3-none-any.whl", "has_sig": false, "md5_digest": "a182c58eff5f72f746ee8c137e725687", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 25045, "upload_time": "2016-12-20T03:44:23", "url": "https://files.pythonhosted.org/packages/73/02/d137bcc1e4836fd0f31b88a7ff426a118090c4aaed0c102c797e0a223a59/sficopaf-1.0.1-py3-none-any.whl" } ] }