{ "info": { "author": "Rohan Kekatpure", "author_email": "rohan_kekatpure@intuit.com", "bugtrack_url": null, "classifiers": [], "description": "=========\nFilemerge\n=========\n\nFilemerge is a utility for merging a large number of small HDFS files into\nsmaller number of large files. Filemerge is intended for use by Hadoop operations\nengineers and map-reduce application developers.\n\nThe structure of the code is simple. The actual merging is performed by a Pig\nscript created at run time using user-supplied parameters. These parameters\ncontrol the set of files to merge. The utility consists of a single file,\n``filemerge.py``, that takes the input parameters and invokes the created pig\nscript. As such, ``pig`` **command needs to be available and in path of the\nruntime user.** The user specifies the input path, output path, topic, and\nfiles to be merged either as a year/month/day format or specific HDFS directory\nor a list of HDFS directories in a file.\n\n========================\nInstallation and testing\n========================\n\nBecause the application code is small and self-contained, installation requires\nsimply cloning the repository.\n\n\n.. code-block:: sh\n\n git clone https://github.com/intuit/filemerge.git\n\n\nNote that ``filemerge`` itself does not have any dependencies besides pig\ncommand-line. However, running the test suite locally requires installation of\nthe test discovery and mocking packages. These dependencies are listed in\n``filemerge/requirements.txt`` and can be installed as follows.\n\n.. code-block:: sh\n\n cd filemerge\n pip install -r requirements.txt\n\nFinally, installation can be verified by running the test suite locally.\n\n.. code-block:: sh\n\n nosetests -w unit_tests -v\n\nDevelopers can check test coverage running\n\n.. code-block:: sh\n\n ./runtests.sh\n\nfrom the project top-level directory.\n\n\n==================\nRunning the script\n==================\n\n--------------\nThe script API\n--------------\n\nThe full API of the script is available on commandline by typing\n\n.. code-block:: sh\n\n python filemerge.py -h\n\nThe help message is reproduced below for reference.\n\n.. code-block:: sh \n\n Usage:\n python filemerge.py --topic=<'topic-in-single-quotes'>\n --input-prefix=<'HDFS-location-in-single-quotes'>\n --output-prefix=<'HDFS-location-in-single-quotes'>\n --num-reducers=\n --queue=\n [--year=<4-digit-year>]\n [--month=]\n [--day=]\n [--dir=]\n [--file=]\n [--window=]\n [--codec=]\n [-r]\n\n\n\n Options:\n -h, --help show this help message and exit\n -y YEAR, --year=YEAR Year for the merge\n -m MONTH, --month=MONTH\n Month for the merge\n -d DAY, --day=DAY Day for the merge\n -D DIRECTORY, --directory=DIRECTORY\n Directory containing files to merge\n -f FILE, --file=FILE File containing list of input directories\n -w WINDOW, --window=WINDOW\n Window in days (merge for the past *n* days\n -l LOOKBACK, --lookback=LOOKBACK\n Lookback period (merge for 1 day *n* days prior)\n -t TOPIC, --topic=TOPIC\n Topic for the merge\n -i INPUT_PREFIX, --input-prefix=INPUT_PREFIX\n Input directory prefix\n -o OUTPUT_PREFIX, --output-prefix=OUTPUT_PREFIX\n Output directory prefix\n -n NUM_REDUCERS, --num-reducers=NUM_REDUCERS\n Number of reducers\n -c CODEC, --codec=CODEC\n Compression codec to use\n -q QUEUE, --queue=QUEUE\n Mapreduce job queue\n -r, --dry-run Dry run; create, but dont execute the Pig script\n\nThe arguments outside the square brackets are required and those in the square\nbrackets are optional, but a minimum set of these arguments is needed to compute\nthe set of directories to be merged. The acceptable option groups are following:\n\n - Group 1\n - year (-y)\n - year (-y), month (-m)\n - year (-y), month (-m), day (-d)\n\n - Group 2\n - HDFS directory (-D)\n\n - Group 3\n - file with a list of HDFS directories (-f)\n\n - Group 4\n - window with a start date (-w); files for all days between start date minus\n window to start date will be merged\n\n - Group 5\n - lookback with a start date (-l); files for a single day lookback days before the\n start date will be merged\n\nThese option groups are designed to enable merging at the directory, day, month,\nor the year level. The ``-f`` offers ability to merge non-contiguous firectory\nblocks. The ``-w`` and ``-l`` options allow merging of directories at periodic\nintervals using a sliding window.\n\nOne can further enhance the flexibility of these options by wrapping the\n``python`` call in a shell script and providing custom list of directories,\nnon-contiguous months, shunking large directory lists into smaller parts etc.\n\n++++++++++++++++++++++++\nWhy all the flexibility?\n++++++++++++++++++++++++\n\nThe ``filemerge`` tool is written with operations and map-reduce application\ndevelopers in mind. Operations team will need periodic merges based on the\nretention policy and will typically use the tool with the ``-y, -m, -d``\noptions. Map-reduce application developers might need to merge single\ndirectories or random directory groups and will use the ``-d`` and ``-f``\noptions.\n\n---------------------------------------------\nBasic usage: Merging all files in a directory\n---------------------------------------------\n\nThe most common usage pattern for ``filemerge`` is to merge all files in a\ndirectory and produce one output file (in a different directory). To merge files\nunders a specific directory, provide the basepath using the ``-i`` option and\nthe final directory name using the ``-D`` option. In the following invocation\nthe ``/path/to/clickstream`` is the base HDFS path and ``jan2016`` is the\nsubdirectory that contains the files to be merged (in this case, for January\n2016). In other words, the full path to the files that will be merged is:\n``/path/to/clickstream/jan2016``\n\n.. code-block:: sh\n\n python filemerge/filemerge.py \\\n -i '/hdfs/path/to/clickstream' \\\n -D 'jan2016' \\\n -o '/hdfs/path/to/jan2016-merged' \\\n -t 'clickstream'\n\n\n-----------------------------------------\nExample invocation for a full month merge\n-----------------------------------------\n\nFollowing command invokes the script for merging February 2015 data of the\n'clickstream' directory in HDFS. This is the raw call to the filemerge python script\nand will initiate 28 map-reduce jobs.\n\n.. code-block:: sh\n \n python filemerge/filemerge.py \\\n -i '/hdfs/path/to/clickstream' \\\n -o '/hdfs/path/to/clickstream-merged' \\\n -t 'clickstream' \\\n -y 2015 \\\n -m 2\n\n---------------------------------------- \nExample invocation for a full year merge\n----------------------------------------\n\nSimply omit the month and day options and the merge wil be performed for the\nfull year. Following command invokes the script for merging the entire 2015 data\nof the 'clickstream' directory with a 1 day chunk size. This will initiate 365\nmap-reduce jobs.\n\n.. code-block:: sh\n \n python filemerge/filemerge.py \\\n -i '/hdfs/path/to/clickstream' \\\n -o '/hdfs/path/to/clickstream-merged' \\\n -t 'clickstream' \\\n -y 2015\n\nNote that detecting files in time window (e.g. a certain month or a year)\nrequires ``filemerge`` to assume certain directory naming conventions. This\nconvention is specified in ``filemerge/templates.py`` and can be user-defined.\n\n------------------------------------------------------\nExample invocation for a non-contiguous directory list\n------------------------------------------------------\n\nTo merge files under unrelated non-contiguous directories, list all the final\ndirectory names in a file and pass the full file path to the ``-f`` option. In\nthe invocation below, ``-i`` captures the common portion of the path to all the\ndirectories and the final directories are listed in the file.\n\n.. code-block:: sh\n\n python filemerge/filemerge.py \\\n -i '/hdfs/path/to/clickstream' \\\n -o '/hdfs/path/to/clickstream-merged' \\\n -t 'clickstream' \\\n -f /local/filesystem/path/to/directory_list.txt\n\nLets assume to that ``/local/filesystem/path/to/directory_list.txt`` contains\nthe following lines\n\n.. code-block:: sh\n\n d_20150225\n d_20160309\n d_20150728\n\nIn that case all files under ``/hdfs/path/to/clickstream/{d_20150225,\nd_20160309, d_20150728}`` will be merged. Note, that they wont be merged into\nthe *same* file. Rather, three different output directories, one for each directory\nin listed in ``directory_list.txt``, will be created.\n\n--------------------------------------------\nExample invocation for a sliding time window\n--------------------------------------------\n\nThe following invocation the ``filemerge`` script will merge files in the\n``clickstream`` directory for the last 20 days (not including today). The window\nis datetime aware.\n\n.. code-block:: sh\n\n python filemerge/filemerge.py \\\n -i '/hdfs/path/to/clickstream' \\\n -o '/hdfs/path/to/clickstream-merged' \\\n -t 'clickstream' \\\n -w 20\n\n---------------------------------------------------\nExample invocation for a sliding window daily merge\n---------------------------------------------------\n\nThe following invocation the ``filemerge`` script will merge files in the\n``clickstream`` topic for the day 20 days prior to today. The lookback is\ndatetime aware.\n\n.. code-block:: sh\n\n python filemerge/filemerge.py \\\n -i '/hdfs/path/to/clickstream' \\\n -o '/hdfs/path/to/clickstream-merged' \\\n -t 'clickstream' \\\n -l 20\n\n---------------------\nMulti-directory merge\n---------------------\n\nFor multi-directory merges, ``filemerge.py`` can be called from a script that\nprovides the list of directories and the merge frequency. The following wrapper\nscript shows how to merge 2015 files for a subset of directories. The script needs to\nbe present in the same directory as the ``filemerge.py`` script.\n\n .. code-block:: sh\n\n #!/bin/bash\n\n # List of all HDFS subdirectories can be obtained as follows\n # hadoop fs -ls /hdfs/base/path | sed -E \"s:.*/hdfs/base/path/(.*)$:\\\\1:\"\n \n # Set of subdirectories to be merged, obtained from output of the\n # above command\n\n TOPICS=(\n businessevents\n customer-transactions\n desktop-clickstream\n mobile-clickstream-ios\n mobile-clickstream-android)\n \n YEAR=2015\n for TOPIC in ${TOPICS[@]}; do\n OUTPUT_DIR=\"/hdfs/base/path/${TOPIC}-merged\"\n\n python filemerge/filemerge.py \\\n -i '/hdfs/base/path/${TOPIC}' \\\n -o ${OUTPUT_DIR} \\\n -t ${TOPIC} \\\n -y 2015\n done\n\n-----------------------\nMerge for custom months\n-----------------------\n\nMerging for custom months is straightforward and is similar to above looping\nlogic. Once again, the following script needs to be located in the same directory\nas ``filemerge.py``.\n\n .. code-block:: sh\n\n #!/bin/bash\n\n # Subset of months to be merged\n MONTHS=(\n 01 \n 02\n 07\n 09\n 10\n 12)\n \n YEAR=2015\n TOPIC=\"clickstream\"\n OUTPUT_DIR=\"/hdfs/base/path/${TOPIC}-merged\"\n\n for MM in ${MONTHS[@]}; do \n python filemerge/filemerge.py \\\n -i '/hdfs/base/path/${TOPIC}' \\\n -o ${OUTPUT_DIR} \\\n -t ${TOPIC} \\\n -y 2015 \\\n -m ${MONTH}\n\n done\n\n------------------\nHigh-level pattern\n------------------\n\nThe overarching pattern here is to realize that **the unit of time for the merge\nlogic is a directory**. As long as this is noted, the actual logic can be customized\nin more ways than those shown above: simply write a wrapper shell script to\ncreate your variables and loop over them. These variables can be months,\ninput directories, or output directories.", "description_content_type": null, "docs_url": null, "download_url": "https://github.com/intuit/filemerge/tarball/0.1", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/intuit/filemerge", "keywords": "pypi,filemerge", "license": "UNKNOWN", "maintainer": null, "maintainer_email": null, "name": "filemerge", "package_url": "https://pypi.org/project/filemerge/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/filemerge/", "project_urls": { "Download": "https://github.com/intuit/filemerge/tarball/0.1", "Homepage": "https://github.com/intuit/filemerge" }, "release_url": "https://pypi.org/project/filemerge/0.0.3/", "requires_dist": null, "requires_python": null, "summary": "Filemerge: Tool to merge small HDFS files", "version": "0.0.3" }, "last_serial": 2369805, "releases": { "0.0.2": [], "0.0.3": [ { "comment_text": "", "digests": { "md5": "db7b1fc7f43bfb1f22fd6a47f5847470", "sha256": "59d42958d48db5f76f2c6a5f64ebbe4a2437f50755a97e26f59da47d2b1145f0" }, "downloads": -1, "filename": "filemerge-0.0.3.tar.gz", "has_sig": false, "md5_digest": "db7b1fc7f43bfb1f22fd6a47f5847470", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 11827, "upload_time": "2016-09-28T23:20:38", "url": "https://files.pythonhosted.org/packages/d0/74/4c86dae0ef8c0e7f9098eb621acff5b4817cede82b2ed2438383b34637fd/filemerge-0.0.3.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "db7b1fc7f43bfb1f22fd6a47f5847470", "sha256": "59d42958d48db5f76f2c6a5f64ebbe4a2437f50755a97e26f59da47d2b1145f0" }, "downloads": -1, "filename": "filemerge-0.0.3.tar.gz", "has_sig": false, "md5_digest": "db7b1fc7f43bfb1f22fd6a47f5847470", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 11827, "upload_time": "2016-09-28T23:20:38", "url": "https://files.pythonhosted.org/packages/d0/74/4c86dae0ef8c0e7f9098eb621acff5b4817cede82b2ed2438383b34637fd/filemerge-0.0.3.tar.gz" } ] }