======
README
======

This package provides a grid file and storage implementation for zope3. This
means we offer a file and storage which is able to handle file upload and store
the file content in a mongodb database.

NOTE
----

This implementation is not compatible with the default gridfs implementation
from mongodb. Our implementation uses a custom collection for store an item
including the meta data and only stores the additional chunks in a chunk
collection.


How we process file upload
--------------------------

This description defines how a file upload will get processd in some raw
steps. It defines some internal part we use for processing an input stream
but it doesn't really explain how we implemented the grid file pattern.

The browser defins a form with a file upload input field:

  - client starts file upload

The file upload will get sent to the server:

  - create request

  - read input stream

  - process input stream 

    - define cgi parser (p01.cgi.parser.parseFormData)

    - parse input stream with cgi parser
    
      - write file upload part in tmp file
  
      - wrap file upload part from input stream with FileUpload
  
    - store FileUpload instance in request.form with the form input field
      name as key

The file upload get processed from the request by using z3c.form components:

  - z3c.form defines a widget

  - z3c.widget reads the FileUpload from the request
  
  - z3c.form data converter returns the plain FileUpload
  
  - z3c.form data manager stores the FileUpload as attribute value

Each file item provides an fileUpload property (attribute) which is responsible
to process the given FileUpload object. The defualt built-in fileUpload
property does the following:

  - get a FileWriter

The FileWrite knows how to write the given FileUpload tmp file to mongodb.


setup
-----

  >>> import re
  >>> from pprint import pprint
  >>> from pymongo import ASCENDING
  >>> import transaction
  >>> import m01.mongo.testing
  >>> import m01.grid.testing

Also define a normalizer:

  >>> patterns = [
  ...    (re.compile("ObjectId\(\'[a-zA-Z0-9]+\'\)"), r"ObjectId('...')"),
  ...    (re.compile("datetime.datetime\([a-zA-Z0-9, ]+tzinfo=<bson.tz_util.FixedOffset[a-zA-Z0-9 ]+>\)"),
  ...                "datetime(..., tzinfo=<bson.tz_util.FixedOffset ...>)"),
  ...    (re.compile("datetime.datetime\([a-zA-Z0-9, ]+tzinfo=[a-zA-Z0-9>]+\)"),
  ...                "datetime(..., tzinfo= ...)"),
  ...    (re.compile("datetime\([a-z0-9, ]+\)"), "datetime(...)"),
  ...    (re.compile("object at 0x[a-zA-Z0-9]+"), "object at ..."),
  ...    ]
  >>> reNormalizer = m01.mongo.testing.RENormalizer(patterns)

Test the grid storage:

  >>> db = m01.grid.testing.getTestDatabase()

  >>> chunks = m01.grid.testing.getTestChunksCollection()
  >>> chunks.name
  u'test.chunks'

  >>> files = m01.grid.testing.getTestFilesCollection()
  >>> files.name
  u'test.files'

  >>> storage = m01.grid.testing.SampleFileStorage()
  >>> storage
  <m01.grid.testing.SampleFileStorage object at ...>

Our test setup offers a log handler where we can use like:

  >>> logger.clear()
  >>> print logger


FileStorageItem
---------------

The FileStorageItem is implemented as a IMongoStorageItem and provides IFile.
This item can get stored in a IMongoStorage. This is known as the
container/item pattern. This contrainer only defines an add method which 
implicit uses the items __name__ as key.

  >>> txt = u'Hello World'
  >>> upload = m01.grid.testing.getFileUpload(txt)
  >>> upload.filename
  u'test.txt'
  
  >>> upload.headers
  {}
  
  >>> upload.read()
  'Hello World'
  
  >>> upload.seek(0)
  
  >>> data = {'title': u'title', 'description': u'description'}
  >>> item = m01.grid.testing.SampleFileStorageItem(data)
  >>> firstID = item._id

Apply the file upload item:

  >>> item.applyFileUpload(upload)
  Traceback (most recent call last):
  ...
  OperationFailure: command SON([('filemd5', ObjectId('...')), ('root', 'test')]) failed: need an index on { files_id : 1 , n : 1 }

And we've got a log entry:

  >>> print logger
  m01.grid DEBUG
    ... test.chunks ChunkWriter add
  m01.grid DEBUG
    ... test.chunks ChunkWriter flush data
  m01.grid ERROR
    ... test.chunks ChunkWriter add caused an error
  m01.grid ERROR
    command SON([('filemd5', ObjectId('...')), ('root', 'test')]) failed: need an index on { files_id : 1 , n : 1 }

  >>> logger.clear()

As you can see, the gridfs index is missing. This index is very important for
get the md5 hash where the database calculates for us. Let's add the index:

  >>> chunks = m01.grid.testing.getTestChunksCollection()
  >>> chunks.ensure_index([("files_id", ASCENDING), ("n", ASCENDING)],
  ...     unique=True)
  u'files_id_1_n_1'

Try again:

  >>> item.applyFileUpload(upload)
  >>> print logger
  m01.grid DEBUG
    ... test.chunks ChunkWriter add
  m01.grid DEBUG
    ... test.chunks ChunkWriter flush data
  m01.grid DEBUG
    ... test.chunks ChunkWriter success

  >>> logger.clear()

Now let's see how our FileItem get enhanced with the chunk info:

  >>> reNormalizer.pprint(item.__dict__)
  {'_id': ObjectId('...'),
   '_m_changed': True,
   '_m_initialized': True,
   '_m_parent': None,
   '_pid': None,
   '_type': u'SampleFileStorageItem',
   '_version': 0,
   'contentType': u'text/plain',
   'created': datetime(..., tzinfo= ...),
   'description': u'description',
   'filename': u'test.txt',
   'length': 11,
   'md5': u'b10a8db164e0754105b7a99be72e3fe5',
   'numChunks': 1,
   'title': u'title',
   'uploadDate': datetime(..., tzinfo= ...)}

  >>> reNormalizer.pprint(item.dump())
  {'__name__': u'...',
   '_id': ObjectId('...'),
   '_type': u'SampleFileStorageItem',
   '_version': 0,
   'chunkSize': 262144,
   'contentType': u'text/plain',
   'created': datetime(..., tzinfo=...),
   'description': u'description',
   'filename': u'test.txt',
   'length': 11,
   'md5': u'b10a8db164e0754105b7a99be72e3fe5',
   'numChunks': 1,
   'removed': False,
   'title': u'title',
   'uploadDate': datetime(..., tzinfo=...)}

As you can see we can lookup the chunks from our chunks collection by calling
find_one. Note, we should not use find and iterate, then this whould let a
cursor open and more important not use our chunk index which requires using
`files_id` and `n` fields:

  >>> reNormalizer.pprint(chunks.find_one({'files_id': item._id, 'n': 0}))
  {u'_id': ObjectId('...'),
   u'data': Binary('Hello World', 0),
   u'files_id': ObjectId('...'),
   u'n': 0}

Now let's store our item in our storage:

  >>> key = storage.add(item)
  >>> len(key)
  24

  >>> reNormalizer.pprint(item.__dict__)
  {'_id': ObjectId('...'),
   '_m_changed': True,
   '_m_initialized': True,
   '_m_parent': <m01.grid.testing.SampleFileStorage object at ...>,
   '_pid': None,
   '_type': u'SampleFileStorageItem',
   '_version': 0,
   'contentType': u'text/plain',
   'created': datetime(..., tzinfo= ...),
   'description': u'description',
   'filename': u'test.txt',
   'length': 11,
   'md5': u'b10a8db164e0754105b7a99be72e3fe5',
   'numChunks': 1,
   'title': u'title',
   'uploadDate': datetime(..., tzinfo= ...)}

  >>> reNormalizer.pprint(item.dump())
  {'__name__': u'...',
   '_id': ObjectId('...'),
   '_type': u'SampleFileStorageItem',
   '_version': 0,
   'chunkSize': 262144,
   'contentType': u'text/plain',
   'created': datetime(..., tzinfo=...),
   'description': u'description',
   'filename': u'test.txt',
   'length': 11,
   'md5': u'b10a8db164e0754105b7a99be72e3fe5',
   'numChunks': 1,
   'removed': False,
   'title': u'title',
   'uploadDate': datetime(..., tzinfo=...)}

Now let's commit the items to mongo:

  >>> transaction.commit()

Now let's read the file data:

  >>> item = storage.get(key)
  >>> reader = item.getFileReader()

  >>> reader.read()
  'Hello World'

  >>> for chunk in reader:
  ...     chunk
  'Hello World'


compatibilty
------------

Our implementation is compatible with the gridfs implementation. But take care
if you write file objects to the mongodb with the gridfs library and don't
forget to add the required data the application uses for the specific FileItem.

First let's see what we have stored in ou files collection:

  >>> files = m01.grid.testing.getTestFilesCollection()
  >>> for data in files.find():
  ...     reNormalizer.pprint(data)
  {u'__name__': u'...',
   u'_id': ObjectId('...'),
   u'_type': u'SampleFileStorageItem',
   u'_version': 1,
   u'chunkSize': 262144,
   u'contentType': u'text/plain',
   u'created': datetime(..., tzinfo=<bson.tz_util.FixedOffset ...>),
   u'description': u'description',
   u'filename': u'test.txt',
   u'length': 11,
   u'md5': u'b10a8db164e0754105b7a99be72e3fe5',
   u'modified': datetime(..., tzinfo=<bson.tz_util.FixedOffset ...>),
   u'numChunks': 1,
   u'removed': False,
   u'title': u'title',
   u'uploadDate': datetime(..., tzinfo=<bson.tz_util.FixedOffset ...>)}

Now let's test how we can read with gridfs:

  >>> import gridfs
  >>> grid = gridfs.GridFS(db, 'test')
  >>> f = grid.get(firstID)
  >>> f
  <gridfs.grid_file.GridOut object at ...>

  >>> f.read()
  'Hello World'

Test iterator:

  >>> for chunk in f:
  ...    chunk
  'Hello World'


update
------

We can also update a file by apply a new fileUpload:

  >>> txt = u'Hello NEW World'
  >>> newUpload = m01.grid.testing.getFileUpload(txt)
  >>> newUpload.filename = u'new.txt'
  >>> newUpload.filename
  u'new.txt'

  >>> item = storage.get(key)
  >>> item.applyFileUpload(newUpload)

As you can see our logger reports that the previous chunk get marked as tmp and 
after upload removed:

  >>> print logger
  m01.grid DEBUG
    ... test.chunks ChunkWriter update
  m01.grid DEBUG
    ... test.chunks ChunkWriter make tmp chunk
  m01.grid DEBUG
    ... test.chunks ChunkWriter flush data
  m01.grid DEBUG
    ... test.chunks ChunkWriter remove tmp chunk
  m01.grid DEBUG
    ... test.chunks ChunkWriter success

  >>> logger.clear()

before we commit, let's check if we get a _m_changed marker:

  >>> reNormalizer.pprint(item.__dict__)
  {'_id': ObjectId('...'),
   '_m_changed': True,
   '_m_initialized': True,
   '_m_parent': <m01.grid.testing.SampleFileStorage object at ...>,
   '_pid': None,
   '_type': u'SampleFileStorageItem',
   '_version': 1,
   'contentType': u'text/plain',
   'created': datetime(..., tzinfo=<bson.tz_util.FixedOffset ...>),
   'description': u'description',
   'filename': u'new.txt',
   'length': 15,
   'md5': u'a3875fc03680b88b13b6ea75c49f8abc',
   'modified': datetime(..., tzinfo=<bson.tz_util.FixedOffset ...>),
   'numChunks': 1,
   'title': u'title',
   'uploadDate': datetime(..., tzinfo= ...)}

commit transaction and check the item:

  >>> transaction.commit()

Now let's check the if the storage cache ist empty and we don't get the
cached item without the changed data:

  >>> storage._cache
  {}

Check what we have in mongo:

  >>> files = m01.grid.testing.getTestFilesCollection()
  >>> for data in files.find():
  ...     reNormalizer.pprint(data)
  {u'__name__': u'...',
   u'_id': ObjectId('...'),
   u'_type': u'SampleFileStorageItem',
   u'_version': 2,
   u'chunkSize': 262144,
   u'contentType': u'text/plain',
   u'created': datetime(..., tzinfo=<bson.tz_util.FixedOffset ...>),
   u'description': u'description',
   u'filename': u'new.txt',
   u'length': 15,
   u'md5': u'a3875fc03680b88b13b6ea75c49f8abc',
   u'modified': datetime(..., tzinfo=<bson.tz_util.FixedOffset ...>),
   u'numChunks': 1,
   u'removed': False,
   u'title': u'title',
   u'uploadDate': datetime(..., tzinfo=<bson.tz_util.FixedOffset ...>)}

And let's load the item with our storage:

  >>> item = storage.get(key)
  >>> reNormalizer.pprint(item.__dict__)
  {'_id': ObjectId('...'),
   '_m_changed': False,
   '_m_initialized': True,
   '_m_parent': <m01.grid.testing.SampleFileStorage object at ...>,
   '_pid': None,
   '_type': u'SampleFileStorageItem',
   '_version': 2,
   'contentType': u'text/plain',
   'created': datetime(..., tzinfo=<bson.tz_util.FixedOffset ...>),
   'description': u'description',
   'filename': u'new.txt',
   'length': 15,
   'md5': u'a3875fc03680b88b13b6ea75c49f8abc',
   'modified': datetime(..., tzinfo=<bson.tz_util.FixedOffset ...>),
   'numChunks': 1,
   'title': u'title',
   'uploadDate': datetime(..., tzinfo=<bson.tz_util.FixedOffset ...>)}

Now let's read the file data:

  >>> reader = item.getFileReader()
  >>> reader.read()
  'Hello NEW World'

  >>> for chunk in reader:
  ...     chunk
  'Hello NEW World'


FileObject
----------

The FileObject provides IFile and IMongoObject.
