At the moment, class File stores contents in the database. This does not scale well and makes database backups a burden in case there are a large number of files stored. Ideally, File
should allow contents to be stored externally to the database, via pluggable storage providers, making it possible to store data in the filesystem or in the cloud, for example.
Implementation Overview
The idea would be create an additional field in the database which contains information specific to external storage providers. In case this field is left blank, the current behaviour is not altered, and File
contents are kept in the database. In case this field is filled with some sort of URL, an external provider will be selected accordingly, allowing File
contents to be stored and retrieved accordingly.
Current test cases must continue to pass. Additional test cases need to be created so that the additional field (let's call it storage
) is employed.
Implementation Alternatives
At first glance, SQLAlchemy-ImageAttach ( http://sqlalchemy-imageattach.readthedocs.org/ ) appears to be an excellent candidate for speeding up the implementation of this new feature. Unfortunately, the API proposed by SQLAlchemy-ImageAttach is specifically designed with images in mind, as opposed to any kind of attachment. Browsing the source code of SQLAlchemy-ImageAttach it becomes apparent that this dependency appears in several places, which does not seem to be convenient to us.
A second alternative would be employing OFS ( https://pypi.python.org/pypi/ofs ), which is BLOB oriented and presents a generic API "documented" here: https://github.com/okfn/ofs/blob/master/ofs/base.py . There are concrete implementations employing local data storage in this directory ( https://github.com/okfn/ofs/tree/master/ofs/local ) and remote implementations employing S3 among others in this directory ( https://github.com/okfn/ofs/tree/master/ofs/remote ). Despite OFS documentation is far from great, browsing the source code demonstrated to be simple and clean enough.
Despite OFS looks sufficiently simple, extensible and mature, i would like to allow even more flexibility, namely: do not depend explicitly of OFS, since OFS maybe sometimes tightly dependent on CKAN... and I had already my fair share of frustration with CKAN. Anyway, the idea is allowing OFS to be one (and possibly the most important and popular) alternative for external data storage in Kotti, but not the only one. More details about this topic are discussed later below.
Proposal of implementation
When Kotti starts, a singleton class called Storage
is instantiated and stored in settings
. This singleton class retrieves information from the configuration and initializes itself accordingly to the OFS implementation chosen, either local or remote. There's also possibility that nothing is specified in the configuration which, in this case, means that Storage
is None, which means that the database should be employed by File
as it is at the moment. Everytime a File
is stored or retrieved, the [singleton] Storage
is obtained from settings
and employed accordingly, in case it is not None.
When a File
is constructed, it's __init__
method should accept an optional argument storage
which, in case it is specified, allows that specific instance of File
to employ a specific instance of Storage
. This is helpful in case storage migration or data reorganization is needed, allowing multiple "Storages" and not the only one provided by the singleton Storage
. More details about this are presented later below.
When a File
record is stored in the database, an additional field called storage
is also persisted, making it possible to select an appropriate storage provider later, when that File
requires data retrieval which, by the way, must be deferred in the same way it is deferred at the moment when internal data storage in the database is employed.
Details of Implementation
At the moment, I have an implementation "already working" but not integrated to File
as proposed here. So, the next step would exactly this sort of integration we are talking about here.
I'm revealing some ideas below in order to allow more experient Kotti developers have their voice as soon as possible, so that we can speed up the implementation as fast as we can.
Below we have a draft of specification which eventually could be reorganized to a better formalized thing by someone, if desired.
Storage
-
class Storage
implements a marking interface IStorage
.
-
At the moment, Storage
recognizes configuration entries started by storage.ofs
, which makes OFS a first class citizen for external data storage, but not the only one. In future, other libraries or implementation would be possible to be plugged in, not only OFS.
-
A system-wide (singleton) Storage
instance is saved into settings
. In case no configuration items are found, a null Storage
reference is created and saved into settings
.
-
OFS itself permits several storage implementations, all of them having their own specific details in regards to initialization. For PTOFS ( https://github.com/okfn/ofs/blob/master/ofs/local/pairtreestore.py ), for example, it would be enough these two configuration entries:
storage.ofs.class=ofs.local.PTOFS
storage.ofs.ptofs.directory=/path/to/storage/directory
-
At this point, I'm supporting solely PTOFS. In future Storage
would be able to recognize several configuration settings, being able to initialize not only PTOFS, but also other specific OFS implementations.
NOTE (5.a) : This is only a limitation of scope for this initial implementation, since I have very limited time for dedicate for it.
-
At this point, our IStorage
interface allows operations: store
, retrieve
and remove
.
NOTE (6.a) We are not supporting changes in the metadata at this point.
See "Example Usage" at https://pypi.python.org/pypi/ofs for more clarification about this.
Example
This is a very brief example about how it look like:
storage = settings['storage']
url = storage.store('This is some data', filename='test.txt')
with storage.retrieve(url) as f:
for line in f:
data += line
assert(data == 'This is some data')
storage.remove(url)
File
StoredFile
extends File
StoredFile
does not implement any additional interfaces, but it employs composition, emplying the already mentioned [singleton] Storage
mentioned above.
StoredFile
also allows an optional initialization parameter called storage
which permits that a specific instance of StoredFile
employs a distinct Storage
instance from the system-wide [singleton] Storage
.
NOTE (3.a): Now it becomes clear that Storage
is a singleton only in general. Certain applications, like storage reorganization or storage migration would employ multiple Storage
instances, but this would be exceptions to the general rule.