– by Anuj Thakur
First, some clarification of terms:
This is an installation of the Dataverse software, that is a dataverse site serving as a data repository
A container of datasets. This is an object that has some a minimal set of metadata and can contain datasets or other dataverses. The Dataverse installation is the Root dataverse.
The actual data associated with a dataset object.
New versions at the dataset level, either because the metadata has changed or the files have changed. Dataverse does not support versions at the data file level (at this point)
Making a copy of the metadata from datasets of a Dataverse installation (A) to another Dataverse installation (B). The datasets from A become searchable in B, but from the search results the user is directed to the dataset landing page in A and the data files are in A.
Cloud Dataverse adds a couple of features to the Metadata Export and Harvesting on Swift storage (4.5-export-harvest-swift) code base. The features are namely:
- Caching of Data files on the swift in addition to the metadata for harvested datasets.
- Using MOC OpenStack environment user credentials to authenticate to the Keystone endpoint.
What is Metadata Export and Harvesting?
Metadata Export will allow for federation and interoperability with other systems to help make Dataverse more widely and easily discoverable. Another key function of the export is not only to share metadata but to store it in the file system in a preservation format. Harvesting allows the metadata from datasets from another site to be imported so they appear to be local, though data files remain on the remote site. This makes it possible to access content from data repositories and other sites with interesting content as long as they support the protocols Dataverse uses. Additionally, harvesting allows for Dataverse installations and other repositories to share metadata with each other to create a data sharing community and provide more access to the datasets stored in each repository.
What does Cloud Dataverse do?
Cloud dataverse provides the existing Metadata Export and Harvesting functionality with additional features. Cloud dataverse in addition to harvesting metadata it also caches the data files. The data files are cached on Swift storage. The OpenStack user credentials are used to store the Datafile on Swift.
Keystone authentication is used to store data on Swift. Authentication is done using the OpenStack credentials for a user. These credentials are stored in the swift.properties file.
How it actually works:
The process of caching the data file begins after the metadata is harvested for the OAI set (this is a set of datasets allowed to be harvested). Each harvested datasets have a list of data file that has to be cached in the proces. The location of the data files can be retrieved by the member variable storage identifier. The storage identifier before caching holds the url to the data file on the harvesting server.
Using this Url the data file is downloaded from the harvesting server to the /tmp directory of the OpenStack VM instance. The storage identifier of the data file is changed to indicate the files new location.
The data file is then copied from the temporary directory to it’s permanent location. The permanent location of the data file, which in this case is Swift by default. After uploading the data file to the Object store service endpoint, the storage identifier is updated to the location of the data file on Swift i.e the Object store service endpoint of the MOC OpenStack environment.
In the figure above, the Dataverse Installation has a OAI Sever. OAI Server has different OAI sets with multiple Datasets each. When the Cloud Dataverse creates a OAI client for a particular OAI set of that OAI server. After the harvesting is completed all the data files for the datasets are present on the Swift service endpoint.
JOSS, a Java library for OpenStack storage i.e Swift, is used to communicate with the Swift endpoint. Two modes of authentication is supported by JOSS library, out of which Keystone authentication mechanism is used to pass the tenant name and authenticate. The container is created and accessed after the authentication is done. For this mechanism the required details need to be added to the swift.properties file.
The process of caching the file on Swift storage endpoint from the Harvard Dataverse Installation took 65 minutes during the August 4th demo. This set of datasets was 5GB of size. Out of the 65 minutes time, 61-62 minutes was taken to upload the data files onto the Swift endpoint. Running the same deployment after Ceilometer was turned off in the OpenStack production environment at MOC took around 35-37 minutes for the same process for the same set of datasets.
Using the concept of java multithreading and running multiple threads to upload data files on Swift endpoint now the process takes 15-16 minutes. The code that does that needs to be added to the pull request after some clean-up.