wiki:DataFormats

Data Formats

This page is dedicated to discussing the different data formats that Telemeta may handle.

Introduction

Telemeta aims at handling audio files associated with metadata. The main purpose of this metadata is to describe and categorize the sound files. From a user perspective, Telemeta should allow to

  • browse and search this metadata and listen to the corresponding audio files
  • edit the metadata and modify/upload audio files
  • export both the metadata and audio files in a compact and "easy to carry" format
  • backup and restore both the metadata and audio files

The first implementation of Telemeta, as a command line tool, uses XML as the data format for metadata storage. XML is handy because meant to be readable by both machines and humans.

However, this first implementation focuses on backup. This is different from everyday use where (possibly many) users will work, search, edit at the same time.

This document tries to see which data storage is the best for each of the different usage contexts.

Everyday use

Most users should eventually access the audio files and associated description through a web interface. In this context, the initial Telemeta idea is to let this web application access the XML files directly, possibly building an index out of their content.

In my opinion XML does not suit such a distributed context. XML is great for interoperability, exchanging data between different systems, or even for configuration files, and generally where the data must be easily parsable by a machine, and readable by a human being.

But XML as data storage will not scale under heavy load as an SQL database will. There is a little confusion here however: XML only is the data format, where an SQL database is a data handling engine. But XML handling APIs such as DOM do not offer all of the nifty indexing, data corruption prevention and fast and concurrent access features provided by an SQL engine such as MySQL and PostGreSQL.

And this is my second point: why should we reinvent the wheel by writing data indexing routines when most SQL engines perform indexing flawlessly and transparently ?

However, I do not question XML as a data backup format. My point of view is that the everyday use, central data repository should be stored on an SQL database server.

Of course, I'm not talking about storing the audio data into the SQL database. Audio files should simply be stored on the local file system and properly linked with corresponding data records. This is extremely easy to achieve.

Backup

Telemeta aims at a producing a clean backup. It should be easy, with a simple text editor and audio player to find what your are looking for in the backup. In this regard, XML is excellent.

And fortunately it is an extremely common task to produce XML out of an SQL database. There are plenty of tools, libraries and well-known practices to do this. For example a web application usually consists in generating XHTML, an XML dialect, out of an SQL database.

One might think that generating a web page and making a backup are pretty different things. But, if we follow the ModelViewController design pattern, these are simply two Views that interface with the same Model, and both consists in generating XML out of SQL. So they are similar in regard to the application architecture and even according to their implementation details.

But backup is not only about storing the data somewhere and forgetting about it. It also is about restoring. That point might be a little more challenging. But as for reading data from SQL and converting it into XML, importing XML data into SQL is a common task, with handy tools and established practices.

Export and broadcasting

The web interface will of course allow to listen to the audio files. In this regard they must be properly compressed, and this can't happen in real time every time an audio file is played or downloaded, for performance reasons. We should generate and cache the compressed audio files whenever a new audio sample is uploaded. This is no big deal.

However, Telemeta aims at providing standalone files which contain both audio and metadata. This is easy to carry, for example when you go on an ethnological expedition in Zimbabwe :)

This document does not address the details of these special audio/metadata file formats, but how they might be generated out of the central SQL server. And if you followed me, your already understood that it all boils down to another View, in our ModelViewController architecture.

The Export View, which could feature several drivers (OGG, MP3, ...) would interface with the Model in a similar manner to the Backup and HTML Views.

But again, we can't update these files bundling audio and metadata every time they are request for playback or download. So we'll also need to generate and cache them every time the metadata - not only the audio content - is updated.

Last modified 5 years ago Last modified on 06/21/09 12:27:57