Design/Metadata
From SF Data Wiki
| Home > Design > Metadata |
Contents |
Metadata
This page discusses the requirements on metadata for incoming data files, how we want to get the meta data, and how we want to process it.
Metadata Requirements At Upload Time
Below are detailed the metadata we need for uploading new files. We'll be adding to the metadata things we need for processing and for the web application.
Provider
This is metadata about the provider of the data. This should only need to be entered once, although it may be updated when the provider changes staff.
Must Have
This is metadata we need to have for all incoming files, otherwise we ought to simply discard them.
- Provider name (unique, long text)
- Provider description (long text)
- Provider contact(s) info (one-to-many) (need at least one contact)
- Name (first/last)
- Email (or phone)
- Phone (or email)
Derived Automatically
We can also collect some information about the provider automatically:
- Provider Created On
- Last Update
- Provider ID (text prefix with autonumber, e.g. MAY3451)
Dataset
Each Provider will have zero to several data "series". A Series is a specific set of data files which are related, generally a time series of files periodically generated from the same source. Generally a series will have a single file format.
Must Have
- Name (long text) (unique with Provider Name)
- Provider (link)
Good To Have
Some additional data about the series. We can incent people to enter this because it will auto-populate most of their file information. In some cases, we may be able to derive it from their first files provided.
- First data from: (timestamp)
- File Format: (text) (from list of mime types)
- Data Frequency (interval) (hourly|daily|weekly|biweekly|monthly|quarterly|annually|etc.)
(probably quantity + unit would cover all cases) - Other Data Division (text) (way other than time the files are divided up)
- Series Comment (text, optional)
- File Naming Convention (regex) (template of how files in the series are named. defaults to "starts with")
- Data Maps (zero-to-several) (filenames) (links to one or several files which describe or encode mapping the data)
Automatically Derived
- First data uploaded on: (date)
- Last upload: (date)
- Key: (autonumber combined with prefix text, e.g. PRM1324) (needed for bulk loads)
File
This is metadata about the actual file, to be provided with each uploaded file.
Must Have
This is metadata we need to have for all incoming files, otherwise we ought to simply discard them.
- Dataset/Provider (link)
- Data Format (from list of mime types) (if different from Series)
- Filename (from upload, or load)
Good To Have
This is metadata we would like to have, and plan to ask for, or incoming files. In some cases, this data can be derived from the contents of the file.
- File comment (long text) (optional)
- Data Starts On: (timestamp)
- Data Ends on (timestamp) (allow entry of data period) (derive from Series if possible)
Automatically Derived
We'll also want to record the following:
- File loaded on (timestamp)
- File size (numeric)
- Processing info (TBD)
Metadata Methods
How will we get this metadata?
Web Upload
This is the most direct one. Users using the file-at-a-time web upload interface will fill out a webform, which will request the data. To make things easy for users, form fields will autofill based on Provider or Series information, or on their last upload.
Bulk FTP Upload
For users wanting a more mass upload interface would upload a series of files to an FTP folder. This would be accompanied by a metadata file in either CSV or XML format. For the former, we could distribute an Excel/OOo template for the CSV file.
The reason for the composite IDs for Series and Providers is to enable using them in the metadata files.
Derived From Contents
Some metadata could be derived from the actual file contents. For example, if you have a column which contains a date timestamp, date range could be derived. But we'd need a mechanism to arrange per file.
Metadata Storage
How do we keep this metadata? How do we connect it to data files?
