Architecture

From SF Data Wiki

Jump to: navigation, search

The goal is a simple but scalable architecture that can run on a laptop or on a Compute Cloud.
See Design goals below.

Contents

Data source

Use case 1 - Simple

Goal: Publish a simple dataset.

Example: crime data

Diagram

Image:arch1_simple.png

Components

  1. Data source: CRM or any other internal application. Can be even e-mail. In this example the crime data would originate here.
  2. Raw data: Raw data exported from the data source (1) in supported format (CSV, XML, Access etc.)
  3. Processing: Data conversion, sanity verification, annotation (receipt date, dataset identifier). Data processing may be written in any language supported by the equipment being used to process. LAMP stacks are cheap and prolific across all sectors, therefore a reasonable baseline implementation might aim for any interpreted language available to the average Linux distribution.
  4. Clean data ready to be accessed. This may or may not be in a standardized format, or may be maintained in domain-specific formats.
  5. Access application (web service)
  6. Public access: AtomPub (domain specific XML)
  7. Optional: The original raw data URL can be obtained from the Access API (5)

Notes

  • Data source (1) can be one application or multiple in which case the Processing (3) will also aggregate
  • Starting with Raw Data storage (2) any or all components can be hosted outside of the local network.
  • Processing (4) can be a simple application or distributed (ex. Hadoop, Cassandra) running on local servers or 3rd party compute cluster (ex. Amazon EC2) - See Use Case 2
  • Conforming applications can export Clean Data (4) directly, in which case we don't need Raw Data store (2) and Processing (3)
  • Data format: XML by default, also selectable HTML, JSON or Freebase accessible (RDF)

Use case 2 - Large dataset

Goal: Publish large (aggregated) dataset.

Example: Daily statistics (?)

Diagram

Similar to Use case 1.

Components

Processing (3), Clean Data (4) and Access (5) is typically running on a redundant and scalable cluster.

Use case 3 - With non-public data

Goal: Allow internal access to full data, filter out sensitive information for public access

Sensitive data: employee home address, water pipe geocode

Diagram

Image:Arch3_internal.png

Components

See Use case 1 for details

  1. One or multiple data sources
  2. Common (and secure) storage for exported data
  3. Processing and Access Control List (ACL)
  4. Clean data - public and sensitive mixed in the same database
  5. Internal access to full database
  6. Clean data - just the public records outside local network
  7. Public access

Use case 4 - Feedback

Goal: Allow feedback (voting and correction). See The need for User Input for the reason why this is important.

Example:

Diagram

Image:arch1_feedback.png

Components

The first part is similar to Use case 1 (Simple)

  1. Data source
  2. Raw data
  3. Processing
  4. Clean data
  5. Access application (web service)
  6. API: Read data
  7. API: Attach comment, correction or rating to data. Feedback should generate a tracking# and status/notification
  8. Incoming data stored separately (no direct editing)
  9. Optional: Internal application allows manual or automated update of data

Data consumer

Use case 1 - Web mashup

Image:arch1_Web.png

Web site: interactive data

  1. Data Source(s) trough CivicDB API
  2. Web server
  3. Internet connection (http)
  4. Web browser

Examples

Use case 2 - Download for local processing

Image:arch1_Download.png

Similar to use case 1, but the provided format is other than HTML - mainly for local processing.

  1. Data Source(s) trough CivicDB API
  2. Web server
  3. The user selects the format
  4. Internet connection (http)
  5. Download the data in desired format for local processing

Use case 3 - Applications

Image:arch1_Applications.png

Similar to use case 1, but with native applications that can connect directly to API (XML, JSON) or trough 3rd party providers.

  1. Data Source(s) trough CivicDB API
  2. Optional: Web server (login, profiles etc.)
  3. Internet connection (http)
  4. Mobile or Desktop application (iPhone, Google Earth style application etc.)

Design goals

  • International (locale and language)
  • Multi platform
  • Scalable
  • Language bindings: C++, C#, Java, PHP, ASP, Python, RoR, Objective C
  • Built-in performance monitoring and metrics
  • Open source and License-free (ex. CC-Zero)

See Project Requirements

Personal tools