An Introduction To Archipelago

Create09/02/2024

Update10/02/2024

AuthorMiquel Herrera

Archipelago is a framework for building data processing and delivery services.

Archipelago aims to challenge the economics of building and evolving this category of system through pervasive composability, conciseness of expression and resource utilization.

Origins

Archipelago grew from experience gained building ETL systems using the Akka toolkit. The Actor paradigm, as exemplified by Akka, possesses great support for fault tolerance and easy concurrency but has issues due to its untyped messaging and lack of protocols for services and flow control.

To be truly scalable, a system must be capable of concurrent execution of its primary tasks maximized by low impedance routing and coordination of data. Scalable systems must be built to accept that errors and outages will happen and need to be tolerated rather than being allowed to distort processing with endless defensive coding. By encapsulating idiosyncrasies, elements become brittle and reuse plummets, forcing case by case development and the loss of the testing and hardening benefits that arise from reuse.

Archipelago grew from the belief that contemporary micro-services, in their desire for singular purpose and mapping one to one to OS processes, are damaged by their poor locality of data, marshalling costs and the fragility of interprocess communications. Archipelago aims to exploit each VM as much as possible before scaling out, co-locating services which share data or compose into higher level services while still providing the desired levels of isolation between unrelated services.

Characteristics

Archipelago’s main characteristics are:

Composability	Cells (Archipelago components) can be composed by parameter types, functional mixins, configuration and associatively by message contract. Higher level structures compose from lower level structure instances and in turn can also act as prototypes for similar structures. Libraries of Cells and higher level structures offer the same flexibility and promotion of reuse. Services compose in series, parallel or hierarchically. Archipelago itself can run clustered, standalone, as a script or embedded.
Conciseness	System specifications are created with a simple, precise DSL. Specifications define new elements and reference pre-existing elements, describing routing, contracts and configuration only as required. Cell implementations are focussed on their domain rather than non-functional issues, facilitated by standard methods and isolated by stable, testable APIs.
Performance	Performance is gained through full utilization of concurrency, locality of data, micro-batching and caching and a simple transaction model keeping data paths conceptually clean. Locality of co-related services dramatically affects performance but the transition to distribution is transparent (virtually invisible in the logical model). Reactive, flow-based asynchronous processing results in resources being dedicated only to active elements. Distribution of services can be dynamic, adjusting to loading or outages. This approach to resilience and elasticity can be effected at runtime on any affected service’s transaction boundaries.
Visibility	Status information is pervasive but highly decoupled, taking very low percentage points away from data-path performance. Every element can report standard or custom status but only on demand and on change, status data is collated, aggregated and cached on-process. Status data is accessible remotely via REST calls and locally for feedback and potential analysis.
Scalability	Archipelago builds systems from specification definitions (Specs). A Spec can support remoting hints attached to subsystems and Cell groups (Reactors), Fragments of a Spec to be remoted are delivered to a remote Archipelago instance (agent process) where they are built out and attached to a proxy node associated with the local parent Subsystem node. Archipelago takes a simple view on scalability, services can be local or remote, contact lost between services results in (sub) transaction failure of the subordinate. Native remote communication in Archipelago is point to point at Cell level for data and Subsystem level for system signals (both via Akka remoting). The prevailing cluster management system arbitrates top level service placement in available Archipelago agent instances.

Application Domain

Archipelago is designed for flow based processing rather than memory resident systems and as such it is suited to classic ETL and in-line processing of streams of data.

Applications such as routing, filtering, aggregation and transformation are particularly simple to create, from single shot processing run as a script to multi datacentre deployments processing infinite data streams.

More challenging systems such as rules engines, pricing engines, risk reporting and data analysis tools are all very much in scope.

Used as an adjunct to Spark, Archipelago simplifies building AI and machine learning applications.

Getting Started

Archipelago

Standard component library

An Introduction To Archipelago

Origins

Characteristics

Application Domain