Sunday, January 16, 2011

Caching Scenarios

Caching Scenarios

Caching is a quick and easy way to save roundtrips between your application and where you store your data. However, it’s not as easy as just snapping a map into your application – to really leverage a cache, you not only have to understand where a cache can be used, but how it can affect what your application does and how your architecture is used.

Definition of caching

Caching is the use of an intermediate and temporary data store to preserve computation or data retrieval time. It sounds simple, but it's worth taking the time to consider cache strategies and approaches. (What’s more, if you’re really going to leverage the technology and tools, it’s worth wondering if you can get rid of the ‘intermediate’ and ‘temporary’ label there.)

Definitions

Some definitions are in order to make discussion easier.

Data. Data, in this context, is any in-memory representation of an object that represents a collection of data, or the result of a computation.

Data can have a lifetime, after which the data is considered stale and inappropriate for use.

Cache. A cache is the temporary datastore. Normally it's a region of memory, although this memory region can be persisted in any number of ways.

Datasource. A datasource is any source for data - it can be a process that performs calculations (in which case the result is often referred to as having been “memoized”) or a datastore like a database.

Participant. A participant system is any client of a cache, distributed or otherwise. In the case of a distributed cache, a participant is able to contribute local resources to the general cache.

Writethrough/Readthrough. Readthrough and writethrough refer to the ability of a cache to pass updates to the cache to the datasource, or automatically query the datasource for results. This is crucial for the ability of a cache to act as a system of record.

System of Record. A system of record is the authoritative repository for data. It will usually be a system's primary database, but some systems are able to use other systems of record efficiently. Few cache systems are able to serve as good systems of record, although some good systems of record can act as caches.

Transaction. A transaction is an operation in which state changes (deletes, updates, insertions) can be discarded or committed. An alternate definition is a cohesive operation, but this is a definition from application requirements and doesn’t have any type of technical definition in this context.

Scenarios

There are four basic strategies for caching, with two of them being fairly rare. They are: local caching, distributed caching, system of record with local storage, and event handling. Of the four of them, the first is by far the most common, the second is growing in deployments, the third is very rare but shouldn't be, and the last is almost entirely inappropriate for caching approaches - but is used nonetheless.

That doesn't mean that the distribution of caching strategies is appropriate. The plain caching approach is workable, but if it's used for the most common case - intermediate data storage - it usually means your data takes too long to retrieve. The second most common case is for memoization - storage of intermediate and calculated results - and local caching is generally fine for this, although distributed caching is better because it prevents different participating processors from having to calculate the same intermediate results.

Plain Caching

Plain caching and distributed caching are very different in terms of configuration, but otherwise very similar in terms of how they're used. Typically, the cache is represented to the participating application as a map (a key/value store) of some kind, where a data item uses a primary key of some kind to access the data.

The difficulties come in two areas: determining the data key and querying.

If the data represents a row in a database table, for example, the key can be very easy to determine: it's the table's primary key. However, if the data represents more than a simple row, things can get more complicated: the key might be a SQL query used to retrieve the data, or some other form of composite data.

This carries with it its own difficulties, because there might be more than one set of data for a given primary key. This leads into the other difficult area: queries.

Usually caches don't expose views of their data outside of the primary key. This means that if you can't reconstruct the key, that data is lost. This ... may be suboptimal.

Application: hibernate

Hibernate has two levels of caching. The first is caching held by a specific transaction, such that an object accessed repeatedly in the context of a single transaction is not fetched (or updated) over and over again.

The second level cache is held by the session factory, and is where plain caching (or distributed caching) comes into play. There are different levels of caching available here, from read-only caches, to fully transactional updating caches. The performance goes down as you traverse from read-only to read-write, to nonstrict-read-write, to transactional, but the update safety improves as a tradeoff for performance.

The second-level caches can be distributed, but distribution of a cache has drastic effects - because in most cases, synchronization takes time. Imagine an item stored in the cache on participant 1, then updated; it's then accessed from participant 2 and updated. There's a synchronization point that needs to be addressed between those nodes before updates can be properly applied.

Distributed Caching

Distributed caching is used when a cache is represented on more than one participant at once. Different cache providers provide different services for distributed caching.

Topologies

There are many topologies involved in distributed mechanisms (not just cache!) - the two basic starting points are hub-and-spoke and mesh topologies. Most other topologies can be considered as variants or mixtures of these two, although there are of course exceptions.

Hub and Spoke

The hub and spoke topology has the concept of a central server, and many clients (participants, using our definitions) connect to the central server. The main server is the cache's "system of record" although it may not be the application's system of record (and often isn't). The network bandwidth available to the central server is a limiting factor here.

One strength of this topology is that network management is usually very simple, and that only one system has to be optimized for data access: the central server.

Usually, systems with the hub and spoke topology have a backup live on the network in the case of failure; therefore, client systems are each connected to two servers (one primary, one backup).

Mesh

Mesh topologies are topologies in which each participating node is connected to every other participating node. There's usually not a central server in this topology, although there might be a coordinating process of some kind.

Each participant is a peer to the other, so there's no dominant server, nor is there an authoritative place for data to live. Network limitations are per-participant.

This is a very, very common topology for clustered environments. Participants tend to prefer data cached locally, which speeds things up; synchronization, however, can be costly depending on the transaction strategy.

One downside to this approach is that clients tend to need to have more resources allocated to them, to have enough room left over to serve cache data to other peers as well as having room to hold results locally that have been fetched from other peers.

Application: Memoization

One application for which distributed transaction shine is in memoization. We've mentioned memoization a few times, but it's worth defining: memoization is the turning the results of a function into something to be remembered.

Consider the strategy for calculating the sine of a value. One strategy that used to be very common (and may still be, although I'm not sure) was the storage of coarse sines - every quarter-radian, for example. Then, when a sine was requested, the calculation would look to see if it had previously calculated the result, and if not, would perform the expensive calculation... and save the result of the calculation in an internal table.

This is memoization. Typical uses for memoization in a caching application might be the calculation of slices of datasets, preserving the slices for later retrieval. Since calculating the slices can be expensive (depending on the size of the data set, of course), it's often worthwhile using a distributed cache to spend the calculation time once and then factor in only the time necessary to transfer the data across the network to other participants.

System of Record

A cache is not normally considered a system of record. The typical data access involving a cache looks something like this:

• Calculate a cache key.

• If cache key exists in the cache: return the data from the cache.

• Fetch the data from the system of record.

• Store the data in the cache.

• Return the data.

A typical write access would look like this:

• Calculate a cache key for the data to be written

• Write the data in the cache with the cache key

• Write the data to the system of record.

This works well, but means that data access is... complicated. Every write ends up going to the network multiple times (for the distributed cache and for the system of record) and synchronization is manual.

There are cache capabilities, however, that provide for write-through and read-through. With these capabilities, one is able to request a key from the cache, and if the cache does not have the value, it will itself go to the “official” system of record and request data; likewise, when data is written to the cache, the cache will synchronize to the database itself without programmer participation.

Naturally, there are concerns. It's usually of fairly minor impact if data is read twice from the backend system of record at the “same time” - but can be fatal if data is written to the database more than once. Therefore, transaction support is mandatory.

The transactions imply that the write-behind is synchronous. (If it's not, do not use it in any kind of transactional system.) The key at this point is to determine how synchronous the events are, and where the synchronous events take place.

Application: Gigaspaces

GigaSpaces XAP is a commercial product from GigaSpaces Technologies. It's more than a simple cache; it's a data grid and application platform (as we'll see more of in the next section). The network structure is a combination of mesh and hub-and-spoke; participants connect to a network of managing applications, which inform the participants who their peers are for failover and high availability. A participant is designated by the management system as a primary or backup; primaries synchronize to their backups and no other systems. (There are exceptions to this, naturally.)

In GigaSpaces XAP, write-behind is supported as a configuration element, normally supported by Hibernate (i.e., writing to a relational database) but offering a customizable externalization interface.

When a data item in the data grid is updated, the write is synchronized between the primary and its backup, and if the writebehind is configured, the update is also written through to a mirroring service. The mirroring service then consumes the writes and sends them to the backend datastore. Due to the nature of the mirror synchronization, you have to have a true and catastrophic failure of the data grid in order to have any problems.

Beyond Caching

The limitations of caching, up to the point of considering a cache a system of record, lie in the concept that data is stored in more than one place at any given time. This leads you to data synchronization issues, which cost you in time and complexity.

Considering a cache as a system of record can help solve some of this complexity, but most cache products’ support for systems of record are cruelly limited, to say the least (with writethrough and readthrough being generally bolted on.)

With that said, there are scenarios beyond caching to consider, arenas in which a “simple cache” can bloom into a full-bore application architecture: messaging, event handling, and distributed processing.

Messaging/Event Handling

An event handling, or messaging, system is a system in which events are delivered to waiting listeners. It’s really that simple: an event, or an object, is constructed with some information about what happened (which can be as simple as “EventHappened” or as complex as an insurance application with full details of the application, including the state of the application.)

A process somewhere is looking for that event, either constantly or regularly (i.e., always listening, or polling the event system at intervals), and upon receipt of the event, removes the event from the message system and does something to it (which often includes writing another event back in the system for further processing by something else.)

In many ways, event handling is a core concept of object-oriented programming, serving as the model for Objective C’s object paradigm, for example, and also serving the same role for Erlang and Smalltalk.

A message system, however, has the ability to queue operations, such that expensive resources (CPU time, database connections, whatever) can be managed by the number of consumers of messages; if your system can only handle one message at a time, you would simply use one consumer of messages, and messages could build up to be handled as the system was able.

A cache normally cannot serve as a message system, except very crudely at best. However, some cache-like systems (including GigaSpaces XAP) provide notification services and, in the best cases, complex and powerful message selectors to provide a full message bus.

GigaSpaces XAP not only provides a full message selection API, but multiple programming interfaces to send and receive messages, including cache-oriented interfaces.

Event handling systems refer to transactions in two contexts: one is the process transaction, an operation’s state in which changes can be committed or discarded. (This is the definition offered in the definitions section of this document.) The other definition is a requirements-defined unit of work, which can itself contain multiple transactions of both types.

It’s critical that event handlers be transactional from the process’ standpoint: otherwise an event handler that fails for any reason will cause the event to be lost entirely.

Messages should also have lifetimes, such that a message is discarded if it’s not consumed within a specific amount of time. This is usually available in cache-like systems, because cache entries normally have lifetimes associated with them already.

Another aspect for event handlers to be considered is whether event order is preserved. GigaSpaces XAP, of course, can support all of the requirements for an event handling system, and adds distributed processing and data locality as well.

Distributed Processing

Distributed processing refers to the capability of related tasks to occur simultaneously or on participants with available resources, a crucial aspect of message handling systems.

Imagine a system that needs to create PDFs on a regular basis. PDF creation is expensive (in time and memory); a system that needs to create many PDFs would easily be overloaded. With a messaging system, the number of simultaneous PDF creation processes can be limited per system participant. Correspondingly, if desired, machines with low system load could “pick up” the PDF generation tasks, which prevents the more burdened systems from being overloaded.

Distribution of tasks relies heavily on how far-reaching state changes are. A “law” called “Amdahl’s Law” refers to the sequential nature of tasks; if tasks can be broken away from serial processing, then those tasks can be executed in the amount of time it takes to process the slowest of them.

In other words, if you have a task “D” composed of tasks “A,” “B,” and “C,” which take three, four, and five seconds respectively, in a sequential process D would take twelve seconds to execute. However, if A, B, and C don’t rely on each others’ data – C doesn’t use A’s results, for example – then D can feasibly execute in five seconds, the amount of time it takes to run the longest subtask.

If C does rely on A’s data, but B does not, then D will take eight seconds to run (B would be able to finish in the time it takes to run A+C.) Amdahl’s Law refers to the potential for time savings by parallelizing tasks that don’t rely on other tasks.

There are also other variants: one is Map/Reduce, which refers to an algorithm that breaks a request into many small portions, easily handled by participant systems (the “map” phase), with the results finally collated into a cohesive result (the “reduce” phase.)

Map/Reduce is well-known in the distributed industry, but there’s an aspect that sometimes gets overlooked: locality of data.

Data can be routed, as it is in a hash, such that related data lives on a given participant. Node A might get all data for Florida, for example, while node B might get all data for New York. Therefore, tasks limited to Florida would get handled only by node A, as node B would have no local data to operate on. (That isn’t to say that node B couldn’t request this data, but this means extra resource consumption.)

This has huge implications for Map/Reduce.

Imagine a system whose purpose it is to contain… marbles of four colors: red, green, blue, and white. Counting the marbles of each color would traditionally involve iterating through all of the marbles (assuming you lose none), incrementing a count for color as each marble is processed. Even assuming the triviality of this task, it can be made quicker and simpler.

Consider: with multiple participant nodes (there’s no point discussing distributed processing with only a single node), you’d have marble colors assigned to each node. With four nodes, each node would have its own marble color; node A would have only red marbles, node B would have only green marbles, and so forth.

Counting marbles, then, becomes a task of merely counting however many marbles a given node has, then collating the results. Node A would count its marbles (with no consideration for color, as all it has are red marbles), and provide a result; nodes B through D would do the exact same thing. This is the Map phase.

The reduce phase would look at the four results (node A: n marbles, node B: p marbles, etc.) and build a list of marble colors and counts out of those results. This is the reduction phase.

The key here is that A, since it contains all of the red marbles, has direct access to its data, such that it’s merely iterating over its internal indexes of marbles – a surprisingly fast operation (not much slower than counting the elements in an array.)

Even if a node has more than one color of marble, it can be amazingly fast – since it’s still using its internal indexes instead of acquiring data from a filesystem or database.

Distributed systems also rely on transactions, as well as supporting systems of record; note that an external system of record (like a relational database) can cripple your execution time, since accessing the external database has to factor in to your runtime. Transactions (and transaction visibility, where a record might not be available to a task until another transaction is complete) will add to the execution time; a distributed system that also serves as a system of record lowers transaction speeds to near-heap access times, yielding massive benefits.