Developers guide to CQL on Accord
Intro
Accord is implemented as a library that is agnostic to the underlying database it integrates with. It has little to no awareness of schema, query language, messaging, threading etc. Instead it presents interfaces for the database to implement that describe the configuration and topology of the database, what reads and writes need to execute and what their dependencies are, and how to actually execute reads and writes at the configured locations.
This guide describes how Cassandra goes about leveraging those interfaces to implement reading and writing CQL as well as live migrating from CQL running on Cassandra to CQL running on Accord.
This guide doesn’t cover how Accord works and doesn’t cover all parts of Accord that are implemented in Cassandra like threading, caching, persistence, and messaging. It also isn’t intended to be a user guide and doesn’t fully overlap with the user guide. You should start with the user guide to get any context that may be missing here.
Anatomy of a transaction
The primary way of interacting with Accord is to define a transaction using Txn/Txn.InMemory and then asking Accord to execute the transaction. Transactions express what they touch by declaring a set of keys or ranges that will be read/written to. This set needs to be declared up front and can’t change during transaction execution and the transaction can be either a key transaction or range transaction but not both.
Range transactions are more expensive for Accord to execute as the dependency tracking work Accord has to do is more CPU and memory intensive and the transactions are more likely to conflict and block execution of other transactions.
Accord is not aware of tables only ranges and keys. Keys and ranges can span any tables managed by Accord and the keys and ranges encode the tables they apply to. So a range transaction covering multiple tables would have a range per table and from Accord’s perspective these are completely different ranges.
Transactions also declare a Kind which can be Read, Write
(Read/Write), EphemeralRead, and ExclusiveSyncPoint. Read, and
Write are what you would expect. EphemeralRead is a read that only
provides per key linearizability, but offers better performance compared
to Read .
ExclusiveSyncPoint is transaction that can be used to establish a
happens before relationship with its dependencies without interfering
with their execution. ExclusiveSyncPoint is used for live migration
and repair to ensure the visibility at ALL of all committed Accord
transactions to non-transactional reads.
Keys and Ranges
Keys and Ranges are prefixed with TableId in the most significant
position to allow Accord to interact with multiple tables without
knowing anything about schema. From Accord’s perspective there is just a
set of ranges that it is responsible for replicating and transacting
over, and they can be compared, sorted, and split, but beyond that they
are completely opaque. A follow on effect from this is that token ranges
(or token ring) are per table.
Key is conceptually similar to DecoratedKey and is implemented by
PartitionKey
. RoutingKey is conceptually similar to Token and is implemented by
TokenKey
.
Accord Range is conceptually equivalent to Cassandra’s
Range<RingPosition> and is implemented by
TokenRange.
Accord Range is start exclusive and end inclusive just like
Cassandra’s Range and we use it exclusively in that mode. There are no
other forms of inclusive/exclusive bound or range used directly by
Accord. Accord Range’s implementation suggests support for other
forms of bounds but it’s not currently supported. It’s theoretically
possible to use something similar to `Range<PartitionPosition>
as the implementation of Accord’s Range but we don’t do that because
Cassandra doesn’t support splitting partitions.
To integrate Cassandra with Accord it’s necessary to have a few
different versions of TokenKey that make it possible to describe
cluster topology and perform query routing to Accord across a range of
partitioners. A TokenKey can be a sentinel for a given table which
maps to -inf or +inf for that table and it’s possible to create
a minimum sentinel that is < -inf or > +inf .
Additionally it’s possible to declare a TokenKey that is between
token and either token - 1 or token + 1 .
Accord expects to be able to convert a RoutingKey to a Range which
is facilitated by being able to create these in between tokens without
requiring the partitioner to support increment or decrement on token.
Partition range reads also leverage these in between tokens to convert
Range bounds from inclusive to exclusive and vice versa to match the
inclusivity/exclusivity of the query that is being executed.
Seekable, Unseekable, Routable
The implementations of these interfaces are always prefixed with
TableId most of which were just discussed.
A Seekable has enough information that it can be used to both route a
query and then execute it because it identifies what exactly to read and
write. An Unseekable is more compact (just a token) for Accord to work
with and can be used to route and schedule transaction execution. A
Routable could be either Seekable or Unseekable and is generally
used when you need to handle both.
Seekable can be either a Key or an Accord Range. Key has both
routing (token) and partition key/clustering information. Range is
Seekable but its bounds are only Routable. Range is in an odd
place in terms of being Seekable . It’s helpful because APIs can
accept Seekable and then handle both Key and Range domains.
Seekables is the collection version of Seekable and can be either
Keys or Ranges.
Unseekable can either a RoutingKey (TokenKey) or Range
(TokenRange) and Unseekables is either RoutingKeys or Ranges.
Route and various kinds of Routables exists, but are primarily used
inside Accord.
Data
Data
is an opaque container for data that has been read during execution of a
transaction. Accord doesn’t know anything about the contents and the
only required interface for Data is that they can be merged since
Accord will execute multiple reads at different command stores and will
need to merge the result.
Data is implemented by
TxnData
which is a glorified map from a unique integer identifying each piece of
data read to TxnDataValue which can be either
TxnDataKeyValue
or
TxnDataRangeValue
. TxnDataKeyValue doesn’t support merging because Accord only reads
from a single replica, but TxnDataRangeValue does because the integer
key for TxnData identifies the logical read in the transaction, but
the actual execution of the range read could touch an arbitrary number
of command stores covered by the range and each will produce their own
TxnDataRangeValue for their portion of the read.
Result
Result
is the interface for what is returned by Query and ends up being
returned as the non-error result by Accord to the coordinator of a
transaction. This is also implemented by TxnData for key read results
and by TxnRangeReadResult for range reads.
There is also RetryNewProtocolResult which can be returned by
Cassandra’s integration with Accord during live migration. This retry
error indicates that Accord determined the transaction’s execute time is
in an epoch where Accord does not manage some or all of that data for
read or write so the transaction should be retried on whatever system
currently manages that data.
Read
Read
is where a transaction defines how data should be read during execution
in order to return a result, and it will have its read method invoked
along with specific keys to be read at command stores.
A Read has to define all the keys it will access up front and needs to
support slice/intersecting/merge so Accord can send only the relevant
parts of a transactions reads to the command stores that are responsible
for persisting metadata about the transaction and executing the read.
TxnRead
implements Read and is a sorted collection of
TxnNamedRead.
The name in TxnNamedRead refers to what is now the integer identifier
for each logical read in the transaction. TxnNamedRead supports both
key and range reads although not both in the same transaction.
The name for a read is an incrementing integer encoded at planning time with the higher order bits storing the kind of read and the lower order bits storing the index of the read. Kinds of reads include:
-
USER - let statements
-
RETURNING - Returning select in
TransactionStatement -
AUTO_READ - Automatically generated reads like list index set
-
CAS_READ - Read for CAS statements
Every read in a transaction is executed concurrently in the read stage
threadpool and the resulting Data (TxnData) is merged into a single
value.
TxnRead contains a read consistency level that is not visible to
Accord that is used to declare the read consistency level that a
transaction requires. This will be discussed more later when we cover
interoperability, but if this is set then the transaction will actually
read from multiple replicas complete with short read protection and
blocking read repair.
Query
Query
is the portion of the transaction definition responsible for computing
the Result of the transaction that will be returned at the
coordinator. It’s implemented by
TxnQuery
which has several different modes it can operate in.
Query only has one method compute to compute the result and is run
on the coordinator of a transaction. There are few things TxnQuery is
responsible for such as validating the query is accessing data managed
by Accord generating a retry error if needed. For CAS statements it’s
also responsible for checking the CAS condition and returning the
appropriate result. For range reads it’s also responsible for merging
the range read results and reapplying the limit.
TxnQuery also has an implementation, UNSAFE_EMPTY, used for
Accord system transactions that does no validation that Accord owns the
ranges in question. This is because from Accord’s perspective it
immediately adopts all the ranges in a table when that table begins
migration to Accord, but from live migration’s perspective (which Accord
can’t see) there is a
TableMigrationState
that specifies which ranges within a table are managed by Accord.
Accord system transactions only impact Accord metadata so “they don’t exist” from the perspective of live migration and concurrent reading and writing to data.
Update
Update
is invoked via the apply method on the Accord coordinator and is
responsible for taking in the Data from Read and producing the
Write that contains all the writes that we applied as part of
committing the transaction.
Update requires support for slice/intersecting/merge so that
Accord only needs to distribute and persist the potentially sizable
partial or complete updates to the shards that actually need them.
TxnUpdate
implements Update and can contain completed or partial updates which
are completed when apply is called with the TxnData from TxnRead.
Updates that are not data dependent (blind writes) are handled
differently from non-data dependent updates. Data dependent updates are
computed at the coordinator and returned in the TxnWrite but non-data
dependent updates are omitted and instead are retrieved from TxnUpdate
at each replica when TxnWrite.apply is called.
TxnUpdate is also responsible for populating the update with the
monotonic transactional hybrid logical clock for the execution time of
the transaction. During migration, normal CQL operations will use the
Accord timestamp once a range starts migration, but will fall back to
server timestamp when migrating away from Accord. For normal CQL operations,
USING TIMESTAMP is supported and will cause the data to use the user
timestamp instead of the Accord one, though this breaks linearizability
and should be avoided when possible.
For BATCH operations, timestamp handling is more complex: if the batch
uses USING TIMESTAMP, the user timestamp will be used. If all mutations
in a BATCH use USING TIMESTAMP, the user timestamp will be used. If not
using USING TIMESTAMP and all partition keys are on Accord, the Accord
timestamp is used. If the BATCH has some partitions on Accord and others
not on Accord, the server timestamp will be used (writes to the Accord table will not be linearizable for multi-table
batches where one table is not migrating to Accord). If a batch mixes
server timestamp and USING TIMESTAMP mutations, the default behavior
is to reject the batch, configurable via accord.mixed_time_source_handling
with values: reject (default), log (accept and log operations where writes to the Accord table will not be linearizable),
or ignore (accept silently).
TxnUpdate has a write consistency level that is not visible to Accord
and is it similar to the commit consistency level for CAS writes. If the
write consistency level is set then Accord will do synchronous commit at
the specified consistency level. Otherwise Accord defaults to
asynchronous commit. How consistency levels are handled will be covered
in interoperability and live migration.
Write
Write
is produced by invoking Update.apply and is not required to be
splittable/mergeable because all writes are sent to all shards. Write
is implemented by
TxnWrite
which each command store will invoke via apply for each intersecting
key. This will cause all writes in a transaction to run concurrently on
the mutation stage.
Putting it all together
With all the components of a transaction available they can be assembled
and provided to Accord to coordinate to implement all the existing CQL
interfaces as well as the new TransactionStatement interface.
See
TransactionStatement.createTxn
,
CQL3CasRequest.toAccordTxn,
ConsensusMigrationHelper.mutateWithAccordAsync,
StorageProxy.readWithAccord,
and
BlockingReadRepair.repairViaAccordTransaction
.
There isn’t as much magic as you would think in how Accord executes
transactions when operating with exclusive access to a table. Accord is
able to mostly execute ReadCommands unmodified with some
accommodations for the fact that reads are strongly consistent from a
single replica so filtering can be pushed down. The majority of the work
is just making the description of things like CAS serializable so it can
be persisted by Accord for transaction recovery.
Where things get complicated is live migrating to Accord and supporting interoperability with non-Accord reads and writes.
Live migration
Core challenges
Accord and Paxos operate fundamentally different in terms of what they perform consensus on and how the transactions are recovered. Paxos performs consensus on the exact set of writes to apply and recovering a transaction only requires the writes to be applied. Accord consensus is on the transaction definition, a superset of the dependencies, and the execution timestamp of the transaction.
Accord needs to recompute the writes during transaction recovery which
means it may need to repeat any reads necessary to compute those writes
which means Accord needs reads to be repeatable during transaction
execution and recovery. Non-Accord writes cause non-determinism for
Accord reads. Accord also reads at ONE so it would miss QUORUM
writes.
The big hammer we use to deal with this is to avoid ever requiring
Accord to read data that is not replicated at ALL. If we did it would
lead to non-deterministic transaction recovery. This isn’t something
that can be addressed by having Accord read at QUORUM and then
performing blocking read repair because different Accord coordinators
can still witness different sets of non-Accord writes.
Accord also defaults to asynchronous commit so when migrating away from Accord it’s not safe for Paxos and non-SERIAL reads to read committed Accord writes
Bridging the gap
Cassandra needs to be highly available while transitioning, but
operations that propagate data at ALL like Cassandra’s Data Repair
+ Paxos Repair, or Accord’s repair syncs are not highly available.
Going forward these will be referred to as range barriers.
At every point during migration there has to be some system safely
capable of executing every operation type. Highly available key barriers
solve this problem by allowing the migration of a single key at QUORUM
to meet the requirements for execution on the migration target system.
A key barrier on Paxos uses the existing Paxos repair mechanism to apply
any partially committed transactions at QUORUM which can then be
safely read by Accord if Accord read’s at QUORUM. A key barrier on
Accord uses Accord’s sync mechanism to wait until all transactions in an
epoch that could have modified the key are applied at QUORUM.
There is a system table and small in memory cache for key barriers to avoid repeatedly performing key migrations, but the key migration is only recorded if the coordinator is a replica to avoid the cache growing too large.
No non-SERIAL key migration
One wrinkle is that it is not possible to do key migration for non-SERIAL Cassandra writes because there is no metadata to check for uncommitted operations like there is with Paxos and Accord. Non-SERIAL writes include all sources of non-SERIAL writes such as read repair, logged batches, and hints. Accord doesn’t have this issue as any data managed by Accord always has metadata available since all operations are routed through Accord.
Splitting migration to Accord into two phases solves this issue because while Accord is unable to safely read non-SERIAL writes it can safely apply non-SERIAL writes as recovery of blind write transactions is still deterministic in Accord. In the first phase of migration to Accord all non-SERIAL writes are executed on Accord and synchronously applied at the requested consistency level while a data repair (full or incremental) runs and makes it safe for Accord to read non-SERIAL writes. Paxos continues to execute all SERIAL writes because Accord is unable to execute SERIAL writes since it can’t read yet.
After a data repair completes the second phase of migration to Accord
begins and all operations are executed on Accord after Paxos key
migration is run to ensure that the key being read by Accord has no
unapplied Paxos transactions. After a Paxos repair + data repair
(full only) the remaining Paxos writes will be visible at ALL and
Accord can begin executing reads at ONE instead of the requested
consistency level and performing asynchronous commit and ignore the
requested commit/write consistency level.
A quirk of incremental repair is that it flushes memtables before Paxos
repair runs and as a result it doesn’t replicate at ALL the data that
Paxos repair propagated at QUORUM. Thus a full repair is required for
the second phase of migration to Accord so that the Paxos data ends up
repaired at ALL. It’s possible, but difficult, to make the
migration three phases and track the Paxos repair independently so that
you could do Paxos repair and then use IR, but this is not currently
implemented.
Supported consistency levels
Live migration to/from Accord requires Accord to honor requested
consistency levels for read and write. Cassandra’s Accord integration
only adds support for a subset of consistency levels listed in
IAccordService
. DC aware consistency levels are not supported along with TWO and
THREE.
Accord will always reject unsupported consistency levels even if it will not actually be honoring them during execution to ensure that your application remains ready to migrate away from Accord in the future.
In the case of ONE as a write/commit consistency level the commit will
silently be performed at QUORUM
Interoperability support
Interoperability aims to extend Accord to support reading and writing at
configurable consistency levels as well as to add support for
synchronous commit. This is facilitated by extension points in Accord
that allow injecting custom implementations for various protocol steps
via
CoordinationAdapter
and
AccordInteropAdapter.
AccordInteropAdapter can inject custom versions of the execute and
persist phases and does conditionally at transaction execution time
based on the read and write consistency levels provided by TxnRead and
TxnUpdate . These consistency levels can differ from the ones
requested by the application because live migration may choose to ignore
the consistency levels when they aren’t needed.
AccordInteropExecution
allows reading at a requested consistency level. It largely inverts
control of reading in Accord and uses Cassandra’s existing Read Executor
functionality to determine what nodes to contact and what commands to
send them while providing short read protection and blocking read
repair. Read executors interface with Accord via the
ReadCoordinator
interface which can either send a regular read message or go through
Accord to send an Accord specific read message which causes the read to
execute at the appropriate command store in the appropriate
transactional context after all dependencies have been applied.
ReadCoordinator also intercepts blocking read repair during execution
of an Accord transaction and executes it through the appropriate command
store. The only legitimate way for this to occur is after Paxos key
migration the data is only propagated at QUORUM so it is possible that
Accord reading at QUORUM will find replicas to read repair. It’s not
strictly necessary as we already know the data is propagated at
QUORUM, but the support is there.
ReadCoordinator also helps apply read repair mutations via Accord in
TransactionalMode.MIXED and during migration by applying the read
repair mutations in Accord’s execute phase instead of waiting for apply.
This is safe because read repair only proposes already committed Accord
writes or already unsafe non-SERIAL writes which aren’t allowed anyways.
AccordInteropPersist
adds support for synchronous commit and commit at a requested
consistency level. It sends AccordInteropApply which is a synchronous
apply message that only responds once application is complete.
TransactionalMode
defines the supported modes and
commitCLForMode
determines the commit consistency level and
readCLForMode
determines the read consistency level. These two methods take into
account both the requested consistency level, the table specific
migration state, the current transactional mode, and the target
transactional mode in order to decide whether to honor the requested
consistency level.
Routing requests during migration
During migration, requests race with changes to
TableMigrationState
to execute and may complete or partially complete on the system they
were originally routed to. This race is resolved by allowing requests to
return a new retry on different system error response that has to be
handled by the coordinator. It’s possible that a request may still
complete after receiving a retry different system error because the
target consistency level was still met.
Migration is per table and per token range so it’s possible for part of a table to be running on Accord and part of it to be running on Paxos. Requests can end up executing partially on Cassandra and partially on Accord.
Detecting misrouted requests
For Paxos this is resolved in the prepare phase where a failure to meet the required consistency level at the prepare phase means the operation does not run on Paxos. If the prepare phase is being performed to recover an existing transaction then it is allowed to proceed because recovery will deterministically create the same state every time it runs so it’s safe to repeat even after key or range migration has occurred since those would have already recovered the transaction.
Accord determines an executeAt timestamp, that is deterministic even
during transaction recovery, for each transaction that includes an epoch
that corresponds to the epoch used by TableMigrationState and this is
used to check all the tables and keys being touched in a transaction.
TxnQuery then returns a retry on different system error if the any
part of the transaction is not eligible to run on Accord.
ColumnFamilyStore checks every Mutation to see if it is marked as
allowing potential transaction conflicts. Paxos and Accord always mark
their Mutation`s as allowing potential transaction conflicts because
they do the work to check for them directly, but non-SERIAL sources of
`Mutation`s will be subject to that check and a
`RetryOnDifferentSystemException is thrown if the mutation is detected
to be misrouted according to the latest cluster metadata available at
the node attempting to apply that mutation.
ReadCommand has a similar arrangement where each read command is
marked with whether it allows potential transaction conflicts and when
executeLocally is run the check is done against cluster metadata to
determine whether or not to throw RetryOnDifferentSystemException.
Accord always allows potential transaction conflicts on its read
commands, but Paxos does not because Paxos does not need to read data in
order to recover transactions.
Splitting write requests
For non-SERIAL writes the Mutation is split into the portion that will
execute on Accord and the portion that will execute on Cassandra and the
Accord portion is executed asynchronously while the Cassandra portion is
executed synchronously. If either attempt fails due to misrouting the
write is re-split with updated cluster metadata and retried without
raising an error.
Logged batches are currently always written to the system table and then split for execution, and if part of the batch fails then batchlog replay will replay the entire batch and re-split it in the process. Batchlog replay only makes a single attempt to replay before converting the batch contents to hints. If part of the batch was routed to Accord then there is no node to hint so there is a fake node that a hint is written to and when that hint is dispatched it will be split and then executed appropriately. In CASSANDRA-20588 this needs to be simplified to writing the entire batch through Accord if any part of it should be written through Accord because it also addresses an atomicity issue with single token batches which can be torn when part is applied through Accord and part is applied through Cassandra.
Hints can be for multiple tables some of which may be Accord and some non-Accord so splitting occurs. It’s also possible a hint will be for an operation that was sent to Accord (not a real node) via the batchlog and it’s possible that splitting discovers the hint now needs to be executed without Accord. In that scenario the hint is converted to a hint for every replica. This conversion can only occur once so the write amplification is bounded.
Splitting of mutations is done in
ConsensusMigrationMutationHelper
with the retry loop being implemented at each caller (batch mutation,
mutation, batch log, hints).
Paxos has a retry loop but does not do any splitting because Paxos only supports a single key.
Partition range reads
Partition range reads are managed by
RangeCommandIterator
which continues to split range reads using the existing algorithm that
is agnostic as to how the range command will be executed. Each generated
range read
is
then split on the boundaries of which system is responsible for reading
that range and that is wrapped in a
retrying
iterator which repeats the splitting if any part of the range read ends
up routed to the wrong system.
Range reads do not execute any key barriers and when migrating away from Accord you will see weaker consistency compared to Paxos because Accord does not necessarily honor commit consistency levels and does asynchronous commit. As things currently stand it’s uncertain the key barriers would run fast enough to avoid timing out range read requests so they are not done.
Range reads also consume more memory when executed on Accord when a limit is used. A single range read command is split into intersecting command store number of range read commands that execute concurrently and each one can return up to the limit number of results before they are merged at the coordinator and the limit is re-applied. This could be improved by applying the limit again before serializing or by executing the reads serially at command stores until the limit is met.
Transactional modes
Transactional modes are set per table and define how Accord, Paxos, and
non-SERIAL operations will execute. The three supported modes are
FULL, MIXED_READS, and OFF.
FULL routes all reads and writes through Accord once migration is
complete and allows Accord to ignore read and write consistency levels.
This allows Accord to perform asynchronous commit reducing the number of
WAN roundtrips from 2 to 1.
MIXED_READS routes all writes through Accord once migration is
complete, but allows non-SERIAL reads to safely execute outside of
Accord and still read Accord writes because Accord will honor the
provided commit consistency level. This means Accord will need to
perform synchronous commit requiring an 1 extra WAN roundtrips for 2
total.
OFF is the default where everything runs either on Paxos if it is
SERIAL or on the usual eventually consistent paths for everything
else.
Other modes exist for testing purposes and are disabled by default unless unlocked via system property.
