Searchable Sequence
Searchable sequence or rank/select dictionary is a sequence of symbols that supports two operations:
rank(position, symbol)
is number of occurrences ofsymbol
in the sequence in the range [0, position)select(rank, symbol)
is a position ofrank
-th occurrence of thesymbol
in the sequence.
Usually the alphabet is {0, 1} because of its practical importance, but larger alphabets are of significant interest too. Especially in Bioinformatics and Artificial Intelligence.
There are many implementations of binary searchable sequences (bitmaps) providing fast query operations with $O(1)$ time complexity. Memoria uses partial sum indexes to speedup rank/select queries. They are asymptotically slower than other methods but have additional space overhead for the index.
Packed searchable sequence is a searchable sequences that has all its data structured packed into a single contiguous memory block with packed allocator. It consists from two data structures:
- multi-index partial sum tree to speedup rank/select queries;
- array of sequence’s symbols.
See the following figure for the case of searchable bitmap.
Note that for a sequence with K-symbol alphabet, packed sum tree has K indexes, that results in significant overhead even for relatively small alphabets. For example 8-bit sequence has 256-index packed tree that takes more than 200% of raw data size if the tree is not compressed. To lower this overhead Memoria provides various compressed encodings for the index’s values.
Creation and Access
To create partial sum tree for a sequence we first need to split it logically into blocks of fixed number of symbols (16 at the figure). Then sum different symbols in the block, each such vector is a simple partial sum tree leaf. Build other levels of the tree accordingly.
- Symbol update is relatively fast, it takes $O(log(N))$ time.
- Symbol insertion is $O(N)$, it requires full rebuilding of partial sum tree.
- Symbol access does not require the tree to perform, it takes $O(1)$ time.
Rank
To compute rank(position, symbol)
- Given
position
, determine sequenceblock_number
for that position, andblock_pos
position in the block, $O(1)$; block_rank
= sum(0,block_number
) in the sum tree, $O(log(N))$;- count number of
symbol
s in the block toblock_pos
, $O(1)$; - final rank is (2) + (3).
Select
To compute select(rank, symbol)
- Given partial sum tree and a
symbol
, determine target sequenceblock_number
havingtotal_rank <= rank
for the givensymbol
using findLE operation, $O(log(N))$; - For the given block, compute
rank_prefix
= sum(0,block_number
) for the givensymbol
, $O(log(N))$; - Compute
local_rank = rank - rank_prefix
; - Scan the block and find position of
symbol
having rank in the block =local_rank
, $O(1)$.
Actual implementation joins operations (1) and (2) into a single traverse of the sum tree.
Hardware Acceleration
Consideration are the same as for partial/prefix sum trees, especially, because searchable sequence contains partial sum tree as an indexing structure.
Modern CPUs usually have direct implementations for rank (PopCount) for binary alphabets. Select operation may also be partially supported. To accelerate searchable sequences, it’s necessary to implement rang/select over arbitrary alphabets (1-8 bits per symbol).
Symbol blocks are also contiguous in memory and can be multiple of DRAM memory blocks. Rank/select machinery is simpler or comparable with machinery for addition and subtraction. Those operations can be efficiently implemented in a small silicon budget and at high frequency.