Bit Vectors

We define 3 operations on bitvectors

Access(B, i): returns the element at pos $i$ in $B$
Rank(B, i): returns the number of 1's in the range $B [1, i]$
Select(B, x): returns the position $i$ in which the rank becomes x, or $∣ B ∣ + 1$ if $R ank (B, n) < x$ .

The book covers both the zero-order and k-th order compression of $B$

Zero-order Compression

Given a bit vector $B$ , we divide it into blocks of size $b$ , s.t $B_{i} = B [(i - 1) b + 1, ib]$ . Every block $B_{i}$ is assigned a class $c_{i}$ in which the class refers to the number of $1^{'} s$ in $B_{i}$ .

Observe that a class $c_{i} = c$ can represent $(c b)$ blocks.

Example

Let $b = 4$ then the class $c = 0$ represents the block ${0000}$ . While $c = 1$ represents the blocks ${0001, 0010, 0100, 1000}$ .

Notice that just the variable $c_{i}$ is not enough to descibe $B_{i}$ , thus an offset variable $o_{i}$ is used to identify which of the possible blocks in $c_{i}$ referes to $B_{i}$ .

Assigning Offsets

Given that we have $(c b)$ starting combinations to encode offsets for, observe that:

There are $(c b - 1)$ permutations of blocks that start with 0.
And $(c - 1 b - 1)$ permutations of blocks that start with 1.

Using this, the book describes an algorithm to assign $o$ to block $B$ of class $c$

Algorithm 1

Encoding offsets:

Set $o = 0$

For $B^{'} = B [i]$ , if:

$B^{'} = 0$ , then set $b = b - 1, c = c, o = o, B = B [i + 1.. n]$ ; That is do nothing and skip over the 0's

$B^{'} = 1$ , then set $o = o + (c b - 1)$ and set $b = b - 1, c = c - 1, B = B [i + 1.. n]$ .

Note $(y x)$ when x<y=0

Example - encoding

$B$ = 1011 So starting with $b = 4, c = 3, o = 0$

$B^{'} = 1$ ; so $o = o + (3 3) = 0 + 1 = 1$ ; and set $b = 3, c = 2$

$B^{'} = 0$ ; so $o = o$ ; and set $b = 2, c = 2$

$B^{'} = 1$ ; so $o = o + (2 1) = 1 + 0 = 1$ ; and set $b = 1, c = 1$

$B^{'} = 1$ ; so $o = o + (3 3) = 1 + 0 = 1$ ; and set $b = 0, c = 0$

So $B = 1011$ is represented by the pair $(c, o) = (3, 1)$

Algorithm 2

Decoding offsets:

The operations is the inverse of encoding, so given $b, c, o$ :

if $o < (c b - 1)$ then $B^{'}$ is a 0. And we set $b = b - 1, c = c, o = 0$

else $B^{'} = 1$ and we set $o = o - (c b - 1)$ then $b = b - 1, c = c - 1$

Example - decoding

Given $b = 4, c = 3, o = 1$

$o = 1 \geq (3 4 - 1) = 1$ ; So $B^{'} = 1$ ----> $b = 3, c = 2, o = 0$

$o = 0 < (2 3 - 1) = 1$ ; So $B^{'} = 0$ ----> $b = 2, c = 2, o = 0$

$o = 0 \geq (2 2 - 1) = 0$ ; So $B^{'} = 1$ ----> $b = 1, c = 1, o = 0$

$o = 0 \geq (1 1 - 1) = 0$ ; So $B^{'} = 1$

Finally if $c = 0$ we can set the remaining $b$ bits to 0, and $b = 0$ is the exit condition.

Structures:

The compressed version of a bitvector $B$ includes 3 structures:

Class array

A $C = [1, ⌈ n / b ⌉]$ array holding the elements of size $⌈ l o g_{2} (b + 1)⌉$

So: $⌈ n / b ⌉ \times l o g_{2} (b + 1)$

We then use the taylor series approximation of $l o g (b + 1) \approx l o g (b) + l o g (1 + \frac{1}{b})$ To get: $⌈ n / b ⌉ \times l o g_{2} (b + 1) \approx ⌈ n / b ⌉ \times l o g_{2} (b) + ⌈ n / b ⌉ \times l o g_{2} (1 + \frac{1}{b})$

Some Big O abuse allows us to ignore the ceil since it adds a $O (1)$ factor. and to ignore the $l o g_{2} (1 + \frac{1}{b})$ since it is strictly less that $1$ for any $b > 1$ . Thus: $⌈ n / b ⌉ \times l o g_{2} (b + 1) \in (n / b) \times l o g_{2} (b) + O (n / b)$

Offset array

Similar to the class array, we have $⌈ n / b ⌉$ elements, each of size $∣ o_{i} ∣ = ⌈ l o g_{2} ((c b))⌉$ .

That is the total size: $i = 1 \sum ⌈ n / b ⌉ ⌈ l o g_{2} (c _{i} b) ⌉$

We know that removing the ceiling adds a $O (1)$ factor, thus over the sum we get. $⌈ n / b ⌉ + i = 1 \sum ⌈ n / b ⌉ l o g_{2} ((c _{i} b))$

Lemma 1

We know that $l o g (a) + l o g (b) = l o g (ab)$ so we can remove the log from the sum by replacing the summation of $l o g (i)$ with $l o g (\prod_{i = 1}^{n} i)$

Using the above lemma 2 we get that: $⌈ n / b ⌉ + l o g_{2} i = 1 \prod ⌈ n / b ⌉ (c _{i} b)$

Lemma 2

Given some values $b, c, c^{'}$ the book states that $(c b) \cdot (c ^{'} b) \leq (c + c ^{'} 2 b)$

Proof (non-formal)

Observe that given a bitvector of size $b$ , any combination $(c b)$ will also appear in a bitvector of size $2 b$ . If we treat $2 b$ as $2$ seperate vectors placed beside each other then naturally the left $b$ bits will have exactlt the same number of combinations $(c b)$ and the right side can hold exactly the same number of combinations $(c ^{'} b)$ w.l.o.g of which side of the $2 b$ vector is used for $c$ and $c^{'}$ .

It follows then that the $2 b$ vector can represent $(c b) \cdot (c ^{'} b)$ , if we remove the restriction of having it be cut in the middle and instead be used as a continuoes block then the $2 b$ vector can represent MORE than the product of what each of it's divided sides can. Thus

$(c b) \cdot (c ^{'} b) \leq (c + c ^{'} 2 b)$ Im prob not doing a good job of exmplaining it, but i get it, trust.

Now, let $m$ to be the total number of $1^{'} s$ in $B$ (so its the $c + c^{'} + ...$ ) and $n$ to total number of $b$ blocks. And using Lemma 2, we have it that: $⌈ n / b ⌉ + l o g_{2} (m n)$

FINALLY, recall from the entropy chapter that the zero-order (aka worst case) entropy of a universe $U$ is $l o g ∣ U ∣$ . so the size of the offset vector $O$ is bounded by ( $B$ is our universe in this case): $∣ O ∣ = n H_{o} (B) + ⌈ n / b ⌉$

Lookup Table

We also need a lookup table $K$ for all the combinatoric values $(j i)$ for $0 \leq j \leq i \leq b$ which comes up to $b^{2}$ (I remember the C++ creator talking about triagular matrices, so a $b^{2}$ upper bound is not really tight but its big O so who cares ig)

Bringing the structures together

We have $C, O, K$ with a combines size of $(n / b \cdot l o g_{2} (b) + O (n / b)) + (n H_{o} (B) + ⌈ n / b ⌉) + O (w b^{2})$

Which with more Big O magic gives the bound: $n H_{o} (B) + l o g_{2} (b) + O (n / b + w b^{2})$

Keyboard shortcuts

CDS Book Notes