This project contains some simple and easy-to-use tools for studying of strings whose characters are drawn from the byte alphabet. It currently consists of
- analytic tools
lyndonfactorization: counts the number of Lyndon factors. Outputs all ending positions of Lyndon factors when setting the environment variableRUST_LOG=debugmus: compute all minimal unique substringscount_r: counts the number of runs in the BWT obtained by the suffix arraycount_sigma: counts the number of different characterscount_z: counts the number of overlapping LZ77 factorscount_lexparse: counts the number of lexparse factorsentropy: counts the k-th order empirical entropyentropykmer: counts the k-th order empirical entropy via k-mers,k \in [1..7]
- word generators with program
wordenumerate: enumerates all strings of a specific length and alphabet size, starting with characterathuemorse: computes the n-th Thue-Morse wordfibonacci: computes the n-th Fibonacci wordperioddoublingcomputes the n-th period-doubling sequencedebruijn: computes a binary de Bruijin word of order nFibonacciLyndonFactor: the n-th Lyndon factor of the infinite Fibonacci word
lyndonwordsto generate all Lyndon words up to a specific length for a given alphabet size- transforms
reverse: reverse the input byte-wise
Compile and run with cargo of rust-lang:
cargo build
cargo run --bin count_sigma -- --file ./data/tudocomp/einstein.en.txt
compute the 5th Fibonacci word
cargo run --bin word -- -n fibonacci -k 5
Datasets can be found at http://dolomit.cs.tu-dortmund.de/tudocomp/
The output format of the analytic tools is compatible with sqlplot.
- The BWT computation requires that the zero byte does not occur in your input. To enforce that, you can use the
escapeprogram to escape all zero bytes. For instance,escape -f 0 -t 255 -e 254 | count_rescapes 0 with the escape byte 254 to byte 255, and pipes the input tocount_rcounting the number of BWT runs.