Markov-Chain
Markov-chains store the letter frequency distribution for each \(n\)-length prefix, and are able to use this to generate randomized pseudo-words with a similar letter distribution as the wordlist they were generated from.
Frequency table
A frequency table stores the frequency (occurences) of each letter. For example:
Character | Count |
---|---|
a | 6 |
b | 3 |
c | 1 |
d | 9 |
... | ... |
Generating a random character from such a table is implemented using Weighted Generation.
Prefix frequency table
Data Structures
For example, a word such as discombobulated would be split up into character tokens like this:
\0 \0 d i s c o m b o b u l a t e d \0
There are three considerations to make for the in-memory representation of a markov-chain:
- Time to build (insertion time)
- Memory usage
- Time to generate it (access time).
In terms of data structures, implementing a Markov-chain requires two primary data structures: a
data structure to represent character frequency (such as a map from character
to occurences
), and
a data structure to map prefixes to this character frequency data. In Rust terms, such a table
could be represented as such:
// Stores frequency per character
type Frequency = Map<char, usize>;
// Prefix of length two
type Prefix = [char; 2];
// Stores frequency for every prefix
type Markov = Map<Prefix, Frequency>;
There are different built-in data structures in Rust which can be used for this mapping. For this reason, both the frequency table and the prefix table were made into traits, and several different implementations were tested to evaluated for their memory usage and insertion speed.
The implementations that were tested are:
- Rust's built-in
HashMap
- Rust's built-in
BTreeMap
- Rust's built-in
Vec
- The
hashbrown
crateHashMap
Table creation
The memory usage of the different options were very similar:
However, a larger difference can be seen in the insertion time.