Markov-Chain

Markov-chains store the letter frequency distribution for each \(n\)-length prefix, and are able to use this to generate randomized pseudo-words with a similar letter distribution as the wordlist they were generated from.

Frequency table

A frequency table stores the frequency (occurences) of each letter. For example:

CharacterCount
a6
b3
c1
d9
......

Generating a random character from such a table is implemented using Weighted Generation.

Prefix frequency table

Data Structures

For example, a word such as discombobulated would be split up into character tokens like this:

\0 \0 d i s c o m b o b u l a t e d \0

There are three considerations to make for the in-memory representation of a markov-chain:

  • Time to build (insertion time)
  • Memory usage
  • Time to generate it (access time).

In terms of data structures, implementing a Markov-chain requires two primary data structures: a data structure to represent character frequency (such as a map from character to occurences), and a data structure to map prefixes to this character frequency data. In Rust terms, such a table could be represented as such:

// Stores frequency per character
type Frequency = Map<char, usize>;

// Prefix of length two
type Prefix = [char; 2];

// Stores frequency for every prefix
type Markov = Map<Prefix, Frequency>;

There are different built-in data structures in Rust which can be used for this mapping. For this reason, both the frequency table and the prefix table were made into traits, and several different implementations were tested to evaluated for their memory usage and insertion speed.

The implementations that were tested are:

  • Rust's built-in HashMap
  • Rust's built-in BTreeMap
  • Rust's built-in Vec
  • The hashbrown crate HashMap

Table creation

The memory usage of the different options were very similar:

However, a larger difference can be seen in the insertion time.

Generation