doc: add full algorithm reference (MD5, SHA-256, Whirlpool)

2026-05-06 10:39:32 +02:00 · 2026-05-06 10:39:32 +02:00 · 3e36cb4906
commit 3e36cb4906
parent 3289a9191d
9 changed files with 488 additions and 35 deletions
--- a/doc/Makefile.am
+++ b/doc/Makefile.am
@ -1,11 +1,18 @@
-EXTRA_DIST = libft_ssl.tex
+EXTRA_DIST = \
 	libft_ssl.tex \
 	preliminaries.tex \
 	introduction.tex \
 	generic_interface.tex \
 	md5.tex \
 	sha256.tex \
 	whirlpool.tex
 if ENABLE_DOC
 pdf: libft_ssl.pdf
 libft_ssl.pdf: libft_ssl.tex
-	$(PDFLATEX) $<
+	TEXINPUTS=$(srcdir): $(PDFLATEX) $<
-	$(PDFLATEX) $<
+	TEXINPUTS=$(srcdir): $(PDFLATEX) $<
 endif
 clean-local:
--- a/doc/generic_interface.tex
+++ b/doc/generic_interface.tex
@ -0,0 +1,53 @@
 \section{Generic Digest Interface}
 All hash algorithms in \texttt{libft\_ssl} are exposed through a single,
 uniform interface built around the \texttt{struct digest\_algo} type. This
 structure holds the algorithm's metadata (its name, digest size and block size)
 along with three function pointers: \texttt{init}, \texttt{update} and
 \texttt{final}. This design allows any algorithm to be driven through the same
 calling convention without the caller needing to know which one is in use. The
 associated context is held in a \texttt{union digest\_ctx}, which overlays the
 per-algorithm state structures so that a single allocation covers all supported
 algorithms.
 \vspace{1em}
 \texttt{struct digest\_algo} describes a hash algorithm as a set of metadata
 and three function pointers. The \texttt{name} field identifies the algorithm.
 The \texttt{digest\_size} and \texttt{block\_size} fields express its output
 length and internal block size in bytes. The three function pointers
 \texttt{init}, \texttt{update} and \texttt{final} define the algorithm's
 lifecycle: \texttt{init} sets the context to its initial state, \texttt{update}
 feeds an arbitrary amount of data into it, and \texttt{final} produces the
 digest and resets the context. All three operate on a \texttt{void~*} context
 pointer, which allows the interface to remain algorithm-agnostic.
 \vspace{1em}
 The \texttt{union digest\_ctx} type provides a single allocation large enough
 to hold the context of any supported algorithm. Because only one algorithm is
 active at a time, overlaying the per-algorithm structures in a union avoids the
 overhead of a separate heap allocation while keeping the calling code uniform.
 The active member is always the one matching the \texttt{struct digest\_algo}
 being used.
 \vspace{1em}
 Each supported algorithm is registered in \texttt{digest\_algos.h} through an
 X-macro list. This file defines a single macro \texttt{DIGEST\_ALGOS(X)} that
 expands \texttt{X} once per algorithm, passing its name, digest size and block
 size. Consuming this list with a different definition of \texttt{X} generates
 the corresponding code or data without repetition --- the global \texttt{struct
 digest\_algo} instances in \texttt{libft\_ssl.c} are produced this way. Adding
 a new algorithm to the library reduces to adding one line to this list.
 \vspace{1em}
 All three algorithms follow the Merkle-Damgård construction. The message is
 split into fixed-size blocks and processed sequentially. After each block, the
 compressed output is combined with the previous state to produce the new state
 --- this chaining ensures that the final digest depends on every bit of the
 input. The exact combination operation is algorithm-specific: MD5 and SHA-256
 use an additive feedforward, while Whirlpool uses the Miyaguchi-Preneel scheme.
 \newpage
--- a/doc/introduction.tex
+++ b/doc/introduction.tex
@ -0,0 +1,19 @@
 \section{Introduction}
 \texttt{libft\_ssl} is a C library implementing cryptographic hash functions
 from scratch. A cryptographic hash function maps an arbitrary-length input to a
 fixed-size digest. This operation is deterministic and one-way: it is
 computationally infeasible to recover the original input from its digest.
 The library currently implements the following algorithms:
 \begin{itemize}
    \item \textbf{MD5} - produces a 128-bit digest.
    \item \textbf{SHA-256} - produces a 256-bit digest.
    \item \textbf{Whirlpool} - produces a 512-bit digest.
 \end{itemize}
 These functions are commonly used for data integrity verification, digital
 signatures, and \textbf{M}essage \textbf{A}uthentication \textbf{C}ode\textbf{s} (MACs).
 \newpage
--- a/doc/libft_ssl.tex
+++ b/doc/libft_ssl.tex
@ -6,7 +6,7 @@
 \usepackage{amssymb}
 \usepackage{listings}
 \usepackage{xcolor}
-\usepackage{hyperref}
+\usepackage[hidelinks]{hyperref}
 \usepackage{geometry}
 \geometry{margin=2.5cm}
@ -21,36 +21,11 @@
 \tableofcontents
 \newpage
-\section{Introduction}
+\input{preliminaries}
 \input{introduction}
 \input{generic_interface}
 \input{md5}
 \input{sha256}
 \input{whirlpool}
 \texttt{libft\_ssl} is a C library implementing cryptographic hash functions
 from scratch. A cryptographic hash function maps an arbitrary-length input to a
 fixed-size digest. This operation is deterministic and one-way: it is
 computationally infeasible to recover the original input from its digest.
 The library currently implements the following algorithms:
 \begin{itemize}
    \item \textbf{MD5} - produces a 128-bit digest.
    \item \textbf{SHA-256} - produces a 256-bit digest.
    \item \textbf{Whirlpool} - produces a 512-bit digest.
 \end{itemize}
 These functions are commonly used for data integrity verification, digital
 signatures, and message authentication codes (MACs).
 \newpage
 \section{Library core}
 \newpage
 \section{MD5}
 \newpage
 \section{SHA-256}
 \newpage
 \section{Whirlpool}
 \end{document}
--- a/doc/md5.tex
+++ b/doc/md5.tex
@ -0,0 +1,102 @@
 \section{MD5}
 MD5 (Message Digest Algorithm 5) was designed by Ronald Rivest in 1991 as a
 strengthened replacement for MD4. It produces a 128-bit digest from a message
 of arbitrary length, processing data in 512-bit blocks. Although MD5 is now
 considered cryptographically broken (collision attacks have been
 demonstrated since 2004) it remains widely used for non-security purposes
 such as checksums and data integrity verification.
 \vspace{1em}
 MD5 maintains a state of four 32-bit words, conventionally named $A$, $B$, $C$
 and $D$, initialized to fixed constants defined in RFC 1321. Each 512-bit block
 is processed in four rounds of sixteen operations each, for a total of 64
 operations per block. Each operation applies one of four non-linear functions
 to the state words, adds a message word and a precomputed constant derived from
 the sine function, and rotates the result by a fixed amount.
 \vspace{1em}
 The state is initialized to the following fixed constants, as specified in RFC 1321:
 \begin{align*}
 A &= \texttt{0x67452301} \\
 B &= \texttt{0xefcdab89} \\
 C &= \texttt{0x98badcfe} \\
 D &= \texttt{0x10325476}
 \end{align*}
 \vspace{1em}
 Before processing, the message is padded to a length congruent to 448 bits
 modulo 512. A single \texttt{1} bit is appended first, followed by as many
 \texttt{0} bits as needed. The original message length in bits is then appended
 as a 64-bit little-endian integer, bringing the total padded length to an exact
 multiple of 512 bits.
 \vspace{1em}
 Each of the four rounds uses a distinct non-linear function applied to the
 state words $B$, $C$ and $D$:
 \begin{align*}
 F(B, C, D) &= (B \land C) \lor (\lnot B \land D) \\
 G(B, C, D) &= (B \land D) \lor (C \land \lnot D) \\
 H(B, C, D) &= B \oplus C \oplus D \\
 I(B, C, D) &= C \oplus (B \lor \lnot D)
 \end{align*}
 The message word index used at step $i$ is not sequential: each round applies
 a distinct selector function $k_r$ where $r = \lfloor i / 16 \rfloor$:
 \begin{align*}
 k_0(i) &= i \bmod 16 \\
 k_1(i) &= (5i + 1) \bmod 16 \\
 k_2(i) &= (3i + 5) \bmod 16 \\
 k_3(i) &= 7i \bmod 16
 \end{align*}
 At each step $i$ (with $0 \leq i < 64$), one of the four functions is selected
 according to the current round, and the state is updated as follows:
 \begin{align*}
 A &\leftarrow B + \bigl((A + \phi(B, C, D) + M[k] + T[i]) \lll s[i]\bigr)
 \end{align*}
 \noindent where $\phi$ is the auxiliary function for the current round, $M[k]$
 is a 32-bit word of the current block, $T[i]$ is a precomputed constant, $s[i]$
 is the rotation amount, and $\lll$ denotes a left rotation. After this
 operation, the state words are cycled: $(A, B, C, D) \leftarrow (D, A, B, C)$.
 \vspace{1em}
 The rotation amounts $s[i]$ are constant per round and repeat every four steps:
 \begin{align*}
 \text{Round 0} &: 7,\ 12,\ 17,\ 22 \\
 \text{Round 1} &: 5,\ 9,\ 14,\ 20 \\
 \text{Round 2} &: 4,\ 11,\ 16,\ 23 \\
 \text{Round 3} &: 6,\ 10,\ 15,\ 21
 \end{align*}
 \vspace{1em}
 The 64 constants $T[i]$ are derived from the sine function:
 \begin{align*}
 \forall i \in \mathbb{N},\ 0\le i < 64, T_i = \left\lfloor 2^{32}\,|\sin(i+1)| \right\rfloor
 \end{align*}
 After each block is processed, the compressed state is added word-by-word to
 the state before compression:
 \begin{align*}
 (A, B, C, D) \leftarrow (A + A_0,\ B + B_0,\ C + C_0,\ D + D_0)
 \end{align*}
 \noindent where $A_0$, $B_0$, $C_0$, $D_0$ denote the state at the beginning
 of the block. After all blocks have been processed, the four state words are
 serialized in little-endian order to produce the 128-bit digest.
 \newpage
--- a/doc/md5_init_T.py
+++ b/doc/md5_init_T.py
@ -0,0 +1,15 @@
 import math
 def md5_T():
    T = []
    for i in range(64):
        val = int(math.floor((2**32) * abs(math.sin(i + 1))))
        T.append(val & 0xFFFFFFFF)
    return T
 if __name__ == "__main__":
    T = md5_T()
    for i, v in enumerate(T):
        print(f"0x{v:08x}, ")
--- a/doc/preliminaries.tex
+++ b/doc/preliminaries.tex
@ -0,0 +1,66 @@
 \section{Preliminaries}
 This section defines the terminology used throughout the document. The concepts
 introduced here are general to cryptographic hash functions and apply to all
 algorithms described in subsequent sections.
 \vspace{1em}
 A \textbf{bit} is the smallest unit of information, taking a value of either 0
 or 1. A \textbf{byte} is a group of eight bits, and is the standard unit of
 data storage and transmission. A \textbf{word} is a fixed-size integer used
 internally by a hash algorithm --- MD5 and SHA-256 operate on 32-bit words,
 while Whirlpool operates on 64-bit words.
 \vspace{1em}
 \textbf{Endianness} refers to the byte order used when storing a multi-byte
 integer in memory. In \textbf{little-endian} order, the least significant byte
 is stored first; in \textbf{big-endian} order, the most significant byte is
 stored first. This distinction matters when serializing the internal state to
 produce the final digest --- MD5 uses little-endian, while SHA-256 and
 Whirlpool use big-endian.
 \vspace{1em}
 A \textbf{message} is the arbitrary-length input fed to a hash function. The
 \textbf{digest} is the fixed-size output it produces. A hash function is said
 to be \textbf{one-way} if it is computationally infeasible to recover any input
 that produces a given digest. A \textbf{collision} occurs when two distinct
 messages produce the same digest; a hash function is considered broken when
 collisions can be found efficiently.
 \vspace{1em}
 Hash functions process their input in fixed-size chunks called \textbf{blocks}.
 Since the message length is rarely a multiple of the block size,
 \textbf{padding} is appended to the last block to bring it to the required
 length. The \textbf{state} is a set of words initialized to fixed constants and
 updated after each block; it accumulates the result of the computation and is
 serialized into the digest at the end. The \textbf{compression function} is the
 core transformation applied to each block --- it takes the current state and
 one block of data, and produces a new state.
 \vspace{1em}
 The \textbf{Miyaguchi-Preneel} construction is a way to build a compression
 function from a block cipher $E$. Given a current state $H$ and a message
 block $M$, it produces a new state as:
 \begin{align*}
 H \leftarrow E(H,\ M) \oplus M \oplus H
 \end{align*}
 \noindent where $E(H, M)$ denotes the encryption of $M$ using $H$ as the key.
 The XOR with both $M$ and $H$ ensures that the output cannot be trivially
 inverted even if $E$ is known.
 \vspace{1em}
 The \textbf{wide-pipe} construction is a variant of Merkle-Damgård where the
 internal state is wider than the final digest. This makes collision attacks
 harder: an attacker targeting the output must first find a collision in the
 larger internal state, which requires significantly more work than attacking
 the digest directly.
 \newpage
--- a/doc/sha256.tex
+++ b/doc/sha256.tex
@ -0,0 +1,110 @@
 \section{SHA-256}
 SHA-256 is part of the SHA-2 family of cryptographic hash functions, designed
 by the NSA and first published by NIST in 2001. It produces a 256-bit digest
 from a message of arbitrary length, processing data in 512-bit blocks. Unlike
 MD5, SHA-256 has no known practical collision attacks and remains widely used
 in security-critical applications such as TLS certificates and Bitcoin's
 proof-of-work.
 \vspace{1em}
 SHA-256 maintains a state of eight 32-bit words, initialized to fixed constants
 derived from the square roots of the first eight prime numbers. Each 512-bit
 block is processed in 64 rounds. Each round applies a compression step involving
 two non-linear functions, a message schedule word, and a precomputed constant
 derived from the cube roots of the first 64 prime numbers.
 \vspace{1em}
 The padding scheme is identical to MD5: a single \texttt{1} bit is appended,
 followed by \texttt{0} bits until the message length is congruent to 448 bits
 modulo 512, and the original length in bits is appended as a 64-bit integer.
 The difference is that SHA-256 encodes this length in big-endian order.
 \vspace{1em}
 Each round uses two non-linear functions applied to the state words:
 \begin{align*}
 \text{Ch}(E, F, G)  &= (E \land F) \oplus (\lnot E \land G) \\
 \text{Maj}(A, B, C) &= (A \land B) \oplus (A \land C) \oplus (B \land C)
 \end{align*}
 \noindent and two rotation-based functions applied to the state words $A$ and $E$:
 \begin{align*}
 \Sigma_0(A) &= (A \ggg 2)  \oplus (A \ggg 13) \oplus (A \ggg 22) \\
 \Sigma_1(E) &= (E \ggg 6)  \oplus (E \ggg 11) \oplus (E \ggg 25)
 \end{align*}
 \noindent where $\ggg$ denotes a right rotation.
 \vspace{1em}
 At each round $i$ (with $0 \leq i < 64$), the state is updated as follows:
 \begin{align*}
 T_1 &= H + \Sigma_1(E) + \text{Ch}(E, F, G) + K[i] + W[i] \\
 T_2 &= \Sigma_0(A) + \text{Maj}(A, B, C) \\
 H &\leftarrow G, \quad G \leftarrow F, \quad F \leftarrow E, \quad E \leftarrow D + T_1 \\
 D &\leftarrow C, \quad C \leftarrow B, \quad B \leftarrow A, \quad A \leftarrow T_1 + T_2
 \end{align*}
 \noindent where $K[i]$ is a precomputed constant and $W[i]$ is a word from the
 message schedule.
 \vspace{1em}
 \begin{align*}
 \forall i \in \mathbb{N},\ 0 \leq i < 64,\quad K[i] = \left\lfloor 2^{32} \times \left(\sqrt[3]{p_{i+1}} \bmod 1\right) \right\rfloor
 \end{align*}
 \noindent where $p_{i+1}$ is the $(i+1)$-th prime number and $\bmod 1$ denotes
 the fractional part.
 \vspace{1em}
 The message schedule extends the 16 words of the current block into 64 words
 using two additional rotation-based functions:
 \begin{align*}
 \sigma_0(x) &= (x \ggg 7)  \oplus (x \ggg 18) \oplus (x \gg 3) \\
 \sigma_1(x) &= (x \ggg 17) \oplus (x \ggg 19) \oplus (x \gg 10)
 \end{align*}
 \noindent where $\gg$ denotes a logical right shift, and $M[i]$ denotes the
 $i$-th 32-bit word of the current 512-bit block. The schedule is then defined
 as:
 \begin{align*}
 W[i] = \begin{cases}
    M[i] & i \in \mathbb{N},\ 0 \leq i < 16 \\
    \sigma_1(W[i-2]) + W[i-7] + \sigma_0(W[i-15]) + W[i-16] & i \in \mathbb{N},\ 16 \leq i < 64
 \end{cases}
 \end{align*}
 \vspace{1em}
 The state is initialized to fixed constants derived from the square roots of
 the first eight prime numbers:
 \begin{align*}
 A &= \texttt{0x6a09e667}, \quad B = \texttt{0xbb67ae85}, \quad
 C = \texttt{0x3c6ef372}, \quad D = \texttt{0xa54ff53a} \\
 E &= \texttt{0x510e527f}, \quad F = \texttt{0x9b05688c}, \quad
 G = \texttt{0x1f83d9ab}, \quad H = \texttt{0x5be0cd19}
 \end{align*}
 After each block is processed, the compressed state is added word-by-word to
 the state before compression:
 \begin{align*}
 (A, \ldots, H) \leftarrow (A + A_0,\ B + B_0,\ C + C_0,\ D + D_0,\ E + E_0,\ F + F_0,\ G + G_0,\ H + H_0)
 \end{align*}
 \noindent where $A_0, \ldots, H_0$ denote the state at the beginning of the
 block. After all blocks have been processed, the eight state words are
 serialized in big-endian order to produce the 256-bit digest.
 \newpage
--- a/doc/whirlpool.tex
+++ b/doc/whirlpool.tex
@ -0,0 +1,106 @@
 \section{Whirlpool}
 Whirlpool is a cryptographic hash function designed by Vincent Rijmen and Paulo
 Barreto, first published in 2000 and standardized by ISO/IEC in 2004. It
 produces a 512-bit digest from a message of arbitrary length, processing data
 in 512-bit blocks. Its internal structure is inspired by the wide-pipe
 Miyaguchi-Preneel construction and shares design principles with AES, using a
 substitution-permutation network over an $8 \times 8$ matrix of bytes.
 \vspace{1em}
 Whirlpool maintains a state of eight 64-bit words, forming an $8 \times 8$
 matrix of bytes. Each 512-bit block is processed in 10 rounds. Each round
 applies four successive transformations to the state matrix: a byte
 substitution, a column shift, a row mixing, and a round key addition.
 \vspace{1em}
 The padding scheme follows the same structure as MD5 and SHA-256: a single
 \texttt{1} bit is appended, followed by \texttt{0} bits until the message
 length is congruent to 448 bits modulo 512. The original message length in bits
 is then appended as a 64-bit big-endian integer.
 \vspace{1em}
 Each round applies the following four transformations in order:
 \textbf{SubBytes} replaces each byte of the state matrix by its image under the
 Whirlpool S-box, a fixed 256-entry lookup table defined in the Whirlpool
 specification.
 \medskip
 \textbf{ShiftColumns} cyclically shifts each column $j$ of the state matrix
 upward by $j$ positions, producing a transposition that spreads bytes across
 rows. Formally, if $a_{i,j}$ denotes the byte at row $i$, column $j$ of the
 state matrix, ShiftColumns produces:
 \begin{align*}
 b_{i,j} = a_{i',\ j} \quad \text{where } i' = (i - j) \bmod 8
 \end{align*}
 \medskip
 \textbf{MixRows} multiplies each row of the state matrix by a fixed MDS matrix
 over $\mathrm{GF}(2^8)$ with irreducible polynomial $x^8 + x^4 + x^3 + x^2 +
 1$, providing diffusion across the eight bytes of each row. Formally, for each
 row $i$, each output byte $b_j$ is computed as:
 \begin{align*}
 b_j = \bigoplus_{k=0}^{7} \mathrm{MDS}[(j - k) \bmod 8] \cdot a_{i,k}
 \end{align*}
 \noindent where $\cdot$ denotes multiplication in $\mathrm{GF}(2^8)$ and
 $\oplus$ denotes XOR.
 \medskip
 \textbf{AddRoundKey} XORs the state with the current round key.
 \vspace{1em}
 The S-box and the MDS matrix coefficients are fixed tables defined in the
 Whirlpool specification; their values are too large to reproduce here. The
 round constants $\mathrm{RC}[r]$, $r \in \mathbb{N},\ 1 \leq r \leq 10$, are
 however directly derived from the S-box. Each $\mathrm{RC}[r]$ is an 8-word
 state where only the first word is non-zero:
 \begin{align*}
 \mathrm{RC}[r][0] &= \sum_{k=0}^{7} S[8(r-1)+k] \cdot 2^{8(7-k)} \\
 \mathrm{RC}[r][j] &= 0 \quad \forall j \in \mathbb{N},\ 1 \leq j \leq 7
 \end{align*}
 \noindent Their role is to break symmetry in the key schedule: without them, a
 symmetric input state would produce symmetric round keys, weakening the internal
 block transformation.
 \vspace{1em}
 The round keys $K[r]$, $r \in \mathbb{N},\ 0 \leq r \leq 10$, are derived from
 the current hash state. $K[0]$ is set to the state before processing the block.
 Each subsequent key is obtained by applying the round function to the previous
 key with a precomputed round constant, where $\text{Round}(S, K)$ denotes the
 successive application of SubBytes, ShiftColumns, MixRows, and AddRoundKey with
 key $K$ to state $S$:
 \begin{align*}
 K[0] &= H \\
 K[r] &= \text{Round}(K[r-1],\ \mathrm{RC}[r]) \quad r \in \mathbb{N},\ 1 \leq r \leq 10
 \end{align*}
 The block $M$ is then encrypted using these keys under a wide-pipe construction.
 The final state update follows the Miyaguchi-Preneel scheme:
 \begin{align*}
 H \leftarrow E(H,\ M) \oplus M \oplus H
 \end{align*}
 \noindent where $E(H, M)$ denotes the encryption of $M$ with key schedule
 derived from $H$.
 \vspace{1em}
 The state is initialized to all zeros. After all blocks have been processed,
 the eight 64-bit state words are serialized in big-endian order to produce the
 512-bit digest.