doc: add full algorithm reference (MD5, SHA-256, Whirlpool)

2026-05-06 10:39:32 +02:00 · 2026-05-06 10:39:32 +02:00 · 3e36cb4906
commit 3e36cb4906
parent 3289a9191d
9 changed files with 488 additions and 35 deletions
--- a/doc/Makefile.am
+++ b/doc/Makefile.am
@ -1,11 +1,18 @@
-EXTRA_DIST = libft_ssl.tex
+EXTRA_DIST = \
+	libft_ssl.tex \
+	preliminaries.tex \
+	introduction.tex \
+	generic_interface.tex \
+	md5.tex \
+	sha256.tex \
+	whirlpool.tex

 if ENABLE_DOC
 pdf: libft_ssl.pdf

 libft_ssl.pdf: libft_ssl.tex
-	$(PDFLATEX) $<
-	$(PDFLATEX) $<
+	TEXINPUTS=$(srcdir): $(PDFLATEX) $<
+	TEXINPUTS=$(srcdir): $(PDFLATEX) $<
 endif

 clean-local:
--- a/doc/generic_interface.tex
+++ b/doc/generic_interface.tex
@ -0,0 +1,53 @@
+\section{Generic Digest Interface}
+
+All hash algorithms in \texttt{libft\_ssl} are exposed through a single,
+uniform interface built around the \texttt{struct digest\_algo} type. This
+structure holds the algorithm's metadata (its name, digest size and block size)
+along with three function pointers: \texttt{init}, \texttt{update} and
+\texttt{final}. This design allows any algorithm to be driven through the same
+calling convention without the caller needing to know which one is in use. The
+associated context is held in a \texttt{union digest\_ctx}, which overlays the
+per-algorithm state structures so that a single allocation covers all supported
+algorithms.
+
+\vspace{1em}
+
+\texttt{struct digest\_algo} describes a hash algorithm as a set of metadata
+and three function pointers. The \texttt{name} field identifies the algorithm.
+The \texttt{digest\_size} and \texttt{block\_size} fields express its output
+length and internal block size in bytes. The three function pointers
+\texttt{init}, \texttt{update} and \texttt{final} define the algorithm's
+lifecycle: \texttt{init} sets the context to its initial state, \texttt{update}
+feeds an arbitrary amount of data into it, and \texttt{final} produces the
+digest and resets the context. All three operate on a \texttt{void~*} context
+pointer, which allows the interface to remain algorithm-agnostic.
+
+\vspace{1em}
+
+The \texttt{union digest\_ctx} type provides a single allocation large enough
+to hold the context of any supported algorithm. Because only one algorithm is
+active at a time, overlaying the per-algorithm structures in a union avoids the
+overhead of a separate heap allocation while keeping the calling code uniform.
+The active member is always the one matching the \texttt{struct digest\_algo}
+being used.
+
+\vspace{1em}
+
+Each supported algorithm is registered in \texttt{digest\_algos.h} through an
+X-macro list. This file defines a single macro \texttt{DIGEST\_ALGOS(X)} that
+expands \texttt{X} once per algorithm, passing its name, digest size and block
+size. Consuming this list with a different definition of \texttt{X} generates
+the corresponding code or data without repetition --- the global \texttt{struct
+digest\_algo} instances in \texttt{libft\_ssl.c} are produced this way. Adding
+a new algorithm to the library reduces to adding one line to this list.
+
+\vspace{1em}
+
+All three algorithms follow the Merkle-Damgård construction. The message is
+split into fixed-size blocks and processed sequentially. After each block, the
+compressed output is combined with the previous state to produce the new state
+--- this chaining ensures that the final digest depends on every bit of the
+input. The exact combination operation is algorithm-specific: MD5 and SHA-256
+use an additive feedforward, while Whirlpool uses the Miyaguchi-Preneel scheme.
+
+\newpage
--- a/doc/introduction.tex
+++ b/doc/introduction.tex
@ -0,0 +1,19 @@
+\section{Introduction}
+
+\texttt{libft\_ssl} is a C library implementing cryptographic hash functions
+from scratch. A cryptographic hash function maps an arbitrary-length input to a
+fixed-size digest. This operation is deterministic and one-way: it is
+computationally infeasible to recover the original input from its digest.
+
+The library currently implements the following algorithms:
+
+\begin{itemize}
+    \item \textbf{MD5} - produces a 128-bit digest.
+    \item \textbf{SHA-256} - produces a 256-bit digest.
+    \item \textbf{Whirlpool} - produces a 512-bit digest.
+\end{itemize}
+
+These functions are commonly used for data integrity verification, digital
+signatures, and \textbf{M}essage \textbf{A}uthentication \textbf{C}ode\textbf{s} (MACs).
+
+\newpage
--- a/doc/libft_ssl.tex
+++ b/doc/libft_ssl.tex
@ -6,7 +6,7 @@
 \usepackage{amssymb}
 \usepackage{listings}
 \usepackage{xcolor}
-\usepackage{hyperref}
+\usepackage[hidelinks]{hyperref}
 \usepackage{geometry}

 \geometry{margin=2.5cm}
@ -21,36 +21,11 @@
 \tableofcontents
 \newpage

-\section{Introduction}
+\input{preliminaries}
+\input{introduction}
+\input{generic_interface}
+\input{md5}
+\input{sha256}
+\input{whirlpool}

-\texttt{libft\_ssl} is a C library implementing cryptographic hash functions
-from scratch. A cryptographic hash function maps an arbitrary-length input to a
-fixed-size digest. This operation is deterministic and one-way: it is
-computationally infeasible to recover the original input from its digest.
-
-The library currently implements the following algorithms:
-
-\begin{itemize}
-    \item \textbf{MD5} - produces a 128-bit digest.
-    \item \textbf{SHA-256} - produces a 256-bit digest.
-    \item \textbf{Whirlpool} - produces a 512-bit digest.
-\end{itemize}
-
-These functions are commonly used for data integrity verification, digital
-signatures, and message authentication codes (MACs).
-\newpage
-
-\section{Library core}
-
-\newpage
-
-\section{MD5}
-
-\newpage
-
-\section{SHA-256}
-
-\newpage
-
-\section{Whirlpool}
 \end{document}
--- a/doc/md5.tex
+++ b/doc/md5.tex
@ -0,0 +1,102 @@
+\section{MD5}
+
+MD5 (Message Digest Algorithm 5) was designed by Ronald Rivest in 1991 as a
+strengthened replacement for MD4. It produces a 128-bit digest from a message
+of arbitrary length, processing data in 512-bit blocks. Although MD5 is now
+considered cryptographically broken (collision attacks have been
+demonstrated since 2004) it remains widely used for non-security purposes
+such as checksums and data integrity verification.
+
+\vspace{1em}
+
+MD5 maintains a state of four 32-bit words, conventionally named $A$, $B$, $C$
+and $D$, initialized to fixed constants defined in RFC 1321. Each 512-bit block
+is processed in four rounds of sixteen operations each, for a total of 64
+operations per block. Each operation applies one of four non-linear functions
+to the state words, adds a message word and a precomputed constant derived from
+the sine function, and rotates the result by a fixed amount.
+
+\vspace{1em}
+
+The state is initialized to the following fixed constants, as specified in RFC 1321:
+
+\begin{align*}
+A &= \texttt{0x67452301} \\
+B &= \texttt{0xefcdab89} \\
+C &= \texttt{0x98badcfe} \\
+D &= \texttt{0x10325476}
+\end{align*}
+
+\vspace{1em}
+
+Before processing, the message is padded to a length congruent to 448 bits
+modulo 512. A single \texttt{1} bit is appended first, followed by as many
+\texttt{0} bits as needed. The original message length in bits is then appended
+as a 64-bit little-endian integer, bringing the total padded length to an exact
+multiple of 512 bits.
+
+\vspace{1em}
+
+Each of the four rounds uses a distinct non-linear function applied to the
+state words $B$, $C$ and $D$:
+
+\begin{align*}
+F(B, C, D) &= (B \land C) \lor (\lnot B \land D) \\
+G(B, C, D) &= (B \land D) \lor (C \land \lnot D) \\
+H(B, C, D) &= B \oplus C \oplus D \\
+I(B, C, D) &= C \oplus (B \lor \lnot D)
+\end{align*}
+
+The message word index used at step $i$ is not sequential: each round applies
+a distinct selector function $k_r$ where $r = \lfloor i / 16 \rfloor$:
+
+\begin{align*}
+k_0(i) &= i \bmod 16 \\
+k_1(i) &= (5i + 1) \bmod 16 \\
+k_2(i) &= (3i + 5) \bmod 16 \\
+k_3(i) &= 7i \bmod 16
+\end{align*}
+
+At each step $i$ (with $0 \leq i < 64$), one of the four functions is selected
+according to the current round, and the state is updated as follows:
+
+\begin{align*}
+A &\leftarrow B + \bigl((A + \phi(B, C, D) + M[k] + T[i]) \lll s[i]\bigr)
+\end{align*}
+
+\noindent where $\phi$ is the auxiliary function for the current round, $M[k]$
+is a 32-bit word of the current block, $T[i]$ is a precomputed constant, $s[i]$
+is the rotation amount, and $\lll$ denotes a left rotation. After this
+operation, the state words are cycled: $(A, B, C, D) \leftarrow (D, A, B, C)$.
+
+\vspace{1em}
+
+The rotation amounts $s[i]$ are constant per round and repeat every four steps:
+
+\begin{align*}
+\text{Round 0} &: 7,\ 12,\ 17,\ 22 \\
+\text{Round 1} &: 5,\ 9,\ 14,\ 20 \\
+\text{Round 2} &: 4,\ 11,\ 16,\ 23 \\
+\text{Round 3} &: 6,\ 10,\ 15,\ 21
+\end{align*}
+
+\vspace{1em}
+
+The 64 constants $T[i]$ are derived from the sine function:
+
+\begin{align*}
+\forall i \in \mathbb{N},\ 0\le i < 64, T_i = \left\lfloor 2^{32}\,|\sin(i+1)| \right\rfloor
+\end{align*}
+
+After each block is processed, the compressed state is added word-by-word to
+the state before compression:
+
+\begin{align*}
+(A, B, C, D) \leftarrow (A + A_0,\ B + B_0,\ C + C_0,\ D + D_0)
+\end{align*}
+
+\noindent where $A_0$, $B_0$, $C_0$, $D_0$ denote the state at the beginning
+of the block. After all blocks have been processed, the four state words are
+serialized in little-endian order to produce the 128-bit digest.
+
+\newpage
--- a/doc/md5_init_T.py
+++ b/doc/md5_init_T.py
@ -0,0 +1,15 @@
+import math
+
+def md5_T():
+    T = []
+    for i in range(64):
+        val = int(math.floor((2**32) * abs(math.sin(i + 1))))
+        T.append(val & 0xFFFFFFFF)
+    return T
+
+if __name__ == "__main__":
+    T = md5_T()
+    for i, v in enumerate(T):
+        print(f"0x{v:08x}, ")
+
+
--- a/doc/preliminaries.tex
+++ b/doc/preliminaries.tex
@ -0,0 +1,66 @@
+\section{Preliminaries}
+
+This section defines the terminology used throughout the document. The concepts
+introduced here are general to cryptographic hash functions and apply to all
+algorithms described in subsequent sections.
+
+\vspace{1em}
+
+A \textbf{bit} is the smallest unit of information, taking a value of either 0
+or 1. A \textbf{byte} is a group of eight bits, and is the standard unit of
+data storage and transmission. A \textbf{word} is a fixed-size integer used
+internally by a hash algorithm --- MD5 and SHA-256 operate on 32-bit words,
+while Whirlpool operates on 64-bit words.
+
+\vspace{1em}
+
+\textbf{Endianness} refers to the byte order used when storing a multi-byte
+integer in memory. In \textbf{little-endian} order, the least significant byte
+is stored first; in \textbf{big-endian} order, the most significant byte is
+stored first. This distinction matters when serializing the internal state to
+produce the final digest --- MD5 uses little-endian, while SHA-256 and
+Whirlpool use big-endian.
+
+\vspace{1em}
+
+A \textbf{message} is the arbitrary-length input fed to a hash function. The
+\textbf{digest} is the fixed-size output it produces. A hash function is said
+to be \textbf{one-way} if it is computationally infeasible to recover any input
+that produces a given digest. A \textbf{collision} occurs when two distinct
+messages produce the same digest; a hash function is considered broken when
+collisions can be found efficiently.
+
+\vspace{1em}
+
+Hash functions process their input in fixed-size chunks called \textbf{blocks}.
+Since the message length is rarely a multiple of the block size,
+\textbf{padding} is appended to the last block to bring it to the required
+length. The \textbf{state} is a set of words initialized to fixed constants and
+updated after each block; it accumulates the result of the computation and is
+serialized into the digest at the end. The \textbf{compression function} is the
+core transformation applied to each block --- it takes the current state and
+one block of data, and produces a new state.
+
+\vspace{1em}
+
+The \textbf{Miyaguchi-Preneel} construction is a way to build a compression
+function from a block cipher $E$. Given a current state $H$ and a message
+block $M$, it produces a new state as:
+
+\begin{align*}
+H \leftarrow E(H,\ M) \oplus M \oplus H
+\end{align*}
+
+\noindent where $E(H, M)$ denotes the encryption of $M$ using $H$ as the key.
+The XOR with both $M$ and $H$ ensures that the output cannot be trivially
+inverted even if $E$ is known.
+
+\vspace{1em}
+
+The \textbf{wide-pipe} construction is a variant of Merkle-Damgård where the
+internal state is wider than the final digest. This makes collision attacks
+harder: an attacker targeting the output must first find a collision in the
+larger internal state, which requires significantly more work than attacking
+the digest directly.
+
+\newpage
--- a/doc/sha256.tex
+++ b/doc/sha256.tex
@ -0,0 +1,110 @@
+\section{SHA-256}
+
+SHA-256 is part of the SHA-2 family of cryptographic hash functions, designed
+by the NSA and first published by NIST in 2001. It produces a 256-bit digest
+from a message of arbitrary length, processing data in 512-bit blocks. Unlike
+MD5, SHA-256 has no known practical collision attacks and remains widely used
+in security-critical applications such as TLS certificates and Bitcoin's
+proof-of-work.
+
+\vspace{1em}
+
+SHA-256 maintains a state of eight 32-bit words, initialized to fixed constants
+derived from the square roots of the first eight prime numbers. Each 512-bit
+block is processed in 64 rounds. Each round applies a compression step involving
+two non-linear functions, a message schedule word, and a precomputed constant
+derived from the cube roots of the first 64 prime numbers.
+
+\vspace{1em}
+
+The padding scheme is identical to MD5: a single \texttt{1} bit is appended,
+followed by \texttt{0} bits until the message length is congruent to 448 bits
+modulo 512, and the original length in bits is appended as a 64-bit integer.
+The difference is that SHA-256 encodes this length in big-endian order.
+
+\vspace{1em}
+
+Each round uses two non-linear functions applied to the state words:
+
+\begin{align*}
+\text{Ch}(E, F, G)  &= (E \land F) \oplus (\lnot E \land G) \\
+\text{Maj}(A, B, C) &= (A \land B) \oplus (A \land C) \oplus (B \land C)
+\end{align*}
+
+\noindent and two rotation-based functions applied to the state words $A$ and $E$:
+
+\begin{align*}
+\Sigma_0(A) &= (A \ggg 2)  \oplus (A \ggg 13) \oplus (A \ggg 22) \\
+\Sigma_1(E) &= (E \ggg 6)  \oplus (E \ggg 11) \oplus (E \ggg 25)
+\end{align*}
+
+\noindent where $\ggg$ denotes a right rotation.
+
+\vspace{1em}
+
+At each round $i$ (with $0 \leq i < 64$), the state is updated as follows:
+
+\begin{align*}
+T_1 &= H + \Sigma_1(E) + \text{Ch}(E, F, G) + K[i] + W[i] \\
+T_2 &= \Sigma_0(A) + \text{Maj}(A, B, C) \\
+H &\leftarrow G, \quad G \leftarrow F, \quad F \leftarrow E, \quad E \leftarrow D + T_1 \\
+D &\leftarrow C, \quad C \leftarrow B, \quad B \leftarrow A, \quad A \leftarrow T_1 + T_2
+\end{align*}
+
+\noindent where $K[i]$ is a precomputed constant and $W[i]$ is a word from the
+message schedule.
+
+\vspace{1em}
+
+\begin{align*}
+\forall i \in \mathbb{N},\ 0 \leq i < 64,\quad K[i] = \left\lfloor 2^{32} \times \left(\sqrt[3]{p_{i+1}} \bmod 1\right) \right\rfloor
+\end{align*}
+
+\noindent where $p_{i+1}$ is the $(i+1)$-th prime number and $\bmod 1$ denotes
+the fractional part.
+
+\vspace{1em}
+
+The message schedule extends the 16 words of the current block into 64 words
+using two additional rotation-based functions:
+
+\begin{align*}
+\sigma_0(x) &= (x \ggg 7)  \oplus (x \ggg 18) \oplus (x \gg 3) \\
+\sigma_1(x) &= (x \ggg 17) \oplus (x \ggg 19) \oplus (x \gg 10)
+\end{align*}
+
+\noindent where $\gg$ denotes a logical right shift, and $M[i]$ denotes the
+$i$-th 32-bit word of the current 512-bit block. The schedule is then defined
+as:
+
+\begin{align*}
+W[i] = \begin{cases}
+    M[i] & i \in \mathbb{N},\ 0 \leq i < 16 \\
+    \sigma_1(W[i-2]) + W[i-7] + \sigma_0(W[i-15]) + W[i-16] & i \in \mathbb{N},\ 16 \leq i < 64
+\end{cases}
+\end{align*}
+
+\vspace{1em}
+
+The state is initialized to fixed constants derived from the square roots of
+the first eight prime numbers:
+
+\begin{align*}
+A &= \texttt{0x6a09e667}, \quad B = \texttt{0xbb67ae85}, \quad
+C = \texttt{0x3c6ef372}, \quad D = \texttt{0xa54ff53a} \\
+E &= \texttt{0x510e527f}, \quad F = \texttt{0x9b05688c}, \quad
+G = \texttt{0x1f83d9ab}, \quad H = \texttt{0x5be0cd19}
+\end{align*}
+
+After each block is processed, the compressed state is added word-by-word to
+the state before compression:
+
+\begin{align*}
+(A, \ldots, H) \leftarrow (A + A_0,\ B + B_0,\ C + C_0,\ D + D_0,\ E + E_0,\ F + F_0,\ G + G_0,\ H + H_0)
+\end{align*}
+
+\noindent where $A_0, \ldots, H_0$ denote the state at the beginning of the
+block. After all blocks have been processed, the eight state words are
+serialized in big-endian order to produce the 256-bit digest.
+
+\newpage
--- a/doc/whirlpool.tex
+++ b/doc/whirlpool.tex
@ -0,0 +1,106 @@
+\section{Whirlpool}
+
+Whirlpool is a cryptographic hash function designed by Vincent Rijmen and Paulo
+Barreto, first published in 2000 and standardized by ISO/IEC in 2004. It
+produces a 512-bit digest from a message of arbitrary length, processing data
+in 512-bit blocks. Its internal structure is inspired by the wide-pipe
+Miyaguchi-Preneel construction and shares design principles with AES, using a
+substitution-permutation network over an $8 \times 8$ matrix of bytes.
+
+\vspace{1em}
+
+Whirlpool maintains a state of eight 64-bit words, forming an $8 \times 8$
+matrix of bytes. Each 512-bit block is processed in 10 rounds. Each round
+applies four successive transformations to the state matrix: a byte
+substitution, a column shift, a row mixing, and a round key addition.
+
+\vspace{1em}
+
+The padding scheme follows the same structure as MD5 and SHA-256: a single
+\texttt{1} bit is appended, followed by \texttt{0} bits until the message
+length is congruent to 448 bits modulo 512. The original message length in bits
+is then appended as a 64-bit big-endian integer.
+
+\vspace{1em}
+
+Each round applies the following four transformations in order:
+
+\textbf{SubBytes} replaces each byte of the state matrix by its image under the
+Whirlpool S-box, a fixed 256-entry lookup table defined in the Whirlpool
+specification.
+
+\medskip
+
+\textbf{ShiftColumns} cyclically shifts each column $j$ of the state matrix
+upward by $j$ positions, producing a transposition that spreads bytes across
+rows. Formally, if $a_{i,j}$ denotes the byte at row $i$, column $j$ of the
+state matrix, ShiftColumns produces:
+
+\begin{align*}
+b_{i,j} = a_{i',\ j} \quad \text{where } i' = (i - j) \bmod 8
+\end{align*}
+
+\medskip
+
+\textbf{MixRows} multiplies each row of the state matrix by a fixed MDS matrix
+over $\mathrm{GF}(2^8)$ with irreducible polynomial $x^8 + x^4 + x^3 + x^2 +
+1$, providing diffusion across the eight bytes of each row. Formally, for each
+row $i$, each output byte $b_j$ is computed as:
+
+\begin{align*}
+b_j = \bigoplus_{k=0}^{7} \mathrm{MDS}[(j - k) \bmod 8] \cdot a_{i,k}
+\end{align*}
+
+\noindent where $\cdot$ denotes multiplication in $\mathrm{GF}(2^8)$ and
+$\oplus$ denotes XOR.
+
+\medskip
+
+\textbf{AddRoundKey} XORs the state with the current round key.
+
+\vspace{1em}
+
+The S-box and the MDS matrix coefficients are fixed tables defined in the
+Whirlpool specification; their values are too large to reproduce here. The
+round constants $\mathrm{RC}[r]$, $r \in \mathbb{N},\ 1 \leq r \leq 10$, are
+however directly derived from the S-box. Each $\mathrm{RC}[r]$ is an 8-word
+state where only the first word is non-zero:
+
+\begin{align*}
+\mathrm{RC}[r][0] &= \sum_{k=0}^{7} S[8(r-1)+k] \cdot 2^{8(7-k)} \\
+\mathrm{RC}[r][j] &= 0 \quad \forall j \in \mathbb{N},\ 1 \leq j \leq 7
+\end{align*}
+
+\noindent Their role is to break symmetry in the key schedule: without them, a
+symmetric input state would produce symmetric round keys, weakening the internal
+block transformation.
+
+\vspace{1em}
+
+The round keys $K[r]$, $r \in \mathbb{N},\ 0 \leq r \leq 10$, are derived from
+the current hash state. $K[0]$ is set to the state before processing the block.
+Each subsequent key is obtained by applying the round function to the previous
+key with a precomputed round constant, where $\text{Round}(S, K)$ denotes the
+successive application of SubBytes, ShiftColumns, MixRows, and AddRoundKey with
+key $K$ to state $S$:
+
+\begin{align*}
+K[0] &= H \\
+K[r] &= \text{Round}(K[r-1],\ \mathrm{RC}[r]) \quad r \in \mathbb{N},\ 1 \leq r \leq 10
+\end{align*}
+
+The block $M$ is then encrypted using these keys under a wide-pipe construction.
+The final state update follows the Miyaguchi-Preneel scheme:
+
+\begin{align*}
+H \leftarrow E(H,\ M) \oplus M \oplus H
+\end{align*}
+
+\noindent where $E(H, M)$ denotes the encryption of $M$ with key schedule
+derived from $H$.
+
+\vspace{1em}
+
+The state is initialized to all zeros. After all blocks have been processed,
+the eight 64-bit state words are serialized in big-endian order to produce the
+512-bit digest.