doc: add full algorithm reference (MD5, SHA-256, Whirlpool)

This commit is contained in:
lohhiiccc 2026-05-06 10:39:32 +02:00
parent 3289a9191d
commit 3e36cb4906
9 changed files with 488 additions and 35 deletions

View file

@ -1,11 +1,18 @@
EXTRA_DIST = libft_ssl.tex EXTRA_DIST = \
libft_ssl.tex \
preliminaries.tex \
introduction.tex \
generic_interface.tex \
md5.tex \
sha256.tex \
whirlpool.tex
if ENABLE_DOC if ENABLE_DOC
pdf: libft_ssl.pdf pdf: libft_ssl.pdf
libft_ssl.pdf: libft_ssl.tex libft_ssl.pdf: libft_ssl.tex
$(PDFLATEX) $< TEXINPUTS=$(srcdir): $(PDFLATEX) $<
$(PDFLATEX) $< TEXINPUTS=$(srcdir): $(PDFLATEX) $<
endif endif
clean-local: clean-local:

53
doc/generic_interface.tex Normal file
View file

@ -0,0 +1,53 @@
\section{Generic Digest Interface}
All hash algorithms in \texttt{libft\_ssl} are exposed through a single,
uniform interface built around the \texttt{struct digest\_algo} type. This
structure holds the algorithm's metadata (its name, digest size and block size)
along with three function pointers: \texttt{init}, \texttt{update} and
\texttt{final}. This design allows any algorithm to be driven through the same
calling convention without the caller needing to know which one is in use. The
associated context is held in a \texttt{union digest\_ctx}, which overlays the
per-algorithm state structures so that a single allocation covers all supported
algorithms.
\vspace{1em}
\texttt{struct digest\_algo} describes a hash algorithm as a set of metadata
and three function pointers. The \texttt{name} field identifies the algorithm.
The \texttt{digest\_size} and \texttt{block\_size} fields express its output
length and internal block size in bytes. The three function pointers
\texttt{init}, \texttt{update} and \texttt{final} define the algorithm's
lifecycle: \texttt{init} sets the context to its initial state, \texttt{update}
feeds an arbitrary amount of data into it, and \texttt{final} produces the
digest and resets the context. All three operate on a \texttt{void~*} context
pointer, which allows the interface to remain algorithm-agnostic.
\vspace{1em}
The \texttt{union digest\_ctx} type provides a single allocation large enough
to hold the context of any supported algorithm. Because only one algorithm is
active at a time, overlaying the per-algorithm structures in a union avoids the
overhead of a separate heap allocation while keeping the calling code uniform.
The active member is always the one matching the \texttt{struct digest\_algo}
being used.
\vspace{1em}
Each supported algorithm is registered in \texttt{digest\_algos.h} through an
X-macro list. This file defines a single macro \texttt{DIGEST\_ALGOS(X)} that
expands \texttt{X} once per algorithm, passing its name, digest size and block
size. Consuming this list with a different definition of \texttt{X} generates
the corresponding code or data without repetition --- the global \texttt{struct
digest\_algo} instances in \texttt{libft\_ssl.c} are produced this way. Adding
a new algorithm to the library reduces to adding one line to this list.
\vspace{1em}
All three algorithms follow the Merkle-Damgård construction. The message is
split into fixed-size blocks and processed sequentially. After each block, the
compressed output is combined with the previous state to produce the new state
--- this chaining ensures that the final digest depends on every bit of the
input. The exact combination operation is algorithm-specific: MD5 and SHA-256
use an additive feedforward, while Whirlpool uses the Miyaguchi-Preneel scheme.
\newpage

19
doc/introduction.tex Normal file
View file

@ -0,0 +1,19 @@
\section{Introduction}
\texttt{libft\_ssl} is a C library implementing cryptographic hash functions
from scratch. A cryptographic hash function maps an arbitrary-length input to a
fixed-size digest. This operation is deterministic and one-way: it is
computationally infeasible to recover the original input from its digest.
The library currently implements the following algorithms:
\begin{itemize}
\item \textbf{MD5} - produces a 128-bit digest.
\item \textbf{SHA-256} - produces a 256-bit digest.
\item \textbf{Whirlpool} - produces a 512-bit digest.
\end{itemize}
These functions are commonly used for data integrity verification, digital
signatures, and \textbf{M}essage \textbf{A}uthentication \textbf{C}ode\textbf{s} (MACs).
\newpage

View file

@ -6,7 +6,7 @@
\usepackage{amssymb} \usepackage{amssymb}
\usepackage{listings} \usepackage{listings}
\usepackage{xcolor} \usepackage{xcolor}
\usepackage{hyperref} \usepackage[hidelinks]{hyperref}
\usepackage{geometry} \usepackage{geometry}
\geometry{margin=2.5cm} \geometry{margin=2.5cm}
@ -21,36 +21,11 @@
\tableofcontents \tableofcontents
\newpage \newpage
\section{Introduction} \input{preliminaries}
\input{introduction}
\input{generic_interface}
\input{md5}
\input{sha256}
\input{whirlpool}
\texttt{libft\_ssl} is a C library implementing cryptographic hash functions
from scratch. A cryptographic hash function maps an arbitrary-length input to a
fixed-size digest. This operation is deterministic and one-way: it is
computationally infeasible to recover the original input from its digest.
The library currently implements the following algorithms:
\begin{itemize}
\item \textbf{MD5} - produces a 128-bit digest.
\item \textbf{SHA-256} - produces a 256-bit digest.
\item \textbf{Whirlpool} - produces a 512-bit digest.
\end{itemize}
These functions are commonly used for data integrity verification, digital
signatures, and message authentication codes (MACs).
\newpage
\section{Library core}
\newpage
\section{MD5}
\newpage
\section{SHA-256}
\newpage
\section{Whirlpool}
\end{document} \end{document}

102
doc/md5.tex Normal file
View file

@ -0,0 +1,102 @@
\section{MD5}
MD5 (Message Digest Algorithm 5) was designed by Ronald Rivest in 1991 as a
strengthened replacement for MD4. It produces a 128-bit digest from a message
of arbitrary length, processing data in 512-bit blocks. Although MD5 is now
considered cryptographically broken (collision attacks have been
demonstrated since 2004) it remains widely used for non-security purposes
such as checksums and data integrity verification.
\vspace{1em}
MD5 maintains a state of four 32-bit words, conventionally named $A$, $B$, $C$
and $D$, initialized to fixed constants defined in RFC 1321. Each 512-bit block
is processed in four rounds of sixteen operations each, for a total of 64
operations per block. Each operation applies one of four non-linear functions
to the state words, adds a message word and a precomputed constant derived from
the sine function, and rotates the result by a fixed amount.
\vspace{1em}
The state is initialized to the following fixed constants, as specified in RFC 1321:
\begin{align*}
A &= \texttt{0x67452301} \\
B &= \texttt{0xefcdab89} \\
C &= \texttt{0x98badcfe} \\
D &= \texttt{0x10325476}
\end{align*}
\vspace{1em}
Before processing, the message is padded to a length congruent to 448 bits
modulo 512. A single \texttt{1} bit is appended first, followed by as many
\texttt{0} bits as needed. The original message length in bits is then appended
as a 64-bit little-endian integer, bringing the total padded length to an exact
multiple of 512 bits.
\vspace{1em}
Each of the four rounds uses a distinct non-linear function applied to the
state words $B$, $C$ and $D$:
\begin{align*}
F(B, C, D) &= (B \land C) \lor (\lnot B \land D) \\
G(B, C, D) &= (B \land D) \lor (C \land \lnot D) \\
H(B, C, D) &= B \oplus C \oplus D \\
I(B, C, D) &= C \oplus (B \lor \lnot D)
\end{align*}
The message word index used at step $i$ is not sequential: each round applies
a distinct selector function $k_r$ where $r = \lfloor i / 16 \rfloor$:
\begin{align*}
k_0(i) &= i \bmod 16 \\
k_1(i) &= (5i + 1) \bmod 16 \\
k_2(i) &= (3i + 5) \bmod 16 \\
k_3(i) &= 7i \bmod 16
\end{align*}
At each step $i$ (with $0 \leq i < 64$), one of the four functions is selected
according to the current round, and the state is updated as follows:
\begin{align*}
A &\leftarrow B + \bigl((A + \phi(B, C, D) + M[k] + T[i]) \lll s[i]\bigr)
\end{align*}
\noindent where $\phi$ is the auxiliary function for the current round, $M[k]$
is a 32-bit word of the current block, $T[i]$ is a precomputed constant, $s[i]$
is the rotation amount, and $\lll$ denotes a left rotation. After this
operation, the state words are cycled: $(A, B, C, D) \leftarrow (D, A, B, C)$.
\vspace{1em}
The rotation amounts $s[i]$ are constant per round and repeat every four steps:
\begin{align*}
\text{Round 0} &: 7,\ 12,\ 17,\ 22 \\
\text{Round 1} &: 5,\ 9,\ 14,\ 20 \\
\text{Round 2} &: 4,\ 11,\ 16,\ 23 \\
\text{Round 3} &: 6,\ 10,\ 15,\ 21
\end{align*}
\vspace{1em}
The 64 constants $T[i]$ are derived from the sine function:
\begin{align*}
\forall i \in \mathbb{N},\ 0\le i < 64, T_i = \left\lfloor 2^{32}\,|\sin(i+1)| \right\rfloor
\end{align*}
After each block is processed, the compressed state is added word-by-word to
the state before compression:
\begin{align*}
(A, B, C, D) \leftarrow (A + A_0,\ B + B_0,\ C + C_0,\ D + D_0)
\end{align*}
\noindent where $A_0$, $B_0$, $C_0$, $D_0$ denote the state at the beginning
of the block. After all blocks have been processed, the four state words are
serialized in little-endian order to produce the 128-bit digest.
\newpage

15
doc/md5_init_T.py Normal file
View file

@ -0,0 +1,15 @@
import math
def md5_T():
T = []
for i in range(64):
val = int(math.floor((2**32) * abs(math.sin(i + 1))))
T.append(val & 0xFFFFFFFF)
return T
if __name__ == "__main__":
T = md5_T()
for i, v in enumerate(T):
print(f"0x{v:08x}, ")

66
doc/preliminaries.tex Normal file
View file

@ -0,0 +1,66 @@
\section{Preliminaries}
This section defines the terminology used throughout the document. The concepts
introduced here are general to cryptographic hash functions and apply to all
algorithms described in subsequent sections.
\vspace{1em}
A \textbf{bit} is the smallest unit of information, taking a value of either 0
or 1. A \textbf{byte} is a group of eight bits, and is the standard unit of
data storage and transmission. A \textbf{word} is a fixed-size integer used
internally by a hash algorithm --- MD5 and SHA-256 operate on 32-bit words,
while Whirlpool operates on 64-bit words.
\vspace{1em}
\textbf{Endianness} refers to the byte order used when storing a multi-byte
integer in memory. In \textbf{little-endian} order, the least significant byte
is stored first; in \textbf{big-endian} order, the most significant byte is
stored first. This distinction matters when serializing the internal state to
produce the final digest --- MD5 uses little-endian, while SHA-256 and
Whirlpool use big-endian.
\vspace{1em}
A \textbf{message} is the arbitrary-length input fed to a hash function. The
\textbf{digest} is the fixed-size output it produces. A hash function is said
to be \textbf{one-way} if it is computationally infeasible to recover any input
that produces a given digest. A \textbf{collision} occurs when two distinct
messages produce the same digest; a hash function is considered broken when
collisions can be found efficiently.
\vspace{1em}
Hash functions process their input in fixed-size chunks called \textbf{blocks}.
Since the message length is rarely a multiple of the block size,
\textbf{padding} is appended to the last block to bring it to the required
length. The \textbf{state} is a set of words initialized to fixed constants and
updated after each block; it accumulates the result of the computation and is
serialized into the digest at the end. The \textbf{compression function} is the
core transformation applied to each block --- it takes the current state and
one block of data, and produces a new state.
\vspace{1em}
The \textbf{Miyaguchi-Preneel} construction is a way to build a compression
function from a block cipher $E$. Given a current state $H$ and a message
block $M$, it produces a new state as:
\begin{align*}
H \leftarrow E(H,\ M) \oplus M \oplus H
\end{align*}
\noindent where $E(H, M)$ denotes the encryption of $M$ using $H$ as the key.
The XOR with both $M$ and $H$ ensures that the output cannot be trivially
inverted even if $E$ is known.
\vspace{1em}
The \textbf{wide-pipe} construction is a variant of Merkle-Damgård where the
internal state is wider than the final digest. This makes collision attacks
harder: an attacker targeting the output must first find a collision in the
larger internal state, which requires significantly more work than attacking
the digest directly.
\newpage

110
doc/sha256.tex Normal file
View file

@ -0,0 +1,110 @@
\section{SHA-256}
SHA-256 is part of the SHA-2 family of cryptographic hash functions, designed
by the NSA and first published by NIST in 2001. It produces a 256-bit digest
from a message of arbitrary length, processing data in 512-bit blocks. Unlike
MD5, SHA-256 has no known practical collision attacks and remains widely used
in security-critical applications such as TLS certificates and Bitcoin's
proof-of-work.
\vspace{1em}
SHA-256 maintains a state of eight 32-bit words, initialized to fixed constants
derived from the square roots of the first eight prime numbers. Each 512-bit
block is processed in 64 rounds. Each round applies a compression step involving
two non-linear functions, a message schedule word, and a precomputed constant
derived from the cube roots of the first 64 prime numbers.
\vspace{1em}
The padding scheme is identical to MD5: a single \texttt{1} bit is appended,
followed by \texttt{0} bits until the message length is congruent to 448 bits
modulo 512, and the original length in bits is appended as a 64-bit integer.
The difference is that SHA-256 encodes this length in big-endian order.
\vspace{1em}
Each round uses two non-linear functions applied to the state words:
\begin{align*}
\text{Ch}(E, F, G) &= (E \land F) \oplus (\lnot E \land G) \\
\text{Maj}(A, B, C) &= (A \land B) \oplus (A \land C) \oplus (B \land C)
\end{align*}
\noindent and two rotation-based functions applied to the state words $A$ and $E$:
\begin{align*}
\Sigma_0(A) &= (A \ggg 2) \oplus (A \ggg 13) \oplus (A \ggg 22) \\
\Sigma_1(E) &= (E \ggg 6) \oplus (E \ggg 11) \oplus (E \ggg 25)
\end{align*}
\noindent where $\ggg$ denotes a right rotation.
\vspace{1em}
At each round $i$ (with $0 \leq i < 64$), the state is updated as follows:
\begin{align*}
T_1 &= H + \Sigma_1(E) + \text{Ch}(E, F, G) + K[i] + W[i] \\
T_2 &= \Sigma_0(A) + \text{Maj}(A, B, C) \\
H &\leftarrow G, \quad G \leftarrow F, \quad F \leftarrow E, \quad E \leftarrow D + T_1 \\
D &\leftarrow C, \quad C \leftarrow B, \quad B \leftarrow A, \quad A \leftarrow T_1 + T_2
\end{align*}
\noindent where $K[i]$ is a precomputed constant and $W[i]$ is a word from the
message schedule.
\vspace{1em}
\begin{align*}
\forall i \in \mathbb{N},\ 0 \leq i < 64,\quad K[i] = \left\lfloor 2^{32} \times \left(\sqrt[3]{p_{i+1}} \bmod 1\right) \right\rfloor
\end{align*}
\noindent where $p_{i+1}$ is the $(i+1)$-th prime number and $\bmod 1$ denotes
the fractional part.
\vspace{1em}
The message schedule extends the 16 words of the current block into 64 words
using two additional rotation-based functions:
\begin{align*}
\sigma_0(x) &= (x \ggg 7) \oplus (x \ggg 18) \oplus (x \gg 3) \\
\sigma_1(x) &= (x \ggg 17) \oplus (x \ggg 19) \oplus (x \gg 10)
\end{align*}
\noindent where $\gg$ denotes a logical right shift, and $M[i]$ denotes the
$i$-th 32-bit word of the current 512-bit block. The schedule is then defined
as:
\begin{align*}
W[i] = \begin{cases}
M[i] & i \in \mathbb{N},\ 0 \leq i < 16 \\
\sigma_1(W[i-2]) + W[i-7] + \sigma_0(W[i-15]) + W[i-16] & i \in \mathbb{N},\ 16 \leq i < 64
\end{cases}
\end{align*}
\vspace{1em}
The state is initialized to fixed constants derived from the square roots of
the first eight prime numbers:
\begin{align*}
A &= \texttt{0x6a09e667}, \quad B = \texttt{0xbb67ae85}, \quad
C = \texttt{0x3c6ef372}, \quad D = \texttt{0xa54ff53a} \\
E &= \texttt{0x510e527f}, \quad F = \texttt{0x9b05688c}, \quad
G = \texttt{0x1f83d9ab}, \quad H = \texttt{0x5be0cd19}
\end{align*}
After each block is processed, the compressed state is added word-by-word to
the state before compression:
\begin{align*}
(A, \ldots, H) \leftarrow (A + A_0,\ B + B_0,\ C + C_0,\ D + D_0,\ E + E_0,\ F + F_0,\ G + G_0,\ H + H_0)
\end{align*}
\noindent where $A_0, \ldots, H_0$ denote the state at the beginning of the
block. After all blocks have been processed, the eight state words are
serialized in big-endian order to produce the 256-bit digest.
\newpage

106
doc/whirlpool.tex Normal file
View file

@ -0,0 +1,106 @@
\section{Whirlpool}
Whirlpool is a cryptographic hash function designed by Vincent Rijmen and Paulo
Barreto, first published in 2000 and standardized by ISO/IEC in 2004. It
produces a 512-bit digest from a message of arbitrary length, processing data
in 512-bit blocks. Its internal structure is inspired by the wide-pipe
Miyaguchi-Preneel construction and shares design principles with AES, using a
substitution-permutation network over an $8 \times 8$ matrix of bytes.
\vspace{1em}
Whirlpool maintains a state of eight 64-bit words, forming an $8 \times 8$
matrix of bytes. Each 512-bit block is processed in 10 rounds. Each round
applies four successive transformations to the state matrix: a byte
substitution, a column shift, a row mixing, and a round key addition.
\vspace{1em}
The padding scheme follows the same structure as MD5 and SHA-256: a single
\texttt{1} bit is appended, followed by \texttt{0} bits until the message
length is congruent to 448 bits modulo 512. The original message length in bits
is then appended as a 64-bit big-endian integer.
\vspace{1em}
Each round applies the following four transformations in order:
\textbf{SubBytes} replaces each byte of the state matrix by its image under the
Whirlpool S-box, a fixed 256-entry lookup table defined in the Whirlpool
specification.
\medskip
\textbf{ShiftColumns} cyclically shifts each column $j$ of the state matrix
upward by $j$ positions, producing a transposition that spreads bytes across
rows. Formally, if $a_{i,j}$ denotes the byte at row $i$, column $j$ of the
state matrix, ShiftColumns produces:
\begin{align*}
b_{i,j} = a_{i',\ j} \quad \text{where } i' = (i - j) \bmod 8
\end{align*}
\medskip
\textbf{MixRows} multiplies each row of the state matrix by a fixed MDS matrix
over $\mathrm{GF}(2^8)$ with irreducible polynomial $x^8 + x^4 + x^3 + x^2 +
1$, providing diffusion across the eight bytes of each row. Formally, for each
row $i$, each output byte $b_j$ is computed as:
\begin{align*}
b_j = \bigoplus_{k=0}^{7} \mathrm{MDS}[(j - k) \bmod 8] \cdot a_{i,k}
\end{align*}
\noindent where $\cdot$ denotes multiplication in $\mathrm{GF}(2^8)$ and
$\oplus$ denotes XOR.
\medskip
\textbf{AddRoundKey} XORs the state with the current round key.
\vspace{1em}
The S-box and the MDS matrix coefficients are fixed tables defined in the
Whirlpool specification; their values are too large to reproduce here. The
round constants $\mathrm{RC}[r]$, $r \in \mathbb{N},\ 1 \leq r \leq 10$, are
however directly derived from the S-box. Each $\mathrm{RC}[r]$ is an 8-word
state where only the first word is non-zero:
\begin{align*}
\mathrm{RC}[r][0] &= \sum_{k=0}^{7} S[8(r-1)+k] \cdot 2^{8(7-k)} \\
\mathrm{RC}[r][j] &= 0 \quad \forall j \in \mathbb{N},\ 1 \leq j \leq 7
\end{align*}
\noindent Their role is to break symmetry in the key schedule: without them, a
symmetric input state would produce symmetric round keys, weakening the internal
block transformation.
\vspace{1em}
The round keys $K[r]$, $r \in \mathbb{N},\ 0 \leq r \leq 10$, are derived from
the current hash state. $K[0]$ is set to the state before processing the block.
Each subsequent key is obtained by applying the round function to the previous
key with a precomputed round constant, where $\text{Round}(S, K)$ denotes the
successive application of SubBytes, ShiftColumns, MixRows, and AddRoundKey with
key $K$ to state $S$:
\begin{align*}
K[0] &= H \\
K[r] &= \text{Round}(K[r-1],\ \mathrm{RC}[r]) \quad r \in \mathbb{N},\ 1 \leq r \leq 10
\end{align*}
The block $M$ is then encrypted using these keys under a wide-pipe construction.
The final state update follows the Miyaguchi-Preneel scheme:
\begin{align*}
H \leftarrow E(H,\ M) \oplus M \oplus H
\end{align*}
\noindent where $E(H, M)$ denotes the encryption of $M$ with key schedule
derived from $H$.
\vspace{1em}
The state is initialized to all zeros. After all blocks have been processed,
the eight 64-bit state words are serialized in big-endian order to produce the
512-bit digest.