%%^^A%%  fontspec-doc-enc.tex -- part of FONTSPEC <latex3.github.io/fontspec>

\documentclass[a4paper]{l3doc}
\usepackage{fontspec-doc-style}
\showexamplesfalse
\begin{document}

\part{Commands for accents and symbols (`encodings')}
\label{part:enc}

\textbf{The functionality described in this section is experimental.}

In the pre-Unicode era, significant work was required by \LaTeX\ to ensure that
input characters in the source could be interpreted correctly depending on file encoding,
and that glyphs in the output were selected correctly depending on the font encoding.
With Unicode, we have the luxury of a single file and font encoding that is used for both input
and output.

While this may provide some illusion that we could get away simply with typing
Unicode text and receive correct output, this is not always the case.
For a start, hyphenation in particular is language-specific, so tags should be used
when switch between languages in a document.
The \pkg{babel} and \pkg{polyglossia} packages both provide features for this.

Multilingual documents will often use different fonts for different languages,
not just for style, but for the more pragmatic reason that fonts do not all contain
the same glyphs. (In fact, only test fonts such as Code2000 provide
anywhere near the full Unicode coverage.)
Indeed, certain fonts may be perfect for a certain application but miss a handful
of necessary diacritics or accented letters.
In these cases, \pkg{fontspec} can leverage the font encoding technology built
into \LaTeX2\ to provide on a per-font basis either provide fallback options or
error messages when a desired accent or symbol is not available.
However, at present
these features can only be provided for input using \LaTeX\ commands rather
than Unicode input; for example, typing |\`e| instead of |è| or |\textcopyright|
instead of |©| in the source file.

The most widely-used encoding in \LaTeXe\ was |T1| with companion `|TS1|' symbols
provided by the \pkg{textcomp} package.
These encodings provided glyphs to typeset text in a variety of western European languages.
As with most legacy \LaTeXe\ input methods, accents and symbols were input using
encoding-dependent commands such as |\`e| as described above.
As of 2017, in \LaTeXe\ on \XeTeX\ and \LuaTeX, the default encoding is |TU|,
which uses Unicode for input and output.
The |TU| encoding provides appropriate encoding-dependent definitions for input commands
to match the coverage of the |T1+TS1| encodings.
Wider coverage is not provided by default since (a)~each font will provide different glyph coverage, and
(b)~it is expected that most users will be writing with direct Unicode input.

For those users who do need finer-grained control, \pkg{fontspec} provides an
interface for a more extensible system.


\section{A new Unicode-based encoding from scratch}

Let's say you need to provide support for a document originally written with fonts
in the |OT2| encoding, which contains encoding-dependent commands for Cyrillic letters.
An example from the |OT2| encoding definition file (|ot2enc.def|) reads:
\begin{Verbatim}[numbers=left,firstnumber=57]
\DeclareTextSymbol{\CYRIE}{OT2}{5}
\DeclareTextSymbol{\CYRDJE}{OT2}{6}
\DeclareTextSymbol{\CYRTSHE}{OT2}{7}
\DeclareTextSymbol{\cyrnje}{OT2}{8}
\DeclareTextSymbol{\cyrlje}{OT2}{9}
\DeclareTextSymbol{\cyrdzhe}{OT2}{10}
\end{Verbatim}

To recreate this encoding in a form suitable for \pkg{fontspec}, create a new file
named, say, |fontrange-cyr.def| and populate it with
\begin{Verbatim}
...
\DeclareTextSymbol{\CYRIE}  {\LastDeclaredEncoding}{"0404}
\DeclareTextSymbol{\CYRDJE} {\LastDeclaredEncoding}{"0402}
\DeclareTextSymbol{\CYRTSHE}{\LastDeclaredEncoding}{"040B}
\DeclareTextSymbol{\cyrnje} {\LastDeclaredEncoding}{"045A}
\DeclareTextSymbol{\cyrlje} {\LastDeclaredEncoding}{"0459}
\DeclareTextSymbol{\cyrdzhe}{\LastDeclaredEncoding}{"045F}
...
\end{Verbatim}
The numbers |"0404|, |"0402|, \dots, are the Unicode slots (in hexadecimal)
of each glyph respectively.
The \pkg{fontspec} package provides a number of shorthands to simplify this style of input; in this case,
you could also write
\begin{Verbatim}
\EncodingSymbol{\CYRIE}{"0404}
...
\end{Verbatim}

To use this encoding in a \pkg{fontspec} font, you would first add this to your preamble:
\begin{Verbatim}
\DeclareUnicodeEncoding{unicyr}{
  \input{fontrange-cyr.def}
}
\end{Verbatim}
Then follow it up with a font loading call such as
\begin{Verbatim}
\setmainfont{...}[NFSSEncoding=unicyr]
\end{Verbatim}
The first argument |unicyr| is the name of the `encoding' to use in the
font family. (There's nothing special about the name chosen but it must be unique.)
The second argument to |\DeclareUnicodeEncoding| also allows adjustments to be made
for per-font changes.
We'll cover this use case in the next section.


\section{Adjusting a pre-existing encoding}

There are three reasons to adjust a pre-existing encoding:
to add, to remove, and to redefine some symbols, letters, and/or accents.

When adding symbols, etc., simply write
\begin{Verbatim}
\DeclareUnicodeEncoding{unicyr}{
  \input{tuenc.def}
  \input{fontrange-cyr.def}
  \EncodingSymbol{\textruble}{"20BD}
}
\end{Verbatim}
Of course if you consistently add a number of symbols to an encoding it would be
a good idea to create a new |fontrange-XX.def| file to suit your needs.

When removing symbols, use the |\UndeclareSymbol|\marg{cmd} command.
For example, if you a loading a font that you know is missing, say, the interrobang
(not that unusual a situation), you might write:
\begin{Verbatim}
\DeclareUnicodeEncoding{nobang}{
  \input{tuenc.def}
  \UndeclareSymbol\textinterrobang
}
\end{Verbatim}
Provided that you use the command |\textinterrobang| to typeset this symbol,
it will appear in fonts with the default encoding, while in any font loaded with
the |nobang| encoding an attempt to access the symbol will either use the default
fallback definition or return an error, depending on the symbol being undeclared.


The third use case is to redefine a symbol or accent. The most common use case
in this scenario is to adjust a specific accent command to either fine-tune its placement
or to `fake' it entirely.
For example, the underdot diacritic is used in typeset Sanskrit,
but it is not necessarily included as an accent symbol is all fonts.
By default the underdot is defined in |TU| as:
\begin{Verbatim}
\EncodingAccent{\d}{"0323}
\end{Verbatim}
For fonts with a missing (or poorly-spaced) |"0323| accent glyph, the `traditional' \TeX\ fake accent
construction could be used instead:
\begin{Verbatim}
\DeclareUnicodeEncoding{fakeacc}{
  \input{tuenc.def}
  \EncodingCommand{\d}[1]{%
    \hmode@bgroup
      \o@lign{\relax#1\crcr\hidewidth\ltx@sh@ft{-1ex}.\hidewidth}%
    \egroup
  }
}
\end{Verbatim}
This would be set up in a document as such:
\begin{Verbatim}
\newfontfamily\sanskitfont{CharisSIL}
\newfontfamily\titlefont{Posterama}[NFSSEncoding=fakeacc]
\end{Verbatim}
Then later in the document, no additional work is needed:
\begin{Verbatim}
...{\titlefont   kalita\d m}...  % <- uses fake accent
...{\sanskitfont kalita\d m}...  % <- uses real accent
\end{Verbatim}
To reiterate from above, typing this input with Unicode text (`|kalitaṃ|')
will \emph{bypass} this encoding mechanism and you will receive only what is contained
literally within the font.


\section{Summary of commands}

The \LaTeXe\ kernel provides the following font encoding commands suitable for Unicode encodings:
\begin{quote}\obeylines
  \cs{DeclareTextCommand}\marg{command}\marg{encoding}\oarg{num}\oarg{default}\marg{code}
  \cs{DeclareUnicodeAccent}\marg{command}\marg{encoding}\marg{slot}
  \cs{DeclareTextSymbol}\marg{command}\marg{encoding}\marg{slot}
  \cs{DeclareTextComposite}\marg{command}\marg{encoding}\marg{letter}\marg{slot}
  \cs{DeclareTextCompositeCommand}\marg{command}\marg{encoding}\marg{letter}\marg{code}
  \cs{UndeclareTextCommand}\marg{command}\marg{encoding}
\end{quote}
See |fntguide.pdf| for full documentation of these.
As shown above, the following shorthands are provided by \pkg{fontspec} to simplify
the process of defining Unicode font range encodings:
\begin{quote}\obeylines
  \cs{EncodingCommand}\marg{command}\oarg{num}\oarg{default}\marg{code}
  \cs{EncodingAccent}\marg{command}\marg{code}
  \cs{EncodingSymbol}\marg{command}\marg{code}
  \cs{EncodingComposite}\marg{command}\marg{letter}\marg{slot}
  \cs{EncodingCompositeCommand}\marg{command}\marg{letter}\marg{code}
  \cs{UndeclareSymbol}\marg{command}
  \cs{UndeclareAccent}\marg{command}
  \cs{UndeclareCommand}\marg{command}
  \cs{UndeclareComposite}\marg{command}\marg{letter}
\end{quote}

\end{document}


% /©
% ------------------------------------------------
% The FONTSPEC package  <latex3.github.io/fontspec>
% ------------------------------------------------
% Copyright  2022-2024  The LaTeX project,  LPPL "maintainer"
% Copyright  2004-2022  Will Robertson
% Copyright  2009-2015  Khaled Hosny
% Copyright  2013       Philipp Gesang
% Copyright  2013-2016  Joseph Wright
% ------------------------------------------------
% This package is free software and may be redistributed and/or modified under
% the conditions of the LaTeX Project Public License, version 1.3c or higher
% (your choice): <http://www.latex-project.org/lppl/>.
% ------------------------------------------------
% ©/