- Documentation
- Reference manual
- Overview
- Getting started quickly
- The user's initialisation file
- Initialisation files and goals
- Command line options
- UI Themes
- GNU Emacs Interface
- Online Help
- Command line history
- Reuse of top-level bindings
- Overview of the Debugger
- Compilation
- Environment Control (Prolog flags)
- An overview of hook predicates
- Automatic loading of libraries
- Packs: community add-ons
- The SWI-Prolog syntax
- Rational trees (cyclic terms)
- Just-in-time clause indexing
- Wide character support
- System limits
- SWI-Prolog and 64-bit machines
- Binary compatibility
- Overview
- Packages
- Reference manual
2.19 Wide character support
SWI-Prolog supports wide characters, characters with character codes above 255 that cannot be represented in a single byte. Universal Character Set (UCS) is the ISO/IEC 10646 standard that specifies a unique 31-bit unsigned integer for any character in any language. It is a superset of 16-bit Unicode, which in turn is a superset of ISO 8859-1 (ISO Latin-1), a superset of US-ASCII. UCS can handle strings holding characters from multiple languages, and character classification (uppercase, lowercase, digit, etc.) and operations such as case conversion are unambiguously defined.
For this reason SWI-Prolog has two representations for atoms and string objects (see section 5.2). If the text fits in ISO Latin-1, it is represented as an array of 8-bit characters. Otherwise the text is represented as an array of 32-bit numbers. This representational issue is completely transparent to the Prolog user. Users of the foreign language interface as described in chapter 12 sometimes need to be aware of these issues though.
Character coding comes into view when characters of strings need to be read from or written to file or when they have to be communicated to other software components using the foreign language interface. In this section we only deal with I/O through streams, which includes file I/O as well as I/O through network sockets.
2.19.1 Wide character encodings on streams
Although characters are uniquely coded using the UCS standard internally, streams and files are byte (8-bit) oriented and there are a variety of ways to represent the larger UCS codes in an 8-bit octet stream. The most popular one, especially in the context of the web, is UTF-8. Bytes 0 ... 127 represent simply the corresponding US-ASCII character, while bytes 128 ... 255 are used for multi-byte encoding of characters placed higher in the UCS space. Especially on MS-Windows the 16-bit Unicode standard, represented by pairs of bytes, is also popular.
Prolog I/O streams have a property called encoding which specifies the used encoding that influences get_code/2 and put_code/2 as well as all the other text I/O predicates.
The default encoding for files is derived from the Prolog flag
encoding, which is
initialised from
setlocale(LC_CTYPE, NULL)
to one of text
, utf8
or
iso_latin_1
. One of the latter two is used if the encoding
name is recognized, while text
is used as default. Using
text
, the translation is left to the wide-character
functions of the C library.38The
Prolog native UTF-8 mode is considerably faster than the generic
mbrtowc() one. The encoding can be specified explicitly in load_files/2
for loading Prolog source with an alternative encoding, open/4
when opening files or using set_stream/2
on any open stream. For Prolog source files we also provide the encoding/1
directive that can be used to switch between encodings that are
compatible with US-ASCII (ascii
, iso_latin_1
, utf8
and many locales). See also section
3.1.3 for writing Prolog files with non-US-ASCII characters and section
2.16.1.9 for syntax issues. For additional information and Unicode
resources, please visit
http://www.unicode.org/.
SWI-Prolog currently defines and supports the following encodings:
- octet
- Default encoding for
binary
streams. This causes the stream to be read and written fully untranslated. - ascii
- 7-bit encoding in 8-bit bytes. Equivalent to
iso_latin_1
, but generates errors and warnings on encountering values above 127. - iso_latin_1
- 8-bit encoding supporting many Western languages. This causes the stream to be read and written fully untranslated.
- text
- C library default locale encoding for text files. Files are read and
written using the C library functions mbrtowc() and wcrtomb(). This may
be the same as one of the other locales, notably it may be the same as
iso_latin_1
for Western languages andutf8
in a UTF-8 context. - utf8
- Multi-byte encoding of full UCS, compatible with
ascii
. See above. - unicode_be
- Unicode Big Endian. Reads input in pairs of bytes, most significant byte first. Can only represent 16-bit characters.
- unicode_le
- Unicode Little Endian. Reads input in pairs of bytes, least significant byte first. Can only represent 16-bit characters.
Note that not all encodings can represent all characters. This implies that writing text to a stream may cause errors because the stream cannot represent these characters. The behaviour of a stream on these errors can be controlled using set_stream/2. Initially the terminal stream writes the characters using Prolog escape sequences while other streams generate an I/O exception.
2.19.1.1 BOM: Byte Order Mark
From section
2.19.1, you may have got the impression that text files are
complicated. This section deals with a related topic, making life often
easier for the user, but providing another worry to the programmer.
BOM or Byte Order Marker is a technique for identifying
Unicode text files as well as the encoding they use. Such files start
with the Unicode character 0xFEFF, a non-breaking, zero-width space
character. This is a pretty unique sequence that is not likely to be the
start of a non-Unicode file and uniquely distinguishes the various
Unicode file formats. As it is a zero-width blank, it even doesn't
produce any output. This solves all problems, or ... Some formats start
off as US-ASCII and may contain some encoding mark to switch to UTF-8,
such as the encoding="UTF-8"
in an XML header. Such formats
often explicitly forbid the use of a UTF-8 BOM. In other cases there is
additional information revealing the encoding, making the use of a BOM
redundant or even illegal.
The BOM is handled by SWI-Prolog open/4
predicate. By default, text files are probed for the BOM when opened for
reading. If a BOM is found, the encoding is set accordingly and the
property bom(true)
is available through stream_property/2.
When opening a file for writing, writing a BOM can be requested using
the option bom(true)
with
open/4.