There are many languages in which all characters can be expressed by single byte.
Multi-byte character codes are used to express many characters for many languages.
mbstring is developed to handle Japanese characters. However, many mbstring functions
are able to handle character encoding other than Japanese.
A multi-byte character encoding represents single character with consecutive bytes. Some
character encoding has shift(escape) sequences to start/end multi-byte character strings.
Therefore, a multi-byte character string may be destroyed when it is divided and/or counted unless
multi-byte character encoding safe method is used. This module provides multi-byte character safe
string functions and other utility functions such as conversion functions.
Since PHP is basically designed for ISO-8859-1, some multi-byte character encoding does
not work well with PHP. Therefore, it is important to set mbstring.internal_encoding to a
character encoding that works with PHP.
These are examples of internal character encoding that works with PHP and does NOT work
with PHP.
Character encodings work with PHP:
ISO-8859-*, EUC-JP, UTF-8
Character encodings do NOT work with PHP:
JIS, SJIS
|
Character encoding, that does not work with PHP, may be converted with mbstring's
HTTP input/output conversion feature/function.
Note: SJIS should not be used for internal encoding unless the reader is familiar with
parser/compiler, character encoding and character encoding issues.
Note: If you use database with PHP, it is recommended that you use the same character
encoding for both database and internal encoding for ease of use and better
performance.
If you are using PostgreSQL, it supports character encoding that is different from backend
character encoding. See the PostgreSQL manual for details.
mbstring is an extended module. You must enable module with configure
script. Refer to the Install section for details.
The following configure options are related to mbstring module.
-
--enable-mbstring : Enable mbstring functions. This option is required
to use mbstring functions.
-
--enable-mbstr-enc-trans : Enable HTTP input character encoding conversion using
mbstring conversion engine. If this feature is enabled, HTTP input character encoding may
be converted to mbstring.internal_encoding automatically.
HTTP input/output character encoding conversion may convert binary data also. Users are
supposed to control character encoding conversion if binary data is used for HTTP input/output.
If enctype for HTML form is set to multipart/form-data,
mbstring does not convert character encoding in POST data. If it is the case, strings are
needed to be converted to internal character encoding.
-
HTTP Input
There is no way to control HTTP input character conversion from PHP script. To disable HTTP
input character conversion, it has to be done in php.ini.
|
Example 1. Disable HTTP input conversion in php.ini
;; Disable HTTP Input conversion
mbstring.http_input = pass
|
|
When using PHP as an Apache module, it is possible to override PHP ini setting per Virtual
Host in httpd.conf or per directory with .htaccess. Refer to the Configuration section and Apache Manual for details.
-
HTTP Output
There are several ways to enable output character encoding conversion. One is using
php.ini, another is using ob_start() with mb_output_handler() as ob_start callback
function.
Note: For PHP3-i18n users, mbstring's output conversion differs from PHP3-i18n.
Character encoding is converted using output buffer.
|
Example 2. php.ini setting example
;; Enable output character encoding conversion for all PHP pages
;; Enable Output Buffering
output_buffering = On
;; Set mb_output_handler to enable output conversion
output_handler = mb_output_handler
|
|
|
Example 3. Script example
<?php
// Enable output character encoding conversion only for this page
// Set HTTP output character encoding to SJIS
mb_http_output('SJIS');
// Start buffering and specify "mb_output_handler" as
// callback function
ob_start('mb_output_handler');
?>
|
|
Currently, the following character encoding is supported by mbstring module.
Caracter encoding may be specified for mbstring functions' encoding
parameter.
The following character encoding is supported in this PHP extension :
UCS-4, UCS-4BE, UCS-4LE, UCS-2, UCS-2BE,
UCS-2LE, UTF-32, UTF-32BE, UTF-32LE, UCS-2LE,
UTF-16, UTF-16BE, UTF-16LE, UTF-8, UTF-7, ASCII,
EUC-JP, SJIS, eucJP-win, SJIS-win, ISO-2022-JP,
JIS, ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-4,
ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8,
ISO-8859-9, ISO-8859-10, ISO-8859-13, ISO-8859-14,
ISO-8859-15, byte2be, byte2le, byte4be, byte4le,
BASE64, 7bit, 8bit and UTF7-IMAP.
php.ini entry, which accepts encoding name, accepts "auto" and
"pass" also. mbstring functions, which accepts encoding name, and accepts
"auto".
If "pass" is set, no character encoding conversion is performed.
If "auto" is set, it is expanded to "ASCII,JIS,UTF-8,EUC-JP,SJIS".
See also mb_detect_order()
Note: "Supported character encoding" does not mean that it works as internal character
code.
-
mbstring.internal_encoding defines default internal character encoding.
-
mbstring.http_input defines default HTTP input character encoding.
-
mbstring.http_output defines default HTTP output character encoding.
-
mbstring.detect_order defines default character code detection order. See also mb_detect_order().
-
mbstring.substitute_character defines character to substitute for invalid
character encoding.
Web Browsers are supposed to use the same character encoding when submitting form.
However, browsers may not use the same character encoding. See mb_http_input() to detect character encoding used by
browsers.
If enctype is set to multipart/form-data in HTML forms,
mbstring does not convert character encoding in POST data. The user must convert them in the
script, if conversion is needed.
Although, browsers are smart enough to detect character encoding in HTML. charset
is better to be set in HTTP header. Change default_charset according to character
encoding.
|
Example 4. php.ini setting example
;; Set default internal encoding
;; Note: Make sure to use character encoding works with PHP
mbstring.internal_encoding = UTF-8 ; Set internal encoding to UTF-8
;; Set default HTTP input character encoding
;; Note: Script cannot change http_input setting.
mbstring.http_input = pass ; No conversion.
mbstring.http_input = auto ; Set HTTP input to auto
; "auto" is expanded to "ASCII,JIS,UTF-8,EUC-JP,SJIS"
mbstring.http_input = SJIS ; Set HTTP2 input to SJIS
mbstring.http_input = UTF-8,SJIS,EUC-JP ; Specify order
;; Set default HTTP output character encoding
mbstring.http_output = pass ; No conversion
mbstring.http_output = UTF-8 ; Set HTTP output encoding to UTF-8
;; Set default character encoding detection order
mbstring.detect_order = auto ; Set detect order to auto
mbstring.detect_order = ASCII,JIS,UTF-8,SJIS,EUC-JP ; Specify order
;; Set default substitute character
mbstring.substitute_character = 12307 ; Specify Unicode value
mbstring.substitute_character = none ; Do not print character
mbstring.substitute_character = long ; Long Example: U+3000,JIS+7E7E
|
|
|
Example 5. php.ini setting for EUC-JP users
;; Disable Output Buffering
output_buffering = Off
;; Set HTTP header charset
default_charset = EUC-JP
;; Set HTTP input encoding conversion to auto
mbstring.http_input = auto
;; Convert HTTP output to EUC-JP
mbstring.http_output = EUC-JP
;; Set internal encoding to EUC-JP
mbstring.internal_encoding = EUC-JP
;; Do not print invalid characters
mbstring.substitute_character = none
|
|
|
Example 6. php.ini setting for SJIS users
;; Enable Output Buffering
output_buffering = On
;; Set mb_output_handler to enable output conversion
output_handler = mb_output_handler
;; Set HTTP header charset
default_charset = Shift_JIS
;; Set http input encoding conversion to auto
mbstring.http_input = auto
;; Convert to SJIS
mbstring.http_output = SJIS
;; Set internal encoding to EUC-JP
mbstring.internal_encoding = EUC-JP
;; Do not print invalid characters
mbstring.substitute_character = none
|
|
Most Japanese characters need more than 1 byte per character. In addition, several
character encoding schemas are used under a Japanese environment. There are EUC-JP, Shift_JIS(SJIS)
and ISO-2022-JP(JIS) character encoding. As Unicode becomes popular, UTF-8 is used also. To develop
Web applications for a Japanese environment, it is important to use the character set for the task
in hand, whether HTTP input/output, RDBMS and E-mail.
-
Storage for a character can be up to six bytes
-
A multi-byte character is usually twice of the width compared to single-byte characters.
Wider characters are called "zen-kaku" - meaning full width, narrower characters are called
"han-kaku" - meaning half width. "zen-kaku" characters are usually fixed width.
-
Some character encoding defines shift(escape) sequence for entering/exiting multi-byte
character strings.
-
ISO-2022-JP must be used for SMTP/NNTP.
-
"i-mode" web site is supposed to use SJIS.
Multi-byte character encoding and its related issues are very complex. It is impossible to
cover in sufficient detail here. Please refer to the following URLs and other resources for further
readings.
- Table of Contents
- mb_language -- Set/Get current language
- mb_parse_str -- Parse GET/POST/COOKIE data
and set global variable
- mb_internal_encoding -- Set/Get
internal character encoding
- mb_http_input -- Detect HTTP input
character encoding
- mb_http_output -- Set/Get HTTP output
character encoding
- mb_detect_order -- Set/Get character
encoding detection order
-
mb_substitute_character -- Set/Get substitution character
- mb_output_handler -- Callback function
converts character encoding in output buffer
- mb_preferred_mime_name -- Get MIME
charset string
- mb_strlen -- Get string length
- mb_strpos -- Find position of first occurrence
of string in a string
- mb_strrpos -- Find position of last
occurrence of a string in a string
- mb_substr -- Get part of string
- mb_strcut -- Get part of string
- mb_strwidth -- Return width of string
- mb_strimwidth -- Get truncated string with
specified width
- mb_convert_encoding -- Convert
character encoding
- mb_detect_encoding -- Detect character
encoding
- mb_convert_kana -- Convert "kana" one
from another ("zen-kaku" ,"han-kaku" and more)
- mb_encode_mimeheader -- Encode
string for MIME header
- mb_decode_mimeheader -- Decode
string in MIME header field
- mb_convert_variables -- Convert
character code in variable(s)
- mb_encode_numericentity --
Encode character to HTML numeric string reference
- mb_decode_numericentity --
Decode HTML numeric string reference to character
- mb_send_mail -- Send encoded mail.
- mb_get_info -- Get internal settings of
mbstring