Web Development

Japanese Input: Detecting Half-Width (Han Kaku) Characters

Problem

Our web application has some modules that only Japanese characters can be entered. More over, it must be full-width characters (Zen Kaku is the Japanese term) and half-width characters (Han Kaku) should not be allowed.

Examples of full-width (Zen Kaku) characters are: ヴ, ガ, ギ and examples of half-width (Han Kaku) characters are: ヴ, ガ, ギ.

As you can see, they are the same characters, but Han Kaku is a bit narrower than Zen Kaku. This is how the form looks like:

hankaku-blog-001

Initial Research

I have read this resource: Need code to prevent two-byte characters in form fields but it doesn’t work for my case.

So I investigated it myself. When we try to use strlen and mb_strlen, strlen returns 3 bytes per Zen Kaku characters and 6 bytes per Han Kaku characters. If we will use the mb_strlen, Zen Kaku returns 1 length per character and 2 length per Han Kaku characters.

So you may think that using the stlen + mb_strlen combo will work? No! How do I know that the mb_strlen returns 2 length for Han Kaku characters? By simply looking at the characters and not programmatically and there is no way to detect it as one character.

Worst thing is that only 32 set of Han Kaku characters are detected as 2 length by mb_strlen, but another 64 characters which are also Han Kaku characters which are detected as on character by mb_strlen, so that makes the problem even worst.

If we map the whole set of Han Kaku characters into an array, still there is no way to recognize it because some characters are detected as two characters.

The list was posted on the php manual user comment: http://www.php.net/manual/en/function.mb-convert-kana.php#81221

The most effective method I’ve come up is to map all Unicode ranges and use if-else conditions.

Resources

From these resource: http://www.unicode.org/Public/5.2.0/ucd/UnicodeData-5.2.0d4.txt I have listed all Japanese character Unicode ranges including those half-width and full-width.

Another more precise table dedicated to Japanese character which clearly indicates the type of character. http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml.

Another resource which allows me to convert UTF-8 charactes into Unicode: http://randomchaos.com/documents/?source=php_and_unicode.

And to complete the recipe, another resource that will allow me to convert Hexadecimal to Decimal and vice versa. http://www.parkenet.com/apl/HexDecConverter.html.

Final Solution

So here is the final compilation based on my project.

This is the Javascript (jQuery) code for the Ajax Add process:

	if (confirm(MSG_CONFIRM_REG))
	{
		var url = baseUrl + "/mntname/ajaxadd/Data_Division/" + dataDivision + "/Name_Cd/" + nameCd + "/";
		
		$.post(url, {
			Name1: name1,
			Name2: name2
			},
			function(data) {
				switch (data)
				{
				case "1":
					alert(MSG_REGISTERED);
					processSearch();
					break;
				case "0":
					alert(MSG_ERROR);
					break;
				case "2":
					alert(MSG_DUPLICATE_EXISTS);
					break;
				case "NOT_ZEN_KAKU":
					alert(MSG_NOT_ZEN_KAKU);
					$("#F_Name1").focus();
					break;
				default:
					alert(MSG_ERROR);
					break;
				}
		});
	}

This is for the controller (Zend Framework) part of my application – Converting UTF-8 into Unicode value array:

    protected function _utf8ToUnicode($str)
    {
        $unicode = array();        
        $values = array();
        $lookingFor = 1;
        
        for ($i = 0; $i < strlen( $str ); $i++ ) {

            $thisValue = ord( $str&#91; $i &#93; );
            
            if ( $thisValue < 128 ) $unicode&#91;&#93; = $thisValue;
            else {
            
                if ( count( $values ) == 0 ) $lookingFor = ( $thisValue < 224 ) ? 2 : 3;
                
                $values&#91;&#93; = $thisValue;
                
                if ( count( $values ) == $lookingFor ) {
            
                    $number = ( $lookingFor == 3 ) ?
                        ( ( $values&#91;0&#93; % 16 ) * 4096 ) + ( ( $values&#91;1&#93; % 64 ) * 64 ) + ( $values&#91;2&#93; % 64 ):
                    	( ( $values&#91;0&#93; % 32 ) * 64 ) + ( $values&#91;1&#93; % 64 );
                        
                    $unicode&#91;&#93; = $number;
                    $values = array();
                    $lookingFor = 1;
            
                } // if
            
            } // if
            
        } // for
        return $unicode;
    }
&#91;/sourcecode&#93;

Detecting Japanese character function:

&#91;sourcecode language='php'&#93;
    /**
     * Returns if a given unicode value is a japanese character
     * Returns 	0 if not japanese
     * 			1 if Zen Kaku
     * 			2 if Han Kaku
     * 			3 if Not Han Kaku but Japanese Character (Hiragana, Kanji, etc)
     * 
     * @param $unicodeVal
     * @return int japanese
     */
    protected function _isJapanese($unicodeVal)
    {
    	$ret = 0;
    	//unicodeVal is a single value only
    	if ($unicodeVal == 8221)
    	{
    		//right double quotation
    		$ret = 3;
    	}
    	elseif ($unicodeVal >= 12288 && $unicodeVal <= 12351)
    	{
    		//Japanese Style Punctuation
    		$ret = 3;
    	}
    	elseif ($unicodeVal >= 12352 && $unicodeVal <= 12447)
    	{
    		//Hiragana
    		$ret = 3;
    	}
    	elseif ($unicodeVal >= 12448 && $unicodeVal <= 12543)
    	{
    		//Katakana
    		$ret = 3;
    	}
    	elseif($unicodeVal >= 12784 && $unicodeVal <= 12799)
    	{
    		$ret = 3;
    	}
    	elseif ($unicodeVal >= 12800 && $unicodeVal <= 13054)
    	{
    		$ret = 3;
    	}
    	elseif ($unicodeVal >= 65280 && $unicodeVal <= 65376)
    	{
    		//full width roman character (Zen Kaku)
    		$ret = 1;
    	}
    	elseif ($unicodeVal >= 65377 && $unicodeVal <= 65439)
    	{
    		//half width character (Han Kaku)
    		$ret = 2;
    	}
    	elseif ($unicodeVal >= 65504 && $unicodeVal <= 65510)
    	{
    		//full width character (Zen Kaku)
    		$ret = 1;
    	}
    	elseif ($unicodeVal >= 65512 && $unicodeVal <= 65518)
    	{
    		//half width character (Han Kaku)
    		$ret = 2;
    	}
    	elseif ($unicodeVal >= 19968 && $unicodeVal <= 40879)
    	{
    		//common and uncommon kanji
    		$ret = 3;
    	}
    	elseif ($unicodeVal >= 13312 && $unicodeVal <= 19903)
    	{
    		//Rare Kanji
    		$ret = 3;
    	}
    	
    	return $ret;
    }
&#91;/sourcecode&#93;

Detecting Han Kaku character function:

&#91;sourcecode language='php'&#93;
    /**
     * Detects if the supplied string is having a han-kaku character
     * @param $str
     * @return unknown_type
     */
    protected function _detectZenKaku($str)
    {
    	$unicode = $this->_utf8ToUnicode($str);
    	$ret = true;
    	
    	foreach ($unicode as $uni)
    	{
    		$chk = $this->_isJapanese($uni);
    		if ($chk == 0 || $chk == 2)
    		{
    			//non-japanese or han kaku found!
    			$ret = false;
    			break;
    		}
    	}
    	
    	return $ret;
    }

And the action (Ajax) controller:

						$proceed = true;
						if ($params['Data_Division'] == 2)
						{
							if (!$this->_detectZenKaku($post['Name1']))
							{
								$proceed = false;
							}		
						}
						if ($proceed)
						{
							$data = $name->add(
										array(
											'Data_Division'		=> (int)$params['Data_Division'],
											'Name_Cd'			=> (int)$params['Name_Cd'],
											'Name1'				=> $post['Name1'],
											'Name2'				=> $post['Name2'],
											'Opt_Cd1'			=> 0,
											'Opt_Cd2'			=> 0,
											'Del_Flg'			=> 0,
											'Update_Date'		=> date('Y-m-d H:i:s'),
											'Update_Cd'			=> (int)$_COOKIE['User_Cd'],
											'Update_Name'		=> $_COOKIE['Name']
											)
										);
							if ($data)
							{
								$ret = 1;
							}
						}
						else
						{
							$ret = 'NOT_ZEN_KAKU';
						}

So that solves the problem.

2 thoughts on “Japanese Input: Detecting Half-Width (Han Kaku) Characters”

  1. If you have something to add or correct, please post your comment to improve this post. I hope this will help somebody who will be employed in a Japanese company or something that deals with Japanese characters.

Leave a reply

Your email address will not be published. Required fields are marked *