golang處理gb2312轉utf-8編碼的問題

問題描述：

最近有同學在工作有使用到iconv-go這個庫，涉及到轉換字符的，出現如下報錯，然後再諮詢我，然後我自己也學習了一下。

報錯信息如下：

invalid or incomplete multibyte or wide character

用到的golang轉化庫爲:

github.com/djimenez/iconv-go

使用的函數爲:

body, err = iconv.ConvertString(body, "GBK", "utf-8")

解決思路：

進去github.com/djimenez/iconv-go點擊源碼查看

首先iconv.ConvertString的實現是在iconv.go中

func ConvertString(input string, fromEncoding string, toEncoding string) (output string, err error) {
// create a temporary converter
converter, err := NewConverter(fromEncoding, toEncoding)
if err == nil {
// convert the string
output, err = converter.ConvertString(input)
// close the converter
converter.Close()
}
return
}

通過以上發現，它調用了

NewConverter(fromEncoding, toEncoding)

新建了一個結構體Converter，調用下面結構體的實現的

output, err = converter.ConvertString(input)

繼續跟蹤這個結構方法,在converter.go内找到實現

type Converter struct {
context C.iconv_t
open bool
}
// Initialize a new Converter. If fromEncoding or toEncoding are not supported by
// iconv then an EINVAL error will be returned. An ENOMEM error maybe returned if
// there is not enough memory to initialize an iconv descriptor
func NewConverter(fromEncoding string, toEncoding string) (converter *Converter, err error) {
converter = new(Converter)
// convert to C strings
toEncodingC := C.CString(toEncoding)
fromEncodingC := C.CString(fromEncoding)
// open an iconv descriptor
converter.context, err = C.iconv_open(toEncodingC, fromEncodingC)
// free the C Strings
C.free(unsafe.Pointer(toEncodingC))
C.free(unsafe.Pointer(fromEncodingC))
// check err
if err == nil {
// no error, mark the context as open
converter.open = true
}
return
}

可以看出，它底層調用的是CGO庫轉化實現

converter.context, err = C.iconv_open(toEncodingC, fromEncodingC)

通過查詢C庫的文档man iconv_open,DESCRIPTION部分有如下介紹

The empty encoding name "" is equivalent to "char": it denotes the locale dependent character encoding.

When the string "//TRANSLIT" is appended to tocode, transliteration is activated. This means that when a character cannot be represented in the target

character set, it can be approximated through one or several similarly looking characters.

When the string "//IGNORE" is appended to tocode, characters that cannot be represented in the target character set will be silently discarded.

The resulting conversion descriptor can be used with iconv any number of times. It remains valid until deallocated using iconv_close.

A conversion descriptor contains a conversion state. After creation using iconv_open, the state is in the initial state. Using iconv modifies the descrip-

tor's conversion state. (This implies that a conversion descriptor can not be used in multiple threads simultaneously.) To bring the state back to the ini-

tial state, use iconv with NULL as inbuf argument.

重點是這句話

When the string “//IGNORE” is appended to tocode, characters that cannot be represented in the target character set will be silently discarded.

意思是說，在”tocode”之後加”//IGNORE”,那些不能被tocode顯示的字符將會自動被忽略,oh good,正好是我想要的.

由這些層層調用關係

ConvertString(input string, fromEncoding string, toEncoding string)
NewConverter(fromEncoding string, toEncoding string) (converter *Converter, err error)
C.iconv_open(toEncodingC, fromEncodingC)

我們只需將//IGNORE傳遞到c庫既可支持

所以代碼改爲:

body, err = iconv.ConvertString(body, "GBK", "utf-8//IGNORE")

經測試,沒有報err,大功告成.

重述一下解決方案：

body, err = iconv.ConvertString(body, "GBK", "utf-8//IGNORE")