golang處理gb2312轉utf-8編碼的問題

字號+ 編輯: 种花家 修訂: 德玛西亚 來源: 利志分享 2023-09-11 我要說兩句(0)

如果你有把曾經的php或者java的老代碼用go重寫的經驗,很可能會遇到gb2312轉utf-8的問題。

问题描述:

最近有同学在工作有使用到iconv-go这个库,涉及到转换字符的,出现如下报错,然后再咨询我,然后我自己也学习了一下。

报错信息如下:

invalid or incomplete multibyte or wide character

用到的golang转化库为:

github.com/djimenez/iconv-go

使用的函数为:

body, err = iconv.ConvertString(body, "GBK", "utf-8")

解决思路:

进去github.com/djimenez/iconv-go点击源码查看

首先iconv.ConvertString的实现是在iconv.go中

func ConvertString(input string, fromEncoding string, toEncoding string) (output string, err error) {
// create a temporary converter
converter, err := NewConverter(fromEncoding, toEncoding)
if err == nil {
// convert the string
output, err = converter.ConvertString(input)
// close the converter
converter.Close()
}
return
}

通过以上发现, 它调用了

NewConverter(fromEncoding, toEncoding)

新建了一个结构体Converter,调用下面结构体的实现的

output, err = converter.ConvertString(input)

继续跟踪这个结构方法,在converter.go内找到实现

type Converter struct {
context C.iconv_t
open bool
}
// Initialize a new Converter. If fromEncoding or toEncoding are not supported by
// iconv then an EINVAL error will be returned. An ENOMEM error maybe returned if
// there is not enough memory to initialize an iconv descriptor
func NewConverter(fromEncoding string, toEncoding string) (converter *Converter, err error) {
converter = new(Converter)
// convert to C strings
toEncodingC := C.CString(toEncoding)
fromEncodingC := C.CString(fromEncoding)
// open an iconv descriptor
converter.context, err = C.iconv_open(toEncodingC, fromEncodingC)
// free the C Strings
C.free(unsafe.Pointer(toEncodingC))
C.free(unsafe.Pointer(fromEncodingC))
// check err
if err == nil {
// no error, mark the context as open
converter.open = true
}
return
}

可以看出,它底层调用的是CGO库转化实现

converter.context, err = C.iconv_open(toEncodingC, fromEncodingC)

通过查询C库的文档man iconv_open,DESCRIPTION部分有如下介绍

The empty encoding name "" is equivalent to "char": it denotes the locale dependent character encoding.

When the string "//TRANSLIT" is appended to tocode, transliteration is activated. This means that when a character cannot be represented in the target

character set, it can be approximated through one or several similarly looking characters.

When the string "//IGNORE" is appended to tocode, characters that cannot be represented in the target character set will be silently discarded.

The resulting conversion descriptor can be used with iconv any number of times. It remains valid until deallocated using iconv_close.

A conversion descriptor contains a conversion state. After creation using iconv_open, the state is in the initial state. Using iconv modifies the descrip-

tor's conversion state. (This implies that a conversion descriptor can not be used in multiple threads simultaneously.) To bring the state back to the ini-

tial state, use iconv with NULL as inbuf argument.

重点是这句话

When the string “//IGNORE” is appended to tocode, characters that cannot be represented in the target character set will be silently discarded.

意思是说,在”tocode”之后加”//IGNORE”,那些不能被tocode显示的字符将会自动被忽略,oh good,正好是我想要的.

由这些层层调用关系

ConvertString(input string, fromEncoding string, toEncoding string)
NewConverter(fromEncoding string, toEncoding string) (converter *Converter, err error)
C.iconv_open(toEncodingC, fromEncodingC)

我们只需将//IGNORE传递到c库既可支持

所以代码改为:

body, err = iconv.ConvertString(body, "GBK", "utf-8//IGNORE")

经测试,没有报err,大功告成.

重述一下解决方案:

body, err = iconv.ConvertString(body, "GBK", "utf-8//IGNORE")


閲完此文,您的感想如何?
  • 有用

    0

  • 沒用

    0

  • 開心

    0

  • 憤怒

    0

  • 可憐

    0

1.如文章侵犯了您的版權,請發郵件通知本站,該文章將在24小時内刪除;
2.本站標注原創的文章,轉發時煩請注明來源;
3.交流群: PHP+JS聊天群

相關課文
  • GO語言GORM如何更新字段

  • gorm如何創建記錄與模型定義需要注意什麽

  • gorm一般查詢與高級查詢

  • GORM時間戳跟蹤及CURD(增刪改查)

我要說說
網上賓友點評