为此,我使用mb_detect_encoding()函数.
问题是如果我用mb_detect_order()函数改变可能的编码顺序,函数会返回不同的结果.
请考虑以下示例
$html = <<< STR ちょっとのアクセスで落ちてしまったり、サーバー障害が多いレンタルサーバーを選ぶとあなたのビジネス等にかなりの影響がでてしまう可能性があります.特に商売をされている個人の方、法人の方は気をつけるようにしてください STR; mb_detect_order(array('UTF-8','EUC-JP','SJIS','eucJP-win','SJIS-win','JIS','ISO-2022-JP','ISO-8859-1','ISO-8859-2')); $originalEncoding = mb_detect_encoding($str); die($originalEncoding); // $originalEncoding = 'UTF-8'
但是,如果您更改mb_detect_order()中的编码顺序,结果将会有所不同:
mb_detect_order(array('EUC-JP','UTF-8','ISO-8859-2')); die($originalEncoding); // $originalEncoding = 'EUC-JP'
检测算法可能只是按顺序继续尝试在mb_detect_order中指定的编码,然后返回字节流有效的第一个编码.
编辑:参见例如this article更智能的方法.
Due to its importance,automatic charset detection is already implemented in major Internet applications such as Mozilla or Internet Explorer. They are very accurate and fast,but the implementation applies many domain specific knowledges in case-by-case basis. As opposed to their methods,we aimed at a simple algorithm which can be uniformly applied to every charset,and the algorithm is based on well-established,standard machine learning techniques. We also studied the relationship between language and charset detection,and compared byte-based algorithms and character-based algorithms. We used Naive Bayes (NB) and Support Vector Machine (SVM).