中文的正则表达式

前端之家收集整理的这篇文章主要介绍了中文的正则表达式前端之家小编觉得挺不错的,现在分享给大家,也给大家做个参考。

中文的正则 [\u2E80-\uFE4F]+


现在网络上流行的是以下两个:
/^[\u0391-\uFFE5]+$/
/^[\u4E00-\u9FA5]+$/

明显,第二个的范围比较小。经过测试,第二个是不对的,第二个范围外的 '\u9FA6' 是汉字 "囗",所以第二个明显没有包含所有必需的。
第一个的最后一个字符 '\uFFE5' 是 ‘¥’ 字符,而 '\uFFE6' 是 '₩' 字符。所以我认为第一个是大体对的,不过第一个开头 ‘\u0391’ 是 ''Α",但是奇怪的是这个不是英文的半角A也不是中文的全角 A,奇怪。 所以我觉得第一个的范围可能稍微偏大,特别是开始段。
于是去查 utf8 编码表
原来汉字编码是比较奇特的,并不是编在一起,比如希伯来文U+0590 -- U+05FF这么方便。汉字被分成了很多小段,而且因为有很多汉字是中国、日本、韩国共享的,所以UTF8编码里面的CJK一般都是指汉字段。
经过审查,第一次出现CJK的是 U+2E80, 最后一次是U+FE4F。因此最终结论是:
/^[\u2E80-\uFE4F]+$/
最后,再贴一下utf8码表
U+0000 -- U+007F: Basic Latin
U+0080 -- U+00FF: Latin-1 Supplement
U+0100 -- U+017F: Latin Extended-A
U+0180 -- U+024F: Latin Extended-B
U+0250 -- U+02AF: IPA Extensions
U+02B0 -- U+02FF: Spacing Modifier Letters
U+0300 -- U+036F: Combining Diacritical Marks
U+0370 -- U+03FF: Greek and Coptic
U+0400 -- U+04FF: Cyrillic
U+0500 -- U+052F: Cyrillic Supplement
U+0530 -- U+058F: Armenian
U+0590 -- U+05FF: Hebrew
U+0600 -- U+06FF: Arabic
U+0700 -- U+074F: Syriac
U+0750 -- U+077F: Arabic Supplement
U+0780 -- U+07BF: Thaana
U+07C0 -- U+07FF: NKo
U+0900 -- U+097F: Devanagari
U+0980 -- U+09FF: Bengali
U+0A00 -- U+0A7F: Gurmukhi
U+0A80 -- U+0AFF: Gujarati
U+0B00 -- U+0B7F: Oriya
U+0B80 -- U+0BFF: Tamil
U+0C00 -- U+0C7F: Telugu
U+0C80 -- U+0CFF: Kannada
U+0D00 -- U+0D7F: Malayalam
U+0D80 -- U+0DFF: Sinhala
U+0E00 -- U+0E7F: Thai
U+0E80 -- U+0EFF: Lao
U+0F00 -- U+0FFF: Tibetan
U+1000 -- U+109F: Myanmar
U+10A0 -- U+10FF: Georgian
U+1100 -- U+11FF: Hangul Jamo
U+1200 -- U+137F: Ethiopic
U+1380 -- U+139F: Ethiopic Supplement
U+13A0 -- U+13FF: Cherokee
U+1400 -- U+167F: Unified Canadian Aboriginal Syllabics
U+1680 -- U+169F: Ogham
U+16A0 -- U+16FF: Runic
U+1700 -- U+171F: Tagalog
U+1720 -- U+173F: Hanunoo
U+1740 -- U+175F: Buhid
U+1760 -- U+177F: Tagbanwa
U+1780 -- U+17FF: Khmer
U+1800 -- U+18AF: Mongolian
U+1900 -- U+194F: Limbu
U+1950 -- U+197F: Tai Le
U+1980 -- U+19DF: New Tai Lue
U+19E0 -- U+19FF: Khmer Symbols
U+1A00 -- U+1A1F: Buginese
U+1B00 -- U+1B7F: Balinese
U+1D00 -- U+1D7F: Phonetic Extensions
U+1D80 -- U+1DBF: Phonetic Extensions Supplement
U+1DC0 -- U+1DFF: Combining Diacritical Marks Supplement
U+1E00 -- U+1EFF: Latin Extended Additional
U+1F00 -- U+1FFF: Greek Extended
U+2000 -- U+206F: General Punctuation
U+2070 -- U+209F: Superscripts and Subscripts
U+20A0 -- U+20CF: Currency Symbols
U+20D0 -- U+20FF: Combining Diacritical Marks for Symbols
U+2100 -- U+214F: Letterlike Symbols
U+2150 -- U+218F: Number Forms
U+2190 -- U+21FF: Arrows
U+2200 -- U+22FF: Mathematical Operators
U+2300 -- U+23FF: Miscellaneous Technical
U+2400 -- U+243F: Control Pictures
U+2440 -- U+245F: Optical Character Recognition
U+2460 -- U+24FF: Enclosed Alphanumerics
U+2500 -- U+257F: Box Drawing
U+2580 -- U+259F: Block Elements
U+25A0 -- U+25FF: Geometric Shapes
U+2600 -- U+26FF: Miscellaneous Symbols
U+2700 -- U+27BF: Dingbats
U+27C0 -- U+27EF: Miscellaneous Mathematical Symbols-A
U+27F0 -- U+27FF: Supplemental Arrows-A
U+2800 -- U+28FF: Braille Patterns
U+2900 -- U+297F: Supplemental Arrows-B
U+2980 -- U+29FF: Miscellaneous Mathematical Symbols-B
U+2A00 -- U+2AFF: Supplemental Mathematical Operators
U+2B00 -- U+2BFF: Miscellaneous Symbols and Arrows
U+2C00 -- U+2C5F: Glagolitic
U+2C60 -- U+2C7F: Latin Extended-C
U+2C80 -- U+2CFF: Coptic
U+2D00 -- U+2D2F: Georgian Supplement
U+2D30 -- U+2D7F: Tifinagh
U+2D80 -- U+2DDF: Ethiopic Extended
U+2E00 -- U+2E7F: Supplemental Punctuation
U+2E80 -- U+2EFF: CJK Radicals Supplement
U+2F00 -- U+2FDF: Kangxi Radicals
U+2FF0 -- U+2FFF: Ideographic Description Characters
U+3000 -- U+303F: CJK Symbols and Punctuation
U+3040 -- U+309F: Hiragana
U+30A0 -- U+30FF: Katakana
U+3100 -- U+312F: Bopomofo
U+3130 -- U+318F: Hangul Compatibility Jamo
U+3190 -- U+319F: Kanbun
U+31A0 -- U+31BF: Bopomofo Extended
U+31C0 -- U+31EF: CJK Strokes
U+31F0 -- U+31FF: Katakana Phonetic Extensions
U+3200 -- U+32FF: Enclosed CJK Letters and Months
U+3300 -- U+33FF: CJK Compatibility
U+3400 -- U+4DBF: CJK Unified Ideographs Extension A
U+4DC0 -- U+4DFF: Yijing Hexagram Symbols
U+4E00 -- U+9FFF: CJK Unified Ideographs
U+A000 -- U+A48F: Yi Syllables
U+A490 -- U+A4CF: Yi Radicals
U+A700 -- U+A71F: Modifier Tone Letters
U+A720 -- U+A7FF: Latin Extended-D
U+A800 -- U+A82F: Syloti Nagri
U+A840 -- U+A87F: Phags-pa
U+AC00 -- U+D7AF: Hangul Syllables
U+D800 -- U+DB7F: High Surrogates
U+DB80 -- U+DBFF: High Private Use Surrogates
U+DC00 -- U+DFFF: Low Surrogates
U+E000 -- U+F8FF: Private Use Area
U+F900 -- U+FAFF: CJK Compatibility Ideographs
U+FB00 -- U+FB4F: Alphabetic Presentation Forms
U+FB50 -- U+FDFF: Arabic Presentation Forms-A
U+FE00 -- U+FE0F: Variation Selectors
U+FE10 -- U+FE1F: Vertical Forms
U+FE20 -- U+FE2F: Combining Half Marks
U+FE30 -- U+FE4F: CJK Compatibility Forms
U+FE50 -- U+FE6F: Small Form Variants
U+FE70 -- U+FEFF: Arabic Presentation Forms-B
U+FF00 -- U+FFEF: Halfwidth and Fullwidth Forms
U+FFF0 -- U+FFFF: Specials

原文链接:https://www.f2er.com/regex/358990.html

猜你在找的正则表达式相关文章