Garbled characters in different Chinese encodings
Here is an example, for example, if there are three php files with different encodings at the same time, what will be displayed.Let's look at the results directly through
git log -p
, and we will find that there are incomprehensible garbled characters, as shown in the figure below.In the example above, I can't see what has been modified, because the encoding is not utf-8.
But after the diff setting of the Git attribute, the situation is different. The screen is as follows, which normally displays three different coded text files :
At the time, this was achieved through git properties.
About Git properties
Using attributes, you can define different merge strategies for individual files or directories, and you can also let Git know how to compare non-text files ,For example, compare the changes of two word files, because of this feature,
We can also make files with different encodings uniformly use one encoding to display on the terminal.
In this article, I will demonstrate how to use the function of git attributes to encode files with different encodings in UTF-8 and display them on the terminal.
Git attributes can be set by adding
.gitattributes
file, which is usually placed in the project directory.If you don't want the set Git attributes to be submitted together, you can also set them on your own
.git/info/attributes
.1. First create the content of .git/info/attributes as follows:
*.php diff=big5
*.html diff=big5
*.htm diff=big5
The syntax of the above example is very clear. I am targeting *.php, *.html and *.htm files. When performing diff, I use the big5 transcoder.That is to say, .php or .html files will be converted through the big5 converter, and then displayed and compared in UTF-8.
2. Manual editing In the .git/config file, add the following settings:
[diff "big5"]
textconv = hkscs2utf8
Or we can also add settings in the form of instructions
git config diff.big5.textconv hkscs2utf8
The above hkscs2utf8 is a simple transcoding bash written by me, which is used to convert gb2312 and big5 into utf8 for display.That is to say, when git performs diff, the sub-files we defined in the git attribute will be used as files such as .php and .html or .html,
Transcode through hkscs2utf8, a bash file written by myself, and compare the results.
3. Put the bash below into /usr/local/bin/hkscs2utf8.
You can echo $PATH to confirm that the path /usr/local/bin is set.
Note: This bash is only suitable for MacOS systems. Of course, this does not mean that Windows or Linux cannot achieve this effect.
This article mainly conveys the concept and use cases of Git attributes. You can transcode in different ways, such as Python, or on different platforms,
Use a more rigorous transcoding program written by yourself.
#!/bin/bash
file -I "$1" |grep utf-8 >/dev/null 2>&1
#Judge whether the file is utf-8?
#ansi files will have different encodings, or it is called CodePage on Widnows, it is not easy to judge whether the file is Big5 or Gb2312 encoding,
#DBCS, because the encoding on both sides of different character sets may be the same.
#So I first determine whether the file is UTF-8
#The error code returned in this way is 0, and there is no error, then directly cat out the UTF-8 encoded file.
if [ $? -eq 0 ]; then
cat ${1}
exit 0;
the fi
#The above UTF-8 judgment failed, and began to try other encodings.
#First try to convert gb2312 to utf-8, hide the error message greater than 0,
#Failed means that the file is not GB2312, so we use big5-hkscs for transcoding.
#If the attempt to transcode is successful and the return is 0, we will re-execute the GB2312 transcoding and change it to UTF-8 output.
iconv -f gb2312 -t utf-8 "$1" >/dev/null 2>&1
if [ $? -eq 1 ]; then
iconv -f big5-hkscs//IGNORE -t utf-8 "$1"
else
iconv -f gb2312 -t utf-8 "$1"
the fi
Remember, you need to chmod 700 hkscs2utf8
, otherwise the bash cannot be executed.The above completes the setting. If you
git log -p
,You can see normal Chinese characters, magic .
4. In addition, I also wrote another transcoding program in Python, and the effect should be better than the bash version.
If your computer can run python, you can try it.
1. Create cv.py to the /usr/local/bin directory
2. Set to use cv.py, execute the command: git config diff.big5.textconv cv.py
#!/usr/bin/python
# -*- coding: utf-8 -*-
import sys
#Check if it is UTF-8 encoded
def isUTF8(data):
try:
decoded = data.decode('UTF-8')
except UnicodeDecodeError:
return False
else:
for ch in decoded:
if 0xD800 <= ord(ch) <= 0xDFFF:
return False
return True
#Get the file content of binary
def get_bytes_from_file(filename):
return open(filename, "rb"). read()
# get file name
filename = sys.argv[1]
data = get_bytes_from_file(filename)
# Check if the file is UTF8
result = isUTF8(data)
#Non-UTF-8 transcoding
if(result == False):
#udata = data. decode("hkscs")
try:
udata = data.decode("gb2312")
except:
udata = data.decode("hkscs")
data = udata.encode("utf-8","ignore")
print(data)
else:
print(data)
In the utf-8 environment, the displayed comments are garbled characters, and the following commands can be used normally
git config --global i18n.logOutputEncoding utf8
No Comment
Post your comment