Decode Arabic chars from a linux encoding

Issue

I’m working on an old file I found on a Linux server and this file contains the Arabic words in the following strange format.

\350\343\240\317\344\307\345\331\307\...

I don’t if the Linux convert the Arabic chars automatically or someone did that himself but I need to understand the Arabic written in that way.

So, does anyone knows what is the above format/encoding?

Thanks in advance.

Solution

I don’t read Arabic, but I can offer some speculations, which should hopefully at least provide enough information to allow you to finish the identification task.

The string you display in your question looks like octal character codes, i.e. how for example Emacs would display a file which contains those characters as bytes. Converting those to hex yields

bash$ python
>>> print([hex(ord(x)) for x in '\350\343\240\317\344\307\345\331\307'])
['0xe8', '0xe3', '0xa0', '0xcf', '0xe4', '0xc7', '0xe5', '0xd9', '0xc7']

Looking these up on https://tripleee.github.io/8bit/ and looking for (probably) Arabic glyphs gets me

(Disclosure: In case it’s not obvious, the linked page is mine.)

I could go on, but since I can’t tell which combinations yield valid Arabic text, I’ll leave it to you to continue this investigation. Picking what seem like the top candidates here so far, see which ones decode to something meaningful:

>>> print(b'\350\343\240\317\344\307\345\331\307'.decode('cp720'))
كعب╧غ╟ف┘╟
>>> #   ^ partial gibberish, so probably no
>>> print(b'\350\343\240\317\344\307\345\331\307'.decode('iso8859_6'))
وك دلامعا

(Google translate gets me "you are in tears" for this text, so looks vaguely promising.)

Windows code page 720 is native to Windows, but it’s of course possible that it was popular for other platforms at some point in time in some region; but that’s another reason to regard it as less likely than the actually standard ISO 8859-6 encoding, so I’d really go with that at least based on the evidence so far.

To convert the entire file to UTF-8 (or whatever is the default on your system) from that encoding, try

iconv -f iso-8859-6 file.txt >new.txt

If your original file contains something else than I assumed, perhaps edit your question to clarify. See also the character-encoding tag info page on Stack Overflow and Problematic questions about decoding errors

If your file contains literal backslashes and numbers, try something like

#!/usr/bin/env python3
from sys import stdin
for line in stdin:
    bline = bytes(int(x, 8) for x in line.rstrip('\n').split('\\')[1:])
    print(bline.decode('iso-8859-6'))

There is no need to use specifically Python for this, it’s just what I use and so it’s convenient for me, and widely understood.

As a quick whirlwind summary, bytes and the corresponding b'...' byte strings are Python binary data types which do not have an encoding; they just represent literal 8-bit binary bytes. Encoding a text string requires you to specify an encoding and produces bytes; going in the other direction converts bytes or a b'...' string to an actual string (which is always Unicode, in Python 3), again different strings (or an error) depending on which encoding you pass in.

int(str, base) converts str to an integer in base base, so for example, int("345", 8) converts octal 345 to decimal 229 (hex 0xe5).

This trivial script just reads from standard input and writes to standard output. If you need this for more than a quick one-off, probably add an option parser to accept file name arguments etc.

Answered By – tripleee

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply

(*) Required, Your email will not be published