Why unicode character is printing even using 1 byte to handling it

Issue

I am doing a school project and I came across something that shouldn’t work in theory.

I need to create two programs where one communicates with the other through unix signals, I will call them client and server, I pass a message in my client’s argv, break each char into bit and send to the server

The idea is to use bitwise communication (Something simple and rudimentary, if the bit is 0 I send SIGUSR1 to the server PID using the kill system call, if it is 1 I send SIGUSR2.

#client send a char to server
int send_sig(int pid, unsigned char b)
{
    int a;

    a = 0;
    while (a < 8)
    {
        if (b & 1)
            kill(pid, SIGUSR2);
        else
            kill(pid, SIGUSR1);
        b = b >> 1;
        a++;
        usleep(1000);
    }
    return (0);
}

the problem is when I use unicode characters, the argv will always be a string (an array of char) so when I pass some unicode character it will vary from 1 to 4 bytes, even so the process continues normal, the problem happens on my server side where I get these bits

The way I structured my code is that I need to print one bit at a time (which is acceptable since in theory a char in C is equivalent to one byte) but even when passing 4 byte unicode characters, printing them one at a time it keeps working (it’s like Russian roulette, it breaks sometimes and works normally sometimes)

# Server receiving the 
unsigned char   reverse(unsigned char b)
{
    b = (b & 0xF0) >> 4 | (b & 0x0F) << 4;
    b = (b & 0xCC) >> 2 | (b & 0x33) << 2;
    b = (b & 0xAA) >> 1 | (b & 0x55) << 1;
    return (b);
}

void    signal_handler(int sig, siginfo_t *p_info, void *ucontext)
{
    static unsigned int     a = 0;
    static unsigned int     b = 0;

    a <<= 1;
    if (sig == SIGUSR2)
        a++;
    b++;
    if (b == 8)
    {
        b = 0;
        ft_printf("%c\0", reverse(a));
    }
    p_info = p_info;
    ucontext = ucontext;
}

Why this behavior happens ? wasn’t it just for it to break and print something wrong ?

Expeculations:

  • the way I print on stdout without NULL byte make the shell and terminal interpreter a whole byte without losing the UTF-8 map

  • The unicode fitt in char (But this is impossible I guess)

reproduce this behavior with theses code:

#client.c file
#include <signal.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
void send_sig(int pid, char b)
{
    int a = 0;
    printf("%c", b);
    while (a < 8)
    {
        if (b & 1)
            kill(pid, SIGUSR2);
        else
            kill(pid, SIGUSR1);
        b >>= 1;
        a++;
        usleep(500);
    }
}
int main(int argc, char *argv[])
{
    char *s = "🤨🤨🤨🤨🤨🤨🤨";

    while (*s++ != '\0')
        send_sig(atoi(argv[1]), *s);

}
#server.c file
#include <unistd.h>
#include <stdio.h>
#include <signal.h>

unsigned char   reverse(unsigned char b)
{
    b = (b & 0xF0) >> 4 | (b & 0x0F) << 4;
    b = (b & 0xCC) >> 2 | (b & 0x33) << 2;
    b = (b & 0xAA) >> 1 | (b & 0x55) << 1;
    return (b);
}

void    signal_handler(int sig, siginfo_t *p_info, void *ucontext)
{
    static unsigned int     a = 0;
    static unsigned int     b = 0;

    a <<= 1;
    if (sig == SIGUSR2)
        a++;
    b++;
    if (b == 8)
    {
        b = 0;
        a = reverse(a);
        write(1, &a, 1);
    }
    p_info = p_info;
    ucontext = ucontext;
}

int main(void)
{
    struct sigaction    act;

    act.sa_sigaction = signal_handler;
    sigemptyset(&act.sa_mask);
    act.sa_flags = 0;
    sigaction(SIGUSR1, &act, NULL);
    sigaction(SIGUSR2, &act, NULL);
    printf("The server pid: %d\n", getpid());
    while (1)
        usleep(300);
}

Solution

Sending unicode bit by bit can be implemented either by sending the 16 (UTF-16) or 32 (UTF-32) bit value (that means a character transmission is always 16 or 32 bits long) or byte by byte. If latter, then the first byte determines the number of bytes (bits) in the transmission. Currently, your server reads only 8 bits and sends the received byte to output, the rest of the (possible multibyte character) bytes are not considered and discarded.

If your server has the first byte (8-bits), then do the following to calculate the number of bytes in the transmission:

if (byte < 0x80)
    num_bytes = 1; //single byte, no further read required
else if ((byte & 0xe0) == 0xc0)
    num_bytes = 2; //one more byte to read
else if ((byte & 0xf0) == 0xe0)
    num_bytes = 3; //two more bytes to read
else if ((byte & 0xf8) == 0xf0)
    num_bytes = 4; //three more bytes to read

Then, to form a valid utf8 (multibyte) character, read the following (if any) bytes into a char array, e.g. unsigned char utf8_bytes[4];

Of course, in order to form a valid null-terminated (printable) string the size of the array has to be 5 and the last byte set to '\0'.


Addition

Your client is sending the bit-sequence (byte: 10101010) as follows:

1010101|0 -> SIGUSR1
 101010|1 -> SIGUSR2
  10101|0 -> SIGUSR1
   1010|1 -> SIGUSR2
    101|0 -> SIGUSR1
     10|1 -> SIGUSR2
      1|0 -> SIGUSR1
       |1 -> SIGUSR2

So, every time your server is receiving a SIGUSR2 it has to set the bit at a certain position, which can be easily done like this:

if (sig == SIGUSR2)
    byte |= (1 << bit_counter);

++bit_counter;

The complete server code could look like this:

void signal_handler(int sig, siginfo_t *p_info, void *ucontext)
{
    static unsigned char utf8_bytes[5]; //multibyte storage
    static unsigned char byte = 0; //bitset
    
    static int byte_index  = 0; //current position in the mb storage
    static int bit_counter = 0; //number of bits received
    static int num_bytes   = 1; //total number of bytes of mb character
    
    if (sig == SIGUSR2) //bit: 1
        byte |= (1 << bit_counter); //set the according bit in byte
        
    if (++bit_counter == 8) { //we received 8 bits -> 1 byte
    
        if (byte_index == 0) { //if first byte in sequence
            if (byte < 0x80)
                num_bytes = 1; //single byte, no further read required
            else if ((byte & 0xe0) == 0xc0)
                num_bytes = 2; //one more byte to read
            else if ((byte & 0xf0) == 0xe0)
                num_bytes = 3; //two more bytes to read
            else if ((byte & 0xf8) == 0xf0)
                num_bytes = 4; //three more bytes to read
        }

        //since we completed 1 byte, decrease num_bytes
        if (--num_bytes == 0) { //and if there are no more bytes to read
            utf8_bytes[++byte_index] = '\0'; //make null-terminated string
            //printf("%s\n", utf8_bytes); //do something useful
            byte_index = 0; //reset the byte index
        } else { //we need further reading
            utf8_bytes[byte_index++] = byte; //store the byte
        }
        
        bit_counter = 0; //reset counter
        byte        = 0; //reset byte (set all bits to zero)

    }
    
    p_info = p_info;
    ucontext = ucontext;
}

Answered By – Erdal Küçük

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply

(*) Required, Your email will not be published