How to use UTF-8 in C code

c++utf-8

My setup: gcc-4.9.2, UTF-8 environment.

The following C-program works in ASCII, but does not in UTF-8.

Create input file:

echo -n 'привет мир' > /tmp/вход

This is test.c:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define SIZE 10

int main(void)
{
  char buf[SIZE+1];
  char *pat = "привет мир";
  char str[SIZE+2];

  FILE *f1;
  FILE *f2;

  f1 = fopen("/tmp/вход","r");
  f2 = fopen("/tmp/выход","w");

  if (fread(buf, 1, SIZE, f1) > 0) {
    buf[SIZE] = 0;

    if (strncmp(buf, pat, SIZE) == 0) {
      sprintf(str, "% 11s\n", buf);
      fwrite(str, 1, SIZE+2, f2);
    }
  }

  fclose(f1);
  fclose(f2);

  exit(0);
}

Check the result:

./test; grep -q ' привет мир' /tmp/выход && echo OK

What should be done to make UTF-8 code work as if it was ASCII code – not to bother how many bytes a symbol takes, etc. In other words: what to change in the example to treat any UTF-8 symbol as a single unit (that includes argv, STDIN, STDOUT, STDERR, file input, output and the program code)?

Best Solution

#define SIZE 10

The buffer size of 10 is insufficient to store the UTF-8 string привет мир. Try changing it to a larger value. On my system (Ubuntu 12.04, gcc 4.8.1), changing it to 20, worked perfectly.

UTF-8 is a multibyte encoding which uses between 1 and 4 bytes per character. So, it is safer to use 40 as the buffer size above. There is a big discussion at How many bytes does one Unicode character take? which might be interesting.