EmbeddedRelated.com
Forums
Memfault State of IoT Report

Getting started with AVR and C

Started by Robert Roland November 24, 2012
"Boudewijn Dijkstra" <sp4mtr4p.boudewijn@indes.com> writes:

> Op Thu, 29 Nov 2012 21:36:53 +0100 schreef Keith Thompson <kst-u@mib.org>: >> upsidedown@downunder.com writes: >>> On Thu, 29 Nov 2012 11:01:34 -0500, James Kuyper >>> <jameskuyper@verizon.net> wrote: >>> >>>> On 11/29/2012 10:23 AM, Grant Edwards wrote: >>>>> On 2012-11-29, Tim Wescott <tim@seemywebsite.com> wrote: >>> >>>>> ... Trying to impliment >>>>> any sort of communications protocol with that was fun. >>> >>> Using left/right shifts and AND and OR operations work just fine. >>> Works OK with different CHAR_BIT and different endianness platforms. >>> Do not try to use structs etc. >>> >>>> Thanks for that information. Claims have frequently been made on >>>> comp.lang.c that, while the C standard allows CHAR_BIT != 8, the >>>> existence of such implementations is a myth. I'm glad to have a >>>> specific counter example to cite. >>> >>> IMHO CHAR_BIT = 21 is the correct way to handle the Unicode range. >>> >>> On the Unicode list, I even suggested packing three 21 characters into >>> a single 64 bit data word as UTF-64 :-) >> >> I like it -- but it breaks as soon as they add U+200000 or higher > > Not really. You can use the spare bit to indicate a different packing. > >> , and I'm not aware of any guarantee that they won't. >> >> I've thought of UTF-24, encoding each character in 3 octets; that's >> good for up to 16,777,216 distinct code points. > > I hope that, when the galactic discovery is underway that would make > this amount of code points necessary, software engineering will have > evolved beyond the point of humans worrying about bit widths and > encodings.
UTF-8 is the way forward isn't it? -- John Devereux
>>>>> John Devereux <john@devereux.me.uk> writes: >>>>> "Boudewijn Dijkstra" <sp4mtr4p.boudewijn@indes.com> writes: >>>>> Op Thu, 29 Nov 2012 21:36:53 +0100 schreef Keith Thompson:
[...] >>> I've thought of UTF-24, encoding each character in 3 octets; that's >>> good for up to 16,777,216 distinct code points. >> I hope that, when the galactic discovery is underway that would make >> this amount of code points necessary, software engineering will have >> evolved beyond the point of humans worrying about bit widths and >> encodings. > UTF-8 is the way forward isn't it? I doubt it is. FWIW, it requires three octets for Cyrillic, while UTF-16 requires only two. Personally, I'd try to use the latter whenever possible (which means: anywhere, unless OS interaction issues are deeply involved in the matter.) -- FSF associate member #7257
On 12/11/12 9:53 AM, John Devereux wrote:
> > UTF-8 is the way forward isn't it? >
As with most compression systems it depends on what the usage pattern of characters is. If the text base is mostly the 7 bit ASCII character set, with some of the other lower valued characters and only a few bigger valued characters, UTF-8 makes sense. If most of the characters are in the larger values (like using a non-Latin based character set) then UTF-16 may make much more sense.
Richard Damon <news.x.richarddamon@xoxy.net> writes:
> On 12/11/12 9:53 AM, John Devereux wrote: >> UTF-8 is the way forward isn't it? > > As with most compression systems it depends on what the usage pattern of > characters is. If the text base is mostly the 7 bit ASCII character set, > with some of the other lower valued characters and only a few bigger > valued characters, UTF-8 makes sense. If most of the characters are in > the larger values (like using a non-Latin based character set) then > UTF-16 may make much more sense.
UTF-8 has a couple of other advantages. It's equivalent to ASCII as long as all the characters are <= 127, which means you can (mostly) deal with UTF-8 using old tools that aren't Unicode-aware. And it has no byte ordering issues, so it doesn't need a BOM (Byte Order Mark). As for compression, you can always use another compression tool if necessary; gzipped UTF-8 should be about as compact as gzipped UTF-16. -- Keith Thompson (The_Other_Keith) kst-u@mib.org <http://www.ghoti.net/~kst> Will write code for food. "We must do something. This is something. Therefore, we must do this." -- Antony Jay and Jonathan Lynn, "Yes Minister"
On 12/15/12 3:24 PM, Keith Thompson wrote:
> Richard Damon <news.x.richarddamon@xoxy.net> writes: >> On 12/11/12 9:53 AM, John Devereux wrote: >>> UTF-8 is the way forward isn't it? >> >> As with most compression systems it depends on what the usage pattern of >> characters is. If the text base is mostly the 7 bit ASCII character set, >> with some of the other lower valued characters and only a few bigger >> valued characters, UTF-8 makes sense. If most of the characters are in >> the larger values (like using a non-Latin based character set) then >> UTF-16 may make much more sense. > > UTF-8 has a couple of other advantages. It's equivalent to ASCII > as long as all the characters are <= 127, which means you can > (mostly) deal with UTF-8 using old tools that aren't Unicode-aware. > And it has no byte ordering issues, so it doesn't need a BOM (Byte > Order Mark). > > As for compression, you can always use another compression tool > if necessary; gzipped UTF-8 should be about as compact as gzipped > UTF-16. >
UTF-8 and UTF-16 *ARE* compression methods. Uncompressed Unicode would be UTF-32 or UCS-4, using 32 bits per character. For most use, if you don't need code points above U+0FFFF, then you might consider UCS-2 uncompressed format. Then UTF-16 isn't really compression, but a method to mark the very rare character above U+0FFFF. UTF-8 is really just a compression format to try and remove some of the extra space, and will do so to the extent that characters 0-7F are more common than U+0800 and higher, the former saving you a byte, and the latter costing you one. UTF-8 does have the other advantage that you mention, looking like ASCII for those characters allowing many Unicode unaware programs to mostly function with UTF-8 data.
Richard Damon <news.x.richarddamon@xoxy.net> writes:
> On 12/15/12 3:24 PM, Keith Thompson wrote:
[...]
>> As for compression, you can always use another compression tool >> if necessary; gzipped UTF-8 should be about as compact as gzipped >> UTF-16. >> > > UTF-8 and UTF-16 *ARE* compression methods.
[...] I don't recall saying they aren't. But they're (relatively) simplistic compression methods that don't adapt to the content being compressed, which is why applying another compression tool (I *did* say "another") can be useful. -- Keith Thompson (The_Other_Keith) kst-u@mib.org <http://www.ghoti.net/~kst> Will write code for food. "We must do something. This is something. Therefore, we must do this." -- Antony Jay and Jonathan Lynn, "Yes Minister"
On Sat, 15 Dec 2012 14:30:44 -0500, Richard Damon wrote:

>> UTF-8 is the way forward isn't it? >> > > As with most compression systems it depends on what the usage pattern of > characters is. If the text base is mostly the 7 bit ASCII character set, > with some of the other lower valued characters and only a few bigger > valued characters, UTF-8 makes sense. If most of the characters are in > the larger values (like using a non-Latin based character set) then > UTF-16 may make much more sense.
Size isn't the only issue; the fact that UTF-16 may (and usually does) contain null bytes ('\0') rules it out for many applications. Similarly, anything which expects specific bytes (e.g. '\x0a', '\x0d', etc) to have their "usual" meanings regardless of context will work fine with UTF-8 but not with UTF-16 or UTF-32.
On Sat, 15 Dec 2012 14:30:44 -0500, Richard Damon
<news.x.richarddamon@xoxy.net> wrote:

>On 12/11/12 9:53 AM, John Devereux wrote: >> >> UTF-8 is the way forward isn't it? >> > >As with most compression systems it depends on what the usage pattern of >characters is. If the text base is mostly the 7 bit ASCII character set, >with some of the other lower valued characters and only a few bigger >valued characters, UTF-8 makes sense. If most of the characters are in >the larger values (like using a non-Latin based character set) then >UTF-16 may make much more sense.
For any given non-Latin based language, there are only a few possible bit combinations in the first byte(s) of the UTF-8 sequence, thus it should compress quite well. For use inside a program, UTF-32 would be the natural choice with 1 array element/character. Compressing a UTF-32 file using some form of Huffman coding, should not take more space than compressed UTF-8/UTF-16 files, since the actually used (and stored) symbol table would reflect the actual usage of sequences in the whole file. Doing the compression on the fly in a communication link would be less effective, since only a part of the data would be available at a time, in order to keep the latencies acceptable.
On 12/15/12 6:45 PM, Keith Thompson wrote:
> Richard Damon <news.x.richarddamon@xoxy.net> writes: >> On 12/15/12 3:24 PM, Keith Thompson wrote: > [...] >>> As for compression, you can always use another compression tool >>> if necessary; gzipped UTF-8 should be about as compact as gzipped >>> UTF-16. >>> >> >> UTF-8 and UTF-16 *ARE* compression methods. > [...] > > I don't recall saying they aren't. > > But they're (relatively) simplistic compression methods that don't > adapt to the content being compressed, which is why applying another > compression tool (I *did* say "another") can be useful. >
But they are fundamentally different than other compressions. Multi-byte/symbol encodings are generally designed so that it is possible to process the data in that encoding. It isn't that much harder to process the data then if it was kept fully expanded. Some operations, like computing the length of a string, require doing a pass over the data instead of just taking the difference in the addresses, but nothing becomes particularly hard. On the other hand, it is very unusual for any program to actually process "zipped" data as such, it is almost always uncompressed to be worked on and then re-compressed, and any changes tend to require reprocessing the entire rest of the file (or at least the current compression block).
Richard Damon <news.x.richarddamon@xoxy.net> writes:
> On 12/15/12 6:45 PM, Keith Thompson wrote: >> Richard Damon <news.x.richarddamon@xoxy.net> writes: >>> On 12/15/12 3:24 PM, Keith Thompson wrote: >> [...] >>>> As for compression, you can always use another compression tool >>>> if necessary; gzipped UTF-8 should be about as compact as gzipped >>>> UTF-16. >>>> >>> >>> UTF-8 and UTF-16 *ARE* compression methods. >> [...] >> >> I don't recall saying they aren't. >> >> But they're (relatively) simplistic compression methods that don't >> adapt to the content being compressed, which is why applying another >> compression tool (I *did* say "another") can be useful. > > But they are fundamentally different than other compressions. > Multi-byte/symbol encodings are generally designed so that it is > possible to process the data in that encoding. It isn't that much harder > to process the data then if it was kept fully expanded. Some operations, > like computing the length of a string, require doing a pass over the > data instead of just taking the difference in the addresses, but nothing > becomes particularly hard. > > On the other hand, it is very unusual for any program to actually > process "zipped" data as such, it is almost always uncompressed to be > worked on and then re-compressed, and any changes tend to require > reprocessing the entire rest of the file (or at least the current > compression block).
I'd say that's a difference of degree, not anything fundamental. Computing the length of a string requires doing a pass over it, whether it's UTF-8 encoded or gzipped. And it's certainly possible to process UTF-8 data by internally converting it to UTF-32. And copying a file doesn't require uncompressing it, regardless of the format. -- Keith Thompson (The_Other_Keith) kst-u@mib.org <http://www.ghoti.net/~kst> Will write code for food. "We must do something. This is something. Therefore, we must do this." -- Antony Jay and Jonathan Lynn, "Yes Minister"

Memfault State of IoT Report