Web Framework (97)

1 Name: #!/usr/bin/anonymous : 2008-01-14 05:50 ID:EkVUNjls

What should I use for my web application that isn't PHP? I would have gone straight to mod_perl a few years ago, but now that there is so many. Is there anything better?

48 Name: #!/usr/bin/anonymous : 2008-02-06 01:58 ID:Heaven

>>47

> My understanding so far is that you're arguing that I should make no attempt to sort at all until it's perfect. Yes/no?

You shouldn't use the language builtin sort-operator when you want a locale-sensitive sort. In POSIX C, you can use qsort() with strcoll() and setlocale() to get a locale-sensitive sort. In python this (usually) means mapping locale.strxfrm first. Perl has a POSIX::strxfrm().

I'm saying you shouldn't use sorted() if you're printing it out to human beings. If you insist anyway- you can't come around and argue that it is useful to bury unicode into the language semantics if you're not using it.

> > You're making do with that because your language's sort operator has to work on its strings and it's sort operator cannot be useful to humans.
> Could you rephrase this?

Sure. strxfrm() works on bytes, and not unicode strings in most environments, and yet it is almost certainly what you want when you're printing a sorted list. Unfortunately, the obvious and simple sorted() isn't what you want- but is what you want when you want a stable (locale-ignorant) sort for unique detection and other algorithms. However that kind of sort doesn't care whether your bytes are 8 bits or 24 bits; the unicode support in your language in this case was a waste of code.

49 Name: dmpk2k!hinhT6kz2E : 2008-02-06 06:25 ID:Heaven

> You shouldn't use the language builtin sort-operator when you want a locale-sensitive sort.

Why shouldn't the language's built-in sort support this?

50 Name: #!/usr/bin/anonymous : 2008-02-06 14:51 ID:Heaven

> Why shouldn't the language's built-in sort support this?

Because a locale-sensitive sort isn't necessarily stable.

More to the point, it's the comparison function that is more dangerous (most sort algorithms simply require a comparitor that returns left or right, -1 or 1, etc). A locale-sensitive sort would change the output as soon as your locale changed.

51 Name: #!/usr/bin/anonymous : 2008-02-06 15:41 ID:vwnO7WC3

> The whole complaint about <tables> and unicode is stupid. The programmer doesn't deal with either of those things in arc. The fact that the programmer has to deal with those things in other languages points out what's stupid about those languages, and isn't infact exposing anything about arc.

This was the original claim made. I still haven't seen anything even remotely resembling a justification for this.

All I've seen is some kind of claim that because implementing Unicode is hard, we should all just use byte arrays like we were coding in C. That doesn't really seem like "the programmer doesn't deal with these things", it seems like "the programmer has to implement these things from scratch again and again" or "the programmer has to use clumsy libraries to deal with these things".

That hardly seems like a good choice for an language for "exploratory programming".

52 Name: dmpk2k!hinhT6kz2E : 2008-02-06 18:34 ID:Heaven

> Because a locale-sensitive sort isn't necessarily stable.

Many implementations of quicksort aren't stable either -- or even deterministic -- but that hasn't stopped anyone. You can add an index field to force stability if that's desired.

> A locale-sensitive sort would change the output as soon as your locale changed.

Often that's the point. If I know a user speaks Norwegian, either through their preferences or Content-Language, I'd like what they see to be sorted in an order they're familiar with.

If I don't know, well, better an ordering than no ordering.

53 Name: #!/usr/bin/anonymous : 2008-02-06 20:27 ID:Heaven

>>52 So you accept that building unicode into the language didn't help with sorting?

>>51 The programmer has to use clumsy libraries anyway. They just think they don't because they think having unicode built into the language is making fewer problems when it is infact making more.

> I still haven't seen anything even remotely resembling a justification for this.

The onus isn't on me to demonstrate the non-usefulness of a non-feature, it's for you to demonstrate the usefulness of a feature; I said arc lacking unicode doesn't matter but you seem to think it does.

You think it's easy
You think it lets you do something you couldn't otherwise
You think it's saving you time
You think it's more elegant

Not only have I provided a demonstration that none of these are true, I've gone better to demonstrate it's a good place to hide bugs and that the use is often unexpected.

54 Name: #!/usr/bin/anonymous : 2008-02-06 21:24 ID:j/yqEPJq

>>53

None of that has anything to do with "The programmer doesn't deal with either of those things in arc."

That is a completely different, and completely insane statement. If it was actually true, it would mean pretty much the exact opposite of what you're arguing would somehow be true.

55 Name: #!/usr/bin/anonymous : 2008-02-07 03:18 ID:Heaven

>>54 No it has to do with the negative justification for unicode in the language.

Really: If I found myself needing some unicode transformation I'd write a macro and I'd be done with it. Instead in a language like python, you have to find out about the problem, go edit all your uses of sorted(), and appologize to the user for not understanding the full scope of the problem when you decided to use a language like Python.

If you think not having unicode builtin to arc causes problems, you're going to have to prove it.

56 Name: #!/usr/bin/anonymous : 2008-02-07 16:09 ID:Heaven

> If I found myself needing some unicode transformation I'd write a macro and I'd be done with it.

In other words, you'd have deal with it. Manually.

And so "The programmer doesn't deal with either of those things in arc" is still untrue.

57 Name: #!/usr/bin/anonymous : 2008-02-07 17:57 ID:Heaven

>>56 That's your problem? Seriously?

The programmer in lisp sees things like this:

(elipse s)
(prn s)

he doesn't see:

elipse(s.decode('utf8'))
"<table><tr>"+"</tr><tr>".join(map....)."</tr></table>"

That's what I meant by not dealing with it.

Some more examples. I can use:

(utf8-chars s)

and:

(len s)

instead of:

s.decode('utf8').length
do { use bytes; length($s); };

and:

s.encode('utf8').length
do { use utf8; length($s); };

Yeah. Building it into the language seems like a big win.

Meanwhile, Pg says he's going to make a case for unicode. It's not important for any of the programs arc has been used for so far, and it's not a showstopper for any real applications anyone is writing. Let's see it right, rather than repeating the crap that python, perl, Java, and well everyone else did.

58 Name: #!/usr/bin/anonymous : 2008-02-07 22:01 ID:Heaven

> s.decode('utf8').length
> s.encode('utf8').length

You mean len(s.decode('utf8')) and len(s.encode('utf8')). And if you're actually using unicode in an app, you really ought to be using it end-to-end anyway, which means you're doing your encoding/decoding at input and output -- and that's something Python already gives you with codecs.open().

How often do you need to know how many bytes something is, except when writing something to disk? (Which, incidentally, falls clearly under the category of "output".) Even then, it's not generally necessary.

59 Name: #!/usr/bin/anonymous : 2008-02-08 01:58 ID:Heaven

No! You shouldn't be using it end-to-end! You should be avoiding transformations because they take time, and introduce more places for bugs to hide. Saying you need len() to mean characters instead of storage size becuase "How often do you need to know how many bytes something is, except when writing something to disk?" is about the stupidest rationalization I've ever seen!

Web apps receive data in byte-escaped utf8 or iso-8859-1/15. Besides the strings you're receiving, why would you bother coding every other string just to get these?

Unicode strings just plain aren't useful. They may be convenient for the language designer, but they aren't useful to the programmer. They only need two kinds of strings normally: byte strings, and locale-aware presentation strings. The fact that you can build both with unicode strings isn't a feature of unicode strings.

60 Name: #!/usr/bin/anonymous : 2008-02-08 02:49 ID:Heaven

>>57

You can do that in almost any language ever, if you just use libraries that hide all the hard work from you. That's not a strength or weakness of any language.

61 Name: #!/usr/bin/anonymous : 2008-02-10 12:17 ID:Heaven

The problem with byte strings is that for any given byte string declared somewhere in the application, nobody knows the encoding of it to convert it into usable characters. So you end up making some class which ties the encoding together with the byte array. Or, standardising on some encoding. But if your application is going to be used by more than a single country, then what choice of encodings do you have if you want to support them all?

Even wchar_t has an encoding of course, it's just that it's always UTF-16. And UTF-16 still requires two units to encode a single character in some situations; the way it does it happens to be very similar to UTF-8.

I'm yet to see a system which plans ahead and uses UTF-32 for strings. It would take more memory, sure... but memory isn't that expensive.

62 Name: #!/usr/bin/anonymous : 2008-02-13 21:46 ID:i+ITJfDJ

>>61 Justify converting it into usable characters.

printf() certainly accepts bytestrings. So does xterm and your-favorite-web-browser-here, so what exactly does one need to convert a bytestring into "usable characters"?

The operations that people have mentioned: sorting, elipsing, and substringing vary more based on the locale than on the supplied encoding- cases where you want sorting, or subsections for non-presentation uses (machine algorithms like unique-detection and counting) bytestrings are satisfactory.

Btw, Erlang uses 32-bit integers for strings, and factor uses 24-bit. Even those aren't "big enough" for the eventual case when someone decides a bignum is needed...

63 Name: #!/usr/bin/anonymous : 2008-02-14 12:23 ID:A7jyxFQb

>what exactly does one need to convert a bytestring into "usable characters"?

This is the dumbest question I have heard here yet. Any application which needs to do actual processing on text will need to know where one character stops and another character begins. If you don't know what character set is inside your blob of byte[] then how do you even find out where the spaces are to break at the word boundaries?

64 Name: #!/usr/bin/anonymous : 2008-02-14 14:44 ID:i+ITJfDJ

>>63 That's as stupid as insisting you can't use a void* to refer to an int because you don't know that it is infact an int. Look at this another way: Your text files on your hard disk don't have character set tagging and yet you can read from them just fine.

If you're reading utf8 files into strings, you know the bytestring contains utf8. If you're reading shift-jis files into strings you know the bytestring contains shift-jis. You generally isolate all of your charset and locale-awareness into a specific part of your program. You don't need to pepper it all over the fucking place to go wordwrap(s,72).

wchar_t was bad engineering. It convinced a lot of people that you needed another set of string-api, and another kind of string. You don't. Your filesystem doesn't support those kinds of strings anyway, so it doesn't really add any features, or give you any new expressiveness (or conciseness), but it does introduce new strange places to hide bugs.

The fact remains: unicode-support in the language doesn't buy you anything, and costs you a lot. You still need to (as the programmer) be aware of charset-conversion at input and output because that information isn't available from the environment. Two trivial examples that don't exist in reality don't change that. Your isspace() example could simply be called utf8_isspace() because you still need to know what was inputted was utf8 anyway.

Maybe this'd be different if the filesystem encoded charset and locale information reliably. It doesn't though, so you're still tasked (as the programmer) of working primarily in bytestrings, and transcoding explicitly when directed.

65 Name: #!/usr/bin/anonymous : 2008-02-14 16:47 ID:Heaven

>>64

> Your isspace() example could simply be called utf8_isspace() because you still need to know what was inputted was utf8 anyway.

Oh yes, let's hardcode everything to use utf8, and force everyone to use it. That's much better than supporting a wide range of character sets, being flexible, and allowing people to load their existing files without running their entire hard drive through recode, remastering all their CDs and DVDs, and proxying every website they look at.

Not to mention perhaps people want to read an existing SJIS formatted text file when the system they're using defaults to UTF-8. So now you have two encodings to worry about. Are you going to keep your calls to utf8_isspace and sjis_isspace straight? Or will you go insane encoding and decoding manually at every step instead of using plain and SIMPLE Unicode in the backend, and setting input_encoding and output_encoding flags for the I/O subsystem?

66 Name: #!/usr/bin/anonymous : 2008-02-14 19:11 ID:i+ITJfDJ

>>65

> That's much better than supporting a wide range of character sets,

It is.

> and allowing people to load their existing files without running their entire hard drive through recode,

How does that follow? Your unicode-aware language is recoding their files every time you load them, but you don't know what the original coding is. My plan only recodes them when we think it's worth the bother, and hopefully we can be aware that figuring out the charset is part of that bother.

> remastering all their CDs and DVDs,

How does this follow? Your unicode-aware language needs to be aware of all character sets in order to read any of them. Mine doesn't even bother most of the time and reads CDs and DVDs just fine.

> and proxying every website they look at.

You have to do this anyway- you cannot save an HTML file from the web on your hd without transcoding all the entities to us-ascii or having a charset-preserving filesystem. HTML has a different "default" charset on a disk (us-ascii) than it does on the wire (iso-8859-1).

> Not to mention perhaps people want to read an existing SJIS formatted text file when the system they're using ... So now you have two encodings to worry about.

You already have two encodings to worry about. You don't know the charset of the file you're loading because your filesystem doesn't preserve that information reliably.

> Are you going to keep your calls to utf8_isspace and sjis_isspace straight?

Why do you think I have to? Only a very small class of programs doing a very small class of things will ever have to deal with SJIS- let alone any other character set. For a web app I'll only ever see %-encoded utf-8 or iso-559-1. If I'm writing a text editor, presumably I translate the code-points to glyphs that I'll use during rendering.

> setting input_encoding and output_encoding flags for the I/O subsystem

The I/O system only deals with bytes. TCP only deals with bytes. Disks only hold files containing: bytes. If your internal representation is utf8, and your external representation is utf8, why the hell would you transcode at all?

67 Name: #!/usr/bin/anonymous : 2008-02-14 23:14 ID:Heaven

>>64

Look, what you're talking about is how programming languages used to work in the past. Perl 4 was like that. C is still like that.

Turns out, it fucking sucks to do things that way. That is why all languages add unicode support these days. They don't do it because they've been brainwashed by some Unicode conspiracy to design their languages to be horrible. They do it because it makes everything much easier, and everybody knows it, because the alternatives have been tried.

68 Name: #!/usr/bin/anonymous : 2008-02-15 00:50 ID:Heaven

> How does this follow? Your unicode-aware language needs to be aware of all character sets in order to read any of them. Mine doesn't even bother most of the time and reads CDs and DVDs just fine.

What?! By that logic, your brain needs to be aware of all languages in order to understand any of them.

I'm not even going to bother replying to the rest of that nonsense because you're obviously a fucking troll.

69 Name: #!/usr/bin/anonymous : 2008-02-15 13:19 ID:A7jyxFQb

>>68 is quite right.

To cite a concrete example, Java uses UTF-16 as its native string storage and the extra charsets are an optional part of the install. If you still install them, then everything still works fine unless you happen to run into one of those character sets.

>>64 seems to think that decoding a string at every single function is an efficient way to write code. You do understand that decoding has overhead. Right?

I don't really object to UTF-8 though. At the very least it's no better nor worse than UTF-16, as they both work in more or less the same fashion for characters which are outside the range that can be represented by a single code unit.

Oh yes, and DVDs work fine because the subtitles are stored as GRAPHICS. I think you will find that real subtitles require knowing the encoding of the subtitle file, whether it's standardised, stored in the file or somewhere else.

70 Name: #!/usr/bin/anonymous : 2008-02-15 15:37 ID:Heaven

>>69 WRONG. Java doesn't use UTF-16. Check again!

It uses a character set that ONLY java uses or supports. It's based on UTF-16, but some code points have the wrong byte-value for UTF-16. If you read a java binary file with UTF-16 you will destroy data and never know it.

Yet another reason why transparent transcoding is stupid.

> >>64 seems to think that decoding a string at every single function is an efficient way to write code.

64 thinks that decoding and recoding strings all the time is a stupid and mind-blowingly retarded way to write code. Why do you think I'm saying otherwise? Because I'm saying having transcoding built into the language and transparent is stupid and worse than useless? Because I'm saying (and demonstrating) that transcoding bugs are hard to locate because they are entirely data-related? Or because having every string be a transcoding unicode string and every i/o operating be a transcoding unicode operation means that EVERY STRING operation and EVERY IO operation is a place where you could have a data-related bug that you might never find.

You don't need to transcode very often. Having it built into the language makes it easier to transcode- when you don't have to, and when you're doing it wrong. It hides bugs, and it doesn't solve problems.

68 has failed to indicate a single reason why having the language transcode for you transparently and invisibly is a good thing, and YOU >>69 are defending him at the same time saying you're worried about performance. What is wrong with you?

71 Name: dmpk2k!hinhT6kz2E : 2008-02-15 17:12 ID:Heaven

> EVERY STRING operation and EVERY IO operation is a place where you could have a data-related bug that you might never find.

This sounds like the manually- versus automatically-managed memory argument again.

72 Name: #!/usr/bin/anonymous : 2008-02-16 12:24 ID:Heaven

> 64 thinks that decoding and recoding strings all the time is a stupid and mind-blowingly retarded way to write code. Why do you think I'm saying otherwise?

Because you suggested having utf8_* functions to do every single string operation. What do you think UTF-8 is, a chicken? It's an encoding!

73 Name: #!/usr/bin/anonymous : 2008-02-16 12:26 ID:Heaven

> Yet another reason why transparent transcoding is stupid.

Um, Java doesn't do it transparently, you have to specify the charset in almost all situations unless you specify nothing, in which case it uses the platform default encoding (dangerous in its own way, but not the topic of the conversation.)

74 Name: #!/usr/bin/anonymous : 2008-02-16 13:36 ID:ONvOLVru

>>72 I most certainly did not! I said you can use utf8_* function if you know the content is utf8 and it matters. If It doesn't matter, don't transcode it. Don't even look at it! Most of the things you want to do with a string are the same when treating it as a bytestring. The special cases are: locale-sensitive compare, and a character-sensitive elipsing/wordwrapping. If you're writing routines to do this over and over again then yes, you should have it in your language. However if you're not, then why are you translating to bignum arrays all the time? Why is substr so slow?

If you think there are other special cases, I'd love to hear about them. Nobody seems to post any of them here.

>>73 "Almost all" situations? I was specifically talking about serialization, but platform-default encoding is a better example.

What exactly is the platform-default encoding anyway? When you save html files to a windows pc, do you convert it to us-ascii? Or do you violate MIME and at least avoid destroying data by converting to to the current codepage, and god-forbid the user change it?

On a Linux pc, what exactly is he default coding? Or a mac?

If there were a meaningful default and lossless coding it might be useful to operate this way, but as it is, the "default" IO often simply destroys data, and nobody ever notices it which I think makes it exactly the topic of conversation: Unicode hides bugs.

Unicode doesn't solve anything in programming languages because the messy nonsense is in locale-specific things and in system-specific things and history demonstrates that programming languages can't really solve either of those things. Because of that I contend that unicode in the language is simply a place to hide bugs and unexpected gotchas, for no real benefit.

http://www.moserware.com/2008/02/does-your-code-pass-turkey-test.html

brought up this exact topic- although to very different conclusions. The author suggests you pepper your code with NumberFormatInfo.Invariant and StringComparison.CurrentCultureIgnoreCase. Using strcmp for asciiz strings and strcoll when you're comparing user input seems fine to me. If the environment is unicode, it had to fit into a bytestring anyway for it to get into argv. As you can see the cool thing is that you get better i18n support without putting unicode in your language because thinking in terms of characters being the primitive part of IO is what's wrong.

75 Name: #!/usr/bin/anonymous : 2008-02-16 15:30 ID:Heaven

Oh god, someone put this thread out of its misery.

76 Name: #!/usr/bin/anonymous : 2008-02-17 13:42 ID:Heaven

>>74
Default encoding depends on the locale. The locale depends on the user. Personally on my Linux machines I have it set to UTF-8.

77 Name: #!/usr/bin/anonymous : 2008-02-17 16:43 ID:Heaven

>>76

Right, but if another user doesn't know what encoding it is, that user might not be able to read it.

Seriously: You wouldn't expect cat to refuse to type out a file because $LANG was set wrong, would you?

Unicode is hard. It's not surprising there aren't any languages that have gotten it right yet, and the arguments about "how do you elipse text without a unicode string" demonstrate just how insulated most programmers are from these problems.

What is surprising is how much ignorance there is on the subject- the underlying assumption that unicode is basically a solved problem- and the vehement badmouthing goes into a language that hasn't implemented anything about it yet.

78 Name: dmpk2k!hinhT6kz2E : 2008-02-17 18:26 ID:Heaven

> "how do you elipse text without a unicode string" demonstrate just how insulated most programmers are from these problems.

Hay, I thought that was the idea!

It's why we leave memory to a GC and don't build our own stack frames. If it's hard we should leave it to experts to solve, and hopefully they'll be nice enough to provide us a simple API that hides the hairiness.

79 Name: #!/usr/bin/anonymous : 2008-02-17 22:05 ID:Heaven

> Right, but if another user doesn't know what encoding it is, that user might not be able to read it.

That's not an argument against unicode. That's just common sense and it holds true no matter how you do the string processing in your program.

80 Name: #!/usr/bin/anonymous : 2008-02-17 23:10 ID:ONvOLVru

>>78 I didn't say the expectation was wrong, just the implementation. All these so-called experts are still figuring it out themselves, and have been for over two decades. Ignoring that fact is extremely dangerous.

Until it's a solved problem, all programmers need to be at least vaguely aware of just how bad this is. Right now: files and network-streams use bytes. Operating systems use bytes. Bytes are very well understood. Bignum strings are not.

>>79 No you're absolutely right. It's not an argument against unicode- it's an argument against building unicode into the language only because programmers have to be aware of what's wrong, and why it's wrong. So long as they don't know, and don't think about this sort of thing, they're making mistakes, and some of them big ones.

81 Name: #!/usr/bin/anonymous : 2008-02-18 00:43 ID:Heaven

82 Name: #!/usr/bin/anonymous : 2008-02-18 00:47 ID:Heaven

>>81 I described python's behavior. Other languages aren't dissimilar.

83 Name: #!/usr/bin/anonymous : 2008-02-18 12:50 ID:Heaven

>>82

I am talking about your stuff like utf8_isspace().

84 Name: #!/usr/bin/anonymous : 2008-02-18 13:22 ID:A7jyxFQb

> Seriously: You wouldn't expect cat to refuse to type out a file because $LANG was set wrong, would you?

It will result in bogus output, but it won't be cat's fault. cat ultimately types out bytes, but the terminal will decode it incorrectly and you'll still get garbage. To the end user the result is more or less the same whether it's cat's fault or not though.

85 Name: #!/usr/bin/anonymous : 2008-02-18 18:10 ID:Heaven

>>83 Why do you think utf8_isspace() is worse than .decode('utf8').isspace()? I think utf8_isspace() is better simply because in the latter case, the programmer is likely to simply use isspace() instead and be wrong.

>>84 Are you saying you would prefer cat refuse to type out a file because $LANG was set wrong?

And to be clear: The user isn't always aware what's wrong. Many users insist that utf-8 is wrong on IRC because many people outside the US don't use utf-8 on irc (for whatever legacy reason). IRC doesn't transcode for people, and the protocol doesn't specify the character set. It's a mess, but it's a mess that the programmers ultimately have to deal with.

86 Name: #!/usr/bin/anonymous : 2008-02-19 12:08 ID:Heaven

>>85
utf8_isspace() by itself is fine, until you have utf8_chomp(), utf8_substring(), utf8_indexof(), utf8_charat(), utf8_reverse(), utf8_split() and a few dozen other functions to do all the other string manipulations you need. Then you realise that instead of decoding it into characters once, you're decoding it every time you call a utility method. It makes more sense to get everything into characters up-front, and do the decode/encode only once for the entire journey.

If you're worried about developers using isspace() on a raw byte, then don't put that method on the Byte class, only put it on the Character class. What is a "space byte" anyway?

And yeah, IRC as a protocol is evil (most people on there suck also, but that's another story.) Well, it turned out to not be very good for charsets other than English and someone had to perform some kind of hack to make it work. Since there is no way to specify the character set (although this could have been added via a new command, trivially I might add) UTF-8 would probably be the best choice out of all the bad choices.

At least XMPP standardises on UTF-8. Even though it's based on XML, the standard clearly specifies that use of other character sets is non-standard.

87 Name: #!/usr/bin/anonymous : 2008-02-19 14:21 ID:Heaven

>>85

Why do I think that?

You should be asking, why does pretty much every language designer out there think it is worse. That was my point.

Once again: We tried that, and it was horrible. Now we don't do that any more.

88 Name: #!/usr/bin/anonymous : 2008-02-19 17:58 ID:i+ITJfDJ

>>87

> You should be asking, why does pretty much every language designer out there think it is worse. That was my point.

Oh, okay, so because every language designer thinks cat should complain about poorly coded text files, you think so as well?

I think it's much more likely that every language designer was wrong. Heck, Guido knows he's wrong which is why python2 doesn't work like python1, and python3 will be deliberately incompatible on this point.

Larry went through similar growing pains; 5.0, 5.6, 5.8 and 5.10 all work differently. perl6 may be different still. This obviously isn't solved.

Common-Lisp has strict isolation, and yet the default external coding for files is :default which is simply some undefined implementation-specific coding system.

Javascript implementations frequently disagree on what coding their source and what coding their data is in. us-ascii is the only thing that works portably.

Java has the worst of both worlds: Slow character access, lying length, and a horde of supporters who think that it has "pretty good unicode support".

Scheme and provides no information on what character-code maps to which code point which means it is IMPOSSIBLE to implement case folding or locale-aware collating routines in portable scheme.

How can you possibly believe that every other language has unicode? How can you possibly believe that the state of unicode is such that we can expect a certain level of functionality out of new protocols and new languages?

> Once again: We tried that, and it was horrible. Now we don't do that any more.

GNOME is one of the finest, most i18n-y application suites I've ever seen. It correctly handles bidi, currencies, datetime formats, and so on. And it's written in C. Simply writing _("text") fragments is easy because it encourages the programmer to use the formatting macros, and so everyone does it. It's a small thing to search for strings. It's a much harder thing to search for string manipulation functions.

i18n and localization is quite a bit better in C than it is in all these "new languages" being made by these language designers you keep talking about.

89 Name: #!/usr/bin/anonymous : 2008-02-19 18:27 ID:Heaven

> I think it's much more likely that every language designer was wrong. Heck, Guido knows he's wrong which is why python2 doesn't work like python1, and python3 will be deliberately incompatible on this point.

Yes, everybody was wrong. You're right, now guide us down the path to sanity and design us the perfect language that's completely free of any problems whatsoever. You seem to know everything about internationalization ever, so quit with the small talk and show us your amazing solution.

90 Name: #!/usr/bin/anonymous : 2008-02-19 19:57 ID:Heaven

> Oh, okay, so because every language designer thinks cat should complain about poorly coded text files, you think so as well?

Nobody thinks that outside of your head. Please don't insult our intelligence with such utterly ridiculous strawman arguments. Seriously.

> This obviously isn't solved.

It may not be solved, but you're saying we should stop even trying to solve it, and just suffer on with the solutions we used to have, which were much, much worse.

All you're doing is picking on problems other languages have, but not offering anything that isn't many times worse. Pretty much everybody would much rather have those problems than suffer under the horrible inflexibility you seem to think is preferable.

> GNOME

Is an application, not a programming language. Also, nobody has said that you can't write internationalized code in C. We are saying that it is much more work than in a language designed to handle this properly. And even so GNOME only needs to use a single character set, instead of a multitude of them.

> i18n and localization is quite a bit better in C than it is in all these "new languages" being made by these language designers you keep talking about.

Find a single real programmer who agrees with you on that, and perhaps I will take you seriously.

91 Name: #!/usr/bin/anonymous : 2008-02-19 19:59 ID:i+ITJfDJ

>>89 I have seen many problems, and I can share what I consider obvious. But I do not have such hubris that I think I have all the answers.

Nevertheless, here are some things I think might help:

  1. Using unicharacter strings where the character codes align with ascii makes it hard to test for people unfamiliar with the subtleties of unicode. Possible solution: make it harder for them to be unfamiliar with this by making all the unicharacter codes xor some magic number.
  2. Case transformation and collating are locale-sensitive. All unicharacter library routines should take a locale-identifier where appropriate. The default locale should be Russian everywhere but Russia, and Chinese there. Word wrapping, collating, elipsing, and case transformation should require this locale setting. Bonus: make the locale a dynamic variable instead of an argument.
  3. Streams are always binary. They might have many coded things, and they might even change coding in the middle. Don't ask for coding at open-time, but still make a per-stream default character set so that read() and write() can use a "default" character set sensibly.
  4. read() should accept a character set as an argument. If being given the "default" character set, it will produce a unistring. Without any character-set argument (or null, or None, etc) it should produce a bytestring.
  5. read() should be aware of a per-stream flag that indicates what byte-sequence (when read) stops automatic decoding.
  6. write() should accept a character set as an argument. It should generate an error if it cannot encode. It must not attempt to coerce a string into a bytestring if a character set argument is omitted.
  7. Message strings should be allocated specially, to ensure they're being setup by the i18n allocator. printf-like functions should not process %-masks on strings that didn't come from this allocator (or that aren't constant). This would eliminate a large number of security bugs that come from poor i18n practices as a side-effect.

That covers most of the ones I see on a daily basis. They're all easy to implement as libraries because that's how I use them. The unicode support in most languages is simply too buggy and has too many subtle problems that it's difficult to do the right thing. Obviously, language-oriented support for these things would work differently than these library-centric ideas, but hopefully it would solve the same kinds of problems that these things solve.

92 Name: #!/usr/bin/anonymous : 2008-02-19 20:09 ID:Heaven

>>90

> Nobody thinks that outside of your head. Please don't insult our intelligence with such utterly ridiculous strawman arguments. Seriously.

You've being naive. People do think this. Seriously. What do you expect this to do?

f = file("in")
shutil.copyfile(f, out)
> We are saying that it is much more work than in a language designed to handle this properly.

And I am pointing out that you're wrong on both counts, and to make matters worse no language exists that handles it properly.

> Find a single real programmer who agrees with you on that, and perhaps I will take you seriously.

No you won't. Perhaps you live in a fantasy world where unicode is something everyone does the same way (they don't), and you reject a language that doesn't have explicit support for unicode because "everyone else does" but then you admit that unicode is hard, and that nobody is doing it right yet.

> It may not be solved, but you're saying we should stop even trying to solve it, and just suffer on with the solutions we used to have, which were much, much worse.

No, I'm not. I started by saying don't reject arc because it doesn't have unicode support because it doesn't need unicode support. Unicode support is completely broken WRT the web anyway. I'm saying nobody's got it right so quit acting like the expectation is normal. Your expectation for unicode support is broken and brain-damaged because it isn't being fulfilled by any language you or anyone else here has brought up.

FWIW, I think languages could have useful unicode support if they bothered to look at how people fuck up unicode support and explicitly target that. I'm convinced that the reason nobody has decent unicode support is that it's hard, and I'm objecting to the idea that it's easy and that what every language has is a good thing.

> Pretty much everybody would much rather have those problems than suffer under the horrible inflexibility you seem to think is preferable.

Those people are idiots then, or they've never given it any serious thought. I suspect more of the latter looking at this thread "oh python has unicode support xxx yyy" and that's the end of it.

93 Name: #!/usr/bin/anonymous : 2008-02-19 22:30 ID:Heaven

>>92

Are you even aware that most language do make a difference between byte streams and character streams? I've never seen a language where you had to decode byte streams from a file. You seem to be living in some insane fantasy land where people actually have to do this. If that was actually the case, you'd be right, it would be stupid.

Thing is, that is not the case.

If you're not going to argue about the real world, why should we bother listening to you?

94 Name: #!/usr/bin/anonymous : 2008-02-19 23:38 ID:Heaven

>>93

> Are you even aware that most language do make a difference between byte streams and character streams?

Are you even aware there's no such thing as a character stream in most languages? The stream itself is in bytes. In order to operate in characters you need to decode. That produces charstrings from a bytestring.

Very few languages differentiate between bytestrings and charstring:

>>> 'foo' == u'foo'
True

Python doesn't.

perl -e 'print do { use bytes; "foo"; } eq do { use utf8; "foo"; };'

Perl doesn't either.

(Common Lisp however, does)

Since you don't know the coding of a file (because the filesystem doesn't know), or perhaps since you're writing an IRC client and the coding can change mid-stream (and change back with a terminator character) you're almost certainly writing buggy code.

> I've never seen a language where you had to decode byte streams from a file.

That's a problem. You must decode bytestrings into charstrings if you're going to operate on them as characters. Because character operations work on bytestrings, you can end up working on something you think is coded when it isn't.

I recommend people avoid character operations or explicitly typing their strings as charstrings or bytestrings because it involves keeping track of the coding throughout a potentially long path. Other programmers recommend people simply use unicode for everything, but that causes people to code/decode needlessly and introduces new places where exceptions can be raised in surprise.

> If you're not going to argue about the real world, why should we bother listening to you?

You shouldn't listen to me. You should think about this yourself. Programmers are supposed to be thinking about this sort of thing. You are not thinking about it. You're arguing that other programmers are doing the same thing when they're clearly not and using that as justification to be an asshat.

95 Name: #!/usr/bin/anonymous : 2008-02-20 03:17 ID:Heaven

>>94
How about using the correct comparison operators to answer the question asked, instead of carefully constructing misleading code to support your arguments. Python most certainly differentiates between the two; that's why there are DIFFERENT CLASSES for unicode and str. You can set a coding for a Python file, which defaults to ASCII, and that is what you're testing - that the Unicode representation of a string is equal to the ASCII representation.

>>> 'foo' is 'foo'
True
>>> 'foo' is u'foo'
False
> You shouldn't listen to me.

Done. I shall now fully ignore this thread, for it gets stupider with every post.

96 Name: #!/usr/bin/anonymous : 2008-02-20 17:41 ID:i+ITJfDJ

>>95

I assume you're accepting all other points in >>94 which which according to your statement in >>93 means you agree that almost all languages (including python) have a stupid implementation of unicode.

I'll only cover the specific point brought up in >>95.

The is operator is irrelevant. Python will automatically promote a bytestring to a charstring. Consider the following:

>>> 'foo'+u'foo' 
u'foofoo'

This causes problems if the bytestring contains non-ascii characters:

>>> '\xa0'+u'foo'
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

This is clearly a bug, but one that will simply never come up to someone who is primarily testing in us-ascii.

If python refused to convert bytestrings to charstrings this bug wouldn't exist.

> You can set a coding for a Python file, which defaults to ASCII, and that is what you're testing - that the Unicode representation of a string is equal to the ASCII representation.

No, you're not. You're testing whether a charstring contains the same characters as a bytestring. The problem is that bytestrings don't really contain characters:

>>>  u'\xa0' == '\xa0'
False

Python makes this compromise for legacy (pre-unicode) python code, and that's why Guido says he intends to fix it for Py3K. But this compromise creates bugs that programmers won't notice, and that's bad. If the language didn't treat bytestrings as us-ascii-encoded charstrings, 'foo' == u'foo' would fail and the programmer would notice immediately that something was being left non-unicode and go fix it.

97 Name: #!/usr/bin/anonymous : 2008-02-21 03:00 ID:Heaven

>>95

> Done. I shall now fully ignore this thread, for it gets stupider with every post.

I bailed in the 50s.

This thread has been closed. You cannot post in this thread any longer.