Newsvine
  • Welcome
  • Help
  • Report Bug
  • Conversation Tracker
  • Your Column
  • Replies
  • Friends
Type Comments Since You Last CheckedArticle Source Last Checked Stop Tracking All Clear Tracking All
Advertise | AdChoices
Log In | Register
Close the Login Panel
Existing users log in below. New users please register for a free account.

New Users:

Existing Users:

E-Mail:
Password:
Forgot Password?
Please enter the e-mail address or domain name you registered with:
E-Mail/Domain:
Back to Login
Log Out
  • Top News
  • Local News
  • World
  • U.S.
  • Sports
  • Politics
  • Tech
  • Entertainment
  • Science
  • Business
  • Health
  • Odd News
  • More
    • Arts
    • Education
    • Environment
    • Fashion
    • History
    • Home & Garden
    • Not News
    • Religion
    • Travel
Visit Chris from MN's column >>

CHRIS FROM MN

Home Page
One more voice in the internet wilderness
Articles Posted: 23  Links Seeded: 12
Member Since: 6/2008  Last Seen: 1/25/2010

What is Newsvine?

Updated continuously by citizens like you, Newsvine is an instant reflection of what the world is talking about at any given moment.

Get a Free Account
Help
Fun Stuff
  • Your Clippings
  • Leaderboard
  • E-Mail Alerts
  • Top of the Vine
  • Newsvine Live
  • Newsvine Archives
  • The Greenhouse
  • Recommended Articles
  • Wall of Vineness
Put a Seed Newsvine link on your own site

Newsvine and Unicode

Mon Sep 1, 2008 11:41 PM EDT
newsvine, meta, html, xml, unicode, programmerdude, newsvine-comment, utf-8, newsvine-article, newsvine-seed, utf8, codepoint
By Chris from MN

Pasting from a word processor can surprise you!

The original Newsvine Comment

Characters & their Unicode Code Points

Basic Unicode Map. Over 65K Served!

128 codes in US-ASCII

Select and Copy

Edit - Paste (same as ^V)

Edit - Paste Special

Paste Special - select Unformated Text

Same source, two results!

Table 1

Advertise | AdChoices

It's tempting to use richly-formatted text as a copy-and-paste source, but it doesn't always work as expected. Sometimes you get the strange characters or little boxes. Here's the deal.

The original text was, “let’s be friends. Can’t we all get along?” However, both apostrophes were replaced with junk character sequences. If you look closely, you see three characters: an â followed by two skinny boxes. (At least, that's how it displayed on my system.)

ASCII1 and Code Points and Chars2

To make complete sense of all this, there's some things you need to know. You need to know a little about Unicode, a little about ASCII, a little about The Web and a little about copy/paste. I'll keep it as basic as possible.

Unicode

The Unicode part is fairly simple. Unicode is basically an international standard that assigns a unique number to each unique character. The number, called a code point, identifies a particular letter or symbol. So, for example, "A" is code point 65 and "Ω" is code point 937. The goal of Unicode is to provide a unique number for every character of every active major language worldwide.

Numbering all the world's characters requires a lot of code points; the current version of Unicode allows about two-million. What's important is that it means any given Unicode character could be a number as big as 2,000,000. Therefore, any software that works with Unicode must deal with numbers that large.

People (usually) count in decimal; computers (usually) count in binary. The noticeable difference is this: a decimal digit is one of ten symbols ("0"-"9"); a binary digit ("bit") is one of two ("0" & "1"). In both cases, the more digits (or bits) you use, the higher you can count

The bottom line is this: computers need at least 21 bits to hold a Unicode character.

ASCII (a little history)

Long ago, when the USA was the center of the Universe, computers only "spoke" English. English has 26 letters, 10 counting digits and a bunch of punctuation. English has upper- and lower-case letters, so there's really 52 letters. Other codes are necessary for the space, the tab and other special characters. But, even if we are generous with punctuation and other codes, it's easy to have only 120 or so.

An early standard, called US-ASCII, assigns numbers to all the characters described above. This early standard has only 128 codes (numbered 0 through 127).

In binary, it takes just seven-bits to count from 0 to 127. Computers of the day used eight-bit packages ("bytes"), which allowed 256 codes (numbered from 0 through 255). So it was natural that variations sprang up and expanded on US-ASCII by using the eighth bit. (That extra bit doubled the possible code points by adding 128 through 255.)

One immediate problem was that the extensions were not compatible. Different organizations used the new code points in different ways. An IBM PC version added box drawing characters (e.g. ■ ╠ ╤ ╛); a Latin version added characters from European languages (e.g. Ñ õ æ ç þ). Character sets became a very complicated proposition in computing.

When computing became a global prospect, even the extended versions of ASCII weren't enough. There are thousands of Chinese characters and dozens of other active languages with their own characters. Early attempts tried to use the 8-bit packaging, but this complicated things and created even more standards to manage.

Unicode approached this differently by defining an abstract standard that just assigns numbers to characters. It also defines encoding schemes that describe how to package the numbers in different bit sizes.

The Web

The Internet uses 8-bit packages (formally called "octets") to move data. This means Unicode, as is, cannot be sent across The Web. A Unicode encoding scheme, called UTF-8, provides a way to transport Unicode on the Internet.

Basically, UTF-8 considers the number of bits a code point actually needs. When necessary, it breaks large codes into a sequence of small codes that fit in octets. Code points using zero to seven bits (0-127) pass through as is. Code points using eight or more bits (128+) are converted to a sequence train of two or more octets.3

Copy and Paste

The idea behind Copy/Paste is that you select and copy something over here and copy it over there. One important aspect is that often what you copy isn't copied so much as remembered until you actually paste it. (Imagine copying a large file, but never doing the paste. Don't do the work until you have to!) The system simply remembers the specifications of what you want to copy.

When you do the paste, the paste target tests those specifications for one the target can handle. Typically, given a choice, the target takes the most feature-rich choice. For example, copying a web page usually provides a formatted choice and a plain-text choice. Most targets, by default, pick the formatted choice.

A Simple Experiment

If you use Windows™, try this simpl’ exper’ment.
Open Windows™ WRITE.4
 ¶  “Copy all the text of this Experiment section.”
 ¶  “Paste the text into WRITE using Edit→Paste.”
 ¶  “Paste the text into WRITE again using Edit→Paste Special.”
(In the Paste Special window, select Unformatted Text.)
The copied text appears twice in WRITE, but note the difference in the formatting.
This is because the source, this web page, offers multiple formats.

Nicely Formatted Text

Some word processors like to use the "sexier" versions of single- and double-quotes. These curvier versions have obvious left- and right-orientation. Table 1 shows the plain ASCII quotes as well as the sexy left/right versions.

The problem is that these curvier characters have codes greater than 255, so they can't be transmitted as is on the Internet. They can be sent as UTF-8 sequences, if the system supports UTF-8. They can also be sent using HTML character encoding. The simplest form of that is &#code-point;. For example, use ’ to generate the curvy apostrophe. Some codes are so common they have named versions, too. The apostrophe (right single quote) can also be generated using ’.

That's a lot of explanation for an apostrophe, no matter how sexy, but the subject of computer character sets is actually hugely complicated and involved. At least now we can address the original question.

So why the corrupted characters?

When you attempt to copy and paste from Word to a Newsvine Edit box, a number of things can happen. What does happen depends a great deal on the Edit box, and this is determined by your browser and possibly your operating system. If the edit box can accept Unicode characters and it sends them as UTF-8 or HTML encoding then all is well. If the edit box mishandles the paste or the sending, there can be problems.

What seems to have happened is that the system converted the original characters to their UTF-8 sequences, but at some point forgot they were UTF-8 sequences. Somewhere along the line they should have been converted back into a sexy apostrophe, but instead are seen as three weird characters. The sexy apostrophe converts to three UTF-8 characters, the first of which is â. The second two characters are not legal under UTF-8, so many systems represent them with skinny boxes or question marks. Systems that do show them may show the entire sequence as ’.

Bottom line, this isn't a bug, per se, but a conflict between using two incompatible systems. It might work to insure that your browser is using UTF-8 rather than Western European "ISO" sets. Newsvine does set a META tag specifying UTF-8, but it's possible for the local system to override that.

Footnotes

[1] "ASCII" is pronounced "as-key" (you can add another "s").
 
[2] "Char" is pronounced "care" (it's short for character). A character is any single letter, digit, punctuation or symbol.
 
[3]  Bit Heads will recognize that this all is based on "the eighth bit." Any octet without bit 8 set is an ordinary character (US-ASCII). Any octet with bit 8 set is special and part of a multi-byte sequence. (Because bit 8 is special, UTF-8 does not allow extended ASCII characters.)
 
[4]  Click the [Start] button, select Run..., type write, click [OK].
 

© 2008 Chris from MN

  • Enjoy this article? Help vote it up the 'Vine.

Back To Top | Front Page

Published to:

  • Chris from MN's Column
  • Groups: MetaVine, Newsvine Help, Newsvine Mentors
  • Regions: none
  • Public Discussion (9)
Chris from MN

Not too long ago, Scott (Scoop) Butki asked a question.

This is the overly-detailed, highly-geekified complete (give or take) explanation....

  • 2 votes
Reply#1 - Mon Sep 1, 2008 11:49 PM EDT
Scott (Scoop) Butki

oh, sure, blame me.

  • 3 votes
#1.1 - Tue Sep 2, 2008 12:10 AM EDT
Reply
Division by Zero

Great post, Chris! I like to use the Xinha plugin for Firefox for my posts and it generally provides a nice wysiwyg interface. I cringe at the thought of copying and pasting from Word. I've seen the messy html that Word generates and wouldn't want to go near it! If I'm somewhere where I can't get online but I want to work on an article, I'll open Notepad and hand-key everything. 

  • 3 votes
Reply#2 - Tue Sep 2, 2008 12:47 AM EDT
Chris from MN

What would your reaction be if I said I'd been hand-coding most of my HTML using a Windows version of vi, called gvim, since webdawn?

Stoneage?   :-)

:wq

  • 4 votes
#2.1 - Tue Sep 2, 2008 1:53 AM EDT
Reply
spiffie

I once sent a bug report about this behavior, several months ago (not anywhere this detailed, though). My suggestion would be that you do the same and include a link to this article so they know exactly what you're referring to.

  • 3 votes
Reply#3 - Tue Sep 2, 2008 12:59 AM EDT
Chris from MN

I will let them know. The more I think about this, the more I think it's possible it is a bug. It depends on what's in the POST data. It may still be related to how different browsers operate.

Say you paste in the curvy apostrophe (code 8217); I can see three possibilities. The browser sends the apostrophe as:

  1. ...the UTF-8 three-octet sequence  xE2  x80  x99.
  2. ...the HTML text string for the code point   "’"
  3. ...the HTML text string for the UTF-8 sequence   "’"

It's case #1 or case #3 that are causing the problem. Both express the three-octet UTF-8 sequence. If case one goes through as just UTF-8, it should work fine.

Case #3, the HTML text strings, if left alone, I think would not be recognized by the browser. I think it would see them as the three codes, and that's exactly what we're seeing.

The question is should the POST data contain case #3? If that's a legal mode, then it's more likely it's happening on the Newsvine side. It may be a choice, rather than a bug, if handling the HTML sequences causes more problems than it solves.

  • 3 votes
#3.1 - Tue Sep 2, 2008 2:26 AM EDT
Reply
MinnieApolis

Excellent explanations, and the illustrations add so much both for pure interest and for clarity's sake. I still would not pass a quiz on the material here, but at least now maybe I have a couple new tricks.

  • 1 vote
Reply#4 - Fri Sep 5, 2008 5:24 PM EDT
Chris from MN

Thank you very much; I'm glad you found it useful!

  • 1 vote
#4.1 - Sun Sep 7, 2008 3:34 PM EDT
Reply
Chris from MN

An update after a couple emails with a Newsvine admin.

HTML character encoding is apparently not processed on the inbound side, but apparently it is processed on the outbound side. Said processing being converting it to UTF-8, although I don't have the specifics.

However, the more I think about it, the more I think this could be the cause of the problem. Whether it consistitutes a "bug" or a "necessary choice" still, I think, depends on what the browser sends as POST data. Of the choices listed in comment #3.1, if both #2 and #3 are legal, it's possible Newsvine is not correctly handling one of them.

Even so, it may be a "lesser of two evils" choice, since correctly handling HTML character sequences isn't simple.

For what it's worth, here's my armchair analyst guess at what's happening. Below are two sequences of notation. Below them the notation is defined and the consequences explained.

HTML(8217) → Unicode(2019) → UTF8(E2,80,99)

HTML(226,128,153) → Unicode(E2;80;99) → UTF8(C3,A2;C2,80;C2,99)

In all cases, the number(s) inside the parenthesis are character codes. Codes for single characters are separated by semicolons. Multiple codes that together represent a single character are separated by commas.

HTML(#) is a HTML encoded Unicode; Unicode(#) is internally-stored Unicode; UTF8(#) is transmittable UTF-8 characters.

Both notated sequences consider what happens to the sexy apostrophe (code 8217) in two cases of browser POST data → NV outbound processing → result page sent to viewer. The first sequence considers case #2 from the comment above; the second sequence considers case #3.

If the HTML sequences are considered individually and converted to Unicode and then to UTF-8, there's a problem with the second sequence (case #3). Instead of seeing the three strings of HTML encoding as representing one Unicode character, the processing sees three individual Unicode characters. If these are expressed as their corresponding three UTF-8 sequences, the result page is corrupted exactly as described in the article.

Processing this correctly is tricky, as it requires detecting the embedded UTF-8 sequence of case #3. So a great deal depends on whether case #3 is a legal case for the data. It may also depend on how the apostrophe was pasted into the edit box by the source system. If the UTF-8 is in the edit box, then the POST data may represent (corrupted) data accurately.

Bottom line, the jury is still out (in my mind) on whose bug this is, but I can now see how it could be on either end.

  • 2 votes
Reply#5 - Sun Sep 7, 2008 4:12 PM EDT
Leave a Comment:
You're in Easy Mode. If you prefer, you can use XHTML Mode instead.
You're in XHTML Mode. If you prefer, you can use Easy Mode instead.
(XHTML tags allowed - a,b,blockquote,br,code,dd,dl,dt,del,em,h2,h3,h4,i,ins,li,ol,p,pre,q,strong,ul)
Newsvine Privacy Statement
As a new user, you may notice a few temporary content restrictions. Click here for more info.
FUN STUFF:
  • Leaderboard |
  • E-Mail Alerts |
  • Top of the Vine |
  • Newsvine Live |
  • Newsvine Archives |
  • The Greenhouse |
COMPANY STUFF:
  • Code of Honor |
  • Company Info |
  • Contact Us |
  • Jobs |
  • User Agreement |
  • Privacy Policy |
  • About our ads
LEGAL STUFF:
  • © 2005-2012 Newsvine, Inc. |
  • Newsvine® is a registered trademark of Newsvine, Inc. |
  • Newsvine is a property of msnbc.com