It's tempting to use richly-formatted text as a copy-and-paste source, but it doesn't always work as expected. Sometimes you get the strange characters or little boxes. Here's the deal.
The original text was, “let’s be friends. Can’t we all get along?” However, both apostrophes were replaced with junk character sequences. If you look closely, you see three characters: an â followed by two skinny boxes. (At least, that's how it displayed on my system.)
ASCII1 and Code Points and Chars2
To make complete sense of all this, there's some things you need to know. You need to know a little about Unicode, a little about ASCII, a little about The Web and a little about copy/paste. I'll keep it as basic as possible.
Unicode
The Unicode part is fairly simple. Unicode is basically an international standard that assigns a unique number to each unique character. The number, called a code point, identifies a particular letter or symbol. So, for example, "A" is code point 65 and "Ω" is code point 937. The goal of Unicode is to provide a unique number for every character of every active major language worldwide.
Numbering all the world's characters requires a lot of code points; the current version of Unicode allows about two-million. What's important is that it means any given Unicode character could be a number as big as 2,000,000. Therefore, any software that works with Unicode must deal with numbers that large.
People (usually) count in decimal; computers (usually) count in binary. The noticeable difference is this: a decimal digit is one of ten symbols ("0"-"9"); a binary digit ("bit") is one of two ("0" & "1"). In both cases, the more digits (or bits) you use, the higher you can count
The bottom line is this: computers need at least 21 bits to hold a Unicode character.
ASCII (a little history)
Long ago, when the USA was the center of the Universe, computers only "spoke" English. English has 26 letters, 10 counting digits and a bunch of punctuation. English has upper- and lower-case letters, so there's really 52 letters. Other codes are necessary for the space, the tab and other special characters. But, even if we are generous with punctuation and other codes, it's easy to have only 120 or so.
An early standard, called US-ASCII, assigns numbers to all the characters described above. This early standard has only 128 codes (numbered 0 through 127).
In binary, it takes just seven-bits to count from 0 to 127. Computers of the day used eight-bit packages ("bytes"), which allowed 256 codes (numbered from 0 through 255). So it was natural that variations sprang up and expanded on US-ASCII by using the eighth bit. (That extra bit doubled the possible code points by adding 128 through 255.)
One immediate problem was that the extensions were not compatible. Different organizations used the new code points in different ways. An IBM PC version added box drawing characters (e.g. ■ ╠ ╤ ╛); a Latin version added characters from European languages (e.g. Ñ õ æ ç þ). Character sets became a very complicated proposition in computing.
When computing became a global prospect, even the extended versions of ASCII weren't enough. There are thousands of Chinese characters and dozens of other active languages with their own characters. Early attempts tried to use the 8-bit packaging, but this complicated things and created even more standards to manage.
Unicode approached this differently by defining an abstract standard that just assigns numbers to characters. It also defines encoding schemes that describe how to package the numbers in different bit sizes.
The Web
The Internet uses 8-bit packages (formally called "octets") to move data. This means Unicode, as is, cannot be sent across The Web. A Unicode encoding scheme, called UTF-8, provides a way to transport Unicode on the Internet.
Basically, UTF-8 considers the number of bits a code point actually needs. When necessary, it breaks large codes into a sequence of small codes that fit in octets. Code points using zero to seven bits (0-127) pass through as is. Code points using eight or more bits (128+) are converted to a sequence train of two or more octets.3
Copy and Paste
The idea behind Copy/Paste is that you select and copy something over here and copy it over there. One important aspect is that often what you copy isn't copied so much as remembered until you actually paste it. (Imagine copying a large file, but never doing the paste. Don't do the work until you have to!) The system simply remembers the specifications of what you want to copy.
When you do the paste, the paste target tests those specifications for one the target can handle. Typically, given a choice, the target takes the most feature-rich choice. For example, copying a web page usually provides a formatted choice and a plain-text choice. Most targets, by default, pick the formatted choice.
A Simple Experiment
If you use Windows™, try this simpl’ exper’ment.
Open Windows™ WRITE.4
¶ “Copy all the text of this Experiment section.”
¶ “Paste the text into WRITE using Edit→Paste.”
¶ “Paste the text into WRITE again using Edit→Paste Special.”
(In the Paste Special window, select Unformatted Text.)
The copied text appears twice in WRITE, but note the difference in the formatting.
This is because the source, this web page, offers multiple formats.
Nicely Formatted Text
Some word processors like to use the "sexier" versions of single- and double-quotes. These curvier versions have obvious left- and right-orientation. Table 1 shows the plain ASCII quotes as well as the sexy left/right versions.
The problem is that these curvier characters have codes greater than 255, so they can't be transmitted as is on the Internet. They can be sent as UTF-8 sequences, if the system supports UTF-8. They can also be sent using HTML character encoding. The simplest form of that is &#code-point;. For example, use ’ to generate the curvy apostrophe. Some codes are so common they have named versions, too. The apostrophe (right single quote) can also be generated using ’.
That's a lot of explanation for an apostrophe, no matter how sexy, but the subject of computer character sets is actually hugely complicated and involved. At least now we can address the original question.
So why the corrupted characters?
When you attempt to copy and paste from Word to a Newsvine Edit box, a number of things can happen. What does happen depends a great deal on the Edit box, and this is determined by your browser and possibly your operating system. If the edit box can accept Unicode characters and it sends them as UTF-8 or HTML encoding then all is well. If the edit box mishandles the paste or the sending, there can be problems.
What seems to have happened is that the system converted the original characters to their UTF-8 sequences, but at some point forgot they were UTF-8 sequences. Somewhere along the line they should have been converted back into a sexy apostrophe, but instead are seen as three weird characters. The sexy apostrophe converts to three UTF-8 characters, the first of which is â. The second two characters are not legal under UTF-8, so many systems represent them with skinny boxes or question marks. Systems that do show them may show the entire sequence as ’.
Bottom line, this isn't a bug, per se, but a conflict between using two incompatible systems. It might work to insure that your browser is using UTF-8 rather than Western European "ISO" sets. Newsvine does set a META tag specifying UTF-8, but it's possible for the local system to override that.
Footnotes
[1] "ASCII" is pronounced "as-key" (you can add another "s").
[2] "Char" is pronounced "care" (it's short for character). A character is any single letter, digit, punctuation or symbol.
[3] Bit Heads will recognize that this all is based on "the eighth bit." Any octet without bit 8 set is an ordinary character (US-ASCII). Any octet with bit 8 set is special and part of a multi-byte sequence. (Because bit 8 is special, UTF-8 does not allow extended ASCII characters.)
[4] Click the [Start] button, select Run..., type write, click [OK].
© 2008 Chris from MN













