Tuesday, April 30, 2013

On bitcoin data spam, and evil data

What happens if somebody puts evil data in the blockchain?  What responses are available?

It is a truly awful situation, and difficult to address.

What happened?

The easiest way to explain what happened here is through analogy. Imagine if someone picked a penny stock on the NYSE and made a sequence of apparently pointless trades. Then they announced that the prices of their stock trades actually encoded links to some "evil" websites. You know, maybe $0.01 means "a" and $0.02 means "b", etc. Stock market tickers are public, lots of places archive that data, so now lots of people have "links to evil data". Except really they don't. What they have is a list of stock trades. You'd need special software to turn that into some other kind of data.

This is what someone has done with Bitcoin. They sent a series of monetary transactions that did not actually represent real trades, and then announced that with a special program you could turn them back into some text. That text then contains links to, well, I don't actually know what because I haven't looked. But let's assume it's bad stuff.

What solutions are available?  Software update?

The answer is very complex, with implications that travel to the heart of bitcoin's value.

Sending bitcoins requires two pieces of data: a bitcoin address, and an amount (number of bitcoins).  There is no "comments field" or anything of that nature.  A bitcoin address is just a random 20-byte piece of data.  Normally those 20 bytes are derived from the RIPEMD160 and SHA256 algorithms, but a valid 20 bytes cannot be distinguished from an invalid 20 bytes.  Therefore, if you are willing to waste money -- albeit very small fractions like 0.00000001 bitcoins -- by sending that money to invalid bitcoin addresses, you essentially have created a channel for random data transmission.

The bitcoin blockchain is in one sense a massively replicated ~7GB database that stores data for all eternity.  There remains the open question of what happens if somebody dumps data into the blockchain, unrelated to currency.  Maybe a government finds that data illegal.  Smart people argue the legal theory mens rea and similar mitigating factors are applicable.  But it remains an unknown.  The vast majority of people are burdened with this awful data they don't care about, simply to use the bitcoin payment system they do care about.

There are many conflicting motives and incentives (very Brave New War-ish):

  • Anarchist activists want to publish this information, to force authorities to act (or not) when this illegal data is published.
  • Bitcoin activists want to publish this information, to force developers (us) to address The Filter Issue (see below).
  • Some people see more value in bitcoin as "eternity data storage", if expensive and inefficient, than bitcoin as a currency.
  • It is, quite literally, impossible to prevent use of bitcoin for data transmission.  It is a purely digital currency.  Who can say which digits are "evil" or "good", allowed or disallowed?  You can detect certain patterns, and possibly filter those.
  • Many bitcoin users are using bitcoin for its intended purpose, as currency transfer, and dislike carrying the costs for these data transmission uses.
  • As this carrying-data issue rears its head, it increases the costs for anyone running a P2P node on the all-volunteer bitcoin P2P network.  This shrinks the total number of bitcoin P2P nodes.
  • As such, due to both legal and resource-usage issues, "data spam" has long been theorized as an attack vector.


The "Filter Issue"

There are very large ramifications to filtering out transactions, even ones that are obviously data spam.

Fungability: currently, all bitcoins have the same value.  My 1.0 BTC and your 1.0 BTC are equivalent in value.  Once you start filtering transactions, you are injecting policy-based censorship into the mix. Some bitcoins are accepted by all, some bitcoins are only accepted by a few.  A value of a bitcoin itself becomes a product of its ancestry.  If this policy is implemented, perhaps by court order to a bitcoin mining pool, it could lead chain forks, where i.e. bitcoin users in the United States see a different set of spendable bitcoins than users outside the US.  That would be a disaster for bitcoin.

It is widely speculated, based on common forum comments in the crypto-anarchist community, that this current round of data spam is intended to force bitcoin users, developers and governments of the world to take action to censor -- or not -- certain bitcoin transactions.  Trying to force the issue, to establish a precedent one way or the other.  Or, more pessimistically, a party could be simply trying to shut down bitcoin.

The bitcoin community is very staunchly anti-censorship, but if data spam were to threaten the life of bitcoin, I imagine ideology-neutral "it looks like data, not currency" filtering might appear.  Bitcoin is ultimately a product of voting -- you vote by choosing which software version and software ruleset to download.

The users can always vote data spam off the island...  but will they? Is data transmission a valid use of bitcoin?  The users themselves choose the definition of "valid."

What solutions could be deployed right now?

Currently being discussed is avoiding the relay of economically worthless (under $0.0001 dollars, say) bitcoin transactions.  Thus, higher transaction fees would be required to send out lots of data, directly raising the cost.


See Gregory Maxwell's post, "to prevent arbitrary data storage in txouts — The Ultimate Solution" for a proposed solution.


  1. The blockchain IS a place for information storage. It is just a matter of interpretation of the data. You do not like it - you do not interpret it. And put your mind at ease knowing that they payed hefty transaction fees for those megabytes.

  2. why not just implement address checking in the mining nodes
    so you cant send to invalid addresses.


    1. Quoting from your link:

      "the next twenty bytes are a RIPEMD-160 digest, but you don't have to know that for this task: you can consider them a pure arbitrary data"

      It is impossible to validate this data.

  3. >"Sending bitcoins requires two pieces of data: a bitcoin address, and an amount (number of bitcoins). There is no "comments field" or anything of that nature."

    That is not quite true. A bitcoin transaction sends coins to a script that the next user has to satisfy in order to send them on. Someone who knows what they are doing could insert plaintext information into a script and still send the bitcoins to a valid address.

    1. Yes, that was intentionally simplified for the audience.

      Nonetheless, it is correct. In the current data spam, the data is stored as bitcoin addresses.

      As you point out, it is not _required_ that data be transmitted this way, even though in this instance it is.

    2. What is gained by, "...intentionally [simplifying] for the audience.."? What is lost by providing a concise, accurate report that may not be understood by 100% of the audience, but is 100% accurate? Do you see the difference? Do you see the elitism? Do you see the lack of trust?

  4. There is actually exactly that, a comments section basically. A public note can be imbedded in the blockchain for all to see.

    1. You really think you know more than a guy listed on the official developers page at http://bitcoin.org/en/development ? There is nothing like a comments section, only a few non-standard ways to encode data. By your logic, there is also an audio attachment and a virus field available for every bitcoin transaction.

      If you're talking about the public note feature offered by blockchain.info, I've seen no evidence that this is actually stored in the blockchain, it is only on their servers. Feel free to provide good evidence.

      Please either show you know what you're talking about, or stop spreading unsupported mistruths.

    2. From the wiki: "Generations have a single input, and this input has a "coinbase" parameter instead of a scriptSig. The data in "coinbase" can be anything; it isn't used."

      Basically, miners put their own data in the generation transaction when creating a block. The same field is used for "extra nonce," since it changes the hash. So you can surely finalize the block without running out of nonces -- you just change the generation transaction.

      This data can be plain text visible without specialized software.

  5. Any large enough number may be interpreted as *anything you want*.

    The blockchain is a huge number. It may be interpreted as pretty much anything one may imagine. Any CP picture existent may be "extracted" from it, as well as any "secret documents from the CIA" and so on.
    Obviously, to be able to make such extraction, you need the necessary software. Bitcoin clients will never be written to make such "nasty" extractions, so we are fine.

    Prosecutors can't just get your computer, pass it through their nasty software, and then claim: see! CP got out!
    That could be done out of any computer in the world, with or without a blockchain.

    So, please, don't overreact and attempt to kill "arbitrary data in the blockchain". There might be interesting and useful use cases for it that just didn't come up yet. As long as spamming is discouraged (and miners have a strong interest to do it), we're fine.

    1. There's a pretty clear information-theoretical difference between "I can extract any arbitrary sequence of bytes from this large number, if I can use an equally large program to do it" and "I can extract a particular sequence of meaningful bytes from this large number using a relatively small program".

      That is, sure, if you prepare the right humongous XOR string then you can apply that to the blockchain and have a Disney movie come out. But you have to use a large amount of program/input data to get there, and the movie is technically encoded in the input data, not in the blockchain. On the other hand, if I encode a Disney movie into the blockchain in an easily extractable way—such that you only need a small program of 100k bytes or so to get the movie out of the blockchain—then the movie IS encoded in the blockchain and would probably be subject to copyright law.

  6. This comment has been removed by the author.

  7. The ability to encode data into the block chain could be a useful anonymous comms application for many people (because the block chain is received by everyone it is impossible to tell who the intended recipient of encoded communications is). Good and bad applications are possible e.g. an oppressed person in North Korea sending out news to the CIA vs an Osama bin Laden sending out instructions to attack. Many though may just like the idea of creating indelible graffiti on the Internet. Block chain message encoding can't easily be prevented, but if levels increase, makes sense to me to introduce a minimum transaction size to prevent everyone with an application doing it and degrading the network for free.

  8. 1* - i can't imagine any bad data, can you give me an example?
    2* - if the goverment want to block this data to be read the have to block the method that explain how that data have to be read, not all the blockchain, this is obvious!!

    can you explain me better that? i'm not an expert in this field....

  9. What if I am hired by Goldman Sachs, "here, take this 2M USD go destroy BTC"
    lets say that 1 BTC==$100
    I go and get 10K BTC for $1M , I keep the other Mil for myself ;).
    Let say I send 0.0001BTC to fake addresses or what ever are you talking about ( arbitrary data in block chain ).
    From 10K BTC I could make 100 Mil fake transactions , and if I send 0.00001 BTC it would be 1 Bil non valid transactions.
    So my question is would this kind of action hurt in any way BTC network/miners/btc coin value.

  10. I have another question (excuse maybe i say silly things before): http://cnnmoneytech.tumblr.com/post/49468888972/how-to-turn-bitcoin-code-into-a-ben-bernanke-portrait is still possibile to embed comments directly in the transaction hash, if yes how?