Building a blockchain in R


A blockchain is a distributed system using cryptography to secure an evolving consensus about an economically valuable token. Dhruv Bansal

Blocks

A blockchain is a chain of blocks. Hence, we first describe a block. A block is a container for data. In its simplest form it contains:

  • an identification number
  • a timestamp of block creation
  • a bunch of data
  • a reference to the previous block (parent block) in the chain
block <- list(number = 3,
             timestamp = "2018-10-01 17:24:18 CEST",
             data = "London",
             parent = 2)

Chain

Blocks are concatenated into a chain. The first block of the chain is called the genesis block and has no parent block. The last block of a chain is the current block and is not parent of any other block. Here is the genesis block of the Ethereum blockchain. Let’s chain some blocks.

# some blocks
block1 <- list(number = 1,
             timestamp = "2018-10-01 17:24:00 CEST",
             data = "London",
             parent = NA)

block2 <- list(number = 2,
             timestamp = "2018-10-01 17:24:15 CEST",
             data = "Paris",
             parent = 1)

block3 <- list(number = 3,
             timestamp = "2018-10-01 17:24:30 CEST",
             data = "Rome",
             parent = 2)

# the blockchain
blockchain = list(block1, block2, block3)

# get the 2nd block
blockchain[[2]]
## $number
## [1] 2
## 
## $timestamp
## [1] "2018-10-01 17:24:15 CEST"
## 
## $data
## [1] "Paris"
## 
## $parent
## [1] 1

Let’s also write a validation function for the blockchain. For now, we only check that the parent field of a non-genesis block references to the previous block in the chain.

validate = function(blockchain) {
  if (length(blockchain) >= 2) {
    for (i in 2:length(blockchain)) {
      if (blockchain[[i]]$parent != blockchain[[i-1]]$number) {
        return(FALSE)
      }
    }
  }
  return(TRUE)
}

validate(blockchain)
## [1] TRUE

Hash

The linking structure of the chain is more complex than one described above: each block in a blockchain is identified by a hash value of its content and references to the hash of the parent block, or 0 if the block is the genesis one.

Hashes of blocks are created using cryptographic hash functions, that are mathematical algorithms that maps data of arbitrary size to a bit string of a fixed size (called hash or digest). The hash is used to certify the information content of the block: by modifying even a single bit of the content, the hash completely changes. Furthermore, notice that the digest of the current block is computed in terms of the previous block digest. This means if you alter one block you need to modify not only the hash of it but that of all following block for the chain to be valid.

The hash algorithm used here (SHA-256) is part of SHA-2 (Secure Hash Algorithm 2), a set of cryptographic hash functions designed by the United States National Security Agency (NSA). In particular, is uses digests of 256 bits (or 32 hexadecimal figures). It is implemented in R package digest.

# creates hash digests of arbitrary R objects
library(digest)

# hash a string
digest("A blockchain is a chain of blocks", "sha256")
## [1] "5c2005976411a1628fabcdde3ac04be563d18943e710b7446afa2a8a5fab9abc"
# hash blocks
block1 <- list(number = 1,
             timestamp = "2018-10-01 17:24:00 CEST",
             data = "London",
             parent_hash = "0")

block1$hash = digest(block1, "sha256")

block2 <- list(number = 2,
             timestamp = "2018-10-01 17:24:15 CEST",
             data = "Paris",
             parent_hash = block1$hash)

block2$hash = digest(block2, "sha256")

block3 <- list(number = 3,
             timestamp = "2018-10-01 17:24:30 CEST",
             data = "Rome",
             parent_hash = block2$hash)

block3$hash = digest(block3, "sha256")

# the blockchain
blockchain = list(block1, block2, block3)

Let’s update the validation function and give a simple example:

validate = function(blockchain) {
  for (i in 1:length(blockchain)) {
    block = blockchain[[i]]
    hash = block$hash
    block$hash = NULL
    hash_expected = digest(block, "sha256")
    if (hash != hash_expected) {
      return(FALSE)
    }
  }
  if (length(blockchain) >= 2) {
    for (i in 2:length(blockchain)) {
      if (blockchain[[i]]$parent_hash != blockchain[[i-1]]$hash) {
        return(FALSE)
      }
    }
  }
  return(TRUE)
}

validate(blockchain)
## [1] TRUE
# alter data of first block
blockchain[[1]]$data = "Budapest"
validate(blockchain)
## [1] FALSE
# restore data
blockchain[[1]]$data = "London"
validate(blockchain)
## [1] TRUE
# alter data and hash of first block
blockchain[[1]]$data = "Budapest"
blockchain[[1]]$hash = NULL
blockchain[[1]]$hash = digest(blockchain[[1]], "sha256")

validate(blockchain)
## [1] FALSE

Proof-of-Work

Hash alone is not enough to prevent tampering, since hash values can be computed fast by computers. A Proof-of-Work (PoW) algorithm controls the difficulty of creating a new block. For blockchains like BitCoin or Ethereum blocks are created (mined) by so-called miners. When a new block has to be created, a hard computational problem is sent out to the network. The miner which solves the problem first creates the new block and is rewarded in crypto currency.

In the case of BitCoin the PoW problem involves the problem of finding a number (called nonce) that once added to the block is such that the corresponding block hash contains a certain amount of leading zeros called difficulty (more specifically, Hashcash). The average work that a miner needs to perform in order to find a valid nonce is exponential in the difficulty, while one can verify the validity of the block by executing a single hash function.

proof_of_work = function(block, difficulty) {
  block$nonce <- 0
  hash = digest(block, "sha256")
  zero <- paste(rep("0", difficulty), collapse="")
  while(substr(hash, 1, difficulty) != zero) {
      block$nonce = block$nonce + 1
      hash = digest(block, "sha256")  
  }
  return(list(hash = hash, nonce = block$nonce))
}

block <- list(number = 1,
             timestamp = "2018-10-01 17:24:00 CEST",
             data = "London",
             hash = "88e96d4537bea4d9c05d12549907b32561d3bf31f45aae734cdc119f13406cb6Parent",
             parent_hash = "d4e56740f876aef8c010b86a40d5f56745a118d0906a34e69aec8c0db1cb8fa3")


proof_of_work(block, 3)
## $hash
## [1] "0003b3433712535240d61d5c5c29c2975ec8a170e941323d489bcc5fedad55f6"
## 
## $nonce
## [1] 2937
n = 4
iterations = vector("integer", n)
for (i in 1:n) {
  iterations[i] = proof_of_work(block, i)$nonce
}

iterations
## [1]     6   165  2937 14644
plot(1:n, log2(iterations), type = "b", xlab = "difficulty", ylab = "log(iterations)")

Now let’s build a blockchain using the PoW method:

mine <- function(previous_block, difficulty = 3, genesis = FALSE){
  
  if (genesis) {
    # define genesis block
    new_block <-  list(number = 1,
                       timestamp = Sys.time(),
                       data = "I'm genesis block",
                       parent_hash = "0")  
  } else {
    # create new block
    new_block <- list(number = previous_block$number + 1,
                      timestamp = Sys.time(),
                      data = paste0("I'm block ", previous_block$number + 1),
                      parent_hash = previous_block$hash)
  }
  # add nonce with PoW
  new_block$nonce <- proof_of_work(new_block, difficulty)$nonce
  # add hash 
  new_block$hash <- digest(new_block, "sha256")
  return(new_block)
}

blockchained = function(difficulty = 3, nblocks = 3) {
  # mine genesis block
  block_genesis = mine(NULL, difficulty, TRUE)   
  # first block is the genesis block
  blockchain <- list(block_genesis)

  if (nblocks >= 2) {
    # add new blocks to the chain
    for (i in 2:nblocks){
      blockchain[[i]] <- mine(blockchain[[i-1]], difficulty) 
    }
    
  }
  
  return(blockchain)
}


blockchained(nblocks = 3)
## [[1]]
## [[1]]$number
## [1] 1
## 
## [[1]]$timestamp
## [1] "2018-10-03 13:43:29 CEST"
## 
## [[1]]$data
## [1] "I'm genesis block"
## 
## [[1]]$parent_hash
## [1] "0"
## 
## [[1]]$nonce
## [1] 1000
## 
## [[1]]$hash
## [1] "0001ceed012053903ade916e4b307d5d055f75780828d7a4e793de130eaae867"
## 
## 
## [[2]]
## [[2]]$number
## [1] 2
## 
## [[2]]$timestamp
## [1] "2018-10-03 13:43:29 CEST"
## 
## [[2]]$data
## [1] "I'm block 2"
## 
## [[2]]$parent_hash
## [1] "0001ceed012053903ade916e4b307d5d055f75780828d7a4e793de130eaae867"
## 
## [[2]]$nonce
## [1] 5241
## 
## [[2]]$hash
## [1] "000cec76518c0212facc76f3cc8cc29a41690eac7334468f947ce61ec63683af"
## 
## 
## [[3]]
## [[3]]$number
## [1] 3
## 
## [[3]]$timestamp
## [1] "2018-10-03 13:43:29 CEST"
## 
## [[3]]$data
## [1] "I'm block 3"
## 
## [[3]]$parent_hash
## [1] "000cec76518c0212facc76f3cc8cc29a41690eac7334468f947ce61ec63683af"
## 
## [[3]]$nonce
## [1] 212
## 
## [[3]]$hash
## [1] "000ea56698e2c0273598d3cdf50eba6678072b7f4fe5592cb72cf965897e3036"

Transactions

Typically, the data field of a block contains a certain number of transactions. Each transaction has a sender, a receiver and a value of crypto currency that is transferred from sender to receiver. Moreover, it contains a fee of the transaction. Transactions are stored in a pool of pending transactions and each mined block will include a (proper) subset of pending transactions. The miner of the block gets the fees of all blocked transactions plus a fixed mining reward (and this transaction is included in the following block). This is how new coins are introduced in the blockchain economy.

Let’s create a pool of fictitious pending transactions. We store them in a data frame structure using the tibble package.

library(tibble)
ntrx = 10
sender = sample(LETTERS, ntrx)
receiver = sample(LETTERS, ntrx)
value = round(runif(n = ntrx, min = 0, max = 100), 0)
fee = round(runif(n = ntrx, min = 0, max = 1), 2)
(transactions = tibble(sender, receiver, value, fee))
## # A tibble: 10 x 4
##    sender receiver value   fee
##    <chr>  <chr>    <dbl> <dbl>
##  1 S      Z            5  0.17
##  2 N      V           10  0.39
##  3 D      U           65  0.5 
##  4 E      S            6  0.06
##  5 R      Y           65  0.64
##  6 J      W           25  0.08
##  7 O      A           36  0.63
##  8 G      R            3  0.89
##  9 I      Q           33  0.68
## 10 T      N           57  0.68

Let’s update the mining function. Each block will include the most profitable transactions (the ones with the highest fees) among the pending ones. The miner of the block gets as reward the sum of the fees of transactions in the block. The reward transaction is inserted in the next block. We take advantage of dplyr package.

library(dplyr)

mine <- function(previous_block, transactions, difficulty = 3, block_size = 3, miner = "Z", genesis = FALSE){
  # filter transactions to add
  trans_pending = arrange(transactions, -fee)
  if (nrow(trans_pending) < block_size) {
    trans_to_add = trans_pending
    trans_pending = tibble()
  } else {
    trans_to_add =  filter(trans_pending, row_number() <= block_size) 
    trans_pending = filter(trans_pending, row_number() > block_size) 
  }
  
  if (genesis) {
    # define genesis block
    new_block <-  list(number = 1,
                       timestamp = Sys.time(),
                       data = trans_to_add,
                       parent_hash = "0")  
  } else {
    # create new block
    new_block <- list(number = previous_block$number + 1,
                      timestamp = Sys.time(),
                      data = trans_to_add,
                      parent_hash = previous_block$hash)
  }
  
  # add nonce with PoW
  new_block$nonce <- proof_of_work(new_block, difficulty)$nonce
  # add hash 
  new_block$hash <- digest(new_block, "sha256")
  # add reward transaction
  trans_pending = rbind(trans_pending, data.frame(sender = NA, receiver = miner, value = sum(new_block$data$fee), fee = 0.01))
  return(list(block = new_block, transactions = trans_pending))
}


blockchained = function(transactions, difficulty = 3, block_size = 3, nblocks = 3) {
  # define genesis block
  mined = mine(NULL, transactions, difficulty, block_size, miner = "Z", genesis = TRUE)
  block_genesis <- mined$block
  pending = mined$transactions
  # first block is the genesis block
  blockchain <- list(block_genesis)

  if (nblocks >= 2) {
    # add blocks to the chain
    for (i in 2:nblocks){
      mined <- mine(blockchain[[i-1]], pending, difficulty, block_size, miner = "Z")
      blockchain[[i]] <- mined$block
      pending = mined$transactions
    }
  }
  
  return(blockchain)
}

blockchained(transactions, nblocks = 3, block_size = 5)
## [[1]]
## [[1]]$number
## [1] 1
## 
## [[1]]$timestamp
## [1] "2018-10-04 09:03:40 CEST"
## 
## [[1]]$data
## # A tibble: 5 x 4
##   sender receiver value   fee
##   <chr>  <chr>    <dbl> <dbl>
## 1 G      R            3  0.89
## 2 I      Q           33  0.68
## 3 T      N           57  0.68
## 4 R      Y           65  0.64
## 5 O      A           36  0.63
## 
## [[1]]$parent_hash
## [1] "0"
## 
## [[1]]$nonce
## [1] 1284
## 
## [[1]]$hash
## [1] "0007efef9c847ad36f1d4b8536360efd2380dfb0528f4ea96faa492a63874201"
## 
## 
## [[2]]
## [[2]]$number
## [1] 2
## 
## [[2]]$timestamp
## [1] "2018-10-04 09:03:40 CEST"
## 
## [[2]]$data
## # A tibble: 5 x 4
##   sender receiver value   fee
##   <chr>  <chr>    <dbl> <dbl>
## 1 D      U           65  0.5 
## 2 N      V           10  0.39
## 3 S      Z            5  0.17
## 4 J      W           25  0.08
## 5 E      S            6  0.06
## 
## [[2]]$parent_hash
## [1] "0007efef9c847ad36f1d4b8536360efd2380dfb0528f4ea96faa492a63874201"
## 
## [[2]]$nonce
## [1] 1125
## 
## [[2]]$hash
## [1] "000c66c1e0f5a5e340a7ba2805d67299a5343ca9109e6247b4215cf3eeb954d1"
## 
## 
## [[3]]
## [[3]]$number
## [1] 3
## 
## [[3]]$timestamp
## [1] "2018-10-04 09:03:40 CEST"
## 
## [[3]]$data
## # A tibble: 2 x 4
##   sender receiver value   fee
##   <chr>  <chr>    <dbl> <dbl>
## 1 <NA>   Z         3.52  0.01
## 2 <NA>   Z         1.2   0.01
## 
## [[3]]$parent_hash
## [1] "000c66c1e0f5a5e340a7ba2805d67299a5343ca9109e6247b4215cf3eeb954d1"
## 
## [[3]]$nonce
## [1] 281
## 
## [[3]]$hash
## [1] "00008d3be2986a8ea71028dbfd796dac4f0a84eb706a6299e1a1df0ec99b0d37"

Digital signature

How can we be sure that transactions are authentic? Blockchain uses asymmetric cryptography to implement digital signatures of transactions. Each transaction is signed by the sender with her private key and anyone can verify the authenticity of the signature (and of the transaction) using the sender’s public key.

Public-key cryptography, or asymmetric cryptography, is any cryptographic system that uses pairs of keys: public keys which may be disseminated widely, and private keys which are known only to the owner. This accomplishes two functions: authentication, where the public key verifies that a holder of the paired private key sent the message, and encryption, where only the paired private key holder can decrypt the message encrypted with the public key. The strength of a public key cryptography system relies on the computational effort required to find the private key from its paired public key.

In a public key encryption system, any person can encrypt a message using the receiver’s public key. That encrypted message can only be decrypted with the receiver’s private key. In a public key signature system, a person can combine a message with a private key to create a short digital signature on the message. Anyone with the corresponding public key can combine the signed message and the known public key to verify whether the signature on the message was valid, i.e. made by the owner of the corresponding private key. Changing the message, even replacing a single letter, will cause verification to fail.

To speed up the process of transmission, instead of applying the sender’s digital signature to the (possibly large) message, the sender can rather hash the message using a cryptographic hash function and then digitally sign the generated hash value. The sender would then sign the newly generated hash value with their private key and encrypt the original message with the receiver’s public key. The transmission would then take place securely and with confidentiality and non-repudiation still intact. The receiver would then verify the signature with sender’s public key and decrypt the message with their private key.

RSA (Rivest–Shamir–Adleman) is one of the first public-key cryptosystems and is widely used for secure data transmission. In RSA, the asymmetry is based on the practical difficulty of the factorization of the product of two large prime numbers.

Next is an example of encryption and digital signature using package openssl:

library(openssl)

# encryption
# generate a private key (key) and a public key (pubkey)
key <- rsa_keygen(512)
pubkey <- key$pubkey
# message
msg <- charToRaw("Blockchain is terrific!")
# cipher the message with public key
ciphermsg <- rsa_encrypt(msg, pubkey)
# decrypt the message with private key
rawToChar(rsa_decrypt(ciphermsg, key))
## [1] "Blockchain is terrific!"
# signature
# generate a private key (key) and a public key (pubkey)
key <- rsa_keygen()
pubkey <- key$pubkey
# build a transaction
trans = list(sender = "A", receiver = "B", amount = "100")
# serialize data
data <- serialize(trans, NULL)
# sign (a hash of) the transaction with private key
sig <- signature_create(data, sha256, key = key)
# verify the message with public key
signature_verify(data, sig, sha256, pubkey = pubkey)
## [1] TRUE
# signature and encryption
# generate a private key (key) and a public key (pubkey)
key <- rsa_keygen()
pubkey <- key$pubkey
# build a transaction
trans = list(sender = "A", receiver = "B", value = "100")
# serialize data
data <- serialize(trans, NULL)
# sign (a hash of) the transaction with private key
sig <- signature_create(data, sha256, key = key)
# cipher the transaction with public key
ciphermsg <- rsa_encrypt(data, pubkey)
# verify the message with public key
signature_verify(data, sig, sha256, pubkey = pubkey)
## [1] TRUE
# decrypt and unserialize the transation with private key
unserialize(rsa_decrypt(ciphermsg, key))
## $sender
## [1] "A"
## 
## $receiver
## [1] "B"
## 
## $value
## [1] "100"

Network

Finally, the blockchain ledger is distributed over a peer-to-peer network. The steps to run the network are as follows:

  1. new transactions are broadcast to all nodes;
  2. each node collects new transactions into a block;
  3. each node works on finding a difficult proof-of-work for its block;
  4. when a node finds a proof-of-work, it broadcasts the block to all nodes;
  5. nodes accept the block only if all transactions in it are valid and not already spent;
  6. nodes express their acceptance of the block by working on creating the next block in the chain, using the hash of the accepted block as the previous hash.

Nodes always consider the longest chain to be the correct one and will keep working on extending it. The incentive of rewards may help encourage nodes to stay honest. If a greedy attacker is able to assemble more CPU power than all the honest nodes, he would have to choose between using it to defraud people by stealing back his payments, or using it to generate new coins. He ought to find it more profitable to play by the rules (generate new coins), such rules that favour him with more new coins than everyone else combined, than to undermine the system and the validity of his own wealth.

Privacy

The traditional banking model achieves a level of privacy by limiting access to information to the parties involved and the trusted third party. The necessity to announce all transactions publicly precludes this method, but privacy can still be maintained by breaking the flow of information in another place: by keeping public keys anonymous. The public can see that someone is sending an amount to someone else, but without information linking the transaction to anyone. This is similar to the level of information released by stock exchanges, where the time and size of individual trades, the tape, is made public, but without telling who the parties were.