Vineet Gupta

At Directi, we are taking a hard look at the way our applications need to store and retrieve data, and whether we really need to use a traditional RDBMS for all scenarios. This does not mean that we will eschew relational systems altogether. What it means is that we will use the best tool for the job – we will use non-relational options wherever needed and not throw everything at a relational database with a mindless one-size-fits-all approach.
This post covers the current landscape of the NoSQL space. In a subsequent post, I intend to cover in more detail the various problem areas addressed by NoSQL systems and the specific algorithms used.
Caveat: though I have tried quite hard to understand each database, the fact is that most of the systems discussed are quite comprehensive and summarizing their capabilities in a few bullets cannot do them any justice. And it is certainly possible that my interpretation maybe wrong in some cases since I haven’t actually used these systems as yet.

Summary

The need to look at Non SQL systems arises out of scalability issues with relational databases, which are a function of the fact that relational databases were not designed to be distributed (which is key to write scalability), and could thus afford to provide abstractions like ACID transactions and a rich high-level query model.
All NoSQL databases try and address the scalability issue in many ways – by being distributed, by providing a simpler data / query model, by relaxing consistency requirements, etc. I find it useful to bucket them as follows:

Distributed vs. Not-distributed: Distributed databases take the responsibility of data partitioning (for scalability) and replication (for availability) and do not leave that to the client.

DistributedNot Distributed (responsibility on client)
  • Amazon Dynamo
  • Amazon S3
  • Scalaris
  • Voldemort
  • CouchDb (thru Lounge)
  • Riak
  • MongoDb (in alpha)
  • BigTable
  • Cassandra
  • HyperTable
  • HBase
  • Redis
  • Tokyo Tyrant
  • MemcacheDb
  • Amazon SimpleDb

Over here, the model of distribution which seems most compelling is the one used by Dynamo. Indeed, it is copied by Voldemort, Riak and Cassandra – three very different kinds of stores. The reason it is most compelling is because it gives simple knobs to an application to tune its expectations of durability, read performance, consistency, write performance, etc. This makes it very general purpose. The other reason this model is good is because it allows heterogeneous hardware to be used in an efficient way (however Cassandra departs from this).

Data Model richness: The other key distinction is in terms of richness of data model:

Key-Value storeDocument storeColumn-Store
  • Amazon Dynamo
  • Amazon S3
  • Redis
  • Scalaris
  • Voldemort
  • Amazon SimpleDb
  • Apache Couchdb
  • MongoDb
  • Riak
  • Cassandra
  • Google BigTable
  • HBase
  • Hyperbase

On one end of the spectrum are the simple key-value stores: Dynamo, S3, Redis, Scalaris, Voldemort, etc. At the other end of the spectrum are the column-stores which provide a very rich model: BigTable, Cassandra, Hypertable and HBase fall into this bucket. This richness comes at a price – the model is not simple, and you need to think data modeling grounds up. Somewhere in between are the document stores – a sort of schema free cousins of relational databases: CouchDb, Riak, MongoDb, etc. Simple to understand and richer than plain key-value stores.

Disk vs. Memory: A third useful dimension is whether the database is memory-driven or disk-driven. This is important since in the latter case you need an explicit cache, while in the former case you are not durable:

MemoryConfigurableDisk
  • Scalaris
  • Redis
  • BigTable
  • Cassandra
  • Hbase
  • HyperTable
  • CouchDb
  • MongoDb
  • Riak
  • Voldemort

On one end of the spectrum is Scalaris which is entirely memory-driven, and Redis which is primarily memory oriented (you can do background snapshots). Cassandra, BigTable, Hypertable, Hbase allow configuring how large the Memtable can get, so that provides a lot of control. The document stores – CouchDb, MongoDb and Riak – all seem to be using on-disk B+ trees, and Voldemort uses BDB and MySQL. Having a pluggable storage engine which is in-memory, does not make it configurable since the in that scenario you are entirely memory driven with no durability at all!

So which one wins? Well, I am biased towards the database solving the scalability issue by taking over the responsibility of partitioning instead of leaving that problem with me. So theoretically, Cassandra seems to combine the best of both worlds: the sophisticated distribution model of Dynamo with the richness of a column-store. Also, it is in heavy production use despite still being in beta. However, I believe the learning curve here would be higher for modeling data, and it may make sense to opt for the simplicity of a Voldemort for most cases. Ultimately it would depend on your app requirements and development team.

1. Why Non-Relational

Some of the stuff here may not resonate with you if you are an enterprise developer since enterprise apps don’t have to deal with the kind of gigantic scale that (some) consumer web applications deal with. However, given the rate at which data is growing and the number of users who are using IT systems, these issues are only going to become more and more common – for smaller consumer apps, as well as for enterprise apps. In fact, even today, irrespective of the scale at which your app operates, if you want to take advantage of a Cloud platform like Google App Engine or Microsoft Azure or Amazon Web Services, you would perhaps need to think of some of the issues below, because at the infra level these platforms do have to bother about high scale and may impose constraints on the application / data model to help them scale.

1.1 Relational databases are hard to scale

1.1.1 Replication – scaling by duplication

  • Master-Slave:
    • Each write results in N x writes where N is the number of slaves. This leads to diminishing returns beyond a point, thus imposing a limit
    • While reads would get faster (since you can now read from N nodes), writes are still bottle-necked to one node
    • Critical reads still need to go the master since the write may not have propagated to all nodes. This logic needs to be built into the app.
    • High volumes of data pose a problem since you need to duplicate the data N times. This also leads to limiting how much you can scale with
      this approach.
  • Multi-Master

1.1.2 Partitioning (sharding) – scaling by division:

  • Scales reads as well as writes
  • Not transparent to the application. The application needs to be partition aware.
  • The value of an RDBMS is in relations. Once partitioned, these relations get broken – you cannot do a join across shards – this now needs to be done in the app layer.
  • In general, manual sharding in relational databases is not simple.

1.2 Don’t need some features

1.2.1 UPDATEs and DELETEs

  • Typically not used since that leads to loss of information
    • May need the record for auditing, or for re-activation
    • Typically, the info is never really “deleted” from a domain perspective anyway
      • A user “leaves” a community – his posts would not be removed
      • Employees “leave” companies – their employment record is maintained
      • The canonical ACID transactions example: debits from a bank account – this is not a DELETE, but an INSERT
    • Nor is info just “updated”
      • Salary gets “incremented” – the previous record is maintained
      • Your bank account balance is not updated – there are “debits” and “credits”
  • So one can typically model an UPDATE / DELETE as an INSERT and version the record.
    • When data gets too large, archive inactive parts of data
  • Two problems that arise when you go for an INSERT-only system:
    • The database cannot help you with cascades thru triggers – this needs to be done explicitly in the app layer
      • The cascades are actually far more complex than propagating a DELETE / UPDATE – this is a domain requirement:
        • When an employee leaves, you need to update the payroll system so that full and final compensation can be carried out
        • Everytime a bank account gets debited, checks need to be made on standing instructions, minimum account balance, etc.
    • Queries need to filter out the inactive records
      • Can lead to dirty looking code – addressed using views
      • There would be some perf penalty that can be addressed by archival

1.2.2 JOINs

  • Why avoid
    • Joins are expensive when data volumes are high since the database server has to perform complex set operations over large volumes of data
    • Do not work across partitions
    • Techniques like Materialized / Indexed Views not supported by all databases
  • How to avoid? De-normalize!
    • Purpose of normalization
      • Make it easier to have consistent data by keeping just one copy
      • Reduce the amount of storage
    • With De-normalization
      • Burden of consistency shifts from the database to the application layer
      • Easier if you only do INSERTs and no UPDATEs / DELETEs
      • Would lead to data bloat – can be significant for large volumes, but storage is cheap and you can archive inactive data

1.2.3 ACID Transactions

  • Atomic – do not need atomicity on modification of more than one record. Single key atomicity is enough
  • ConsistencyCAP theorem – can get any two of Consistency, Availability, Partition tolerance – not all three. (Also see http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.1915)
    • Most systems need partition tolerance and availability ahead of consistency.
      • Customer wants to place an order – you will accept the order, not return the money saying the system is unavailable – availability is important
      • Inventory would be checked asynchronously
      • Order details would be checked asynchronously
      • … would be done asynchronously
      • All this while data would be in an inconsistent state
      • This is ok – businesses are like that. They do not operate on a single version of truth. Reconciliation happens all the time.
    • Therefore our data model need not be strongly / strictly consistent. We can do with Eventual Consistency.
    • In most scenarios we need Read-Your-Writes Consistency and Monotonic Reads Consistency (as defined by Vogels in the paper above)
    • Strong consistency relies upon conflict resolution at write time to keep read complexity simpler. This does not scale.
  • Isolation – do not need isolation beyond Read-Committed. Easy with single key atomicity (above)
  • Durability – need durability till the time that RAM becomes cheap enough that one can afford many peer replicated nodes holding data in memory so that data is available even with node failures.

1.2.4 Fixed Schema

  • In an RDBMS you have to define the schema before you can start using data (somewhat like declaring types in statically typed languages)
    • Define each entity (table), its attributes (columns) and relations between entities
    • Define usage patterns (indexes)
  • Modifying schemas is essential
    • Intense competition and rapid growth necessitate adding new features / tweaking existing features rapidly
    • Changes usually require modifying the data model, thus precipitating schema changes
  • Modifying the schema is hard
    • Adding / Deleting / Modifying a column may lock the rows (imagine doing this for several million rows)
    • Adding / Removing an index may lock up the table

1.3 Don’t get some features

  • Hard to model hierarchical data
  • Hard to model graphs
  • Don’t rely primarily on main memory
    • Preferable to avoid going to disk as far as possible and serve out of main memory to get faster response time
    • Most relational systems are not memory oriented, but disk-oriented. Even with large main memory, relational databases end up going to disk for most queries – they are not aggressive about serving data from main memory and avoiding going to disk.
    • Vendors are trying to address this by acquiring / building in-memory database technology, but this is far from mainstream

2. Desired Characteristics

The environment expected is that the system would be spread over 100s to 1000s of commodity machines (hence called nodes) with different capacities. In a system like this failure is expected, tolerated and recovered from, with no loss of data, and without affecting the overall a
vailability and scalability of the system at large. The following characteristics are explicitly required to address these requirements:

2.1 High Scalability

  • Ability to add nodes incrementally to support more users and data
  • Achieved via partitioning
  • Increasing number of nodes should not result in a diminishing return beyond a threshold

2.2 High Availability

  • No single point of failure
  • Data should be available even as some nodes go down
  • Achieved via replication since data is duplicated
  • Also achieved via partitioning since at least some data continues to be available

2.3 High Performance

  • Operations should return fast
  • Achieved via being main-memory oriented instead of being disk-oriented, using non-blocking writes, lower complexity algorithms, etc.

2.4 Atomicity

  • Individual writes need to be atomic – should not expose in-between state to a read operation
  • Batching of multiple writes into a single atomic unit not required

2.5 Consistency

  • Do not need strong consistency
  • Ok to have eventual consistency (read-your-writes consistency)

2.6 Durability

  • Data should be persisted to disk and not just kept in volatile memory

2.7 Deployment Flexibility

  • Addition / removal of nodes should distribute data and load automatically without requiring manual intervention
  • Should not impose constraints like requiring a distributed file system or a shared storage
  • Should not require specialized hardware
  • Should be able to work with heterogeneous hardware – identical hardware should not be required to use the system optimally (otherwise you need to upgrade all nodes to get max efficiency which is not practical in a system running on 1000s of nodes)
  • Application should be able to control the degree of consistency, durability and read / write performance it requires without being aware of the deployment model.

2.8 Modeling flexibility

Should be able to model various types of data in a simple way:

  • Key-Value pairs
  • Hierarchical data
  • Graphs

2.9 Query Flexibility

  • Multi-Gets (retrieve a bunch of values, for the set of keys provided, in one call)
  • Range queries (retrieve data based on the specified range of the key)

3. Database Managers

A Database Manager is a software library that is loaded in-process to provide a high performance database. Its focus is typically restricted to the job of maintaining on-disk (or in-memory) data structures. These libraries are often used by distributed databases as a storage backend for managing the CRUD operations.

3.1 Berkley DB

3.2 Tokyo Cabinet

  • http://1978th.net/tokyocabinet/
  • Written in C. Language Bindings: Java, Ruby, Perl, Lua
  • Data Model: Hash, B+ Tree, Fixed length and Table
  • Concurrency:
    • Re-entrant API functions.
    • Writers block
    • Locking granularity is per record for hash and fixed length databases, Per file for others
  • Transactions:
    • Atomic operations
    • Two isolation levels: serializable and read uncommitted
    • Durability thru write ahead logging and shadow paging

4. Key Value Stores

Key-Value stores provide the simplest possible data model. However, this comes at a cost. Range queries are not straightforward (unless the database provides explicit support), and in general modeling applications in general on top of a key-value store can get complicated.

4.1 Amazon Dynamo

  • http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html – terrific paper, must read!
  • Distributed key-value store. Values are opaque to system. Targets objects typically < 1 MB
  • Internal to Amazon, not available for direct consumption externally
  • Partitioning
    • Uses a variant of consistent hashing that addresses
      • Non-uniform data and load distribution
      • Heterogeneity in capacity / performance of each node
    • The hash ring is divided into V partitions – each being serviced by a virtual node. There are P physical servers, with each physical server being typically responsible for Q/P virtual nodes, however this is decided based on the capacity of each physical node. Number of physical nodes is much less than number of virtual nodes.
    • When a physical node starts for the first time, it chooses its set of tokens (virtual nodes) and maps nodes to their tokens. This mapping is stored on disk and initially contains only the local node and token set.
    • Mappings stored at different nodes are communicated to peers using a Gossip based protocol.
    • Eventually consistent view of mappings – each node connects to a randomly chosen peer every second and efficiently reconciles mappings.
  • Replication – data is duplicated on multiple hosts.
    • Each key is associated with a
      “Preference List” -  a list of nodes to which it is propagated. This consists of N nodes where replicas are stored, and some more nodes for failure handling (see below)
    • Also, since number of physical nodes < virtual nodes, the preference list is built to ensure that the list contains only distinct physical nodes (by skipping positions in the hash ring). Since each node is aware of the token ranges handled by its peers (because of mappings getting propagated), this allows each node to forward a key’s Read / Write operation to the correct set of nodes directly
    • There are three configurable values:
      • N = number of nodes that store replicas of data
      • W = Min no. of nodes that must acknowledge the receipt of a Write before a Write completes.
      • R = Min of nodes that are contacted when a data item is to be read
      • If R +W > N, then write and read set overlap, giving Quorum.
    • Both Read and Write requests are handled by nodes called coordinators. A coordinator node is one of the top nodes in the Preference List for the key. The selection of a coordinator node is done by either using a partition aware client, or by using a load balancer.
      • Writes: The Write (Put) request is sent to the coordinator. The coordinator stores the value locally and then sends it to N highest ranking nodes in the Preference List that are reachable (healthy). If at least W-1 nodes respond with success, the Write is considered successful.
      • Reads: A Read (Get) request is received by the coordinator. The coordinator asks for all existing versions of that key from the top N ranking nodes in the Preference List that are reachable. If at least R nodes respond, the Read is considered a success.
      • So performance would be governed by the slowest node in a R/W operation.
    • This model of configurable (N, R, W) values gives applications control on the desired performance, availability, durability and consistency
      • Increasing W would ensure that data is replicated many times increasing durability. However, since W nodes would also be required to have a successful write, availability would reduce as W increases. Performance would also reduce since the chance that one of the nodes is a low thruput node increases. Also increasing W would potentially increase number of divergent versions (see below) creating reconciliation overhead.
      • Increasing R would ensure that consistency is high, but again availability would be reduced and performance may also reduce.
      • Reducing R would improve read performance. So for a read-intensive, low updates application, (N, R, W) can be set to (3, 1, 3)
      • Typical values of (N, R, W) for Amazon Apps are (3, 2, 2)
  • Consistency – Since updates propagate to replicas asynchronously, the model is eventually consistent
    • Uses object versioning (each update results in a new immutable object) using Vector Clocks.
    • Consistency protocol: As mentioned above, a successful read requires that at least R nodes respond. Each response includes all versions of the object. If the coordinator node ends up with multiple versions of data, it does the following:
      • Returns all versions it deems to be causally un-related
      • Divergent versions are reconciled
      • Reconciled version superseding the current version is written back.
  • Handling Temporary Failures:
    • As mentioned above, the Read and Write propagation considers the first N healthy nodes, not the first N nodes in the Preference List.
    • However, when a node that is not in the top N nodes (but in the top N Healthy nodes) receives a Write request, it is also given extra metadata to indicate which node was it actually meant for. This is called a Hinted Replica.
    • Each node periodically scans its hinted replica set and tries to connect to the original nodes for which the replica was meant.
    • If it is able to connect, the node tries to deliver the replica to the original node and if delivered, may delete the replica
    • This process is called Hinted Handoff
  • Handling Permanent Failures:
    • Permanent failures can arise in case a hinted replica becomes unavailable or some other risk
    • To reduce chances of permanent failures, nodes sync replicas thru Merkel Trees
    • Each node maintains a Merkel tree for the key range hosted by it
    • To sync, two nodes exchange the root of the tree common to the key ranges hosted by them, and in case there are any divergences, the apt sync action is taken.
  • Ring membership:
    • Addition / removal of node from ring is relatively expensive because of the need to rebalance partition assignment
    • Therefore addition of a node to a ring and its departure are explicit and not based on non-availability since non-availability can be because of transient conditions (say network)
    • Membership propagation and reconciliation happens during the same communication that is used for token mapping propagation and reconciliation (see partitioning above)
  • Failure Detection:
    • The need to detect failure is to avoid trying to reach unhealthy (unreachable) nodes.
    • When a node A fails to reach another node B, it propagates that information thru Gossip. Checks periodically to see if B comes back. Info on arrival of node also propagates using Gossip.
  • Persistence:
    • Pluggable model – Berkley DB Transactional store, MySQL, In-memory buffer with persistent backing store.
    • Apps choose the kind of storage they need.

4.2 Amazon S3

4.3 Project Voldemort

4.4 Redis

4.5 Scalaris

  • http://code.google.com/p/scalaris/
  • Distributed key-value store based on written in Erlang
  • No persistence, entirely memory driven. Can use a backend like Tokyo Cabinet for handling database size > RAM + Swap.
  • Partitioning – Uses a modified version of Chord called Chord# that
    • Preserves key order (allowing range queries)
    • Does routing in node space
    • Delivers proven O(Log N) performance (proof in the paper linked above)
  • Replication – Essential for being fault tolerant since there is no persistence. Uses a peer to peer replication approach called Symmetric Replication that reads and writes to a majority of nodes participating in replication.
  • Consistency: Strong consistency – uses a modified version of Paxos that provides atomic transactions
    : non-blocking writes that execute in parallel with strong consistency.
  • Language support: Erlang, Java, JSON RPC
  • License: Apache 2.0
  • Production Use: None mentioned

4.6 Others

5. Document Stores

Document stores are a step further from key-value stores in the sense that a value associated with a key is a full blown record (document) which is not opaque to the database, but exposes a structure which allows the database to perform some operations on it. For example, the document  can be a JSON serialized object.
Note that this approach is different from a relational database where the schema is defined upfront in the form of a table. In a document store, each document can have a different schema. Despite this, it is possible to describe relationships between records, just the way you do in relational databases:

  • One to Many:
    • Embed key of foreign entity in the document. Then the foreign entity can be retrieved.
    • Embed foreign entity itself in the document. Does not scale, but is faster
  • Many to Many
    • Embed keys of one type of entities in the other type of entity
    • Maintain another entity that embeds the keys of the related entities.

As should be obvious – document stores are conceptually very similar to relational databases except that schema definition is not upfront.

5.1 Amazon SimpleDB

  • http://aws.amazon.com/simpledb/
  • Written in Erlang
  • Data is modeled in terms of
    • Domain (a container for holding related data together)
    • Item (a entity that goes into a domain)
    • Attribute and its Value (a properties of an item)
  • So basically each item is a dictionary to which you can add / remove attributes (keys) at any time. Related items go into a Domain.
  • When an item is added, it is indexed based on its attributes (and the attributes of other items in the domain).
  • You can issue SELECT statements much the way you would issue to a relational system, with the concept of a table being replaced by a Domain.
  • Eventually consistent
  • Partitioning: The domain-item-attribute model imposes limits on size of each domain, number of attributes / domain and number of domains, so in order to scale, you need to partition the data manually.

5.2 Apache CouchDB

  • http://couchdb.apache.org/
  • Written in Erlang
  • Data and Storage Model
    • Records serialized in the form of JSON strings (documents)
    • Each document has a unique DocId
    • Uses B+ Trees: Both for main data as well as for Indexes
    • Append only operations (Every update generates a new Sequence Id for the DocId). Committed data is never over-written
      • Every write operation also results in index updates which are also append only, so a new node is added to the B+ Tree which recursively leads to Log(n) updates to the B+ tree.
      • During Compaction older versions are removed
    • Append only model means that Reads are done using Multi-Version Concurrency Control. A client can hold on to the older root of the B+ tree and get a consistent view while the data is being modified continuously.
    • Writes are serialized, except for blobs which can be concurrently written.
    • An update to a document is atomically committed with data and index updates getting synchronously flushed to disk, and the database header is updated.
    • When resuming from a crash, no recovery is required (except for recovering the database header). Partial updates are simply truncated.
  • Views
    • Required to “add structure back to unstructured and semi-structured data.”
    • Can have multiple views for the same data
    • Defined using a JavaScript function that maps keys to values in a special type of document called a Design Document.
    • View functions are run when your query a view: The system takes the source code and runs it against every document in the database. There are basically two kinds of functions:
      • Map function: takes a single parameter – the document – a single document in the system, does some operation on it (can’t modify the document) and returns a result (mapping an input to a result).
      • Reduce function: takes a list of key-values and returns a single result after performing some aggregate operation on it (reducing multiple inputs to a single result)
    • View results are stored in a B+ tree and unless the document changes or the view definition changes, the results are not re-computed and fetched straight from the B+ tree.
  • Replication
    • Master-slave with multi-master support
    • Only latest revision replicated
  • Conflict Handling in Multi-master
    />

    • CouchDb detects conflicts and uses a deterministic algorithm for picking a “winner”. Deterministic means that all replicas end up winning the same version without talking to each other.
    • This winner may or may not be the one your app expected. This will have to fixed by the app itself.
  • Partitioning and Load Balancing – Provided thru CouchDb Lounge
    • Uses consistent hashing to partition the data
    • Dumbproxy used for all requests except View requests. Implemented as an Nginx module
    • Smartproxy used for View requests only
      • Twisted python based daemon
      • Sends view requests to all shards
      • Merges responses before sending them back
      • Merges happen in constant memory space
  • License: Apache 2.0
  • Production Use: BBC, PylonsHQ, GPirate, and several others

5.3 Riak

  • http://riak.basho.com/
  • Written in Erlang. Uses the same architecture and algorithms as Amazon Dynamo.
  • Client libraries
  • Data Model
    • Key-Value. Value is a document.
  • Riak nodes go on a ordered consistent hash ring
    • 160-bit binary hash of key-value pair, maps to a position on the ring.
    • Ring divided into partitions. Each partition is designated an owner called v-node. The v-node is said to claim a partition.
    • Nodes try and run equal number of v-nodes. So each node is responsible for 1/(number of nodes) or (number of partitions) / (number of v-nodes). So if 2 nodes define a 16-partition cluster, each node would run 8 v-nodes.
  • Writes (Puts)
    • Any node may be chosen as coordinator for the Write request
    • Coordinator node figures out the v-node responsible for the partition where the key needs to go.
    • Coordinator sends Write request to that v-node as well as next N-1 partitions in the ring, where N is a configurable parameter that defines how many copies of the value need to be created.
    • Write request may also specify that at least W (=< N) of the v-nodes respond with success, and that DW (=< W) of the W nodes respond with success only after durably storing the state.
  • Reads (Gets)
    • Any node may be chosen as a coordinator for the Read request
    • Coordinator node figures out the v-node responsible for the partition for the key
    • Sends request to v-node, as well as next N-1 nodes
    • Request can also specify R (=< N) nodes that must reply before a response is returned.
  • Ring state propagation
    • Consists of things like arrival and departure of nodes on ring
    • Done using Gossip protocol
  • Eventual Consistency
    • When a object is added it is tagged with a vector clock specifying its initial version
    • For each update, the clock is updated in such a way that the system can compare two versions of an object and determine how the two objects are related (direct descendant, common parents, unrelated)
    • To compare object versions, a Merkle tree is used
  • Failure handling is done using the same hinted replicas and hinted hand-offs approach as described in the Dynamo paper.
  • Storage model – pluggable:
  • Queries
    • Map-reduce model
    • Map operations run local to the data on the hash ring – can be run in parallel
    • Reduce operations take inputs from the preceding phase and reduce (typically perform an aggregate operation) the input. This cannot be parallelized
  • License: Apache 2.0
  • Production use: No names mentioned, except that the FAQ says that “Riak has robustly served the needs of multiple revenue-generating applications for nearly two years now.”

5.4 MongoDB

  • http://www.mongodb.org/
  • Written in C++
  • Language bindings (drivers): C, C++, Java, JavaScript, Perl, PHP, Python, Ruby. Also community provided support for C#, F#, Erlang, Groovy, etc.
  • Data Model
    • Collections – similar to Domains in Amazon SimpleDb – a bucket to hold together similar documents
    • Key-Value, with value being binary serialized JSON (BSON)
      • 4 MB limit on a BSON object
      • Large object support via GridFS
    • B-Trees used for indexes. You specify the fields on which to index using a function called db.things.ensureIndex() that takes the field as a parameter.
  • Storage: Uses memory mapped files. So caching is controlled by the OS VMM
  • Writes
    • In-place updates
    • Provides partial updates – you can update the value without sending a full row update.
    • Single document Atomic updates. Atomic batch updates possible thru db.eval() but may not be supported with partitioned data

    <
    br />

  • Queries
    • Provides a JSON style based syntax to pick values from inside documents
    • Support for conditional operators, regular expressions, etc.
    • Supports forward-only cursors
    • Query optimizer is not cost-based, instead multiple query plans are tried and the one that works best is picked.
  • Batch Operations
  • Replication:
  • Partitioning:
    • Partitioning In alpha stage.
    • Data sharded by Collection with order preservation
    • A shard consists of one or more servers replicating using an enhanced version of replica pairs
    • Shards deal with data in units called Chunks.
      • A chunk is a contiguous range of data from a collection with a max limit of 50 mb. After that a chunk splits into two chunks
      • Data migrates from one shard to another in chunks (in case of shard having excess data, or having nearby shards)
  • License: AGPL 3.0
  • In Production use at: Sourceforge, Github, EA, and several others

6. Column-Family Stores

6.1 BigTable

  • http://labs.google.com/papers/bigtable.html
  • Internally used at Google. Exposed to general public thru Google App Engine
  • Building Blocks
    • Google File System – a distributed file system. Provides raw storage
      • Files broken into chunks (typically 64 mb)
      • Each chunk is replicated across 3 machines for safety.
        • A heuristic here is that one of the replicas tries be on the same rack as the original and the other two replicas somewhere else
      • Chunk Servers – responsible for storing chunks
      • Data transfer happens directly between clients and chunk servers
      • Master – responsible for metadata in the file system
    • The storage file format is called SSTable (sorted strings table)
      • Persistent, ordered, immutable map of keys to values with both keys and values being arbitrary byte strings
      • Operations to look up the value specified with a key and to iterate over key/value pairs in a given key range
      • Optionally, an SSTable can be completely mapped into memory
      • Immutability of SSTables:
        • Means no synchronization required during file access.
        • Makes removing permanently deleted data a garbage collection exercise (happens in a major compaction – see below)
    • Scheduler
      • Schedules jobs on machines
      • Watches for failures and re-schedules if required
    • Chubby Service – Distributed lock / file / name manager
      • Coarse grained lock – can store small amount of data in lock
      • 5 replicas, 1 master. Need majority vote to be active. Uses Paxos to keep replicas in sync
      • Provides a namespace that consists of directories and small files. Each file / directory can be used as a lock, and reads and writes to a file are atomic.
      • Each chubby client maintains a session with a chubby service. There is a lease time which needs to renewed.
      • Clients can register callbacks on Chubby files and directories for notification of changes / session expiration
    • Optional Services used by BigTable applications (but not used by BigTable itself)
      • Map-Reduce: Batch processing – clients often use this to read / write BigTable data
      • Sawzall: to support execution of client provided scripts in the server space for transformation, filtering and summarization
  • Data Model
    • Multi-dimensional map (map of maps)
    • Data organized into tables. A table is an entity that holds related data
      • Each record is identified by a row key
      • Each row has columns. Columns here should be thought of as attributes of a row. Not all rows would have the same columns
      • Intersection of a row and a column creates a cell
      • A cell has versioned data based on timestamp. So a single cell may holds multiple versions.
      • It is convenient to organize attributes. Thus columns are grouped into column families. A column family can be thought of as a namespace for a set of related attributes.
    • So the dimensions of the multi-dimensional map are:
      • Row key: Uniquely identifies a datum (arbitrary byte string)
      • Column-family: A grouping of attributes – a namespace for a set of related attributes.
      • Column: A single attribute.
      • Timestamp: Denoting the time when the value was changed.
        • Different versions are stored in decreasing order of timestamp, so most recent versions can be read first
        • Client can specify that only last n-revisions of cell should be kept, or only values written in the last n-days should be kept
    • Transactional access possible to any column within a single row
      • Can perform atomic read-modify-write sequences on data stored under a single row key
      • Transactions across row keys not supported
      • Batch operations across row keys are possible but are not atomic
  • Tablets
    • Data is sorted lexicographically by Row key
    • Row range for a table is dynamically partitioned.
    • A row range is called a Tablet. So you can describe a tablet in terms of a start row key and an end row key
    • One tablet = 100 to 200 mb
      • Tablets can be split based on tablet size or load
    • A tablet server may end up serving anywhere from 10 to 1000 tablets (typically hundreds).
      • While each tablet consists of contiguous data, a tablet server would pick up tablets spread across the entire data set and not data that is related. This helps allevi
        ate hot spots since load gets spread evenly
      • The high ratio of tablets to tablet servers gives very fast recovery since if a machine that was serving 100 tablets fails, 100 machines can each pick up 1 tablet from the failed machine making recovery highly parallelized
      • Also gives fine-grained load balancing – you can migrate tablets away from an overloaded machine.
    • Servers are further grouped into cells
      • A cell is a collection of servers. Most cells have 10-20 servers
      • Some cells have as many as thousands of servers managing several hundred TB of data
  • A typical cluster consists of
    • Commodity Intel based PC hardware with each node running
      • Google’s version of Linux
      • GFS Chunk server – for raw reads and writes
      • Scheduler slave process – runs jobs
      • Tablet server – serves BigTable read / write requests
      • Some nodes may also run a BigTable master which is responsible for meta-data requests (create a new table) and load monitoring / balancing
    • GFS Master
    • Cluster scheduling master – talks to the scheduler slave process on each node to startup jobs
    • Lock service – used for electing BigTable masters and for bootstrapping the tablet lookup (below)
  • Locating Tablets
    • To figure out which server is serving which tablet (row key range), this mapping info is kept in special metadata tables (which in turn can split into tablets)
    • Three level hierarchical lookup with bootstrap provided by Chubby
      • Chubby file points to a root metadata tablet – this tablet is never split
      • This tablet points to other metadata tablets
      • Each 2nd level metadata tablet points to app data tablets
      • Each metadata row is an encoding of tablet’s table id and end row key, and takes approx 1 kb
      • With a 128 mb limit on metadata tables, this scheme can address upto 2^34 tablets (or 2^61 bytes in 128 mb tablets)
    • The lookup is aggressively pre-fetched and cached by clients so most queries end up going straight to servers
      • Empty cache: There are 3 network round trips including one read from Chubby
      • Stale cache: could take up to  round trips since stale entries are discovered via misses
  • Tablet Assignment
    • Each tablet assigned to one server at a time. Master keeps track
    • When a tablet gets unassigned, the Master asks a Tablet server with available capacity to take over the tablet
    • Tablet assignment uses Chubby
      • On starting, a tablet server creates and acquires an exclusive lock on a uniquely named file in a specific Chubby directory called Servers
      • Master monitors this directory to discover tablet servers
      • If a tablet server loses its exclusive lock it stops serving the tablets
      • Tablet server may try and re-acquire the lock if the file still exists, however if the file is gone, the tablet server kills itself
    • Master asks each tablet server periodically for the status of its lock
      • If tablet server reports a loss, or if master fails to connect to tablet server, the master tries to acquire an exclusive lock on the tablet server’s file
      • If master is able to acquire lock, it means Chubby is alive. So master deletes the file so that the tablet server can never serve again, and the master moves all tablets assigned to the tablet server to the set of unassigned tablets
      • If master is unable to reach Chubby, it kills itself
  • Writes
    • When a write request goes to a tablet server, it checks it for well formedness, and authorization (Chubby maintains a list of permitted writers)
    • Valid mutations are written to a commit log (grouped for small mutations)
    • After the write has been committed, its contents are inserted into a sorted in-memory buffer called memtable. From here, updates keep going to a sequence of SSTables as they get older
  • Reads
    • Similar checks as for a write request – well formedness and auth
    • Read op executed on a merged view of SSTable sequence and memtable
  • Tablet Recovery
    • Maybe required if a Tablet server dies
    • Tablet server reads its metadata from the MetaData table – list of SSTables and set of redo points which point to commit logs
    • Server reads indices of SSTables into memory and reconstructs memtable by applying all updates that have been committed since redo points
  • Compactions
    • Minor Compactions:
      • As writes occur, memtable grows till a threshold. Is frozen. New memtable created. Frozen memtable converted to SSTable and written to GFS.
      • Shrinks memory usage of tablet server
      • Reduces amount of data that needs to be read from Commit log during recovery
    • SSTable Compactions
      • As minor compactions occur, new SSTables get created. This can lead to a Read operation slowing down as it may need to merge data from multiple SSTables
      • So periodically a merging compaction runs which reads a few SSTables and the Memtable and writes out a new SSTable
      • This does not remove special deletion entries (used to suppress deleted data)
    • Major Compactions
      • From time to time BigTable runs a compaction that rewrites all SSTables into exactly one table is. This is called a Major compaction
      • During this compaction deleted data / deletion entries are removed.
      • Required for claiming resources
  • Refinements
    • Locality Groups: Are used to group together parts of data that have similar usage characteristics.
      • Client can group multiple column families into a locality group
      • Separate SSTable generated for each locality group
      • Locality group can be marked as being in-memory (lazy loaded). BigTable itself uses it for the Location column-family in metadata table.
    • Compression: Clients can specify whether or not SSTables for a locality group are compressed or not, and which compression algorithm to use
    • Caching: Tablet servers use two levels of caching:
      • Scan cache – higher level. Caches key-value pairs returned by SSTable interface to Tablet server code. Useful for apps that tend to read the same data repeatedly
      • Block cache – lower level. Caches SSTable blocks reads from GFS. Useful for apps that read data close to data they recently read

6.2 Cassandra

  • http://incubator.apache.org/cassandra/
  • Combines the distributed architecture of Dynamo with BigTable’s column-family data model
  • Written in Java. Thrift based interface. High-level libraries available on Ruby, Perl, Python, Scala and Java
  • Data Model:
    • Multi-dimensional map indexed by a key like BigTable.
      • Each application creates its own key-space (table in BigTable)
      • Key – a name for a record. Can be an arbitrarily long string. Indexed by Cassandra
      • Column – an attribute of a record. Smallest datum you can get hold of. This is timestamped.
      • Column-family: Grouping of columns. Similar to a relational table.
      • Super-Colum
        ns: A list of columns
      • A column-family can either contain columns or super-columns, but not both. Coumn family containing super-columns is called super-column-family
      • The way to get to a piece of data is: KeySpace.ColumnFamily.Key.[SuperColumn].Column
    • Sorting
      • Data is sorted at write time (unlike relational)
      • Columns are sorted within their row by column-name. The sorting provider is pluggable
  • Partitioning: Similar to Dynamo
    • Consistent hashing using an order preserving hash function.
    • However uses the Chord approach to load balancing rather than Dynamo’s v-node approach
  • Replication
    • Same concept of coordinator nodes and preference lists as Dynamo
    • Various replication policies: Datacenter aware, Rack-aware, Rack-unaware
    • Rack-unaware approach is same as Dynamo (replicate to N-1 nodes)
    • Rack aware and Datacenter aware use a more intricate algorithm which involves a leader (chosen using Zookeeper) that maintains the N-1 replicas invariant. This is apparently not there in Apache Cassandra.
  • Membership: Membership based on Scuttlebutt – anti-entropy Gossip based mechanism. Same concept of explicit membership changes as Dynamo
  • Failure detection: Uses a modified version of Φ Accrual Failure Detector which gives a probabilistic value for each monitored node. This apparently is fairly accurate and more representative of network conditions.
  • Failure Handling: Same concept of hinted replicas and hinted hand offs as in Dynamo
  • Persistence:
    • Relies on local file system (unlike BigTable that uses GFS)
    • Storage structure and approach is similar to BigTable: SSTables, Memtables, Commit logs, Compaction, Bloom filters, etc.
  • Writes
    • Similar to the BigTable approach of writing to commit log for durability and recoverability, followed by update to Memtable
    • Dedicated disk on each machine for commit log makes the writes sequential and thus maximizes thruput
    • When memtable crosses a threshold (out of space / too many keys / client provided time duration), it gets written to two files:
      • Data file – SSTable
      • Index File – key/offset pair
    • No seeks – always sequential, thus quite fast
    • Atomic within Columnfamily
  • Reads
    • Similar approach as Dynamo for figuring out which nodes will serve, read-repair, etc.
    • At a storage level, approach is similar to BigTable: read from memtable + SSTable sequence
  • License: Apache 2
  • In Production at: Facebook, Digg, Rackspace

We recently started off on a couple of projects that target the JVM. The team had no prior Java experience but had worked on C#/.Net. We found Java tedious and ceremonial and decided to investigate the other languages that target the JVM – Groovy, Scala and Clojure. These notes summarize findings based on my interpretation of the language specs, examples and scanning the project sites. They are not based on any real programming experience on these languages.

1. Groovy

 

2. Scala

 

3. Clojure

 

Conclusion

  • Groovy – If you are an experienced Java programmer and need to do scripting work.
  • Clojure – If you have a Lisp background and want to target the JVM. Take the high ground on concurrency – very promising.
  • Scala – If you want sophistication, and the best of OO and functional. Most impressive.
  • Cool part is that like the CLR, you can do language interop here – so choose the best language for the job

Personally, I am excited about Scala – I think it will replace Java as the mainstream language on the JVM in the long-term. Clojure is exciting because I always wanted to learn Lisp – here’s the perfect opportunity!

We recently augmented the team working on our desktop product. At the core of the product is XMPP – the protocol that drives several instant messaging servers and clients, sites like Chesspark and now Google Wave. Since XMPP is not known by many people, let alone be understood well enough, every time we on-board someone new, they have to go thru a steep learning curve. This post is an attempt to make it easier to understand the protocol .

 

What is XMPP

XMPP generally refers to a collection of specifications that define protocols for real time interactions over the public internet.  The core set of specifications has been standardized by IETF. While the first application (and the origin) of the protocols was to address instant messaging, the extensible nature of XMPP has led it to be used for a wide range of applications which need real time communication.

 

Architecture

XMPP has a decentralized client-server architecture (like the WWW) – there are several hundreds of XMPP deployments, each running anywhere from one to hundreds of servers to which millions of clients connect. Key aspects:

 

Addressing:

Each interacting entity on XMPP needs to have a unique address. This address is called a Jabber Id (JID). A JID consists of three parts:

user-identifier: Unicode string representing the interacting entity. For example: user1

domain-identifier: Unicode string representing the domain of the interacting entity. For example: directi.com

resource-identifier: Unicode string representing a resource used by the entity. For example: pwdesktop

A full JID looks as follows: <user-identifier>@<domain-identifier>/<resource-identifier>. For example: [email protected]/pwdesktop. An identifier without the resource is  called a bare JID: [email protected]. Another way of addressing is to use XMPP URIs: xmpp:[email protected]

 

Communication:

The original purpose of XMPP was to do instant messaging. This requires the server to be able to push messages out to the client on a real time basis. Over HTTP, this requires the client to poll the server, or a technique like Comet where the server caches HTTP requests and then sends responses on those requests. XMPP however uses a long lived TCP connection. This gives the server an always on channel to push info to the client. The client also need not wait for the server to respond to its messages, but can instead send an indefinite amount of messages to the server without blocking, which the server can then respond to, enabling an asynchronous kind of communication. This is further helped by the fact that XMPP does not require every packet being sent over the wire to be acknowledged. An entity assumes a packet to be delivered unless it receives an error.

 

Protocol:

XMPP uses streaming XML over a long lived TCP connection (though HTTP is also possible thru BOSH – see later) for communication. There is one stream from client to server and another stream from server to client. To start communicating, a client would send an opening XML tag:

<?xml version="1.0"?>

This is followed by a stream element – this marks the beginning or root of the document:

<stream:stream>

This in turn is followed by various messages sent across as XML elements. These elements are called stanzas. These stanzas can continue to get exchanged between the server and the client endlessly till a closing stream tag is sent which marks the end of the communication. There are three kind of stanzas:

1) Message: The <message/> stanza denotes the basic push method for sending stuff from one entity to the other. These need not be acknowledged and provide a quick fire and forget mechanism to send info from one point to the other. A Message has a “to” and a “from” attribute denoting the receiver and the sender, and can include one or more payloads. In IM conversations, the payload is often HTML markup for richly formatted text (defined by XHTML-IM)

2) Presence: The <presence/> stanza denotes a broadcast sent out by an entity to advertise its availability to other entities who have subscribed to receive these updates from the advertising entity. Presence messages can also include a payload. Common uses are to include an availability state like “away,” “busy,” etc. and personal status messages like “Working on blog post 1 of 6.”

3) Information Query: The <iq/> stanza provides a Request-Response mechanism like HTTP verbs GET, PUT and POST. The payload defines the request of the sender which needs to be processed by the receiver. This is the only stanza type where the sender expects a reply – a result or an error. This makes IQ more reliable than a Message, allowing the two entities to carry out a structured interaction. Common examples of IQ usage in IM applications are to fetch the the Roster, and add / remove entries in the Roster.

A closing stream tag indicates end of conversation.

 

Decentralized client-server architecture

XMPP servers talk to each other directly. So if [email protected] needs to interact with [email protected], the servers at domain1.com would interact with domain2.com servers directly. This is different from email where communication between servers on different domains happens thru hops (which can lead to address spoofing and other issues) or HTTP where servers do not interact with each other at all.

 

 

Creating a XMPP Session

[email protected] wants to talk to [email protected]. To accomplish this:

 

1) Client creates a TCP session with the server:

The client used by user1 needs to find out the box that hosts the XMPP service for domain2. This is accomplished by doing a DNS service lookup by checking the DNS SRV record which maps the service to the machine name and the port of the service (5222 by default) becomes known, a AAAA lookup gives the IP of the machine. With the IP and the port we can now open a TCP connection.

 

2) Client and Server start streams in opposite directions:

a) The client sends across the opening XML text declaration tag (optional) followed by an initial stream header:

<?xml version="1.0"?><stream:stream to="domain2.com"     version="1.0"     xmlns="jabber:client"    xmlns:stream="http://etherx.jabber.org/streams">

b) The server sends back a response stream header with a unique stream id:

<?xml version="1.0"?><stream:stream from="domain2.com"     id="0123456789" version="1.0"     xmlns="jabber:client"    xmlns:stream="http://etherx.jabber.org/streams">

 

3) Client and Server negotiate Stream Features:

Right after sending the response stream header, the server send across a <stream:features> message on the features it supports. These features are typically about:

a) Whether server supports TLS or not (recommended)

b) Authentication mechanism supported (see my earlier blog post on authentication mechanisms) – typically SASL plaintext and digest-md5 are supported. Ideally one should not use plain text without TLS since in that case the password is sent in clear on the wire.

c) Stream compression (optional)

At this point, the client uses <iq/> stanzas to negotiate which features it wants to use, and if TLS is used, enter into a TLS negotiation, or if SASL is used, authenticate via the appropriate SASL mechanism.

 

4) Post Authentication Stream Negotiation

After authentication, the server resets the session by sending a new stream header with a new stream id. This is done for security purposes. This new stream does not publish any authentication features (since that is already done), but now publishes new features. These typically include:

a) Compression support

b) Resource binding

c) Formally starting an XMPP session

At this point, the client again uses <iq/> stanzas to negotiate which features it wants to use, and once that part is over, the actual task of application specific stanza exchange can start between the client and the server.

 

BOSH

I earlier mentioned that XMPP can work over HTTP as well. This seems counter-intuitive: XMPP requires push, and HTPP is pull based (client sends a request and server responds). However, it turns out that one can do push over HTTP as well – the technique for using XMPP over HTTP is called Bidirectional-streams Over Synchronous HTTP (BOSH):

 

1) There is a server in front of the XMPP server which handles HTTP clients. This is called a BOSH connection manager (CM).

 

2) Client sends DNS query for TXT records, and discovers that there is an entry for BOSH connection which points to the BOSH server mentioned above.

 

3) Session Creation Request: Client now sends a HTTP POST with an empty <body/> tag with some attributes set. The important ones are:

a) hold – the number of HTTP requests the BOSH server can queue. This is typically set to 1

b) wait – the timeout in seconds before which the server must respond to a pending request

c) rid – a large random number that acts as the initial request id

 

4) Session Creation Response: BOSH server opens a regular XMPP stream with the XMPP server over a TCP connection, receives the server’s XMPP response, wraps it up in a <body/> tag and returns this to the client over HTTP. The body tag contains the following attributes:

a) hold – same as earlier

b) requests – max number of HTTP requests that the client can open with the BOSH server at any time. This is typically set to hold + 1. Since hold is typically 1, requests is typically 2.

c) sid – a large random number that acts as the session id. This is diff from the stream id sent by the XMPP server. The client must now include the sid in every subsequent request.

 

5) Hereafter the client and the server negotiate stream features and authenticate pretty much in the same way as with a TCP connection, with the BOSH server sitting in between and wrapping / unwrapping the <stream> and <iq/> stanzas in <body/> tags. The XMPP application is now ready to exchange its specific stanzas.

 

6) The question at this stage is, how does the server do a push. Recall that hold=1 is the max number of HTTP requests the BOSH server can queue, and requests=2 is the max number of HTTP requests that the client can open with the BOSH server at any time. Assume that the last request from the client was sent to the BOSH server just around 60 seconds back (let’s say a <presence/> packet) . The server had nothing to respond because the XMPP server had no stanzas. Now since the 60 seconds timeout is about to be over, this is what happens:

a) BOSH server returns a HTTP 200 ok response to the last request with an empty <body/> tag. If the client too has nothing to say, it also sends across a HTTP 200 ok with an empty <body/> tag, to which the server can again respond at the 60 second timeout. This can go on ad-infiniteum as a keep-alive mechanism.

b) Now assume that the client has something to say. One request is already in the play and max two are allowed. So the client can now send the new stanza in a new HTTP request. The BOSH server immediately responds to the earlier request (which was kept on hold) with a HTTP 200 ok with empty <body/> tag. It now again has one request outstanding which it can use to either send back the keep-alive, or send back a response from the XMPP server.

c) Assume that the client and the server have been playing the keep alive game. Now the XMPP server sends a stanza (say an authorization request). The BOSH server needs to push this to the client. This is easy since it has a cached HTTP request. Hence push is accomplished over HTTP, without doing constant polling.

 

The advantage of using BOSH is that it can work even in flaky networks where a TCP connection would break, forcing the client to once again establish an XMPP session. Also, this makes it possible to use XMPP in web clients where one cannot open a TCP connection, for example, Facebook’s chat feature uses XMPP over BOSH.

 

Jingle

XMPP uses a client-server model for all communication and is optimized for small snippets of info. So if the amount of data to be exchanged is very large, for example in applications like file transfer, audio-video calls and screen-sharing, an XEP called Jingle. Jingle is large and complex enough to deserve its own series of posts. I will summarize basic facts here:

1) The basic idea behind Jingle (and other multimedia protocols like SIP) is to use two channels:

a) Signaling Channel to set up, manage and tear down application defined sessions

b) Media Channel to transfer the payload either peer to peer or relayed thru a mediator over a application defined transport

 

2) In a Jingle negotiation (<jingle/> element inside a <iq/> element),  the initiator makes an offer to start a session by declaring one or more app type (say voice video, etc.) and a transport method (ICE, UDP, etc.). The responder and the initiator then negotiate a set of parameters (for example codecs to be used), and if the negotiation works data is exchanged. Some parameters can be modified even while the data is being exchanged.

 

3) Jingle supports two transport types:

a) Datagram transports like UDP – can tolerate packet loss – meant for apps like media streaming

b) Streaming transports like TCP – no packet loss tolerated – for example file transfer

 

4) The real power of Jingle comes from using Jingle over ICE. ICE provides a mechanism for two entities to communicate and negotiate all possible ways of connecting between each other – direct or mediated. ICE in turn can use a STUN server to find out the IP address and port of an endpoint from outside the firewall, and a TURN server to relay data in case a direct peer to peer connection is not available.

When the Twitter reply feature tweak story started breaking, my first reaction was – this could be us in future. Watching Twitter struggle with the change and the backlash it generated was a big public lesson in product design, tech implementation and communications, and I thought I should document it here, lest I ever forget it.

For the uninitiated, Twitter recently removed a setting where you could see responses by people who you follow to other people who you don’t follow. This was primarily done for tech reasons, but the blog announcement did not explain that very well, leading people to think of it as an arbitrary move, resulting in some strong feedback. Twitter however responded quickly with the real story: it was done for technical reasons. Here’s what I learnt:

 

1) Product Design:

According to Alex Payne – Twitter’s API lead, only 3% of the users had ever turned on that feature, however, these 3% users are apparently power users, which in Twitter’s world also means that they are also strongly vocal with a large influence base, and they were using the feature to discover new friends. Hence the strong backlash. The lessons are:

a) If you change a feature, figure out the audience profile it impacts most, and the consequence of that impact. Numbers-wise 3% is low, but what if they are some of your most important and vocal users?

b) When you do change a feature, figure out if the impact can be lessened by coming up with an alternative. It seems that the Twitter team is now working on this – they could have started the work before going ahead with the decision to remove the setting.

 

2) Tech Implementation:

The feature was removed for tech scalability reasons. Let us see what the situation is:

  • User U1 has n1 followers represented by the set S1. One of the followers in S1 is U2 who has in turn n2 followers represented by the set S2.
  • User U1 makes a twitter post – all n1 followers in S1 see it
  • User U2 makes a reply to this post. Now twitter needs to deliver it to people from both S1 and S2:
    • First deliver it to the intersection of S1 and S2 since that is the set which is following both U1 and U2.
    • Now find out (S1 + S2) – (S1 x S2) – this is the remaining set which is not following both U1 and U2
    • In this set find out the 3% users who have opted for receiving replies even when they are not following the user.

How hard is it to do this? Well you are looking at at least two select queries, where the second one involves a complex join. This is where the stress on the system comes from. Can this not be addressed by caching? Here’s what would need to be cached: the follower list for each user, and the 3% of the total users who have got this feature turned on. It would be ineffective if the follower numbers n1, n2, n3, … for users U1, U2, U3, … were very large and not easily cacheable. For a typical social network like Facebook it would not be very difficult, but Twitter is different: it is asymmetric – you can follow someone without requiring that person to follow you. This results in massive values of n1, n2, n3, … for popular users. How big? According to http://wefollow.com/top, each of the top 10 twitter users has more than a million followers, with Oprah Winfrey at 1.02M and Ashton Kutcher at 1.85M. Clearly not an easy number to cache. What is scarier is that Twitter is growing at a very fast pace – faster than anyone else according to http://blog.nielsen.com/nielsenwire/online_mobile/twitters-tweet-smell-of-success/, and the number of celebrities and businesses using Twitter is growing at a fast pace. According to http://twitterholic.com/, 5 of the top 10 users have joined in the last one year.

Lesson: when designing features, design with scalability in mind. If it is not scalable, avoid putting it in – you may need to kill it.

 

3) Communications:

This has been covered ad nauseam all over the web, so I won’t repeat it here. The moral of the story is that if you do have to kill a feature, the key thing is to be honest about the reasons and be clear in the message leaving no ambiguities. First Facebook learnt the lesson and now Twitter has. Hopefully, we would be able to avoid these mistakes.

One of the cornerstones of the .PW platform is the Wall – the real time aggregate of activities being done by an entity and its network. So while I am personally rather inactive on the various social networks (way too distracting), the recent announcement by FB on opening up their feed via activity streams led to some analysis and thought – here’s a summary.

 

The Premise

  • A user carries out several activities across multiple systems – posting items, joining forums, connecting to people, etc.
  • Each of these systems have their own way of capturing, storing and publishing this information.
  • Each system wants to puts various degrees of control on the usage of this information. Twitter makes it freely available, FB puts a wall-garden around it, others fall in between.
  • The user wants to be able to control who can see what information, and exercise his copyright on that information in terms of how it is consumed, used, persisted and further shared.

 

Challenges

  1. How does one capture the semantic richness of the various types of activities in a simple, precise way.
  2. How does one publish the stream in near real time without going into expensive polling. Why is this imp? Because on Jul 21, 2008 FriendFeed crawled Flickr ~2.7M times for a grand total of 6700 updates. HTTP was not made for Push, and Pull is a resource hog.
  3. How does a user continue to exercise his copyright on his content even as his feed becomes available in a machine readable way and published.

 

Solutions

  • Challenge #1 is being addressed thru the emerging activity streams standard. More on this in a minute.
  • For #2, activity streams uses Atom. So this can theoretically be layered over XMPP. (Any implementations, anyone?)
  • For #3, there are no straightforward solutions. Twitter makes everything public – as a user you can opt out. FB comes from the other side of the fence – the user opts in for sharing. Copyright enforcement in both cases is contractually enforced.

 

Activity Streams Standard

Take a look at the formats of the following public feeds:

1) http://api.flickr.com/services/feeds/photos_public.gne

<entry>  <title>death valley 07a</title>  <link rel="alternate" type="text/html" href="http://www.flickr.com/photos/willburn25/3488468568/"/>  <id>tag:flickr.com,2005:/photo/3488468568</id>  <published>2009-04-30T08:40:47Z</published>  <updated>2009-04-30T08:40:47Z</updated>  <dc:date.Taken>2009-04-30T02:40:47-08:00</dc:date.Taken>  <content type="html"> &lt;p&gt;&lt;a href=&quot;http://www.flickr.com/people/willburn25/&quot;&gt;Shackleton, Jules&lt;/a&gt; posted a photo:&lt;/p&gt;    &lt;p&gt;&lt;a href=&quot;http://www.flickr.com/photos/willburn25/3488468568/&quot; title=&quot;death valley 07a&quot;&gt;&lt;img src=&quot;http://farm4.static.flickr.com/3391/3488468568_b3eea508d4_m.jpg&quot; width=&quot;240&quot; height=&quot;92&quot; alt=&quot;death valley 07a&quot; /&gt;&lt;/a&gt;&lt;/p&gt; </content>  <author>    <name>Shackleton, Jules</name>    <uri>http://www.flickr.com/people/willburn25/</uri>  </author>  <link rel="enclosure" type="image/jpeg" href="http://farm4.static.flickr.com/3391/3488468568_b3eea508d4_m.jpg" />

</entry>

 

2) http://picasaweb.google.com/data/feed/api/all 


<entry>  <id>http://picasaweb.google.com/data/entry/api/user/isabellechedin/albumid/5284084403719516961/photoid/5330400059699013378</id>  <published>2009-04-30T08:34:36.000Z</published>  <updated>2009-04-30T08:34:36.000Z</updated>  <categoryscheme='http://schemas.google.com/g/2005#kind' term='http://schemas.google.com/photos/2007#photo'/>  <titletype='text'>DSC03419.JPG</title>  <summarytype='text'/>  <contenttype='image/jpeg' src='http://lh6.ggpht.com/_2pkj6pXKPwY/SflinNUUpwI/AAAAAAAABxs/gXHsFzjrylU/DSC03419.JPG'/>  <linkrel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://picasaweb.google.com/data/feed/api/user/isabellechedin/albumid/5284084403719516961/photoid/5330400059699013378'/>  <linkrel='alternate' type='text/html' href='http://picasaweb.google.com/isabellechedin/ExpatriationHongKong#5330400059699013378'/>  <linkrel='http://schemas.google.com/photos/2007#canonical' type='text/html' href='http://picasaweb.google.com/lh/photo/FV74oIogEEchKANsA9dfLg'/>  <linkrel='self' type='application/atom+xml' href='http://picasaweb.google.com/data/entry/api/user/isabellechedin/albumid/5284084403719516961/photoid/5330400059699013378'/>  <linkrel='http://schemas.google.com/photos/2007#report' type='text/html' href='http://picasaweb.google.com/lh/reportAbuse?uname=isabellechedin&amp;aid=5284084403719516961&amp;iid=5330400059699013378'/>  <author>    <name>isa</name>    <uri>http://picasaweb.google.com/isabellechedin</uri>    <email>isabellechedin</email>    <gphoto:user>isabellechedin</gphoto:user>    <gphoto:nickname>isa</gphoto:nickname>    <gphoto:thumbnail>http://lh5.ggpht.com/_2pkj6pXKPwY/AAAArGA7Css/AAAAAAAAAAA/lspX8vyux5o/s48-c/isabellechedin.jpg</gphoto:thumbnail>  </author>  <gphoto:id>5330400059699013378</gphoto:id>  <gphoto:albumid>5284084403719516961</gphoto:albumid>  <gphoto:access>public</gphoto:access>  <gphoto:width>1600</gphoto:width>  <gphoto:height>1200</gphoto:height>  <gphoto:timestamp>1240920712000</gphoto:timestamp>  <exif:tags>    <exif:fstop>3.5</exif:fstop>    <exif:make>SONY</exif:make>    <exif:model>DSC-T100</exif:model>    <exif:exposure>0.04</exif:exposure>    <exif:flash>false</exif:flash>    <exif:focallength>5.8</exif:focallength>    <exif:iso>400</exif:iso>    <exif:time>1240920712000</exif:time>  </exif:tags>  <media:group>    <media:contenturl='http://lh6.ggpht.com/_2pkj6pXKPwY/SflinNUUpwI/AAAAAAAABxs/gXHsFzjrylU/DSC03419.JPG' height='1200' width='1600' type='image/jpeg' medium='image'/>    <media:credit>isa</media:credit>    <media:descriptiontype='plain'/>    <media:thumbnailurl='http://lh6.ggpht.com/_2pkj6pXKPwY/SflinNUUpwI/AAAAAAAABxs/gXHsFzjrylU/s72/DSC03419.JPG' height='54' width='72'/>    <media:thumbnailurl='http://lh6.ggpht.com/_2pkj6pXKPwY/SflinNUUpwI/AAAAAAAABxs/gXHsFzjrylU/s144/DSC03419.JPG' height='108' width='144'/>    <media:thumbnailurl='http://lh6.ggpht.com/_2pkj6pXKPwY/SflinNUUpwI/AAAAAAAABxs/gXHsFzjrylU/s288/DSC03419.JPG' height='216' width='288'/>    <media:titletype='plain'>DSC03419.JPG</media:title>  </media:group>  <gphoto:albumtitle>expatriation à Hong-Kong</gphoto:albumtitle>  <gphoto:albumctitle>ExpatriationHongKong</gphoto:albumctitle>  <gphoto:albumdesc/>  <gphoto:location/>  <gphoto:snippet/>  <gphoto:snippettype>PHOTO_DESCRIPTION</gphoto:snippettype>  <gphoto:truncated>0</gphoto:truncated></entry>

While both of them are about public photos, the formats are vastly different since they come from two different providers. So a potential system that wants to aggregate photo updates across multiple providers needs to understand each vendor’s format separately. Efforts like Yahoo’s Media RSS extensions seek to address this.  (The picasa feed above uses the Yahoo extensions but the Flickr feed does not, which is strange considering it is a Yahoo property). Another example is Apple’s iTunes RSS extensions. However, these are RSS extensions, and what we really need in this space is an Atom extension. (why? see http://blog.unto.net/work/on-rss-and-atom/). An example is http://martin.atkins.me.uk/specs/atommedia.

However, this is only about photos which is just one of the activities. When we start considering the possible list of activities, the problem scope becomes extremely large. This is where the Activity Streams effort comes in. Consider the following activity: “Vineet posted new photographs to Agastya on Facebook 6 hours ago.” This can be broken down as:

  • Vineet = actor
  • posted = verb
  • new photographs = item / article
  • Agastya = social object (album here)
  • Facebook = site
  • 6 hours ago = timestamp

Other possible fields here can be:

  • User Agent / tool (example Thwirl)
  • Location
  • Mood
  • Title
  • Summary
  • Detail
  • Link
  • Verb collection
  • Actor collection
  • Object collection
  • Comments (points to another object)
  • The producer of the content may define the default viewing mechanism which can be used by a consumer

To capture the above, the following draft specs are currently in place.

  1. Atom Activity Extensions defines Actor, Verbs, Objects, Title, Summary, Detail, Link, etc.
  2. Atom Activity Base Schema defines the semantics of various Verbs and Objects, Location and Mood
  3. Atom Media Extensions defines how typical media – videos, photographs, etc. should be described. So in the above example, if we wanted to enclose the actual photographs of Agastya, we would need to describe the actual link, the preview thumbnail, image type, height, width, etc.

Note that this is work in progress and if you read the drafts, the gaps are fairly obvious. The big ones (I think) are:

  • As activities get republished, a downstream consumer may end up dealing with duplicates. This can be addressed by a combination of a URI + per item identifier
  • Some activities may be in response to other activities (classic case being video responses on YouTube) – the relationship would need to be captured
  • Instead of a single actor-verb-object, we may have collections of actors, verbs and objects. Examples being multiple actors on a single object (wiki editing) or single actor, multiple objects (uploading multiple pictures) or combinations.

 

Activity Streams Implementation

1. MySpace: http://wiki.developer.myspace.com/index.php?title=Standards_for_Activity_Streams is compliant to the current draft specs. Also, their documentation is simple to read, understand and use.

2. Facebook: http://wiki.developers.facebook.com/index.php/Using_Activity_Streams. For one, the stream only consists of user generated content and not app generated content. Second, they have a model where you need to prompt the user for permission to access the stream. This has two repercussions:

  • This requires the user to be logged on to Facebook
  • You can only show the user his own data. So if I import the FB feed on Friendfeed, the subscribers to my Friendfeed feed would not get to see the FB feed. This could have been possible had apps like FF could have persisted FB data, but FB TOS prohibit caching data for more than 24 hours.

Note that the Activity Stream standard is just one way of accessing the FB stream. They also provide a XML/HTTP and a FQL API. See Using the Open Stream API for details. The restrictions however stay in place.

So in summary, while the Activity Stream standard effort is quite exciting, the
current FB implementation is not. All one can use it for is building clients
which show the user her own data. The real value would be in letting apps
re-publish the data without violating the privacy needs of the user. As of now,
the wall garden is very much in place.

 

Further Reading

After publishing my previous post, I had thought that I would not be coming back to Crypto for a while. However, today evening Sebastiaan posted on SCRAM on one of the Directi mailing lists, and I got compelled to write down this one.

Authentication in Cryptography has two aspects: data authentication and entity authentication. Data authentication is addressed using HMACs and Digital Signatures (discussed in my previous post). What we are talking about here is entity authentication: how does one entity get to know that the other entity is actually what is claims to be. This in turn has two aspects:

  1. Alice and Bob are communicating in “real time” – the typical way of authentication here is by Alice issuing a challenge to which only Bob can respond.
  2. Alice and Bob are sending messages to each other which are delivered after an appreciable delay since a message may be stored and forwarded thru multiple devices. Here since Alice can’t wait for Bob to respond to the challenge, the interest is only in validity of origin of data.

Here we are talking about the first kind of authentication. The goals of an entity authentication / identification protocol are as follows:

  1. Bob should be able to successfully authenticate himself to Alice.
  2. A third party should not be able to impersonate Bob to authenticate itself as Bob to Alice
  3. Alice should not be able to use the data used in authenticating Bob to impersonate Bob to a third party.

 

Passwords

A password is a shared secret between two entities. By sending across the shared secret, one entity can prove to the other that it is indeed the possessor of the secret and hence the valid entity. However, sending the shared secret across as a mechanism for authentication is fairly weak. Here are some inherent weaknesses in the scheme:

  1. Passwords are easy to guess (or brute force) since they are unlikely to be very long or use a random sequence or be chosen from a large alphabet set.
  2. Passwords do not change frequently so brute forcing them becomes easier since there is a huge time window available.
  3. A password is a shared secret between the two entities. Any entity that is able to intercept the password being sent by Bob over the wire can now use the password to authenticate itself to Alice.
  4. In systems that store password in clear text, this is even worse, since privileged insiders or administrators now have access to the shared secret.

 

Some of these issues can be addressed to some extent thru the following:

  1. Enforce a password policy that requires a long length, and inclusion of numbers or special symbols. This makes a brute force / dictionary / exhaustive search attack harder, but not impossible.
  2. Make the user change the password frequently – while good in principle, in practice this does not help in a huge way since most password changing policies provide a lot of time (30-45 days) for a dedicated hacker.
  3. Encrypt the password, preferably using a one-way function so that even if the algorithm, key and the Ciphertext are all available to the privileged user, he cannot get to know the shared secret.
  4. Salting the one way function by using a randomly generated number as a key also makes a dictionary attack harder.

However, the biggest drawback of the fixed password approach is that it is open to a replay attack. So it cannot be used over an open network, but must require a trusted connection.

 

Challenge Response Identification

The idea of a Challenge  Response scheme is simple: Both Alice and Bob have a shared secret. To authenticate Bob, Alice makes him a challenge. To respond to the challenge, the shared secret is required which only Bob has. Thus by sending the response, Bob proves that he is the possessor of the shared secret without ever sending across the shared secret. There are multiple ways of doing challenge response:

 

Symmetric key Challenge Response using Encryption

Here the assumption is that Alice and Bob exchange the shared secret a priori. Let’s  suppose the shared key is K. Then:

  1. Alice generates a random number n1, and sends it to Bob.
  2. Bob applies the encryption algorithm to generate n2 = EA(n1, K) and sends it to Alice
  3. Alice apples the decryption algorithm to compute n3 = DA(n2, K)
  4. If n1 = n3, Bob indeed owns the shared secret and is who he claims to be.

The EA is typically a function like modular exponentiation which is easy to compute, but hard to reverse. Also, data integrity must be ensured in all transmission by using a HMAC.

 

Symmetric key Challenge Response using One-Way functions

In case the usage of an Encryption / Decryption routine is infeasible (because of computing cost maybe), it is possible to achieve the same using one-way functions as well:

  1. Alice generates a random number n1, and sends it to Bob.
  2. Bob applies the one-way function H to generate n2 = H(n1, K) and sends it to Alice
  3. Alice also applies the one-way function H to generate n3 = H(n1, K)
  4. If n1 = n3, Bob indeed owns the shared secret and is who he claims to be.

 

Asymmetric Challenge Response

  1. Alice generates a random number n1 and sends it to Bob
  2. Bob encrypts the random number n1 using his private key to generate n2 and sends it over to Alice
  3. Alice decrypts n2 using Bob’s public key and obtains n3
  4. If n1 = n3, Bob indeed owns the public key under his name

 

Zero Knowledge Protocols

Since in all the cases above, Bob responds with some response based on the secret key, there is a theoretical risk of information leakage. Zero knowledge protocols seek to address this by enabling a claimant to demonstrate knowledge of the secret without revealing any information whatsoever.  The basic idea is that the claimant must answer two questions: one involving demonstrating the knowledge of the secret, and the second is a simple question to prevent cheating. The series of steps is repeated several times (typically 40) to reduce the probability of cheating, and only if all answers are correct is the entity authenticated. The math involved in doing this is detailed but straight forward, and it is also easy to understand how the protocol works without revealing any information. You can read about it any standard cryptography textbook.

 

SASL Mechanisms

With that backgrounder, let’s come back to the issue of SCRAM-MD5 where we started. Basically SASL is a framework for entity and data authentication supporting multiple mechanisms including anonymous, plain-text, one-time-password, CRAM, Digest and now SCRAM. Let’s quickly examine these (please do not take the descriptions below as accurate – they are greatly simplified):

  1. Anonymous: As the name suggests, the purpose of the mechanism here is for an entity to be able to use the service without proving its identity. The claimant basically sends a single string indicating that it wants to access the service anonymously. The server if it has that capability (which typically is checked by the client before sending the anonymous auth request) allows the entity to use the service anonymously.
  2. Pla
    in
    -text: The plain-text mechanism is expected to be used over a secure pipe provided by a lower layer which takes care of confidentiality and integrity. Basically, the client sends over the username and password in plain text over TLS. The problem here is that in the absence of a secure lower layer, the entity can be easily compromised. Also, there is no mutual authentication.
  3. OTP: One time passwords are a simple concept:
    • The user provides the password.
    • The client receives a non-secret seed from the server and appends the password with the seed
    • The result is passed thru a hash function (typically MD5, SHA1 and MD4 also allowed) and then reduced to 64-bits. This result is called S
    • This secret S is now run thru the hash function N-times, then N-1 times, N-2 times, and so on to produce multiple one time passwords.
    • Every time authentication needs to be done, the OTP along with the sequence number and the hashing function used are send over to the server
    • The adversary cannot invert the hash function and hence cannot recover the original password
    • The server applies the same hash function the same number of times and if it gets the same data, authenticates the user.
    • Good: makes replay attack difficult
    • Bad: Server stores password, no mutual auth. Open to dictionary attack. Hard to maintain sequence in the event of communication failure. Open to a pre-play attack.
  4. CRAM:
    • Server sends a challenge string
    • Client responds with username followed by a message digest created by hashing the challenge string using the password as a key
    • Server performs the same computation on its side, and if it obtains the same message digest, authenticates the user
    • Good: Not open to replay. No disclosure of password
    • Bad: No mutual auth. Open to a dictionary attack if a CRAM exchange is captured. Server needs to store passwords (hopefully encrypted).
  5. Digest:
    • Server stores one way hash of username and password H1 = Hash(username, password)
    • On request of a resource, server calculates hash of resource and the method required to access it H2 = Hash(resourceURI, method)
    • Server sends a nonce to client
    • Client calculates H1 = Hash(username, password), H2 = (resource URI, method)
    • Client generates Response = Hash(H1, nonce, H2) and sends it back to the server
    • Server does the same calculation at its end and if both are identical, the user is authenticated.
    • Good: Server does not store password, but a one-way hash
    • Bad: Usage of MD5 as a hashing function. Many security settings are optional. Open to a man in the middle attack.
  6. SCRAM: It is tedious to describe the protocol in brief since it involves multiple repeated computations. Best to read the draft. The key improvements over Digest and CRAM, etc. are:
    • Server stores salted one way hash of password
    • Mutual authentication
    • Channel bindings
    • Can be used as a GSS API mechanism as well. GSS API is like SASL – a generic security framework.
    • Server cannot impersonate the client to other servers
    • Permits the use of a server-authorized proxy without requiring the proxy to have super-user rights with the backend server

I am mentioning the last two points in good faith. Have not studied the protocol deeply enough to arrive at how these two conclusions are valid.

Apr 202009

Most developers whom I have come across, lack a solid grasp of the fundamentals of cryptography.  When a developer who does not understand crypto needs to use crypto, several things can go wrong:

•    Not understanding the implications of using some crypto technology in the code
•    Not realizing where to use crypto
•    Not implementing crypto correctly and hoping that the implementation is correct
•    Not implementing crypto correctly, but feeling secure because “we have used cryptography”
•    Not using crypto at all

The unfortunate part to this whole situation is that Cryptography is not hard to understand and most of the perceived complexity is in areas where an application developer would typically never venture: designing an algorithm or a protocol, cryptanalysis, etc.  In most cases, one can use one of the modern crypto libraries and go by with a minimal level of understanding of the principles involved. The aim of this post is to provide that minimum background required to start using crypto effectively in day to day work.

 

1. Introduction
While Cryptography can serve in multiple ways, its most common usage – which also illustrates most of its principles – is secure communication. To understand the notion of secure communication, let’s bring in our cast:

This story is about two nice, genuine people – Alice and Bob – who are in separate locations and want to have a conversation. The third character in our story is Eve who does not like Alice and Bob and is their adversary. She wants to intercept this conversation, find out what Alice and Bob are talking about, and if possible, disrupt the conversation by modifying the messages between them or by impersonating Alice or Bob.

Now Alice and Bob are aware of Eve and to prevent her from accomplishing her purposes, they set up a pipe which goes from Alice to Bob. Whatever message Alice drops in the pipe at her end would be received by Bob at the other end and vice versa. The pipe is opaque and indestructible so Eve cannot look at the messages nor modify them.

Such a pipe or channel then has the following properties:

•    Confidentiality: Contents of messages passing between sender and receiver cannot be seen by the adversary since the pipe is opaque.
•    Integrity: Contents of messages cannot be modified since the pipe cannot be penetrated
•    Authentication: The pipe, being indestructible, does not allow anyone else except Alice and Bob to send messages. So Eve cannot masquerade as Bob.
•    Non-Repudiation: Since only Alice and Bob can send messages on the pipe, when Bob receives a message on the pipe, Alice cannot deny that she had not sent that message.

The purpose of cryptography is to establish this kind of a secure channel over a public network like the Internet.

 

2. Confidentiality using a Shared Secret
Traditionally the problem of making communication confidential over an insecure transport has been solved by Private-Key Encryption [also called Symmetric Encryption]. The idea is simple: Before exchanging messages, Alice and Bob meet in person and decide on the following:

  • An Encryption Algorithm (EA)
  • A Decryption Algorithm (DA)
  • A Shared Secret – a key (K)

[Note: Alice and Bob do not care if Eve gets to know EA or DA, as long as K remains a secret. This is a very important point in modern-day cryptography: we assume that the Adversary has the knowledge of how the algorithm works. This is called Kerckhoffs’ Principle. Mechanisms that are based on keeping the working of an algorithm secret are obscure, not secure. Such mechanisms are of historical interest only and are inadequate for today’s world.

Some people compare this with government agencies (like NSA in the US) not publishing their mechanisms. While this is true, such non-disclosure does not imply that the algorithms used by agencies like NSA are designed on the basis of algorithm obscurity. Security agencies in governments (or large IT companies creating security technology) hire top-notch cryptographers who are well aware of the Kirchhoff's Principle and design with the assumption in mind. Obscurity then is either an extra defense (however weak that maybe), or a way of protecting intellectual property rights on the mechanism. Bottom-line: modern-cryptography assumes that the attacker has knowledge of the working of the algorithm and only the key is secret.]

Now to send messages securely, Alice takes the message (called Plaintext in this context) and generates an encrypted message out of it (called Ciphertext) by using the Encryption Algorithm EA and the Secret Key K. Thus:

Ciphertext = EA(Plaintext, K)

This Ciphertext is now sent to Bob over the public network. On receiving the Ciphertext, Bob uses the Decryption Algorithm (DA) and the Secret Key (K) to generate the Plaintext back. Thus:

Plaintext = DA(Ciphertext, K)

Eve does not have the Secret Key (K) and is therefore not able to obtain the Plaintext. We have achieved confidentiality by sharing a secret.

 

2.1 Substitution Ciphers
The simplest way of understanding how private-key encryption works in practice is to consider an elementary algorithm – The Substitution Cipher. A Substitution Cipher works as follows: There is a message M that needs to be sent. To encrypt the message, characters in blocks of n characters are taken, and substituted by a different set of characters picked up from the Alphabet, differing from the original set by a function of K.

The idea is not new. It’s most famous usage was by none other than Julius Caesar who replaced each character in the message by a character three characters ahead (shifting by 3) in the Latin Alphabet. This is called the Caesar-Cipher. Given that most of his enemies were not even literate, let alone thinking of breaking ciphers, it would have been fairly secure. Even as late as 1915, the Russian army was using the Caesar cipher since their troops could not learn to use more complicated ones. A trivial modern day implementation is called ROT13 (Rotation 13 – rotation by 13 positions). Since the English alphabet is 26 characters, ROT13 is its own inverse, i.e., the decryption and encryption algorithms are the same rule because of the choice of the key (13). ROT13 is used by people asking questions / puzzles, etc, on the Usenet. The standard Usenet client has ROT13 built-in.

What can Eve – the adversary – do about breaking a substitution cipher being used by Alice and Bob? Cryptanalysis – analysis of a Cryptographic Scheme with the goal of breaking it – of a substitution Cipher is rather straightforward, and by analyzing a few instances of Ciphertext produced by a substitution cipher, a cryptanalyst can easily break the cipher.

One of the simplest techniques that Eve can use for this purpose is Frequency Analysis. This is based on the fact that in any language some letters are used more than others. (In English the approximate sequence is ETAOIN SHRDLU for the top 12 characters). So by analyzing the frequency of the alphabets appearing in the Ciphertext, the Plaintext can be figured out. The earliest use of this technique was by the Arabs in the 9th century who realized that certain characters appear more frequently in the Koran than others. Several of the ciphers used in World War II were also broken using frequency analysis.

 

2.2 Transposition Ciphers
We know that substitution ciphers are not strong enough. So what are our other choices? There is another technique called the Transposition Cipher, where the characters are permutated, i.e., positions of characters are interchanged. Transposition Ciphers have also been known for a very long time and one can find a transposition ci
ph
er every day in a newspaper. We know it by the name “Anagram.” An anagram is a very simple form of a transposition cipher, and has more recreational value than cryptographic. Nonetheless, anagrams have been used by linguists and intellectuals throughout history for amusement or for hiding messages, and provided a powerful plot device in Dan Brown’s The Da Vinci Code.

Anagrams aside, the earliest example of a Transposition Cipher being used for a security purpose is a Scytale – Greek for a baton – a device used by the ancient Greeks to communicate messages during a military campaign. The working was really simple: a strip of paper was wound around a cylinder of a known diameter and a message written on the paper. The paper would then be peeled off and sent thru a messenger. The recipient would have a similar cylinder on which it would wind the paper again to read the message. Unless the recipient’s cylinder was of the same diameter, the letters would not come together properly and the message would not make sense. However, manual analysis of the letters is straightforward – letters after a specific distance need to be put together to make sense out of the message, and the strip of paper is a big giveaway.

 

2.3 Product Ciphers
Transposition Ciphers or Substitution Ciphers are both open to fairly simple Cryptanalysis, but a combination of a Transposition operation with a Substitution operation leads to a much stronger cipher. For example, a Product Cipher, alternates transposition and substitution, and carries out this core function over several iterations. The ciphers used in World War II (ENIGMA by the Germans, PURPLE by the Japanese) were Product Ciphers. Modern-day Product Ciphers include DES as well as AES (a.k.a. Rijndael). DES uses a 64-bit key and alternates 16 substitutions with 15 transpositions. AES uses a 128-bit key and alternates 10 substitutions with 10 transpositions. AES is stronger.

 

2.4 Modes of Operation
Depending on how a Symmetric encryption algorithm deals with data, we can define two distinct categories for symmetric algorithms: Block Ciphers and Stream Ciphers. In a Block Cipher, data is taken in blocks (typically 64-bit or 128-bit block) and the algorithm encrypts / decrypts the block as a whole. A Stream Cipher operates on a stream of data and encrypts / decrypts the data one-bit at a time.

 

2.4.1 Stream Ciphers
A Stream Cipher uses the bits in the secret key to perform some operation with each bit in a stream to generate Ciphertext. For example, a simple operation would be to XOR each bit in a key with each bit in the Plaintext stream. However, it is highly unlikely that the key would be as long as the Plaintext (see “One Time Pad” in section 3.1 for a discussion), so what typically happens is that the Secret Key is used to generate a pseudo-random bit-stream called keystream, and the keystream is then used to perform some operation with the Plaintext stream.

The keystream is generated using some internal state in the algorithm. For higher security, this internal state needs to be updated continuously. Based on how this state change takes place, Stream Ciphers can be classified as:

1.    Synchronous Stream Ciphers: The internal state of the cipher changes independent of Plaintext or Ciphertext.

2.    Self-Synchronizing Stream Ciphers: The internal state changes based on the previous Ciphertext digits.

Since the keystream itself is not really random, the bit sequence in the stream would repeat itself after a period. The longer the period, the more secure the algorithm.

 

2.4.2 Block Ciphers
A block cipher operates on a pre-defined block-size. Typically the block-size is 64-bits or 128-bits. So for a block cipher, the biggest question is how to deal with messages longer than the block-size. There are various modes of operation which a Block Cipher can employ. The main ones are:

1.    Electronic Codebook (ECB): This is the simplest mode. The message is split into parts and each block is encrypted / decrypted separately. This is insecure and not recommended since it is subject to two simple attacks:
•    Stereotyped Beginnings and Endings: Messages have typical headers and footers and that knowledge can compromise the key.
•    Replay Attack: The adversary does not know the Plaintext or the key, but she simply intercepts the Ciphertext and forwards it to the receiver. An example where this can hurt is authentication.

2.    Cipher Block Chaining (CBC):  In this mode, each block is XORed with the previous block before encryption. It is very commonly used. A variation on CBC is called Propagating Cipher Block Chaining (PCBC). This mode is used in Kerberos.

3.    Cipher Feedback (CFB), Output Feedback (OFB), Counter (CTR): All of these modes convert a block-cipher into a stream-cipher by taking the output from the previous block. CFB converts a CBC into a self-synchronizing stream cipher. OFB and CTR makes a block cipher a synchronous stream cipher.

All modes except ECB use output from the previous block to operate on the next block. This leads us to the question: How is the first block processed? To solve this problem, a dummy block, called an Initialization Vector (IV) is used. The IV need not be secret, but the same IV should never be used with the same key.

 

3. The Chimera of Perfect Secrecy
I mentioned that Product Ciphers are stronger as compared to Transposition and Substitution Ciphers. But on what basis does one make a statement like that? Is there a way of figuring out how secure an encryption is, or how difficult it is to break it? Till recently, most cryptography was ad-hoc without any solid theory behind it. This changed in 1948 when Claude Shannon – the Father of Information Theory and Modern-Day Cryptography – published A Mathematical Theory of Communication (I tried reading it and lost my way by the 3rd page, so this really is the interpretation of other people – unlikely to be wrong since a lot of very learned people say the same thing). What he showed was that a secure encryption system could exist only if the size of the Secret Key was as large as the total size of the messages that would ever be exchanged using that secret key.

3.1 One Time Pad
An example that illustrates Shannon’s principle is the One Time Pad (OTP). The idea is this: suppose Alice has Plaintext of length N characters. She now generates a key randomly, also of length N and uses it to generate the Ciphertext. The key is used only once, and for the next message, a new key is generated, again equal to the length of the message. This is mathematically proven to be unbreakable. (Since the key can be used only once, it is called “One Time”, and “Pad” because the key used to be on a slip of paper torn from a Pad on which random text was printed).

The problem with OTPs is key distribution – it is logistically tough and prone to security breaches. So while OTPs in themselves are perfectly secure, their implementations are vulnerable because of the insecurity involved in key distribution. OTPs are typically never used, except in highly critical applications where the parties involved are willing to undertake the key exchange challenge.

The reason I brought up OTPs is to point out that perfect security – that can defeat unlimited computing power – is not always practical. Modern-day Cryptography does not chase this chimera. Its goal is not to provide security in the face of infinite computing capacity. Instead, it assumes that the adversary has (reasonably) limited resources. What makes modern-day Crypto work is the gap between the computational ability it takes the legitimate

user to encrypt vs. the computational infeasibility of decryption for the adversary who does not have the secret.

3.2 Computational Complexity Theory
Whether or not a computation is feasible can be found out if we know the cost involved in making the computation. The discipline of computer science which deals with cost of computation is Computational Complexity Theory. The theory measures the cost of making a computation in terms of the space (quantity of information storage) and time it takes to make the computation. Here it is important to note that when we think of computation, we are not thinking about a specific computation model like the Von Neumann Architecture on which modern PCs are based, or a Turing Machine, but any arbitrary model.

3.2.1 Algorithm Efficiency
Before we get deeper into the theory, a quick word about algorithm efficiency: An algorithm’s efficiency is defined as the amount of the time it takes the algorithm to compute the problem, as a function of the problem size. (The problem size measures the size of the input to the program. For different problems, problem sizes are differently defined, and getting it right takes some familiarity with complexity theory.) So for example, the time taken could be:

T(n) = n3 + n2 + 1

Where n = problem size.

Now as n gets arbitrarily large, the value of (n2 + 1) becomes insignificant as compared to n3. So T(n) = n3 for very large n. Such an algorithm then has complexity of the “order of n cube”, and the algorithm is called a O(n3) algorithm.

An algorithm is considered to run in polynomial time if T(n) = O(nk) where k is a constant. Example include

•    Constant time: O(1) [k = 0].
•    Linear time: O(n) [k = 1]
•    Quadratic time: O(n2) [k = 2]
•    Cubic time: O(n3) [k = 3]

An algorithm running in polynomial time would be more efficient than an algorithm running in exponential time: T(n) = O(kn)

 

3.2.2 Complexity Classes
In general, complexity theory looks at the universal set of problems in terms of what are called Complexity Classes. A complexity class is a set of problems that are of related complexity, as in, the algorithm efficiency for solving that set of problems can be expressed by T(n) = O(f(n)).  As can be expected, there are a lot of complexity classes, but the ones of biggest interest to us are called P and NP classes.

 

3.2.3 Class P Complexity
A problem that can be solved in polynomial time is said to have Class P Complexity. A number of problems fall under Class P complexity:

  • Finding the greatest common divisor
  • Determining if a number is Prime
  • Linear Programming
  • Shortest Path
  • Sorting
  • Modular Exponentiation MOD(x, y, z) = xy  MOD z

From the perspective of Complexity theory any algorithm that runs in polynomial time is considered as being efficient (or feasible), and any algorithm that does not run in polynomial time as being intractable or infeasible. So any computation in the class P-Complexity is considered efficient. However, this is only in theory, it may not be so in real life. For example if the value of k is 10000, T(n) = O(n10000) is certainly a very inefficient algorithm.

 

3.2.4 Class NP Complexity
Put informally, problems of class NP complexity can be defined as ones having a verifier function that can be solved in polynomial time. To understand this better, consider the problem of finding out whether a number is composite or not. Doing this for large numbers is really hard, with T(n) = O(kLog n) for the most efficient algorithms. However, once we have a candidate factor of n (say x), finding if x is actually a factor or not is trivial:

bool IsFactor(int n, int x)
{
    if (n % x != 0 || x == 1 || x == n) return false; else return true;
}

Since the verifier function (IsFactor) is a class P algorithm, the problem of whether n is composite or not is a NP problem. All NP problems have a verifier function which accepts an input and a verification value, which can be computed in polynomial time. A number of problems are Class NP Complex, the most common being the Factoring problem and the traveling salesman problem.

The hardest problems in Class NP complexity are said to be NP Complete. An example is the problem of multi-processor scheduling: Find the minimum time required to schedule n jobs (of varying lengths) on m processors such that no two jobs overlap.

 

3.2.5 Is P = NP?
Consider the case where a P class problem also has a P class verifier. Would this problem be a class NP problem or a class P problem? The simple answer is P is a subset of NP. So every P problem is a NP problem, but the reverse may not be true. However, some researchers believe that any problem whose verifier can be solved in polynomial time, can itself also be solved in polynomial time. In short, P = NP.

Whether P is equal to NP is one of the most important unsolved problems in modern mathematics and if it can be proven that all class NP problems are actually class P problems, the implications would be stunning, with a huge impact on our society (for starters, most of modern-day Cryptography would be rendered useless)

 

3.3 Information Leakage
While the computation of the Plaintext from the Ciphertext may be infeasible (NP Complete), what if it is feasible (P Complex) to compute part of the Plaintext? For example, what if the length of the Plaintext can be known? Does that compromise the security of the message? Assume a situation where Eve expects Bob to ask Alice whether she would marry him. The answer which Alice would return is a simple yes / no, represented by a single bit. Now if the Eve is looking at the Ciphertext going along the wire, and discovers a single unit moving across, she may deduce that Alice has replied to Bob’s question.  What now is the probability of Eve finding out the correct answer? 50%. Much higher than what a secure communication may require.

So from a Cryptography standpoint, when we look at computational feasibility, we need to look at it not just from the perspective of the entire data becoming available, but also how feasible it would be to recover part of the information.

 

4. Cryptographic Primitives
Modern-day cryptography – based on computational complexity – is built using certain Cryptographic Primitives. What is a primitive? A primitive is an atomic operation (from the perspective of the layer above it), a building block, which does a certain defined task. Cryptography requires primitives that have certain computational hardness. All computational complexity in Cryptography comes from these underlying primitives. Understanding the primitives is the key to understanding Cryptography. While you would not have to typically deal with how Protocols are constructed, a basic understanding is important to get the fundamentals right.

 

4.1 One Way Functions
The simplest of the primitives is a One-Way Function. A one-way function is any function that is easy to compute but hard to invert. A mathematical function that is hard to invert is multiplication: Given integers P and Q, N = P x Q can be easily calculated, but given a random N, how would one factor it into P and Q? Sure, we can produce all the factors. But getting P and Q specifically is probabilistic. The higher the value of N, the lower is the probability of getting P and Q back quickly and consistently. This is called the factoring problem, and it is a good example of a functi

on that is efficient to compute, but infeasible to invert.

Today we have no rigorous mathematical proof that what we consider one-way functions, are indeed one-way. What that means, is that today we are not sure whether there exist inversion methods that have lower computational complexity than the ones we are aware of. Just because a lower-complexity inversion has not yet been discovered, does not mean that they cannot exist. So cryptography is built on assumptions. So far the assumptions have been valid.

 

4.2 Hash Functions
One of the most important one-way functions is a Hash Function. A Hash function takes an arbitrary number of bits as input and produces a smaller number of bits based on the input. The algorithm used is deterministic (given the same input, it would always pass thru the same steps and produce the same output), so for the same input bits, it would always generate the same output bits. In context of hash functions, the input is often called a message and the output is called a message digest or a digital fingerprint. So:

Digital Fingerprint / Message Digest = HashFunction(Message)

A hash function is obviously one-way. That’s because a hash function is lossy – the number of output bits is less than the number of input bits so there is not enough information in the output to calculate the input. However, this alone would not make hash functions useful. What makes hash functions useful is that they are collision resistant and create an Avalanche effect.

A collision is a situation where the hash function generates the same output for two different inputs. A good hash function is collision resistant. Once again, if you consider the fact that the number of output bits is less than the number of input bits, it is obvious that there is no way of avoiding a collision. A simple way of approaching this is to think about the input as a 64-bit number and the output as a 16-bit number. Now there are 18,000 trillion 64-bit numbers (264 = 18,446,744,073,709,551,616) 64-bit numbers and 65,536 16-bit numbers. As we keep reducing each number in the 18,000 trillion-range to 65,536 numbers, there are bound to be collisions.

So a hash function cannot be collision resistant. However, we can reduce the probability of collision if we choose a large enough output range and a good algorithm. For cryptographic purposes, we define collision resistance as: It should be computationally hard to find two inputs m1 and m2 so that their output h = HashingFunction(m1) = HashingFunction (m2). The other important property for a cryptographically secure hashing algorithm is that it should be Pre-Image Resistant: Given an output h, it should be computationally hard to find an input m so that h = HashingFunction(m).

The other desirable property of a good hash function is what is called the Avalanche Effect. The idea is that if the input changes even by a small amount, the output changes dramatically. The Strict Avalanche Criterion states that if one bit is flipped, each of the output bits should change with a probability of 0.5, in other words, half of the output bits should flip.

 

4.3 Random Numbers
The other important family of Cryptographic primitives is Random numbers and their generators. Recall Kerckhoffs’ Principle – the only secret is the key, not the algorithm. Now, we may keep the key a secret, but what if someone guesses the key? So we want to make the key harder to guess. A “guess” can even be an automated attack where the program tries all possible permutations. (This is called a brute-force attack) To reduce the possibility of such an attack from being successful, there are two things that are needed:

1. The key should be long. The longer the key, the stronger the encryption
2. The key should be unpredictable. The more random a key, lesser are the chances of it being predictable.

Point one is easy to understand, but point two is a little ambiguous. What is “more random?” Without getting into a philosophical discussion on randomness, I would just like to say that a random number is one which is generated in such a way that its probability of being generated is as high / low as any other number, and there is no way of determining when the number would be generated.

Generation of random numbers depends on the initial input on which a mathematical calculation would be performed. But unless that initial condition has randomness in itself, the output would also not be in random. This initial randomness is called an Entropy Source. Based on the entropy source, Random Number Generators (RNGs) can be classified as:

1. RNGs based on Deterministic Random-bit Generators (DRBGs): DRBG means that the source of entropy is deterministic. Modern computers are all deterministic (as compared to quantum computers which are non-deterministic). If the entropy source is deterministic, the numbers generated cannot really be random. This means that a software-based generator does not generate truly random numbers. Hence we call the numbers generated by such a generator Pseudo-Random Numbers, and the generator itself is called a Pseudo-Random Number Generator (PRNG). All software based RNGs are PRNGs.

2. RNGs based on Non-Deterministic Random-bit Generators (NDRBGs): NDRBG means that the source of entropy is non-deterministic. Such RNGs are typically hardware based and use a physical phenomenon as a source of entropy. Examples of physical phenomena used by NDRBGs include photoelectric effect, thermal noise, or some other quantum phenomena. Quantum phenomena in theory are unpredictable and therefore a reliable source of entropy. We call the numbers generated by such a generator Truly-Random Numbers, and the generator itself is called a True-Random Number Generator (TRNG). TRNGs are rather expensive and are needed only for very high security scenarios.

How does one find out if a PRNG is good enough to be used in Cryptography? A random number generator would typically generate numbers in a given range. This leads to two important properties:

• The distribution of values of the numbers generated in the range. The more even the distribution, the better.
• Ease of prediction of numbers based on a given set of numbers from the generator. The more difficult the prediction, the better is the PRNG.

In order to have these properties, a PRNG needs two things:

1. A random bit stream from a source of entropy. There is some entropy even in a deterministic computer. Examples include Clock Tick count, and values of System Handles (like Thread handle, Process handle, Network Card Id, etc.)
2. A mathematical calculation that removes the bias from a bit-stream and makes it more randomized. Examples include hashing algorithms like MD4, SHA, etc.

For the second point, there is a standard by a body called FIPS (Federal Information Processing Standard) which can be found in FIPS 186-2.

A function catering to the above two requirements is called a Cryptographically Secure Pseudo Random Generator (CRNG). Typical random number generators like the C++ rand() are not CRNG and they should not be used in Cryptography.

 

5. Confidentiality without using a Shared Secret
Coming back to the problem of message exchange between Alice and Bob, the situation so far is as follows:

  • Prior to exchanging messages, Alice and Bob agree upon a large random key. The longer and more random the key, the more secure the encryption. The key is private to Alice and Bob – it is a shared secret between them and not disclosed to a third party.
  • The sender then uses a sophisticated encryption scheme (like a Product Cipher) to encrypt the message. The strength of the algorithm being determined by how efficient the algorithm is
    for the sender / receiver to execute, but computationally hard for Eve (the Adversary) to invert. The bigger this difference, the better the algorithm.
  • To decrypt the message, the recipient uses the secret key to reverse the computation carried out to encrypt the message. The shared Secret makes the reversal of the computation efficient for the recipient and gives him the computational advantage over the Adversary.

The only problem in this situation is the initial condition – the sharing of the secret key. Since the security of the scheme derives from the key (recall Kerckhoffs’ Principle), its sharing in a secure way is a pre-requisite. The question is how does one share the secret in a secure way without meeting in person? There has to be some way of transferring the key in a secure way too. But wouldn’t secure transmission of the key also involve some sort of encryption? And in that case, wouldn’t we need a key once more? The problem seems to be recursive. But we do know that the problem has been solved, and the key to solving the problem (pun unintended), lies in the symmetry of the situation.

 

5.1 Diffie-Hellman Key Exchange
The first time a solution was published for solving the problem of key sharing was in 1976 when Whitfield Diffie and Martin Hellman proposed The Diffie-Hellman Key Exchange:

  • Alice and Bob agree to use a prime number p (say 11) and a ‘generator’ g (say 5). Now these numbers are well established and everyone (including Eve, the Adversary) is aware of it.
  • Hereafter, the following actions take place on Alice’s and Bob’s sides (MOD is the modulus / remainder operation):
AliceBob
Alice chooses an integer ‘a’ at random. Say 2.Bob chooses an integer ‘b’ at random. Say 3.
Alice computes (ga MOD p) and sends it over to Bob. (52 MOD 11 = 3) is sent over.Bob computes (gb MOD p) and sends it over to Alice. (53 MOD 11 = 4) is sent over.
Alice now computes (gb MOD p) a MOD p. So Alice computes 42 MOD 11 = 5.Bob now computes (ga MOD p) b MOD p. So Bob computes 33 MOD 11 = 5.
Alice obtains 5Bob obtains 5
Shared Secret = 5Shared Secret = 5

I have never been a UI guy, my focus being mostly on databases and middleware. Sure I know HTML / CSS / XAML, but my knowledge here is far less than what I know about the other tiers. In any case this is from a development point of view – whatever little I do know of these languages does not help me distinguish a good UI from a bad UI, leave alone design a good UI. However, users ultimately buy experience, and as someone who owns a few products at Directi it is my job to own that experience. Now I doubt I can teach myself aesthetic appreciation, but usability is a little bit more objective. So of late I have been investing some time in understanding the core principles behind usability. This series of posts summarizes the key things that I learnt and is meant as notes to myself. Hopefully, this material would be useful to others as well.

Basic Usability Design

The first thing any programmer who is looking to improve his UI skills must read is User Interface Design for Programmers – classic Joel Spolsky from his early days. The key points are simple enough:

  1. Invent some users
  2. Figure out the important activities
  3. Figure out the user model — how the user will expect to accomplish those activities
  4. Sketch out the first draft of the design
  5. Iterate over your design again and again, making it easier and easier until it’s well within the capabilities of your imaginary users
  6. Watch real humans trying to use your software. Note the areas where people have trouble, which probably demonstrate areas where the program model isn’t matching the user model.

However, what is hard is implementing these steps correctly, and Joel gives several examples on correct and incorrect decision and typical mistakes. As I said, a must read.

 

Research Findings 

This content comes from a document that my friend Ninad Rawal – the Usability Lead at Directi – shared with me. He is also responsible for introducing me to the Web Form Design book which is the topic of the second post.

 

1) Memory

a) Initial short-term memory is fragile – you cannot remember a lot. (So avoid having your user remember stuff across screens)

b) More you process info, more you remember it. So for example if your screen refreshes to display an update, keep that update in the same spot because your mind remember the update being there at that spot. Example is the clock in Windows Taskbar in the bottom right corner. Consistency is key since it increases recall.

c) Interfering with info while it is being processed affects how much you remember. I think it is for this reason that I prefer the static notification icon in Facebook to the popup toasts common in IM apps.

d) How much you are able to remember can be assisted by chunking information. For example, phone numbers. Or updates on a micro-blog being grouped by date. Ideal chunking is 3-4 chunks with 3-4 items / chunk.

e) Mostly you do not remember events, you remember your interpretation of events. You typically remember parts of the event and fill in the gaps by parts of other memories or what you believe is probable. Memory is not perfect and not reliable. Avoid having people to remember things. (I am thinking passwords here!)

 

2) Visual Recall

a) There is a common perspective associated with each object we see in daily life. In engineering terms, it happens to be the isometric perspective, followed by the front-perspective, followed by the top-perspective. This is why icons should be isometric (3d like) and not flat. For 2d objects like files and messages, a 2d icon is alright.

b)  Humans are great at shape recognition (reading is a form of shape recognition). The sides of a shape are more important than the middle of a shape for recall. The mind can extrapolate and create the middle parts if the edges are present but not the other way round.

c) People tend to scan horizontally rather than diagonally. While looking for something specific, we start at top-left and then go to center, avoiding edges. So put most imp info in the center and assume a horizontal scan. Don’t put imp info at the edges.

d) Clutter creates confusion. This is because if two things are present together, the mind processes them together, even if they are unrelated. So put less space between related items and increase space between unrelated items.

e) In printed text, it is easier to read mixed upper case and lower case. For isolated words, all caps is better.

 

3) Decision making

a) First impressions are important, especially if they are consistently reinforced. It is hard to change once someone has made up her mind even sub-consciously.

b) Humans like to have more info. So if your user needs to use x amount of data, you may need to provide 4x.

c) Abstract info is hard to grasp – real-life examples, especially actual events (case-studies, anecdotes) create a bigger impact

 

4) Text Reading

a) Length of Line: (source:  http://www.ingentaconnect.com/content/tandf/tbit/2004/00000023/00000006/art00002)

People prefer reading text with a smaller line length because of less eye movement, even though they are able to read faster with a longer line length of around 95 chars.

 

b) Number of Columns and Justification: (source: http://psychology.wichita.edu/surl/usabilitynews/72/columns.asp)

- Fast readers get best reading speed and efficiency in two column fully-justified text

- Slow readers get best reading speed and efficiency in single column left justified text

- If you are not sure, choose single column left justified text

 

c) Serif vs. Sans-serif Typeface: (source: http://www.alexpoole.info/academic/literaturereview.html)

- They are equally legible – could be a matter of preference, not of performance

- Choose a standard typeface – TNR, Arial, Verdana …

 

d) Visual Factors for GUI and Web: (source: http://www.ingentaconnect.com/content/hfes/hf/2005/00000047/00000001/art00012)

- Reading Performance reduces with in pages with many links and variable densities

- Text Alignment is not a performance enhancing factor

 

e) Font and Background Color: (source: http://sigs.aisnet.org/sighci/bit04/BIT_Hall.pdf)

- People believe black text on white background is most readable as compared to color fonts / backgrounds (though it does not affect their ability to retain info)

- People find color to be aesthetically appealing

- People equate higher readability with higher professionalism

 

5) Interaction

a) Icons (source: http://www.informaworld.com/smpp/content~content=a785033788~db=all)

- Bigger icons are faster to
w
ork with

- Icons left on the screen (like on the Windows task bar) result in faster response. Increasing number of icons does not increase response time.

- Icons appearing on the screen on an action (say on pressing Start button in Windows) result in a slower response. Increasing number of icons increases response time.

- Minimizing movement reduces response time (Fitts’ Law) – hence square configuration is best, followed by horizontal

 

b) Keyboard Shortcuts (source: Hidden Cost of GUI : Failure to Make Transition form Menu and Icon Toolbars. International Journal of Human Computer Interaction, 18(2), pp. 133-144)

- Using a keyboard shortcut is fastest, followed by icon, followed by menu

- People don’t learn shortcuts despite being easy to learn.

- Habitual patterns dominate performance

Microsoft’s application platform has now got a shade of the sky – Azure was formally announced at the PDC yesterday by Ray Ozzie in his opening keynote address.

I have said this several times in the past – computing going forward is going to be all about parallelization. On the client, parallelization of code is necessitated by the advent of many-core CPUs. On the server, parallelization is about scaling out – partitioning your code, app, data to be able to run on several servers in parallel to provide high-scale and redundancy. Of course it is rather difficult to design, build and operate the app infrastructure required to run an application that can potentially run across hundreds of servers. And this is where the Azure platform comes in – it is a platform to run .Net applications on Microsoft’s cloud fabric, hosted in its datacenters across the globe, offering virtually limitless scalability and availability. This is done by providing a set of abstractions that let you focus on the core business logic of the application instead of worrying about the underlying physical infrastructure. You can get an overview of these abstractions by reading David Chappell’s excellent whitepaper that introduces the three key set of abstractions announced at the PDC. In this post, I am going to focus on the lowest layer – Windows Azure.

 

Cloud Operating System

Windows Azure is the OS for the Cloud. Marketing speak? Not really. What does an OS do? It abstracts away the hardware. Before the OS came along, people used to code specifically for each machine architecture. You had to know the instruction set of a machine in order to be able to program for it. This changed with the notion of an OS. You did not have to worry about the specific machine architecture – you wrote to the OS and it took care of loading your code in memory, loading stack in the register, moving the instruction pointer, loading the data from disk to RAM, RAM to cache, cache to register, etc.

Today, when we think about creating really large scale applications, we need to know our deployment environment fairly well. How many web-servers are there? How should I partition my data – where is the node to which I need to send the writes, from which nodes can I do the reads. And then servicing this entire thing is difficult as well – how do I upgrade my code across the farm? How do I apply patches? And then there is the problem of recovery – a node goes down, now how do I route traffic? What happened to in-flight data on that node? What about the database node that just went down? Recovery gets very hard in a truly large scale app that is deployed across several physical boxes. It is not that it cannot be done – every single successful dotcom / web2.0 site has gone thru this route. But it is a hard route to follow, it takes a ton of engineering talent, a lot of upfront money and you still need to be lucky to get it right.

Enter the Cloud OS – an abstraction for a scaled out infrastructure that you can start using without bothering about underlying details. The idea is this – you design your app to scale (yes, you still have to do the design – no abstraction can help scale an app that has sticky session state or unpartitionable data design), and stop bothering about everything else underneath it. You basically state your intent by defining your deployment architecture declaratively by telling the OS the number of front-end servers you need, the number of back-end processing machines you need, what is the response time you expect on a certain operation, etc. and the system figures out the rest of it – it is an OS for the Cloud!

 

The Programming Model

At a high level, the programming model is really straight-forward. You write your code just the way you typically do (hopefully designed for scale-out) and alongside, you provide meta-data about what your app is and how it should be deployed. Then you submit your code and the metadata to the Cloud OS.

In the OS, a component called the Fabric Controller examines your metadata and finds the available resources that match the physical needs you describe and pushes your code to these resources – typically virtual machines. On each of these resources another component called an Agent is running which is in constant communication with the Controller. When the Controller pushes your code to these boxes, the agent configures it on this box as you defined – and keeps a check on its activities to make sure that the code is able to deliver the kind of characteristics you desire in terms of response time, footprint, etc. Similarly, if your metadata describes other resources to be used – routers, switches, load balancers, etc. – the Controller also gets these resources provisioned and configured according to your definition.

Once your app is up and running, the agent on a box maintains a heartbeat with your code - if it sees anything moving out the range it may take action like moving your app to a different box, or provisioning another box, etc. Of course what can also happen is that this box itself may die, taking the agent with it. This is when the Controller comes into action. The controller also maintains a heartbeat with the various Agents, and in case an Agent goes down, it can try and re-start the machine, etc.

 

Service Architecture

The architecture of a Service targeting the Azure CTP release is given at http://msdn.microsoft.com/library/dd179341.aspx. Couple of key points:

1) The service can consist of up to one app in a web role and one app in a worker role. However, there can be any number of instances of these apps.

2) The metadata is split into two parts:

a) Static Part – This part, called the System Definition, cannot be changed for a running service. This describes the "shape" of your service – no of web roles, worker roles. To change this, the service has to be republished. The schema is given at http://msdn.microsoft.com/library/dd179395.aspx

b) Dynamic Part – This part, called the System Configuration, can be changed for a running service. This describes the instancing of your service – how many instances of the web role and the worker role do you need. The schema for this is given at http://msdn.microsoft.com/library/dd179389.aspx

3) Storage: There are two options for storing data: First one is to use SQL Server Services – this is the database in the sky option. More on this at http://msdn.microsoft.com/library/dd200927.aspx. Second option is to use Windows Azure native storage: This is a simpler storage option. There are three services here:

a) Blob Storage: Meant for opaque objects like media files. This is similar to Amazon’s S3.More on http://msdn.microsoft.com/library/dd135733.aspx.

b) Table Storage: Meant for structured data. This is conceptually the same as ADO.net Data Services. More on http://msdn.microsoft.com/library/dd179423.aspx

c) Queue Storage: A basic message queue system. More on http://msdn.microsoft.com/library/dd179363.aspx 

 

Creating a Azure hosted Service

The Azure SDK provides an emulator called the Development Fabric that simplifies development tasks a lot. Here are the key steps:

1) Write your code

2) For storing data, a Development Storage is provided. You need to initialize this storage thru a tool called DevelopmentStorage

3) Create the Service Definition and Service Configuration files

4) Change the storage URIs as per http://msdn.microsoft.com/library/dd179425.aspx in the configuration file

5) Package the code along with the Service-Def file using a tool called CSPack.

6) Publish the package using a tool called CSRun.

 

Of course, before you can carry out step 6) you need to get your Azure account. To do that, visit https://connect.microsoft.com/site/sitehome.aspx?SiteID=681.

 

But Remember …

… that though the OS makes it simpler to run a scalable application by abstracting away all the infra underneath and by providing multiple services, the key is to design the scalable app. I will write more on writing scalable applications and also on using Azure in subsequent posts.

No this is not a line from a new novel by Douglas Adams, but our regular doomsayers. Yessir, not only have we become powerful enough to destroy the Universe, we are foolish enough to do it too! So after several failed warnings, finally the end of the world is in sight and you do not have to pay your EMI any more. Only this time it has not do with the turn of the millennium or a celestial body bringing the wrath of the Gods or the second coming of Christ, but something we humans are going to unleash upon ourselves. And it goes by the inane name of LHC. No it is not a mass hypnosis drug – far from it – LHC stands for the Large Hydron Collider, the largest particle accelerator in the world.

What are particle accelerators, you ask and how can one end the world? Well particle accelerators are instruments built by physicists (the Einstein variety of scientists – classification by subject matter, not gray) to study the behavior of very tiny particles at high energies. To get these tiny particles at the high energy levels, they make them go round and round in a long circular tube, accelerated by magnetic fields, till they start approaching the fastest speed possible – that of light in vacuum (a staggering 300,000 km / second). Then these particles are smashed into one another producing fireworks which the physicists watch with anticipation and glee. The glee is mostly because of being able to spend billions of dollars to smash things into one another at high speeds, but occasionally is also caused by observation of phenomena that reveal truths about the fundamental nature of our world – how the Universe got created and why things work the way they do.

At a length of 27 km and a budget of 6 Billion dollars, the LHC is just the largest and the most expensive of these toys yet. And apparently this toy can create particles that can destroy our world, or so say some scientists. Nay, say the toy makers- CERN – the same lot who built the world wide web. Nay, also say the 7000 odd kids who are going to play with the toys – some of the most eminent physicists of the world. The situation has the beautifully tragic potential of a foolish young civilization ending itself by probing too deep into the mind of God. It also has an element of ironical comedy which would have fitted straight into a sub-plot of the Hitchhiker’s Guide to the Galaxy. But we are dealing with an important question here – whether to pay the EMI or not – and must find out which camp to believe.

Now popular media cannot be believed, since it is caught up between two forces – the commercial force of getting more eyeballs which leads to sensationalism and the commercial force of credibility so that they can continue to get eyeballs which leads to more responsible reporting. Of course when it comes to such a juicy story, the immediate grabbing of eyeballs is more important, credibility can be built by other stories. Which is why you see "News" channels these days warning people against the wrath of the solar-eclipse and the evil eye of Saturn in the vocal styles of C-grade horror movie protagonists – it is all about immediate cash flow. However, this blog is not constrained by commercial interests – partly because my employers take good care of me (not for long, says my manager, if I continue writing such pieces), but mostly because Microsoft is late to bring out an ad-module for Live Spaces. So lets examine the arguments on both sides.

Of the several particles that would be produced at the LHC, the following four are the ones under the spotlight:

 

1) Micro Black Holes:

What is a Black Hole?

All masses attract each other thru the force of gravity. However the force is very weak – if you place two balls apart, they do not automatically collide since the force of attraction is unable to overcome the force of friction. This force of gravity increases as the mass of an object increases and the distance between the objects decreases. For objects as large as the planet Earth, the force is quite strong – actually it is this force that holds you and me to Earth.  Since matter is made up of smaller particles, all particles also attract thru the force of gravity, a single piece of matter can get compressed. However, the internal repulsive forces prevent this from happening. Which is why we do not collapse to a single point. However, when the mass is as high as a Star, the internal pressures can be overcome and the mass can collapse on itself undergoing a compression. The reason Stars do not collapse is because the tight compression fuses nuclei and generates energy which pushes particles apart, preventing further compression. However, very large stars, after exhausting their nuclear fusion can go on compressing infinitely producing what is called a Black Hole. Nothing is able to escape the gravitational pull of a Black Hole, not even light with its fast speed. That is why the name Black Hole – it cannot be seen because light is not able to escape it. So a black hole swallows matter around it and grows stronger. However, Black Holes are not a one way tunnel. Stephen Hawking proved that Black Holes evaporate energy, diminishing their mass gradually unless they completely disappear. This is called Hawking Radiation. Hawking Radiation is stronger for smaller black holes than larger black holes. This means that a small black hole radiates energy faster. Since the pull of a small black hole is smaller, it also grows slowly as compared to a larger black hole. Hence, a small black hole dissipates faster than a big one.

What is a Micro-Black Hole?

Theorists have long held that lighter black holes are also possible. As light as a millionth of a gram! However, the energy that it would take to produce such a small black hole is quite high. Now remember that in the LHC, particles would collide at very high energies, so maybe, just maybe a micro black hole can get formed. According to one theoretical model however, the energy required would actually be just sufficient and micro black holes can get formed at a rate of as much as one black hole per second in the LHC.

What is the Danger?

Since black holes grow and swallow other mass, there is a chance that a micro black hole will be formed that can swallow the entire Earth. There are several factors required to lead to this situation:

a) Micro black holes should be formed in the first place. The currently accepted models of micro black holes put the energy levels required at a thousand times the capability of the LHC. The doomsday scenario people claim that an alternative model makes it possible

b) The Micro black hole must not evaporate. This would happen only if Hawking Radiation is not a true phenomenon – that is, only a theoretical curiosity, not an actual fact.

c) The Micro black hole would be produced at low velocity so that it does not cross the Earth very quickly. At a high speed (remember the particles are colliding in the LHC at speed approaching that of light), the particle can leave the Earth very quickly without causing any harm

 

2) Strangelets:

All matter consists ultimately of a set of fundamental particles called Quarks and Leptons. I will not explain Quarks and Leptons here, you can read more on them on my other blog. There are 6 kinds of Quarks and one of them is called the Strange Quark. Don’t go on the name – the scientists named the quarks in creative ways and all quarks are equally strange. Typically matter is made of neutrons and protons which are made of Quarks called Up and Down quarks. However, t
he
re can also be matter made of Up, Down and Strange Quarks and such matter is called Strange Matter. Strange matter can occur in the core of very dense stars called neutron stars (these stars fall somewhat short of being a black hole) or in isolated droplets called Strangelets or a neutron star may get converted to a Strange Star if the mass is very high, but just short of being a black hole.

What is the Danger?

Typically we do not observe Strange matter because it is unstable. However, it is possible that if there is more of Strange Matter, it may be stable. The LHC may produce strangelets in enough quantity that they become stable. It is also possible that when a strangelet comes in contact with another nucleus it may convert that to a strangelet as well, thus setting off a chain reaction which may convert all of matter on Earth into Strange Matter.

 

3) Vacuum Bubbles:

There are theoretical models that predict that the vacuum in space is not the lowest-energy vacuum, but that there is a lower state vacuum called True Vacuum. This means that the Universe has False Vacuum. Thus at some point, our Universe may transition into the True Vacuum. However, this has not happened in 13 billion years so far. At LHC, since very high energy conditions would be there, it is possible that a region of True Vacuum – called a Vacuum Bubble may appear.

What is the Danger?

If a Vacuum Bubble is created, space surrounding it would also collapse into a lower energy state and it would expand. Such a bubble can grow at the speed of light and convert the entire Universe in a state of True Vacuum. This can destroy the Universe as we know it.

 

4) Magnetic monopoles:

If you have seen a magnet, you know it has two poles – North and South. Ever seen a magnet without two poles? What do you think would happen if you cut a magnet in half? People have tried this, and it leads to two magnets – each with two poles. No one has a seen a magnet with a single pole – a magnetic monopole. However, current theories predict magnetic monopoles, in fact they require that magnetic monopoles exist in order to be self-consistent. Just that they have not been observed anywhere, simply because they are obviously very rare and hence not likely to be encountered by a particle detector. Also, because they are very heavy, they are not created by particle accelerators. However, the LHC is a bigger and more powerful accelerator than any. And hence the possibility of a magnetic monopole getting formed.

What is the Danger?

In some models, it is predicted that a magnetic monopole can catalyze nucleon decay – this means that if a magnetic monopole strikes a large number of nuclei, causing them to decay and to release massive amounts of energy creating a gigantic nuclear explosion.

 

What Does CERN say?

Having understood the various threats in a little bit of detail, let’s now try and see why CERN and a majority of Earth’s physicists are not worried about this. The basic argument has to do with the LHC experiment being common in nature. To understand that we need to understand a phenomenon called Cosmic Rays.

Earth is constantly bombarded by high energy protons coming from deep space. These particles zipping all around us – called Cosmic Rays – originate in the Sun, in Supernova explosions in the outer Galaxy, in the Super-Heavy nuclei of various Galaxies, in Gamma Ray bursts, and in phenomena we do not know / understand yet. These particles can travel at very high energies. The highest energy comic ray possible is defined by the GZK limit and is held at 6 x 1019 eV – around that of a tennis ball thrown at 90 km/hr. However, cosmic rays with higher energies have also been observed and these are called Ultra High Energy Cosmic Rays. The magnetic field of the Earth and the Sun is able to divert most Cosmic Rays towards the poles – this is what leads to the beautiful Northern Lights phenomenon, and that is one reason why life was able to originate on Earth – the magnetic field allowed creation of more complex molecules which would not have been possible under an open exposure to the Cosmic Ray shower. But the point is this – Cosmic Rays are common.

What does this have to do with the LHC? Just this – the particles produced in the LHC would have energies around 1017 eV – three orders of magnitudes less than the highest energy cosmic rays! What? I hear you ask! Then why spend 6 billion dollars of tax payers money to produce stuff that nature already produces??? Well, one reason is that big boys like playing with big toys. The other reason is that the particle accelerator provide controlled environments to reproduce and study such phenomena which is not possible in the open Cosmic Ray environment. Nevertheless, the collision energy produced in the LHC is far less than what occurs in nature and hence it would not produce anything which is not otherwise found in nature. It would just provide an environment to study the phenomenon. According to CERN:

"Over the past billions of years, Nature has already generated on Earth as many collisions as about a million LHC experiments – and the planet still exists. Astronomers observe an enormous number of larger astronomical bodies throughout the Universe, all of which are also struck by cosmic rays. The Universe as a whole conducts more than 10 million million LHC-like experiments per second. The possibility of any dangerous consequences contradicts what astronomers see – stars and galaxies still exist."

While CERN also goes on to publish rebuttals to each type of threat individually, to me the argument above sounds like a good one. If you are interested in a more detailed discussion, read http://cern.ch/lsag/LSAG-Report.pdf.

 

So my recommendation is to plan your life on the assumption that the world is not ending. Not tomorrow when they inject the first set of particles, not after a couple of months when the LHC is fully functional, and not after a few years when enough meta-stable micro-black holes, magnetic monopoles, strangelets and vacuum bubbles have accumulated to destroy everything we know and love. And yes, do pay your EMI. I am continuing to pay, and so are the doomsayers!