##(see the SpecSpec for an explanation) * '''Launchpad Entry''': UbuntuSpec:server-maverick-cloud-datastores * '''Created''': 2010-05-20 * '''Contributors''': [[ClintByrum]] * '''Packages affected''': couchdb , mongodb == Summary == This spec details the steps we will take in Maverick to improve support for popular distributed/cloud based storage technologies such as CouchDB, MongoDB, and Cassandra. == Release Note == Maverick includes CouchDB, a distributed data store, in main, and a number of other cloud friendly data storage packages in Universe/Multiverse, including MongoDB, Cassandra, and Drizzle == Rationale == In order to make Ubuntu Server the platform of choice on cloud environments, we need to support the workloads that users typically need on such environments. This include cloud-oriented databases or datastores. == User stories == As a web developer building modern, distributed, highly scalable applications, I want to deploy applications using distributed data storage with a minimum of customization. As a web developer, I want to be able to try out popular cloud/distributed data storage solutions and remove them cleanly if they do not fit my needs. As an ops engineer supporting distributed applications in and out of the cloud, I want to deploy critical infrastructure pieces such as data storage from known distributions without having to create custom builds of complicated software. == Assumptions == == Design == === CouchDB === With desktop couch already in main, CouchDB is a logical choice for promotion from universe to main. === MongoDB === [[http://www.mongodb.org/|MongoDB]] is already packaged in Debian. Ensure that merge/sync with debian has the most up to date version possible for Maverick. === Cassandra === [[http://cassandra.apache.org/|Cassandra]] is a very popular, fast moving database engine that needs to be made available to users of Ubuntu. It is not quite ready for Universe inclusion, as its dependencies are in flux, and releases are deprecated at a very rapid rate. ==== Recommended PPA or, "RPA" ==== The Cassandra project puts out binary packages that are not hard to modify to build under a PPA on Launchpad. These packages with the required changes to build will be uploaded there. ==== Discoverability ==== ===== Upstream Invitation for Involvement ===== We will encourage the Cassandra Development team to contribute here and upload their packages directly the RPA above as well, to help users stay up to date as Cassandra releases are made. Membership in the restricted cassandra-ubuntu team on Launchpad will allow committing and uploading, so upstreams will be free to update the packages if they are granted membership. ===== Server Team Endorsement ===== We will endorse the PPA's on our blog and mailing list. ===== RPA's List ===== Once a list of "recommended package archives" is available Cassandra's will be listed in it. It should materialize at [[ServerTeam/RPA]]. == Implementation == See blueprint whiteboard UbuntuSpec:server-maverick-cloud-datastores == Test/Demo Plan == == Unresolved issues == N/A == UDS session agenda and discussion == Discussion notes: === Top candidates: === * Couchdb * Mongo * Cassandra * Drizzle === Overview === * Document store * CouchDB (in main for karmic+) * MongoDB (in universe for lucid+) * debian uptodate, we are not * Eventually‐consistent key‐value store (Dynamo implementation): 1. Cassandra (not in Ubuntu) 2. Project_Voldemort (LinkedIn) (not in ubuntu) * Tabular * Hbase (build on top of hadoop, see server-maverick-hadoop-pig): available in maverick * hypertable * Key/value store on disk * redis (in universe karmic+) * tokyo cabinet (in universe hardy+) - 1.4.37 (from debian) vs 1.4.44 (upstream) * maybe bug in watchfile * tokyotyrant - in debian unstable, up to date 1.1.40 - network enabled tokyocabinet storage - native, memcache, http REST access - async replication * memcachedb (in universe jaunty+) - current but 'stable' for a long time * there are better way to do things now * Oracle is interested in it because of berkley db * Key/value store in RAM : see server-m-web20-workloads * Other NOSQL databases (from http://en.wikipedia.org/wiki/Nosql) * Neo4j (graph db) * Keyspace (graph db) * ndb -- uptodate - part of mysql cluster * RIAK * how fast does it move? is it worth packaging? * seeing adoption * SQL cloud-oriented databases * Drizzle * in debian, but moving fast * should be synced to maverick, keep following * candidate for nightly vcs === Actions === * CouchDB: move couchdb server binary pkg into main => YES * MongoDB: merge/sync with Debian => YES * Cassandra: package for universe * 8+ Missing build-deps (ttx to doublecheck this) * avro * paranamer * .(maven). * ConcurrentLinkedHashMap * hadoop => support for data analysis ( http://architects.dzone.com/news/cassandra-adds-hadoop ) * high-scale-lib * java.util.concurrent * java.util.hashtable * jackson * json-simple * thrift => packaged by digg * .(none?). * moving away from thrift 'soon' http://about.digg.com/opensource * worst case: only go in multiverse in current form + nightly build == Notes == === Other Databases === The following would be interesting to revisit for later cycles: * tokyotyrant * RIAK * Project_Voldemort ---- CategorySpec