<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0"><channel><atom:link rel="hub" href="http://tumblr.superfeedr.com/" xmlns:atom="http://www.w3.org/2005/Atom"/><description></description><title>Big Data Cookbook</title><generator>Tumblr (3.0; @bigdata-cookbook)</generator><link>http://www.bigdata-cookbook.com/</link><item><title>How-to: Configure Eclipse for Hadoop Contributions</title><description>&lt;a href="http://blog.cloudera.com/blog/2013/05/how-to-configure-eclipse-for-hadoop-contributions/"&gt;How-to: Configure Eclipse for Hadoop Contributions&lt;/a&gt;: &lt;p&gt;Useful how-to on setting up Hadoop in Eclipse to enable dev and contribution.&lt;/p&gt;

&lt;p&gt;—Jason&lt;/p&gt;</description><link>http://www.bigdata-cookbook.com/post/50900333433</link><guid>http://www.bigdata-cookbook.com/post/50900333433</guid><pubDate>Mon, 20 May 2013 07:20:41 -0400</pubDate><category>hadoop</category><category>eclipse</category><category>bigdata</category><category>cloudera</category><category>apache</category></item><item><title>Really useful EC2 Pricing Breakdown</title><description>&lt;p&gt;The guys at &lt;a href="http://www.infochimps.com/"&gt;Infochimps&lt;/a&gt; published this &lt;a href="https://github.com/infochimps-labs/ironfan/wiki/ec2-pricing_and_capacity"&gt;pricing breakdown&lt;/a&gt; for the various EC2 instances.&lt;/p&gt;

&lt;p&gt;Seems like it will come in handy one day&amp;#8230;&lt;/p&gt;

&lt;p&gt;(via @rjurney)&lt;/p&gt;

&lt;p&gt;&amp;#8212;Jason&lt;/p&gt;</description><link>http://www.bigdata-cookbook.com/post/50569814879</link><guid>http://www.bigdata-cookbook.com/post/50569814879</guid><pubDate>Thu, 16 May 2013 07:02:17 -0400</pubDate><category>EC2</category><category>aws</category><category>pricing</category><category>bigdata</category></item><item><title>Machine Learning over Storm</title><description>&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/pmerienne/trident-ml"&gt;trident-ml&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Trident-ML is a realtime online machine learning library. It allows you to build real time predictive features using scalable online algorihms. This library is built on top of Storm, a distributed stream processing framework which runs on a cluster of machines and supports horizontal scaling. The packaged algorithms are designed to fit into limited memory and processing time but they don&amp;#8217;t work in a distributed way.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/quintona/storm-r"&gt;storm-r&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Storm-R uses the &lt;a href="https://github.com/nathanmarz/storm/wiki/Multilang-protocol"&gt;multilang protocol&lt;/a&gt; to integrate R function calls with a trident Function.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/quintona/storm-pattern"&gt;storm-pattern&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Based on the cascading.pattern project. The pattern sub-project for &lt;a href="http://Cascading.org/"&gt;http://Cascading.org/&lt;/a&gt; which uses flows as containers for machine learning models, importing PMML model descriptions from R, SAS, Weka, RapidMiner, KNIME, SQL Server, etc.&lt;/p&gt;

&lt;p&gt;&amp;#8212;Jason&lt;/p&gt;</description><link>http://www.bigdata-cookbook.com/post/50334182923</link><guid>http://www.bigdata-cookbook.com/post/50334182923</guid><pubDate>Mon, 13 May 2013 06:38:38 -0400</pubDate><category>storm</category><category>machine learning</category><category>realtime</category><category>analytics</category></item><item><title>Running a Multi-Broker Apache Kafka 0.8 Cluster on a Single Node</title><description>&lt;a href="http://www.michael-noll.com/blog/2013/03/13/running-a-multi-broker-apache-kafka-cluster-on-a-single-node/"&gt;Running a Multi-Broker Apache Kafka 0.8 Cluster on a Single Node&lt;/a&gt;: &lt;p&gt;&lt;a href="http://www.michael-noll.com/blog/2013/03/13/running-a-multi-broker-apache-kafka-cluster-on-a-single-node/"&gt;This&lt;/a&gt; is an excellent howto by &lt;a href="https://twitter.com/miguno"&gt;Michael Nole&lt;/a&gt; on setting up a multi broker &lt;a href="http://kafka.apache.org/"&gt;Kafka&lt;/a&gt; cluster.&lt;/p&gt;

&lt;p&gt;—Jason&lt;/p&gt;</description><link>http://www.bigdata-cookbook.com/post/45843402075</link><guid>http://www.bigdata-cookbook.com/post/45843402075</guid><pubDate>Wed, 20 Mar 2013 13:15:45 -0400</pubDate><category>storm</category><category>hadoop</category><category>kafka</category><category>streaming</category><category>queue</category><category>realtime</category><category>bigdata</category><category>messaging</category></item><item><title>Parquet columnar storage format for Hadoop released</title><description>&lt;p&gt;&lt;a href="http://parquet.github.com/"&gt;Parquet&lt;/a&gt; is a columnar storage format for Hadoop.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/parquet/parquet-format"&gt;Format&lt;/a&gt; and &lt;a href="https://github.com/parquet/parquet-mr"&gt;MapReduce&lt;/a&gt; code on &lt;a href="https://github.com/Parquet"&gt;GH&lt;/a&gt;. Includes a &lt;a href="https://github.com/Parquet/parquet-mr/blob/master/parquet-pig/src/main/java/parquet/pig/ParquetLoader.java"&gt;Loader&lt;/a&gt; and &lt;a href="https://github.com/Parquet/parquet-mr/blob/master/parquet-pig/src/main/java/parquet/pig/ParquetStorer.java"&gt;Storer&lt;/a&gt; for &lt;a href="http://pig.apache.org/"&gt;pig&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Released under the Apache 2.0 License.&lt;/p&gt;</description><link>http://www.bigdata-cookbook.com/post/45205603097</link><guid>http://www.bigdata-cookbook.com/post/45205603097</guid><pubDate>Tue, 12 Mar 2013 15:31:49 -0400</pubDate><category>hadoop</category><category>mapreduce</category><category>serde</category><category>columnar</category><category>storage</category></item><item><title>Spatial Hadoop: MapReduce framework for spatial data</title><description>&lt;p&gt;&lt;img src="http://media.tumblr.com/ae30f1df77830b24a63ae23cfd19c85a/tumblr_inline_mjk9ba5z281qz4rgp.png" alt=""/&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="http://spatialhadoop.cs.umn.edu/"&gt;Spatial Hadoop&lt;/a&gt; is a MapReduce framework designed specifically to handle huge datasets of spatial data. SpatialHadoop is shipped with built-in spatial high level language, spatial data types, spatial indexes and efficient spatial operations.&lt;/p&gt;

&lt;p&gt;Code on &lt;a href="https://github.com/aseldawy/spatialhadoop"&gt;GH&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&amp;#8212;Jason&lt;/p&gt;</description><link>http://www.bigdata-cookbook.com/post/45202536273</link><guid>http://www.bigdata-cookbook.com/post/45202536273</guid><pubDate>Tue, 12 Mar 2013 14:45:09 -0400</pubDate><category>hadoop</category><category>mapreduce</category><category>bigdata</category><category>maps</category><category>spatial</category><category>data</category></item><item><title>NetflixGraph: Compact in-memory representation of directed graph data</title><description>&lt;p&gt;&lt;img src="http://media.tumblr.com/761d8e64b2d53a7d86afffcd2a5c39f8/tumblr_inline_mjgjkgp3IO1qz4rgp.gif" alt=""/&gt;&lt;/p&gt;

&lt;p&gt;Great &lt;a href="http://techblog.netflix.com/2013/01/netflixgraph-metadata-library_18.html"&gt;write-up&lt;/a&gt; and &lt;a href="http://netflix.github.com/netflix-graph/"&gt;release&lt;/a&gt; of an open source (Apache 2) project from &lt;a href="https://www.netflix.com"&gt;Netflix&lt;/a&gt;.  Code on &lt;a href="https://github.com/Netflix/netflix-graph"&gt;GH&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&amp;#8212;Jason&lt;/p&gt;</description><link>http://www.bigdata-cookbook.com/post/45041410971</link><guid>http://www.bigdata-cookbook.com/post/45041410971</guid><pubDate>Sun, 10 Mar 2013 14:34:40 -0400</pubDate><category>graph</category><category>analytics</category><category>efficient</category><category>open source</category><category>data</category></item><item><title>Fnordmetric: Nice metrics visualization</title><description>&lt;p&gt;&lt;img src="http://media.tumblr.com/ff6bc3ccc9c3f7e8607de12c3f0a6ea3/tumblr_inline_mjgiwdh5jW1qz4rgp.png" alt=""/&gt;&lt;/p&gt;

&lt;p&gt;I found &lt;a href="http://fnordmetric.io/"&gt;this&lt;/a&gt; recently and it looks pretty nice.  Check out the &lt;a href="http://fnordmetric.io/screenshots/"&gt;screenshots&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Code on &lt;a href="https://github.com/paulasmuth/fnordmetric"&gt;GH&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&amp;#8212;Jason&lt;/p&gt;</description><link>http://www.bigdata-cookbook.com/post/45040226817</link><guid>http://www.bigdata-cookbook.com/post/45040226817</guid><pubDate>Sun, 10 Mar 2013 14:19:52 -0400</pubDate><category>stats</category><category>metrics</category><category>visualization</category><category>graphs</category><category>charts</category><category>counts</category></item><item><title>Write-up on Clustering Graphite</title><description>&lt;p&gt;&lt;img src="http://media.tumblr.com/920281334e2ede73f9f7b187d83c3c34/tumblr_inline_mjgi5snZvL1qz4rgp.png" alt=""/&gt;&lt;/p&gt;

&lt;p&gt;Short &lt;a href="http://bitprophet.org/blog/2013/03/07/graphite/"&gt;write-up&lt;/a&gt; on clustering &lt;a href="http://graphite.wikidot.com/"&gt;Graphite&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&amp;#8212;Jason&lt;/p&gt;</description><link>http://www.bigdata-cookbook.com/post/45039001902</link><guid>http://www.bigdata-cookbook.com/post/45039001902</guid><pubDate>Sun, 10 Mar 2013 14:04:20 -0400</pubDate><category>graphite</category><category>metrics</category><category>counts</category><category>stats</category></item><item><title>[Video] ElasticSearch: The Missing Intro</title><description>&lt;a href="https://air.mozilla.org/elasticsearch/"&gt;[Video] ElasticSearch: The Missing Intro&lt;/a&gt;: &lt;p&gt;A nice intro &lt;a href="https://air.mozilla.org/elasticsearch/"&gt;video&lt;/a&gt; on &lt;a href="http://www.elasticsearch.org/"&gt;ElasticSearch&lt;/a&gt; from &lt;a href="https://air.mozilla.org/"&gt;Air Mozilla&lt;/a&gt;.  It is ~56min.&lt;/p&gt;

&lt;p&gt;—Jason&lt;/p&gt;</description><link>http://www.bigdata-cookbook.com/post/44535820164</link><guid>http://www.bigdata-cookbook.com/post/44535820164</guid><pubDate>Mon, 04 Mar 2013 06:37:54 -0500</pubDate><category>elasticsearch</category><category>search</category><category>bigdata</category></item><item><title>Spark and Shark Tutorial from Stratconf 2013</title><description>&lt;p&gt;&lt;img src="http://media.tumblr.com/f5f4251212d1dd848abf1ddfa7020177/tumblr_inline_mj3pjlluzV1qz4rgp.png"/&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="http://spark-project.org/"&gt;Spark&lt;/a&gt; and &lt;a href="http://shark.cs.berkeley.edu/"&gt;Shark&lt;/a&gt; tutorial/course given at &lt;a href="http://strataconf.com/strata2013/public/schedule/detail/27438"&gt;Strata&lt;/a&gt;, materials &lt;a href="http://ampcamp.berkeley.edu/amp-camp-strata-2013/"&gt;online&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/7c496281062b78721bf711db815a3f2f/tumblr_inline_mj3pk21zLR1qz4rgp.gif" alt=""/&gt;&lt;/p&gt;

&lt;p&gt;&amp;#8212;Jason&lt;/p&gt;</description><link>http://www.bigdata-cookbook.com/post/44480336832</link><guid>http://www.bigdata-cookbook.com/post/44480336832</guid><pubDate>Sun, 03 Mar 2013 15:21:50 -0500</pubDate><category>spark</category><category>shark</category><category>strataconf</category><category>bigdata</category><category>realtime</category><category>analytics</category><category>course</category><category>tutorial</category></item><item><title>Graph Based Recommendations using "How-To" Guides Dataset</title><description>&lt;p&gt;&lt;img src="http://media.tumblr.com/8af1d90d4f26069c9033c38f600d08d7/tumblr_inline_mj3l09ZJXJ1qz4rgp.png" alt=""/&gt;&lt;/p&gt;

&lt;p&gt;This is a &lt;a href="http://aimotion.blogspot.com/2013/03/graph-based-recommendations-using-how.html"&gt;great post&lt;/a&gt; using Python, &lt;a href="http://www.neo4j.org/"&gt;Neo4j&lt;/a&gt;, and &lt;a href="http://bulbflow.com/"&gt;Bulbflow&lt;/a&gt; to build a recommendation system using a graph database.  It looks like they crawled &lt;a href="http://snapguide.com/"&gt;SnapGuide&lt;/a&gt; to get their data for this.&lt;/p&gt;

&lt;p&gt;The code for bulbflow is on &lt;a href="https://github.com/espeed/bulbs"&gt;GH&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/50df5b090dc8358261c59ce598cd5c42/tumblr_inline_mj3l0jT9aQ1qz4rgp.png" alt=""/&gt;&lt;/p&gt;

&lt;p&gt;Here is another post on &lt;a href="http://markorodriguez.com/2011/09/22/a-graph-based-movie-recommender-engine/"&gt;Graph Recommendation Systems using Gremlin&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&amp;#8212;Jason&lt;/p&gt;</description><link>http://www.bigdata-cookbook.com/post/44471616714</link><guid>http://www.bigdata-cookbook.com/post/44471616714</guid><pubDate>Sun, 03 Mar 2013 13:37:51 -0500</pubDate><category>graph</category><category>neo4j</category><category>python</category><category>bulbflow</category><category>analytics</category><category>recommender</category></item><item><title>Using SOLR Cloud as a NoSQL database for low latency analytics</title><description>&lt;p&gt;&lt;a href="http://chimpler.wordpress.com/2013/02/27/playing-with-solr-cloud-for-responsive-analytics/"&gt;Article&lt;/a&gt; on using &lt;a href="http://wiki.apache.org/solr/SolrCloud"&gt;SolrCloud&lt;/a&gt; for low latency analytics.  Example configs on &lt;a href="https://github.com/chimpler/blog-solr-cloud-example"&gt;GH&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&amp;#8212;Jason&lt;/p&gt;</description><link>http://www.bigdata-cookbook.com/post/44396358138</link><guid>http://www.bigdata-cookbook.com/post/44396358138</guid><pubDate>Sat, 02 Mar 2013 15:56:39 -0500</pubDate><category>solr</category><category>nosql</category><category>analytics</category><category>search</category><category>lucene</category></item><item><title>Agile Analytics Applications [Presentation]</title><description>&lt;p&gt;This is an awesome &lt;a href="http://www.slideshare.net/russell_jurney/hortonworks-roadshow"&gt;presentation&lt;/a&gt; by &lt;a href="https://twitter.com/rjurney"&gt;Russel Jurney&lt;/a&gt; on building data driven applications that use big data.&lt;/p&gt;

&lt;center&gt;
&lt;img height="328" width="250" src="http://media.tumblr.com/767ef995d9f9d1bcfc8293232b4a0580/tumblr_inline_mif25r1skP1qz4rgp.jpg" alt="image"/&gt;&lt;/center&gt;

&lt;p&gt;Russel is in the midst of writing a book on this topic and the book is currently available for review on &lt;a href="%5Bhttp://ofps.oreilly.com/titles/9781449326265/"&gt;O&amp;#8217;reilly&amp;#8217;s Open Feedback Publishing System&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&amp;#8212;Jason&lt;/p&gt;</description><link>http://www.bigdata-cookbook.com/post/43395593731</link><guid>http://www.bigdata-cookbook.com/post/43395593731</guid><pubDate>Mon, 18 Feb 2013 07:50:00 -0500</pubDate><category>bigdata</category><category>presentation</category><category>mapreduce</category><category>pig</category><category>agile</category><category>analytics</category><category>data science</category></item><item><title>Computing Rollups on Impression Logs using Storm</title><description>&lt;p&gt;&lt;img src="http://media.tumblr.com/ea243a89e8b87d49fad7ded25f694c6f/tumblr_inline_mif1tz8fRk1qz4rgp.png" alt="Chimpler"/&gt;&lt;/p&gt;

&lt;p&gt;Great &lt;a href="http://chimpler.wordpress.com/2013/02/16/a-hadoop-alternative-building-a-real-time-data-pipeline-with-storm/"&gt;blog post&lt;/a&gt; at &lt;a href="http://chimpler.wordpress.com"&gt;Chimpler&lt;/a&gt; on using Storm to compute rollups on impression logs and storing the results in MongoDB.  Code in &lt;a href="https://github.com/chimpler/blog-storm-adnetwork-example"&gt;GH&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&amp;#8212;Jason&lt;/p&gt;</description><link>http://www.bigdata-cookbook.com/post/43395270273</link><guid>http://www.bigdata-cookbook.com/post/43395270273</guid><pubDate>Mon, 18 Feb 2013 07:39:49 -0500</pubDate><category>storm</category><category>realtime</category><category>bigdata</category><category>analytics</category></item><item><title>Habakkuk: Finding Religious Tweets using Storm</title><description>&lt;p&gt;Great &lt;a href="http://technicalelvis.com/blog/2012/06/21/habakkuk-starter/"&gt;blog post&lt;/a&gt; on mining the Twitter stream for religious tweets using Storm.  Code on &lt;a href="https://github.com/telvis07/habakkuk"&gt;GH&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&amp;#8212;Jason&lt;/p&gt;</description><link>http://www.bigdata-cookbook.com/post/43393659127</link><guid>http://www.bigdata-cookbook.com/post/43393659127</guid><pubDate>Mon, 18 Feb 2013 06:39:26 -0500</pubDate><category>storm</category><category>bigdata</category><category>data mining</category><category>analytics</category><category>realtime</category><category>Twitter</category></item><item><title>Storing Time Series Metrics With Cassandra and Composite Columns</title><description>&lt;p&gt;Great (short) &lt;a href="http://www.slideshare.net/charmalloc/jsteinmeetupcassandra20111102"&gt;presentation&lt;/a&gt; on &amp;#8220;Storing Time Series Metrics With Cassandra and Composite Columns&amp;#8221; by Joe Stein from Medialets.  With code on &lt;a href="https://github.com/joestein/apophis"&gt;GH&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&amp;#8212;Jason&lt;/p&gt;</description><link>http://www.bigdata-cookbook.com/post/43244051764</link><guid>http://www.bigdata-cookbook.com/post/43244051764</guid><pubDate>Sat, 16 Feb 2013 14:04:01 -0500</pubDate><category>cassandra</category><category>bigdata</category><category>counts</category><category>analytics</category><category>howto</category></item><item><title>Polyglot Persistence and Query with Gremlin</title><description>&lt;a href="http://thinkaurelius.com/2013/02/04/polyglot-persistence-and-query-with-gremlin/"&gt;Polyglot Persistence and Query with Gremlin&lt;/a&gt;: &lt;p&gt;Great &lt;a href="http://thinkaurelius.com/2013/02/04/polyglot-persistence-and-query-with-gremlin/"&gt;article&lt;/a&gt; on using &lt;a href="https://github.com/tinkerpop/gremlin"&gt;Gremlin&lt;/a&gt; to query graph data in various datastores.&lt;/p&gt;

&lt;p&gt;—Jason&lt;/p&gt;</description><link>http://www.bigdata-cookbook.com/post/43009881635</link><guid>http://www.bigdata-cookbook.com/post/43009881635</guid><pubDate>Wed, 13 Feb 2013 12:21:39 -0500</pubDate><category>graph</category><category>gremlin</category><category>data mining</category><category>analytics</category></item><item><title>A Tutorial on Writing Hive UDFs</title><description>&lt;a href="http://snowplowanalytics.com/blog/2013/02/08/writing-hive-udfs-and-serdes/"&gt;A Tutorial on Writing Hive UDFs&lt;/a&gt;: &lt;p&gt;Nice &lt;a href="http://snowplowanalytics.com/blog/2013/02/08/writing-hive-udfs-and-serdes/"&gt;tutorial&lt;/a&gt; on writing UDFs for Hive.&lt;/p&gt;

&lt;p&gt;—Jason&lt;/p&gt;</description><link>http://www.bigdata-cookbook.com/post/43008851830</link><guid>http://www.bigdata-cookbook.com/post/43008851830</guid><pubDate>Wed, 13 Feb 2013 12:01:41 -0500</pubDate><category>hadoop</category><category>hive</category><category>UDF</category><category>mapreduce</category><category>bigdata</category><category>tutorial</category><category>howto</category></item><item><title>Flatten entire HBase column families with Pig and Python UDFs</title><description>&lt;a href="http://chase-seibert.github.com/blog/2013/02/10/pig-hbase-flatten-column-family.html"&gt;Flatten entire HBase column families with Pig and Python UDFs&lt;/a&gt;: &lt;p&gt;Some nice &lt;a href="http://pig.apache.org/"&gt;Pig&lt;/a&gt; and &lt;a href="http://hbase.apache.org/"&gt;HBase&lt;/a&gt; &lt;a href="http://chase-seibert.github.com/blog/2013/02/10/pig-hbase-flatten-column-family.html"&gt;hackery&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;—Jason&lt;/p&gt;</description><link>http://www.bigdata-cookbook.com/post/43008701676</link><guid>http://www.bigdata-cookbook.com/post/43008701676</guid><pubDate>Wed, 13 Feb 2013 11:58:55 -0500</pubDate><category>hbase</category><category>hadoop</category><category>pig</category><category>mapreduce</category><category>bigdata</category><category>bigtable</category></item></channel></rss>
