Posts tagged "data"

Spatial Hadoop is a MapReduce framework designed specifically to handle huge datasets of spatial data. SpatialHadoop is shipped with built-in spatial high level language, spatial data types, spatial indexes and efficient spatial operations.

Code on GH.

—Jason

Great write-up and release of an open source (Apache 2) project from Netflix. Code on GH.

—Jason

First I’ve heard of this.

“god is a scalable, performant, persistent, in-memory data structure server. It allows massively distributed applications to update and fetch common data in a structured and sorted format.

Its main inspirations are Redis and Chord/DHash. Like Redis it focuses on performance, ease of use, and a small, simple yet powerful feature set, while from the Chord/DHash projects it inherits scalability, redundancy, and transparent failover behaviour.”

Code on GH.

—Jason

A great video lecture on Social Network Analysis (SNA) from a course on “Computational Journalism” from the University of Hong Kong.

—Jason

Indiana University is releasing a huge http dataset for research purposes. Looks pretty awesome for bigdata and machine learning research.

During collection, the system generated data at a rate of about 60 million requests per day, or about 30 GB/day of raw data. The data was collected between Sep 2006 and May 2010. Data is missing for about 275 days. The dataset has two collections:
  1. raw: About 25 billion requests, where only the host name of the referrer is retained. Collected between 26 Sep 2006 and 3 Mar 2008; missing 98 days of data, including the entire month of Jun 2007. Approximately 0.85 TB, compressed.
  2. raw-url: About 28.6 billion requests, where the full referrer URL is retained. Collected between 3 Mar 2008 and 31 May 2010; missing 179 days of data, including the entire months of Dec 2008, Jan 2009, and Feb 2009. Approximately 1.5 TB, compressed.


Fields:
* a timestamp
* the requested URL
* the referring URL
* a boolean classification of the user agent (browser or bot)
* a boolean flag for whether the request was generated inside or outside IU.

—Jason