
Spatial Hadoop is a MapReduce framework designed specifically to handle huge datasets of spatial data. SpatialHadoop is shipped with built-in spatial high level language, spatial data types, spatial indexes and efficient spatial operations.
Code on GH.
—Jason
First I’ve heard of this.
“god is a scalable, performant, persistent, in-memory data structure server. It allows massively distributed applications to update and fetch common data in a structured and sorted format.
Its main inspirations are Redis and Chord/DHash. Like Redis it focuses on performance, ease of use, and a small, simple yet powerful feature set, while from the Chord/DHash projects it inherits scalability, redundancy, and transparent failover behaviour.”
Code on GH.
—Jason
Coursera:
I will keep updating this as I find more.
—Jason
A great video lecture on Social Network Analysis (SNA) from a course on “Computational Journalism” from the University of Hong Kong.
—Jason
Indiana University is releasing a huge http dataset for research purposes. Looks pretty awesome for bigdata and machine learning research.
During collection, the system generated data at a rate of about 60 million requests per day, or about 30 GB/day of raw data. The data was collected between Sep 2006 and May 2010. Data is missing for about 275 days. The dataset has two collections:
- raw: About 25 billion requests, where only the host name of the referrer is retained. Collected between 26 Sep 2006 and 3 Mar 2008; missing 98 days of data, including the entire month of Jun 2007. Approximately 0.85 TB, compressed.
- raw-url: About 28.6 billion requests, where the full referrer URL is retained. Collected between 3 Mar 2008 and 31 May 2010; missing 179 days of data, including the entire months of Dec 2008, Jan 2009, and Feb 2009. Approximately 1.5 TB, compressed.
Fields:
* a timestamp
* the requested URL
* the referring URL
* a boolean classification of the user agent (browser or bot)
* a boolean flag for whether the request was generated inside or outside IU.
—Jason