Context
This dataset is a crawl of the blog posts of the Techcrunch technology blog which was conducted on April of 2010. It was used as an experimental dataset for the requirements of the research paper:
L. Akritidis, D. Katsaros, P. Bozanis, "Identifying the Productive and Influential Bloggers in a Community", IEEE Transactions on Systems, Man, and Cybernetics-Part C: Applications and Reviews, vol. 41, no 5, pp. 759-764, 2011.
The primary goal of this dataset was to provide an active community for the identification of members who are both productive and influential. However, since the full text of the posts is present, it can also be used for a wide variety of text mining tasks, such as sentiment analysis, opinion retrieval, and NLP. There is also a (My)SQL version that is available from here.
The researchers who used, or will use this dataset, are kindly asked to cite the aforementioned article in their work/s.
If you found this dataset useful, you may also check my TUAW dataset for identifying influential bloggers.
Content
The repository consisfts of four files:
- A list of the bloggers of Techcrunch, along with their (unique) IDs and some statistics
- A database of the retrieved blog posts,
- The incoming links to the blog posts of Techcrunch, automatically retrieved by using the Googl Blog Search service.
- The submitted comments to the posts.
Precise descriptions and record counts for each file are provided below.