Methodology:
All news links are crawled from google search "Stock Market" for that specific date, and only get the first 2 pages of results, where each page containing 10 results.
Columns:
- date: 2021-10-09 to 2022-10-08 (Cleaned)
- url: the news link (Not cleaned)
- full_text: entire article scrape using Newspaper3k library (Not cleaned)
- summary: apply summary function from Newspaper3k, containing 10 sentences (Not cleaned)
- close: close price from yahoo finance (Cleaned, empty means market not opened)
- volume: ticket volume from yahoo finance (Cleaned, empty means market not opened)
Remarks:
1). Since some of the contents are blocked by paywall, so excluded the news with number of full text character less than 500 (e.g.
"$0.99 Subscription for reading!"). However, it is not guaranteed.
2). Some urls may be replicated
3). Some full text content may happen to appear in different links
4). full_text and summary are not guaranteed English
For more information, please visit my Github, thanks!