information from the borsaitaliana.it website to support a publication
Dataset Description
This dataset contains information in preparation of forthcoming publication:
- extracted from public open data accessible via web (see "Production Notes" section at the end for details)
- overall aim: comparing company data pre- and post-COVID, i.e. evolution from 2019 to 2022 (balance sheet due July 2023)
General description: see linkedin post
Rationale of dataset and the associated project: Reading pre- and post-COVID corporate narratives, the Italian case: a dataset in fieri
See associated notebook (more charts will be added as further information willl be integrated)
Structure of the file: listino_catalog_kaggle.csv
The first file contained in this dataset is the list of stocks and warrants presented on the website of Borsa Italian as of 2023-07-11, specifically the following structure:
| column | name | datatype | description |
|---|---|---|---|
| 1 | # | numeric | position index |
| 2 | stock | text | name of the company, as per Borsa Italiana website |
| 3 | link | URL | URL link to the page |
| 4 | market | text | subsection of the "listino", as per Borsa Italiana website |
| 5 | ISIN | text | stock identification code, starting with a 2-char country code, followed by 10 digits |
| 6 | profile | URL | URL link to the profile page for the stock (if filled by the company) |
| 7 | detailspresent | char | Y=if a page with details was linked, N=details page not present |
| 8 | withinstudy | string | only for ISINs starting with IT where there was a value within the profile URL: blank if retained within the study, "MissingReports" if financial reports are partial or not available, "NotCoveringPeriod" if some financial reports 2019-2021 are missing |
| 9 | covidstudy | string | within those selected in column 8, further restricted, based on data available, companies for a study comparing pre- and post-Covid financial and operational information; values: Y = within the study / N = excluded due to data / outofscope = not within the scope |
| 10 | industry | string | na = not available: if a value is present = as listed by industry on BorsaItaliana.it |
| 11 | subindustry | string | na = not available: if a value is present = as listed by subindustry within the industry on BorsaItaliana.it |
| 12 | 2019accounts | string | languages of the 2019 accounts for companies whose "covidstudy" (column 9) is "Y"; if both English and Italian are available, EN is listed |
| 13 | 2021accounts | string | languages of the 2021 accounts for companies whose "covidstudy" (column 9) is "Y"; if both English and Italian are available, EN is listed |
| 14 | UsedforENG | string | string: Y if used for the text-based part of the study, i.e. those that have EN in both "2019accounts" and "2021accounts" |
| 15 | YahooFinanceURL | URL | using the ISIN as main point of reference, the link to YahooFinance page presenting financials; where non was available, "na" |
| 16 | checkvs2021yahoo | string | included=data reconciliation successful and company included in sample; bankassfin=company excluded but included in future study on bank/assurance/finance; excluded=company excluded for other reasons |
Note:
- this table is kept as a CSV source, which was build on 2023-07-12 using the information extracted on 2023-07-11 from the Borsa Italiana website (specifically, the "listino A-Z" 30 pages available)
- only the latest version of this dataset is always visible
- it has been updated on 2023-08-04, adding column 8 ("withinstudy") after retrieving the financial reports for all the companies on Borsa Milano that fulfill the condition described in the table able for column 8
- it has been updated on 2023-09-03, adding column 9 ("covidstudy") after identifying which companies are part of the study (i.e. beside the other conditions, annual reports for 2019 and 2021 are available)
- it has been updated on 2023-11-02, adding column 10 ("industry") and 11 ("subindustry") as on the BorsaItaliana.it website (the lists by industry and subindustry posted online as of 2023-11-02 covered only part of the "azioni" listed on the website; if needed, can provide those lists and the whole taxonomy)
- it has been updated on 2023-12-22, adding columns 12 ("2019accounts") and 13 ("2021accounts") and 14 ("UsedforENG"), as the next publication step is to share information comparing 2019 and 2021 as within the financial reports collected; if a company had a fiscal year partially overlapping with the solar year, the "2019" and "2021" could be "2018/2019" and "2021/2022", to avoid including part of 2020, as in that year accounts where partially unpublished
- it has been updated on 2024-02-02, searching YahooFinance by ISIN and then selecting the Financials page; where not available this way, or ambiguous, the search has been done contextually and by looking at the data to identify the one relevant to the dataset; where even this search did not provide unambiguous results or provided no results, it was marked "na"
This dataset will be complemented with other information by end 2024
Whenever new data items will be added:
- if new columns, will be in this dataset, and used in the sample notebook
- if new files, will be documented here, and a table of contents added
Update History
UPDATE 2023-07-23
Added the following columns:
| column | name | datatype | description |
|---|---|---|---|
| 5 | ISIN | text | stock identification code, starting with a 2-char country code, followed by 10 digits |
| 6 | profile | URL | URL link to the profile page for the stock (if filled by the company) |
| 7 | detailspresent | char | Y=if a page with details was linked, N=details page not present |
UPDATE 2023-08-04
Added the following column:
| column | name | datatype | description |
|---|---|---|---|
| 8 | withinstudy | string | only for ISINs starting with IT where there was a value within the profile URL: blank if retained within the study, "MissingReports" if financial reports are partial or not available, "NotCoveringPeriod" if some financial reports 2019-2021 are missing |
- it has been updated on 2023-08-04, adding column 8 ("withinstudy") after retrieving the financial reports for all the companies on Borsa Milano that fulfill the condition described in the table able for column 8
UPDATE 2023-09-03
Added the following column:
| column | name | datatype | description |
|---|---|---|---|
| 9 | covidstudy | string | within those selected in column 8, further restricted, based on data available, companies for a study comparing pre- and post-Covid financial and operational information; values: Y = within the study / N = excluded due to data / outofscope = not within the scope |
- it has been updated on 2023-09-03, adding column 9 ("covidstudy") after identifying which companies are part of the study (i.e. beside the other conditions, annual reports for 2019 and 2021 are available)
UPDATE 2023-11-02
Added the following columns:
| 10 | industry | string | na = not available: if a value is present = as listed by industry on BorsaItaliana.it |
| 11 | subindustry | string | na = not available: if a value is present = as listed by subindustry within the industry on BorsaItaliana.it |
- it has been updated on 2023-11-02, adding column 10 ("industry") and 11 ("subindustry") as on the BorsaItaliana.it website (the lists by industry and subindustry posted online as of 2023-11-02 covered only part of the "azioni" listed on the website; if needed, can provide those lists and the whole taxonomy)
UPDATE 2023-12-22
Added the following columns:
| column | name | datatype | description |
|---|---|---|---|
| 12 | 2019accounts | string | languages of the 2019 accounts for companies whose "covidstudy" (column 9) is "Y"; if both English and Italian are available, EN is listed |
| 13 | 2021accounts | string | languages of the 2021 accounts for companies whose "covidstudy" (column 9) is "Y"; if both English and Italian are available, EN is listed |
| 14 | UsedforENG | string | string: Y if used for the text-based part of the study, i.e. those that have EN in both "2019accounts" and "2021accounts" |
- it has been updated on 2023-12-22, adding columns 12 ("2019accounts") and 13 ("2021accounts") and 14 ("UsedforENG"), as the next publication step is to share information comparing 2019 and 2021 as within the financial reports collected; if a company had a fiscal year partially overlapping with the solar year, the "2019" and "2021" could be "2018/2019" and "2021/2022", to avoid including part of 2020, as in that year accounts where partially unpublished
UPDATE 2024-02-06
Added the following column:
| column | name | datatype | description |
|---|---|---|---|
| 15 | YahooFinanceURL | URL | using the ISIN as main point of reference, the link to YahooFinance page presenting financials; where non was available, "na" |
- it has been updated on 2024-02-02, searching YahooFinance by ISIN and then selecting the Financials page; where not available this way, or ambiguous, the search has been done contextually and by looking at the data to identify the one relevant to the dataset; where even this search did not provide unambiguous results or provided no results, it was marked "na"
UPDATE 2024-02-23
Added the following column:
| 16 | checkvs2021yahoo | string | included=data reconciliation successful and company included in sample; bankassfin=company excluded but included in future study on bank/assurance/finance; excluded=company excluded for other reasons |
- following the previous udpate on 2024-02-06, reconciled 2021 annual reports with the information available on YahooFinance, to identify reporting practices and further focus the selection
Production Notes
After manually extracting a sample for each layer (listing, sample pages, sample profiles, sample details), to identify structure:
- read from the "listino" (search feature) to extract automatically the list of stocks
- used the information retrieved to identify the presence of a profile page and other keys
- used the keys found to extract automatically further information from the Borsa Italiana website
All the accesses to the website (except single-page sample testing) were done:
- after standard market hours
- adding a delay of few seconds between each read
Language used: Python within Jupyter Notebook
Libraries (in alphabetical order):
- glob
- json
- os
- pandas
- pathlib
- requests
- time
Related Datasets
-
2021-2027 Achievements Planned (latest)
@esifunds
-
Bovespa - 03.07.2019
@kaggle
-
EU SPI 2020 Scores And Other Statistics
@esifunds