Refined: Large Metal Lyrics Archive (228K songs)
This dataset represents a refined version of "The Large Metal Lyrics Archive (LMLA)" [https://www.kaggle.com/datasets/markkorvin/large-metal-lyrics-archive-228k-songs]. The initial version posed several challenges as identified by the original author, which included:
- The presence of purely instrumental songs/entries, which were not consistently indicated with markers such as 'instrumental'.
- A diversity of languages in the songs without any language classification.
- The inclusion of non-lyrical elements within the lyrics, such as 'Chorus', 'Solo', or others.
Additionally,
- The original dataset suffered from numerous encoding errors.
These issues complicate Natural Language Processing (NLP) tasks that require clean data, necessitating thorough data cleaning steps.
Correcting Encoding Errors
A significant problem within the LMLA was encoding errors, evident from characters like [ГВўЂ™]
. An initial regex-search revealed that this issue affected 44,843 rows (song entries), or about 19.64% of the dataset, highlighting the need for encoding error correction to maintain data integrity and utility for analysis. These characters not only misrepresent the text but could also interfere with subsequent text processing tasks. A rigorous data correction and cleaning process was thus implemented.
Python Script for Encoding Correction: The process began with an analysis to identify incorrect encoding patterns, revealing issues likely due to a mix of character sets and encodings (windows-1251
, latin-1
, and utf-8
). A Python script using a specific encoding and decoding algorithm, along with the ftfy
library for text encoding repair, was developed.
Outcome and Verification: After applying the script, another subsequent regex-search for the problematic characters [ГВўЂ™]
found only 430 matches, reducing the issue to about 0.19% of the dataset. This decrease demonstrated the strategy's effectiveness, although it also indicated the presence of genuine Cyrillic text and the Trademark ™ sign.
Removing Non-Lyrical Elements
The data cleaning for the Large Metal Lyrics Archive continued with the removal of non-lyrical elements, such as metadata within brackets and sourcing notes. This was achieved using regex (regular expression) searches in OpenRefine, a user-friendly tool for data preprocessing.
Removing Brackets without Line Breaks: The first regex search targeted content within brackets without line breaks, effectively removing a wide range of non-lyrical elements. This step removed strings representing irrelevant annotations, affecting 61,832 cells.
Regex-Search: \[(.*?)\]
Removing Brackets with Line Breaks: The second regex pattern focused on brackets spanning multiple lines, addressing more complex annotations. While effective, this approach occasionally removed content that might have been lyrical, affecting 458 cells but with minimal impact on the dataset's integrity.
Regex-Search: \[[^\]]*?(?:\r?\n)[^\]]*?\]
Removing Contributors' Notes: Notes likely from the original compilation source were identified and removed, affecting another 4,303 cells.
Regex-Search: Thanks to (.*?) lyrics.
Table: Overview of Data Cleansing Steps in the Large Metal Lyrics Archive (LMLA), Removal of Non-Lyric Elements; N total = 228,288 cells.
Nr. |
Step |
Regex Pattern |
Matching Cells |
1 |
Remove Brackets (without Line Breaks) |
[(.*?)] |
61,832 |
2 |
Remove Brackets (with Line Breaks) |
[[^]]?(?:\r?\n)[^]]?] |
458 |
3 |
Remove Attributions |
Thanks to (.*?) lyrics. |
4,303 |
Language Classification
Language Classification: The Spacy library was used for classifying the languages of the entries. The process added two columns, DetectedLanguage
and Certainty
, to the dataset, with DetectedLanguage
indicating the identified language and Certainty
a confidence score. For entries where no language could be classified, the default value UNKNOWN
was set for DetectedLanguage
, and Certainty
was set to 0.0
.