Collection
Due to there is no existing well-structured dataset for understanding problem of poems, we try to create a dataset by ourselves.
Because of the differences in writing style and expression between ancient and modern times, people today often find it challenging to study and understand Chinese classical poetry. Our goal is to restate Chinese classical poems in modern style. Since translating modern style's speaking into different languages is relatively straightforward, this approach can facilitate the understanding of Chinese classical poetry across various languages😊.
In that case, our dataset should include two key components: the original text and the restating text in Chinese.
We write a script fetch.py
to fetch our data from 古诗词网. Due to the time cost, we only fetch the top 350 pages of the website. If you want to use more data to train, you can make some small change to the script and continue to fetch🧐. Our dataset is also available at: Chinese Poems with Chinese Annotations.
Preprocessing
Due to our understanding, LLM of small size tends to perform bad on long-text understanding problems, and some Chinese poems tend to be very long, we divide a poem into sentences, and use index to represent the position of this sentence in the origin poem. If you want to try training on some big scale models like Gemma-2-27b, you can use this attribute to concatenate the sentences into a complete poem😉.
The dynasty should also be an important attribute because the writing styles and expression methods of poems created by poets from different dynasties vary significantly. We believe that poems which from the same dynasty share similar expression styles, meaning that these data should come from the same distribution. In a fine-tuning task, it is necessary to use data from the same distribution to avoid introducing noise into the training process. And this attribute helps to guide users to fine-tuning Gemma 2 on different periods' Chinese poems😉.