Line-Level Code Vs Text Classification Dataset
@kaggle.violetakastreva_line_level_code_vs_text_classification_dataset
@kaggle.violetakastreva_line_level_code_vs_text_classification_dataset
Binary line-level classification:
The task is intentionally defined at the line granularity, rather than document or block level, to support applications such as preprocessing pipelines, mixed-content filtering, and robustness evaluation.
The dataset consists of three columns:
| Column | Description |
|---|---|
line |
Raw line content (text or code). Punctuation, symbols, and formatting are preserved. |
label |
Binary label (0 = text, 1 = code). |
source |
Data origin: stackoverflow_2019, stackoverflow_2020, or twitch. |
No aggressive normalization has been applied; punctuation, symbols, and formatting are preserved.
The Stack Overflow portion of the dataset was created by parsing HTML post bodies and explicitly separating structural elements:
<code> tags<p> (paragraph) tagsTo ensure that extracted code corresponds to actual programming content, code blocks are filtered by post tags, retaining only posts associated with the following technologies:
<c++>, <java>, <python>, <php>, <c#>, <javascript>, <c>, <go><react-native>, <laravel>, <django>, <typescript>, <node.js>, <.net-core>Additional filtering steps:
pip) are excludedThis filtering aims to retain clean, representative programming code while avoiding noisy or auxiliary content.
The Twitch portion consists exclusively of natural language text and is included to introduce informal, unstructured, and ambiguous language that is difficult to distinguish from code using surface-level heuristics.
Key properties:
The purpose of this source is to increase dataset difficulty and improve robustness to domain shift.
The dataset is approximately balanced, enabling standard classification benchmarks without mandatory resampling.
| Source | Text (0) | Code (1) |
|---|---|---|
| stackoverflow_2019 | 14,698 | 0 |
| stackoverflow_2020 | 7,948 | 24,940 |
| twitch | 4,684 | 0 |
Overall:
By source:
<code> tags is labeled as textUsers should account for these properties when designing downstream tasks.
This dataset is derived from publicly available sources.
Please refer to the following datasets for original data and licensing information:
Users are responsible for complying with the original licenses when redistributing or building upon this dataset.
@owid
Share link
Anyone who has the link will be able to view this.