A Line-Level Dataset for Distinguishing Programming Code from Natural Language
Dataset Description
Task Definition
Binary line-level classification:
- label = 0 — natural language text
- label = 1 — programming code
The task is intentionally defined at the line granularity, rather than document or block level, to support applications such as preprocessing pipelines, mixed-content filtering, and robustness evaluation.
Dataset Structure
The dataset consists of three columns:
| Column | Description |
|---|---|
line |
Raw line content (text or code). Punctuation, symbols, and formatting are preserved. |
label |
Binary label (0 = text, 1 = code). |
source |
Data origin: stackoverflow_2019, stackoverflow_2020, or twitch. |
No aggressive normalization has been applied; punctuation, symbols, and formatting are preserved.
Data Collection and Processing
Stack Overflow Data
The Stack Overflow portion of the dataset was created by parsing HTML post bodies and explicitly separating structural elements:
- Code lines are extracted from
<code>tags - Text lines are extracted from
<p>(paragraph) tags
To ensure that extracted code corresponds to actual programming content, code blocks are filtered by post tags, retaining only posts associated with the following technologies:
<c++>,<java>,<python>,<php>,<c#>,<javascript>,<c>,<go><react-native>,<laravel>,<django>,<typescript>,<node.js>,<.net-core>
Additional filtering steps:
- Lines containing package installation commands (e.g.
pip) are excluded - Lines dominated by error messages or stack traces are ignored
- Duplicate lines are removed after normalization
This filtering aims to retain clean, representative programming code while avoiding noisy or auxiliary content.
Twitch Data
The Twitch portion consists exclusively of natural language text and is included to introduce informal, unstructured, and ambiguous language that is difficult to distinguish from code using surface-level heuristics.
Key properties:
- No code labels are present in this subset
- Duplicate lines are removed
- Language is short, conversational, and noisy
The purpose of this source is to increase dataset difficulty and improve robustness to domain shift.
Dataset Statistics
Overall Size
- Total lines: 52,270
- Text (label 0): 27,330
- Code (label 1): 24,940
The dataset is approximately balanced, enabling standard classification benchmarks without mandatory resampling.
Label Distribution by Source
| Source | Text (0) | Code (1) |
|---|---|---|
| stackoverflow_2019 | 14,698 | 0 |
| stackoverflow_2020 | 7,948 | 24,940 |
| twitch | 4,684 | 0 |
Source Distribution
- Stack Overflow 2020: 32,888 lines (62.9%)
- Stack Overflow 2019: 14,698 lines (28.1%)
- Twitch: 4,684 lines (9.0%)
Line Length Statistics (Characters)
Overall:
- Mean: 59.53
- Median: 34
- Minimum: 1
- Maximum: 9,082
- Standard deviation: 85.95
By source:
- Stack Overflow 2019: mean 88.71, median 58
- Stack Overflow 2020: mean 51.60, median 33
- Twitch: mean 23.62, median 17
Intended Use Cases
- Line-level code vs text classification
- Mixed-content preprocessing and filtering
- Robustness evaluation across domains
- Stylometry and authorship analysis
- Code-aware language modeling pipelines
Limitations
- Labels are structural, not semantic
- Code-like text outside
<code>tags is labeled as text - Logs or pseudo-code inside code blocks are labeled as code
- Line-level context across adjacent lines is not preserved
Users should account for these properties when designing downstream tasks.
Data Sources and Licensing
This dataset is derived from publicly available sources.
Please refer to the following datasets for original data and licensing information:
- Stack Overflow dataset: (link to be added)
- Twitch dataset: (link to be added)
Users are responsible for complying with the original licenses when redistributing or building upon this dataset.
Related Datasets
-
Dummy Monster
@owid