A Line-Level Dataset for Distinguishing Programming Code from Natural Language

Task Definition

Binary line-level classification:

label = 0 — natural language text
label = 1 — programming code

The task is intentionally defined at the line granularity, rather than document or block level, to support applications such as preprocessing pipelines, mixed-content filtering, and robustness evaluation.

Dataset Structure

The dataset consists of three columns:

Column	Description
`line`	Raw line content (text or code). Punctuation, symbols, and formatting are preserved.
`label`	Binary label (0 = text, 1 = code).
`source`	Data origin: `stackoverflow_2019`, `stackoverflow_2020`, or `twitch`.

No aggressive normalization has been applied; punctuation, symbols, and formatting are preserved.

Data Collection and Processing

Stack Overflow Data

The Stack Overflow portion of the dataset was created by parsing HTML post bodies and explicitly separating structural elements:

Code lines are extracted from <code> tags
Text lines are extracted from <p> (paragraph) tags

To ensure that extracted code corresponds to actual programming content, code blocks are filtered by post tags, retaining only posts associated with the following technologies:

<c++>, <java>, <python>, <php>, <c#>, <javascript>, <c>, <go>
<react-native>, <laravel>, <django>, <typescript>, <node.js>, <.net-core>

Additional filtering steps:

Lines containing package installation commands (e.g. pip) are excluded
Lines dominated by error messages or stack traces are ignored
Duplicate lines are removed after normalization

This filtering aims to retain clean, representative programming code while avoiding noisy or auxiliary content.

Twitch Data

The Twitch portion consists exclusively of natural language text and is included to introduce informal, unstructured, and ambiguous language that is difficult to distinguish from code using surface-level heuristics.

Key properties:

No code labels are present in this subset
Duplicate lines are removed
Language is short, conversational, and noisy

The purpose of this source is to increase dataset difficulty and improve robustness to domain shift.

Dataset Statistics

Overall Size

Total lines: 52,270
Text (label 0): 27,330
Code (label 1): 24,940

The dataset is approximately balanced, enabling standard classification benchmarks without mandatory resampling.

Label Distribution by Source

Source	Text (0)	Code (1)
stackoverflow_2019	14,698	0
stackoverflow_2020	7,948	24,940
twitch	4,684	0

Source Distribution

Stack Overflow 2020: 32,888 lines (62.9%)
Stack Overflow 2019: 14,698 lines (28.1%)
Twitch: 4,684 lines (9.0%)

Line Length Statistics (Characters)

Overall:

Mean: 59.53
Median: 34
Minimum: 1
Maximum: 9,082
Standard deviation: 85.95

By source:

Stack Overflow 2019: mean 88.71, median 58
Stack Overflow 2020: mean 51.60, median 33
Twitch: mean 23.62, median 17

Intended Use Cases

Line-level code vs text classification
Mixed-content preprocessing and filtering
Robustness evaluation across domains
Stylometry and authorship analysis
Code-aware language modeling pipelines

Limitations

Labels are structural, not semantic
Code-like text outside <code> tags is labeled as text
Logs or pseudo-code inside code blocks are labeled as code
Line-level context across adjacent lines is not preserved

Users should account for these properties when designing downstream tasks.

Data Sources and Licensing

This dataset is derived from publicly available sources.
Please refer to the following datasets for original data and licensing information:

Stack Overflow dataset: (link to be added)
Twitch dataset: (link to be added)

Users are responsible for complying with the original licenses when redistributing or building upon this dataset.

Related Datasets

Software Vulnerability Detection Datasets - Function/method Level

@zenodo
AI Performance On Coding Problems

@owid
Text Classification For QA Dataset

@kaggle
AI Performance On Language Tasks

@owid
Dummy Monster

@owid
Performance On Coding, Math, Language, Image Classification And Atari Tasks - Only State Of The Art(Papers With Code, 2023)

@owid

Software Vulnerability Detection Datasets - Function/method Level

AI Performance On Coding Problems

Text Classification For QA Dataset

AI Performance On Language Tasks

Dummy Monster

Performance On Coding, Math, Language, Image Classification And Atari Tasks - Only State Of The Art(Papers With Code, 2023)