Baselight
Sign In
kaggle

Line-Level Code Vs Text Classification Dataset

Kaggle

@kaggle.violetakastreva_line_level_code_vs_text_classification_dataset

Loading...
Loading...

A Line-Level Dataset for Distinguishing Programming Code from Natural Language

Dataset Description

Task Definition

Binary line-level classification:

  • label = 0 — natural language text
  • label = 1 — programming code

The task is intentionally defined at the line granularity, rather than document or block level, to support applications such as preprocessing pipelines, mixed-content filtering, and robustness evaluation.


Dataset Structure

The dataset consists of three columns:

Column Description
line Raw line content (text or code). Punctuation, symbols, and formatting are preserved.
label Binary label (0 = text, 1 = code).
source Data origin: stackoverflow_2019, stackoverflow_2020, or twitch.

No aggressive normalization has been applied; punctuation, symbols, and formatting are preserved.


Data Collection and Processing

Stack Overflow Data

The Stack Overflow portion of the dataset was created by parsing HTML post bodies and explicitly separating structural elements:

  • Code lines are extracted from <code> tags
  • Text lines are extracted from <p> (paragraph) tags

To ensure that extracted code corresponds to actual programming content, code blocks are filtered by post tags, retaining only posts associated with the following technologies:

  • <c++>, <java>, <python>, <php>, <c#>, <javascript>, <c>, <go>
  • <react-native>, <laravel>, <django>, <typescript>, <node.js>, <.net-core>

Additional filtering steps:

  • Lines containing package installation commands (e.g. pip) are excluded
  • Lines dominated by error messages or stack traces are ignored
  • Duplicate lines are removed after normalization

This filtering aims to retain clean, representative programming code while avoiding noisy or auxiliary content.


Twitch Data

The Twitch portion consists exclusively of natural language text and is included to introduce informal, unstructured, and ambiguous language that is difficult to distinguish from code using surface-level heuristics.

Key properties:

  • No code labels are present in this subset
  • Duplicate lines are removed
  • Language is short, conversational, and noisy

The purpose of this source is to increase dataset difficulty and improve robustness to domain shift.


Dataset Statistics

Overall Size

  • Total lines: 52,270
  • Text (label 0): 27,330
  • Code (label 1): 24,940

The dataset is approximately balanced, enabling standard classification benchmarks without mandatory resampling.


Label Distribution by Source

Source Text (0) Code (1)
stackoverflow_2019 14,698 0
stackoverflow_2020 7,948 24,940
twitch 4,684 0

Source Distribution

  • Stack Overflow 2020: 32,888 lines (62.9%)
  • Stack Overflow 2019: 14,698 lines (28.1%)
  • Twitch: 4,684 lines (9.0%)

Line Length Statistics (Characters)

Overall:

  • Mean: 59.53
  • Median: 34
  • Minimum: 1
  • Maximum: 9,082
  • Standard deviation: 85.95

By source:

  • Stack Overflow 2019: mean 88.71, median 58
  • Stack Overflow 2020: mean 51.60, median 33
  • Twitch: mean 23.62, median 17

Intended Use Cases

  • Line-level code vs text classification
  • Mixed-content preprocessing and filtering
  • Robustness evaluation across domains
  • Stylometry and authorship analysis
  • Code-aware language modeling pipelines

Limitations

  • Labels are structural, not semantic
  • Code-like text outside <code> tags is labeled as text
  • Logs or pseudo-code inside code blocks are labeled as code
  • Line-level context across adjacent lines is not preserved

Users should account for these properties when designing downstream tasks.


Data Sources and Licensing

This dataset is derived from publicly available sources.
Please refer to the following datasets for original data and licensing information:

  • Stack Overflow dataset: (link to be added)
  • Twitch dataset: (link to be added)

Users are responsible for complying with the original licenses when redistributing or building upon this dataset.


Related Datasets

Share link

Anyone who has the link will be able to view this.