Dataset: Named Entity Recognition (NER) Corpus

About this Dataset

Named Entity Recognition (NER) Corpus

Task

Named Entity Recognition(NER) is a task of categorizing the entities in a text into categories like names of persons, locations, organizations, etc.

Dataset

Each row in the CSV file is a complete sentence, list of POS tags for each word in the sentence, and list of NER tags for each word in the sentence

You can use Pandas Dataframe to read and manipulate this dataset.

Since each row in the CSV file contain lists, if we read the file with pandas.read_csv() and try to get tag lists by indexing the list will be a string.

>&gt;&gt; data['tag'][0] 
"['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O']"
>&gt;&gt; type(data['tag'][0])
string

You can use the following to convert it back to list type:

>&gt;&gt; from ast import literal_eval
>&gt;&gt; literal_eval(data['tag'][0] )
['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O']
>&gt;&gt; type(literal_eval(data['tag'][0] ))
list

Acknowledgements

This dataset is taken from Annotated Corpus for Named Entity Recognition by Abhinav Walia dataset and then processed.

Annotated Corpus for Named Entity Recognition is annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set.

Essential info about entities:

geo = Geographical Entity
org = Organization
per = Person
gpe = Geopolitical Entity
tim = Time indicator
art = Artifact
eve = Event
nat = Natural Phenomenon

Tables

Ner

@kaggle.naseralqaydeh_named_entity_recognition_ner_corpus.ner

6.66 MB
47959 rows
4 columns


CREATE TABLE ner (
  "sentence" VARCHAR,
  "sentence_3e3809" VARCHAR,
  "pos" VARCHAR,
  "tag" VARCHAR
);