Baselight

Arabic(Indian) Digits MADBase

Arabic numbers dataset

@kaggle.hossamahmedsalah_arabicindian_digits_madbase

Loading...
Loading...

About this Dataset

Arabic(Indian) Digits MADBase

This dataset is flattern images where each image is represented in a row

  • Objective: Establish benchmark results for Arabic digit recognition using different classification techniques.
  • Objective: Compare performances of different classification techniques on Arabic and Latin digit recognition problems.
  • Valid comparison requires Arabic and Latin digit databases to be in the same format.
  • A Modified version of the ADBase (MADBase) with the same size and format as MNIST is created.
  • MADBase is derived from ADBase by size-normalizing each digit to a 20x20 box while preserving aspect ratio.
  • Size-normalization procedure results in gray levels due to anti-aliasing filter.
  • MADBase and MNIST have the same size and format.
  • MNIST is a modified version of the NIST digits database.
  • MNIST is available for download.
    I used this code to turn 70k arabic digit into a tabular data for ease of use and to waste less time in the preprocessing
# Define the root directory of the dataset
root_dir = "MAHD"

# Define the names of the folders containing the images
folder_names = ['Part{:02d}'.format(i) for i in range(1, 13)]
# folder_names = ['Part{}'.format(i) if i>9 else 'Part0{}'.format(i)  for i in range(1, 13)]


# Define the names of the subfolders containing the training and testing images
train_test_folders = ['MAHDBase_TrainingSet', 'test']

# Initialize an empty list to store the image data and labels
data = []
labels = []

# Loop over the training and testing subfolders in each Part folder
for tt in train_test_folders:
   for folder_name in folder_names:
       if tt == train_test_folders[1] and folder_name == 'Part03':
           break
       subfolder_path = os.path.join(root_dir, tt, folder_name)
       print(subfolder_path)
       print(os.listdir(subfolder_path))
       for filename in os.listdir(subfolder_path):
           # check of the file fromat that it's an image
           if os.path.splitext(filename)[1].lower() not in '.bmp':
               continue
           # Load the image
           img_path = os.path.join(subfolder_path, filename)
           img = Image.open(img_path)

           # Convert the image to grayscale and flatten it into a 1D array
           img_grey = img.convert('L')
           img_data = np.array(img_grey).flatten()

           # Extract the label from the filename and convert it to an integer
           label = int(filename.split('_')[2].replace('digit', '').split('.')[0])

           # Add the image data and label to the lists
           data.append(img_data)
           labels.append(label)

# Convert the image data and labels to a pandas dataframe
df = pd.DataFrame(data)
df['label'] = labels

This dataset made by
https://datacenter.aucegypt.edu/shazeem
with 2 datasets

  • ADBase
  • MADBase (✅ the one this dataset derived from , similar in form to mnist)

Tables

Mahd

@kaggle.hossamahmedsalah_arabicindian_digits_madbase.mahd
  • 21.63 MB
  • 70000 rows
  • 786 columns
Loading...

CREATE TABLE mahd (
  "unnamed_0" BIGINT,
  "n_0" BIGINT,
  "n_1" BIGINT,
  "n_2" BIGINT,
  "n_3" BIGINT,
  "n_4" BIGINT,
  "n_5" BIGINT,
  "n_6" BIGINT,
  "n_7" BIGINT,
  "n_8" BIGINT,
  "n_9" BIGINT,
  "n_10" BIGINT,
  "n_11" BIGINT,
  "n_12" BIGINT,
  "n_13" BIGINT,
  "n_14" BIGINT,
  "n_15" BIGINT,
  "n_16" BIGINT,
  "n_17" BIGINT,
  "n_18" BIGINT,
  "n_19" BIGINT,
  "n_20" BIGINT,
  "n_21" BIGINT,
  "n_22" BIGINT,
  "n_23" BIGINT,
  "n_24" BIGINT,
  "n_25" BIGINT,
  "n_26" BIGINT,
  "n_27" BIGINT,
  "n_28" BIGINT,
  "n_29" BIGINT,
  "n_30" BIGINT,
  "n_31" BIGINT,
  "n_32" BIGINT,
  "n_33" BIGINT,
  "n_34" BIGINT,
  "n_35" BIGINT,
  "n_36" BIGINT,
  "n_37" BIGINT,
  "n_38" BIGINT,
  "n_39" BIGINT,
  "n_40" BIGINT,
  "n_41" BIGINT,
  "n_42" BIGINT,
  "n_43" BIGINT,
  "n_44" BIGINT,
  "n_45" BIGINT,
  "n_46" BIGINT,
  "n_47" BIGINT,
  "n_48" BIGINT,
  "n_49" BIGINT,
  "n_50" BIGINT,
  "n_51" BIGINT,
  "n_52" BIGINT,
  "n_53" BIGINT,
  "n_54" BIGINT,
  "n_55" BIGINT,
  "n_56" BIGINT,
  "n_57" BIGINT,
  "n_58" BIGINT,
  "n_59" BIGINT,
  "n_60" BIGINT,
  "n_61" BIGINT,
  "n_62" BIGINT,
  "n_63" BIGINT,
  "n_64" BIGINT,
  "n_65" BIGINT,
  "n_66" BIGINT,
  "n_67" BIGINT,
  "n_68" BIGINT,
  "n_69" BIGINT,
  "n_70" BIGINT,
  "n_71" BIGINT,
  "n_72" BIGINT,
  "n_73" BIGINT,
  "n_74" BIGINT,
  "n_75" BIGINT,
  "n_76" BIGINT,
  "n_77" BIGINT,
  "n_78" BIGINT,
  "n_79" BIGINT,
  "n_80" BIGINT,
  "n_81" BIGINT,
  "n_82" BIGINT,
  "n_83" BIGINT,
  "n_84" BIGINT,
  "n_85" BIGINT,
  "n_86" BIGINT,
  "n_87" BIGINT,
  "n_88" BIGINT,
  "n_89" BIGINT,
  "n_90" BIGINT,
  "n_91" BIGINT,
  "n_92" BIGINT,
  "n_93" BIGINT,
  "n_94" BIGINT,
  "n_95" BIGINT,
  "n_96" BIGINT,
  "n_97" BIGINT,
  "n_98" BIGINT
);

Share link

Anyone who has the link will be able to view this.