Baselight
Sign In
kaggle

Arabic(Indian) Digits MADBase

Kaggle

@kaggle.hossamahmedsalah_arabicindian_digits_madbase

Loading...
Loading...

Arabic numbers dataset

Dataset Description

This dataset is flattern images where each image is represented in a row

  • Objective: Establish benchmark results for Arabic digit recognition using different classification techniques.
  • Objective: Compare performances of different classification techniques on Arabic and Latin digit recognition problems.
  • Valid comparison requires Arabic and Latin digit databases to be in the same format.
  • A Modified version of the ADBase (MADBase) with the same size and format as MNIST is created.
  • MADBase is derived from ADBase by size-normalizing each digit to a 20x20 box while preserving aspect ratio.
  • Size-normalization procedure results in gray levels due to anti-aliasing filter.
  • MADBase and MNIST have the same size and format.
  • MNIST is a modified version of the NIST digits database.
  • MNIST is available for download.
    I used this code to turn 70k arabic digit into a tabular data for ease of use and to waste less time in the preprocessing
# Define the root directory of the dataset
root_dir = "MAHD"

# Define the names of the folders containing the images
folder_names = ['Part{:02d}'.format(i) for i in range(1, 13)]
# folder_names = ['Part{}'.format(i) if i>9 else 'Part0{}'.format(i)  for i in range(1, 13)]


# Define the names of the subfolders containing the training and testing images
train_test_folders = ['MAHDBase_TrainingSet', 'test']

# Initialize an empty list to store the image data and labels
data = []
labels = []

# Loop over the training and testing subfolders in each Part folder
for tt in train_test_folders:
   for folder_name in folder_names:
       if tt == train_test_folders[1] and folder_name == 'Part03':
           break
       subfolder_path = os.path.join(root_dir, tt, folder_name)
       print(subfolder_path)
       print(os.listdir(subfolder_path))
       for filename in os.listdir(subfolder_path):
           # check of the file fromat that it's an image
           if os.path.splitext(filename)[1].lower() not in '.bmp':
               continue
           # Load the image
           img_path = os.path.join(subfolder_path, filename)
           img = Image.open(img_path)

           # Convert the image to grayscale and flatten it into a 1D array
           img_grey = img.convert('L')
           img_data = np.array(img_grey).flatten()

           # Extract the label from the filename and convert it to an integer
           label = int(filename.split('_')[2].replace('digit', '').split('.')[0])

           # Add the image data and label to the lists
           data.append(img_data)
           labels.append(label)

# Convert the image data and labels to a pandas dataframe
df = pd.DataFrame(data)
df['label'] = labels

This dataset made by
https://datacenter.aucegypt.edu/shazeem
with 2 datasets

  • ADBase
  • MADBase (✅ the one this dataset derived from , similar in form to mnist)

Related Datasets

Share link

Anyone who has the link will be able to view this.