Baselight

AI-Based Job Site Matching

Leveraging 400k+ Hours of Resource & Performance Data

@kaggle.thedevastator_ai_based_job_site_matching

Loading...
Loading...

About this Dataset

AI-Based Job Site Matching


AI-Based Job Site Matching

Leveraging 400k+ Hours of Resource & Performance Data

By [source]


About this dataset

As you savvy job-seekers know, selecting an optimal site for GlideinWMS jobs is no small feat -weighing so many critical variables, and performing the highly sophisticated calculations needed to maximize the gains can be a tall order. Our dataset offers a valuable helping hand: with detailed insight into resource metrics and time-series analysis of over 400K hours of data, this treasure trove of information will hasten your journey towards finding just the right spot for all your job needs.

Specifically, our dataset contains three files: dataset_classification.csv, which provides information on critical elements such as disk usage and CPU cache size; dataset_time_series_analysis.csv featuring in-depth takeaways from careful time series analysis; And finally dataset_400k_hour.csv gathering computation results from over 400K hours of testing! With columns such as Failure (indicating whether or not the job failed) TotalCpus (the total number of CPUs used by the job), CpuIsBusy (whether or not the CPU is busy), and SlotType (the type of slot used by the job), it's easier than ever to plot that perfect path to success!

More Datasets

For more datasets, click here.

Featured Notebooks

  • 🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset can be used to help identify the most suitable site for GlideinWMS jobs. It contains resource metrics and time-series analysis, which can provide useful insight into the suitability of each potential site. The dataset consists of three sets: dataset_classification.csv, dataset_time_series_analysis.csv and dataset_400k_hour.csv.

The first set provides a high-level view of the critical resource metrics that are essential when matching a job and a site: DiskUsage, TotalCpus, TotalMemory, TotalDisk, CpuCacheSize and TotalVirtualMemoryTotalSlots as well as total slot information all important criteria for any job matching process - including whether or not the CpuIsBusy - along with information about the SlotType for each job at each potential site; additionally there is also data regarding Failure should an issue arise during this process; finally Site is provided so that users can ensure they are matching jobs to sites within their own specific environment if required by policy or business rules.

The second set provides detailed time-series analysis related to these metrics over longer timeframes as well LastUpdate indicating when this analysis was generated (without date), ydate indicating year of last update (without date), mdate indicating month of last update (without date) and hdate indicating hour at which data is refreshed on a regular basis without errors so that up-to-the minute decisions can be made during busy times like peak workloads or reallocations caused by anomalies in usage patterns within existing systems/environments;

Finally our third set takes things one step further with detailed information related to our 400k+ hours analytical data collection allowing you maximize efficiency while selecting best possible matches across multiple sites/criteria using only one tool (which we have conveniently packaged together in this impressive kaggle datasets :)

By taking advantage of our AI driven approach you will be able benefit from optimal job selection across many different scenarios such maximum efficiency scenarios with boosts in throughput through realtime scaling along with accountability boost ensuring proper system governance when moving from static systems utilizing static strategies towards ones more reactive working utilization dynamics within new agile deployments increasing stability while lowering maintenance costs over longer run!

Research Ideas

  • Use the total CPU, memory and disk usage metrics to identify jobs that need additional resources to complete quickly and suggest alternatives sites with more optimal resource availability
  • Utilize the time-series analysis using failure rate, last update time series, as well as month/hour/year of last update metrics to create predictive models for job site matching and failure avoidance on future jobs
  • Identify inefficiencies in scheduling by cross-examining job types (slot type), CPU caching size requirements against historical data to find opportunities for optimization or new approaches to job organization

Acknowledgements

If you use this dataset in your research, please credit the original authors.
Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: dataset_classification.csv

Column name Description
DiskUsage The amount of disk space used by the job. (Numeric)
TotalCpus The total number of CPUs allocated to the job. (Numeric)
TotalMemory The total amount of memory allocated to the job. (Numeric)
TotalDisk The total amount of disk space allocated to the job. (Numeric)
CpuCacheSize The size of the CPU cache allocated to the job. (Numeric)
TotalVirtualMemory The total amount of virtual memory allocated to the job. (Numeric)
TotalSlots The total number of slots allocated to the job. (Numeric)
CpuIsBusy A boolean value indicating whether the CPU is busy or not. (Boolean)
SlotType The type of slot allocated to the job. (String)
Failure A boolean value indicating whether the job failed or not. (Boolean)
Site The site where the job is running. (String)

File: dataset_time_series_analysis.csv

Column name Description
TotalCpus The total number of CPUs allocated to the job. (Numeric)
TotalMemory The total amount of memory allocated to the job. (Numeric)
TotalDisk The total amount of disk space allocated to the job. (Numeric)
LastUpdate The date and time of the last update to the job. (DateTime)
ydate The year of the job's last update. (Numeric)
mdate The month of the job's last update. (Numeric)
hdate The hour of the job's last update. (Numeric)

Acknowledgements

If you use this dataset in your research, please credit the original authors.
If you use this dataset in your research, please credit .

Tables

Dataset 400k Hour

@kaggle.thedevastator_ai_based_job_site_matching.dataset_400k_hour
  • 43.48 KB
  • 1554 rows
  • 4 columns
Loading...

CREATE TABLE dataset_400k_hour (
  "time" TIMESTAMP,
  "totalcpus" DOUBLE,
  "totalmemory" DOUBLE,
  "totaldisk" DOUBLE
);

Dataset Classification

@kaggle.thedevastator_ai_based_job_site_matching.dataset_classification
  • 618.11 KB
  • 25783 rows
  • 13 columns
Loading...

CREATE TABLE dataset_classification (
  "unnamed_0" BIGINT,
  "diskusage" DOUBLE,
  "totalcpus" BIGINT,
  "totalmemory" BIGINT,
  "totaldisk" BIGINT,
  "cpucachesize" BIGINT,
  "totalvirtualmemory" BIGINT,
  "glidein_job_max_time" BIGINT,
  "totalslots" BIGINT,
  "cpuisbusy" BIGINT,
  "slottype" BIGINT,
  "failure" BIGINT,
  "site" VARCHAR
);

Dataset Time Series Analysis

@kaggle.thedevastator_ai_based_job_site_matching.dataset_time_series_analysis
  • 1.6 MB
  • 91843 rows
  • 9 columns
Loading...

CREATE TABLE dataset_time_series_analysis (
  "unnamed_0" BIGINT,
  "lastupdate" TIMESTAMP,
  "glidein_site" VARCHAR,
  "totalcpus" BIGINT,
  "totalmemory" BIGINT,
  "totaldisk" BIGINT,
  "ydate" TIMESTAMP,
  "mdate" TIMESTAMP,
  "hdate" TIMESTAMP
);

Share link

Anyone who has the link will be able to view this.