Vacancies From Indeed Scraped Biweekly 2024
Scraped the same keywords for almost entire 2024. Both processed and raw data.
@kaggle.artemfedorov_vacancies_from_indeed_scraped_biweekly_2024
Scraped the same keywords for almost entire 2024. Both processed and raw data.
@kaggle.artemfedorov_vacancies_from_indeed_scraped_biweekly_2024
The dataset was obtained by web-scraping Indeed twice a week.
Scraped vacancies for Data Scientist/Analyst/Engineer starting from 11.01.2024, for Business Intelligence starting from 26.02.2024 and for Machine Learning Engineer starting from 16.09.2024, location for all the searches was limited to Israel.
Raw unprocessed data includes total of 141 thousand rows, as most vacancies were scraped during more than one search. The raw dataset includes more than 8000 vacancies with unique id and full text description, it is available in both csv and parquet formats.
I have selected vacancies, both by title and by description that I considered relevant, and extracted required skills and required experience in years from full text description using Gemini. There are 2500 relevant vacancies available in the processed dataset.
My visualization of the processed dataset is available here. Would be happy to get some feedback!
NOTES for raw dataset:
NOTES for processed dataset:
CREATE TABLE all_vacancies (
"job_title" VARCHAR,
"url" VARCHAR,
"company" VARCHAR,
"location" VARCHAR,
"text_short" VARCHAR,
"scrape_day" BIGINT,
"scrape_month" BIGINT,
"last_update" VARCHAR,
"text_full" VARCHAR,
"company_rating" DOUBLE,
"is_responsive" VARCHAR,
"key_word" VARCHAR,
"result_pos" BIGINT,
"job_type" VARCHAR,
"employer_active" VARCHAR,
"tagged_new" VARCHAR
);CREATE TABLE processed_data (
"unnamed_0" BIGINT -- Unnamed: 0,
"url" VARCHAR,
"experience_years" VARCHAR,
"experience_mandatory" VARCHAR,
"education" VARCHAR,
"languages" VARCHAR,
"mandatory" VARCHAR,
"advantage" VARCHAR,
"job_type" VARCHAR,
"company" VARCHAR,
"job_title" VARCHAR,
"city" VARCHAR,
"district" VARCHAR,
"first_online" TIMESTAMP,
"last_update" VARCHAR,
"total_dates" BIGINT,
"is_direct" BIGINT,
"min_experience" DOUBLE,
"last_online" TIMESTAMP,
"is_unique_text" BIGINT,
"cloud_skills" VARCHAR,
"viz_tools" VARCHAR,
"type_da" BIGINT,
"type_ds" BIGINT,
"type_de" BIGINT,
"type_bi" BIGINT,
"type_aiml" BIGINT
);Anyone who has the link will be able to view this.