Baselight

Vacancies From Indeed Scraped Biweekly 2024

Scraped the same keywords for almost entire 2024. Both processed and raw data.

@kaggle.artemfedorov_vacancies_from_indeed_scraped_biweekly_2024

About this Dataset

Vacancies From Indeed Scraped Biweekly 2024

The dataset was obtained by web-scraping Indeed twice a week.
Scraped vacancies for Data Scientist/Analyst/Engineer starting from 11.01.2024, for Business Intelligence starting from 26.02.2024 and for Machine Learning Engineer starting from 16.09.2024, location for all the searches was limited to Israel.

Raw unprocessed data includes total of 141 thousand rows, as most vacancies were scraped during more than one search. The raw dataset includes more than 8000 vacancies with unique id and full text description, it is available in both csv and parquet formats.

I have selected vacancies, both by title and by description that I considered relevant, and extracted required skills and required experience in years from full text description using Gemini. There are 2500 relevant vacancies available in the processed dataset.

My visualization of the processed dataset is available here. Would be happy to get some feedback!

NOTES for raw dataset:

  • Columns "is_responsive", "job_type" and "company_rating" are not available for all the dates (I had to add or remove columns as the site changed).
  • "Url" is the unique identifier used by Indeed. Sometimes vacancies with different ID's have exactly the same descriptions. You can see that sometimes full text description changes over time for the same 'url'.
  • "last_update" field contains information on how old the vacancy is. Publication date for vacancies older than 30 days is hidden; only text "older than 30 days" is available.
  • Extracting "text_full" and "company_rating" fields requires making an additional click. This data was only extracted for vacancies newer than 1 month (and also for the first records per page, as those do not require additional clicks).
  • With only one exception, the site was scrapped by Mondays and Thursdays.
  • Some vacancies are not from direct employers but from recruiting companies like "Gotfriends", "Nisha Group" and "SQLink".
    UPDATE: after 11.11 "last_update" column is unavailable due to site change, the column is empty for late dates. Also added two new columns "employer_active" and "tagged_new". Started also scraping on Saturdays to be able to understand better week when an job description first appeared.

NOTES for processed dataset:

  • Vacancies, that require data analyst, data scientist, data engineer or business intelligence skills are considered relevant in addition to those, that have these words in title. ML\AI jobs are all the jobs that have a mandatory requirement of either "Machine Learning", "Deep Learning" or "AI" skill.
  • "Cloud skills" column has "any cloud" value for vacancies that either mention more than one cloud provider or mention "Cloud skills" without details. Same for "viz_tools" column - value of "any_viz_tool" means, that either several visualization tools were mentioned, or only a requirement of a visualization tools knowledge was mentioned without mentioning any specific tool.
  • "first_online" column contains calculated date when the vacancy first appeared on Indeed. It is calculated from the date it was first time scrapped and value of the "last update" field for that date.
  • In case if a vacancy has text like "It is possible either to work from our office in city A or from our office in city B" Indeed creates two distinct vacancy records with different IDs. I used field "is_unique_text" to mark such cases of exactly the same vacancy text available also for another city.

Share link

Anyone who has the link will be able to view this.