From f6fbacfdf447f8e7a26234ee4bf7769e379831c0 Mon Sep 17 00:00:00 2001 From: Marcel <14852157+Marcel0024@users.noreply.github.com> Date: Mon, 24 Jun 2024 18:16:16 +0200 Subject: [PATCH] Update docs --- README.md | 30 +++++++++++++++--------------- 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/README.md b/README.md index 994cab4..9bb1ddf 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ [![Build and Publish](https://github.com/Marcel0024/FundaScraper/actions/workflows/build-and-publish-image.yaml/badge.svg?branch=main)](https://github.com/Marcel0024/FundaScraper/actions/workflows/build-and-publish-image.yaml) -# FundaScraper - New listings to webhooks (and file) +# FundaScraper - Automate Listings with Docker and Webhooks `marcel0024/funda-scraper` docker image provides the easiest way to perform web scraping on Funda, the Dutch housing website. You simply provide the URL that you want to be scraped with the prefilled search criteria, and the image does the rest. @@ -9,7 +9,7 @@ Scraping times are set by a CRON expression, so you can set it to once a day, tw What makes this scraper unique is, it imitates a real user browsing the website. It opens a browser, loads the page, and waits for the page to load and then scrapes it. Further more you can override all selectors to make it work with future changes on the website. -That way you don't have to wait for the image to be updated. +That way you don't have to wait for the image to be updated. Note the browser windows are all opened insided the container you won't physically see the browser. Please note: @@ -42,28 +42,28 @@ services: environment: - FUNDA_URL=https://www.funda.nl/zoeken/koop?selected_area=%5B%22amsterdam%22%5D&object_type=%5B%22house%22%5D&price=%22-450000%22 - WEBHOOK_URL=http://homeassistantlocal.ip/api/webhook/123-redacted-key + - CRON=0 7,19 * * * # Everyday at 7am and 7pm volumes: - /data/fundascraper:/data ``` ## Environment Variables -| Variable | Required | Default | Description | -| -------------------------- | ---------------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| `CRON` | No (has default) | `0 7 * * *` | Every day at 7AM in the morning. | -| `FUNDA_URL` | Yes | - | The starting URL to scrape. You can build the parameters in the browser and just copy the link. Pricing, area, location, etc are all embedded in the URL, so make sure you filter it on the website before you copy it. | -| `WEBHOOK_URL` | No | - | The webhook URL to send the new listings to. | -| `ERROR_WEBHOOK_URL` | No | - | The webhook URL to send errors to parsing fails and stops the app. | -| `START_PAGE` | No | 1 | The page to start with (pagination) | -| `TOTAL_PAGES` | No | 5 | Total pages to scrape. Increase this you're quering a big area. | -| `RUN_ON_STARTUP` | No | `false` | Run the crawl on startup. If `false` the next run depends on the `CRON` value. | -| `PAGE_CRAWL_LIMIT` | No | `500` | The total pages it can crawl for each run. Highly unlikely this needs to be edited. | -| `TOTAL_PARALLELISM_DEGREE` | No | 5 | Total browsers that can be open at the same time. It's a balance with hardware specs/site limits before blocking and how fast the scraping needs to be done. These are all done within the container you won't physically see the browser. | +| Variable | Required | Default | Description | +| -------------------------- | ---------------- | ----------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `CRON` | No (has default) | `0 7 * * *` | Every day at 7AM in the morning. | +| `FUNDA_URL` | Yes | - | The starting URL to scrape. You can build the parameters in the browser and just copy the link. Pricing, area, location, etc are all embedded in the URL, so make sure you filter it on the website before you copy it. | +| `WEBHOOK_URL` | No | - | The webhook URL to send the new listings to. Note: the first run of the app of a new area you will get spammed, since everything is considered new. | +| `ERROR_WEBHOOK_URL` | No | - | The webhook URL to send errors to parsing fails and stops the app. | +| `START_PAGE` | No | 1 | The page to start with (pagination) | +| `TOTAL_PAGES` | No | 5 | Total pages to scrape. Increase this if you're quering a big area. | +| `RUN_ON_STARTUP` | No | `false` | Run the crawl on startup. If `false` the next run depends on the `CRON` value. | +| `TOTAL_PARALLELISM_DEGREE` | No | 5 | Total browsers that can be open at the same time. It's a balance with hardware specs, site limitations against scraping and how fast you want the scraping to be done. These are all done within the container you won't physically see the browser. | ### Selector variables -| Variable | Default | Description | -| ---------------------- | ------------------- | ------------------------------- | +| Variable | Default | Description | +| ---------------------- | -------------------------------- | ------------------------------- | | `LISTING_SELECTOR` | See `FundaScraper/defaults.json` | The selector to click a listing | | `TITLE_SELECTOR` | See `FundaScraper/defaults.json` | The selector for the address | | `ZIP_CODE_SELECTOR` | See `FundaScraper/defaults.json` | The selector for the zipcode |