Skip to content

Latest commit

 

History

History
134 lines (92 loc) · 5.11 KB

ex10.md

File metadata and controls

134 lines (92 loc) · 5.11 KB

Scraping from a WebSite 101

So you are probabily wondering, "I want to be chilling at home and when I suddenly want to know what the word of the day is, I only have to press a button!". Yeah, not everyone, but if the WebSite is either hosted by HTTP or HTTPS, you can get that information (mostly).

HTTP/s library

If you want to access this information using your very own ESP8266, then you can use this libraries that will give you the abstraction you need

#include <ESP8266WiFi.h>
#include <ESP8266WiFiMulti.h>
#include <ESP8266HTTPClient.h>
#include <WiFiClientSecureBearSSL.h>

We recommend you checking out this repository as it might have other cool tools and examples for you to use! https://github.com/esp8266/Arduino/tree/master/libraries/ESP8266HTTPClient

HTTPS Fingerprint

When trying an HTTPS connection, you also need to indicate the WebPage firgerprint and this is how you get it:

Fingerprint_1 Fingerprint_2 Fingerprint_3

As you might have noticed, this is the fingerprint for the www.stcp.pt website, as our example later will be based on it 😉

HTTPS Example

The example we will be using, will be the same of the repository above, but to access the stcp website to know when a determined bus arrives at the station. Who doesn't hate bus that are always late!

#include <Arduino.h>

#include <ESP8266WiFi.h>
#include <ESP8266WiFiMulti.h>

#include <ESP8266HTTPClient.h>

#include <WiFiClientSecureBearSSL.h>
// Fingerprint for demo URL, expires on June 2, 2021, needs to be updated well before this date
const uint8_t fingerprint[20] = {0x7E, 0x6D, 0xA5, 0x17, 0x1E, 0xC1, 0x20, 0x9B, 0x78, 0x03, 0xD4, 0x9B, 0x60, 0xE2, 0x0A, 0x60, 0xF2, 0x46, 0xA9, 0xA5};

ESP8266WiFiMulti WiFiMulti;

#define URL "https://www.stcp.pt/pt/itinerarium/soapclient.php?codigo=AML1&linha=0&hash123=nq_2PRqorJYQ_7oMZYBZpbEASufOxOgsVk40VBQ2OkM"

#define SSID "***********"
#define PASSWORD "***********"

void setup() {

  Serial.begin(115200);
  // Serial.setDebugOutput(true);

  Serial.println();
  Serial.println();
  Serial.println();

  for (uint8_t t = 4; t > 0; t--) {
    Serial.printf("[SETUP] WAIT %d...\n", t);
    Serial.flush();
    delay(1000);
  }

  WiFi.mode(WIFI_STA);
  WiFiMulti.addAP(SSID, PASSWORD);
}

void loop() {
  // wait for WiFi connection
  if ((WiFiMulti.run() == WL_CONNECTED)) {

    std::unique_ptr<BearSSL::WiFiClientSecure> client(new BearSSL::WiFiClientSecure);

    client->setFingerprint(fingerprint);
    // Or, if you happy to ignore the SSL certificate, then use the following line instead:
    // client->setInsecure();

    HTTPClient https;

    Serial.print("[HTTPS] begin...\n");
    if (https.begin(*client, URL)) {  // HTTPS

      Serial.print("[HTTPS] GET...\n");
      // start connection and send HTTP header
      int httpCode = https.GET();

      // httpCode will be negative on error
      if (httpCode > 0) {
        // HTTP header has been send and Server response header has been handled
        Serial.printf("[HTTPS] GET... code: %d\n", httpCode);

        // file found at server
        if (httpCode == HTTP_CODE_OK || httpCode == HTTP_CODE_MOVED_PERMANENTLY) {
          String payload = https.getString();
          Serial.println(payload);
        }
      } else {
        Serial.printf("[HTTPS] GET... failed, error: %s\n", https.errorToString(httpCode).c_str());
      }

      https.end();
    } else {
      Serial.printf("[HTTPS] Unable to connect\n");
    }
  }

  Serial.println("Wait 10s before next round...");
  delay(10000);
}

Scraping

This code will print out the html for when the next bus will arrive and how long it takes. How can you have this information? In this case it was simple and we will tell how soon, but for some websites it might be a little harder. So we will be leaving here 2 great youtube videos of how to Web scrape with an ESP8266.

https://youtu.be/kWdfCCwhQTE

https://youtu.be/HUjFMVOpXBM

Now for the stcp website😃. First we found the page that displayed the data of when a determined bus would arrive, then we inspected it. When inspecting we selected network and refreshed the page to see the packets going throught:

taking _packages_1

Then, as there might be a lot of unneeded packets, you can filter for Fetch/XHR packets and search for the one that interests you. In this case, is this one:

finding_packet

Now, just paste the link on your browser and voilá, easy html information to get. That is the link you will paste in your example. Good luck parsing the data ;).

Main Menu | Next