Skip to content
Change the repository type filter

All

    Repositories list

    • Page Object pattern for Scrapy
      Python
      BSD 3-Clause "New" or "Revised" License
      2811995Updated Oct 18, 2024Oct 18, 2024
    • Software stack with latest Scrapy and updated deps
      Dockerfile
      BSD 3-Clause "New" or "Revised" License
      206120Updated Oct 17, 2024Oct 17, 2024
    • web-poet

      Public
      Web scraping Page Objects core library
      Python
      BSD 3-Clause "New" or "Revised" License
      15951413Updated Oct 16, 2024Oct 16, 2024
    • andi

      Public
      Library for annotation-based dependency injection
      Python
      BSD 3-Clause "New" or "Revised" License
      52031Updated Oct 16, 2024Oct 16, 2024
    • spidermon

      Public
      Scrapy Extension for monitoring spiders execution.
      Python
      BSD 3-Clause "New" or "Revised" License
      97533397Updated Oct 10, 2024Oct 10, 2024
    • Python
      BSD 3-Clause "New" or "Revised" License
      141320Updated Oct 2, 2024Oct 2, 2024
    • python parser for human readable dates
      Python
      BSD 3-Clause "New" or "Revised" License
      4662.5k28550Updated Oct 2, 2024Oct 2, 2024
    • A python binding for crfsuite
      Python
      MIT License
      221770453Updated Oct 1, 2024Oct 1, 2024
    • streamparse lets you run Python code against real-time streams of data. Integrates with Apache Storm.
      Python
      Apache License 2.0
      218201Updated Sep 20, 2024Sep 20, 2024
    • Parse numbers written in natural language
      Python
      BSD 3-Clause "New" or "Revised" License
      23108126Updated Sep 16, 2024Sep 16, 2024
    • Formasaurus tells you the type of an HTML form and its fields using machine learning
      HTML
      47701Updated Aug 7, 2024Aug 7, 2024
    • splash

      Public
      Lightweight, scriptable browser as a service with an HTTP API
      Python
      BSD 3-Clause "New" or "Revised" License
      5144.1k37726Updated Aug 2, 2024Aug 2, 2024
    • extruct

      Public
      Extract embedded metadata from HTML markup
      Python
      BSD 3-Clause "New" or "Revised" License
      1138473815Updated Jul 25, 2024Jul 25, 2024
    • A Postgres-backed ContentsManager implementation for IPython
      Python
      Apache License 2.0
      83201Updated Jul 18, 2024Jul 18, 2024
    • Crawl Frontier HCF backend
      Python
      BSD 3-Clause "New" or "Revised" License
      5721Updated Jul 17, 2024Jul 17, 2024
    • shublang

      Public
      Pluggable DSL that uses pipes to perform a series of linear transformations to extract data
      Python
      BSD 3-Clause "New" or "Revised" License
      815236Updated Jul 9, 2024Jul 9, 2024
    • Scrapy entrypoint for Scrapinghub job runner
      Python
      BSD 3-Clause "New" or "Revised" License
      162570Updated Jul 8, 2024Jul 8, 2024
    • An opinionated fork of the Drone CI system
      Go
      Other
      364005Updated Jul 7, 2024Jul 7, 2024
    • varanus

      Public
      A command line spider monitoring tool
      Python
      7822Updated Jul 6, 2024Jul 6, 2024
    • scrapyrt

      Public
      HTTP API for Scrapy spiders
      Python
      BSD 3-Clause "New" or "Revised" License
      162832246Updated Jun 28, 2024Jun 28, 2024
    • portia

      Public
      Visual scraping for Scrapy
      Python
      BSD 3-Clause "New" or "Revised" License
      1.4k9.3k11119Updated Jun 26, 2024Jun 26, 2024
    • scikit-learn inspired API for CRFsuite
      Python
      216200Updated Jun 18, 2024Jun 18, 2024
    • Python
      MIT License
      2403Updated Jun 17, 2024Jun 17, 2024
    • autologin

      Public
      A project to attempt to automatically login to a website given a single seed
      Python
      Apache License 2.0
      441102Updated Jun 17, 2024Jun 17, 2024
    • Python wrapper for the Intercom API.
      Python
      Other
      145101Updated Jun 17, 2024Jun 17, 2024
    • luigi

      Public
      Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
      Python
      Apache License 2.0
      2.4k401Updated Jun 7, 2024Jun 7, 2024
    • mrjob

      Public
      Run MapReduce jobs on Hadoop or Amazon Web Services
      Python
      Other
      587001Updated Jun 6, 2024Jun 6, 2024
    • Keep docker hosts tidy
      Python
      Apache License 2.0
      50001Updated May 21, 2024May 21, 2024
    • aduana

      Public
      Frontera backend to guide a crawl using PageRank, HITS or other ranking algorithms based on the link structure of the web graph, even when making big crawls (one billion pages).
      C
      BSD 3-Clause "New" or "Revised" License
      95592Updated May 21, 2024May 21, 2024
    • exporters

      Public
      Exporters is an extensible export pipeline library that supports filter, transform and several sources and destinations
      Python
      BSD 3-Clause "New" or "Revised" License
      104057Updated May 21, 2024May 21, 2024