Soluvas Scrape

Declarative web site scraping library/framework in Java. Uses Jaunt.

Underlying Library

Jaunt. Primary engine.
TBD: Ghost Driver + PhantomJS for advanced use cases requiring JavaScript support.
TBD: jSoup. For simpler use cases, jSoup may be enough.

Planned Output Formats and Databases

JSON (TODO)
PostgreSQL (TODO)
CSV (TODO)
MongoDB (TODO)
Parquet (TODO). So you can query with Hive/Pig.
HBase (TODO). So you can query with Hive/Pig.
XLSX (TODO)
ODS (TODO)

Concepts

Collection. Contains list of entities, all have same schema of properties.
Property. Part of a schema. All entities in a collection have the same property definitions, but not all are filled.
Entity. A single element in a Collection.

Planned Features

HTTP-level Caching with Apache HttpClient + ManagedHttpCache. Effective only for GET requests, and has no effect for POST/JSON-RPC requests.
Application-level Caching with Spring Caching + EhCache alternative. Will work for POST/JSON-RPC/social media APIs, but cannot use EhCache due to Enterprise-only localRestartable. GridGain/Apache Ignite seem to be best here, not to mention its Spark/Hadoop integration.
Historical/Time Series Data support (TODO). So you can track, e.g. estimated passing grade for several schools over time. Three possible approaches: materialized (write-first), embedded (calculate-on-query), and materialized+some embedded. With materialized, during data collection, it also writes time series data in different tables.

However:
- it's hard to go back and change the time series data if there's any mistake in the implementation, but over time this should happen less.
- if you need to define a new time series data, this can only apply to future data points, not past data points.
With embedded, historical data is saved in the same table as the original data (which the original data schema needs to be modified), then they're processed during reporting/query time. We'll be using materialized+some embedded as it is more scalable and can implemented in any database, e.g. PostgreSQL, MongoDB, Cassandra, HBase, even file-based formats like Parquet. By "some embedded", we also store a list of essential foreign key in the materialized table, like registration_id for each option_id+snapshottime so we can refer to this later on.
Generate JPA Entity class from ScrapeTemplate (TODO).
Generate Liquibase from ScrapeTemplate (TODO).
Generate R ggplot2 charts and histograms (TODO).
Generate R Shiny charts and histograms (TODO).
Generate interactive D3 charts and histograms (TODO).
Generate interactive Tableau visualization definition (TODO).
Save raw requests & server responses to JSON or HTML files, with request metadata (TODO).
Save raw requests & server responses to Postman Catalog (TODO).
Snapshot important lists for historical/time series information. For example, save registration_id[] (array) keyed by option_id and snapshottime. While main purpose of each

Supported Protocols/Input Formats

JSON-RPC over HTTP(S).
HTML over HTTP(S) (TODO).
Twitter API (TODO).
Foursquare API (TODO).
Facebook API (TODO).
Atom/RSS Feed (TODO).

Database Storage

Integrated User/Organization Workspace. All workspace tables are stored inside the app database, inside the tenant's schema, with appropriate prefixes for namespacing (i.e. p_ for person and o_ for organization). The schema of all workspace tables are fully managed by app, and is considered responsibility of the app. Currently stored in PostgreSQL for storage efficiency and speed of indexing/aggregate reports, but maybe MongoDB or Cassandra can be good too.

Sample Data

For Hendy, sample data is at: ~/Dropbox/Hendy_Projects/soluvas-scrape/sample/ppdb

Related Projects

Soluvas Analytics. After getting those data, present them in a pleasing and engaging way.
Soluvas ETL. Integrate incoming scraped data with other systems, post or sync with social media.
Soluvas AI. Add some intelligence to fill missing information or to predict new ones.
Soluvas Buzz. Run social media, email, and online marketing campaigns with scraped data.
Soluvas Publisher will be superseded by Soluvas Scrape + Soluvas ETL/Soluvas Buzz.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
core		core
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Soluvas Scrape

Underlying Library

Planned Output Formats and Databases

Concepts

Planned Features

Supported Protocols/Input Formats

Database Storage

Sample Data

Related Projects

About

Releases

Packages

Languages

License

soluvas/soluvas-scrape

Folders and files

Latest commit

History

Repository files navigation

Soluvas Scrape

Underlying Library

Planned Output Formats and Databases

Concepts

Planned Features

Supported Protocols/Input Formats

Database Storage

Sample Data

Related Projects

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages