Declarative web site scraping library/framework in Java. Uses Jaunt.
- Jaunt. Primary engine.
- TBD: Ghost Driver + PhantomJS for advanced use cases requiring JavaScript support.
- TBD: jSoup. For simpler use cases, jSoup may be enough.
- JSON (TODO)
- PostgreSQL (TODO)
- CSV (TODO)
- MongoDB (TODO)
- Parquet (TODO). So you can query with Hive/Pig.
- HBase (TODO). So you can query with Hive/Pig.
- XLSX (TODO)
- ODS (TODO)
- Collection. Contains list of entities, all have same schema of properties.
- Property. Part of a schema. All entities in a collection have the same property definitions, but not all are filled.
- Entity. A single element in a Collection.
-
HTTP-level Caching with Apache HttpClient + ManagedHttpCache. Effective only for GET requests, and has no effect for POST/JSON-RPC requests.
-
Application-level Caching with Spring Caching + EhCache alternative. Will work for POST/JSON-RPC/social media APIs, but cannot use EhCache due to Enterprise-only
localRestartable
. GridGain/Apache Ignite seem to be best here, not to mention its Spark/Hadoop integration. -
Historical/Time Series Data support (TODO). So you can track, e.g. estimated passing grade for several schools over time. Three possible approaches: materialized (write-first), embedded (calculate-on-query), and materialized+some embedded. With materialized, during data collection, it also writes time series data in different tables.
However:
- it's hard to go back and change the time series data if there's any mistake in the implementation, but over time this should happen less.
- if you need to define a new time series data, this can only apply to future data points, not past data points.
With embedded, historical data is saved in the same table as the original data (which the original data schema needs to be modified), then they're processed during reporting/query time. We'll be using materialized+some embedded as it is more scalable and can implemented in any database, e.g. PostgreSQL, MongoDB, Cassandra, HBase, even file-based formats like Parquet. By "some embedded", we also store a list of essential foreign key in the materialized table, like
registration_id
for eachoption_id
+snapshottime
so we can refer to this later on. -
Generate JPA Entity class from ScrapeTemplate (TODO).
-
Generate Liquibase from ScrapeTemplate (TODO).
-
Generate R ggplot2 charts and histograms (TODO).
-
Generate R Shiny charts and histograms (TODO).
-
Generate interactive D3 charts and histograms (TODO).
-
Generate interactive Tableau visualization definition (TODO).
-
Save raw requests & server responses to JSON or HTML files, with request metadata (TODO).
-
Save raw requests & server responses to Postman Catalog (TODO).
-
Snapshot important lists for historical/time series information. For example, save
registration_id[]
(array) keyed byoption_id
andsnapshottime
. While main purpose of each
- JSON-RPC over HTTP(S).
- HTML over HTTP(S) (TODO).
- Twitter API (TODO).
- Foursquare API (TODO).
- Facebook API (TODO).
- Atom/RSS Feed (TODO).
- Integrated User/Organization Workspace. All workspace tables are stored inside the app database, inside the tenant's schema,
with appropriate prefixes for namespacing (i.e.
p_
for person ando_
for organization). The schema of all workspace tables are fully managed by app, and is considered responsibility of the app. Currently stored in PostgreSQL for storage efficiency and speed of indexing/aggregate reports, but maybe MongoDB or Cassandra can be good too.
For Hendy, sample data is at: ~/Dropbox/Hendy_Projects/soluvas-scrape/sample/ppdb
- Soluvas Analytics. After getting those data, present them in a pleasing and engaging way.
- Soluvas ETL. Integrate incoming scraped data with other systems, post or sync with social media.
- Soluvas AI. Add some intelligence to fill missing information or to predict new ones.
- Soluvas Buzz. Run social media, email, and online marketing campaigns with scraped data.
- Soluvas Publisher will be superseded by Soluvas Scrape + Soluvas ETL/Soluvas Buzz.