This is a Spark project used to generate reports that deliver data about providers.
The project is completed and the code is available in the src
folder. The code is written in Scala and uses Spark to process the data. The code is supported by unit tests. The reports are saved in the data
folder in JSON format in folders output1
and output2
.
- Language: Scala
- Framework: Spark
- Output: json file/files
Within the data
subfolder, you'll find two sets of data:
providers.csv
- A CSV containing data about providers and their respective specialties.visits.csv
- A CSV with the unique visit ID, the provider ID (the ID of the provider who was visited), and the date of service of the visit.
Feel free to use this repository as a basis for solving these problems. You can also supply your own code.
-
Given the two data datasets, calculate the total number of visits per provider. The resulting set should contain the provider's ID, name, specialty, along with the number of visits. Output the report in json, partitioned by the provider's specialty.
-
Given the two datasets, calculate the total number of visits per provider per month. The resulting set should contain the provider's ID, the month, and total number of visits. Output the result set in json.
Please provide your code in some format so we can attempt to run it during evaluation. Examples include github/gitlab links, .scala files, zipped folders containing the modified source project.