OSHA Inspection Data Scraper

This project is a set of Python scripts designed to scrape inspection data from the OSHA (Occupational Safety and Health Administration) website. The program can extract inspection numbers, details, and other relevant data using web scraping techniques, such as BeautifulSoup and Selenium.

Features

Scrapes data such as "Inspection Nr", "Report ID", "Date Opened", and violation summaries.
Supports both batch and individual scraping modes for efficiency.
Stores the scraped data in Excel and text formats.
Includes logging for easy debugging and monitoring of the scraping process.

Prerequisites

To use this project, ensure you have the following:

Python 3.8+
Required Python Packages:
- requests
- beautifulsoup4
- selenium
- webdriver-manager
- tqdm
- pandas

You can install all dependencies using the following command:

$ pip install -r requirements.txt

ChromeDriver: The project uses Chrome as the default browser. ChromeDriver should be installed via webdriver_manager automatically.

Files in the Project

inspection_bs4.py: Scrapes inspection numbers using BeautifulSoup and stores the data in logs. This script is configured to handle smaller batches of data efficiently.
inspection_selenium.py: Uses Selenium to scrape OSHA inspection data by simulating a browser. It allows processing larger batches of data and extracts inspection details.
summary.py: Extracts "Summary Nrs" from HTML files and saves them into a text file for further processing.
utils.py: Contains utility functions to assist with reading files, fetching inspection numbers, and handling HTML data.
inspection_detail.py: Retrieves detailed information about specific inspections from OSHA by navigating the website via Selenium.
Summary_Nrs.txt: A sample file containing a list of Summary Nrs to process.

Usage

1. Extract Summary Numbers

To extract summary numbers from HTML files, run:

$ python summary.py --directory /path/to/html/files

2. Scrape Inspection Numbers

To scrape inspection numbers based on the summary numbers, you can use:

$ python inspection_bs4.py --file Summary_Nrs.txt

3. Scrape Detailed Inspection Information

To scrape detailed inspection data for specific inspection numbers, run:

$ python inspection_detail.py --input-file_path Summary_Nrs.txt

Command Line Options

--directory or -D: Specifies the directory containing HTML files (used in summary.py).
--file or -F: Specifies the file containing the list of Summary Nrs (used in inspection_bs4.py).
--input-file_path or -I: Path to the file containing the list of Summary Nrs (used in inspection_detail.py).

Logging

Logs are automatically created and stored in the logs/ directory. The logging format includes timestamps and relevant information about the operations being performed.

Output

The output data is saved in the following formats:

Text Files: Inspection numbers are saved in .txt files.
Excel Files: Detailed inspection data is saved in .xlsx files.

License

This project is licensed under the MIT License.

Acknowledgements

This project uses data provided by OSHA (Occupational Safety and Health Administration).
Web scraping is performed using the BeautifulSoup and Selenium libraries.

Author

Created by @pikaybh

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Accident Search Results _ Occupational Safety and Health Administration osha.gov(0~100_000)_files		Accident Search Results _ Occupational Safety and Health Administration osha.gov(0~100_000)_files
Accident Search Results _ Occupational Safety and Health Administration osha.gov(100_001~200_000)_files		Accident Search Results _ Occupational Safety and Health Administration osha.gov(100_001~200_000)_files
Accident Search Results _ Occupational Safety and Health Administration osha.gov(20_001~29_092)_files		Accident Search Results _ Occupational Safety and Health Administration osha.gov(20_001~29_092)_files
Accident Search Results _ Occupational Safety and Health Administration osha.gov(sample)_files		Accident Search Results _ Occupational Safety and Health Administration osha.gov(sample)_files
Accident Search Results _ Occupational Safety and Health Administration osha.gov(sample_5)_files		Accident Search Results _ Occupational Safety and Health Administration osha.gov(sample_5)_files
chrome_user_data		chrome_user_data
inspection-detail		inspection-detail
inspection-nrs		inspection-nrs
logs		logs
output		output
(tmp) Accident Search Results _ Occupational Safety and Health Administration osha.gov(sample).html		(tmp) Accident Search Results _ Occupational Safety and Health Administration osha.gov(sample).html
(tmp) Accident Search Results _ Occupational Safety and Health Administration osha.gov(sample_5).html		(tmp) Accident Search Results _ Occupational Safety and Health Administration osha.gov(sample_5).html
.gitignore		.gitignore
Accident Search Results _ Occupational Safety and Health Administration osha.gov(0~10_000).html		Accident Search Results _ Occupational Safety and Health Administration osha.gov(0~10_000).html
Accident Search Results _ Occupational Safety and Health Administration osha.gov(10_001~20_000).html		Accident Search Results _ Occupational Safety and Health Administration osha.gov(10_001~20_000).html
Accident Search Results _ Occupational Safety and Health Administration osha.gov(20_001~29_092).html		Accident Search Results _ Occupational Safety and Health Administration osha.gov(20_001~29_092).html
Inspection Nrs.txt		Inspection Nrs.txt
README.md		README.md
Summary_Nrs.txt		Summary_Nrs.txt
inspection-nrs.zip		inspection-nrs.zip
inspection_bs4.py		inspection_bs4.py
inspection_detail copy.py		inspection_detail copy.py
inspection_detail.py		inspection_detail.py
inspection_selenium.py		inspection_selenium.py
summary.py		summary.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OSHA Inspection Data Scraper

Features

Prerequisites

Files in the Project

Usage

1. Extract Summary Numbers

2. Scrape Inspection Numbers

3. Scrape Detailed Inspection Information

Command Line Options

Logging

Output

License

Acknowledgements

Author

About

Releases

Packages

Languages

pikaybh/osha-dataset

Folders and files

Latest commit

History

Repository files navigation

OSHA Inspection Data Scraper

Features

Prerequisites

Files in the Project

Usage

1. Extract Summary Numbers

2. Scrape Inspection Numbers

3. Scrape Detailed Inspection Information

Command Line Options

Logging

Output

License

Acknowledgements

Author

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages