Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

find software to convert scanned docs -> searchable text #5

Open
chooliu opened this issue May 12, 2017 · 3 comments
Open

find software to convert scanned docs -> searchable text #5

chooliu opened this issue May 12, 2017 · 3 comments

Comments

@chooliu
Copy link
Contributor

chooliu commented May 12, 2017

are there open-source tools to convert the text in a scanned letter/agenda into searchable text?

@chooliu
Copy link
Contributor Author

chooliu commented May 12, 2017

@davidmooreppf
Copy link

This relatively-new software service from non-profit Open Media Foundation helps small governments publish agendas as searchable text.

I'll add, our open-source Councilmatic could be enhanced to take .pdf's of agendas and publishing them in structured formats. (The .pdf's are generally currently coming from government vendor, primary city data publisher Legistar.)

@andysbolton
Copy link
Contributor

andysbolton commented Jul 18, 2017

Here's one option I've tried out: PyPDF2

Output for the San Leandro agenda for 7/17/2017 at 5:30pm (link here):

>>> import PyPDF2
>>> pdf = open("C:/Users/andys/Downloads/Agenda (6).pdf", "rb")
>>> pdfReader = PyPDF2.PdfFileReader(pdf)
>>> page = pdfReader.getPage(0)
>>> page.extractText()
"City CouncilCity of San LeandroMeeting AgendaCivic Center835 East 14th StreetSan Leandro, CaliforniaWelcome to your City of San Leandro City Council meeting.Your City Councilmembers are:Mayor Pauline Russo CutterDeborah Cox, District 1Ed Hernandez, District 2Lee Thomas, District 3Benny Lee, District 4Corina N. Lopez, District 5Pete Ballew, District 6City Manager's Large Conference Room5:30 PMMonday, July 17, 2017Special Meeting and Closed Session1.CALL TO ORDER1.A.ROLL CALLMembers Ballew, Cox, Hernandez, Lee, Lopez, Thomas; Mayor Cutter1.B.ANNOUNCEMENTS2.PUBLIC COMMENTSPublic Comments are limited to 3 minutes per speaker, subject to adjustment by the Mayor.  The public is invited to make comments on Closed Session items only at this time.3.CLOSED SESSION3.A.CONFERENCE WITH LABOR NEGOTIATORS Agency designated representatives: Bill Avery/ Emily Hung/ Chris Zapata Employee organization: San Leandro Confidential Employees Association (SLCEA)3.B.CONFERENCE WITH LEGAL COUNSELŠEXISTING LITIGATION(Paragraph (1) of subdivision (d) of Section 54956.9)Name of case: Coalition for the San Leandro Shoreline v. City of San Leandro, Case No. RG15782404, Alameda County Superior Court4.ADJOURNAdjourn to Regular Meeting at 7:00 p.m. in City Council ChambersPage 1 City of San LeandroPrinted on 7/11/2017"

Some PDFs will have extracted text with newline characters in them, but others, such as in this example, just return a string without any real structure. I don't know how easy this would be for the text wranglers to parse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants