Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lack of validation on returned case citations #142

Open
markmatney opened this issue May 18, 2018 · 3 comments
Open

lack of validation on returned case citations #142

markmatney opened this issue May 18, 2018 · 3 comments

Comments

@markmatney
Copy link

I am doing some experiments on some of the "Statutes at Large" search-able PDFs on FDsys. The text layer presumably contains raw OCR output, since it contains a lot of errors. I am extracting the text layer and sending it to cite-server running locally.

The following code snippets return false positives:

Citation.find("pursuant to 5 use 552(a)(1)(E) and") // "use" instead of "usc"
Citation.find("pursuant to 5 GARBAGE 552(a)(1)(E) and")
Citation.find("The sum of 27 and 42 is a number between 68 and 70.") // two citations found!

I am seeing the first case ("use") often where US Code citations in historical documents often omit periods in the abbreviation "USC" (see https://www.gpo.gov/fdsys/pkg/STATUTE-70/content-detail.html, open the PDF, search for the string "use", and see it highlighted often in the margins). I think the OCR engine that generated the text guessed "use", a word more common in everyday English than "USC". (Just in case, I'm NOT suggesting that it is the responsibility of the citation finder to anticipate and fix things like OCR errors.)

The last case has been popping up every once in a while, where you have a single word in between two numbers (see #100).

Generally, the issue seems to be that citations of the reporter type are not being properly validated before being returned to the caller of Citation.find.

@konklone
Copy link
Member

konklone commented Jun 2, 2018

This is a great writeup of the problem, thank you! I'll take a look into this, though I don't have an ETA for it. If you're using this in something where time is of the essence, let me know -- and I'd welcome a pull request with a fix, if you have one.

@mlissner
Copy link
Contributor

mlissner commented Jun 4, 2018

Weird. Seems like the regex here would only allow USC or U.S.C.:

https://github.com/unitedstates/citation/blob/master/citations/usc.js#L51

Is it being picked up as a U.S.C. citation?

@markmatney
Copy link
Author

@konklone not time sensitive for me. I'm happy to collaborate on this issue though. I wouldn't be able to take it on entirely myself (not too knowledgeable about law and legal citations) but I am quite good with regular expressions.
@mlissner each of the examples I mentioned are being interpreted as citations of type reporter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants