Arguably one of my favorite (and best) labeled text datasets are patents at the United States Patent and Trademark Office (USPTO). Every patent is freely available with labeled images, abstract, claims, a long description, authors, dates, classification labels, etc.
Data in the provided format can be used for a lot of natural language processing (NLP) tasks, particularly: document classification, text generation, summerization, etc.
What makes this dataset so excellent, is that there are millions of patents freely available on the web for download: bulkdata.uspto.gov. For reference, the patent text alone is in the hundreds of gigabytes per year of patents.
The only downside, is that the USPTO bundles multiple patents in a large non-compliant XML format (but is labeled .xml), then zips the file.
Over the next few sections we’ll cover how to download, unzip, then parse the patents into a use able format (primarily for NLP related tasks).
Downloading USPTO Patents
Downloading patents from the USPTO are very very easy. The USPTO has a portion of their website dedicated to bulk downloads. For the described use cases (for NLP), the documents in the most accessible format has the following header:
Patent Grant Full Text Data (No Images) (JAN 1976 – PRESENT)
Contains the full text of each patent grant issued weekly (Tuesdays) from January 1, 1976 to present (excludes images/drawings). Subset of the Patent Grant Full Text Data with Embedded TIFF Images.
Patents in this section will primarily be in an xml-esque format with HTML tags that can be parsed easily in Python with BeautifulSoup or a similar library.
To download patents, select a year, generally I do 2002 and after (since those are in the best format). It’ll look like the following:
When you click on it, you’ll be presented with zip files for all the patents for a given week:
Click to download all the granted patents for the given week. When you download a file it’ll be a .zip file, such as: ipg050104.zip.
To unzip it, use your favorite unzip tool, if you’re on Linux just use:
After the file is unzipped you’ll get a file of patents for the given week, often hundreds of megabytes.
Parsing USPTO Patents
Parsing the USPTO documents is relatively straight forward. Load unzipped document, split based on the XML header, then pull out the applications or grants from the split XML patent. The following is some Python code which will parse out the patents in the XML documents:
Once the application has been split out, BeautifulSoup allows the HTML tags to be easily pulled out (which is how the USPTO applications are structured). By parsing the HTML tags, it’s possible to pull the titles, international patent classification (IPC), inventors, abstract, claims, etc.
That’s all it is! The text we’d need for most NLP tasks are parsed with the code above!
If you’d like to run it yourself checkout the Python script to parse patents on my Github.
Next up! I’ll show how to use this data to write a classifier for patents, which will take in part of the patent text and label the patent based on the international patent classifications (IPC).