CAMBIA's Sequence Project

Now you can search the protein and DNA sequences listed in patents and patent applications, and specifically search sequences in claims, using bioinformatics tools such as NCBI's BLAST

Because all the sequences in these datasets have not been submitted with full annotation data to Genbank, we are unable to offer NCBI's client server support for all types of BLAST queries (such as filters by organism name), but welcome collaboration to improve what we can offer here through the Patent Lens. 

Sequence listings have been available from the USPTO patent search output page for any given patent application, but you had to have a reason to go to that patent application among the thousands of others first.  (Also, for any patent application or patent that has a biological sequence listing longer than 300 pages, the listings were only downloadable from a separate webpage!)  CAMBIA's work has brought them together in one readily searchable database, and delineated which sequences are actually mentioned in the claims of the US patent applications.

For those who would like to do analysis of larger data sets rather than search single sequences, this project provides FASTA format files of biological sequences extracted from USPTO patent grants and applications.   For any derived data products that were produced using the original data set, the user should properly cite the data in any publication or in the metadata, in the following form:

Bacon N, Ashton D, Jefferson RA, Connett MB (2006) Biological sequences named and claimed in US patents and patent applications, CAMBIA Patent Lens OS4 Initiative, http://www.patentlens.net.

  • You may be surprised to know that until CAMBIA did this work starting in June 2006, there wasn't a publicly available cost-free listing, in searchable form, of the biological sequences listed in US patent applications!  
  • CAMBIA also developed a method to delineate which sequences are actually mentioned in the claims of granted patents or patent applications.  The software used to produce the data is also being made available here, in the hope that:
  1. it discloses the methods used and their strengths and weakness and may stimulate suggestions for improvements;
  2. it may be extended to handle data formats used by other patent offices;
  3. it provides a parser for WIPO's ST.25 format, which could be used to improve patent data quality by validation prior to publication.

We've supplied the information on the sequences in patent applications, and in claims, to Genbank, which is also working with the USPTO.  Genbank sequences can now link to the relevant patent application documents in both the USPTO database and the Patent Lens, as currently happens for US granted patent documents.  

An advantage of the link to the patent documents on the Patent Lens, in addition to downloadable pdfs the USPTO does not supply, is that entries are also linked to information on the status of related patents and applications in other countries that report this information.  Since many of these countries do not provide searchable databases to the general public, this may be the only notification the public has about pending gene sequence applications in these countries.  We welcome collaboration to develop this dataset further, to extend to sequence listings submitted in patent applications to other jurisdictions, for example.

Comments

Comments (0)