CAMBIA's Sequence Software
The directory below contains software to:
- extract biological sequences from USPTO XML sequence listings into FASTA format;
- parse WIPO ST.25 format (a non-XML format) sequence listings, used by the USPTO for bulk sequence listings, extracting the sequences into FASTA format;
- parse patent text, extracting the SEQ ID NOS referenced (e.g. in the claims);
- filter FASTA files (e.g. to extract only the sequences referenced in the claims).
The software is:
- developed by CAMBIA as part of CAMBIA's PatentLens;
- written in Java;
- uses the ANTLR parser generator;
- licensed under the GPL; and
- built using Maven.
Comments and suggestions for improvements are welcome at webmaster@cambia.org
Acknowledgement
Significant funding for this software and collection of these data was provided by the Ministry of Foreign Affairs of Norway through the International Rice Research Institute for CAMBIA's Patent Lens (the OS4 Initiative: Open Source, Open Science, Open Society, Orzya sativa).
Citation
For any derived data products that were produced using the original data set, the user should properly cite the data in any publication or in the metadata, in the following form:
Bacon N, Ashton D, Jefferson RA, Connett MB (2006) Biological sequences named and claimed in US patents and patent applications, CAMBIA Patent Lens OS4 Initiative, http://www.patentlens.net/daisy/patentlens/2205.html.
It is not ethical to publish data without proper attribution or co-authorship and acknowledgement of the ideas and funding. Compilation of this dataset required intellectual, financial and time investment in the conception, preparation and collection of data. Co-authorship in the publication of descriptive or interpretive results derived directly from the data is the privilege and responsibility of these investigators.
Notification
To assist with the furtherance of the public good mission served by the funder and sponsoring organization, we request that users notify the originators of the data set when any derivative work or publication based on derived from the data set is distributed.



There are no comments.