Fast Gene Information extraction from NCBI Entrez
Working with biological data often means dealing with NCBI’s Entrez API, a powerful but slow gateway to vast databases like PubMed.
The challenge?
Entrez responses can be massive (several megabytes), while you often need just a few fields from the beginning of the XML response.
I previously wrote about streaming XML parsing for HTTP responses, showing how to extract data without waiting for complete downloads.
Today, let’s dive deeper into a real-world application: building a smart gene information API using the http-stream-xml library.
The Entrez Challenge
When you request gene information from NCBI Entrez, you get detailed XML responses that can easily exceed 2MB.
But here’s the key insight: the essential gene information (summary, description, synonyms, and locus) appears within the first 5-10KB of the response.
Traditional approaches force you to:
- Wait for the entire multi-megabyte response
- Parse the complete XML document
- Extract just the fields you need
This is wasteful and slow, especially when dealing with unreliable government servers.
Smart Streaming Solution
The http-stream-xml
library includes a specialized Genes
class that demonstrates how to build an intelligent API wrapper:
from http_stream_xml.entrez import genes, GeneFields
# Simple case-insensitive gene lookup with caching
gene_info = genes['PPARA']
print(gene_info[GeneFields.description])
Behind this simple interface lies sophisticated streaming logic:
1. Early Termination Strategy
extractor = XmlStreamExtractor(self.fields)
for line in request.iter_lines(chunk_size=1024):
if line:
extractor.feed(line)
if extractor.extraction_completed:
break # Stop as soon as we have all required fields
The parser stops immediately when all required XML tags are found, typically after downloading just 5-10KB instead of the full 2MB response.
2. Intelligent Caching Layer
def __getitem__(self, gene_name: str) -> dict[str, Any]:
gene_name = self.canonical_gene_name(gene_name) # Case-insensitive
if gene_name in self.db and len(self.db[gene_name]) >= len(self.fields):
return self.db[gene_name] # Return cached result
gene = self.get_gene_details(gene_name)
if gene:
self.db[gene_name] = gene # Cache for future requests
return gene
The caching system is smart about partial results—if a previous request didn’t find all fields, it will retry the request.
3. Robust Error Handling
Public research database servers can be unreliable. The implementation includes:
@lru_cache(maxsize=100)
def requests_retry_session(
retries: int = 3,
backoff_factor: float = 1.0,
status_forcelist: Collection[int] = (500, 502, 504),
):
"""Retry policy for unreliable government servers."""
session = requests.Session()
retry = Retry(
total=retries,
read=retries,
connect=retries,
backoff_factor=backoff_factor,
status_forcelist=status_forcelist,
)
# Configure adapters for both HTTP and HTTPS
4. Multiple Gene ID Handling
Real-world gene searches often return multiple IDs. The system intelligently handles this:
def get_gene_id(self, gene_name: str) -> Optional[str]:
# ... search logic ...
if len(ids) > 1:
# Try each ID until we find exact locus match
for gene_id in ids:
gene = self.get_gene_details_by_id(gene_id)
if self.canonical_gene_name(gene[GeneFields.locus]) == gene_name:
return gene_id
This ensures you get the exact gene you’re looking for, even when multiple matches exist.
Performance Benefits
The streaming approach delivers significant performance improvements:
- Speed: Extract data in ~1-2 seconds instead of 10-30 seconds
- Bandwidth: Download 5-10KB instead of 2MB+ per request
- Reliability: Early termination reduces exposure to network timeouts
- Scalability: Built-in caching eliminates redundant API calls
Practical Usage Patterns
Basic Gene Lookup
from http_stream_xml.entrez import genes, GeneFields
# Get gene description
description = genes['SLC9A3'][GeneFields.description]
print(f"Gene function: {description}")
Batch Processing
gene_names = ['PPARA', 'SLC9A3', 'MYO5B', 'PDZK1']
for name in gene_names:
if gene_data := genes[name]:
print(f"{name}: {gene_data[GeneFields.summary]}")
Custom Fields
from http_stream_xml.entrez import Genes, GeneFields
# Create specialized instance for specific fields
custom_genes = Genes(fields=[GeneFields.summary, GeneFields.synonyms])
Configuration Options
The Genes
class offers flexible configuration:
genes = Genes(
timeout=30, # Request timeout
max_bytes_to_fetch=10*1024, # Safety limit
api_key="your_entrez_key", # For higher rate limits
fields=[GeneFields.summary] # Customize extracted fields
)
Conclusion
The http-stream-xml
library’s Entrez integration demonstrates how streaming XML parsing can transform API interactions with large, slow data sources. By combining early termination, intelligent caching, and robust error handling, you can build APIs that are both fast and reliable.
This approach isn’t limited to biological data—any scenario involving large XML responses with front-loaded important data can benefit from similar streaming strategies.
The next time you’re faced with slow, large API responses, consider whether the data you need appears early in the response. If so, streaming parsing might be your performance salvation.