Portfolio

The Portfolio class lets you interact with SEC Submissions. Portfolio's consist of a folder that contains subfolders named after SEC Submission accession numbers.

Attributes

portfolio.path - path to folder

`set_api_key`

Use this if you don't want to store the api key as environmental variable.

set_api_key(api_key)

`download_submissions`

download_submissions(self, cik=None, ticker=None, submission_type=None, filing_date=None, provider=None,document_type=None,keep_filtered_metadata=False, requests_per_second=5,skip_existing=True, **kwargs)

Parameters

cik - company CIK, e.g. 0001318605 or 1318605 or ['0001318605','789019']
ticker - e.g. 'TSLA' or ['TSLA','AAPL','MSFT']
submission_type - the submission type e.g. '10-K' or ['3','4','5']
document_type - arg for downloading only a specific document type in a submission. e.g. setting to 'PROXY VOTING RECORD' for submission_type='N-PX' will only download the proxy voting record file.
filing_date - e.g. '2023-05-09' or ('2023-01-01','2024-01-01') or `['2023-01-01','2024-21-11','2025-23-01']
provider - e.g. sec or datamule. will use defaults from config
requests_per_second - sec hard rate limit is 10/s, soft limit is 5/s over long durations.
keep_filtered_metadata - whether metadata on documents within a submission should be kept or discarded if documents are filtered.
skip_existing - whether to download submissions already in the Portfolio.
**kwargs

Filtering

Filtering filters what submissions are downloaded. Filters can be chained.

Example

portfolio.filter_text('"climate change"', filing_date=('2019-01-01', '2019-01-31'), submission_type='10-K')
portfolio.filter_text('drought', filing_date=('2019-01-01', '2019-01-31'), submission_type='10-K')
portfolio.download_submissions(filing_date=('2019-01-01', '2019-01-31'), submission_type='10-K')

`filter_text`

filter_text(self, text_query, cik=None, ticker=None, submission_type=None, filing_date=None, **kwargs)

Parameters

text_query - e.g. "machine learning". For more information click here
cik - company CIK, e.g. 0001318605 or 1318605 or ['0001318605','789019']
ticker - e.g. 'TSLA' or ['TSLA','AAPL','MSFT']
submission_type - the submission type e.g. '10-K' or ['3','4','5']
filing_date - e.g. '2023-05-09' or ('2023-01-01','2024-01-01') or `['2023-01-01','2024-21-11','2025-23-01']
**kwargs

`filter_xbrl`

filter_xbrl(self, taxonomy, concept, unit, period, logic, value)

For help filling out args, see this.

Parameters

taxonomy - e.g. dei,us-gaap, etc.
concept - e.g. AccountsPayableCurrent
unit - e.g. USD
period - e.g. CY2019Q4I
logic - '>','>=','==','!=', '<=', '<'

`monitor_submissions`

Monitor the SEC for new submissions.

monitor_submissions(data_callback=None, interval_callback=None,
                            polling_interval=1000, quiet=True, start_date=None,
                            validation_interval=60000)

hits format:

[{'accession': 176693425000001, 'submission_type': 'D/A', 'ciks': ['1766934'], 'filing_date': '2025-05-28'}...]

Parameters

data_callback - function that uses hits
interval_callback - function that is called between polls
quiet - whether to print output
start_date - start date for backfill
polling_interval - time in ms to poll the rss feed. If set to None, will never poll
validation_interval - time to run more robust check of what submissions have been submitted. If set to None, will never validate

rate limit sharing

will update this later to add documentation for rate limit sharing - e.g. downloading each submission as they come out

Example

from datamule import Portfolio
from time import time

start_time = time()

portfolio = Portfolio('test')
portfolio.monitor.set_domain_rate_limit(domain='sec.gov', rate=3)
def interval_callback():
    global start_time
    print(f"elapsed time: {time() - start_time} seconds")

def data_callback(hits):
    print(f"Number of new hits: {len(hits)}")

portfolio.monitor_submissions(validation_interval=60000,start_date='2025-04-25',
                                   quiet=True,polling_interval=1000,
                                   interval_callback=interval_callback,data_callback=data_callback)

Architecture

There are two ways to Monitor SEC submissions in real time. 1. RSS - ~25% of submissions are missed 2. EFTS - often 30-50s slower than the RSS feed The Monitor class is a compromise that uses both systems. I will likely do a write up on how it works later on, because both systems are annoying to work with. If you have a use-case that requires insane levels of speed, feel free to email me for advice.

`stream_submissions`

Get new SEC submissions by listening to datamule's websocket. Requires an API Key.

stream_submissions(data_callback=None,api_key=None,quiet=False)

hits format:

[{'accession': '95017025085535', 'submission_type': '4', 'ciks': ['109198', '1278731'], 'filing_date': '2025-06-12', 'detected_time': 1749762028168, 'source': 'rss'}...]

Parameters

data_callback - function that uses hits
quiet - whether to print output

Example

portfolio = Portfolio('websockettest')

def data_callback(hits):
    for hit in hits:
        print(hit)
portfolio.stream_submissions(data_callback=data_callback)

`document_type`

Iterate through documents in a portfolio based on type.

Example

for document in portfolio.document_type('10-K'):
    print(document.path)

Iterable

Iterate through submissions in a portfolio.

Example

for submission in portfolio:
    print(submission.path)

`process_submissions`

Process submissions within a portfolio using threading (faster).

process_submissions(self, callback)

`compress`

Compress all individual submissions into batch tar files for efficient storage.

compress(self, compression=None, compression_level=None, threshold=1048576, max_batch_size=1024*1024*1024, max_workers=None)

Parameters

compression - Compression algorithm for large documents: 'gzip', 'zstd', or None (default: None)
compression_level - Compression level, if None uses defaults (gzip=6, zstd=3)
threshold - Size threshold for compressing individual documents in bytes (default: 1048576 = 1MB)
max_batch_size - Maximum size per batch tar file in bytes (default: 1024*1024*1024 = 1GB)
max_workers - Number of threads for parallel document processing (default: portfolio.MAX_WORKERS)

Example

# Compress all submissions using zstd compression
portfolio.compress(compression='zstd')

# Use gzip compression with custom threshold and compression level
portfolio.compress(compression='gzip', compression_level=9, threshold=500000)

# No document compression, just bundle into batch tars
portfolio.compress(compression=None)

# Use custom number of worker threads
portfolio.compress(compression='zstd', max_workers=4)

Storage Efficiency

This method combines document-level compression (gzip/zstd for large files) with batch tar creation following the downloader's naming convention (batch_000_001.tar, batch_000_002.tar, etc.). Original individual submission directories/tars are removed after successful compression.

`decompress`

Decompress all batch tar files back to individual submission directories.

decompress(self, max_workers=None)

Parameters

max_workers - Number of threads for parallel file processing (default: portfolio.MAX_WORKERS)

Example

# Extract all batch tar files to individual submission directories
portfolio.decompress()

# Use custom number of worker threads
portfolio.decompress(max_workers=4)

Complete Extraction

This method extracts all submissions from batch tar files back to individual accession_number/ directories in the portfolio root. Compressed documents (.gz/.zst) are automatically decompressed during extraction. Batch tar files are removed after successful extraction.

Example

def callback_function(submission):
    print(submission.metadata['cik'])

# Process submissions - note that filters are applied here
portfolio.process_submissions(callback=callback_function)

`process_documents`

Process documents within a portfolio using threading (faster).

process_documents(self, callback)

Example

def callback_function(document):
    print(document.path)

# Process submissions - note that filters are applied here
portfolio.process_documents(callback=callback_function)