Portfolio
The Portfolio class lets you interact with SEC Submissions. Portfolio's consist of a folder that contains subfolders named after SEC Submission accession numbers.
Attributes
portfolio.path- path to folder
set_api_key
Use this if you don't want to store the api key as environmental variable.
set_api_key(api_key)
download_submissions
There are three providers to download from:
- sec, which has a rate limit of 5/s over long durations, 10/s over short durations.
- datamule-sgml, which has no rate limit and depends on your internet speed/cpu. Users have reported getting 1gbps while running in the Cloud. See How to host the SEC archive for $20/month.
- datamule-tar, which is like datamule but much faster when you only need some documents within a submission. New, may have bugs. See Programmatically downloading SEC attachments in bulk.
datamule-sgml and datamule-tar also have more args to filter on. This is because they datamule's internal databases instead of the SEC.
Note that: datamule -> datamule-sgml.
provider='sec'
download_submissions(self, cik=None, ticker=None, submission_type=None, filing_date=None, provider=None,document_type=None,keep_filtered_metadata=False, requests_per_second=5,skip_existing=True, **kwargs)
Example
Download every IBM 10-K between 2019 and 2024
portfolio = Portfolio('ibm')
portfolio.download_submissions(filing_date=('2019-01-01', '2024-01-01'), submission_type='10-K',
provider='sec')
Parameters
- cik - company CIK, e.g.
0001318605or1318605or['0001318605','789019'] - ticker - e.g.
'TSLA'or['TSLA','AAPL','MSFT'] - submission_type - the submission type e.g.
'10-K'or['3','4','5'] - document_type - arg for downloading only a specific document type in a submission. e.g. setting to
'PROXY VOTING RECORD'forsubmission_type='N-PX'will only download the proxy voting record file. - filing_date - e.g.
'2023-05-09'or('2023-01-01','2024-01-01')or `['2023-01-01','2024-21-11','2025-23-01'] - provider - e.g.
secordatamule. will use defaults from config - requests_per_second - sec hard rate limit is 10/s, soft limit is 5/s over long durations.
- keep_filtered_metadata - whether metadata on documents within a submission should be kept or discarded if documents are filtered.
- skip_existing - whether to download submissions already in the Portfolio.
- **kwargs
provider='datamule'
download_submissions(self, cik=None, ticker=None, submission_type=None, filing_date=None, provider=None, document_type=[],
keep_filtered_metadata=False, standardize_metadata=True, skip_existing=True,
accession_numbers=None, report_date=None, detected_time=None, contains_xbrl=None, sequence=None,
quiet=False, filename=None, **kwargs)
Example
Only download the graphics from Apple's submissions.
portfolio = Portfolio('apple_graphics')
portfolio.download_submissions(ticker='AAPL',document_type="GRAPHIC",
provider='datamule-tar')
Download every IBM Form 4 since 1994.
portfolio = Portfolio('ibm')
portfolio.download_submissions(submission_type=['4','4/A'],
provider='datamule-sgml')
Additional Parameters
- report_date - Report period of the Submission.
- contains_xbrl - Whether the Submission contains XBRL.
- sequence - order of documents within a Submission.
- filename - filename of a document within the Submission.
Filtering
Filtering filters what submissions are downloaded. Filters can be chained.
Example
portfolio.filter_text('"climate change"', filing_date=('2019-01-01', '2019-01-31'), submission_type='10-K')
portfolio.filter_text('drought', filing_date=('2019-01-01', '2019-01-31'), submission_type='10-K')
portfolio.download_submissions(filing_date=('2019-01-01', '2019-01-31'), submission_type='10-K')
filter_text
filter_text(self, text_query, cik=None, ticker=None, submission_type=None, filing_date=None, **kwargs)
Parameters
- text_query - e.g. "machine learning". For more information click here
- cik - company CIK, e.g.
0001318605or1318605or['0001318605','789019'] - ticker - e.g.
'TSLA'or['TSLA','AAPL','MSFT'] - submission_type - the submission type e.g.
'10-K'or['3','4','5'] - filing_date - e.g.
'2023-05-09'or('2023-01-01','2024-01-01')or `['2023-01-01','2024-21-11','2025-23-01'] - **kwargs
filter_xbrl
filter_xbrl(self, taxonomy, concept, unit, period, logic, value)
For help filling out args, see this.
Parameters
- taxonomy - e.g.
dei,us-gaap, etc. - concept - e.g.
AccountsPayableCurrent - unit - e.g.
USD - period - e.g.
CY2019Q4I - logic -
'>','>=','==','!=','<=','<'
monitor_submissions
Monitor the SEC for new submissions.
monitor_submissions(data_callback=None, interval_callback=None,
polling_interval=1000, quiet=True, start_date=None,
validation_interval=60000)
hits format:
[{'accession': 176693425000001, 'submission_type': 'D/A', 'ciks': ['1766934'], 'filing_date': '2025-05-28'}...]
Parameters
- data_callback - function that uses hits
- interval_callback - function that is called between polls
- quiet - whether to print output
- start_date - start date for backfill
- polling_interval - time in ms to poll the rss feed. If set to None, will never poll
- validation_interval - time to run more robust check of what submissions have been submitted. If set to None, will never validate
rate limit sharing
will update this later to add documentation for rate limit sharing - e.g. downloading each submission as they come out
Example
from datamule import Portfolio
from time import time
start_time = time()
portfolio = Portfolio('test')
portfolio.monitor.set_domain_rate_limit(domain='sec.gov', rate=3)
def interval_callback():
global start_time
print(f"elapsed time: {time() - start_time} seconds")
def data_callback(hits):
print(f"Number of new hits: {len(hits)}")
portfolio.monitor_submissions(validation_interval=60000,start_date='2025-04-25',
quiet=True,polling_interval=1000,
interval_callback=interval_callback,data_callback=data_callback)
Architecture
There are two ways to Monitor SEC submissions in real time.
1. RSS - ~25% of submissions are missed
2. EFTS - often 30-50s slower than the RSS feed
The Monitor class is a compromise that uses both systems. I will likely do a write up on how it works later on, because both systems are annoying to work with. If you have a use-case that requires insane levels of speed, feel free to email me for advice.
stream_submissions
Get new SEC submissions by listening to datamule's websocket. Requires an API Key.
stream_submissions(data_callback=None,api_key=None,quiet=False)
hits format:
[{'accession': '95017025085535', 'submission_type': '4', 'ciks': ['109198', '1278731'], 'filing_date': '2025-06-12', 'detected_time': 1749762028168, 'source': 'rss'}...]
Parameters
- data_callback - function that uses hits
- quiet - whether to print output
Example
portfolio = Portfolio('websockettest')
def data_callback(hits):
for hit in hits:
print(hit)
portfolio.stream_submissions(data_callback=data_callback)
document_type
Iterate through documents in a portfolio based on type.
Example
for document in portfolio.document_type('10-K'):
print(document.path)
Iterable
Iterate through submissions in a portfolio.
Example
for submission in portfolio:
print(submission.path)
process_submissions
Process submissions within a portfolio using threading (faster).
process_submissions(self, callback)
decompress
Decompress all batch tar files back to individual submission directories.
decompress(self, max_workers=None)
Parameters
- max_workers - Number of threads for parallel file processing (default: portfolio.MAX_WORKERS)
Example
# Extract all batch tar files to individual submission directories
portfolio.decompress()
# Use custom number of worker threads
portfolio.decompress(max_workers=4)
Complete Extraction
This method extracts all submissions from batch tar files back to individual accession_number/ directories in the portfolio root. Batch tar files are removed after successful extraction.
Example
def callback_function(submission):
print(submission.metadata['cik'])
# Process submissions - note that filters are applied here
portfolio.process_submissions(callback=callback_function)
process_documents
Process documents within a portfolio using threading (faster).
process_documents(self, callback)
Example
def callback_function(document):
print(document.path)
# Process submissions - note that filters are applied here
portfolio.process_documents(callback=callback_function)
delete
Deletes the portfolio's folder and reinitializes an empty Portfolio object.
delete()
Example
from datamule import Portfolio
port = Portfolio('deletetest')
port.delete()
port.download_submissions(ticker='MSFT',submission_type='10-K')
port.delete()
port.download_submissions(ticker='MSFT',submission_type='10-K')