Skip to content

Document

The Document class represents a single file in a SEC Submission.

Attributes

Metadata

  • document.accession - submission accession number
  • document.path - document file path
  • document.filing_date - submission filing date
  • document.extension - document file extension, e.g. '.xml'

document.content

Document in either string or bytes format.

document.data

Available for html, text and some pdf files. Document content parsed into dictionary form.

document.data_tuples

Available for html, text and some pdf files. document.data flattened into form: (id,type,content,level).

document.text

Available for html, text and some pdf files. Returns the document's text.

document.markdown

Available for html, text and some pdf files. Returns the document in markdown format.

document.tables

Available for xml and html files. In the case of xml files, the document is flattened into multiple tables. In the case of html files, tables are extracted with basic cleaning using doc2dict, then the local context is included as description. Should be useful for LLMs.

tags

Tags extracted from documents. For example: cusips, persons, etc. Only works for html or text files currently.

Tags are an experimental feature to add "good enough" NLP to the SEC corpus, without compromising speed or bloating the package. How tags work is that they leverage basic pattern matching (fast + lightweight) alongside dictionary lookup of pre-computed NLP datasets.

It is highly recommended to use a pre computed dataset to improve quality. Or don't, if you want to see how bad older forms of NLP can be.

Usage

document.text.tags.cusip # Get cusips from text in form (match,start position, end position)
document.data.tags.cusip # Get cusips from data in form (match, id of text or title segment, start position, end position)

Example

from datamule import Portfolio
from time import time
from datamule.tags.config import set_dictionaries

set_dictionaries(['13fhr_information_table_cusips'])

portfolio = Portfolio('13fhr')
portfolio.download_submissions(submission_type=['13F-HR'],filing_date=('2008-09-01','2008-09-30'))

for sub in portfolio:
    for doc in sub:
        results = doc.text.tags.cusips
        if results is not None:
            print(results)

Supported Tags:

  • cusips
  • isins
  • figis
  • persons
  • tickers
    • nyse
    • nasdaq
    • nyse_american
    • london_stock_exchange
    • toronto_stock_exchange
    • euronext_paris
    • euronext_amsterdam
    • euronext_brussels
    • euronext_lisbon
    • euronext_milan
    • deutsche_borse_xetra
    • six_swiss_exchange
    • tokyo_stock_exchange
    • hong_kong_stock_exchange
    • shanghai_stock_exchange
    • shenzhen_stock_exchange
    • australian_securities_exchange
    • singapore_exchange
    • nse_bse
    • sao_paulo_b3
    • mexico_bmv
    • korea_exchange
    • taiwan_stock_exchange
    • johannesburg_stock_exchange
    • tel_aviv_stock_exchange
    • moscow_exchange
    • istanbul_stock_exchange
    • nasdaq_stockholm
    • oslo_bors
    • otc_markets_us
    • pink_sheets

similarity

Usage

document.text.similarity.loughran_mcdonald 
document.data.similarity.loughran_mcdonald # get similarity by section

Dictionaries

To improve tag quality, use a dictionary. On first load, these dictionaries are downloaded into the User's home. e.g. for Windows: C:\Users\{username}\.datamule\dictionaries.

from datamule.tags.config import set_dictionaries
set_dictionaries(['ssa_baby_names'], overwrite=False) # set this to true, to download the latest dataset.

Tags

Persons

  • ssa_baby_names (Uses all baby first names since 1880, where there are more than 5 names per year.)
  • 8k_2024_persons (Uses multistage spacy, human parser pipeline to extract names from all documents within 2024 8-K filings)

CUSIP

  • sc13dg_cusips (Uses SC 13D/G, somewhat incomplete)
  • 13fhr_information_table_cusips.txt (Uses 13F-HR INFORMATION TABLE, should be better)

ISIN

  • npx_isins (Uses isins detected in N-PX filings, very incomplete)

FIGI

  • npx_figis (Uses figis detected in N-PX filings, very incomplete)

Similarity

Loughran McDonald

  • loughran_mcdonald (Link)

Methods

contains_string

contains_string(self, pattern)

Checks if the document content contains a specified pattern. Works for HTML, XML, and TXT files.

get_section

Gets section by title. Formats are: text, markdown, dict.

get_section(title=None, title_regex=None,title_class=None, format='dict'):
# returns a list of dictionaries preserving hierarchy - a list in case there are multiple sections with the same title
get_section(title='parti', format='dict')

# returns a list of flattened version of dict form
get_section(title='signatures', format='text')

# return all sections with title including item1, e.g. item1, item1a,... title_class restricts to nodes where class is 'item'
get_section(title_regex= r"item1.*", format='dict',title_class='item')

# returns all sections starting with income
get_section(title_regex= r"income.*", format='dict')

Note that get_section will return matches for title (original title) or standardized_title (standardized title - e.g. "ITEM 1A. RISK FACTORS" -> 'item1a').

get_tables

Gets table by exact name or by regex of description. Description is autogenerated from nearby context for html files.

get_tables(description_regex=None,name=None)

Example

from datamule import Portfolio
portfolio= Portfolio('DEFM14A')
portfolio.download_submissions(cik='943324',submission_type='DEFM14A',document_type='DEFM14A',filing_date=('2011-10-25','2011-10-25'))
for doc in portfolio.document_type('DEFM14A'):
    print(str(doc.get_tables(description_regex=r'(?i)golden parachute')[0]))

visualize

Opens the parser version of a document using webbrowser. Only works for certain file extensions.

open

Opens the document using webbrowser.

Example

pattern = r'(?i)chatgpt'
document.contains_string(pattern)

write_csv

write_csv(self, output_folder)

If the document has extension '.xml', parses the XML into tables, then writes to disk using append mode.

write_json

write_json(self, output_filename)

Writes document.data to JSON format (automatically parses document if not already parsed).

Legacy Methods

parse

Parses a document into dictionary form and applies a mapping dict. Currently supports files in html, xml, has limited support for .pdf, and some txt formats. Most do not have mapping dicts written yet, so are less standardized.

Note: You typically don't need to call parse() manually - accessing document.data will automatically trigger parsing if needed.

Architecture

Lazy Loading

The Document class uses lazy loading.