Document

The Document class represents a single file in a SEC Submission.

Attributes

Metadata

document.accession - submission accession number
document.path - document file path
document.filing_date - submission filing date
document.extension - document file extension, e.g. '.xml'

`document.content`

Document in either string or bytes format.

`document.data`

Available for html, text and some pdf files. Document content parsed into dictionary form.

`document.data_tuples`

Available for html, text and some pdf files. document.data_tuples flattened into form: (id, type, content, level, class).

`document.text`

Available for html, text and some pdf files. Returns the document's text.

`document.markdown`

Available for html, text and some pdf files. Returns the document in markdown format.

`document.tables`

Available for xml and html files. In the case of xml files, the document is flattened into multiple tables. In the case of html files, tables are extracted with basic cleaning using doc2dict, then the local context is included as description. Should be useful for LLMs.

`tags`

Tags extracted from documents. For example: cusips, persons, etc. Only works for html or text files currently.

Tags are an experimental feature to add "good enough" NLP to the SEC corpus, without compromising speed or bloating the package. How tags work is that they leverage basic pattern matching (fast + lightweight) alongside dictionary lookup of pre-computed NLP datasets.

It is highly recommended to use a pre computed dataset to improve quality. Or don't, if you want to see how bad older forms of NLP can be.

Usage

document.text.tags.cusip # Get cusips from text in form (match,start position, end position)
document.data.tags.cusip # Get cusips from data in form (match, id of text or title segment, start position, end position)

Example

from datamule import Portfolio
from time import time
from datamule.tags.config import set_dictionaries

set_dictionaries(['13fhr_information_table_cusips'])

portfolio = Portfolio('13fhr')
portfolio.download_submissions(submission_type=['13F-HR'],filing_date=('2008-09-01','2008-09-30'))

for sub in portfolio:
    for doc in sub:
        results = doc.text.tags.cusips
        if results is not None:
            print(results)

Supported Tags:

cusips
isins
figis
persons
tickers
- nyse
- nasdaq
- nyse_american
- london_stock_exchange
- toronto_stock_exchange
- euronext_paris
- euronext_amsterdam
- euronext_brussels
- euronext_lisbon
- euronext_milan
- deutsche_borse_xetra
- six_swiss_exchange
- tokyo_stock_exchange
- hong_kong_stock_exchange
- shanghai_stock_exchange
- shenzhen_stock_exchange
- australian_securities_exchange
- singapore_exchange
- nse_bse
- sao_paulo_b3
- mexico_bmv
- korea_exchange
- taiwan_stock_exchange
- johannesburg_stock_exchange
- tel_aviv_stock_exchange
- moscow_exchange
- istanbul_stock_exchange
- nasdaq_stockholm
- oslo_bors
- otc_markets_us
- pink_sheets

`similarity`

Usage

document.text.similarity.loughran_mcdonald 
document.data.similarity.loughran_mcdonald # get similarity by section

Dictionaries

To improve tag quality, use a dictionary. On first load, these dictionaries are downloaded into the User's home. e.g. for Windows: C:\Users\{username}\.datamule\dictionaries.

from datamule.tags.config import set_dictionaries
set_dictionaries(['ssa_baby_names'], overwrite=False) # set this to true, to download the latest dataset.

Similarity

Loughran McDonald

loughran_mcdonald (Link)

Methods

`contains_string`

contains_string(self, pattern)

Checks if the document content contains a specified pattern. Works for HTML, XML, and TXT files.

`get_section`

Gets section by title. Formats are: text, markdown, dict.

get_section(title=None, title_regex=None,title_class=None, format='dict'):

# returns a list of dictionaries preserving hierarchy - a list in case there are multiple sections with the same title
get_section(title='parti', format='dict')

# returns a list of flattened version of dict form
get_section(title='signatures', format='text')

# return all sections with title including item1, e.g. item1, item1a,... title_class restricts to nodes where class is 'item'
get_section(title_regex= r"item1.*", format='dict',title_class='item')

# returns all sections starting with income
get_section(title_regex= r"income.*", format='dict')

Note that get_section will return matches for title (original title) or standardized_title (standardized title - e.g. "ITEM 1A. RISK FACTORS" -> 'item1a').

`get_tables`

Gets table by exact name, by regex of description, or by regex patterns that must exist in the table data. Description is autogenerated from nearby context for html files.

Parameters:

description_regex: Regex pattern to match against table descriptions
description_fields: which fields to search, default is ['preamble', 'postamble', 'footnotes']
name: Exact table name to match (works for xml derived tables)
contains_regex: List of regex patterns that must ALL match somewhere in the table's data

Examples:

Get tables by description:

from datamule import Portfolio

portfolio = Portfolio('DEFM14A')
portfolio.download_submissions(cik='943324', submission_type='DEFM14A', document_type='DEFM14A', filing_date=('2011-10-25','2011-10-25'))
for doc in portfolio.document_type('DEFM14A'):
    print(str(doc.get_tables(description_regex=r'(?i)golden parachute')[0]))

contains_regex=[r'Director', r'20\d{2}']

`visualize`

Opens the parser version of a document using webbrowser. Only works for certain file extensions.

`open`

Opens the document using webbrowser.

Example

pattern = r'(?i)chatgpt'
document.contains_string(pattern)

`write`

Writes the document to disk.

write(self,file)

`write_csv`

write_csv(self, output_folder)

If the document has extension '.xml', parses the XML into tables, then writes to disk using append mode.

`write_json`

write_json(self, output_filename)

Writes document.data to JSON format (automatically parses document if not already parsed).

Legacy Methods

`parse`

Parses a document into dictionary form and applies a mapping dict. Currently supports files in html, xml, has limited support for .pdf, and some txt formats. Most do not have mapping dicts written yet, so are less standardized.

Note: You typically don't need to call parse() manually - accessing document.data will automatically trigger parsing if needed.

Architecture

Lazy Loading

The Document class uses lazy loading.