Document
The Document
class represents a single file in a SEC Submission.
Attributes
document.accession
- submission accession numberdocument.path
- document file pathdocument.filing_date
- submission filing datedocument.extension
- document file extension, e.g. '.xml'document.content
- document in either string or bytes formatdocument.data
- parsed document content (automatically parsed when first accessed)document.text
- available for html or txt files. Returns the text without formatting such as tags. (automatically parsed when first accessed)document.tables
- parsed tables from XML documents (automatically parsed when first accessed)
Lazy Loading
The Document class uses lazy loading for both data
and tables
attributes. This means:
document.data
automatically callsparse()
when first accesseddocument.tables
automatically callsparse_tables()
when first accessed- You don't need to manually call
parse()
before accessing document content - Parsing only happens once - subsequent accesses return the cached result
Example
doc = Document(...)
# This automatically parses the document
content = doc.data
# This automatically parses tables (if XML) or returns empty list
tables = doc.tables
# No manual parsing needed
doc.visualize() # Works automatically
sections = doc.get_section("item1") # Works automatically
Methods
contains_string
contains_string(self, pattern)
Checks if the document content contains a specified pattern. Works for HTML, XML, and TXT files.
Example
pattern = r'(?i)chatgpt'
document.contains_string(pattern)
parse
Parses a document into dictionary form and applies a mapping dict. Currently supports files in html
, xml
, has limited support for .pdf
, and some txt
formats. Most do not have mapping dicts written yet, so are less standardized.
Note: You typically don't need to call parse()
manually - accessing document.data
will automatically trigger parsing if needed.
visualize
Opens the parser version of a document using webbrowser. Only works for certain file extensions.
open
Opens the document using webbrowser.
get_section
Gets section by title.
get_section(title=None, title_regex=None,title_class=None, format='dict'):
# returns a list of dictionaries preserving hierarchy - a list in case there are multiple sections with the same title
get_section(title='parti', format='dict')
# returns a list of flattened version of dict form
get_section(title='signatures', format='text')
# return all sections with title including item1, e.g. item1, item1a,... title_class restricts to nodes where class is 'item'
get_section(title_regex= r"item1.*", format='dict',title_class='item')
# returns all sections starting with income
get_section(title_regex= r"income.*", format='dict')
Note that get_section
will return matches for title
(original title) or standardized_title
(standardized title - e.g. "ITEM 1A. RISK FACTORS" -> 'item1a').
tables
document.tables
If the document has extension '.xml', automatically parses the XML into tables when first accessed. For non-XML documents, returns an empty list.
write_csv
write_csv(self, output_folder)
If the document has extension '.xml', parses the XML into tables, then writes to disk using append mode.
write_json
write_json(self, output_filename)
Writes document.data
to JSON format (automatically parses document if not already parsed).
Deprecated Methods
parse_xbrl
Functionality moved to the Submission Class
parse_fundamentals
Functionality moved to the Submission Class