Parsing

Currently parses documents with: * .xml extension * .txt extension if 10-K, 10-Q, 8-K, SC 13D, SC 13G * .htm/.html extension if 10-K, 10-Q, 8-K, SC 13D, SC 13G

Note: The parser will soon be updated to parse almost every document type.

Future

  • parses all .htm/.html files

  • parses most .pdf files (some are image-based and cannot be parsed)

Standardization

Parsing utilizes doc2dict to convert documents to a dictionary format. Documents can be further standardized using the mapping_dicts. Contributions to mapping dicts are highly appreciated!