HTML¶
Quickstart¶
# Load your html file
with open('apple_10k_2024.htm','r') as f:
content = f.read()
# Convert to dictionary
dct = html2dict(content,mapping_dict=None)
Example¶
...
"37": {
"title": "PART I",
"standardized_title": "parti",
"class": "part",
"contents": {
"38": {
"title": "ITEM 1. BUSINESS",
"standardized_title": "item1",
"class": "item",
"contents": {
"39": {
"title": "GENERAL",
"standardized_title": "",
"class": "predicted header",
"contents": {
"40": {
"title": "Embracing Our Future",
...
"292": {
"table": [
[
"Name",
"Age",
"Position with the Company"
],
[
"Satya Nadella",
"56",
"Chairman and Chief Executive Officer"
],
...
Tweaking the engine for your use case¶
I will make this section better soon
I just want to get the basic docs out!
Debugging¶
from doc2dict import convert_html_to_instructions, convert_instructions_to_dict, visualize_instructions, visualize_dict
# load your html file
with open('tesla10k.htm','r') as f:
content = f.read()
# convert html to a series of instructions
instructions = convert_html_to_instructions(content)
# visualize the conversion
visualize_instructions(instructions)
# convert instructions to dictionary
dct = html2dict(content,mapping_dict=None)
# visualize dictionary
visualize_dict(dct)
Writing your own mapping dictionaries¶
Experimental
If you write a mapping dict, and I change something so it stops working - please email me.
Mapping dicts currently work by specifying the class of the section header: part
, regex for section header r'^part\s*([ivx]+)$'
where the capture group ([ivx]+)
and class part
determine the standardized_title
, and the level, where 0
is the root.
In this example, items
will always be nested under parts
.
dict_10k_html = {
('part',r'^part\s*([ivx]+)$') : 0,
('signatures',r'^signatures?\.*$') : 0,
('item',r'^item\s*(\d+)\.?([a-z])?') : 1,
}