Premium Only Content
Compiler From Scratch: Phase 1 - Tokenizer Generator 022: Resolving DFA state ambiguity
Streamed on 2024-12-13 (https://www.twitch.tv/thediscouragerofhesitancy)
Zero Dependencies Programming!
Last week we got stuck by the fact that DFAs can have overlapping transitions. In these cases the first test always wins and there is no way to jump off an invalid track onto another valid track once it starts down a track. So we fixed that today.
The trick is to think of each transition as a set of characters; a set in the mathematical sense. When processing the transitions we look for overlapping sets. If they overlap we compute:
1) Which characters are only in the first set
2) Which characters are only in the second set
3) Which characters are in both sets, and treat that as a new transition that follows the lowest rule number (rule order precedence).
Then add each of these three onto the unprocessed list for further checking. Once a transition makes it all the way through the unprocessed list without overlapping any other sets, this transition if put on the processed list. This process also has the added benefit of making the order we check the transitions in not matter at all. We can shuffle the transitions into any order, and since they don't overlap any more, they will all be checked eventually and not get cut off by an overlapping rule. Now the DFA looks much messier, but it is finally correct.
With that change done the testing of VVProject tokenizer proceeded. It didn't take much fiddling to get it the way we want. We had already done most of the plumbing into VVProject last week and with a bit of tweaking it was working just fine there as well.
I started down the road of making a tokenizer for the VVTokenizerDefinition, but got sidetracked thinking about multiple-encodings support in that tokenizer. I started down a dark road trying to make that work, but where VVProject calls VVTokenizerDefinition was where I found the problem: for the "Multi" encoding to work it would have to know the encoding when we generate the tokenizer itself. We can't switch that encoding behavior at tokenizer runtime with this system, only at tokenizer generation time. And the biggest problems is the REGEX tokenizing rule. That will have to support anything at tokenization time. I have a plan, but ran out of time by the time I had thought it through. We'll have to remove some of the work we did today, but that will have to wait for next week.
-
25:41
Robbi On The Record
2 days ago $42.47 earnedThe Billion-Dollar Lie Behind OnlyFans “Empowerment” (Her Testimony Will Shock You) | part II
60.9K67 -
1:06:09
Man in America
21 hours agoExposing HAARP's Diabolical Mind Control Tech w/ Leigh Dundas
99K100 -
1:47:16
Tundra Tactical
16 hours ago $115.93 earnedGlock Interview From Beyond The Grave//Whats the Future of Home Training??
75.4K12 -
2:16:35
BlackDiamondGunsandGear
15 hours agoEBT Apocalypse? / Snap Down SHTF / After Hours Armory
41K14 -
14:05
Sideserf Cake Studio
1 day ago $20.34 earnedHYPERREALISTIC HAND CAKE GLOW-UP (Old vs. New) 💅
84.1K15 -
28:37
marcushouse
1 day ago $15.22 earnedSpaceX Just Dropped the Biggest Starship Lander Update in Years! 🤯
49.5K20 -
14:54
The Kevin Trudeau Show Limitless
4 days agoThe Hidden Force Running Your Life
139K28 -
2:16:35
DLDAfterDark
15 hours ago $17.88 earnedIs The "SnapPocalypse" A Real Concern? Are You Prepared For SHTF? What Are Some Considerations?
47.2K15 -
19:58
TampaAerialMedia
1 day ago $11.78 earnedKEY LARGO - Florida Keys Part 1 - Snorkeling, Restaurants,
54.5K24 -
1:23
Memology 101
2 days ago $11.75 earnedFar-left ghoul wants conservatives DEAD, warns Dems to get on board or THEY ARE NEXT
44.7K85