{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Translating [Markdown] to [Python]\n",
"\n",
"A primary translation is literate programming is the tangle step that converts the literate program into \n",
"the programming language. The 1979 implementation converts `\".WEB\"` files to valid pascal - `\".PAS\"` - files.\n",
"The `pidgy` approach begins with [Markdown] files and proper [Python] files as the outcome. The rest of this \n",
"document configures how [IPython] acknowledges the transformation and the heuristics the translate [Markdown] to [Python].\n",
"\n",
"[Markdown]: #\n",
"[Python]: #"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
" import typing, mistune, IPython, pidgy.util\n",
" __all__ = 'tangle', 'Tangle'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `pidgy` tangle workflow has three steps:\n",
"\n",
"1. Block-level lexical analysis to tokenize [Markdown].\n",
"2. Normalize the tokens to compacted `\"code\" and not \"code\"` tokens.\n",
"3. Translate the normalized tokens to a string of valid [Python] code.\n",
"\n",
"[Markdown]: #\n",
"[Python]: #"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
" @pidgy.implementation\n",
" def tangle(str:str)->str:\n",
" translate = Tangle()\n",
" return translate.stringify(translate.parse(''.join(str)))"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
" class pidgyManager(IPython.core.inputtransformer2.TransformerManager):\n",
" def transform_cell(self, cell): return super(type(self), self).transform_cell(tangle(str=cell))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Block level lexical analysis.\n",
"\n",
"`pidgy` uses a modified `mistune.BlockLexer` to create block level tokens\n",
"for a [Markdown] source. A specific `pidgy` addition is the addition off \n",
"a `doctest` block object, `doctest` are testable strings that are ignored by the tangle\n",
"step. The tokens are to be normalized and translated to [Python] strings.\n",
"\n",
"BlockLexer
"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
" class BlockLexer(mistune.BlockLexer, pidgy.util.ContextDepth):\n",
" class grammar_class(mistune.BlockGrammar):\n",
" doctest = __import__('doctest').DocTestParser._EXAMPLE_RE\n",
" block_code = __import__('re').compile(r'^((?!\\s+>>>\\s) {4}[^\\n]+\\n*)+')\n",
" default_rules = \"newline hrule block_code fences heading nptable lheading block_quote list_block def_links def_footnotes table paragraph text\".split()\n",
"\n",
" def parse_doctest(self, m): self.tokens.append({'type': 'paragraph', 'text': m.group(0)})\n",
"\n",
" def parse_fences(self, m):\n",
" if m.group(2): self.tokens.append({'type': 'paragraph', 'text': m.group(0)})\n",
" else: super().parse_fences(m)\n",
"\n",
" def parse_hrule(self, m): self.tokens.append(dict(type='hrule', text=m.group(0)))\n",
" \n",
" def parse_def_links(self, m):\n",
" super().parse_def_links(m)\n",
" self.tokens.append(dict(type='def_link', text=m.group(0)))\n",
" \n",
" def parse_front_matter(self): ...\n",
" def parse(self, text: str, default_rules=None, normalize=True) -> typing.List[dict]:\n",
" front_matter = None\n",
" if not self.depth: \n",
" self.tokens = []\n",
" if text.strip() and text.startswith('---\\n') and '\\n---\\n' in text[4:]:\n",
" front_matter, sep, text = text[4:].partition('---\\n')\n",
" front_matter = {'type': 'front_matter', 'text': F\"\\n{front_matter}\"}\n",
"\n",
" with self: tokens = super().parse(pidgy.util.whiten(text), default_rules)\n",
" if normalize and not self.depth: tokens = normalizer(text, tokens)\n",
" if front_matter: tokens.insert(0, front_matter)\n",
" return tokens"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Normalizing the tokens\n",
"\n",
"Tokenizing [Markdown] typically extracts conventions at both the block and inline level.\n",
"Fortunately, `pidgy`'s translation is restricted to block level [Markdown] tokens, and mitigating some potential complexities from having opinions about inline code while tangling.\n",
"\n",
"normalizer
"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
" def normalizer(text, tokens):\n",
" compacted = []\n",
" while tokens:\n",
" token = tokens.pop(0)\n",
" if 'text' not in token: continue\n",
" if not token['text'].strip(): continue\n",
" block, body = token['text'].splitlines(), \"\"\n",
" while block:\n",
" line = block.pop(0)\n",
" if line:\n",
" before, line, text = text.partition(line)\n",
" body += before + line\n",
" if token['type']=='code':\n",
" compacted.append({'type': 'code', 'lang': None, 'text': body})\n",
" elif compacted and compacted[-1]['type'] == 'paragraph':\n",
" compacted[-1]['text'] += body\n",
" else: compacted.append({'type': 'paragraph', 'text': body})\n",
" \n",
" if compacted and compacted[-1]['type'] == 'paragraph':\n",
" compacted[-1]['text'] += text\n",
" elif text.strip():\n",
" compacted.append({'type': 'paragraph', 'text': text})\n",
" # Deal with front matter\n",
" if compacted and compacted[0]['text'].startswith('---\\n') and '\\n---' in compacted[0]['text'][4:]:\n",
" token = compacted.pop(0)\n",
" front_matter, sep, paragraph = token['text'][4:].partition('---')\n",
" compacted = [{'type': 'front_matter', 'text': F\"\\n{front_matter}\"},\n",
" {'type': 'paragraph', 'text': paragraph}] + compacted\n",
" return compacted"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Flattening the tokens to a [Python] string.\n",
"\n",
"The tokenizer controls the translation of markdown strings to python strings. Our major constraint is that the Markdown input should retain line numbers.\n",
"\n",
"Flatten
"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
" class Tangle(BlockLexer):\n",
" def stringify(self, tokens: typing.List[dict], source: str = \"\"\"\"\"\", last: int =0) -> str:\n",
" import textwrap\n",
" INDENT = indent = pidgy.util.base_indent(tokens) or 4\n",
" for i, token in enumerate(tokens):\n",
" object = token['text']\n",
" if token and token['type'] == 'code':\n",
" if object.lstrip().startswith(pidgy.util.FENCE):\n",
"\n",
" object = ''.join(''.join(object.partition(pidgy.util.FENCE)[::2]).rpartition(pidgy.util.FENCE)[::2])\n",
" indent = INDENT + pidgy.util.num_first_indent(object)\n",
" object = textwrap.indent(object, INDENT*pidgy.util.SPACE)\n",
"\n",
" if object.lstrip().startswith(pidgy.util.MAGIC): ...\n",
" else: indent = pidgy.util.num_last_indent(object)\n",
" elif token and token['type'] == 'front_matter': \n",
" object = textwrap.indent(\n",
" F\"locals().update(__import__('ruamel.yaml').yaml.safe_load({pidgy.util.quote(object)}))\\n\", indent*pidgy.util.SPACE)\n",
"\n",
" elif not object: ...\n",
" else:\n",
" object = textwrap.indent(object, pidgy.util.SPACE*max(indent-pidgy.util.num_first_indent(object), 0))\n",
" for next in tokens[i+1:]:\n",
" if next['type'] == 'code':\n",
" next = pidgy.util.num_first_indent(next['text'])\n",
" break\n",
" else: next = indent \n",
" Δ = max(next-indent, 0)\n",
"\n",
" if not Δ and source.rstrip().rstrip(pidgy.util.CONTINUATION).endswith(pidgy.util.COLON): \n",
" Δ += 4\n",
"\n",
" spaces = pidgy.util.indents(object)\n",
" \"what if the spaces are ling enough\"\n",
" object = object[:spaces] + Δ*pidgy.util.SPACE+ object[spaces:]\n",
" if not source.rstrip().rstrip(pidgy.util.CONTINUATION).endswith(pidgy.util.QUOTES): \n",
" object = pidgy.util.quote(object)\n",
" source += object\n",
"\n",
" # add a semicolon to the source if the last block is code.\n",
" for token in reversed(tokens):\n",
" if token['text'].strip():\n",
" if token['type'] != 'code': \n",
" source = source.rstrip() + pidgy.util.SEMI\n",
" break\n",
"\n",
" return source"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Append the lexer for nested rules."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
" for x in \"default_rules footnote_rules list_rules\".split():\n",
" setattr(BlockLexer, x, list(getattr(BlockLexer, x)))\n",
" getattr(BlockLexer, x).insert(getattr(BlockLexer, x).index('block_code'), 'doctest')\n",
" if 'block_html' in getattr(BlockLexer, x):\n",
" getattr(BlockLexer, x).pop(getattr(BlockLexer, x).index('block_html'))\n",
" del x"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## More `pidgy` langauge features\n",
"\n",
"`pidgy` experiments extra language features for python, using the same system\n",
"that IPython uses to add features like line and cell magics."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
" import ast, pidgy, IPython"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Recently, IPython introduced a convention that allows top level await statements outside of functions. Building of this convenience, `pidgy` allows for top-level __return__ and __yield__ statements. These statements are replaced with the an IPython display statement."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
" class ExtraSyntax(ast.NodeTransformer):\n",
" def visit_FunctionDef(self, node): return node\n",
" visit_AsyncFunctionDef = visit_FunctionDef \n",
"\n",
" def visit_Return(self, node):\n",
" replace = ast.parse('''__import__('IPython').display.display()''').body[0]\n",
" replace.value.args = node.value.elts if isinstance(node.value, ast.Tuple) else [node.value]\n",
" return ast.copy_location(replace, node)\n",
"\n",
" def visit_Expr(self, node):\n",
" if isinstance(node.value, (ast.Yield, ast.YieldFrom)): return ast.copy_location(self.visit_Return(node.value), node)\n",
" return node\n",
"\n",
" visit_Expression = visit_Expr"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We know naming is hard, there is no point focusing on it. `pidgy` allows authors\n",
"to use emojis as variables in python. They add extra color and expression to the narrative."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
" def demojize(lines, delimiters=('_', '_')):\n",
" str = ''.join(lines)\n",
" import tokenize, emoji, stringcase; tokens = []\n",
" try:\n",
" for token in list(tokenize.tokenize(\n",
" __import__('io').BytesIO(str.encode()).readline)):\n",
" if token.type == tokenize.ERRORTOKEN:\n",
" string = emoji.demojize(token.string, delimiters=delimiters\n",
" ).replace('-', '_').replace(\"’\", \"_\")\n",
" if tokens and tokens[-1].type == tokenize.NAME: tokens[-1] = tokenize.TokenInfo(tokens[-1].type, tokens[-1].string + string, tokens[-1].start, tokens[-1].end, tokens[-1].line)\n",
" else: tokens.append(\n",
" tokenize.TokenInfo(\n",
" tokenize.NAME, string, token.start, token.end, token.line))\n",
" else: tokens.append(token)\n",
" return tokenize.untokenize(tokens).decode()\n",
" except BaseException: raise SyntaxError(str)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
" def init_json():\n",
" import builtins\n",
" builtins.yes = builtins.true = True\n",
" builtins.no = builtins.false = False\n",
" builtins.null = None"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}