{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Translating [Markdown] to [Python]\n", "\n", "A primary translation is literate programming is the tangle step that converts the literate program into \n", "the programming language. The 1979 implementation converts `\".WEB\"` files to valid pascal - `\".PAS\"` - files.\n", "The `pidgy` approach begins with [Markdown] files and proper [Python] files as the outcome. The rest of this \n", "document configures how [IPython] acknowledges the transformation and the heuristics the translate [Markdown] to [Python].\n", "\n", "[Markdown]: #\n", "[Python]: #" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ " import typing, mistune, IPython, pidgy.util\n", " __all__ = 'tangle', 'Tangle'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `pidgy` tangle workflow has three steps:\n", "\n", "1. Block-level lexical analysis to tokenize [Markdown].\n", "2. Normalize the tokens to compacted `\"code\" and not \"code\"` tokens.\n", "3. Translate the normalized tokens to a string of valid [Python] code.\n", "\n", "[Markdown]: #\n", "[Python]: #" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ " @pidgy.implementation\n", " def tangle(str:str)->str:\n", " translate = Tangle()\n", " return translate.stringify(translate.parse(''.join(str)))" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ " class pidgyManager(IPython.core.inputtransformer2.TransformerManager):\n", " def transform_cell(self, cell): return super(type(self), self).transform_cell(tangle(str=cell))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Block level lexical analysis.\n", "\n", "`pidgy` uses a modified `mistune.BlockLexer` to create block level tokens\n", "for a [Markdown] source. A specific `pidgy` addition is the addition off \n", "a `doctest` block object, `doctest` are testable strings that are ignored by the tangle\n", "step. The tokens are to be normalized and translated to [Python] strings.\n", "\n", "
BlockLexer" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ " class BlockLexer(mistune.BlockLexer, pidgy.util.ContextDepth):\n", " class grammar_class(mistune.BlockGrammar):\n", " doctest = __import__('doctest').DocTestParser._EXAMPLE_RE\n", " block_code = __import__('re').compile(r'^((?!\\s+>>>\\s) {4}[^\\n]+\\n*)+')\n", " default_rules = \"newline hrule block_code fences heading nptable lheading block_quote list_block def_links def_footnotes table paragraph text\".split()\n", "\n", " def parse_doctest(self, m): self.tokens.append({'type': 'paragraph', 'text': m.group(0)})\n", "\n", " def parse_fences(self, m):\n", " if m.group(2): self.tokens.append({'type': 'paragraph', 'text': m.group(0)})\n", " else: super().parse_fences(m)\n", "\n", " def parse_hrule(self, m): self.tokens.append(dict(type='hrule', text=m.group(0)))\n", " \n", " def parse_def_links(self, m):\n", " super().parse_def_links(m)\n", " self.tokens.append(dict(type='def_link', text=m.group(0)))\n", " \n", " def parse_front_matter(self): ...\n", " def parse(self, text: str, default_rules=None, normalize=True) -> typing.List[dict]:\n", " front_matter = None\n", " if not self.depth: \n", " self.tokens = []\n", " if text.strip() and text.startswith('---\\n') and '\\n---\\n' in text[4:]:\n", " front_matter, sep, text = text[4:].partition('---\\n')\n", " front_matter = {'type': 'front_matter', 'text': F\"\\n{front_matter}\"}\n", "\n", " with self: tokens = super().parse(pidgy.util.whiten(text), default_rules)\n", " if normalize and not self.depth: tokens = normalizer(text, tokens)\n", " if front_matter: tokens.insert(0, front_matter)\n", " return tokens" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Normalizing the tokens\n", "\n", "Tokenizing [Markdown] typically extracts conventions at both the block and inline level.\n", "Fortunately, `pidgy`'s translation is restricted to block level [Markdown] tokens, and mitigating some potential complexities from having opinions about inline code while tangling.\n", "\n", "
normalizer" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ " def normalizer(text, tokens):\n", " compacted = []\n", " while tokens:\n", " token = tokens.pop(0)\n", " if 'text' not in token: continue\n", " if not token['text'].strip(): continue\n", " block, body = token['text'].splitlines(), \"\"\n", " while block:\n", " line = block.pop(0)\n", " if line:\n", " before, line, text = text.partition(line)\n", " body += before + line\n", " if token['type']=='code':\n", " compacted.append({'type': 'code', 'lang': None, 'text': body})\n", " elif compacted and compacted[-1]['type'] == 'paragraph':\n", " compacted[-1]['text'] += body\n", " else: compacted.append({'type': 'paragraph', 'text': body})\n", " \n", " if compacted and compacted[-1]['type'] == 'paragraph':\n", " compacted[-1]['text'] += text\n", " elif text.strip():\n", " compacted.append({'type': 'paragraph', 'text': text})\n", " # Deal with front matter\n", " if compacted and compacted[0]['text'].startswith('---\\n') and '\\n---' in compacted[0]['text'][4:]:\n", " token = compacted.pop(0)\n", " front_matter, sep, paragraph = token['text'][4:].partition('---')\n", " compacted = [{'type': 'front_matter', 'text': F\"\\n{front_matter}\"},\n", " {'type': 'paragraph', 'text': paragraph}] + compacted\n", " return compacted" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Flattening the tokens to a [Python] string.\n", "\n", "The tokenizer controls the translation of markdown strings to python strings. Our major constraint is that the Markdown input should retain line numbers.\n", "\n", "
Flatten" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ " class Tangle(BlockLexer):\n", " def stringify(self, tokens: typing.List[dict], source: str = \"\"\"\"\"\", last: int =0) -> str:\n", " import textwrap\n", " INDENT = indent = pidgy.util.base_indent(tokens) or 4\n", " for i, token in enumerate(tokens):\n", " object = token['text']\n", " if token and token['type'] == 'code':\n", " if object.lstrip().startswith(pidgy.util.FENCE):\n", "\n", " object = ''.join(''.join(object.partition(pidgy.util.FENCE)[::2]).rpartition(pidgy.util.FENCE)[::2])\n", " indent = INDENT + pidgy.util.num_first_indent(object)\n", " object = textwrap.indent(object, INDENT*pidgy.util.SPACE)\n", "\n", " if object.lstrip().startswith(pidgy.util.MAGIC): ...\n", " else: indent = pidgy.util.num_last_indent(object)\n", " elif token and token['type'] == 'front_matter': \n", " object = textwrap.indent(\n", " F\"locals().update(__import__('ruamel.yaml').yaml.safe_load({pidgy.util.quote(object)}))\\n\", indent*pidgy.util.SPACE)\n", "\n", " elif not object: ...\n", " else:\n", " object = textwrap.indent(object, pidgy.util.SPACE*max(indent-pidgy.util.num_first_indent(object), 0))\n", " for next in tokens[i+1:]:\n", " if next['type'] == 'code':\n", " next = pidgy.util.num_first_indent(next['text'])\n", " break\n", " else: next = indent \n", " Δ = max(next-indent, 0)\n", "\n", " if not Δ and source.rstrip().rstrip(pidgy.util.CONTINUATION).endswith(pidgy.util.COLON): \n", " Δ += 4\n", "\n", " spaces = pidgy.util.indents(object)\n", " \"what if the spaces are ling enough\"\n", " object = object[:spaces] + Δ*pidgy.util.SPACE+ object[spaces:]\n", " if not source.rstrip().rstrip(pidgy.util.CONTINUATION).endswith(pidgy.util.QUOTES): \n", " object = pidgy.util.quote(object)\n", " source += object\n", "\n", " # add a semicolon to the source if the last block is code.\n", " for token in reversed(tokens):\n", " if token['text'].strip():\n", " if token['type'] != 'code': \n", " source = source.rstrip() + pidgy.util.SEMI\n", " break\n", "\n", " return source" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Append the lexer for nested rules." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ " for x in \"default_rules footnote_rules list_rules\".split():\n", " setattr(BlockLexer, x, list(getattr(BlockLexer, x)))\n", " getattr(BlockLexer, x).insert(getattr(BlockLexer, x).index('block_code'), 'doctest')\n", " if 'block_html' in getattr(BlockLexer, x):\n", " getattr(BlockLexer, x).pop(getattr(BlockLexer, x).index('block_html'))\n", " del x" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## More `pidgy` langauge features\n", "\n", "`pidgy` experiments extra language features for python, using the same system\n", "that IPython uses to add features like line and cell magics." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ " import ast, pidgy, IPython" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recently, IPython introduced a convention that allows top level await statements outside of functions. Building of this convenience, `pidgy` allows for top-level __return__ and __yield__ statements. These statements are replaced with the an IPython display statement." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ " class ExtraSyntax(ast.NodeTransformer):\n", " def visit_FunctionDef(self, node): return node\n", " visit_AsyncFunctionDef = visit_FunctionDef \n", "\n", " def visit_Return(self, node):\n", " replace = ast.parse('''__import__('IPython').display.display()''').body[0]\n", " replace.value.args = node.value.elts if isinstance(node.value, ast.Tuple) else [node.value]\n", " return ast.copy_location(replace, node)\n", "\n", " def visit_Expr(self, node):\n", " if isinstance(node.value, (ast.Yield, ast.YieldFrom)): return ast.copy_location(self.visit_Return(node.value), node)\n", " return node\n", "\n", " visit_Expression = visit_Expr" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We know naming is hard, there is no point focusing on it. `pidgy` allows authors\n", "to use emojis as variables in python. They add extra color and expression to the narrative." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ " def demojize(lines, delimiters=('_', '_')):\n", " str = ''.join(lines)\n", " import tokenize, emoji, stringcase; tokens = []\n", " try:\n", " for token in list(tokenize.tokenize(\n", " __import__('io').BytesIO(str.encode()).readline)):\n", " if token.type == tokenize.ERRORTOKEN:\n", " string = emoji.demojize(token.string, delimiters=delimiters\n", " ).replace('-', '_').replace(\"’\", \"_\")\n", " if tokens and tokens[-1].type == tokenize.NAME: tokens[-1] = tokenize.TokenInfo(tokens[-1].type, tokens[-1].string + string, tokens[-1].start, tokens[-1].end, tokens[-1].line)\n", " else: tokens.append(\n", " tokenize.TokenInfo(\n", " tokenize.NAME, string, token.start, token.end, token.line))\n", " else: tokens.append(token)\n", " return tokenize.untokenize(tokens).decode()\n", " except BaseException: raise SyntaxError(str)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ " def init_json():\n", " import builtins\n", " builtins.yes = builtins.true = True\n", " builtins.no = builtins.false = False\n", " builtins.null = None" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 4 }