https://github.com/clvnkhr/tao2tex
Goes through a HTML version of one of Prof Tao's blogposts, and spits out a LaTeX version
https://github.com/clvnkhr/tao2tex
blog latex mathematics webscraping wordpress
Last synced: 6 months ago
JSON representation
Goes through a HTML version of one of Prof Tao's blogposts, and spits out a LaTeX version
- Host: GitHub
- URL: https://github.com/clvnkhr/tao2tex
- Owner: clvnkhr
- License: mit
- Created: 2022-12-07T08:59:44.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2024-06-18T12:57:30.000Z (about 2 years ago)
- Last Synced: 2024-10-30T18:32:50.205Z (over 1 year ago)
- Topics: blog, latex, mathematics, webscraping, wordpress
- Language: Python
- Homepage:
- Size: 1.81 MB
- Stars: 0
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# tao2tex
Examples:
## Description
Goes through the HTML of a wordpress math blogpost (mainly, [Prof. Terry Tao’s blog](https://terrytao.wordpress.com)) using a combination of regexes and BeautifulSoup, and spits out a $\rm\LaTeX$ version. In some ways, a partial inverse for [LaTeX2WP](https://lucatrevisan.wordpress.com/latex-to-wordpress/using-latex2wp/). However, we also include the comments (which sometimes has great information.) This should work well for many of Tao's blog posts, and issues with the generated `.tex` should be few and easy to fix.
**Note:** please observe Prof Tao's copyright notice on [this page](https://terrytao.wordpress.com/about/) and do not redistribute large numbers of Tao's blogposts without asking him for permission:
>Readers are welcome to copy, link to, quote from, or translate reasonable portions of the content of this blog (e.g. a single article) into other media, though for items longer than one or two paragraphs, I would appreciate it if a reference or citation to the URL that the content originates from is provided. If you wish to copy a significantly larger fraction of the content (e.g. an entire series of articles), please contact me about it first.
## Requirements and Installation
You need reasonably up-to-date installations of [Python 3](https://www.python.org/) and $\rm\LaTeX$ ([software](https://www.latex-project.org/get/) to compile the output of `tao2tex.py`). In addition, we also require the following to be installed (e.g. via pip)
- [`lxml`](https://lxml.de/)
- `bs4` ([Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/))
- [`requests`](https://requests.readthedocs.io/en/latest/)
- [`emoji`](https://pypi.org/project/emoji/)
You could also use a cloud service like [Overleaf](https://www.overleaf.com/) in lieu of a new $\rm\TeX$ installation.
## Usage
1. clone the repo and install the dependencies. One way to do this is with `pdm install`.
2. Go to [Terry’s blog](terrytao.wordpress.com) and find a post you want to convert to $\rm\LaTeX$.
3. Copy the URL.
4. `cd` to the repo and run `python3 tao2tex.py URL`. (if using pdm, then use `pdm run python tao2tex.py`)
5. Wait a few seconds and a `.tex` file will be produced.
6. Run the `.tex` file in your favourite $\rm\LaTeX$ workflow to create a finished PDF.
For instance if we copied [this](https://terrytao.wordpress.com/2018/12/09/254a-supplemental-weak-solutions-from-the-perspective-of-nonstandard-analysis-optional/) url, we should type `python3 tao2tex.py https://terrytao.wordpress.com/2018/12/09/254a-supplemental-weak-solutions-from-the-perspective-of-nonstandard-analysis-optional/`.
tao2tex also supports a local mode, and a batch mode:
- For local mode, save the html of the page and then use the name of the file in place of the url, with the option `-l`. e.g. `python3 tao2tex.py file.html -l`
- For batch mode, save the list of urls in a file, e.g. batch.txt and call `python3 tao2tex.py batch.txt -b`. If you have a list of local files, you can use `-b -l`, e.g. the provided `tested.txt` file. Everything after the first whitespace in each line is ignored, so you can leave comments after a space.
In addition, you can specify the name of the .tex file with the `-o` option, the `-p` option prints the output to the command-line, and `-d` enables a rudimentary debugger. If you do not have a specific post in mind, you can run `python3 tao2tex.py -i https://terrytao.wordpress.com` to get a list of blog posts on Prof Tao's front page.
## Testing
Since the desired output is not precisely defined, we provide a `test.html` file which may be used for debugging (in particular, for adding features, adjusting to breaking changes, or for adapting to other blogs). It is a short sample HTML file that can be used to test the output of tao2tex via the command `python3 tao2tex.py test.html -l`.
## Customizing the output
The easiest way to customise the output is to modify `preamble.tex`. The theorems look very close to how they appear online. This is achieved with `\usepackage[framemethod=tikz]{mdframed}` and the simple style `\mdfdefinestyle{tao}{outerlinewidth = 1,roundcorner=2pt,innertopmargin=0}`. The more standard `amsthm` environments are provided as a commented-out block.
There are a number of keywords in the given `preamble.tex`; they are in all-caps and begin with `TTT-`, e.g. `TTT-BLOG-TITLE`. These are substituted via regex by tao2tex.py to create the `.tex` output. It is possible to create more of these keywords; to make tao2tex see them, you should modify the `preamble_formatter` function.
Emoji that appear (for instance, in [certain](https://terrytao.wordpress.com/2022/10/07/a-bayesian-probability-worksheet/#comment-659640) comments) are processed (e.g. 😂 becomes `\emoji{face_with_tears_of_joy}`); `\emoji` is defined to simply be `\texttt`, as $\rm\LaTeX$ is unable to render emoji without help. But you can get the actual emoji if you comment out this definition, import the [`emoji`](https://www.ctan.org/pkg/emoji) package, and compile with $\rm Lua\TeX$, [a variant](https://www.luatex.org/) of $\rm pdf\TeX$.
## Known Limitations or Issues
- the more recent versions (since 2018) of $\rm pdf\LaTeX$ will cope with many unicode symbols (but not all) because [UTF8 is assumed to be the default input encoding](https://tex.stackexchange.com/questions/34604/entering-unicode-characters-in-latex). If you do not want to install a newer version, you can try using [Overleaf](https://www.overleaf.com/). You might be able to get away with adding `\usepackage[TU]{inputenc}` or `\usepackage[T1]{inputenc}` to the preamble...
- Sometimes (In section names, theorem names, etc.) The mathematics is skipped. This should be easy to fix once I have time to look into this.
- In `string_formatter`, we escape only a few unicode characters to attempt to please the $\rm\TeX$ engine. We replace greek characters, which do appear on [some](https://terrytao.wordpress.com/2022/10/03/what-are-the-odds/#comment-658396) of the blog posts, in an arguably naive and counterproductive manner (e.g. alpha into`\(\alpha\)`). $\rm{}pdf\LaTeX$ will complain, and $\rm{}Xe\LaTeX$ and $\rm{}Lua\LaTeX$ will work if you switch to a font that has the glyphs (without, these two will still compile.)
- Since we pull website data using the `requests` module, we do not see any HTML generated from Javascript. For example, we are unable to process the occasional polls that Tao makes. However, the rest of the post should work as expected.
- In some posts, e.g. [this one](https://terrytao.wordpress.com/2020/04/13/247b-notes-2-decoupling-theory/#comments), there are so many comments that we check multiple pages. We skip this when running in `-l`/`--local` mode.
- The heuristics we use for labels are not perfect. However, we definitely include all labelled tags (formatted as `eq. number`). Most issues seem to be easy to regex away after running tao2tex; for example, I had success replacing `end{align}\\label{[a-z-]*}` with `end{align}` globally.
- Most likely, modification of the `BeautifulSoup` part is needed to work with other blogs, even those that are on Wordpress. Despite looking quite similar, the precise way that the tags are laid out seem to differ from blog to blog.
- For similar reasons, if Prof Tao ever updates the layout of the blog, this tool will break. Hopefully such a new version will directly support a good print option, but in any case the posts pre-update with the older layout will still be accessible, thanks to the [Internet Archive](https://web.archive.org/web/20220000000000*/terrytao.wordpress.com).