I concur with most of these arguments, especially about longevity. But, this only applies to smallish files like configurations because I don't agree with the last paragraph regarding its efficiency.
I have had to work with large 1GB+ JSON files, and it is not fun. Amazing projects such as jsoncons for streaming JSONs, and simdjson, for parsing JSON with SIMD, exist, but as far as I know, the latter still does not support streaming and even has an open issue for files larger than 4 GiB. So you cannot have streaming for memory efficiency and SIMD-parsing for computational efficiency at the same time. You want streaming because holding the whole JSON in memory is wasteful and sometimes not even possible. JSONL tries to change the format to fix that, but now you have another format that you need to support.
I was also contemplating the mentioned formats for another project, but they are hardly usable when you need to store binary data, such as images, compressed data, or simply arbitrary data. Storing binary data as base64 strings seems wasteful. Random access into these files is also an issue, depending on the use case. Sometimes it would be a nice feature to jump over some data, but for JSON, you cannot do that without parsing everything in search of the closing bracket or quotes, accounting for escaped brackets and quotes, and nesting.
My rule of thumb that has been surprisingly robust over several uses of it is that if you gzip a JSON format you can expect it to shrink by a factor of about 15.
That is not the hallmark of a space-efficient file format.
Between repeated string keys and frequently repeated string values, that are often quite large due to being "human readable", it adds up fast.
"I was also contemplating the mentioned formats for another project, but they are hardly usable when you need to store binary data, such as images, compressed data, or simply arbitrary data."
One trick you can use is to prefix a file with some JSON or other readable value, then dump the binary afterwards. The JSON can have offsets into the binary as necessary for identifying things or labeling whether or not it is compressed or whatever. This often largely mitigates the inefficiency concerns because if you've got a big pile of binary data the JSON bloat by percent tends to be much smaller than the payload; if it isn't, then of course I don't recommend this.
I can confirm usual compression ratios of 10-20 for JSON. For example, wikidata-20220103.json.gz is quite fun to work with. It is 109 GB, which decompresses to 1.4 TB, and even the non-compressed index for random access with indexed_gzip is 11 GiB. The compressed random access index format, which gztool supports, would be 1.4 GB (compression ratio 8). And rapidgzip even supports the compressed gztool format with further file size reduction by doing a sparsity analysis of required seek point data and setting all unnecessary bytes to 0 to increase compressibility. The resulting index is only 536 MiB.
The trick for the mix of JSON with binary is a good reminder. That's how the ASAR file archive format works. That could indeed be usable for what I was working on: a new file format for random seek indexes. Although the gztool index format seems to suffice for now.
I see sooo many comments on this submission talking about large files. It feels massively over-relresented a concern to me.
On Linux, a good number of FS have builtin compression. My JSON all gets hit with lz4 compression automatically.
It indeed annoying having to go compress & decompress files before sending. It'd be lovely if file transfer tools (including messaging apps) were a bit better at auto-conpressing. I think with btrfs, it tests for compress ability too, will give up on trying to compress at some point: a similar effort ought be applied here.
The large file question & efficiency question feels like it's dominating this discussion, and it just doesn't seem particularly interesting or fruitful a concern to me. It shouldnt matter much. The computer can and should generally be able to eliminate most of the downsides relatively effectively.
> I have had to work with large 1GB+ JSON files, and it is not fun.
I had also had to work with large JSON files, even though I would prefer other formats. I had written a C code to split it into records, which is done by keeping track of the nesting level and of whether or not it is a string and the escaping in a string (so that escaped quotation marks will work properly). It is not too difficult.
> I was also contemplating the mentioned formats for another project, but they are hardly usable when you need to store binary data, such as images, compressed data, or simply arbitrary data. Storing binary data as base64 strings seems wasteful.
I agree, which is one reason I do not like JSON (I prefer DER). In addition to that, there is escaping text.
> Random access into these files is also an issue, depending on the use case. Sometimes it would be a nice feature to jump over some data, but for JSON, you cannot do that
With DER you can easily skip over any data.
However, I think the formats with type/length/value (such as DER) do not work as well for streaming, and vice-versa.
I can understand this for "small" data, say less than 10 Mb.
In bioinformatics, basically all of the file formats are human-readable/text based. And file sizes range between 1-2Mb and 1 Tb. I regularly encounter 300-600 Gb files.
In this context, human-readable files are ridiculously inefficient, on every axis you can think of (space, parsing, searching, processing, etc.). It's a GD crime against efficiency.
And at that scale, "readable" has no value, since it would take you longer to read the file than 10 lifetimes.
I do not think the argument is that ALL data should be in human readable form, but I think there are far more cases of data being in a binary form when it would be better human readable. Your example of a case where it is human readable when it should be binary is rarer for most of us.
In some cases human readable data is for interchange and it should be processed and queried in other forms - e.g. CSV files to move data between databases.
An awful lot of data is small - and these days I think you can say small is quite a bit bigger than 10Mb.
Quite a lot of data that is extracted from a large system would be small at that point, and would benefit from being human readable.
The benefit of data being human readable is not necessarily that you will read it all, but that it is easier to read bits that matter when you are debugging.
> human-readable files are ridiculously inefficient on every axis you can think of (space, parsing, searching, processing, etc.).
In bioinformatics, most large text files are gzip'd. Decompression is a few times slower than proper file parsing in C/C++/Rust. Some pure python parsers can be "ridiculously inefficient" but that is not the fault of human-readability. Binary files are compressed with existing libraries. Compressed binary files are not noticeably faster to parse than compressed text files. Binary formats can be indeed smaller but space-efficienct formats take years to develop and tend to have more compatibility issues. You can't skip the text format phase.
> And at that scale, "readable" has no value, since it would take you longer to read the file than 10 lifetimes.
You can't read the whole file by eye, but you can (and should often) eyeball small sections in a huge file. For that, you need a human-readable file format. A problem with this field IMHO is that not many people are literally looking at the data by eye.
One of the problems is that a lot of bioinformatics formats nowadays have to hold so much data that most text editors stop working properly. For example, FASTA splits DNA data into lines of 50-80 characters for readability. But in FASTQ, where the '>' and '+' characters collide with the quality scores, as far as I know, DNA and the quality data are always put into one line each. Trying to find a location in a 10k long line gets very awkward. And I'm sure some people can eyeball Phred scores from ASCII, but I think they are a minority, even among researchers.
Similarly, NEXUS files are also human-readable, but it'd be tough to discern the shape of inlined 200 node Newick trees.
When I was asking people who did actual bioinformatics (well, genomics) what some of their annoyances when working with the bioinf software were, having to do a bunch of busywork on files in-between pipeline steps (compressing/uncompressing, indexing) was one of the complaints mentioned.
I think there's a place in bioinformatics for a unified binary format which can take care of compression, indexing, and metadata. But with that list of requirements it'd have to be binary. Data analysis moved from CSVs and Excel files to Parquet, and I think there's a similar transition waiting to happen here
My hypothesis is that bioinformatics favors text files, because open source tools usually start as research code.
That means two things. First, the initial developers are rarely software engineers, and they have limited experience developing software. They use text files, because they are not familiar with the alternatives.
Second, the tools are usually intended to solve research problems. The developers rarely have a good idea what the tools eventually end up doing and what data the files need to store. Text-based formats are a convenient choice, as it's easy extend and change them. By the time anyone understands the problem well enough to write a useful specification, the existing file format may already be popular, and it's difficult to convince people to switch to a new format.
Totally. A good chuck of the formats are just TSV files with some metadata in header. Setting aside the drawbacks, this approach is both straightforward and flexible.
I think we're seeing some change in that regard, though. VCF got BCF and SAM and got BAM
Another thing is human readable is typically synonymous with unindexed, which becomes a problem when you have large files and care about performance. In bioinformatics we often distribute sidecar index files with the actual data, which is janky and inefficient. Why not have a decent format to begin with?
Further, when the file is unindexed it's even harder to read it as a human because you can't easily skip to a particular section. I have this trouble often where my code can efficiently access the data once it's loaded, but a human-eye check is tedious/impossible because you have to scroll through gigabytes to find what you want.
> Another thing is human readable is typically synonymous with unindexed
Indexing is not directly related to binary vs text. Many text formats in bioinformatics are indexed and many binary formats are not when they are not designed with indexing in mind.
> a human-eye check is tedious/impossible because you have to scroll through gigabytes to find what you want.
Yes, indexing is better but without indexing, you can use command line tools to extract the portion you want to look at and then pipe to "more" or "less".
I journeyed from fancy commercial bookkeeping systems that changed data formats every few years (with no useful migration) to GNU Cash and finally to Plain-Text Accounting. I can finally get the information I need with easy backups (through VCS) and flexibility (through various tools that transform the data). The focus is on content, not tools or presentation or product.
When I write I write text. I can transform text using various tools to provide various presentations consumable through various products. The focus is on content, not presentation, tools, or product.
I prefer human-readable file formats, and that has only been reinforced over more than 4 decades as a computer professional.
I have recently migrated ~8y of Apple Numbers spreadsheets (an annoyingly non-portable format) to plaintext accounting.
It took me many hours and a few backtracks to get to a point where I am satisfied with it, and where errors are caught early. I would just suggest anyone starting now to enable --strict --pedantic on ledger-cli from the day 1, and writing asserts for your accounts as well e.g. to check that closed accounts don’t get new entries.
I really miss data entry being easier and not as prone to free-form text editing errors (most common are typos on the amount or copying the wrong source/dest account), but I am confident it matches reality much better than my spreadsheets did.
Ease of: reading, comprehension, manipulation, short- and long-term retrieval are not the same problems. All file formats are bad at at least one of these.
Given an arbitrary stream of bytes, readability only means the human can inspect the file. We say "text is readable" but that's really only because all our tooling for the last sixty years speaks ASCII and we're very US-centric. Pick up a text file from 1982 and it could be unreadable (EBCDIC, say). Time to break out dd and cross your fingers.
Comprehension breaks down very quickly beyond a few thousand words. No geneticist is loading up a gig of CTAGT... and keeping that in their head as they whiz up and down a genome. Humans have a working set size.
Short term retrieval is excellent for text and a PITA for everything else. Raise your hand if you've gotten a stream of bytes, thrown file(1) at it, then strings(1), and then resorted to od or picking through the bytes.
Long term retrieval sucks for everyone. Even textfiles. After all, a string of bytes has no intrinsic meaning except what the operating system and the application give it. So who knows if people in 2075 will recognise "48 65 6C 6C 6F 20 48 4E 21"?
I decoded that as "Hello HI!" using basic cryptanalysis, the assumption that the alphabet would be mostly contiguous, the assumption that capital and lower-case are separated by a bit, and the knowledge that 0x20 is space and 0x21 is exclamation mark. On a larger text, we wouldn't even need these assumptions: cryptanalysis is sufficiently-powerful, and could even reverse-engineer EBCDIC! (Except, it might be difficult to figure out where the punctuation characters go, without some unambiguous reference such as C source code: commas and question marks are easy, but .![]{} are harder.)
Edit: I can't count. H and I are consecutive in the alphabet, and it actually says "Hello HN!". I think my general point is valid, though.
I see no issue using (or even receiving) an SQLite file, where I can see the tables structure and even export everything to pure text format.
The major problem of both human-readable and binary formats is not the serialized form, but the understanding of the schema (structure) of the data, which more often than not, is completely undocumented. Human-readable formats are worse in this regard, because they justify it by "it's obvious what this is".
/* Technically, most binary formats are legible to a human, given a proper renderer, e.g. journalctl. What TFA speaks about is ASCII/UTF-8 text formats that need no processing besides rendering CR, LF, and TAB characters specially. Assuming a Unix command line, I would call these "cat-readable" formats, or maybe even "less-readable". */
FWIW, journald's file format is one case where upon reading the spec I said it needs to be taken behind the barn and shot as a mercy.
And no, it had nothing to do with being binary, and everything to do with badly mixing "durable system log" and "quick retrieval" aspects in wrong order.
Given that the author mentions CSV and text table formats, the article's list of the "entire Unix toolchain" is significantly impoverished not only by the lack of ex (which is usefully scriptable) but by the lack of mlr.
Wow, I've never heard of 'mlr' before. Looks like a synthesis of Unix tools, jq, and others? Very useful - hopefully it's packaged everywhere for easy access.
We learned the hard way, for some of us it's all too easy to make careless design errors that become baked-in and can't be fixed in a backward-compatible way (either at the DSL or API level). An example in Graphviz is its handling of backslash in string literals: to escape special characters (like quotes \"), to map special characters (like several flavors of newline with optional justification \n \l \r) and to indicate variables (like node names in labels \N) along with magic code that knows that if the -default- node name is the empty string that actually means \N but if a particular node name is the empty string, then it stays.
There was a published study, Wrangling Messy CSV Files by Detecting Row and Type Patterns by Gerrit J. J. van den Burg, Alfredo Nazábal, and Charles Sutton (Data Mining and Knowledge Discovery, 2019) that showed many pitfalls with parsing CSV files found on GitHub. They achieved 97%. It's easy to write code that slings out some text fields separated by commas, with the objective of using a human-readable portable format.
You can learn even more by allowing autofuzz to test your nice simple code to parse human readable files.
I'd love to see us go even further! Having file formats at all is less human readable than going 9p style, less human (and script) readable than a directory tree of simple values.
The OS has a builtin way to create heirarchical structured data. I'd love to see the boldness to try using that!
This does mean you'd need to tar/zip up your file tree to send it around. It's less clear that this directory is a thing, is meant to be a file like one entity. But by taking our data out of complex file formats and turning it into a a filesystem heirarchy, it removes the arbitrariness of choosing any encoding at all, and it directly opens up all data to scripting. You can echo a new hex color into a foreground file to change it. You can watch a user directory for changes to update a rendering of it. You can btrfs snapshot your window layout to save your desktop configuration.
We have so so many specific tools for computing. It's time that we try making some more general systems, that let us work broadly / regardless of specific application! "Just files" could be a key enabler!
Let’s say that hypothetically one were to disagree with this. What would be the best alternative format? One that has ample of tooling for editing and diffing, as though it was text, yet stores things more efficiently.
Most of the arguments presented in TFA are about openness, which can still be achieved with standard binary formats and a schema. Hence the problem left to solve is accessibility.
I’m thinking something like parquet, protobuf or sqllite. Despite their popularities, still aren’t trivial for anyone to edit.
I suppose with SQLite files, you could at least in theory diff their SQL-dump representations, though you'd presumably want a way to canonicalise said representation. In a way I suppose each (VCS) commit is a bit like a database migration.
Clearly there’s a very real need for binary data formats, or we wouldn’t have them. For one, it’s much more space efficient. Does the author know how much storage cost in 1985? Or how slow computers were?
If I time traveled back to 1985 and told corporate to adopt CSV because it’d be useful in 50 years when unearthing old customer records I’d be laughed out of the cigar lounge.
So I guess it's also a common knowledge that a compressed stream needs to be uncompressed in order to use it. So argumenting that compressed stream takes smaller space on a floppy disk doesn't mean that it will also take the same amount of bytes in memory.
Also I'd argue if HTTP1 can be treated as a pure text format, since it requires \r and \n as EOL markers, even on systems that only use \n. Strict binary requirements like this shouldn't be needed if it was a text protocol.
It's often too late to overhaul you systems when performance becomes a serve limiting factor. By that point things like data format are already set in stone. The whole "premature optimization" was originally about peep-hole stuff, not architecture-defining concerns, and it's really sad to see it misapplied to "lets store everything as json and use O(n²) everywhere and hopefully it will be someone else's problem".
It saves you from escaping stuff inside of multiline-strings by using meaningful whitespace.
What I did not like about CCL so much that it leaves a bunch of stuff underspecified.
You can make lists and comments with it, but YOU have to decide how.
> Unlike binary formats or database dumps, these files don't hide their meaning behind layers of abstraction. They're built for clarity, for resilience, and for people who like to know what's going on under the hood.
Csv files hide their meaning in external documentation or someone’s head, are extremely unclear in many cases (is this a number or a string? A date?) and is extremely fragile when it comes to people editing them in text editors. They entirely lack checks and verification at the most basic level and worse still they’re often but not always perfectly line based. Many tools then work fine until they completely break you file and you won’t even know. Until I get the file and tell you I guess.
I’ve spent years fixing issues introduced by people editing them like they’re text.
If you’ve got to use tools to not completely bugger them then you might as well use a good format.
Binary files are not usually edited by hand, and should not get broken without a broken parser or writer. There can be many more ways of fixing them too because it may not be as ambiguous as broken sv files.
That came far after csv files started being used and many parsers don’t follow the spec. Even if they do, editing the file manually can easily and silent break it - my criticisms are of entirely valid to the new spec files. The wide range of ways people make csvs is a whole other thing I’ve spent years fixing.
It’s not about the stupidity of the humans, and if it was then planning for “no stupid people” is even stupider than those messing up the files.
> Maybe you need a database or an app rather than flat files.
Flat files are great. What’s needed are good file formats.
json makes for a great flat file format these days, with jq around to munge the data in it. csv is pretty bad for errors. mostly use it to dump data when I need to pass it to someone that will want to shove it in excel.
I gave you a good text file format. You're acting like there are no good file formats. Either invent a domain-specific one, use a standard one, or use a different modality rather than complain that a utopia you won't bother to create doesn't exist.
“This memo […] does not specify an Internet standard of any kind”
and
“Interoperability considerations:
Due to lack of a single specification, there are considerable differences among implementations. Implementors should "be conservative in what you do, be liberal in what you accept from others" (RFC 793 [8]) when processing CSV files”
It’s not about human-readable, it’s about standard format & available tooling to read for it. Be it .txt or json or yaml; standard format & tooling; digital content anyways isn’t human readable either digital interface.
ASN.1 is not quite "binary format that include the specification in the format itself". There are multiple formats of ASN.1, such as DER, which is a binary format which includes the structure and type specifications (except implicit types) with the data. I think DER is better than many other formats, and I use this for some of my own programs. This does not necessarily mean that you will know what the use of the data is (but CSV, JSON, etc won't do that either), but the structure and some of the data types can be decoded.
I'm not sure the author knows much about binary formats.
Binary formats are binary for a reason. Speed of interpretation is one reason. Usage of memory is another reason. Directly mapping it and using it, is another reason. Binary formats can make assumptions about system memory page size. They can store internal offsets to make incremental reading faster. None of this is offered by text formats.
Also, the ability to modify text formats is completely wrong. Nothing can be changed if we introduce checksums inside text formats. Also if we digitally sign a format, then nothing can be changed despite the fact that it's a text format.
Also, comparing CSV files to internal database binary format? It's like comparing a book cover to the ERP system of a library. Meaning, it's comparing two completely different things.
I used the URL gemini://adele.pollux.casa/gemlog/2025-08-04_why_I_prefer_human-readble_file_formats.gmi (the one linked to directly does not work on my computer).
I prefer binary file formats (including DER) for many things, and I will respond to the individual parts as well as my own comments.
> With human-readable formats, you're never locked out of your own data. Whether you're on a fresh Linux installation, a locked-down corporate machine, or troubleshooting a system with minimal tools, you can always inspect your configuration files, data exports, or documentation with nothing more than `cat`, `less`, or any basic text editor.
This is helpful especially for documentation.
However, not all of the formats for data are going to be that easy to inspect in this way, and the meaning is not always clear even if it is a text format. Additionally, even if it is clear when reading does not necessarily mean that it is convenient to modify it.
Furthermore, the use of text formats means that escaping may be needed, which can complicate the decoding and encoding, and can also be the "leaning toothpick syndrome".
> A JSON configuration file works the same way whether you're viewing it in VS Code, vim, or even a web browser. This universality means fewer barriers to collaboration and fewer "it works on my machine" moments.
Although it can be displayed the same way (especially if it is in purely ASCII format), it can be difficult to read if packed together and inefficient if formatted nicely, and there is the issue with escaping that I mentioned above. When editing JSON, there is also that JSON does not have comments and that optional trailing commas are not allowed, which can make it inconvenient to modify.
JSON, CSV, etc have their own limitations though, which can be problems for some uses (e.g. storing binary data together with text, storing non-Unicode text data, and others). Sometimes this will then extend to the formats made using them, because they had not considered that.
> Digital archaeology is real, and proprietary formats are its enemy. How many documents from the 1990s are now trapped in obsolete file formats?
It is not quite that simple. I had sometimes found them easier to figure out than some modern text-based formats.
> Ok, sometimes, there are some character encoding conversions needed (you see, CP-1252, EBCDIC, IBM-850, ISO8859-15, UTF-8), but, these operations are easy nowadays.
That is also often done badly. There are good ways to handle character encoding, including ways that do not involve conversion, but they are not as commonly supported by some modern programs.
> Need to bulk-update settings? Write a simple script or use standard text processing tools.
This is not always good, depending on the format and on other things. (SQL might work better for many kind of bulk updates.)
> You don't need expensive software licenses, proprietary APIs, or vendor-specific tools to work with your data.
You don't need those things for many binary formats either. Sometimes you might, but if it is designed well then you shouldn't need it.
> The entire Unix toolchain, `grep`, `sed`, `awk`, `sort`, `cut`, becomes your toolkit. Want to extract all email addresses from a CSV? `grep` has you covered.
For some simple formats, especially TSV, it might work, but not all text-based formats are like that.
> JSON, XML, CSV, these formats have multiple independent implementations, comprehensive specifications, and broad community support.
Yes, but so do many binary formats, such as DER (I wrote my own implementation in C).
> Version control systems like Git are optimized for text, and human-readable formats take full advantage of this optimization. Line-by-line diffs become meaningful, showing exactly what changed between versions. Merge conflicts are resolvable because you can actually read and understand the conflicting content.
It is true that such version control systems are made to work with text formats, but that does not mean that it should have to be. Also, it does not mean that it handles the details of all possible text-based formats.
For CSV it might work, but different formats, whether text or binary, cannot always resolve merge conflicts so easily, or identify the differences in a better way so easily, due to various things, such as the way that blocks might work in a format, and that some formats use indentation-oriented syntax, lack of trailing commas in JSON, etc. Automatic merging does not necessarily have the correct result either, and must be corrected manually, regardless of it is a text format or a binary format.
In showing what is changed also, you might want to know what section it is in, in a text-based format that works in that way. When making the diffs and merges for specific formats, they could be made for binary formats also.
> Text-based formats are often surprisingly compact, especially when compressed. They parse quickly, require minimal memory overhead, and can be streamed and processed incrementally. A well-structured JSON file can be more efficient than a complex binary format with similar information density.
Although compression can help, it just adds another complexity to the format. The parsing can require handling escaping, and the streaming can depend on the specific uses. A binary format does not have to be complex, and can store binary data directly.
> Command-line processors like `jq` for JSON or standard Unix tools can handle massive files with minimal resource consumption.
Sometimes it can apply, although programs can be made for other formats as well. I had some idea how to make something to work for DER format. For some formats, there already are.
> These formats also represent a philosophy: that technology should serve human understanding rather than obscure it.
Using JSON or XML will not solve that. What helps is to have better documentation.
Nice captcha. I made almost the same thing, except mine looks like an actual captcha box and sends a post request upon clicking the fake checkbox. This works very well (100% of the times) for automated scrapers. I don't think anything more is needed
I concur with most of these arguments, especially about longevity. But, this only applies to smallish files like configurations because I don't agree with the last paragraph regarding its efficiency.
I have had to work with large 1GB+ JSON files, and it is not fun. Amazing projects such as jsoncons for streaming JSONs, and simdjson, for parsing JSON with SIMD, exist, but as far as I know, the latter still does not support streaming and even has an open issue for files larger than 4 GiB. So you cannot have streaming for memory efficiency and SIMD-parsing for computational efficiency at the same time. You want streaming because holding the whole JSON in memory is wasteful and sometimes not even possible. JSONL tries to change the format to fix that, but now you have another format that you need to support.
I was also contemplating the mentioned formats for another project, but they are hardly usable when you need to store binary data, such as images, compressed data, or simply arbitrary data. Storing binary data as base64 strings seems wasteful. Random access into these files is also an issue, depending on the use case. Sometimes it would be a nice feature to jump over some data, but for JSON, you cannot do that without parsing everything in search of the closing bracket or quotes, accounting for escaped brackets and quotes, and nesting.
My rule of thumb that has been surprisingly robust over several uses of it is that if you gzip a JSON format you can expect it to shrink by a factor of about 15.
That is not the hallmark of a space-efficient file format.
Between repeated string keys and frequently repeated string values, that are often quite large due to being "human readable", it adds up fast.
"I was also contemplating the mentioned formats for another project, but they are hardly usable when you need to store binary data, such as images, compressed data, or simply arbitrary data."
One trick you can use is to prefix a file with some JSON or other readable value, then dump the binary afterwards. The JSON can have offsets into the binary as necessary for identifying things or labeling whether or not it is compressed or whatever. This often largely mitigates the inefficiency concerns because if you've got a big pile of binary data the JSON bloat by percent tends to be much smaller than the payload; if it isn't, then of course I don't recommend this.
I can confirm usual compression ratios of 10-20 for JSON. For example, wikidata-20220103.json.gz is quite fun to work with. It is 109 GB, which decompresses to 1.4 TB, and even the non-compressed index for random access with indexed_gzip is 11 GiB. The compressed random access index format, which gztool supports, would be 1.4 GB (compression ratio 8). And rapidgzip even supports the compressed gztool format with further file size reduction by doing a sparsity analysis of required seek point data and setting all unnecessary bytes to 0 to increase compressibility. The resulting index is only 536 MiB.
The trick for the mix of JSON with binary is a good reminder. That's how the ASAR file archive format works. That could indeed be usable for what I was working on: a new file format for random seek indexes. Although the gztool index format seems to suffice for now.
> One trick you can use is to prefix a file with some JSON or other readable value, then dump the binary afterwards.
The GLB container (binary glTF) works almost exactly as you described, except there is a fixed size header before the JSON part.
https://registry.khronos.org/glTF/specs/2.0/glTF-2.0.html#bi...
I see sooo many comments on this submission talking about large files. It feels massively over-relresented a concern to me.
On Linux, a good number of FS have builtin compression. My JSON all gets hit with lz4 compression automatically.
It indeed annoying having to go compress & decompress files before sending. It'd be lovely if file transfer tools (including messaging apps) were a bit better at auto-conpressing. I think with btrfs, it tests for compress ability too, will give up on trying to compress at some point: a similar effort ought be applied here.
The large file question & efficiency question feels like it's dominating this discussion, and it just doesn't seem particularly interesting or fruitful a concern to me. It shouldnt matter much. The computer can and should generally be able to eliminate most of the downsides relatively effectively.
> I have had to work with large 1GB+ JSON files, and it is not fun.
I had also had to work with large JSON files, even though I would prefer other formats. I had written a C code to split it into records, which is done by keeping track of the nesting level and of whether or not it is a string and the escaping in a string (so that escaped quotation marks will work properly). It is not too difficult.
> I was also contemplating the mentioned formats for another project, but they are hardly usable when you need to store binary data, such as images, compressed data, or simply arbitrary data. Storing binary data as base64 strings seems wasteful.
I agree, which is one reason I do not like JSON (I prefer DER). In addition to that, there is escaping text.
> Random access into these files is also an issue, depending on the use case. Sometimes it would be a nice feature to jump over some data, but for JSON, you cannot do that
With DER you can easily skip over any data.
However, I think the formats with type/length/value (such as DER) do not work as well for streaming, and vice-versa.
try clickhouse-local, it's amazing how it can crunch JSON/TSV or whatever at great speed
I can understand this for "small" data, say less than 10 Mb.
In bioinformatics, basically all of the file formats are human-readable/text based. And file sizes range between 1-2Mb and 1 Tb. I regularly encounter 300-600 Gb files.
In this context, human-readable files are ridiculously inefficient, on every axis you can think of (space, parsing, searching, processing, etc.). It's a GD crime against efficiency.
And at that scale, "readable" has no value, since it would take you longer to read the file than 10 lifetimes.
I do not think the argument is that ALL data should be in human readable form, but I think there are far more cases of data being in a binary form when it would be better human readable. Your example of a case where it is human readable when it should be binary is rarer for most of us.
In some cases human readable data is for interchange and it should be processed and queried in other forms - e.g. CSV files to move data between databases.
An awful lot of data is small - and these days I think you can say small is quite a bit bigger than 10Mb.
Quite a lot of data that is extracted from a large system would be small at that point, and would benefit from being human readable.
The benefit of data being human readable is not necessarily that you will read it all, but that it is easier to read bits that matter when you are debugging.
> human-readable files are ridiculously inefficient on every axis you can think of (space, parsing, searching, processing, etc.).
In bioinformatics, most large text files are gzip'd. Decompression is a few times slower than proper file parsing in C/C++/Rust. Some pure python parsers can be "ridiculously inefficient" but that is not the fault of human-readability. Binary files are compressed with existing libraries. Compressed binary files are not noticeably faster to parse than compressed text files. Binary formats can be indeed smaller but space-efficienct formats take years to develop and tend to have more compatibility issues. You can't skip the text format phase.
> And at that scale, "readable" has no value, since it would take you longer to read the file than 10 lifetimes.
You can't read the whole file by eye, but you can (and should often) eyeball small sections in a huge file. For that, you need a human-readable file format. A problem with this field IMHO is that not many people are literally looking at the data by eye.
One of the problems is that a lot of bioinformatics formats nowadays have to hold so much data that most text editors stop working properly. For example, FASTA splits DNA data into lines of 50-80 characters for readability. But in FASTQ, where the '>' and '+' characters collide with the quality scores, as far as I know, DNA and the quality data are always put into one line each. Trying to find a location in a 10k long line gets very awkward. And I'm sure some people can eyeball Phred scores from ASCII, but I think they are a minority, even among researchers.
Similarly, NEXUS files are also human-readable, but it'd be tough to discern the shape of inlined 200 node Newick trees.
When I was asking people who did actual bioinformatics (well, genomics) what some of their annoyances when working with the bioinf software were, having to do a bunch of busywork on files in-between pipeline steps (compressing/uncompressing, indexing) was one of the complaints mentioned.
I think there's a place in bioinformatics for a unified binary format which can take care of compression, indexing, and metadata. But with that list of requirements it'd have to be binary. Data analysis moved from CSVs and Excel files to Parquet, and I think there's a similar transition waiting to happen here
My hypothesis is that bioinformatics favors text files, because open source tools usually start as research code.
That means two things. First, the initial developers are rarely software engineers, and they have limited experience developing software. They use text files, because they are not familiar with the alternatives.
Second, the tools are usually intended to solve research problems. The developers rarely have a good idea what the tools eventually end up doing and what data the files need to store. Text-based formats are a convenient choice, as it's easy extend and change them. By the time anyone understands the problem well enough to write a useful specification, the existing file format may already be popular, and it's difficult to convince people to switch to a new format.
Totally. A good chuck of the formats are just TSV files with some metadata in header. Setting aside the drawbacks, this approach is both straightforward and flexible.
I think we're seeing some change in that regard, though. VCF got BCF and SAM and got BAM
Another thing is human readable is typically synonymous with unindexed, which becomes a problem when you have large files and care about performance. In bioinformatics we often distribute sidecar index files with the actual data, which is janky and inefficient. Why not have a decent format to begin with?
Further, when the file is unindexed it's even harder to read it as a human because you can't easily skip to a particular section. I have this trouble often where my code can efficiently access the data once it's loaded, but a human-eye check is tedious/impossible because you have to scroll through gigabytes to find what you want.
> Another thing is human readable is typically synonymous with unindexed
Indexing is not directly related to binary vs text. Many text formats in bioinformatics are indexed and many binary formats are not when they are not designed with indexing in mind.
> a human-eye check is tedious/impossible because you have to scroll through gigabytes to find what you want.
Yes, indexing is better but without indexing, you can use command line tools to extract the portion you want to look at and then pipe to "more" or "less".
I journeyed from fancy commercial bookkeeping systems that changed data formats every few years (with no useful migration) to GNU Cash and finally to Plain-Text Accounting. I can finally get the information I need with easy backups (through VCS) and flexibility (through various tools that transform the data). The focus is on content, not tools or presentation or product.
When I write I write text. I can transform text using various tools to provide various presentations consumable through various products. The focus is on content, not presentation, tools, or product.
I prefer human-readable file formats, and that has only been reinforced over more than 4 decades as a computer professional.
I have recently migrated ~8y of Apple Numbers spreadsheets (an annoyingly non-portable format) to plaintext accounting.
It took me many hours and a few backtracks to get to a point where I am satisfied with it, and where errors are caught early. I would just suggest anyone starting now to enable --strict --pedantic on ledger-cli from the day 1, and writing asserts for your accounts as well e.g. to check that closed accounts don’t get new entries.
I really miss data entry being easier and not as prone to free-form text editing errors (most common are typos on the amount or copying the wrong source/dest account), but I am confident it matches reality much better than my spreadsheets did.
Ease of: reading, comprehension, manipulation, short- and long-term retrieval are not the same problems. All file formats are bad at at least one of these.
Given an arbitrary stream of bytes, readability only means the human can inspect the file. We say "text is readable" but that's really only because all our tooling for the last sixty years speaks ASCII and we're very US-centric. Pick up a text file from 1982 and it could be unreadable (EBCDIC, say). Time to break out dd and cross your fingers.
Comprehension breaks down very quickly beyond a few thousand words. No geneticist is loading up a gig of CTAGT... and keeping that in their head as they whiz up and down a genome. Humans have a working set size.
Short term retrieval is excellent for text and a PITA for everything else. Raise your hand if you've gotten a stream of bytes, thrown file(1) at it, then strings(1), and then resorted to od or picking through the bytes.
Long term retrieval sucks for everyone. Even textfiles. After all, a string of bytes has no intrinsic meaning except what the operating system and the application give it. So who knows if people in 2075 will recognise "48 65 6C 6C 6F 20 48 4E 21"?
I decoded that as "Hello HI!" using basic cryptanalysis, the assumption that the alphabet would be mostly contiguous, the assumption that capital and lower-case are separated by a bit, and the knowledge that 0x20 is space and 0x21 is exclamation mark. On a larger text, we wouldn't even need these assumptions: cryptanalysis is sufficiently-powerful, and could even reverse-engineer EBCDIC! (Except, it might be difficult to figure out where the punctuation characters go, without some unambiguous reference such as C source code: commas and question marks are easy, but .![]{} are harder.)
Edit: I can't count. H and I are consecutive in the alphabet, and it actually says "Hello HN!". I think my general point is valid, though.
I see no issue using (or even receiving) an SQLite file, where I can see the tables structure and even export everything to pure text format.
The major problem of both human-readable and binary formats is not the serialized form, but the understanding of the schema (structure) of the data, which more often than not, is completely undocumented. Human-readable formats are worse in this regard, because they justify it by "it's obvious what this is".
Even "human-readable" formats are only readable if you have proper tools - i.e. editors or viewers.
If a binary file has a well-known format and tools available to view/edit it, I see zero problems with it.
/* Technically, most binary formats are legible to a human, given a proper renderer, e.g. journalctl. What TFA speaks about is ASCII/UTF-8 text formats that need no processing besides rendering CR, LF, and TAB characters specially. Assuming a Unix command line, I would call these "cat-readable" formats, or maybe even "less-readable". */
FWIW, journald's file format is one case where upon reading the spec I said it needs to be taken behind the barn and shot as a mercy.
And no, it had nothing to do with being binary, and everything to do with badly mixing "durable system log" and "quick retrieval" aspects in wrong order.
Given that the author mentions CSV and text table formats, the article's list of the "entire Unix toolchain" is significantly impoverished not only by the lack of ex (which is usefully scriptable) but by the lack of mlr.
* https://miller.readthedocs.io/
vis/unvis are fairly important tools for those text tables, too.
Also, FediVerse discussion: https://social.pollux.casa/@adele/statuses/01K1VA9NQSST4KDZP...
Wow, I've never heard of 'mlr' before. Looks like a synthesis of Unix tools, jq, and others? Very useful - hopefully it's packaged everywhere for easy access.
We learned the hard way, for some of us it's all too easy to make careless design errors that become baked-in and can't be fixed in a backward-compatible way (either at the DSL or API level). An example in Graphviz is its handling of backslash in string literals: to escape special characters (like quotes \"), to map special characters (like several flavors of newline with optional justification \n \l \r) and to indicate variables (like node names in labels \N) along with magic code that knows that if the -default- node name is the empty string that actually means \N but if a particular node name is the empty string, then it stays.
There was a published study, Wrangling Messy CSV Files by Detecting Row and Type Patterns by Gerrit J. J. van den Burg, Alfredo Nazábal, and Charles Sutton (Data Mining and Knowledge Discovery, 2019) that showed many pitfalls with parsing CSV files found on GitHub. They achieved 97%. It's easy to write code that slings out some text fields separated by commas, with the objective of using a human-readable portable format.
You can learn even more by allowing autofuzz to test your nice simple code to parse human readable files.
I'd love to see us go even further! Having file formats at all is less human readable than going 9p style, less human (and script) readable than a directory tree of simple values.
The OS has a builtin way to create heirarchical structured data. I'd love to see the boldness to try using that!
There was a good example of a json->directory tool submitted yesterday, json2dir. https://github.com/alurm/json2dir https://news.ycombinator.com/item?id=44840307
This does mean you'd need to tar/zip up your file tree to send it around. It's less clear that this directory is a thing, is meant to be a file like one entity. But by taking our data out of complex file formats and turning it into a a filesystem heirarchy, it removes the arbitrariness of choosing any encoding at all, and it directly opens up all data to scripting. You can echo a new hex color into a foreground file to change it. You can watch a user directory for changes to update a rendering of it. You can btrfs snapshot your window layout to save your desktop configuration.
We have so so many specific tools for computing. It's time that we try making some more general systems, that let us work broadly / regardless of specific application! "Just files" could be a key enabler!
Readable files are great… until they’re 1TB and you just want to cry.
1TB of perfectly readable, human despair.
single, 1 terabyte long, line.
To be fair, nothing's great when I want to cry.
Let’s say that hypothetically one were to disagree with this. What would be the best alternative format? One that has ample of tooling for editing and diffing, as though it was text, yet stores things more efficiently.
Most of the arguments presented in TFA are about openness, which can still be achieved with standard binary formats and a schema. Hence the problem left to solve is accessibility.
I’m thinking something like parquet, protobuf or sqllite. Despite their popularities, still aren’t trivial for anyone to edit.
Protobuf has a text and binary format. https://protobuf.dev/reference/protobuf/textformat-spec/
Google uses it a lot for data dumps for tests or config that can be put into source control.
I suppose with SQLite files, you could at least in theory diff their SQL-dump representations, though you'd presumably want a way to canonicalise said representation. In a way I suppose each (VCS) commit is a bit like a database migration.
ZIP archive of XML is used for Office documents
Clearly there’s a very real need for binary data formats, or we wouldn’t have them. For one, it’s much more space efficient. Does the author know how much storage cost in 1985? Or how slow computers were?
If I time traveled back to 1985 and told corporate to adopt CSV because it’d be useful in 50 years when unearthing old customer records I’d be laughed out of the cigar lounge.
Except there are many things for which we used human readable formats in the 1980s for which we use binary formats now - HTTP headers, for example.
CSV was definitely in wide use back then.
Text formats are compressible.
Text formats are compressible because they waste a lot of space to encode data. Instead of the space of 256 values per byte they use maybe 100.
I assumed that is common knowledge here. The point is that you need to take that into account when discussing storage requirements.
So I guess it's also a common knowledge that a compressed stream needs to be uncompressed in order to use it. So argumenting that compressed stream takes smaller space on a floppy disk doesn't mean that it will also take the same amount of bytes in memory.
Also I'd argue if HTTP1 can be treated as a pure text format, since it requires \r and \n as EOL markers, even on systems that only use \n. Strict binary requirements like this shouldn't be needed if it was a text protocol.
[flagged]
It's often too late to overhaul you systems when performance becomes a serve limiting factor. By that point things like data format are already set in stone. The whole "premature optimization" was originally about peep-hole stuff, not architecture-defining concerns, and it's really sad to see it misapplied to "lets store everything as json and use O(n²) everywhere and hopefully it will be someone else's problem".
Human-readability was one of the aspects that I enjoyed about using CCL,the Categorical Configuration Language (https://chshersh.com/blog/2025-01-06-the-most-elegant-config...), in one of my projects recently.
It saves you from escaping stuff inside of multiline-strings by using meaningful whitespace.
What I did not like about CCL so much that it leaves a bunch of stuff underspecified. You can make lists and comments with it, but YOU have to decide how.
> Unlike binary formats or database dumps, these files don't hide their meaning behind layers of abstraction. They're built for clarity, for resilience, and for people who like to know what's going on under the hood.
Csv files hide their meaning in external documentation or someone’s head, are extremely unclear in many cases (is this a number or a string? A date?) and is extremely fragile when it comes to people editing them in text editors. They entirely lack checks and verification at the most basic level and worse still they’re often but not always perfectly line based. Many tools then work fine until they completely break you file and you won’t even know. Until I get the file and tell you I guess.
I’ve spent years fixing issues introduced by people editing them like they’re text.
If you’ve got to use tools to not completely bugger them then you might as well use a good format.
If you're reading in data, you need to parse and verify it anyway.
Which you might not be able to do after it’s been broken silently.
That's still an issue with binary files too, and you can't even look at them to fix.
Binary files are not usually edited by hand, and should not get broken without a broken parser or writer. There can be many more ways of fixing them too because it may not be as ambiguous as broken sv files.
They're standardized[0], so it's only stupid humans screwing them up.
Maybe you need a database or an app rather than flat files.
0. https://www.ietf.org/rfc/rfc4180.txt
That came far after csv files started being used and many parsers don’t follow the spec. Even if they do, editing the file manually can easily and silent break it - my criticisms are of entirely valid to the new spec files. The wide range of ways people make csvs is a whole other thing I’ve spent years fixing.
It’s not about the stupidity of the humans, and if it was then planning for “no stupid people” is even stupider than those messing up the files.
> Maybe you need a database or an app rather than flat files.
Flat files are great. What’s needed are good file formats.
json makes for a great flat file format these days, with jq around to munge the data in it. csv is pretty bad for errors. mostly use it to dump data when I need to pass it to someone that will want to shove it in excel.
TOML
What's the problem?
What are you trying to ask? I don’t understand. I’m not talking about toml.
I gave you a good text file format. You're acting like there are no good file formats. Either invent a domain-specific one, use a standard one, or use a different modality rather than complain that a utopia you won't bother to create doesn't exist.
Csv files are bad for many reasons, some of which are listed as positives in the article. I’m not talking about other formats.
But TOML is not a good file format. Quite the opposite actually.
https://hitchdev.com/strictyaml/why-not/toml/
I found that post unconvincing.
> It's very verbose.
This is his example: https://github.com/crdoconnor/strictyaml/blob/master/hitch/s...
I think you shouldn't use yaml or toml for this.
> TOML's hierarchies are difficult to infer from syntax alone
True! The point of TOML is to flatten the hierarchical structures. I would argue your configuration files shouldn't have much nesting anyway.
> Overcomplication: Like YAML, TOML has too many features
Basically TOML has a date type and all associated problems and advantages. I think it's a reasonable thing to include.
> Syntax typing
I think this is a good thing. I want to know whether something is a string or a number.
> They're standardized[0]
From that article:
“This memo […] does not specify an Internet standard of any kind”
and
“Interoperability considerations:
Due to lack of a single specification, there are considerable differences among implementations. Implementors should "be conservative in what you do, be liberal in what you accept from others" (RFC 793 [8]) when processing CSV files”
Are you AI? I was replying to a comment, not the article.
Also, you're quoting me to myself: https://news.ycombinator.com/item?id=44837879
Me too.
Especially in ASCIIDOC/HTML/TXT/PDF files - for PDF files also a copy of ODF/ODT for editing where possible.
Then I can search for anything needed with grep(1) or pdfgrep(1) commands.
It’s not about human-readable, it’s about standard format & available tooling to read for it. Be it .txt or json or yaml; standard format & tooling; digital content anyways isn’t human readable either digital interface.
Are there any binary formats that include the specification in the format itself?
Don't most binary format must have some specification somewhere (either private or public)?
Unless someone just decided to shove random stuff in binary mode and call it a day?
https://en.wikipedia.org/wiki/ASN.1
ASN.1 is not quite "binary format that include the specification in the format itself". There are multiple formats of ASN.1, such as DER, which is a binary format which includes the structure and type specifications (except implicit types) with the data. I think DER is better than many other formats, and I use this for some of my own programs. This does not necessarily mean that you will know what the use of the data is (but CSV, JSON, etc won't do that either), but the structure and some of the data types can be decoded.
I'll take sexprs over CSV/JSON/YAML/XML any day.
Do you have the Gemini:// URL? I’m getting a URL resolution error.
gemini://adele.pollux.casa/gemlog/2025-08-04_why_I_prefer_human-readble_file_formats.gmi
Lets hear it for RTF for documents
You want JARs to be human-readable? PNGs? MP3s?
I think the author is thinking about a very narrow set of files.
I'm not sure the author knows much about binary formats.
Binary formats are binary for a reason. Speed of interpretation is one reason. Usage of memory is another reason. Directly mapping it and using it, is another reason. Binary formats can make assumptions about system memory page size. They can store internal offsets to make incremental reading faster. None of this is offered by text formats.
Also, the ability to modify text formats is completely wrong. Nothing can be changed if we introduce checksums inside text formats. Also if we digitally sign a format, then nothing can be changed despite the fact that it's a text format.
Also, comparing CSV files to internal database binary format? It's like comparing a book cover to the ERP system of a library. Meaning, it's comparing two completely different things.
I used the URL gemini://adele.pollux.casa/gemlog/2025-08-04_why_I_prefer_human-readble_file_formats.gmi (the one linked to directly does not work on my computer).
I prefer binary file formats (including DER) for many things, and I will respond to the individual parts as well as my own comments.
> With human-readable formats, you're never locked out of your own data. Whether you're on a fresh Linux installation, a locked-down corporate machine, or troubleshooting a system with minimal tools, you can always inspect your configuration files, data exports, or documentation with nothing more than `cat`, `less`, or any basic text editor.
This is helpful especially for documentation.
However, not all of the formats for data are going to be that easy to inspect in this way, and the meaning is not always clear even if it is a text format. Additionally, even if it is clear when reading does not necessarily mean that it is convenient to modify it.
Furthermore, the use of text formats means that escaping may be needed, which can complicate the decoding and encoding, and can also be the "leaning toothpick syndrome".
> A JSON configuration file works the same way whether you're viewing it in VS Code, vim, or even a web browser. This universality means fewer barriers to collaboration and fewer "it works on my machine" moments.
Although it can be displayed the same way (especially if it is in purely ASCII format), it can be difficult to read if packed together and inefficient if formatted nicely, and there is the issue with escaping that I mentioned above. When editing JSON, there is also that JSON does not have comments and that optional trailing commas are not allowed, which can make it inconvenient to modify.
JSON, CSV, etc have their own limitations though, which can be problems for some uses (e.g. storing binary data together with text, storing non-Unicode text data, and others). Sometimes this will then extend to the formats made using them, because they had not considered that.
> Digital archaeology is real, and proprietary formats are its enemy. How many documents from the 1990s are now trapped in obsolete file formats?
It is not quite that simple. I had sometimes found them easier to figure out than some modern text-based formats.
> Ok, sometimes, there are some character encoding conversions needed (you see, CP-1252, EBCDIC, IBM-850, ISO8859-15, UTF-8), but, these operations are easy nowadays.
That is also often done badly. There are good ways to handle character encoding, including ways that do not involve conversion, but they are not as commonly supported by some modern programs.
> Need to bulk-update settings? Write a simple script or use standard text processing tools.
This is not always good, depending on the format and on other things. (SQL might work better for many kind of bulk updates.)
> You don't need expensive software licenses, proprietary APIs, or vendor-specific tools to work with your data.
You don't need those things for many binary formats either. Sometimes you might, but if it is designed well then you shouldn't need it.
> The entire Unix toolchain, `grep`, `sed`, `awk`, `sort`, `cut`, becomes your toolkit. Want to extract all email addresses from a CSV? `grep` has you covered.
For some simple formats, especially TSV, it might work, but not all text-based formats are like that.
> JSON, XML, CSV, these formats have multiple independent implementations, comprehensive specifications, and broad community support.
Yes, but so do many binary formats, such as DER (I wrote my own implementation in C).
> Version control systems like Git are optimized for text, and human-readable formats take full advantage of this optimization. Line-by-line diffs become meaningful, showing exactly what changed between versions. Merge conflicts are resolvable because you can actually read and understand the conflicting content.
It is true that such version control systems are made to work with text formats, but that does not mean that it should have to be. Also, it does not mean that it handles the details of all possible text-based formats.
For CSV it might work, but different formats, whether text or binary, cannot always resolve merge conflicts so easily, or identify the differences in a better way so easily, due to various things, such as the way that blocks might work in a format, and that some formats use indentation-oriented syntax, lack of trailing commas in JSON, etc. Automatic merging does not necessarily have the correct result either, and must be corrected manually, regardless of it is a text format or a binary format.
In showing what is changed also, you might want to know what section it is in, in a text-based format that works in that way. When making the diffs and merges for specific formats, they could be made for binary formats also.
> Text-based formats are often surprisingly compact, especially when compressed. They parse quickly, require minimal memory overhead, and can be streamed and processed incrementally. A well-structured JSON file can be more efficient than a complex binary format with similar information density.
Although compression can help, it just adds another complexity to the format. The parsing can require handling escaping, and the streaming can depend on the specific uses. A binary format does not have to be complex, and can store binary data directly.
> Command-line processors like `jq` for JSON or standard Unix tools can handle massive files with minimal resource consumption.
Sometimes it can apply, although programs can be made for other formats as well. I had some idea how to make something to work for DER format. For some formats, there already are.
> These formats also represent a philosophy: that technology should serve human understanding rather than obscure it.
Using JSON or XML will not solve that. What helps is to have better documentation.
Nice captcha. I made almost the same thing, except mine looks like an actual captcha box and sends a post request upon clicking the fake checkbox. This works very well (100% of the times) for automated scrapers. I don't think anything more is needed