
Reverse Engineering 僕の夏休み for Fun and Japanese Learning
Intro
I’m now a year into learning Japanese, and I’m starting to consume Japanese media. I’m certainly no expert in Japanese yet, but I know enough to follow along with most media as long as it’s not too advanced. One tool I’ve used to help me do this is the Yomitan browser plugin. If I come across a word I don’t know, I can use Yomitan to look it up, hear pronunciation, and even add it to my Anki deck if I want to make sure I don’t forget it. This works great with any web content, even youtube video subtitles. Unfortunately, it doesn’t help with video games. Unless, they have a game script available online. You can find Japanese game scripts for a handful of popular video games online, and then follow along with the game and look up words when you need to.
Playing Japanese games is a huge driving force behind my learning of Japanese, and one Japanese exclusive game I’ve wanted to play for a very long time is 僕の夏休み (Boku no Natsu Yasumi).

This game does not have a game script online however, and while I found my Japanese was good enough to understand large chunks of the game’s content, when I came across words I didn’t know there was no easy way for me to lookup that word and add it to my studies. A game script would sure make that much easier…so I decided to extract the script from the game.
Finding the Game Text
The first and most obvious thing to do when trying to figure out a game’s content is check out the contents of the disc. The game disc has a __STR folder. Any time you see an STR folder on a PS1 game, this will almost always be video files, and examining the folder contents shows this game is no exception, and nothing here is of use to us for game text. Beyond this, there is only the main executable “SCPS_100.88” and “BOKU.BIN”. BOKU.BIN’s 104mb points to it being the games content, but the smaller executable file could have useful info as well.
Searching Memory
With no obvious leads for locating text here, I jumped to searching the game’s memory while it’s running. The advantages to starting here versus the disc data are twofold. First, this narrows down the amount of data to 2mb (PS1 ram size) from the 104mb of the entire game data. Second, it lets us watch for changes in real time as the game runs.
My default for this is RALibRetro, a version of RetroArch made for creating achievements for retroachievements.org. It’s got great tools for filtering down memory values while the game is running using different conditions. In this case, I let the game run without advancing the dialog and filtered down to memory values that stayed the same during this process. I then advanced the dialog and filtered down to only values that changed. Repeat this a couple times and I found the pointer to the dialog lines in memory. The pointer only contained the final 4 places of the address, but this was enough to find the dialog quickly by just jumping by 0x10000 at a time and editing values.

We’ve found text! We’ve only found it in ram, which doesn’t tell us where it lives on the disc, but it might tell us how it lives on the disc. Meaning, the values here are likely the same on disc, provided the game’s content isn’t compressed on disc. Most games don’t compress data, since the 600mb+ of disc space is usually plenty, but some do, not just to save disc space but to decrease loading time.
Character Encoding
Before we move to checking the disc data, what character encoding is this? Looking at the first few characters, I thought this might be Shift-JIS, the same encoding used by the PS1 itself for memory card save names (I used this for my PS1 Memory Card Viewer for the Memcard Pro). However, checking the kanji, this was clearly not the case and some sort of custom character encoding is used. While this can certainly be figured out by inputting values into the memory editor and seeing what character appears, with potentially 2000+ characters, this is quite time consuming.
A number of years ago I created a tool to extract art from PS1 games. This came in handy for a few things while extracting the text, the first of which was finding the font image used in the game:

The font file is quite cleverly made, a single image that uses only 4 bits per pixel. This means each pixel is a lookup into a 16 color palette. Here, 8 different palettes are used, 4 black and 4 white. Black vs white is quite simple as it changes the color of the text, but the 4 palettes within each color actually change what is drawn on the image.
With all four palettes applied in order and the image laid out side by side above, you can see the font in order. Start at the top left and go across all four palettes left to right, then go to the second row and keep going. This is easy to see if you look at the English characters. Not only do we now have the character encoding, we can send these images into OCR and get back the text. With that, we can decode the game’s text. Now, lets figure out the file system.
The File System
We now have the byte values of text from the game’s memory. Can we find these values in the BOKU.BIN file?

Yes. This is all the text from the dialog at the beginning of the game. Finding this, and having found all the images on disc, we know the game data is not compressed on disc. We now need to extract all of the text, much in the same way my PSX image parser could extract all the images. With images, we have the advantage of knowing the header format of PS1 image files, and so can search binary data for them. To do the same for Boku’s text, we need to figure out it’s internal file system.
Gaining Context
Finding the text was good, but just the text alone was simply not enough to make sense of the game’s internal file system. At this point, I went down a couple paths, including stepping through game loading in pcsx-redux’s debugger and parsing through the content of the game’s main executable file SCPS_100.88, and checking what looked to be file/directory entry headers against other common patterns.
While I found pieces of information here, ultimately there wasn’t enough context to understand what was going on. I went back to my image parser and modified the code slightly to log out IMHex pattern code for each image, making it an array of bytes with the offset and size of the image. This let me map out large chunks of the file system, and immediately the file system made sense.

I now knew the sizes and offsets of full files, with this it becomes pretty easily apparent how the file system works. First is the number of files in the “directory”, then the offset to the first file data, then the length of the first file. After this the pattern repeats where for each file it’s length and offset are listed.
With this information, I isolated a folder structure that had dialog in it, and found the exact same pattern in the files with dialog, there were offsets and lengths that pointed to the dialog. However, there were also offsets and length that pointed to other mystery data. As this wasn’t relevant to extracting the script, I haven’t looked any more into this, but it could be any number of things. Likely either code to check flags to determine what dialog to play, and/or the code to go load the dialog and play it. In either case, all I needed to do know was isolate the dialog.
Extracting the Dialog
The last challenge now was isolating just the text. All dialog lines are surrounded with other event or dialog related data, and I needed to isolate just the text. Again I decided to look for patterns rather than trying to step through thousands of lines of assembly. Two patterns stood out that let me isolate only the text.
Before each dialog text entry, there was always another mystery 12 byte entry. I don’t know what this does but again likely a trigger to play audio? It’s functionality didn’t matter for my purposes, only that this was a marker for dialog. This alone almost worked:

There were extra characters appended to the end, and occasionally junk data was sneaking in. The final cleanup was to make use of the “End of Line” and “End of Dialog” markers (0x8001 and 0x8002 respectively). After 0x8002, there was always a 4 byte section of data, I needed to always skip over this after I encountered 0x8002, and then the incorrect end characters (marked in my code as (?)) went away. Finally, I used simply the presence of the 0x8002 character at the end of the dialog as a flag that the dialog was valid. This removed the junk data.
Ultimately the final parsing didn’t involve much code at all. With some helper classes that checked that a chunk of data followed the rules of the sector or scene, I could pull out all the text data like this:
function getBokuText(data: DataView) {
const sectors: BokuFileSector[] = [];
let runningOffset = 0;
while (runningOffset < data.byteLength) {
const isSector = BokuFileSector.CheckIfDataIsFileSector(runningOffset, data);
if (isSector) {
const sector = new BokuFileSector(runningOffset, data);
sectors.push(sector);
runningOffset += sector.totalByteLength;
} else {
runningOffset += 2;
}
}
const withTextEvents: BokuSceneFile[] = [];
sectors.forEach(sector => {
sector.fileInfos.forEach(sectorFile => {
const isEventFile = BokuSceneFile.CheckIfFileIsDialog(sectorFile.dataView);
if (isEventFile) {
const eventFile = new BokuSceneFile(sectorFile.dataView);
if (eventFile.scenes.some(scene => scene.dialogs.length > 0)) {
withTextEvents.push(eventFile);
}
}
})
})
return withTextEvents;
}

Extra Features
Finally, I decided to add a few extra features to make studying easier. First I cross referenced the kanji in the game’s font file with the actual script. Just because the game’s font file has 1200+ kanji in it, doesn’t mean all of them are used in this game. And sure enough, doing that gave back only 844 kanji. I broke these down by JLPT level to make pulling out the ones I don’t know yet easier.
Finally, I sent the entire script through a Japanese tokenizer to extract individual vocab words. Unlike English where you can simply split text by spaces, with Japanese you need specialized software to break down the text to individual words. The tokenizer also gave base words back, so a single verb that had been conjugated in different ways would return the same base word. I used this to filter down to only unique base words and came up with a vocab list of 1,654 words.
Wrap Up
I spent about a week working on this in my spare time, and I think the results are pretty cool. I got to learn about the inner workings of a great game, and also created study material that helps me learn Japanese. Plus in the end, I get to use that knowledge to play a great game. While this was just a short fun personal project, I think it might be cool to do some more things with it like: break down vocab by JLPT level, extract the game audio for accompanying listening practice, and potentially even expand this into a full app with other games, tv shows, and movies, and full flash card practice.
Thanks for reading, please check out the tool here: https://roblouie.com/boku/
You can view the vocab and kanji right on the page, but to see the full game script you must provide your own BOKU.BIN file from the game disc.