Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sentencepiece preset #1384

Merged
merged 13 commits into from
Jul 25, 2023
Merged

Add sentencepiece preset #1384

merged 13 commits into from
Jul 25, 2023

Conversation

sbrunk
Copy link
Contributor

@sbrunk sbrunk commented Jul 8, 2023

I'm trying to add a preset for the sentencepiece library, an

unsupervised text tokenizer for Neural Network-based text generation

frequently used in current large language model architectures.

C++ API examples

Current state

The native library builds, and the parser runs sucessfully, but the generated JNI code still fails to compile. Grateful for any feedback and apologies if I'm missing something obvious here, my C++ knowledge is still rather limited.

I think it could be related to the use of string_view, which I've just tried to map to a std:string, but perhaps needs to treated diffently.

infoMap.put(new Info("string_view", "absl::string_view").pointerTypes("@StdString String"))

Here's one example where it fails:

jnisentencepiece.cpp:2560:53: error: no matching member function for call to 'Decode'
        rptr = new sentencepiece::util::Status(ptr->Decode(*(const std::vector<int>*)ptr0, (std::basic_string< char >&)adapter1));
                                               ~~~~~^~~~~~

sentencepiece_processor.h:310:24: note: candidate function not viable: no known conversion from 'std::basic_string<char>' to 'std::string *' (aka 'basic_string<char, char_traits<char>, allocator<char>> *') for 2nd argument; take the address of the argument with &
  virtual util::Status Decode(const std::vector<int> &ids,
                       ^
sentencepiece_processor.h:420:24: note: candidate function not viable: no known conversion from 'std::basic_string<char>' to 'sentencepiece::SentencePieceText *' for 2nd argument
  virtual util::Status Decode(const std::vector<int> &ids,
                       ^
sentencepiece_processor.h:302:24: note: candidate function not viable: no known conversion from 'const vector<int>' to 'const vector<std::string>' for 1st argument
  virtual util::Status Decode(const std::vector<std::string> &pieces,
                       ^
sentencepiece_processor.h:306:24: note: candidate function not viable: no known conversion from 'const vector<int>' to 'const vector<absl::string_view>' for 1st argument
  virtual util::Status Decode(const std::vector<absl::string_view> &pieces,
                       ^
sentencepiece_processor.h:414:24: note: candidate function not viable: no known conversion from 'const vector<int>' to 'const vector<std::string>' for 1st argument
  virtual util::Status Decode(const std::vector<std::string> &pieces,
                       ^
sentencepiece_processor.h:417:24: note: candidate function not viable: no known conversion from 'const vector<int>' to 'const vector<absl::string_view>' for 1st argument
  virtual util::Status Decode(const std::vector<absl::string_view> &pieces,

@saudet
Copy link
Member

saudet commented Jul 8, 2023

If you don't need those functions, we could just skip them for now.

@sbrunk sbrunk force-pushed the sentencepiece branch 2 times, most recently from 69fc311 to 8bdb65d Compare July 8, 2023 13:55
@sbrunk
Copy link
Contributor Author

sbrunk commented Jul 8, 2023

I've skipped all functions triggering the error for now, to get the build and an encoding example working. The build works now except for Windows, which keeps causing symlink and path issues so I'm inclined to also disable it for now.

The one important function triggering the issue is sentencepiece::SentencePieceProcessor::Decode, I'm hoping to get that working.

I was able to get encoding working via the example, well half of the time on my mac.

/**
* To try encoding you can download an existing model, i.e.
* wget https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.model
* mvn compile exec:java -e -Dexec.mainClass=SentencepieceExample -D exec.args="en.wiki.bpe.vs10000.model"
*/
public final class SentencepieceExample {
public static void main(String[] args) {
SentencePieceProcessor processor = new SentencePieceProcessor();
Status status = processor.Load(args[0]);
if (!status.ok()) {
throw new RuntimeException(status.ToString().getString());
}
IntVector ids = new IntVector();
processor.Encode("hello world!", ids);
for (int id : ids.get()) {
System.out.print(id + " ");
}
}
}

The other half it crashes even if I only instantiate the SentencePieceProcessor. From your experience, do you have any guess where to start looking first to find the cause?

java(89975,0x70000b369000) malloc: *** error for object 0x103122000: pointer being freed was not allocated

@saudet
Copy link
Member

saudet commented Jul 8, 2023

Please try to set the "org.bytedeco.javacpp.nopointergc" system property to "true".

@sbrunk
Copy link
Contributor Author

sbrunk commented Jul 9, 2023

It still shows the same behaviour with nopointergc enabled.

@saudet
Copy link
Member

saudet commented Jul 9, 2023

You could try it on Linux, and if it works there, we'll know the problem is something related to Mac.

@sbrunk
Copy link
Contributor Author

sbrunk commented Jul 9, 2023

Thanks! Looks like some mac weirdness indeed as it's working fine on my Linux machine.

Perhaps it's even something weird on my local machine so think I should try the mac CI builds to see if it makes a difference. I think I've asked you this before, but right now it's not possible to download the artifacts from the CI build if they are not yet published to sonatype right?

@saudet
Copy link
Member

saudet commented Jul 9, 2023

Right, but the builds need to pass any to get anything published.

make: *** No targets specified and no makefile found. Stop.

GNU make doesn't Visual Studio, we typically use ninja, like this:
https://github.com/bytedeco/javacpp-presets/blob/master/tvm/cppbuild.sh#L176

@sbrunk
Copy link
Contributor Author

sbrunk commented Jul 9, 2023

I figured that I can use cmake itself for building by looking at the CI of upstream sentencepiece (they're building for windows too). The windows build also surfaced a missing linker config.

So with decoding skipped, it builds on all enabled platforms now.

@sbrunk sbrunk marked this pull request as ready for review July 9, 2023 16:19
@saudet
Copy link
Member

saudet commented Jul 9, 2023

Looking good! Next, as per commits 5b31213 and 56806a7 please don't use Docker in the workflows for Linux. It's not something I'm going to maintain.

@sbrunk
Copy link
Contributor Author

sbrunk commented Jul 10, 2023

I've removed Docker and configured cross-compiling for the arm64 build.

@saudet
Copy link
Member

saudet commented Jul 11, 2023

I've removed Docker and configured cross-compiling for the arm64 build.

I can see how it works for linux-arm64, but cross-compilation is missing for macosx-arm64.

@sbrunk
Copy link
Contributor Author

sbrunk commented Jul 11, 2023

I've now tried to enable cross-building for macosx-arm64 as well.

I'm not really sure if it works as intended though, because before it seems like it linked the arm64 compiled jni lib against the x86_64 sentencepiece lib without errors. Could be due to Apple's compat layer perhaps? I guess I could try to build it locally (I have an intel mac) with this new config and try to see if the object files are different.

@sbrunk
Copy link
Contributor Author

sbrunk commented Jul 11, 2023

Ok I built both variants locally on my intel mac with maven, and I think this is what we want:

❯ mvn package --projects sentencepiece -Djavacpp.platform=macosx-x86_64
❯ lipo -info sentencepiece/cppbuild/macosx-x86_64/lib/*.a
Non-fat file: sentencepiece/cppbuild/macosx-x86_64/lib/libsentencepiece.a is architecture: x86_64
Non-fat file: sentencepiece/cppbuild/macosx-x86_64/lib/libsentencepiece_train.a is architecture: x86_64
❯ lipo -info sentencepiece/target/native/org/bytedeco/sentencepiece/macosx-x86_64/libjnisentencepiece.dylib
Non-fat file: sentencepiece/target/native/org/bytedeco/sentencepiece/macosx-x86_64/libjnisentencepiece.dylib is architecture: x86_64

❯ mvn package --projects sentencepiece -Djavacpp.platform=macosx-arm64 
❯ lipo -info sentencepiece/cppbuild/macosx-arm64/lib/*.a
Non-fat file: sentencepiece/cppbuild/macosx-arm64/lib/libsentencepiece.a is architecture: arm64
Non-fat file: sentencepiece/cppbuild/macosx-arm64/lib/libsentencepiece_train.a is architecture: arm64
❯ lipo -info sentencepiece/target/native/org/bytedeco/sentencepiece/macosx-arm64/libjnisentencepiece.dylib
Non-fat file: sentencepiece/target/native/org/bytedeco/sentencepiece/macosx-arm64/libjnisentencepiece.dylib is architecture: arm64

@sbrunk
Copy link
Contributor Author

sbrunk commented Jul 12, 2023

I'm currently mapping strings like this to make the Decode method compile:

infoMap.put(new Info("std::string").annotations("@StdString").valueTypes("String").pointerTypes("@Cast({\"char*\", \"std::string*\"}) BytePointer"))

Now for decoding, in C++ I'd create a std:string and give the address to the Decode method:

std::vector<std::string> pieces = { "▁This", "▁is", "▁a", "", "te", "st", "." };   // sequence of pieces
std::string text;
processor.Decode(pieces, &text);
std::cout << text << std::endl;

I'm not sure how to create std::string text; here via JavaCPP. Do I need a different mapping to be able to call the right constructor?


The linux-arm64 build fails again but I think it is unrelated to the changes I made since I didn't touch that in the latest commit. It's already failing while installing the system packages so perhaps some update of ubuntu/ubuntu-ports packages could have caused it.

@saudet
Copy link
Member

saudet commented Jul 12, 2023

Don't map that to BytePointer, it most likely won't work. Map it to a something like a StringVector with Info.define():
https://github.com/bytedeco/javacpp/wiki/Mapping-Recipes#defining-wrappers-for-basic-c-containers

@saudet
Copy link
Member

saudet commented Jul 12, 2023

If you're talking only about the std::string, that can be mapped to a BytePointer like this yes. What issue are you having with String constructor?
http://bytedeco.org/javacpp/apidocs/org/bytedeco/javacpp/BytePointer.html#BytePointer-java.lang.String-

@sbrunk
Copy link
Contributor Author

sbrunk commented Jul 12, 2023

If you're talking only about the std::string, that can be mapped to a BytePointer like this yes. What issue are you having with String constructor? http://bytedeco.org/javacpp/apidocs/org/bytedeco/javacpp/BytePointer.html#BytePointer-java.lang.String-

Ah, I hadn't realized the role of BytePointer. Should have looked into the JavaDoc. 🙈

The peer class to native pointers and arrays of signed char, including strings.

Thanks! Decoding is working now. I've updated the example and added the missing README.

@sbrunk
Copy link
Contributor Author

sbrunk commented Jul 22, 2023

@saudet if you don't have any additional remarks, this PR is ready now.

@saudet saudet merged commit 251f317 into bytedeco:master Jul 25, 2023
6 checks passed
@sbrunk sbrunk deleted the sentencepiece branch July 25, 2023 14:55
@sbrunk
Copy link
Contributor Author

sbrunk commented Jul 25, 2023

Thanks for your help and patience @saudet!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants